# Looping over data sets


## _Determining matches_

Which of these files is not matched by the expression `glob.glob('data/*as*.csv')`?

1. `data/gapminder_gdp_africa.csv`
2. `data/gapminder_gdp_americas.csv`
3. `data/gapminder_gdp_asia.csv`


### Solution

1 is not matched by the glob.


## _Minimum file size_

Modify this program so that it prints the number of records in the file that has the fewest records.

```{python}
import glob
import pandas as pd
fewest = ____
for filename in glob.glob('data/*.csv'):
    dataframe = pd.____(filename)
    fewest = min(____, dataframe.shape[0])
print('smallest file has', fewest, 'records')
```

Note that the [DataFrame.shape](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html) property returns a tuple with the number of rows and columns of the data frame.


### Solution

In [None]:
import glob
import pandas as pd

fewest = float("Inf")
for filename in glob.glob("../../data/*.csv"):
    dataframe = pd.read_csv(filename)
    fewest = min(fewest, dataframe.shape[0])
print("smallest file has", fewest, "records")

You might have chosen to initialize the `fewest` variable with a number greater than the numbers you’re dealing with, but that could lead to trouble if you reuse the code with bigger numbers. Python lets you use positive infinity, which will work no matter how big your numbers are. What other special strings does the [`float` function](https://docs.python.org/3/library/functions.html#float) recognize?


## Comparing data

Write a program that reads in the regional data sets and plots the average GDP per capita for each region over time in a single chart. Pandas will raise an error if it encounters non-numeric columns in a dataframe computation so you may need to either filter out those columns or tell pandas to ignore them.


### Solution

This solution builds a useful legend by using the [string `split` method](https://docs.python.org/3/library/stdtypes.html#str.split) to extract the region from the path `‘../../data/gapminder_gdp_a_specific_region.csv’`.


In [None]:
import glob
import pandas as pd
import matplotlib.pyplot as plt

fig, ax = plt.subplots(1, 1)
for filename in glob.glob("../../data/gapminder_gdp*.csv"):
    dataframe = pd.read_csv(filename)
    # extract <region> from the filename, expected to be in the format 'data/gapminder_gdp_<region>.csv'.
    # we will split the string using the split method and `_` as our separator,
    # retrieve the last string in the list that split returns (`<region>.csv`),
    # and then remove the `.csv` extension from that string.
    # NOTE: the pathlib module covered in the next callout also offers
    # convenient abstractions for working with filesystem paths and could solve this as well:
    # from pathlib import Path
    # region = Path(filename).stem.split('_')[-1]
    region = filename.split("_")[-1][:-4]
    # pandas raises errors when it encounters non-numeric columns in a dataframe computation
    # but we can tell pandas to ignore them with the `numeric_only` parameter
    dataframe.mean(numeric_only=True).plot(ax=ax, label=region)
    # NOTE: another way of doing this selects just the columns with gdp in their name using the filter method
    # dataframe.filter(like="gdp").mean().plot(ax=ax, label=region)

plt.legend()
plt.show()

Licensed under [CC-BY 4.0](http://swcarpentry.github.io/python-novice-gapminder/18-style/index.html) 2018–2023 by [The Carpentries](https://carpentries.org/)

Licensed under [CC-BY 4.0](http://swcarpentry.github.io/python-novice-gapminder/18-style/index.html) 2016–2018 by [Software Carpentry Foundation](https://software-carpentry.org/)
