## Use a `for` loop to process files given a list of their names.

*   A filename is just a character string.
*   And lists can contain character strings.

In [None]:
import pandas as pd

In [None]:
file_list = ['../data/gapminder_gdp_africa.csv', '../data/gapminder_gdp_asia.csv']

for filename in file_list:
    data = pd.read_csv(filename, index_col='country')
    print(filename, data.min())

## Use `glob.glob` to find sets of files whose names match a pattern.

*   In Unix, the term "globbing" means "matching a set of files with a pattern".
*   The most common patterns are:
    *   `*` meaning "match zero or more characters"
    *   `?` meaning "match exactly one character"
*   Python contains the `glob` library to provide pattern matching functionality
*   The `glob` library contains a function also called `glob` to match file patterns
*   E.g., `glob.glob('*.txt')` matches all files in the current directory 
    whose names end with `.txt`.
*   Result is a (possibly empty) list of character strings.


In [None]:
import glob
print('all csv files in data directory:', glob.glob('../data/*.csv'))

In [None]:
# Any debug files?
print('all PDB files:', glob.glob('*.pdb'))

## Use `glob` and `for` to process batches of files.

*   Helps a lot if the files are named and stored systematically and consistently
    so that simple patterns will find the right data.

In [None]:
for filename in glob.glob('../data/gapminder_*.csv'):
    data = pd.read_csv(filename)
    min_52 = data['gdpPercap_1952'].min()
    print(filename, min_52)

*   This includes all data, as well as per-region data.
*   Use a more specific pattern in the exercises to exclude the whole data set.
*   But note that the minimum of the entire data set is also the minimum of one of the data sets,
    which is a nice check on correctness.

## Combine data sets using `pd.concat`

We can [concatenate](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) (i.e. join) data sets together
using the aptly named `pd.concat` function.

In [None]:
file_list = glob.glob('../data/gapminder_*.csv')

data_list = list()
for fn in file_list:
    data0 = pd.read_csv(fn)
    data_list.append(data0)
    
data = pd.concat(data_list, sort=True)

In [None]:
# Look at first rows
data.head()

> ## Efficient concat

> There is a much cleaner way to use `pd.concat` and list comprehensions. Try that now

In [None]:
# Build up dataframe using concat and a list comprehension

> ## Determining Matches
>
> Which of these files is *not* matched by the expression `glob.glob('data/*as*.csv')`?
>
> 1. `data/gapminder_gdp_africa.csv`
> 2. `data/gapminder_gdp_americas.csv`
> 3. `data/gapminder_gdp_asia.csv`
> 4. 1 and 2 are not matched.

> ## Minimum File Size
>
> Modify this program so that it prints the number of records in
> the file that has the fewest records.


In [None]:
import glob
import pandas
fewest = ____
for filename in glob.glob('../data/gapminder_*.csv'):
    dataframe = pandas.____(filename)
    fewest = min(____, dataframe.shape[0])
print('smallest file has', fewest, 'records')

> ## Comparing Data
>
> Write a program that reads in the regional data sets
> and plots the average GDP per capita for each region over time
> in a single chart.

In [None]:
import glob
import pandas 
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1)
for filename in glob.glob('../data/gapminder_gdp*.csv'):
    dataframe = pandas.read_csv(filename)
    # extract region from the filename, expected to be in the format '../data/gapminder_gdp_<region>.csv'
    region = filename.rpartition('_')[2][:-4] 
    dataframe.mean().plot(ax=ax, label=region, rot=90)
plt.legend()
plt.show()

# Where to?

[Check-in](02-Checkin.ipynb) on putting some data sets together.

### Questions:
- "How can I process many data sets with a single command?"

### Objectives:
- "Be able to read and write globbing expressions that match sets of files."
- "Use glob to create lists of files."
- "Write for loops to perform operations on files given their names in a list."

### Keypoints:
- "Use a `for` loop to process files given a list of their names."
- "Use `glob.glob` to find sets of files whose names match a pattern."
- "Use `glob` and `for` to process batches of files."

## References

### Software Carpentry
* [Looping Over Data Sets](http://swcarpentry.github.io/python-novice-gapminder/13-looping-data-sets/)

### Other