<a href="https://colab.research.google.com/github/CBIIT/python-carpentry-workshop/blob/main/python_looping_over_data_sets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Looping Over Data Sets

## Download two data sets





In [None]:
#Download two data sets from the Web
!wget https://raw.githubusercontent.com/swcarpentry/python-novice-gapminder/gh-pages/data/gapminder_gdp_africa.csv
!wget https://raw.githubusercontent.com/swcarpentry/python-novice-gapminder/gh-pages/data/gapminder_gdp_asia.csv

In [None]:
!ls -F

In [None]:
!mv *.csv sample_data/

The most common patterns are:
- `*` meaning "match zero or more characters"
- `?` meaning "match exactly one character"

## Use a `for` loop to process files given a list of their names

In [None]:
import pandas as pd
for filename in ['sample_data/gapminder_gdp_africa.csv', 'sample_data/gapminder_gdp_asia.csv']:
    data = pd.read_csv(filename, index_col='country')
    print(filename, data.min())

## Use `glob.glob` to find sets of files whose names match a pattern

- Python’s standard library contains the `glob` module to provide pattern matching functionality
- The `glob` module contains a function also called `glob` to match file patterns
- E.g., `glob.glob('*.txt')` matches all files in the current directory whose names end with `.txt`.
- Result is a (possibly empty) list of character strings.

In [None]:
import glob
print('all csv files in data directory:', glob.glob('sample_data/*.csv'))

In [None]:
print('all PDB files:', glob.glob('*.pdb'))

## Use `glob` and `for` to process batches of files

- Helps a lot if the files are named and stored systematically and consistently so that simple patterns will find the right data.
- Use a more specific pattern to exclude the whole data set.

In [None]:
for filename in glob.glob('sample_data/gapminder_gdp_*.csv'):
    data = pd.read_csv(filename)
    print(filename, data['gdpPercap_1952'].min())

## Dealing with File Paths

The `pathlib module` provides useful abstractions for file and path manipulation like returning the name of a file without the file extension. This is very useful when looping over files and directories. In the example below, we create a `Path` object and inspect its attributes.

In [None]:
from pathlib import Path

p = Path("sample_data/gapminder_gdp_africa.csv")
print(p.parent)
print(p.stem)
print(p.suffix)