<a href="https://colab.research.google.com/github/CBIIT/python-carpentry-workshop/blob/main/week_5b_looping_over_data_sets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Looping Over Data Sets

## Download data sets





In [1]:
#Download two data sets from the Web
!wget https://raw.githubusercontent.com/swcarpentry/python-novice-gapminder/gh-pages/data/gapminder_gdp_africa.csv
!wget https://raw.githubusercontent.com/swcarpentry/python-novice-gapminder/gh-pages/data/gapminder_gdp_asia.csv

--2022-06-07 19:05:26--  https://raw.githubusercontent.com/swcarpentry/python-novice-gapminder/gh-pages/data/gapminder_gdp_africa.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8095 (7.9K) [text/plain]
Saving to: ‘gapminder_gdp_africa.csv’


2022-06-07 19:05:26 (93.6 MB/s) - ‘gapminder_gdp_africa.csv’ saved [8095/8095]

--2022-06-07 19:05:26--  https://raw.githubusercontent.com/swcarpentry/python-novice-gapminder/gh-pages/data/gapminder_gdp_asia.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5103 (5.0K) [text/plain]
Savi

In [2]:
!ls -F

gapminder_gdp_africa.csv    gapminder_gdp_asia.csv    sample_data/
gapminder_gdp_africa.csv.1  gapminder_gdp_asia.csv.1


In [3]:
!mv *.csv sample_data/

The most common patterns are:
- `*` meaning "match zero or more characters"
- `?` meaning "match exactly one character"

## Use a `for` loop to process files given a list of their names


- A filename is a character string.
- Lists can contain character strings.

In [4]:
import pandas as pd
for filename in ['sample_data/gapminder_gdp_africa.csv', 'sample_data/gapminder_gdp_asia.csv']:
    data = pd.read_csv(filename, index_col='country')
    print(filename, data.min())

sample_data/gapminder_gdp_africa.csv gdpPercap_1952    298.846212
gdpPercap_1957    335.997115
gdpPercap_1962    355.203227
gdpPercap_1967    412.977514
gdpPercap_1972    464.099504
gdpPercap_1977    502.319733
gdpPercap_1982    462.211415
gdpPercap_1987    389.876185
gdpPercap_1992    410.896824
gdpPercap_1997    312.188423
gdpPercap_2002    241.165876
gdpPercap_2007    277.551859
dtype: float64
sample_data/gapminder_gdp_asia.csv gdpPercap_1952    331.0
gdpPercap_1957    350.0
gdpPercap_1962    388.0
gdpPercap_1967    349.0
gdpPercap_1972    357.0
gdpPercap_1977    371.0
gdpPercap_1982    424.0
gdpPercap_1987    385.0
gdpPercap_1992    347.0
gdpPercap_1997    415.0
gdpPercap_2002    611.0
gdpPercap_2007    944.0
dtype: float64


## Use `glob.glob` to find sets of files whose names match a pattern

- Python’s standard library contains the `glob` module to provide pattern matching functionality
- The `glob` module contains a function also called `glob` to match file patterns
- E.g., `glob.glob('*.txt')` matches all files in the current directory whose names end with `.txt`.
- Result is a (possibly empty) list of character strings.

In [5]:
import glob
print('all csv files in data directory:', glob.glob('sample_data/*.csv'))

all csv files in data directory: ['sample_data/gapminder_gdp_africa.csv', 'sample_data/gapminder_gdp_asia.csv', 'sample_data/california_housing_train.csv', 'sample_data/california_housing_test.csv', 'sample_data/mnist_train_small.csv', 'sample_data/mnist_test.csv']


In [6]:
print('all PDB files:', glob.glob('sample_data/*.pdb'))

all PDB files: []


## Use `glob` and `for` to process batches of files

- Helps a lot if the files are named and stored systematically and consistently so that simple patterns will find the right data.
- Use a more specific pattern to exclude the whole data set.

In [7]:
for filename in glob.glob('sample_data/gapminder_gdp_*.csv'):
    data = pd.read_csv(filename)
    print(filename, data['gdpPercap_1952'].min())

sample_data/gapminder_gdp_africa.csv 298.8462121
sample_data/gapminder_gdp_asia.csv 331.0


## Dealing with File Paths

The `pathlib module` provides useful abstractions for file and path manipulation like returning the name of a file without the file extension. This is very useful when looping over files and directories. In the example below, we create a `Path` object and inspect its attributes.

In [8]:
from pathlib import Path

p = Path("sample_data/gapminder_gdp_africa.csv")
print(p.parent)
print(p.stem)
print(p.suffix)

sample_data
gapminder_gdp_africa
.csv


## Ecercise



1.   [Determining Matches](http://swcarpentry.github.io/python-novice-gapminder/14-looping-data-sets/index.html#determining-matches)
2.   [Minimum File Size](http://swcarpentry.github.io/python-novice-gapminder/14-looping-data-sets/index.html#minimum-file-size)
3.   [Comparing Data](http://swcarpentry.github.io/python-novice-gapminder/14-looping-data-sets/index.html#comparing-data)



## Key Points

- Use a `for` loop to process files given a list of their names.
- Use `glob.glob` to find sets of files whose names match a pattern.
- Use `glob` and `for` to process batches of files.


