## `glob()`ing: Selecting multiple filenames without hard-coding their names

Often, you have many files to read in, and you don't want to type the names of every one of them. "Hard-coding" filenames like this is tedious and error-prone. It's also not flexible or scalable: if you collect more data or remove some bad data files, you have to manually update your list of file names. This is not a big deal if you only have three files, but it's much more work (and potential error) if you have 20, 50, 100 etc. files. 

Fortunately, Python has ways of listing all the files in a folder (or even in sub-folders), and using *filters* and *wildcards* to list only files whose names match specific criteria. This is done using a function called `glob()` from the package `glob` (yes, [seriously](https://docs.python.org/3/library/glob.html)). To use glob you first have to import it:

In [1]:
import pandas as pd 
from glob import glob

`glob()` requires one argument, which is a string specifying the pattern you want to match filenames against. For example, if you wanted to use `glob()` to list one of the input files above, `s1.csv`, you would use:

In [2]:
glob('s1.csv')

['s1.csv']

So you can see, `glob()` returns a list of filenames that match the string you specified. This is not terribly useful when you provide an exact file name, except it can tell you if a file is present or not. For example, if we glob a filename that doesn't exist, we get an empty list:

In [3]:
glob('ceci_nest_pas_une_file.txt')

[]

Where `glob()` really gets useful though, is in its use of filters and wildcards. `glob()` uses Unix [**regular expressions**](https://en.wikipedia.org/wiki/Regular_expression) (regexp), which is a syntax for matching particular patterns of characters. Regexp are very powerful, but for now we'll use the simplest-but-most-powerful of them all: the wildcard, `*`.

The `*` in a regexp is caleld the "wildcard" becuase just like a "wild" card in a card game - which can be used to represent any other card in the deck - `*` matches any number of any characters. So if we do:

In [4]:
glob('*')

['lesson5_ch1a.ipynb',
 'session_1.csv',
 'maze_data_1.csv',
 'maze_data_3.csv',
 'birthday_months.csv',
 'session_3.csv',
 'session_2.csv',
 'lesson5_ch1c.ipynb',
 'maze_data_2.csv',
 'lesson5_ch2j.ipynb',
 'lesson5.md',
 'lesson5_ch1g.ipynb',
 'lesson5_ch3c.ipynb',
 'learning_objectives.md',
 'lesson5_ch3a.ipynb',
 'lesson5_ch1e.ipynb',
 'lesson5_ch2h.ipynb',
 'images',
 'lesson5_ch1b.ipynb',
 'visuzalization.md',
 'lesson5_ch3d.ipynb',
 'vis_extras.md',
 'rt_data.csv',
 'lesson5_ch2i.ipynb',
 'lesson5_ch1d.ipynb',
 'introduction.md',
 'lesson5_ch1f.ipynb',
 'sample_data.xlsx',
 'lesson5_ch3b.ipynb',
 'lesson5_ch2g.ipynb',
 'lesson5_ch1j.ipynb',
 'lesson5_ch1h.ipynb',
 'getting_help.md',
 'lesson5_ch2e.ipynb',
 'lesson2.ipynb',
 's2.csv',
 's3.csv',
 'lesson5_ch2a.ipynb',
 'lesson5_ch2c.ipynb',
 's1.csv',
 'lesson5_ch2d.ipynb',
 'eye_colour.csv',
 'lesson5_ch1i.ipynb',
 'study1.csv',
 'rt_data_code.txt',
 'lesson1.ipynb',
 'study2.csv',
 'lesson5_ch2f.ipynb',
 'data',
 'lesson5_ch2b.

...we get a list of all the files in the current directory, because every file name contains some number of characters. 

This gets useful when we combine the wildcard with a *substring* - a set of characters that appears in one or more files. For example, if we want to list all files that end with the `.csv` extension, we use:

In [5]:
glob('*.csv')

['session_1.csv',
 'maze_data_1.csv',
 'maze_data_3.csv',
 'birthday_months.csv',
 'session_3.csv',
 'session_2.csv',
 'maze_data_2.csv',
 'rt_data.csv',
 's2.csv',
 's3.csv',
 's1.csv',
 'eye_colour.csv',
 'study1.csv',
 'study2.csv',
 'fav_colour.csv']

We can get fancier, too. For example, if we have a number of CSV files whose names all start with `study`, followed by a number, we can use:

In [6]:
filenames = glob('study*.csv')
print(filenames)

['study1.csv', 'study2.csv']


However, you have to be careful and check that you are being specific enough. For example, above we hard-coded the names of three files:
`filenames = ['s1.csv', 's2.csv', 's3.csv']`
    
We could try using `glob()` to get the same result:

In [7]:
filenames = glob('s*.csv')
print(filenames)

['session_1.csv', 'session_3.csv', 'session_2.csv', 's2.csv', 's3.csv', 's1.csv', 'study1.csv', 'study2.csv']


...but you can see we get several other files whose names also start with `s` but have more than just a number after them. Fortunately, there's another, more restricted wildcard that matches any *single* character rather than any number of characters: `?`

In [8]:
filenames = glob('s?.csv')
print(filenames)

['s2.csv', 's3.csv', 's1.csv']


If you want to match a specific number of characters, but you don't care what those characters are, you can use that many `?`s:

In [9]:
glob('??.csv')

['s2.csv', 's3.csv', 's1.csv']

We can also use `glob()` to find files inside folders. For example, imagine we have a folder of data from an experimetn, which is named `data`. Inside `data`, there is a folder containing the data for one participant (named according to the participant's ID code), and each folder has a CSV file named after the participant. The naming convention for the subjects is `subj_??` where the `??` represents a two-digit ID code, starting with 01 and increasing. We can generate a list of all the data files using: 

In [10]:
filenames = glob('data/subj_??/subj_??.csv')
print(filenames)

['data/subj_01/subj_01.csv', 'data/subj_03/subj_03.csv', 'data/subj_02/subj_02.csv', 'data/subj_05/subj_05.csv']


Note here that we use `/` to represent a subfolder. 

Also note that `glob()` can save us a lot of trouble, because:
1. It scales/adapts to any number of files
2. It makes no assumptions about the filenames, beyond what you tell it. For example, above you'll see there is no `subj_04`, but `glob()` doesn't care - it only tells you what *is* there. 

Conversely, hard-coding file names becomes more tedious and error-prone the more there are, and you have to carefully double-check that you don't specify the names of files you don't actually have.