# Colab only
❗ This notebook is designed to run on google colab.   
🎉 These top few cells should install the nescesary libraries needed and set you up to be able to run the rest of the notebook!

You should already have a google account ready to go!

To bore you with details, these cells will:
- Install needed packages which are not already installed google-colab
- Download the needed data and extract it where it is needed

### 🧙‍♀️ Wizards
🧙‍♂️ If you're a wizard, here's some info about colab you may want to know:
- Each colab notebook runs on its own temporary linux virtual machine with its own filesystem.   
- If your notebook is shutdown, this will delete the temporary instance - this is why you need to mount google drive
- Colab seems to let you have about 3 notebook instances running at any time; each of these will be on their own unique VM

In [None]:
# Now ensure the data we need is in the correct place
!wget --no-verbose --output-document data.zip https://github.com/CurtinIDS/CIDS_Carpentries_Python/releases/download/stable/data.tar.gz
!unzip -o data.zip -d ../data
# Now list the contents to make sure we see the 4 shapefile components (.dbf, .prj, .shp, and .shx)
!echo 
!echo The data folder contents:
!ls -al ../data

In [None]:
# This cell has been automatically inserted from build_scripts/colab_nb_builder.py
# It should make this notebook google-colab compatible!

!pip install -q --upgrade pip 
!pip install -q ipykernel
!pip install -q numpy
!pip install -q matplotlib
!pip install -q pandas
!echo All done! Test below if it works.

# CIDS Carpentries Workshop - Episode 6 - Analysing Data from Multiple Files
This lesson is adapted from the Software Carpentries [Programming with Python](https://swcarpentry.github.io/python-novice-inflammation/index.html) lesson.

---

## ❓ Questions and Objectives
What should you be able to answer by the end of this episode?

### Questions
- How can I do the same operations on many different files?


### Objectives
- Use a library function to get a list of filenames that match a wildcard pattern.
- Write a `for` loop to process multiple files.

---


As a final piece to processing our inflammation data, we need a way to get a list of all the files in our `data` directory whose names start with `inflammation-` and end with `.csv`. The following library will help us to achieve this:

In [None]:
# Importing libraries


The `glob` library contains a function, also called `glob`, that finds files and directories whose names match a pattern. We provide those patterns as strings: the character `*` matches zero or more characters, while `?` matches any one character.

We can use this to get the names of all the CSV files in a directory:

In [None]:
# Printing the list of files in the data folder that start with inflammation


As these examples show, `glob.glob`'s result is a list of file and diretory paths in arbitrary order. This means we can loop over it to do something with each filename in turn.

In our case, the "something" we want to do is generate a set of plots for each file in our inflammation dataset.

If we want to start by analysing just the first three files in alphabetical order, we can use the `sorted` built-in function to generate a new sorted list from the `glob.glob` output:

In [None]:
# Sorting and slicing the list of files in the data folder that start with inflammation


Revisiting the code from `Grouping Plots` in `3_Visualising_Tabular_Data`, we will use a `for` loop to generate subplots for the average, max and min for our files.

In [None]:
# Importing libraries
import numpy as np
import matplotlib.pyplot as plt

# Instantiating a for loop
for filename in filenames:
    print(filename)

    # Copying the code from the previous lesson
    data = np.loadtxt(fname=filename, delimiter=',')

    fig = plt.figure(figsize=(10.0, 3.0))

    axes1= fig.add_subplot(1, 3, 1)
    axes2= fig.add_subplot(1, 3, 2)
    axes3= fig.add_subplot(1, 3, 3)

    axes1.set_ylabel('average')
    axes1.plot(np.mean(data, axis=0))

    axes2.set_ylabel('max')
    axes2.plot(np.amax(data, axis=0))

    axes3.set_ylabel('min')
    axes3.plot(np.amin(data, axis=0))

    fig.tight_layout()
    plt.show()

The plots generated for the second clinical trial file loop very similar to the plots in the first file; their average plots show similar "noisy" rises and falls; their maxima plots show exactly the same linear rise and fall; and their minima plots show similar staircase structures.

The third dataset shows much noisier average and maxima plots that are far less suspicious than the first two datasets, however the minima plot shows that the third dataset minima is consistently zero across every day of the trial. If we produce a heatmap for the third datafile, we see the following.

In [None]:
# Plotting a heatmap of the third datafile


We can see that there are zero values sporadically distributed across all patients and days of the clinical trial, suggesting that there were potential issues with data collection throughout the trial. In addition, we can see that the last patient in the study didn't have any inflammation flare-ups at all throughout the trial, suggesting that they many not even suffer from arthritis!

---

## 🏆 Exercises

### ✏️ Exercise 1 : Plotting Differences

Plot the difference between the average inflamation reported in the first and second datasets (stored in `inflammation-01.csv` and `inflammation-02.csv`, correspondingly).
i.e. the difference between the leftmost plots of the first two figures.

### ✏️ Exercise 2 : Generate Composite Statistics

Use each of the files once to generate a dataset containing values averaged over all patients by completing the code inside the loop given below:

In [None]:
filenames = glob.glob('../data/inflammation*.csv')
composite_data = np.zeros((60, 40))
for filename in filenames:
    data = np.loadtxt(fname=filename, delimiter=',')
    # sum each new file's data into composite_data as it's read
# and then divide the composite_data by number of samples


Then use pyplot to generate average, max, and min for all patients.

In [None]:
fig = plt.figure(figsize=(10.0, 3.0))

axes1 = fig.add_subplot(1, 3, 1)
axes2 = fig.add_subplot(1, 3, 2)
axes3 = fig.add_subplot(1, 3, 3)

axes1.set_ylabel('average')
axes1.plot(np.mean(composite_data, axis=0))

axes2.set_ylabel('max')
axes2.plot(np.amax(composite_data, axis=0))

axes3.set_ylabel('min')
axes3.plot(np.amin(composite_data, axis=0))

fig.tight_layout()

plt.show()

---

After spending some time investigating the heatmap and statistical plots, as well as doing the above exercises to plot differences between datasets and to generate composite patient statistics, we gain some insight into the twelve clinical trial datasets.

The datasets appear to fall into two categories:
1. seemingly "ideal" datasets that agree excellently with Dr. Maverick's claims, but displays suspicious maxima and minima (such as `inflammation-01.csv` and `inflammation-02.csv`)
2. "noisy" datasets that somewhat agree with Dr. Maverick's claims, but show concerning data collection issues such as sporadic missing values and even an unsuitable candidate making it into the clinical trial.

In fact, it appears that all three of the "noisy" datasets (`inflammation-03.csv`, `inflammation-08.csv`, and `inflammation-11.csv`) are identical down to the last value. Armed with this information, we confront Dr. Maverick about the suspicious data and duplicated files.

Dr. Maverick has admitted to fabricating the clinical data for their drug trial. They did this after discovering that the initial trial had several issues, including unreliable data recording and poor participant selection. In order to prove the efficay of their drug, they created fake data. When asked for additional data, they attempted to generate more fake datasets, and also included the original poor-quality dataset several times in order to make the trials seem more realistic.

🎉 Congratulations! We've investigated the inflammation data and proven that the datasets have been synthetically generated.

But it would be a shame to throw away the synthetic datasets that have taught us so much already, so we'll forgive the imaginary Dr. Maverick and continue to use the data to learn how to program!

---

## 🔑 Key Points
- Use `glob.glob(pattern)` to create a list of files whose names match a pattern.
- Use `*` in a pattern to match zero or more characters, and `?` to match any single character.

