## Python Library Ecosystem Exercise: Parsing, Manipulating, and Exploring Patient Metadata

We are going to revisit the same public dataset of COVID-19 chest x-ray images as before. Now, instead of writing loops and iterating one-by-one over data entries, we are going to make use of python libraries and save ourselves a lot of coding and even gain speed!

To start, let's make sure the libraries are installed:

### Install Libraries

Open a command prompt or terminal and use `pip` to install the libraries.
```python
pip install numpy scipy matplotlib pandas scikit-image scikit-learn
```

Check these libraries are installed, either by printing the installed list of libraries:

```python
pip list
```

or open python in the terminal and attempt to import each one 

In [149]:
import importlib # this is only to import libraries in a loop by name. You would normally use 'import numpy' etc.

library_names = ['numpy', 'scipy', 'matplotlib', 'pandas', 'skimage', 'sklearn']

for lib in library_names:
    try:
        importlib.import_module(lib)
        print('imported, ', lib)
    except:
        print(lib, ' library not installed ...')

imported,  numpy
imported,  scipy
imported,  matplotlib
imported,  pandas
imported,  skimage
imported,  sklearn


## 1. Read in the metadata.csv using ```pandas```

In [150]:
# let's list the contents of the dataset repository
import os 

os.listdir('../datasets/covid-chestxray-dataset')

['.github',
 '.gitignore',
 '.gitkeep',
 'annotations',
 'docs',
 'images',
 'metadata.csv',
 'README.md',
 'requirements.txt',
 'SCHEMA.md',
 'scripts',
 'tests',
 'volumes']

As before, use a variable to store the dataset folder and construct filepath to metadata programmatically to help readability

In [151]:
dataset_folder = '../datasets/covid-chestxray-dataset'
metadata_file = os.path.join(dataset_folder,
                            'metadata.csv')
print(metadata_file) # Note the \ if on Windows. It looks weird, but will work fine

../datasets/covid-chestxray-dataset\metadata.csv


In [152]:
def pandas_read_csv(filepath):
    import pandas as pd # recommended to place import in function definition if you are not using it for many functions
    
    return pd.read_csv(filepath)

metadata_table = pandas_read_csv(metadata_file)

# you will see the data is nicely formatted now on print and 'NaN' is used for empty entries
print(metadata_table)

    patientid  offset sex   age                   finding RT_PCR_positive  \
0           2     0.0   M  65.0  Pneumonia/Viral/COVID-19               Y   
1           2     3.0   M  65.0  Pneumonia/Viral/COVID-19               Y   
2           2     5.0   M  65.0  Pneumonia/Viral/COVID-19               Y   
3           2     6.0   M  65.0  Pneumonia/Viral/COVID-19               Y   
4           4     0.0   F  52.0  Pneumonia/Viral/COVID-19               Y   
..        ...     ...  ..   ...                       ...             ...   
945       479     0.0   F  40.0                 Pneumonia             NaN   
946       479    70.0   F  40.0                 Pneumonia             NaN   
947       480     NaN   M  26.0                 Pneumonia             NaN   
948       481     NaN   M  50.0                 Pneumonia             NaN   
949       481     NaN   M  50.0                 Pneumonia             NaN   

    survival intubated intubation_present went_icu  ...              date  

This structure makes it super easy to access contents. For example, the column names are keys. We can directly pull out all entries in the `patientid` column

In [153]:
col_subset = metadata_table.loc[:,'patientid':] # : denotes i want all.

print(col_subset)

    patientid  offset sex   age                   finding RT_PCR_positive  \
0           2     0.0   M  65.0  Pneumonia/Viral/COVID-19               Y   
1           2     3.0   M  65.0  Pneumonia/Viral/COVID-19               Y   
2           2     5.0   M  65.0  Pneumonia/Viral/COVID-19               Y   
3           2     6.0   M  65.0  Pneumonia/Viral/COVID-19               Y   
4           4     0.0   F  52.0  Pneumonia/Viral/COVID-19               Y   
..        ...     ...  ..   ...                       ...             ...   
945       479     0.0   F  40.0                 Pneumonia             NaN   
946       479    70.0   F  40.0                 Pneumonia             NaN   
947       480     NaN   M  26.0                 Pneumonia             NaN   
948       481     NaN   M  50.0                 Pneumonia             NaN   
949       481     NaN   M  50.0                 Pneumonia             NaN   

    survival intubated intubation_present went_icu  ...              date  

### Indexing and slicing the table by rows, or columns 

`pandas.DataFrame` can be subset by row or columns using `.loc` (by logic or string) or `.iloc` (by numerical index) methods. 

In [154]:
import pandas as pd # re-import to make it available globally

# we can subset the first columns, from 'patientid' (1st column, index 0) up to 'RT_PCR_positive' (6th column, index 5) inclusive
subset_col_table_by_column_name = metadata_table.loc[:,'patientid':'RT_PCR_positive'] # subset by string
subset_col_table_by_column_index = metadata_table.iloc[:,0:6] # subset by string

print(subset_col_table_by_column_name) # note this looks table-like however a column is 1D and therefore a pandas.Series
print('==============================')
print(subset_col_table_by_column_index)
print('==============================')

# Let's find all rows associated with patientid=2
subset_patientid_2_data = metadata_table.loc[metadata_table['patientid'].values=='479']
subset_col_subset_patientid_2_data = subset_col_table_by_column_index.loc[metadata_table['patientid'].values=='479']

print('++++++++++++++++++++++++++++++')
print(subset_patientid_2_data)
print('++++++++++++++++++++++++++++++')
print(subset_col_subset_patientid_2_data)

    patientid  offset sex   age                   finding RT_PCR_positive
0           2     0.0   M  65.0  Pneumonia/Viral/COVID-19               Y
1           2     3.0   M  65.0  Pneumonia/Viral/COVID-19               Y
2           2     5.0   M  65.0  Pneumonia/Viral/COVID-19               Y
3           2     6.0   M  65.0  Pneumonia/Viral/COVID-19               Y
4           4     0.0   F  52.0  Pneumonia/Viral/COVID-19               Y
..        ...     ...  ..   ...                       ...             ...
945       479     0.0   F  40.0                 Pneumonia             NaN
946       479    70.0   F  40.0                 Pneumonia             NaN
947       480     NaN   M  26.0                 Pneumonia             NaN
948       481     NaN   M  50.0                 Pneumonia             NaN
949       481     NaN   M  50.0                 Pneumonia             NaN

[950 rows x 6 columns]
    patientid  offset sex   age                   finding RT_PCR_positive
0           2 

In [155]:
# be careful after subsetting by row if doing it by name, this is because the row index doesn't change!, so the name is that of the original!

print('subsetting by index works')
print(subset_col_subset_patientid_2_data.iloc[0])

print('=================================')
print('=================================')
print('subsetting by name must use the name in original table, else you will get keyerror')
print(subset_col_subset_patientid_2_data.loc[0])

subsetting by index works
patientid                479
offset                     0
sex                        F
age                       40
finding            Pneumonia
RT_PCR_positive          NaN
Name: 945, dtype: object
subsetting by name must use the name in original table, else you will get keyerror


KeyError: 0

There are some quirks for working directly with `pandas.Series`, `pandas.DataFrame` and performing mathematical or plotting operations using e.g. `numpy` and `matplotlib`. Therefore it is valuable to know how to convert to pure `numpy` arrays. The downside is that we lose the associated row and columns information. 

### Let's compare Pandas vs Our Pure Python naive line-by-line csv reading code

First let's define a function for the previous line-by-line reading code

In [156]:
def read_csv_line_by_line(filepath):
    
    metadata_contents = []

    with open(metadata_file, 'r', encoding='utf-8') as f:
        for line in f: # note the for loop iteration.
            # strip blank space, split by comma
            line_contents = line.strip().split(',')
            # append into empty list
            metadata_contents.append(line_contents)

    return metadata_contents

We use the `time` function from the `time` python module to time both approaches to reading the .csv file

In [157]:
import time

# first time pandas
t1_pandas = time.time() # start the clock
metadata_table_pandas = pandas_read_csv(metadata_file) 
t2_pandas = time.time() # stop the clock
print('pandas reading time: ', t2_pandas-t1_pandas)

# second time pure python line-by-line
t1_python = time.time()
metadata_table_python = read_csv_line_by_line(metadata_file)
t2_python = time.time()
print('python reading time: ', t2_python-t1_python)

pandas reading time:  0.029920339584350586
python reading time:  0.022936582565307617


**Wow! pure python was faster than pandas!** This is a little bit of a warning. Library does not equal fast. A single function may be hiding many steps underneath which slow-down the code. 

Our test here was a little unfair, since we don't return an array. Let's use numpy to revise the line-by-line code to return the column names and an array of values, which is effectively what pandas offers. 

In [158]:
def read_csv_line_by_line_numpy(filepath):
    import numpy as np
    
    metadata_contents = []

    with open(metadata_file, 'r', encoding='utf-8') as f:
        for line in f: # note the for loop iteration.
            # strip blank space, split by comma
            line_contents = line.strip().split(',')
            # append into empty list
            metadata_contents.append(line_contents)
    columns = np.hstack(metadata_contents[0]) # this is the first line of file, and we use np.hstack to turn into 1D array
    data = np.array(metadata_contents[1:],dtype=object) # we use np.array to convert the rest to numpy array

    return data, columns # note we return two things now.

# third time pure python line-by-line with numpy array conversion
t1_python_numpy = time.time()
metadata_table_numpy, metadata_table_columns = read_csv_line_by_line_numpy(metadata_file)
t2_python_numpy = time.time()

print('pandas reading time: ', t2_pandas-t1_pandas)
print('python reading time: ', t2_python-t1_python)
print('python + numpy conversion reading time: ', t2_python_numpy-t1_python_numpy)

pandas reading time:  0.029920339584350586
python reading time:  0.022936582565307617
python + numpy conversion reading time:  0.01695561408996582


python + numpy conversion is still faster! but there is a problem. We have a warning of ragged nested sequences. This is not good. A table should be a regular n_rows x n_cols matrix. 

We check the shape of the numpy array:

In [159]:
print('python + numpy data shape', metadata_table_numpy.shape)

python + numpy data shape (950,)


This is 1-dimensional when it should be 2!

What's the problem? It is because our code is splitting each line by looking for commans ','. However, the comma is not exclusively separating columns. Some column entries such as `data`, and `clinical_notes` contain ',' in their text! 

`pandas` was able to correctly read the table as it incorporates proof-checking, based on the expected number of columns, parsed from the first line. We need to write much more code to detect and correct for the extra commas. This is generally not worth it and we might not get it right! More general handling and treatment of potential errors is why even though it may be slower, it is better practice to use a well-developed library.  

You will now write control statements using the metadata_contents list :

Try to do as many as you can - you can team up.

## 2. Getting the data we want from the metadata table, now using library functions

We can revisit the exercises you previously did with loops and replace them. We can also start viewing the associated images.

#### Exercise 1: Create an array for `patientid` from `metadata_table`. Hint: answer already given above.

In [160]:
# Feel free to write code in here, or else use your favorite Python IDE.

#### Exercise 2: Find the number of unique `patientid` as well as the unique ids . Hint: `numpy.unique`

In [161]:
# Feel free to write code in here, or else use your favorite Python IDE.

#### Exercise 3: Find the `age` of each unique `patientid`.

In [162]:
# Feel free to write code in here, or else use your favorite Python IDE.

#### Exercise 4: Find the `finding` of each unique `patientid`. How many unique `finding` are there?

In [163]:
# Feel free to write code in here, or else use your favorite Python IDE.

## 3. Exploring and visualizing the data

We can use the various libraries to make plots, and explore the data further to get some insights.

####  Exercise 5: Plot a histogram of `age` using matplotlib. You can import matplotlib with `import pylab as plt`

In [164]:
# Feel free to write code in here, or else use your favorite Python IDE.

####  Exercise 6: Write code to find the image path/s associated with each unique patient. Use `scikit-image` to read and `matplotlib` to display them.  

In [165]:
# Feel free to write code in here, or else use your favorite Python IDE.

####  Exercise 7: Write code to find the image path/s associated with each unique patient. Use `scikit-image` to read and `matplotlib` to display them.  

In [166]:
# Feel free to write code in here, or else use your favorite Python IDE.

## Extension: Processing images and machine learning

Generally, working directly on individual pixel intensity is not very informative. Consequently, features are extracted from images to form a vector per image which is then input to machine learning algorithms. This part will explore this a little using scikit-image and scikit-learn.

#### Extension exercise 1: (PCA on raw image intensities) 
Choose 1 image per unique patient. read each image using scikit image. The image is 2-dimensional, you will then flatten the image into 1D vector to apply PCA to.

In [167]:
# Feel free to write code in here, or else use your favorite Python IDE. 
# Below is some code comments to outline the logic to get started 


# 1. Write a loop over unique patients and use one image to describe him/her, if there are multiple e.g. the first image file.

# 2. For each patient image, read it in using img=skimage.io.imread(imgfile), if color i.e. img.shape is 3 numbers, last of which is 3. then make it grayscale e.g. using skimage.color.rgb2gray

# 3. Flatten the image to form a 1-D vector, i.e. img_flat = img.ravel(). 

# 4. Perform PCA of all flattened images using scikit-learn library into 2 dimension i.e. n_components=2. This is done using sklearn.decomposition.PCA. See example at https://scikit-learn.org/1.5/auto_examples/decomposition/plot_pca_iris.html for example usage

# 5. Use matplotlib to plot the 2-dimensions, and color by `finding`

#### Extension exercise 2: (Use ORB features to encode each image to train a SVM classifier)
Again, use 1 image per unique patient. Read this scikit-image [example](https://scikit-image.org/docs/stable/auto_examples/features_detection/plot_fisher_vector.html#sphx-glr-auto-examples-features-detection-plot-fisher-vector-py). Then modify the example code to extract ORB image features for each image, to train a classifier to predict `finding`.   HINT: The target in this case, `finding` is not numerical. You will need to use [sklearn.preprocessing.LabelEncoder](https://scikit-learn.org/1.5/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder) to transform into integers for training