In [109]:
import os
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [110]:
pd.set_option("display.max_rows", 15)

## The Dataset

Similar to former notebooks where we worked with the [ChestX-ray8 dataset](https://arxiv.org/abs/1705.02315), where here have a smaller X-ray dataset containing 5856 images.
Instead of 14 different diseases we will here concentrate on much fewer possible labels so that *hopefully* the number of images is enough to train good deep learning classifiers.

The images are distributed accross two folders (`NORMAL` and `PNEUMONIA`), the related metadata can be found in `x_ray_metadata_portfolio.csv` but actually also is reflected in the image file names.

## Import metadata

In [111]:
path = "../data/ChestXray_pneumonia_prediction/"
csv_file = 'x_ray_metadata_portfolio.csv'
metadata = pd.read_csv(path + csv_file)

# Portfolio exercises:
### 1. load and inspect the data 
- what are missing/problematic entries?

In [112]:
metadata.head()

Unnamed: 0,patient_id,label,infection_type,folder,image
0,1,normal,none,NORMAL,IM-0001-0001.jpeg
1,3,normal,none,NORMAL,IM-0003-0001.jpeg
2,5,normal,none,NORMAL,IM-0005-0001.jpeg
3,6,normal,none,NORMAL,IM-0006-0001.jpeg
4,7,normal,none,NORMAL,IM-0007-0001.jpeg


In [113]:
metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5856 entries, 0 to 5855
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   patient_id      5856 non-null   object
 1   label           5856 non-null   object
 2   infection_type  5856 non-null   object
 3   folder          5856 non-null   object
 4   image           5856 non-null   object
dtypes: object(5)
memory usage: 228.9+ KB


In [114]:
metadata.isnull().sum()

patient_id        0
label             0
infection_type    0
folder            0
image             0
dtype: int64

Luckily there are no null values in our data and we can use it without deleting any entries.

### 2. data exploration and cleaning 
- remove, fill, change data (if you think this makes sense or is necessary/beneficial)
- inspect two different types of labels (`label` and `infection_type`). How many classes are there and is there any bias? If so, is there anything we have to take care of?

In [115]:
metadata['infection_type'].value_counts()

bacteria    2780
none        1583
virus       1493
Name: infection_type, dtype: int64

In [116]:
metadata['label'].value_counts()

pneumonia    4273
normal       1583
Name: label, dtype: int64

In [117]:
metadata.loc[metadata['label'] == 'pneumonia'].__len__() / metadata.loc[metadata['label'] == 'normal'].__len__()

2.6993051168667086

There are 3 infection types: bacteria, virus and none. There are roughly as many viral infections as there are healthy patients. But the amount of bacterial infections is almost double the size of the other two cases.
Also if we just look at the label column the data is heavily biased towards the diseased patients, which is 2.7 times as much.  

In [118]:
# check if all infected are also diseased
counter = 0
for index, row in metadata.iterrows():
    if(row["infection_type"] == "none" and row['label'] == 'pneumonia'):
         counter+=1
         
print("There are {} patients with pneumonia, that have no bacterial or viral infection".format(counter))

There are 0 patients with pneumonia, that have no bacterial or viral infection


In [119]:
metadata['folder'].value_counts()

PNEUMONIA    4273
NORMAL       1583
Name: folder, dtype: int64

In [120]:
# check if all pictures are in the right folder
for index, row in metadata.iterrows():
    if row['folder'].lower() != row['label']:
        print('Error in row: ', index)       

In [121]:
metadata['patient_id'].value_counts()

person23      31
person124     20
person441     18
person30      15
person1320    14
              ..
person1580     1
0907           1
person1579     1
person1577     1
0001           1
Name: patient_id, Length: 2790, dtype: int64

It is also noteworthy that some patients are presented multiple times in the dataset, while others have only a single entry.

### 3. Prepare data for machine learning: 
- Separate data / labels
- Split into train ~70% / validation ~15% / test ~15%

In [144]:
# shuffle data
shuffled_data = metadata.sample(frac=1, random_state=42)

# combine folder and image file name to file path
for index, row in shuffled_data.iterrows():
    shuffled_data.at[index, 'filepath'] = "{}/{}".format(row['folder'], row['image'])
    if(row['label'] == 'pneumonia'):
        row['label'] = 1
    else:
        row['label'] = 0

shuffled_data['label'] = shuffled_data['label'].astype(np.int32)
shuffled_data.head()

Unnamed: 0,patient_id,label,infection_type,folder,image,filepath
3649,person276,1,bacteria,PNEUMONIA,person276_bacteria_1297.jpeg,PNEUMONIA/person276_bacteria_1297.jpeg
4211,person403,1,virus,PNEUMONIA,person403_virus_803.jpeg,PNEUMONIA/person403_virus_803.jpeg
960,0553,0,none,NORMAL,NORMAL2-IM-0553-0001.jpeg,NORMAL/NORMAL2-IM-0553-0001.jpeg
23,0031,0,none,NORMAL,IM-0031-0001.jpeg,NORMAL/IM-0031-0001.jpeg
810,0353,0,none,NORMAL,NORMAL2-IM-0353-0001.jpeg,NORMAL/NORMAL2-IM-0353-0001.jpeg


In [145]:
# split train and test data into 70% and 30% respectively
train_df, test_df = train_test_split(shuffled_data, test_size=0.3, random_state=42)
# split the 30% test data into 15% and 15% for validation and test respectively
test_df, val_df = train_test_split(test_df, test_size=0.5, random_state=42)

# check the distribution after the split
print(test_df.shape, val_df.shape, train_df.shape)

print("Train data share: ", train_df.shape[0] / shuffled_data.shape[0])
print("Test data share: ", test_df.shape[0] / shuffled_data.shape[0])
print("Validation data share: ", val_df.shape[0]/ shuffled_data.shape[0])

(878, 6) (879, 6) (4099, 6)
Train data share:  0.6999658469945356
Test data share:  0.14993169398907105
Validation data share:  0.15010245901639344


### 4. Build an image generator pipeline
- Use keras `ImageDataGenerator` to define a `train_generator` and a `validation_generator`.
- Use the generator to recale the pixel values and to reshape the image to the desired format.

In [146]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Normalize images
image_generator = ImageDataGenerator(
    rescale=1.0/255
)

# Pick your label column(s)
label_column = ['N', 'D', 'G', 'C', 'A', 'H', 'M',] # add the label-columns we want to predict

# Define the data generators
train_generator = image_generator.flow_from_dataframe(
    dataframe=train_df,
    directory=path,
    x_col="filepath",
    y_col="label",
    target_size=(320, 320),
    batch_size=32,
    class_mode="raw",
    color_mode="rgb" # add color mode
)

val_generator = image_generator.flow_from_dataframe(
    dataframe=val_df,
    directory=path,
    x_col="filepath",
    y_col="label",
    target_size=(320, 320),
    batch_size=32,
    class_mode="raw",
    color_mode="rgb", #add color mode,
    shuffle=False,  # this is crucial for later evaluation!
)

Found 4099 validated image filenames.
Found 879 validated image filenames.


### 5. Train your own custom-made CNN (pneumonia yes/no prediction)
- Define a conventional CNN image classifier (convolution-pooling steps, followed by dense layers) to predict if a patient has pneumonia or not.

In [147]:
from keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense
from keras.models import Model
from keras.preprocessing.image import ImageDataGenerator


# Define the input shape
inputs = Input(shape=(320, 320, 3))

# Define the CNN architecture
x = Conv2D(32, (3, 3), activation='relu')(inputs)
x = MaxPooling2D((2, 2))(x)
x = Conv2D(64, (3, 3), activation='relu')(x)
x = MaxPooling2D((2, 2))(x)
x = Conv2D(128, (3, 3), activation='relu')(x)
x = MaxPooling2D((2, 2))(x)
x = Conv2D(128, (3, 3), activation='relu')(x)
x = MaxPooling2D((2, 2))(x)
x = Conv2D(128, (3, 3), activation='relu')(x)
x = MaxPooling2D((2, 2))(x)
x = Flatten()(x)
x = Dense(64, activation='relu')(x)
outputs = Dense(1, activation="sigmoid")(x)#add code here, activation=#add code here)(x)

# Create the model
model = Model(inputs=inputs, outputs=outputs)

model.summary()

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_4 (InputLayer)        [(None, 320, 320, 3)]     0         
                                                                 
 conv2d_15 (Conv2D)          (None, 318, 318, 32)      896       
                                                                 
 max_pooling2d_15 (MaxPoolin  (None, 159, 159, 32)     0         
 g2D)                                                            
                                                                 
 conv2d_16 (Conv2D)          (None, 157, 157, 64)      18496     
                                                                 
 max_pooling2d_16 (MaxPoolin  (None, 78, 78, 64)       0         
 g2D)                                                            
                                                                 
 conv2d_17 (Conv2D)          (None, 76, 76, 128)       7385

In [148]:
import tensorflow as tf

metrics = [
    'accuracy',
    tf.keras.metrics.Precision(name='precision'),
    tf.keras.metrics.Recall(name='recall')
]

# Compile the model
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics="accuracy") #metrics)

In [149]:
# Train the model
history = model.fit(
    train_generator,
    epochs=10,
    validation_data=val_generator,
    verbose=1
)

Epoch 1/10
  2/129 [..............................] - ETA: 10:00 - loss: 0.6388 - accuracy: 0.5781

KeyboardInterrupt: 

### 6. Use transfer learning to train a CNN classifier (pneumonia yes/no prediction)
- Pick any suitable CNN you like (e.g. Resnet-50, Densenet, MobileNet...) and create your own image classifier from there.

### 7. Evaluate both models
- Evaluate the models on data other than the training data.
- Plot a ROC curve and compute the area under the curve (AUC). What does this tell you?
- Compare to your custom-made CNN: which one achieves better results (and why)? Which one trains faster (and why)? What would you prefer to use in an actual medical application?

### 8. Train a CNN for a multi-class prediction (bacteria, virus, none)
- Adapt a CNN (e.g. pick one form 5. or 6.) and modify it to predict if a patient has a bacterial or a virus infection, or none of both using the `infection_type` column as label.
- This requires to also adapt the generators.

### 9. Evaluate the multi-class model
- Evaluate the multi-class prediction model, simliar to what you did in (7.). Again also include a ROC curve (now of course for all 3 possible labels).
- Is the performance comparable, better, worse than the pneumonia/no pneumonia case?
- Which label can be predicted with the highest precision?

### 10. Possible improvements
Try two possible strategies to improve the results you got on 8. and 9.
- Data augmentation.
- Model fine tuning. This refers to the training of a large network with lower learning rate, but with more layers of the base model being set to `trainable`.
We haven't done this before, so here some example code of how to "unfreeze" some layers and make them trainable again:
```python
# Now: unfreeze some of the base model layers and do a second pass of training
for layer in model.layers[:100]:
    layer.trainable = False
for layer in model.layers[100:]:
    layer.trainable = True
``` 
You can still use the `Adam` optimizer, but preferably with a much lower learning rate, maybe `1e-5`.
- Do you see any promising effect of one or both of those strategies?

## Final submission:
Please address all the above mentioned points in this notebook (e.g., using text cells where needed for explanations or answers). Obviously, you can use code snippets from notebooks we have already worked on during the live coding sessions.


### Happy hacking!!!