# **Clinical Data Science and Machine Learning with Python**

## **Day 2**

**Instructor**: Teresa Krieger, BIH/Charité (teresa.krieger@charite.de)

**Content**:

1.   References
2.   Library imports and data download
3.   Data preparation
4.   Model building
5.   Model training
6.   Model evaluation
7.   Model prediction
8.   Optional: More MNIST

---
## **1. References**

In this course, we will use Python 3.6 (default in Colab as of February 2021).
The following documentation and links might be useful to you:

- Deep Learning:
  - https://www.deeplearningbook.org/
- Tensorflow and Keras:
  - https://www.tensorflow.org/tutorials/
  - https://keras.io/guides/
- Source of the pneumonia X-ray dataset:
  - https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia
- We will loosely follow these tutorials:
  - https://www.kaggle.com/code/amyjang/tensorflow-pneumonia-classification-on-x-rays/notebook

You can also take a look at [this](https://www.youtube.com/playlist?list=PLblh5JKOoLUICTaGLRoHQDuF_7q2GfuJF) machine learning series on Youtube, where you can learn more about e.g. bias and variance, regression, and other common machine learning algorithms.

---
## **2. Library imports and data download**

In [1]:
import numpy as np
print(np.__version__)


1.24.3


In [4]:
!pip show protobuf



Name: protobuf
Version: 4.25.2
Summary: 
Home-page: https://developers.google.com/protocol-buffers/
Author: protobuf@googlegroups.com
Author-email: protobuf@googlegroups.com
License: 3-Clause BSD License
Location: /Users/ricoandreschmitt/anaconda3/envs/myenv/lib/python3.8/site-packages
Requires: 
Required-by: tensorboard, tensorflow


In [2]:
import re
import os
import pickle
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from sklearn.model_selection import train_test_split
import tensorflow.keras.utils as image

from keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.optimizers import RMSprop
from google.colab import files

import zipfile


%matplotlib inline

2024-01-11 15:05:56.879389: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


TypeError: bases must be types

In this session, we will be working with chest X-ray images from patients with and without pneumonia. This data is available on Kaggle. Before we can download it, we need to set up a connection to Kaggle in our colab environment:

In [None]:
!pip install -q kaggle
!wget -O kaggle.json https://www.dropbox.com/s/ewjoj1ge5u130m9/kaggle.json?dl=0
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json



Now we can download the data from Kaggle:

In [None]:
!kaggle datasets download paultimothymooney/chest-xray-pneumonia

The dataset is organised into 3 folders (train, test, val) and contains subfolders for each image category (PNEUMONIA/NORMAL). There are 5,863 X-Ray images (JPEG) across the two categories. We still need to unzip the compressed data and store the folder names:

In [None]:
# Extract compressed data
with zipfile.ZipFile('chest-xray-pneumonia.zip', mode='r') as zf:   # Here, mode = 'r' means we are reading the zip file
  zf.extractall()

In [None]:
# Define the folder paths:
img_dir = os.path.join(os.getcwd(), 'chest_xray')   # This is the parent directory
train_img_dir = os.path.join(img_dir, 'train')
test_img_dir = os.path.join(img_dir, 'test')
val_img_dir = os.path.join(img_dir, 'val')

# Print the paths
print('Parent directory for images: '+img_dir)
print('Directory for training images: '+train_img_dir)
print('Directory for test images: '+test_img_dir)
print('Directory for validation images: '+val_img_dir)

We would now like to take a look at some example images from the NORMAL and PNEUMONIA categories inside our training data folder.

In [None]:
samples_normal=os.listdir(train_img_dir+"/NORMAL/")[0:4]
samples_pneumonia=os.listdir(train_img_dir+"/PNEUMONIA/")[0:4]
f, ax = plt.subplots(2,4, figsize=(30,10))
for i in range(4):
    img = plt.imread(train_img_dir+"/NORMAL/"+samples_normal[i])
    ax[0,i].imshow(img, cmap='gray')
    ax[0,i].set_title("Normal")
for i in range(4):
    img = plt.imread(train_img_dir+"/PNEUMONIA/"+samples_pneumonia[i])
    ax[1,i].imshow(img, cmap='gray')
    ax[1,i].set_title("Pneumonia")
plt.show()

In [None]:
import os
import zipfile
local_zip = 'Xrayimage/XrayArchive.zip'
zip_ref = zipfile.ZipFile(local_zip, 'r')
zip_ref.extractall('/Xrayimage')
zip_ref.close()
base_dir = '/XrayImage/chest_xray'
train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'test')
# Create directorys to store the normal and sick lungs
train_healthy_dir = os.path.join(train_dir, 'NORMAL')
train_sick_dir = os.path.join(train_dir, 'PNEUMONIA')
# We use these as validation i.e we don't train with these.
validation_healthy_dir = os.path.join(validation_dir, 'NORMAL')
validation_sick_dir = os.path.join(validation_dir, 'PNEUMONIA')

---
## **3. Data preparation**

To prepare the images for processing, we will use the `ImageDataGenerator` function from `Keras`. This function automatically generates batches of image data for training and testing our model. Moreover, it can perform real-time data augmentation, for example by introducing random rotations and vertical or horizontal flips. You can find out more in the documentation [here](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator).

For now, we will only apply the `rescale` argument to transform all images to a [0...1] grayscale range instead of [0...255] as is common for RGB images. The generator essentially reads images from the source folders in batches when they are required during training and evaluation. We will therefore set up separate instances for training, validation and testing:





In [None]:
train_generator = ImageDataGenerator(rescale = 1/255).flow_from_directory(
    'chest_xray/train/',
    target_size = (300,300),   # dimensions to which all images will be resized (in pixels)
    batch_size = 128,          # data is loaded in batches
    class_mode = 'binary'      # refers to the binary labels (0/1)
)

test_generator = ImageDataGenerator(rescale = 1/255).flow_from_directory(
    'chest_xray/test/',
    target_size = (300, 300),
    batch_size = 128,
    class_mode = 'binary'
)

val_generator = ImageDataGenerator(rescale = 1/255).flow_from_directory(
    'chest_xray/val/',
    target_size = (300, 300),
    batch_size = 128,
    class_mode = 'binary'
)

---
## **4. Model building**

The model we are going to build consists of several components:

*   **tf.keras.layers.Conv2D()**: the convolution layer which abstracts images features
*   **tf.keras.layers.MaxPooling2D()**: a layer to reduce the information in an image while maintaining features
*   **tf.keras.layers.Flatten()**: flattens the result into a one-dimensional array
*   **tf.keras.layers.Dense()**: a densely connected layer

We will build a four-layer convolutional neural network in which each layer consists of a Conv2D() and a MaxPooling2D() step. Then, the output of the final convolutional layer will be flattened and fit to fully connected neurons. A dropout layer is added to avoid overfitting.



In [None]:
model = tf.keras.models.Sequential([

    # Note the input shape is the size of the image (300 x 300 px) x 3 colours
    # This is the first convolution
    tf.keras.layers.Conv2D(16, (3,3), activation='relu', input_shape=(300, 300, 3)),
    tf.keras.layers.MaxPooling2D(2, 2),

    # The second convolution
    tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),

    # The third convolution
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),

    # The fourth convolution
    tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),

    # Flatten output
    tf.keras.layers.Flatten(),

    # Densely connected hidden layer with 512 neurons
    tf.keras.layers.Dense(512, activation='relu'),

    # Dropout layer
    tf.keras.layers.Dropout(0.3),

    # One output neuron with a sigmoid activation function -
    # this will contain a value from 0 ('normal') to 1 ('pneumonia')
    tf.keras.layers.Dense(1, activation='sigmoid')
])

We can inspect the architecture of our model by printing a summary as follows:

In [None]:
model.summary()

In [None]:
keras.utils.plot_model(
    model,
    to_file="model.png"
)

Additionally, before the model is fitted for training, it is necessary to configure the specifications as follows:

*   **loss**: with a sigmoid activation function in the final step, we select `binary_crossentropy` as the loss function
*   **optimizer**: `RMSprop` (Root Mean Square Propagation) with a learning rate of 0.001 will be used
*   **metrics**: we will use `accuracy` as our metric to evaluate the prediction accuracy on every epoch

We can now compile the model:

In [None]:
model.compile(loss='binary_crossentropy',
              optimizer=RMSprop(learning_rate=0.001),
              metrics = ['accuracy'])

---
#### **_Your turn_: Exercises**

**Exercise 1**: Define a different sequential model for our images (e.g. using a different number of convolutions and adding one or two more dropout layers). Store this as variable `model2`.

In [None]:
model2 = tf.keras.models.Sequential([

    # Note the input shape is the size of the image (300 x 300 px) x 3 colours
    # This is the first convolution
    tf.keras.layers.Conv2D(20, (3,3), activation='relu', input_shape=(300, 300, 3)),
    tf.keras.layers.MaxPooling2D(2, 2),

    # The second convolution
    tf.keras.layers.Conv2D(30, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),

    # The third convolution
    tf.keras.layers.Conv2D(60, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),

    # The fourth convolution
    tf.keras.layers.Conv2D(120, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2,2),

    # Flatten output
    tf.keras.layers.Flatten(),

    # Densely connected hidden layer with 512 neurons
    tf.keras.layers.Dense(512, activation='relu'),

    # Dropout layer
    tf.keras.layers.Dropout(0.3),

    # One output neuron with a sigmoid activation function -
    # this will contain a value from 0 ('normal') to 1 ('pneumonia')
    tf.keras.layers.Dense(1, activation='sigmoid')
])

---
## **5. Model training**

Now we are ready to train our model. We will train for 25 epochs and store our progress in `history`. Note that this might take a few minutes - time for some tea or coffee!

In [None]:
history = model.fit(
    train_generator,
    validation_data = val_generator,
    epochs = 25
)

The `history` object contains the `history.history` attribute, which is a record of training loss values and metrics values at successive epochs - this is what we're interested in here:

In [None]:
history = history.history

After training, we can save our model and the training history:

In [None]:
# Save model
tf.keras.saving.save_model(
    model, 'trained_model.h5', overwrite=True, save_format='h5'
)

# Save training history
with open('training_history', 'wb') as file_pi:
  pickle.dump(history, file_pi)

**Taking too long?** If you've finished your tea or coffee but the above is still not done, you can also interrupt execution of the cell by clicking on `Runtime > Interrupt execution` in the top menu. Now you can just download the trained model as well as the training history by executing the following code:

In [None]:
# Load model and training history
!wget -O trained_model.h5 https://www.dropbox.com/scl/fi/syngj8c2lijfoo228zl0n/trained_model.h5?rlkey=ec2h2ejk976j065s5rqg5kdvm&dl=0
model = tf.keras.saving.load_model('trained_model.h5')

!wget -O training_history.pickle https://www.dropbox.com/scl/fi/wxdtgif7kcrgrh08d94ph/training_history?rlkey=yhw2muwjvwzpr71146ks0dxxm&dl=0
file_to_read = open('training_history.pickle', 'rb')
history = pickle.load(file_to_read)
file_to_read.close()

---
## **6. Model evaluation**

To evaluate our model, we can plot the accuracy as a function of training epochs for our training and validation data:

In [None]:
plt.plot(history['accuracy'])
plt.plot(history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()

---
#### **_Your turn_: Exercises**

**Exercise 1**: Plot the loss instead of the accuracy for training and validation data. Place the legend in the upper right corner of the plot.

In [None]:
plt.plot(history['loss'])          
plt.plot(history['val_loss'])      
plt.title('Model Loss')            
plt.ylabel('Loss')                 
plt.xlabel('Epoch')                
plt.legend(['Train', 'Val'], loc='upper right')  
plt.show()                        



---
## **7. Model prediction**

So how does our model perform on unseen data? We can use the `model.evaluate` function to check this.


In [None]:
result = model.evaluate(test_generator)
print('loss for test data :', result[0])
print('accuracy for test data :', result[1])

You will probably receive a prediction accuracy upwards of 80%, which is not bad for such a simple model!

Now our model is ready to make predictions! This can be done with the `model.predict` function. To predict whether a given image corresponds to a patient with or without pneumonia, we first need to download the image file and feed it into the `model.predict` function. For simplicity, we will use a file from the test data set, but we could of course also use our model for any other chest X-ray!

In [None]:
!wget -O test_image_1.jpeg https://www.dropbox.com/s/z2dwy069smbrtym/test_image_1.jpeg?dl=0
path = 'test_image_1.jpeg'  # File path to an image from the test data set
img = tf.keras.utils.load_img(path, target_size=(300,300))   # Load image
x = tf.keras.utils.img_to_array(img)     # Turn image into array
x = x/255 # Scale
x = np.expand_dims(x, axis=0)   # Add one dimension to match the input size expected by our model

In [None]:
prob_pneumonia = model.predict(x)
prob_pneumonia

---
#### **_Your turn_: Exercises**

**Exercise 1**: The variable `prob_pneumonia` gives the probability that the image comes from a pneumonia patient. Write a few lines of code to print 'The patient has pneumonia' if this probability is greater than 50%, and 'The patient does not have pneumonia' otherwise.


**Exercise 2**: As (future) medical doctors, you might also want to take a look at the image yourself. You can display it using the `imshow` function of `matplotlib` as shown below. Do you agree with your model?

In [None]:
plt.imshow(img)

**Exercise 3**: If you're still feeling motivated, you can repeat the evaluation for the image file called `test_image_2.jpeg` which you can download from https://www.dropbox.com/s/z471e1sbeac29g7/test_image_2.jpeg?dl=0.

**Exercise 4:** If you're STILL feeling motivated and you have some time to spare this afternoon: What happens if, instead of our model with five convolutions, you use your model with only three convolutions (`model2`)? Does this affect the performance of your model?

---
## **8. Optional: More MNIST**

Here you can try out how different network architectures and training parameters affect the performance of a deep learning model for the MNIST dataset. We will start with the following architecture:
*   input shape 28×28×1 (the size of the images),
*   1st convolutional layer with 64 filters, kernel size (3,3), stride (1,1) and ReLu activations,
*   dropout layer which drops 20% of the input units,
*   2nd convolutional layer with 32 filters, kernel size (3,3), stride (1,1) and ReLu activations,
*   flatten layer,
*   dense output layer with 10 units and softmax activation.

**Load and prepare data**

In [None]:
# Import the MNIST dataset
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

# Load MNIST images and labels and normalise (range 0 to 1) image data
(X_train, y_train), (X_test, y_test) = mnist.load_data() # 60,000 training + 10,000 test images/labels
X_train = X_train / 255.0
X_test  = X_test / 255.0

# Reshape dataset to have a single channel
X_train = X_train.reshape((X_train.shape[0], 28, 28, 1)) # The CNN requires this layout (batch_size, height, width, n_channels)
X_test = X_test.reshape((X_test.shape[0], 28, 28, 1)) # The CNN requires this layout (batch_size, height, width, n_channels)

# One-hot encode target values (i.e. make all the values 0 or 1)
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

**Build model**

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Conv2D, Dense, Flatten, Dropout

model = Sequential()
model.add(Input(shape=(28,28,1)))
model.add(Conv2D(64, kernel_size=(3,3), strides=(1,1), activation='relu'))
model.add(Dropout(0.2))
model.add(Conv2D(32, kernel_size=(3,3), strides=(1,1), activation='relu'))
model.add(Flatten()) # This flattens your image with width and height into a vector of length widht*height.
model.add(Dense(10, activation='softmax'))

model.summary()

**Compile model**

In [None]:
from tensorflow.keras.losses import CategoricalCrossentropy

# Loss function
loss = CategoricalCrossentropy()

model.compile(loss=loss,
              optimizer='adam',
              metrics=['accuracy'])

**Train model**

In [None]:
# Define number of epochs and batch size
epochs = 5
batch_size = 512

# Fit model
history = model.fit(x=X_train, y=y_train,
                    validation_split=0.1,
                    epochs=epochs,
                    batch_size=batch_size)

**Plot loss and accuracy as a function of epochs**

In [None]:
n_epochs = np.arange(0,epochs)

fig, (ax1,ax2) = plt.subplots(2,1,figsize=(8,16))
ax1.plot(n_epochs, history.history['loss'], label='training loss')
ax1.plot(n_epochs, history.history['val_loss'], label='validation loss')
ax1.set_ylim(-0.05,1.05)
ax1.legend()

ax2.plot(n_epochs, history.history['accuracy'], label='training accuracy')
ax2.plot(n_epochs, history.history['val_accuracy'], label='validation accuracy')
ax2.set_ylim(-0.05,1.05)
ax2.legend()
plt.show()

**Evaluate model**

In [None]:
loss_and_metrics = model.evaluate(X_test, y_test,
                                  batch_size=batch_size)

---
#### **_Your turn_: Exercises**

**Exercise 1:** Try changing the architecture of your model, e.g. by adding layers or changing the type of layers. How does this affect model performance?

**Exercise 2:** Try changing the training parameters of your model, e.g. the number of epochs or the batch size. How does this affect model performance?

#**Well done!**



