# <b> Make sure you are running this in google colab.</b>
Go to the link: https://drive.google.com/drive/folders/11eX89jZEuR0yWtBWwzLuIZDqQmsyXEcT?usp=sharing and click on <i> "add the shortcut to My Drive" </i>. This will add the shortcut to My Drive.

For a detailed step by step procedure, please open this link to follow the instructions: https://docs.google.com/document/d/1uhH_5W_Aoa8zlyMSvmU_yuhyJJgR6-7l/edit?usp=sharing&ouid=107456846675416400203&rtpof=true&sd=true

<b> Before starting your notebook, click on Runtime: Change Runtime to GPU (connect to a hosted runtime i.e., GPU) . If not possible, proceed with the CPU runtime. </b>

# Deep Learning for beginners
## Classification of suspicious breast lesions

### This is a private dataset of Contrast Enhanced Mammography (CEM) kindly supplied by MUMC+ for the purposes of this course only. Please make sure you delete image data after the workshop.

## Getting Started : Setting up the google drive and installing some packages

In [None]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
!pip install -q livelossplot

## Description of the dataset
* The dataset is comprised of 997 MUMC patients that were recalled after a screening mammography.
* Each patient has two views of CEM images  - Caudal-cranial (CC) and mediolateral oblique view (MLO)
* Each view has two types of image - the low energy image (almost like a standard mammo) and the recombined image
* Masses have pathological confirmation of status, labeled 0 for benign and 1 for malignant
* The data has been resized to 256*256 crop around the tumor while maitaining the same greyscale values
* the arrays have two channels containing the (1) low energy image and (2) the recombined low and high energy images


![picture](https://drive.google.com/uc?export=view&id=1NcUiIikQX4jxCQXgCLkvQHRKlYj8g6ll)  

#### <b>Notebook structure:</b>
The script will take you through 5 steps where you'll learn how to train the models. <br>
 - <b>1. Importing libraries. </b>
 - <b>2. Getting Your Data Ready. </b>
 - <b>3. Define the Network. </b>
 - <b>4. Train your network. </b>
 - <b>5. Test your network. </b>

<br>


#### <b>Filling in Missing Values:</b>

You will have to fill in some missing values in the notebook. They will help you to better understand the deep learning workflow.

#### <b> If you get stuck you can always open the "cheat" notebook, but where's the fun in that? </b> :-)

## 1. Import librairies

In [None]:
#desactivate warnings
import warnings
warnings.filterwarnings("ignore")

#import utilities
import numpy as np
import pandas as pd
from glob import glob
import os
import shutil
from tqdm import tqdm_notebook as tqdm

#import images
from skimage.io import imread
import matplotlib.pyplot as plt

#import
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

#import keras functions
import tensorflow as tf
import keras.backend as K
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.applications.nasnet import NASNetMobile
from tensorflow.keras.applications.xception import Xception
from tensorflow.keras.layers import Dropout, Flatten, Dense, GlobalAveragePooling2D, Input, Concatenate, GlobalMaxPooling2D
from tensorflow.keras.models import Model
from tensorflow.keras.callbacks import CSVLogger, ReduceLROnPlateau, ModelCheckpoint, LearningRateScheduler
from tensorflow.keras.optimizers import Adam
from livelossplot.tf_keras import PlotLossesCallback

## 2. Getting your data ready

### 2.1 Visualizing the data contents

In [None]:
folder_pth = r"/content/drive/MyDrive/breast_lesion_classification"

csv_path = os.path.join(folder_pth, r"df_for_workshop.csv")
info_csv = pd.read_csv(csv_path)
print(info_csv)

                              image_ID position orientation  classification
0     ID_LeckFg19Qxio7VJiy7Wu7vpvD.npy        R         MLO               1
1     ID_H5ViU9RHKQO74w8Cq8gtZBLkW.npy        R          CC               1
2     ID_srgTQm5KBFIH45HtaahcXm66V.npy        R         MLO               0
3     ID_2yxLITGuMLRO2UkSrLfKE5l1t.npy        R          CC               0
4     ID_6EBGfrbMC5HhWyCNEGdZrmtbA.npy        L         MLO               0
...                                ...      ...         ...             ...
1989  ID_uB9NDT2Tck4ec48l1iV3GwKfC.npy        L          CC               0
1990  ID_xaQ4byQ3ovcmc1X3oHtGAoi4K.npy        R         MLO               1
1991  ID_9EWw4bRgn32ovvxDo3JqTy0cu.npy        R          CC               1
1992  ID_9ZlE026gbum6IgmhypRs9es5r.npy        R         MLO               1
1993  ID_o2NNSrjzBdhnLlCnIxHD4xdab.npy        R          CC               1

[1994 rows x 4 columns]


In [None]:
#Assign the target variable
classification = info_csv.classification
print("The targets in the classification variable:")
classification.head()

The targets in the classification variable:


0    1
1    1
2    0
3    0
4    0
Name: classification, dtype: int64

In [None]:
#load images
image_ID = info_csv["image_ID"] # getting image IDs from the csv file
path_npy = os.path.join(folder_pth, r"dataset_for_workshop")


# reading the training patches in the Variable X using np.load
X = []
for image_id in tqdm(list(image_ID)):
    X.append(np.load(os.path.join(path_npy,image_id)))

X = np.array(X)

  0%|          | 0/1994 [00:00<?, ?it/s]

In [None]:
print("the size of the dataset is:", X.shape)

#extract shape parameter
IMAGE_SIZE = X[0].shape
N_IMAGES = len(X)
print("Number of images:", N_IMAGES)
print("Image size:",IMAGE_SIZE[:2])
print("Number of channels:", IMAGE_SIZE[-1])

In [None]:
# Using the 'train_test_split' function from the sklearn Python package,
# we divided the dataset into a training set and a testing set,
# opting for a 70/30 split.
X_train, X_test, y_train, y_test = train_test_split(X,classification,test_size=0.3, random_state=42)
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

## 2.2. Making Training and Testing Data Arrays based on the train_test_split
We have split train and test data into two folders. Make two train and test arrays containing the image patches of size 256 x 256.

In [None]:
train_ID = os.listdir(os.path.join(folder_pth, "train"))
X_train = []
y_train = []
for image_id in tqdm(list(train_ID)):
    X_train.append(np.load(os.path.join(folder_pth, "train",image_id)))
    y_train.append(classification[image_ID == image_id].values[0])
X_train = np.array(X_train)

test_ID = os.listdir(os.path.join(folder_pth, "test"))
X_test = []
y_test = []
for image_id in tqdm(list(test_ID)):
    X_test.append(np.load(os.path.join(folder_pth, "test",image_id)))
    y_test.append(classification[image_ID == image_id].values[0])

X_test = np.array(X_test)

In [None]:
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)

In [None]:
#extract shape parameter
IMAGE_SHAPE = X_train[0].shape
IMAGE_SIZE = X_train[0].shape[1]
print("Image shape:",IMAGE_SHAPE[:2])
print("Number of channels:", IMAGE_SHAPE[-1])

## 2.3: Preprocessing the data

In [None]:
#normalisation coef, first mean second std
norm_coeff_re=[897,728] #norm_coeff_re : for the recombined image, first channel of the array
norm_coeff_le=[770,847] #norm_coeff_le : for the low energy image, second channel of the array

In [None]:
#normalize images (use coefficients)
new_X_train = []
for i, x in enumerate(X_train):
    x_le = (x[:,:,1]-norm_coeff_le[0])/norm_coeff_le[1] # z-score
    x_re = (x[:,:,0]-norm_coeff_re[0])/norm_coeff_re[1] # z-score
    new_X_train.append(np.concatenate((np.expand_dims(x_re, axis=2), np.expand_dims(x_le, axis=2)),axis = 2)) #concatenate channels on the 3rd axis

new_X_test = []
for i, x in enumerate(X_test):
    x_le = (x[:,:,1]-norm_coeff_le[0])/norm_coeff_le[1] # z-score
    x_re = (x[:,:,0]-norm_coeff_re[0])/norm_coeff_re[1] # z-score
    new_X_test.append(np.concatenate((np.expand_dims(x_re, axis=2), np.expand_dims(x_le, axis=2)),axis = 2)) #concatenate channels on the 3rd axis

new_X_train = np.array(new_X_train)
new_X_test = np.array(new_X_test)

print("new_X_train shape: ", ... ) # FILL IN THE VALUE
print("new_X_test shape: ", ... ) ## FILL IN THE VALUE

What you should see: <br>
new_X_train shape:  (1395, 256, 256, 2) <br>
new_X_test shape:  (599, 256, 256, 2)

In [None]:
#using only one channel for the Xception model
X_train = np.expand_dims(new_X_train[:,:,:,0],axis=3)
print(X_train.shape)

X_test = ... ## FILL IN THE VALUE Do it in the similar way as you did for the X_train - use only one channel.
print(X_test.shape)

## 2.4 Visualizing the Slices

In [None]:
# sanity check: check how the data looks like,
# change the range parameters and X_test for X_train

for i in range(10):
    idx = np.random.randint(0, len(X_test))
    plt.figure()
    plt.imshow(np.squeeze(X_test[idx,:,:]/255))
    plt.title("Label - " + str(y_test[idx]))
    plt.show()

## 3. Define the Network

In [None]:
#enter the number of units for the dense layer

def CNN_xception(input_shape):
    base_model = Xception(weights=None, include_top=False,input_shape=input_shape,pooling = 'avg')
    #hint : change the last layers if you built your own model
    x = base_model.output
    print(x.shape)
    x = Dense(1024, activation='relu')(x) #choose number of units for dense layer, 1024 for pre-loaded weights
    print(x.shape)
    predictions = Dense(1, activation='sigmoid')(x)
    print(predictions.shape)

    model = Model(inputs=base_model.input, outputs=predictions)
    return model

callbacks_list = [PlotLossesCallback()]

model1 = CNN_xception((IMAGE_SIZE, IMAGE_SIZE,1))

#model architecture
model1.summary()

### 4. Train the Model

### 4.1 Mode 1: Training the model using GPU

In [None]:
#optimizer can be found https://keras.io/optimizers/
model1.compile(optimizer= Adam(lr= ... ),loss='binary_crossentropy',metrics=['accuracy']) ## FILL IN THE VALUE What would be the appropriate learning rate?

#choose batch size and epochs number
history =model1.fit(X_train/255,  np.array(y_train), batch_size= ... , epochs=10 ,
                    validation_data=( X_test/255, np.array(y_test)),callbacks=callbacks_list) ## What would be the appropriate learning rate? ## FILL IN THE VALUE

### 4.2: Mode 2: Use the pre-trained weights

In [None]:
# use this sell if you can not train the model using the GPUs
model1.load_weights(os.path.join(folder_pth, r"checkpoint.hdf5"))

## 5: Test the model
### 5.1 Get the Predictions on the Test

In [None]:
# ROC validation plot

predictions = []
for x in tqdm(X_test):
    predictions.append(model1.predict(np.expand_dims(x,axis=0)/255))
predictions = np.array(predictions).flatten()

### 5.2 plot the ROC curve

In [None]:
false_positive_rate, true_positive_rate, threshold = roc_curve(y_test, predictions) #use roc_curve function from sklearn library
area_under_curve = auc(false_positive_rate, true_positive_rate) #use auc function from sklearn library

plt.plot([0, 1], [0, 1], 'k--')
plt.plot(false_positive_rate, true_positive_rate, label='AUC = {:.3f}'.format(area_under_curve))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
plt.show()

### 5.3 Classification Report

In [None]:
classification_threshold = 0.5
true_labels = y_test
predicted_labels = predictions > classification_threshold

print(classification_report(true_labels, predicted_labels))

### 5.4 Test the model with test time augmentation (TTA) to improve results

Documentation: https://towardsdatascience.com/test-time-augmentation-tta-and-how-to-perform-it-with-keras-4ac19b67fb4d

In [None]:
#can change the transformations see ImageDataGenerator documentations
train_datagen = ImageDataGenerator(
        rescale=1./255,
        shear_range=...,
        vertical_flip=True,
        horizontal_flip=True) ## FILL IN THE VALUE Define the shear range for the test time augmentation (it goes from 0 to 1)

In [None]:
#choose number of times you apply transformation to the test set
tta_idx = 4
tta_prediction = np.empty((tta_idx,) + predictions.shape[:] + (1,)) #initialize with the former predictions
for i in tqdm(range(tta_idx)):
    val_it_plain = train_datagen.flow(X_test,y_test, batch_size=len(y_test), shuffle = False)
    x,y = val_it_plain.next()
    tta_prediction[i] = model1.predict(x)

tta_prediction=np.array(tta_prediction)

In [None]:
mean_prediction = np.mean(tta_prediction,axis = 0).reshape(len(y_test))

# ROC validation plot with TTA
false_positive_rate, true_positive_rate, threshold = roc_curve(y_test, mean_prediction) #use roc_curve function from sklearn library
area_under_curve = auc(false_positive_rate, true_positive_rate) #use auc function from sklearn library

plt.plot([0, 1], [0, 1], 'k--')
plt.plot(false_positive_rate, true_positive_rate, label='AUC = {:.3f}'.format(area_under_curve))
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(loc='best')
#plt.savefig(ROC_PLOT_FILE, bbox_inches='tight')
plt.show()

In [None]:
classification_threshold = ... ## FILL IN THE VALUE Define the classification threshold, it ranges from 0 to 1
true_labels = y_test
predicted_labels = mean_prediction > classification_threshold

print(classification_report(true_labels, predicted_labels))

If you arrived here, well done ! Now you can play with the data and try new models

# Second part: try to implement your own model with 2 channels




Suggestion of changes: <br>
* you can load separetly the views (CC and MLO) and train on those subsets
* during pre-processing of the data you can implement an other normalization method, filtering, rebinning, data augmentation
* for training the model you can try a different base model: vgg16, densenet101, inception ... and pre-load the weights from ImageNet
* after loading a base model, you can choose to add some more layer, change the number of units
* try different parameters, you can change the learning rate, the loss function, the batch size, the optimizer...

## In case the images cannot be load entirely in the memory
The generator function load the data batch by batch. To train the model, it needs to be used with model1.fit_generator()

In [None]:
# def generator(gen_type, batch_size = 10):

#     assert gen_type in ["train","test"], "Allowed gen_type: train or test"

#     ids = os.listdir(gen_type)

#     while True:

#         X = []
#         y = []

#         offset = np.random.randint(0,len(ids)-batch_size)
#         batch_ids = ids[offset:offset+batch_size]

#         for image_id in batch_ids:

#             X.append(np.load(os.path.join(gen_type,image_id))) #possible that there might be cache files and stuff, so may be check the extension as well for train_Id
#             y.append(classification[image_ID == image_id].values[0])


#         yield np.array(X),np.array(y)

# train_gen = generator("train")
# test_gen = generator("test")

# model1.fit_generator(train_gen, steps_per_epoch=2000,
#     epochs=20,
#     validation_data=test_gen,
#     validation_steps=800)

 🎨 **Auxillary task: Play around with different Keras in-built models**

Reference: https://www.tensorflow.org/api_docs/python/tf/keras/applications

In [None]:
#enter the number of units for the dense layer
tf.keras.backend.clear_session()

def CNN_trial(input_shape):
  #play with different keras base models
    base_model = tf.keras.applications................(weights=None,             # try using any keras model here, refer the previous code block that defines the model
                                                       include_top=False,
                                                       input_shape=input_shape,
                                                       pooling = 'avg')
    #hint : change the last layers if you built your own model
    x = base_model.output
    print(x.shape)
    x = Dense(1024, activation='relu')(x) #choose number of units for dense layer, 1024 for pre-loaded weights
    print(x.shape)
    predictions = Dense(1, activation='sigmoid')(x)
    print(predictions.shape)

    model = Model(inputs=base_model.input, outputs=predictions)
    return model


model2 = CNN_trial((IMAGE_SIZE, IMAGE_SIZE,1))

#model architecture
print(model2.summary())

# uncomment for pre-training
from tensorflow.keras.callbacks import ModelCheckpoint, LearningRateScheduler, ReduceLROnPlateau
model2.compile(optimizer= Adam(learning_rate=0.001),loss='binary_crossentropy',metrics=['accuracy'])
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss',
                                                  patience=50,
                                                  verbose=0,
                                                  mode='min',
                                                  )
model_checkpoint_path = os.path.join(r"./trial_checkpoint.hdf5")
# load weights of the model
# model1.load_weights(model_checkpoint_path)

check_point = tf.keras.callbacks.ModelCheckpoint(model_checkpoint_path,
                                                 monitor='val_loss',
                                                 verbose=1,
                                                 save_best_only=True,
                                                 mode='min',
                                                 )
reduce_lrplateau = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=10,
                                     min_delta=0.0001, min_lr=1e-6, verbose=1)
pretrain_callbacks = [check_point, early_stopping, reduce_lrplateau, PlotLossesCallback()]

history =model2.fit(X_train,  np.array(y_train), batch_size=4, epochs=1000,
                    validation_data=(X_test, np.array(y_test)),
                    callbacks=pretrain_callbacks,
                    verbose=2,
                    shuffle=True)