# Red blood cell classifier : practical application of a convNet

With this example, you will build your own convNet in order to perform a specific task : classifying images of red blood cells (RBC) of either good or bad quality.

Data were kindly provided by Viviana Claveria & Manouk Abkarian from the Centre of Biologie Structurale, Montpellier (FRANCE). 

This notebook was inspired from the EMBL notebooks from the Kreshuk lab (https://github.com/kreshuklab/teaching-dl-course-2020/).

## I - Data importation

The data are RGB images saved in the png format on a google drive. All images are already sorted according to two classes :
- good RBC (1)
- bad RBC (0)


The first step is to load all the python packages we will use in the notebook:

In [None]:
import os
from glob import glob
import random
import sys
import warnings
import numpy as np
from tqdm import tqdm

from skimage.io import imread, imshow, imread_collection, concatenate_images
from skimage.transform import resize
from skimage.morphology import label

import tensorflow as tf
from tensorflow import keras
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, Conv2D, MaxPooling2D, Flatten, BatchNormalization, Input
from keras.layers import RandomFlip, Resizing, Rescaling, RandomRotation

import matplotlib.pyplot as plt
%matplotlib inline

If your data are save on your google drive, you first need to connect google drive to google collab and move to the folder containing the dataset.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

base_dir = '/content/drive/MyDrive/Deep_learning_formation_MRI/Doc_JB_2022/Notebooks for workshop /Data/RBC_classification/RBC_dataset_simplified'
os.chdir(base_dir)
%ls

Since all the images are different and do not have the same X,Y dimensions we will define a set of parameters to homogeneize the training/testing sets. For example, below the width and height of all images will be set to IMG_WIDTH and IMG_HEIGHT respectively. In our case, we are working with RGB images, the number of channels is therefore set to 3.

Note that **the path to the different folders will need to be updated** accordingly.

In [None]:
# Set the image size
# -----------------
IMG_WIDTH = 85
IMG_HEIGHT = 85
IMG_CHANNEL = 3

# Define the path where the data are saved
# ----------------------------------------

goodRBC_train_PATH = 'train/good_RBC/'
badRBC_train_PATH = 'train/bad_RBC/'

goodRBC_val_PATH = 'validation/good_RBC/'
badRBC_val_PATH = 'validation/bad_RBC/'

goodRBC_test_PATH = 'test/good_RBC/'
badRBC_test_PATH = 'test/bad_RBC/'

The following method **get_data** is used to download the images and convert them to the right format (according to the parameters defined above).

In [None]:
def get_data(path):

  # get the total number of samples
  # -------------------------------

  ids = next(os.walk(path))[2]
  X = np.zeros((len(ids), IMG_HEIGHT, IMG_WIDTH, IMG_CHANNEL), dtype=np.uint8)
  img_size = np.zeros((len(ids),2))

  # sys.stdout.flush()
  
  # select only the first n_im images
  # ---------------------------------

  for n, id_ in tqdm(enumerate(ids), total=len(ids)):
    path_new = os.path.join(path, id_)

    # we'll be using skimage library for reading file and make sure all the images
    # have the same dimensions
    # -------------------------

    img = imread(path_new)
    img_size[n,:] = img.shape[0:1]
    img = resize(img, (IMG_HEIGHT, IMG_WIDTH), mode='constant', preserve_range=True)

    if len(img.shape) == 3:
      X[n] = img
    else:
      img = np.stack((img,)*3, axis=-1)
      X[n] = img

  # Return the median width/height of the set of images 
  print(f'median images width : {np.median(img_size[:,0])} +/- {np.std(img_size[:,0])}')
  print(f'median images heigth : {np.median(img_size[:,1])} +/- {np.std(img_size[:,1])}')

  return X

The training and testing set are defined below. Note that we are also building the ground truth accordingly.
**Note that it takes ~20-30min to load the data.**

In [None]:
# Load the training set - the images are classified into two different folders. 
# Below, the "good" images are associated to a label equal to 1. The label for 
# the "bad" images is set to 0.  
# ----------------------------
Good_RBC = get_data(goodRBC_train_PATH)
y_good = np.ones((Good_RBC.shape[0],))
Bad_RBC = get_data(badRBC_train_PATH)
y_bad = np.zeros((Bad_RBC.shape[0],))

X_train = np.concatenate((Good_RBC,Bad_RBC), axis=0)
Y_train = np.concatenate((y_good,y_bad), axis=0)

# Load the validation set
# -----------------------
Good_RBC_validation = get_data(goodRBC_val_PATH)
y_good = np.ones((Good_RBC_validation.shape[0],))
Bad_RBC_validation = get_data(badRBC_val_PATH)
y_bad = np.zeros((Bad_RBC_validation.shape[0],))

X_validation = np.concatenate((Good_RBC_validation,Bad_RBC_validation), axis=0)
Y_validation = np.concatenate((y_good,y_bad), axis=0)

# Load the test set
# -----------------
Good_RBC_test = get_data(goodRBC_test_PATH)
y_good = np.ones((Good_RBC_test.shape[0],))
Bad_RBC_test = get_data(badRBC_test_PATH)
y_bad = np.zeros((Bad_RBC_test.shape[0],))

X_test = np.concatenate((Good_RBC_test,Bad_RBC_test), axis=0)
Y_test = np.concatenate((y_good,y_bad), axis=0)


Return the composition of the dataset

In [None]:
print(f'The train dataset is composed of {Good_RBC.shape[0]} images belonging to the "goodRBC" class and {Bad_RBC.shape[0]} images belonging to the "badRBC" class')
print(f'The validation dataset is composed of {Good_RBC_validation.shape[0]} images belonging to the "goodRBC" class and {Bad_RBC_validation.shape[0]} images belonging to the "badRBC" class')
print(f'The test dataset is composed of {Good_RBC_test.shape[0]} images belonging to the "goodRBC" class and {Bad_RBC_test.shape[0]} images belonging to the "badRBC" class')

## II- Data vizualization :

Display a few examples of images belonging to the "GoodRBC" class

In [None]:
plt.rcParams['figure.figsize'] = (9,9) # Make the figures a bit bigger

for i in range(9):
    plt.subplot(3,3,i+1)
    num = random.randint(0, len(Good_RBC))
    im = Good_RBC[num]
    plt.imshow(im)
    
plt.tight_layout()
plt.show()

And the same for the "BadRBC" class

In [None]:
plt.rcParams['figure.figsize'] = (9,9) # Make the figures a bit bigger

for i in range(9):
    plt.subplot(3,3,i+1)
    num = random.randint(0, len(Bad_RBC))
    im = Bad_RBC[num]
    plt.imshow(im)
    
plt.tight_layout()
plt.show()

## III- Definition of the model and training

Define the model and the compilation options. 


In [None]:
model = ...

model.compile(...)

model.summary()

Define the model and the compilation options. 

In [None]:
history = ...

Display the loss function during the training


In [None]:
history_dict = history.history

loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

n = len(loss_values)
epochs = range(1, n+1)

plt.rcParams['figure.figsize'] = (7,7) # Make the figures a bit bigger
plt.plot(epochs, loss_values, 'bo', label='Training loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs', fontsize=15)
plt.ylabel('Loss', fontsize=15)
plt.legend()

plt.show()

The accuracy of the model is tested using the testing set of data.

In [None]:
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']

n = len(acc_values)
epochs = range(1, n+1)

plt.rcParams['figure.figsize'] = (7,7) # Make the figures a bit bigger
plt.plot(epochs, acc_values, 'bo', label='Training acccuracy')
plt.plot(epochs, val_acc_values, 'b', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs', fontsize=15)
plt.ylabel('accuracy', fontsize=15)
plt.legend()

plt.show()

Evaluate the model :

In [None]:
test_loss, test_acc = model.evaluate(X_test, Y_test)
print('test_acc:', test_acc)

In [None]:
# The predict_classes function outputs the highest probability class
# according to the trained classifier for each input example.
# -----------------------------------------------------------

predicted_classes = model.predict(X_test)
predicted_classes = np.round(predicted_classes) # defines the threshold between the two classes at 0.5
predicted_classes = predicted_classes.flatten()

# Check which items we got right / wrong
# --------------------------------------

correct_indices = np.nonzero(predicted_classes[:] == Y_test[:])[0]
incorrect_indices = np.nonzero(predicted_classes[:] != Y_test[:])[0]

print(len(correct_indices))
print(len(incorrect_indices))

In [None]:
plt.rcParams['figure.figsize'] = (9,9) # Make the figures a bit bigger

for i in range(9):
    plt.subplot(3,3,i+1)
    num = random.randint(0, len(correct_indices))
    plt.imshow(X_test[correct_indices[num]], interpolation='none')
    plt.title("Predicted {}, Class {}".format(predicted_classes[correct_indices[num]],
                                              Y_test[correct_indices[num]]))
    
plt.tight_layout()
plt.show()

In [None]:
plt.rcParams['figure.figsize'] = (9,9) # Make the figures a bit bigger

for i in range(9):
    plt.subplot(3,3,i+1)
    num = random.randint(0, len(incorrect_indices))
    plt.imshow(X_test[incorrect_indices[num]], interpolation='none')
    plt.title("Predicted {}, Class {}".format(predicted_classes[incorrect_indices[num]], 
                                              Y_test[incorrect_indices[num]]))
    
plt.tight_layout()
plt.show()

## IV- Transfer learning

In the following section, we will see how transfer learning can be used to improve the performances of a classifier. The idea is to  used a previously trained network (such as VGG16) that was already trained on thousands of images and able to recognize hundred of thousands of different features on images.

By adding new layers at the end of the pre-trained network, we can use the features recognition property of this network and applied it to a completely new problem. 

The first step is to load the pre-trained VGG16 network. There are many different available model in keras (https://keras.io/api/applications/) and they all can be used for transfer learning. 

In [None]:
# from keras.applications import the vgg16 network pretrained on the imagenet 
# database. The include_top option is set to zero, meaning that the classifier 
# part (composed of a dense network) is removed.
# ---------------------------------------------- 
conv_base = keras.applications.vgg16.VGG16(
    include_top=False,
    weights='imagenet',
    input_shape=(IMG_WIDTH, IMG_HEIGHT, IMG_CHANNEL)
)

conv_base.summary()

And then we will build our model around the VGG16 :

In [None]:
model_VGG = Sequential([
    
    # Initialization, normalization and image augmentation 
    Input(shape=(IMG_WIDTH, IMG_HEIGHT, IMG_CHANNEL)),
    Rescaling(scale=1./255),
    RandomFlip(mode="horizontal_and_vertical"),
    
    # Add the VGG16 network without the classifier
    conv_base,

    Flatten(),
    
    # Create the fully connected layers for the final classification
    Dense(256, activation = 'relu'), # 256 FCN nodes
    Dropout(0.5),
    Dense(128, activation = 'relu'), # 128 FCN nodes
    Dropout(0.5),
    Dense(1, activation = 'sigmoid'),
])
model_VGG.summary()

By default, all the parameters of the VGG16 network can be retrained. For transfer learning, we need to define which part of the network will be trained, and which part of the network will be "frozen" (i.e. will keep the same values for the weights after being trained on the imageNet database). 
In our case, only the last convolution block of the VGG and the densely connected part will be trained :

In [None]:
# The conv_base is composed of 19 layers. The last 4 layers are related to 
# block5. Therefore, only the four last payers will be set to "trainable".
# ------------------------------------------------------------------------

conv_base.trainable = True
for n in range(15):
  conv_base.layers[n].trainable = False

# Now, when looking at the network summary, the global architecture did not 
# change. However, the number of trainable parameters dropped by almost half.
model_VGG.summary()

# Compile the new network
model_VGG.compile(optimizer = 'adam', 
            loss='binary_crossentropy',
            metrics=['accuracy'])

And finally train the new model and save it.

In [None]:
history = model_VGG.fit(X_train, Y_train,
                    epochs = 50,
                    batch_size = 64,
                    validation_data=(X_validation, Y_validation),
                    shuffle = True)

# Save the model
# --------------

# model.save('RBC_classification_VGG16_1.h5')

In [None]:
history_dict = history.history

loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']

n = len(loss_values)
epochs = range(1, n+1)

plt.rcParams['figure.figsize'] = (7,7) # Make the figures a bit bigger
plt.plot(epochs, loss_values, 'bo', label='Training loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs', fontsize=15)
plt.ylabel('Loss', fontsize=15)
plt.legend()

plt.show()

In [None]:
acc_values = history_dict['accuracy']
val_acc_values = history_dict['val_accuracy']

n = len(acc_values)
epochs = range(1, n+1)

plt.rcParams['figure.figsize'] = (7,7) # Make the figures a bit bigger
plt.plot(epochs, acc_values, 'bo', label='Training acccuracy')
plt.plot(epochs, val_acc_values, 'b', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs', fontsize=15)
plt.ylabel('accuracy', fontsize=15)
plt.legend()

plt.show()

In [None]:
test_loss, test_acc = model_VGG.evaluate(X_test, Y_test)
print('test_acc:', test_acc)