# <center>CNN on COVID-19 CT lungs scans</center>
## <div align=right>Made by Ihor Markevych</div>

[My Medium article](https://medium.com/@ih.markevych/4294e29b72b?source=friends_link&sk=ce5ebe4837e469ce0ce4f7145b3db08d).

Since we have a small dataset, we will be using validation set as test set. However, we won't be setting seed, which will allow us to have different subset of data each time and therefore it will give more robust evaluation of performance together with saving data for training.
  
----------

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import cv2

import tensorflow as tf

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten, Dropout, MaxPooling2D
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import EarlyStopping
import os

import sklearn.metrics

In [None]:
DIR = '/kaggle/input/covidct/'
SUBDIR_POS = 'CT_COVID/'
SUBDIR_NEG = 'CT_NonCOVID/'
print(f'Positive samples: {len(os.listdir(DIR + SUBDIR_POS))}.')
print(f'Negative samples: {len(os.listdir(DIR + SUBDIR_NEG))}.')

In [None]:
im = cv2.imread('/kaggle/input/covidct/CT_COVID/2020.03.20.20037325-p23-122.png', 0) / 255
plt.imshow(im, cmap='gray', vmin=0, vmax=1) 
plt.show()

In [None]:
EPOCHS = 40
BATCH_SIZE = 64
OPTIMIZER = tf.keras.optimizers.Adam(learning_rate=0.001, decay=0.001 / EPOCHS)
img_height, img_width = 248, 248
es = EarlyStopping(monitor='val_acc', mode='max',
                   verbose=1, 
                   patience=10, restore_best_weights=True)

In [None]:
# https://stackoverflow.com/questions/42443936/keras-split-train-test-set-when-using-imagedatagenerator
train_datagen = ImageDataGenerator(
    rescale=1./255,
    horizontal_flip=True,
    rotation_range=5,
    width_shift_range=0.05,
    height_shift_range=0.05,
    shear_range=0.05,
    zoom_range=0.05,
    validation_split=0.2) 

train_generator = train_datagen.flow_from_directory(
    DIR,
    target_size=(img_height, img_width),
    batch_size=BATCH_SIZE,
    class_mode='binary',
    color_mode="grayscale",
    subset='training') 

validation_generator = train_datagen.flow_from_directory(
    DIR, 
    target_size=(img_height, img_width),
    batch_size=BATCH_SIZE,
    class_mode='binary',
    color_mode="grayscale",
    subset='validation') 

In [None]:
def create_model():
    model = Sequential([
        Conv2D(16, 1, padding='same', activation='relu', input_shape=(img_height, img_width, 1)),
        MaxPooling2D(),
        Conv2D(32, 3, padding='same', activation='relu'),
        MaxPooling2D(),
        Conv2D(64, 5, padding='same', activation='relu'),
        MaxPooling2D(),
        Conv2D(64, 5, padding='same', activation='relu'),
        MaxPooling2D(),
        
        Flatten(),
        Dense(128, activation='relu'),
        Dropout(0.4),
        Dense(64, activation='relu'),
        Dropout(0.5),
        Dense(8, activation='relu'),
        Dropout(0.3),
        Dense(1, activation='sigmoid')
    ])

    model.compile(optimizer=OPTIMIZER,
                  loss='binary_crossentropy',
                  metrics=['accuracy', 'Precision', 'Recall'])
    
    return model

In [None]:
model = create_model()
model.summary()

In [None]:
hist = model.fit_generator(
        train_generator,
        steps_per_epoch = train_generator.samples // BATCH_SIZE,
        validation_data = validation_generator, 
        validation_steps = validation_generator.samples // BATCH_SIZE,
        epochs = EPOCHS,
        verbose = 0, 
        callbacks=[es])

In [None]:
plt.title('Accuracy')
plt.plot(hist.history['accuracy'])
plt.plot(hist.history['val_accuracy'])
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

plt.title('Loss')
plt.plot(hist.history['loss'])
plt.plot(hist.history['val_loss'])
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

plt.title('Recall')
plt.plot(hist.history['Recall'])
plt.plot(hist.history['val_Recall'])
plt.ylabel('recall')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

plt.title('Precision')
plt.plot(hist.history['Precision'])
plt.plot(hist.history['val_Precision'])
plt.ylabel('recall')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()

In [None]:
y_pred = (model.predict_generator(validation_generator) > 0.5).astype(int)
y_true = validation_generator.classes

for name, value in zip(model.metrics_names, model.evaluate_generator(validation_generator)):
    print(f'{name}: {value}')
    
print(f'F1 score: {sklearn.metrics.f1_score(y_true, y_pred)}')

In [None]:
pd.DataFrame(sklearn.metrics.confusion_matrix(y_true, y_pred), 
             columns=['pred no covid', 'pred covid'], 
             index=['true no covid', 'true covid'])

### Saving model

In [None]:
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)
model.save_weights("model.h5")

### Underfit / overfit

We observe that model is not overfitting, as validation loss is better than training. This is explained by dropout - as dropout is not included in prediction, model is having full potential and generally making better predictions. However, if we add additional neurons, layers or filters, model will start performing badly, therefore, we can make a conclusion that we don't have removable bias in the model.  
  
Generalizability of the model is limited, since the train set size is only around 600 images. Additionally, it is possible to include images of other pneumonia-like ilnesses and see if it is possible to build a model that would distinguish them correctly. It may be that this task would be significantly harder and we would need to go with just two classes - healthy person and coronavirus/other ilness, as differentiating of pneumonia/coronavirus may be incredibly hard from just CT scans.  
  
How to make better model - we can take pretrained model (or pretrain it by ourselves) that is capable of diagnosing similar illnesses and apply transfer learning with COVID-19 dataset. CT lungs scans of person with pneumonia can be quite close to CT scans of person with coronavirus, so, such dataset can be used.

Since we are not setting seed, performance may slightly vary, but accuracy is in range between 70-80%. If we continue training, we observe overfitting, therefore we stop our training after 40 epochs.