# AI-PP7 computer vision and image classification

[![Open in Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/PauliusU/PP7-computer-vision-and-image-classification/blob/master/AI-PP7-computer-vision-and-image-classification.ipynb)

## Problem statement and goals

Project aims to train a model that can classify people as having Covid and not having Covid based on the images.

Project uses [Covid-Chestxray-Dataset](https://github.com/ieee8023/covid-chestxray-dataset) and ChexPert Dataset mentioned in [this article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7372265/) from PP7 requirements.

Loading data in Google Colab was one of the biggest challenges. I've downloaded combined dataset from [DeepCovid](https://github.com/shervinmin/DeepCovid) repo which was also mentioned in the same article. Later on, I've upload the dataset to personal Google Drive.


## Getting the dataset

In [None]:
# Determine if notebook is running in Google Colab environment
import sys

IS_IN_COLAB = 'google.colab' in sys.modules
print(IS_IN_COLAB)

In [None]:
# Check if GPU is used
import tensorflow as tf
print("Number GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

# Print the list of available training devices - alternative method to verify TensorFlow sees the GPU
tf.config.list_physical_devices()

> ❗WARNING if running the notebook in Google Colab ❗ Cell below will ask for permission to access your personal Google Drive. I requires some manual steps, but it reduces dataset download time to 0.

Couple of semi-manual steps are needed to set up.

1. Give permission to access Google Drive.
2. Open this link to get the dataset: [DeepCovid dataset in Google Drive](https://drive.google.com/drive/folders/1NhP7oV3mPk4H9VS4eY3aN7rftGz6zFkw).
3. Add the shared dataset folder to your personal Google Drive by clicking "Add shortcut to Drive" like shown in the picture below.
   
![Example](./assets/drive.png)

In [None]:
if IS_IN_COLAB:
    # If running in Google Colab, setup Google Drive to be able to access the dataset

    from google.colab import drive

    drive.mount("/content/drive")

    # Set paths do training and test data in Google Drive
    train_data_path = "/content/drive/MyDrive/DeepCovid/train"
    test_data_path = "/content/drive/MyDrive/DeepCovid/test"
else:
    # If running locally
    train_data_path = "./dataset/train"
    test_data_path = "./dataset/test"

## Setup diagnostics

In [None]:
# Setup diagnostics to be later used for model analysis

from matplotlib import pyplot as plt

diagnostic_plot_name = "diagnostic_summary_plot.png"


def summarize_diagnostics(history):
    plt.subplot(211)
    plt.title('Cross Entropy Loss')
    plt.plot(history.history['loss'], color='blue', label='train')
    plt.plot(history.history['val_loss'], color='orange', label='test')
    plt.subplot(212)
    plt.title('Classification Accuracy')
    plt.plot(history.history['accuracy'], color='blue', label='train')
    plt.plot(history.history['val_accuracy'], color='orange', label='test')
    plt.savefig(diagnostic_plot_name)
    plt.close()


def get_diagnostic_summary() -> None:
    """ Helper function to generate diagnostic plot """
    plt.figure()
    if IS_IN_COLAB:
        img = plt.imread(f"/content/{diagnostic_plot_name}")
    else:
        img = plt.imread(f"./{diagnostic_plot_name}")
    plt.imshow(img)
    plt.show()


## Set up the model

Quick research suggested that `VGG model` is good starting point due to modular structure of the architecture. Due to this, VGG model is relatively easy to understand and implement. In addition to that, it achieved top performance in the ILSVRC 2014 competition.

For this model, a fixed image size of 200x200 pixels will be used. Several different sources proposed starting with stacking convolutional layers (Conv2D) with small 3×3 filters followed by a max pooling layer (MaxPooling2D). Together, these layers form a block, and these blocks can be repeated where the number of filters in each block is increased with the depth of the network such as 32, 64, 128, 256 for the first four blocks of the model.

In [None]:

# Create a base model

from tensorflow.keras.layers import Conv2D, MaxPooling2D, Dense, Flatten
from tensorflow.keras.models import Sequential
# from tensorflow.keras.utils import to_categorical


def define_model():
    model = Sequential()
    # Stacking convolutional layers with small 3×3 filters and image size of 200x200 pixels
    model.add(Conv2D(32, (3, 3), activation='relu',
              kernel_initializer='he_uniform', padding='same', input_shape=(200, 200, 3)))
    model.add(MaxPooling2D((2, 2)))
    model.add(Flatten())
    model.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    model.add(Dense(1, activation='sigmoid'))
    opt = SGD(learning_rate=0.001, momentum=0.9)
    model.compile(optimizer=opt, loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model


Further preparation consists of defining an ImageDataGenerator instance. It will scale the pixel values to the range of 0-1.
Then iterators need to be prepared for both the train and test datasets. ImageDataGenerator's flow_from_directory function is use to create iterators for each of the directories (one for 'train', other for 'test').
Eventually, we can create a plot of the history collected during training stored in the “history” directory returned from the call to fit_generator().

In [None]:
from tensorflow.keras.preprocessing.image import ImageDataGenerator

mode = "binary"
batch = 64
size = (200, 200)
epochs = 20

def run_test_harness(model):
    datagen = ImageDataGenerator(rescale=1.0/255.0)
    print('train')
    train_iterator = datagen.flow_from_directory(
        train_data_path, class_mode=mode, batch_size=batch, target_size=size)
    print('test')
    # Set iterators
    test_iterator = datagen.flow_from_directory(
        test_data_path, class_mode=mode, batch_size=batch, target_size=size)
    history = model.fit(train_iterator, steps_per_epoch=len(
        train_iterator), validation_data=test_iterator, validation_steps=len(test_iterator), epochs=epochs, verbose=0)

    _, acc = model.evaluate(test_iterator, steps=len(test_iterator), verbose=0)
    print('> %.3f' % (acc * 100.0))

    summarize_diagnostics(history)

## Train the model

> ❗WARNING❗ Running code cell below takes substantial amount ot time. On my local machine it took 15-20 minutes. Transfer learning is not used here.

In [None]:
# Training time
run_test_harness(define_model())

**Interim result**: during my tests I constantly get accuracy of 97-98%. It exceeded my initial expectations. Let's see the diagnostic plot to to see more details what is going on.

In [None]:
# Generate diagnostic plot
get_diagnostic_summary()

Intermediate conclusions: Plot indicates that the model is overfitting the training dataset somewhere at 1 or 2 epochs.

As mentioned, earlier accuracy is already quite high (97-98%). However, let's see if the can improve the results by using a regularization technique, such as a dropout.

## Re-train the model

> ❗WARNING❗ Running code cell below takes even more time than the cell above. On my local machine it took about 30-45 minutes.

In [None]:

from tensorflow.keras.layers import Dropout
from tensorflow.keras.optimizers import SGD


def define_model_with_dropout():
    model = Sequential()
    model.add(Conv2D(32, (3, 3), activation='relu',
              kernel_initializer='he_uniform', padding='same', input_shape=(200, 200, 3)))
    model.add(MaxPooling2D((2, 2)))
    model.add(Dropout(0.2))
    model.add(Flatten())
    model.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))
    opt = SGD(learning_rate=0.001, momentum=0.9)
    model.compile(optimizer=opt, loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model


epochs = 40  # Double the number of epochs from 20 to 40.

run_test_harness(define_model_with_dropout())


In [None]:
# Get new diagnostics
get_diagnostic_summary()


**Final result and conclusions**

Tuning of model did not improve the results. In fact, in my tests accuracy dropped from 97-97% to 96%. Also, the plot suggests that that ovefitting is occurring slightly later (on the 3rd and 4th epoch). To reduce overfitting further tuning is needed.

Implementing solution for this project clearly showed me how computing power-hungry AI projects could be. I was frustrated by the GPU limitations posed the Google Colab. Therefore, the initial cells includes logic for trining the downloaded dataset locally and checking if GPU is used at all. Combined dataset of ~5200 images proved to a though challenge for both Google Colab and my local environment. Running the notebook takes at least an hour or even more.

This is the the transfer learning comes in. One of the main benefits of transfer learning includes the saving of resources and improved efficiency when training new models. I believe training performance would greatly benefit from it. However, given the surprisingly high initial accuracy, I've decided to leave transfer learning for the next time.