# AI-PP7 computer vision and image classification

[![Open in Google Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/PauliusU/PP7-computer-vision-and-image-classification/blob/master/AI-PP7-computer-vision-and-image-classification.ipynb)

## Problem statement and goals

Project aims to train a model that can classify people as having Covid and not having Covid based on the images.

Project uses [Covid-Chestxray-Dataset](https://github.com/ieee8023/covid-chestxray-dataset) and ChexPert Dataset mentioned in [this article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7372265/) from PP7 requirements.

Loading data in Google Colab was one of the biggest challenges. I've downloaded combined dataset from [DeepCovid](https://github.com/shervinmin/DeepCovid) repo which was also mentioned in the same article. Later on, I've upload the dataset to personal Google Drive.


In [None]:
# Determine if notebook is running in Google Colab environment
import sys

IS_IN_COLAB = 'google.colab' in sys.modules
print(IS_IN_COLAB)

> ❗WARNING if running the notebook in Google Colab ❗ Cell below will ask for permission to access your personal Google Drive. I required some manual steps, but it reduces dataset download time to 0.

Couple of semi-manual steps are needed to set up.

1. Give permission to access Google Drive.
2. Open this link to get the dataset: [DeepCovid dataset in Google Drive](https://drive.google.com/drive/folders/1NhP7oV3mPk4H9VS4eY3aN7rftGz6zFkw).
3. Add the shared dataset folder to your personal Google Drive by clicking "Add shortcut to Drive" like shown in the picture below.
   
![Example](./assets/drive.png)

In [None]:
if IS_IN_COLAB:
    # If running in Google Colab, setup Google Drive to be able to access the dataset

    from google.colab import drive

    drive.mount("/content/drive")

    # Set paths do training and test data in Google Drive
    train_data_path = "/content/drive/MyDrive/DeepCovid/train"
    test_data_path = "/content/drive/MyDrive/DeepCovid/test"
else:
    # If running locally
    train_data_path = "./dataset/train"
    test_data_path = "./dataset/test"

> ❗WARNING❗ Running cell below takes substantial amount ot time. On my local machine it took 15-20 minutes. This is due to a fact that, transfer learning is not used for the cell below.

Quick research indicated, that 
# A good starting point is the general architectural principles of the VGG models.
# These are a good starting point because they achieved top performance in the ILSVRC 2014 competition and because the modular structure of the architecture is easy to understand and implement.
# Also, I plan to use this particular pre-trained model later when we will implement transfer learning.
# The architecture involves stacking convolutional layers with small 3×3 filters followed by a max pooling layer.
# Together, these layers form a block, and these blocks can be repeated where the number of filters in each block is increased with the depth of the network such as 32, 64, 128, 256 for the first four blocks of the model.
# We will choose a fixed image size of 200x200 pixels.


In [None]:

# Create base model

from matplotlib import pyplot as plt
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv2D, Dropout, MaxPooling2D, Dense, Flatten
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.preprocessing.image import ImageDataGenerator



def define_model():
    model = Sequential()
    model.add(Conv2D(32, (3, 3), activation='relu',
              kernel_initializer='he_uniform', padding='same', input_shape=(200, 200, 3)))
    model.add(MaxPooling2D((2, 2)))
    model.add(Flatten())
    model.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    model.add(Dense(1, activation='sigmoid'))
    opt = SGD(learning_rate=0.001, momentum=0.9)
    model.compile(optimizer=opt, loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

# Next, we need to prepare the data.This involves first defining an instance of the ImageDataGenerator that will scale the pixel values to the range of 0-1.
# Then iterators need to be prepared for both the train and test datasets. We can use the flow_from_directory() function on the data generator and create one iterator for each of the train/ and test/ directories.
# Finally, we can create a plot of the history collected during training stored in the “history” directory returned from the call to fit_generator().


diagnostic_plot_name = "diagnostic_summary_plot.png"


def summarize_diagnostics(history):
    plt.subplot(211)
    plt.title('Cross Entropy Loss')
    plt.plot(history.history['loss'], color='blue', label='train')
    plt.plot(history.history['val_loss'], color='orange', label='test')
    plt.subplot(212)
    plt.title('Classification Accuracy')
    plt.plot(history.history['accuracy'], color='blue', label='train')
    plt.plot(history.history['val_accuracy'], color='orange', label='test')
    plt.savefig(diagnostic_plot_name)
    plt.close()


mode = "binary"
batch = 64
size = (200, 200)
epochs = 20


def run_test_harness(model):
    datagen = ImageDataGenerator(rescale=1.0/255.0)
    print('train')
    train_it = datagen.flow_from_directory(
        train_data_path, class_mode=mode, batch_size=batch, target_size=size)
    print('test')
    test_it = datagen.flow_from_directory(
        test_data_path, class_mode=mode, batch_size=batch, target_size=size)
    history = model.fit(train_it, steps_per_epoch=len(
        train_it), validation_data=test_it, validation_steps=len(test_it), epochs=epochs, verbose=0)

    _, acc = model.evaluate(test_it, steps=len(test_it), verbose=0)
    print('> %.3f' % (acc * 100.0))

    summarize_diagnostics(history)


run_test_harness(define_model())


In [None]:
# Wow! I got 96%:) Did not expect such high accuracy on the very first run.
# We can also check out the diagnostic summary.

def get_diagnostic_summary() -> None:
    """ Helper function to generate diagnostic plot """
    plt.figure()
    if IS_IN_COLAB:
        img = plt.imread(f"/content/{diagnostic_plot_name}")
    else:
        img = plt.imread(f"./{diagnostic_plot_name}")    
    plt.imshow(img) 
    plt.show()

get_diagnostic_summary()

In [None]:
# WARNING! This one might also take a while to run. For me it took more than 1.5 hours.

# We can see that the model has overfit the training dataset at about 1 or 2 epochs.
# I'm satisfied with the result, but the model would benefit from a regularization technique, so let's add dropout.

def define_model_with_dropout():
    model = Sequential()
    model.add(Conv2D(32, (3, 3), activation='relu',
              kernel_initializer='he_uniform', padding='same', input_shape=(200, 200, 3)))
    model.add(MaxPooling2D((2, 2)))
    model.add(Dropout(0.2))
    model.add(Flatten())
    model.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    model.add(Dropout(0.5))
    model.add(Dense(1, activation='sigmoid'))
    opt = SGD(learning_rate=0.001, momentum=0.9)
    model.compile(optimizer=opt, loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

# Let's also increase the number of epochs to give the model more space for refinement.
epochs = 50

run_test_harness(define_model_with_dropout())


In [None]:
# Get new diagnostics

get_diagnostic_summary()

In [None]:
# Well the accuracy dropped, and it seems that ovefitting occurred slightly later, on the 3-4 epoch.
# This model would need way more tuning to reduce overfitting.

# I think this will be enough. I planned to use transfer learning, but results are good, so I will leave it for next time.
# My conclusion is that keras is the best AI framework. I really enjoy working with it. Everything is simple and straightforward, API is well documented and understandable.
# I liked the project, it was interesting to work with images. Most of the time was spent figuring out how to best load data and then running the model.
# I was very surprised by the first result, it was way higher than I expected. 
# I also began to see how much computing power is needed for real world AI applications - ~3000 image training took almost an hour.