# CDA (Confirmatory Data Analysis)

In this notebook, we will explore the Chest X-ray dataset found on Kaggle (https://www.kaggle.com/praveengovi/coronahack-chest-xraydataset). The dataset contains chest x-rays of patients who were well (normal) or had pneumonia. Those who had pneumonia were afflicted by either a bacterial infection, a virus, or ARDS (smoking/stress related causes). And of those who had a virus, it was either by coronavirus or other.

We used this dataset to create three classifiers: 1) A normal vs pneumonia classifier, 2) virus vs bacteria, and 3) covid vs non-covid. The model can be used hierachically to diagnose covid. The reason for doing so is that the dataset is quite disproportionate, in that the covid images total to around 1% of the whole dataset. Whilst the normal/pneumonia split was around 27%-73%, the virus/bacteria around 36%-64%. Hence, if we can classify these latter binary variables well, the models may still be useful to the doctor even if we cannot classify between covid/non-covid.

Note that this is part one of two notebooks. We have started with CDA due to computer vision being a complex problem. In the second notebook, we perform an EDA, where we visualize the models we have learnt here. The models we use are strictly **CNNs**, and nothing more.

## Preliminaries (imports and function definitions)

In [21]:
from keras.models import load_model
from keras.preprocessing.image import ImageDataGenerator

## Pneumonia vs normal (chest) classifier

### Model

In [None]:
# From pneumonia_classifer.py:
# Don't run this code, it is here for you to know what model was used to learn the relationships. Actually, they are all the same.

normpneu_model = models.Sequential()
normpneu_model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 1)))
normpneu_model.add(layers.MaxPooling2D(2, 2))
normpneu_model.add(layers.Conv2D(64, (3, 3), activation='relu'))
normpneu_model.add(layers.MaxPooling2D(2, 2))
normpneu_model.add(layers.Conv2D(64, (3, 3), activation='relu'))
normpneu_model.add(layers.Flatten())
normpneu_model.add(layers.Dense(512, activation='relu'))
normpneu_model.add(layers.Dense(1, activation='sigmoid'))
normpneu_model.compile(optimizer=optimizers.RMSprop(lr=1e-4), loss='binary_crossentropy', metrics=['acc'])

# data augmentation
train_norm_pneu_gen = ImageDataGenerator(rescale=1. / 255, rotation_range=40,
                                             width_shift_range=0.2,
                                             height_shift_range=0.2,
                                             shear_range=0.2,
                                             zoom_range=0.2,
                                             horizontal_flip=True,
                                             )
valid_norm_pneu_gen = ImageDataGenerator(rescale=1. / 255)

### Accuracy and loss scores

![norm_pneu_train_val_acc.svg](attachment:norm_pneu_train_val_acc.svg) 

![norm_pneu_train_val_loss.svg](attachment:norm_pneu_train_val_loss.svg)

### Test set score

In [11]:
test_datagen = ImageDataGenerator(rescale=1. / 255)

norm_pneu_test_generator = test_datagen.flow_from_directory(
    'Coronahack-Chest-XRay-Dataset/test/norm_pneu',
    target_size=(150, 150),
    batch_size=20,
    class_mode='binary',
    color_mode='grayscale'
)

norm_pneu_model = load_model('pneumonia_model_data/normal_and_pneumonia_small_1.h5')

test_loss, test_acc = norm_pneu_model.evaluate(norm_pneu_test_generator,
                                                         steps=int(
                                                             norm_pneu_test_generator.n / norm_pneu_test_generator.batch_size))
print("Norm/Pneu loss,acc:",test_loss,test_acc)

Found 788 images belonging to 2 classes.
Norm/Pneu loss,acc: 0.3111162483692169 0.8858974575996399


## Virus vs bacteria classifier

### Model

In [None]:
# From virus_classifier.py
# Don't run this code, it is here for you to know what model was used to learn the relationships. Actually, they are all the same.

virus_model = models.Sequential()
virus_model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 1)))
virus_model.add(layers.MaxPooling2D(2, 2))
virus_model.add(layers.Conv2D(64, (3, 3), activation='relu'))
virus_model.add(layers.MaxPooling2D(2, 2))
virus_model.add(layers.Conv2D(64, (3, 3), activation='relu'))
virus_model.add(layers.Flatten())
virus_model.add(layers.Dropout(0.5))
virus_model.add(layers.Dense(512, activation='relu'))
virus_model.add(layers.Dense(1, activation='sigmoid'))
virus_model.compile(optimizer=optimizers.RMSprop(lr=1e-4), loss='binary_crossentropy', metrics=['acc'])

# data augmentation
train_virus_gen = ImageDataGenerator(rescale=1. / 255,
                                         rotation_range=40,
                                         width_shift_range=0.2,
                                         height_shift_range=0.2,
                                         shear_range=0.2,
                                         zoom_range=0.2,
                                         horizontal_flip=True,
                                         )

test_virus_gen = ImageDataGenerator(rescale=1. / 255)

### Accuracy and loss scores

![virus_bac_train_val_acc.svg](attachment:virus_bac_train_val_acc.svg)

![virus_bac_train_val_loss.svg](attachment:virus_bac_train_val_loss.svg)

### Test set score

In [15]:
virus_bac_test_generator = test_datagen.flow_from_directory(
    'Coronahack-Chest-XRay-Dataset/test/virus_bacteria',
    target_size=(150, 150),
    batch_size=20,
    class_mode='binary',
    color_mode='grayscale'
)
virus_bac_model = load_model('virus_model_data/virus_small_1.h5')
test_loss, test_acc = virus_bac_model.evaluate(virus_bac_test_generator,
                                                         steps=int(
                                                             virus_bac_test_generator.n / virus_bac_test_generator.batch_size))
print("Virus/Bac loss,acc:", test_loss, test_acc)


Found 708 images belonging to 2 classes.
Virus/Bac loss,acc: 0.5828590989112854 0.7028571367263794


## Covid vs non-covid (virus) classifier

### Model

In [None]:
# From covid_classifier.py
# Don't run this code, it is here for you to know what model was used to learn the relationships. Actually, they are all the same.

covid_model = models.Sequential()
covid_model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(150, 150, 1)))
covid_model.add(layers.MaxPooling2D(2, 2))
covid_model.add(layers.Conv2D(64, (3, 3), activation='relu'))
covid_model.add(layers.MaxPooling2D(2, 2))
covid_model.add(layers.Conv2D(64, (3, 3), activation='relu'))
covid_model.add(layers.Flatten())
covid_model.add(layers.Dropout(0.5))
covid_model.add(layers.Dense(512, activation='relu'))
covid_model.add(layers.Dense(1, activation='sigmoid'))
covid_model.compile(optimizer=optimizers.RMSprop(lr=1e-4), loss='binary_crossentropy', metrics=['acc'])


# data augmentation
train_covid_gen = ImageDataGenerator(rescale=1. / 255,
                                         rotation_range=40,
                                         width_shift_range=0.2,
                                         height_shift_range=0.2,
                                         shear_range=0.2,
                                         zoom_range=0.2,
                                         horizontal_flip=True,
                                         )

valid_covid_gen = ImageDataGenerator(rescale=1. / 255)

### Accuracy and loss scores

![covid_train_val_acc.svg](attachment:covid_train_val_acc.svg)

![covid_train_val_loss.svg](attachment:covid_train_val_loss.svg)

### Test set score

In [20]:
covid_test_generator = test_datagen.flow_from_directory(
    'Coronahack-Chest-XRay-Dataset/test/covid_noncovid',
    target_size=(150, 150),
    batch_size=5,
    class_mode='binary',
    color_mode='grayscale'
)

covid_model = load_model('covid_model_data/covid_small_1.h5')
test_loss, test_acc = covid_model.evaluate(covid_test_generator,
                                                         steps=int(
                                                             covid_test_generator.n / covid_test_generator.batch_size))
print("Covid/Non-covid loss,acc:", test_loss, test_acc)

Found 30 images belonging to 2 classes.
Covid/Non-covid loss,acc: 0.16160979866981506 0.9666666388511658


## Conclusion 

We found that the normal/pneumonia classifier works quite well (a test acc. of approx. 88%), as did the covid/non-covid classifier (a test acc. of approx. 97%). However the latter was the result of a small dataset, and therefore we are not keen on believing its accuracy just yet. The virus/bacteria classifier did not perform too well having just a test acc. of around 70%.

It does not appear that any of our models are overfitting, perhaps due to data augmentation and dropout.

In the second notebook (EDA), we visualize the filters to see what was actually learnt. In this way, we can assure ourselves further that the right thing has been learnt (or not).