<a href="https://colab.research.google.com/github/AhmedAlshaari/AhmedAlshaari/blob/main/Lung_X_ray_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ahmed Alshaari - Lung X-ray Classification

---



This project aims to create a submission to participate in the kaggle competition which can be found [here](https://www.kaggle.com/c/cap-4611-spring-21-assignment-5/overview). The competition provides a dataset of Lung X-rays and the goal is to classify the type of lung damage into one of the following categories:

* Normal (No damage)
* Virus
* Bacteria
* Stress-smoking

The submission created through this notebook achieved an accuracy of 71.79% which can also be found on the leaderboard of the Kaggle competition [here](https://www.kaggle.com/c/cap-4611-spring-21-assignment-5/leaderboard).


### Imports

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # Graph
import matplotlib.pyplot as plt # Graphs and plots
import tensorflow as tf 
import os
from tensorflow.keras.preprocessing import image_dataset_from_directory
from tensorflow.keras.preprocessing.image import ImageDataGenerator
#from keras.utils import get_file
from google.colab import files 
import io 
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Flatten, Dropout, MaxPooling2D, Activation
from keras import optimizers
from tensorflow import keras
from tensorflow.keras import layers

### Linking google drive and unzipping the folder that contains the data.

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')
!unzip gdrive/My\ Drive/cap-4611-spring-21-assignment-5.zip

Establishing the directory paths

In [None]:
dir = '/content/data'

train_csv_path = '/content/assignment5_training_data_metadata.csv'
test_csv_path = '/content/assignment5_test_data_metadata.csv'

train_csv = pd.read_csv(train_csv_path, index_col='id')
test_csv = pd.read_csv(test_csv_path, index_col='id')

### Taking a look at the data

In [None]:
train_csv

Unnamed: 0_level_0,image_name,label,cause,type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,IM-0128-0001.jpeg,Normal,,
1,IM-0127-0001.jpeg,Normal,,
2,IM-0125-0001.jpeg,Normal,,
3,IM-0122-0001.jpeg,Normal,,
4,IM-0119-0001.jpeg,Normal,,
...,...,...,...,...
5304,1-s2.0-S0929664620300449-gr2_lrg-c.jpg,Pnemonia,COVID-19,Virus
5305,1-s2.0-S0929664620300449-gr2_lrg-b.jpg,Pnemonia,COVID-19,Virus
5306,1-s2.0-S0929664620300449-gr2_lrg-a.jpg,Pnemonia,COVID-19,Virus
5307,1-s2.0-S0140673620303706-fx1_lrg.jpg,Pnemonia,COVID-19,Virus


In [None]:
train_csv['type'].unique()

array([nan, 'Virus', 'bacteria', 'Stress-Smoking'], dtype=object)

### Missing Values
The analysis above shows some NaN values that appear to be miss aligned as the picture is labeled normal, so those data points will be filled

In [None]:
train_csv['type'].fillna('Normal', inplace=True)

In [None]:
train_csv.isnull().sum()

image_name       0
label            0
cause         5217
type             0
dtype: int64

In [None]:
train_csv

Unnamed: 0_level_0,image_name,label,cause,type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,IM-0128-0001.jpeg,Normal,,Normal
1,IM-0127-0001.jpeg,Normal,,Normal
2,IM-0125-0001.jpeg,Normal,,Normal
3,IM-0122-0001.jpeg,Normal,,Normal
4,IM-0119-0001.jpeg,Normal,,Normal
...,...,...,...,...
5304,1-s2.0-S0929664620300449-gr2_lrg-c.jpg,Pnemonia,COVID-19,Virus
5305,1-s2.0-S0929664620300449-gr2_lrg-b.jpg,Pnemonia,COVID-19,Virus
5306,1-s2.0-S0929664620300449-gr2_lrg-a.jpg,Pnemonia,COVID-19,Virus
5307,1-s2.0-S0140673620303706-fx1_lrg.jpg,Pnemonia,COVID-19,Virus


### Generatoring the image data and aligning it with the csv file.

In [None]:
size = (64, 64)

gen = ImageDataGenerator(
    validation_split=0.2,
    rescale=1./255.,
    rotation_range=90,
  )

train_dir = '/content/images/images/train'

train_gen = gen.flow_from_dataframe(
        train_csv,
        directory=train_dir,
        x_col='image_name',
        y_col='type',
        class_mode='categorical',
        target_size=size,
        color_mode='rgb',
        batch_size = 64,
        shuffle=True,
        subset='training'
      )

val_gen = gen.flow_from_dataframe(
        train_csv,
        directory=train_dir,
        x_col='image_name',
        y_col='type',
        class_mode='categorical',
        target_size=size,
        color_mode='rgb',
        batch_size = 64,
        shuffle=True,
        subset='validation'
      )

Found 5286 validated image filenames belonging to 4 classes.


Creating the model architecture by adding different layers. and then compiling the model. 

In [None]:
model = Sequential()

num_classes = len(train.class_indices)

# First convolution layer with relu activation and padding with 0s
model.add(Conv2D(32, (3, 3), padding='same'))
model.add(Activation('relu'))

# Second convolution layer with relu activation, and max pooling of only 2 pixels. Then a dropout layer to avoid overfitting
model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

# Third convolution layer with relu activation and padding with 0s
model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))

# Fourth convolution layer with relu activation, and max pooling of only 2 pixels. Then a dropout layer to avoid overfitting
model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))

# Flatten layer to match the shape of the previous layer with the Dense layer which will represent the output
model.add(Flatten())
model.add(Dense(512))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(4, activation='softmax'))

# Compiling the model with the adam optimizer and categorical crossentropy loss function
model.compile(optimizer = 'adam', loss='categorical_crossentropy', metrics=['accuracy'])

### Fitting the model with 100 epochs.

In [None]:
model.fit(train, epochs=70, steps_per_epoch=83)

Epoch 1/70
Epoch 2/70
Epoch 3/70
Epoch 4/70
Epoch 5/70
Epoch 6/70
Epoch 7/70
Epoch 8/70
Epoch 9/70
Epoch 10/70
Epoch 11/70
Epoch 12/70
Epoch 13/70
Epoch 14/70
Epoch 15/70
Epoch 16/70
Epoch 17/70
Epoch 18/70
Epoch 19/70
Epoch 20/70
Epoch 21/70
Epoch 22/70
Epoch 23/70
Epoch 24/70
Epoch 25/70
Epoch 26/70
Epoch 27/70

### Generating Data for submission

In [None]:
test_gen = ImageDataGenerator(
    rescale=1./255.,
    rotation_range=90,
  )

test_dir = '/content/images/images/test'

test = test_gen.flow_from_dataframe(
        test_csv,
        directory=test_dir,
        x_col='image_name',
        class_mode=None,
        target_size=size,
        color_mode='rgb',
        batch_size = 64,
        shuffle=False
      )

### Predicting the target vector

In [None]:
predictions = np.argmax(model.predict(test), axis=-1)
predictions

### Fixing the labels on the target vector

In [None]:
for i in range(624):
  if (predictions[i] == 1):
    predictions[i] = 4
  if (predictions[i] == 0):
    predictions[i] = 1

predictions

Generating output file for submission

In [None]:
output = pd.DataFrame({'id': test_csv.index, 'type': predictions})

output.to_csv('/content' + '/submission-file.csv', index = False)
print("Your submission was successfully saved!")