## Lab Exercise 1: Pneumonia Classification Using Machine Learning

Pneumonia is a lung inflammation caused by a viral or bacterial infection, which can result in the inability to breathe sufficient oxygen for the bloodstream. Chest X-rays are typically performed on patients to check for pneumonia, and doctors will then examine the X-ray images and conclude accordingly.

Our objective for this lab exercise is to use machine learning techniques such as kNN and SVM to classify X-ray image as either ‘normal’ or ‘pneumonia’.

This lab exercise will be done in a step-by-step manner, following the typical machine learning pipeline. Some parts of the code will be pre-written for you, and some parts will require you to write your own code to achieve the desired outcome. All those parts that require your own code input will be commented out in green as [YOUR CODE HERE] with the specific instructions that you will need to follow.

## Import Python libraries and mount Google Drive for dataset

In [None]:
import cv2
import os
import math
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from google.colab.patches import cv2_imshow

In [None]:
from google.colab import drive
drive.mount('/content/drive')

## 1. Data Preparation

In [None]:
datadir_root = # [YOUR CODE HERE] specify the root data directory containing the train, val and test folders

datadir_train = os.path.join(datadir_root, "train")
datadir_val = os.path.join(datadir_root, "val")
datadir_test = os.path.join(datadir_root, "test")

In [None]:
labels = ['NORMAL', 'PNEUMONIA']
img_size = (64, 64)

def get_data(datadir):
  data = []
  for label in labels:
    path = os.path.join(datadir, label)
    filenames = os.listdir(path)
    for filename in filenames:
      # Load the image in grayscale
      # [YOUR CODE HERE] img= ...

      # Resize the image to 'img_size' defined earlier
      # [YOUR CODE HERE] img = ...

      # Append the 'data' list with the image and its label
      data.append([img, label])

  print(f'Successfully loaded {len(data)} images')
  # Convert the 'data' list to a NumPy array
  # [YOUR CODE HERE] 

  return data

In [None]:
# load the train, val and test sets by calling the 'get_data' function defined earlier
train = get_data(datadir_train)
val = # [YOUR CODE HERE]
test = # [YOUR CODE HERE]

In [None]:
# For each of 'train', 'val' and 'test', split the features (X) and labels (Y) into two separate NumPy arrays
trainX, trainY = np.array(train[:,0]), np.array(train[:,1])
valX, valY = # [YOUR CODE HERE]
testX, testY = # [YOUR CODE HERE]

In [None]:
# normalize 'trainX', 'valX' and 'testX' based on pixel intensities
trainX = trainX / 255
valX = # [YOUR CODE HERE]
testX = # [YOUR CODE HERE]

In [None]:
# Flatten 'trainX', 'valX' and 'testX' from NumPy arrays into lists

trainX_flatten = []
valX_flatten = []
testX_flatten = []

for X in trainX:
  X = X.flatten()
  trainX_flatten.append(X)
trainX = trainX_flatten

for X in valX:
  X = X.flatten()
  valX_flatten.append(X)
valX = valX_flatten

for X in testX:
  X = X.flatten()
  testX_flatten.append(X)
testX = testX_flatten

## 2. Model training (k-NN)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

all_k = np.arange(1,21)
all_accuracy = []

for k in all_k:
  kNN_model = # [YOUR CODE HERE] specify the kNN model using n_neighbors = k and weights = 'distance'

  # [YOUR CODE HERE] fit the kNN model on the train data

  # [YOUR CODE HERE] use the fitted kNN model to predict the val data
  
  # [YOUR CODE HERE] obtain the accuracy of the kNN model on the val data
  
  # [YOUR CODE HERE] append the 'all_accuracy' list with the accuracy obtained above

# return the value of k corresponding to the maximum accuracy value in 'all_accuracy'
best_k = all_k[np.argmax(all_accuracy)]

best_accuracy = # [YOUR CODE HERE] return the maximum accuracy value in 'all_accuracy'


In [None]:
# use matplotlib.pyplot to generate a graph of 'all_accuracy' against 'all_k' and label all axes of the graph
# [YOUR CODE HERE]

print(f'The best value of k is {best_k} with a corresponding validation accuracy of {best_accuracy}')

In [None]:
# combine 'trainX' and 'valX' to form overall train set called 'trainvalX' and repeat the same thing for Y too
trainvalX = trainX + valX
trainvalY = trainY.tolist() + valY.tolist()

# fit the kNN model on the 'trainval' data using n_neighbors = best_k and weights = 'distance' 
# [YOUR CODE HERE]

# predict the test data
# [YOUR CODE HERE]

## 3. Model evaluation (k-NN)

In [None]:
from sklearn.metrics import confusion_matrix

# generate a confusion matrix for the kNN model on the test data
# [YOUR CODE HERE]


In [None]:
# generate the classification report for the kNN model on the test data
# [YOUR CODE HERE]


## 4. Model training (SVM)

In [None]:
from sklearn.svm import SVC
import warnings
warnings.filterwarnings("ignore")   # supress warnings when any evaluation metric calculates to be NaN

# create an array of gamma
all_gamma = np.logspace(-10, 10, 21)
all_accuracy = []

for gamma in all_gamma:
  SVM_model = # [YOUR CODE HERE] specify the SVM model using kernel = 'rbf' and gamma = gamma

  # fit the SVM model on the train data
  # [YOUR CODE HERE] 
  
  # use the fitted SVM model to predict the val data
  # [YOUR CODE HERE] 

  # obtain the accuracy of the SVM model on the val data
  # [YOUR CODE HERE] 

  # append the 'all_accuracy' list with the accuracy obtained above
  # [YOUR CODE HERE] 

# return the value of gamma corresponding to the maximum accuracy value in 'all_accuracy'
best_gamma = all_gamma[np.argmax(all_accuracy)]

best_accuracy = # [YOUR CODE HERE] return the maximum accuracy value in 'all_accuracy'


In [None]:
# use matplotlib.pyplot to generate a graph of 'all_accuracy' against 'all_gamma' and label all axes of the graph
# [YOUR CODE HERE]

print(f'The best value of gamma is {best_gamma} with a corresponding validation accuracy of {best_accuracy}')

In [None]:
# fit the SVM model on 'trainval' data using kernel = 'rbf' and gamma = best_gamma and predict the test data
# [YOUR CODE HERE]


## 5. Model evaluation (SVM)

In [None]:
# generate a confusion matrix for the SVM model on the test data
# [YOUR CODE HERE]


In [None]:
# generate the classification report for the SVM model on the test data
# [YOUR CODE HERE]
