# Assignment 3: Image classification

## Instructions

In this assignment, you will be working on the COVID-19 lung X-Ray dataset. It consists of X-ray images that are labeled as either "Normal" or "Covid", with the latter indicating an X-ray of a COVID-19 patient.

To predict if a person has COVID or not, you will train the following five different classification models, finetune their hyperparameters, and finally select the best model and test it on the test data.

1.   Decision Tree
2.   Gradient Boosting
3.   Random Forest
4.   Support Vector Machines
5.   Neural Netwroks

The dataset can be found here: [Covid 19 X-Ray Dataset](https://drive.google.com/drive/folders/1Nwa6L58NwF23PvImpDrp-W8oTOheu_b3?usp=sharing). The data is already split into training, validation, and test.

To help you with this assignment, you are provided some instructions to follow, we will be checking those criteria when evaluating your submissions.

## Part 1: Data Exploration

In [None]:
# load the dataset
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# define the path for training, testing and validation datasets
train_dir = "/content/drive/MyDrive/Classroom/Training"
val_dir = "/content/drive/MyDrive/Classroom/Validation"
test_dir = "/content/drive/MyDrive/Classroom/Testing"

Since we are dealing with folders and subfolders, it's crucial to use the os library so that we will be able to use the

```
# os.path.join(folder, 'destination directory name')
```

where the destination will be either "COVID" or "Normal".



In [None]:
# import libaries
import os

# Function to count samples in each folder
def count_samples(folder):
    covid_path = os.path.join(folder, 'COVID')
    normal_path = os.path.join(folder, 'Normal')
    num_covid = len(os.listdir(covid_path))
    num_normal = len(os.listdir(normal_path))
    return num_covid, num_normal

# count samples in each folder
train_covid, train_normal = count_samples(train_dir)
test_covid, test_normal = count_samples(test_dir)
val_covid, val_normal = count_samples(val_dir)

# output the counts
print("Number of samples in each folder:")
print("Number of samples in each folder:")
print("Training - COVID:", train_covid)
print("Training - Normal:", train_normal)
print("Validation - COVID:", val_covid)
print("Validation - Normal:", val_normal)
print("Testing - COVID:", test_covid)
print("Testing - Normal:", test_normal)

Your next step is to visualize some of the data that you have. When we were dealing with a dataframe, it was easier for us to visualize some samples since the dataframe have a head and many other functions to get the job done.

In our case, we are dealing with images. The simplest option is to check the folder from your drive but we need to see it inside the model here.

To do that, you should use the cv2 library and use the method

```
cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
```
so that we can read an image that is in grayscale.


In addition, the image should be in a viewable format, so we need to plot then show the image using the matplotlib library.

In [None]:
# import libaries
import cv2
import matplotlib.pyplot as plt

# function to visualize images from each folder
def visualize_images(folder, label, num_samples=2):
    covid_path = os.path.join(folder, 'COVID')
    normal_path = os.path.join(folder, 'Normal')

    # visualize COVID images
    covid_images = os.listdir(covid_path)[:num_samples]
    print(f"\nVisualizing {num_samples} COVID images from {label} folder:")
    for img_name in covid_images:
        img_path = os.path.join(covid_path, img_name)
        img = cv2.imread(img_path)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        plt.imshow(img)
        plt.title("COVID")
        plt.axis('off')
        plt.show()

    # visualize Normal images
    normal_images = os.listdir(normal_path)[:num_samples]
    print(f"\nVisualizing {num_samples} Normal images from {label} folder:")
    for img_name in normal_images:
        img_path = os.path.join(normal_path, img_name)
        img = cv2.imread(img_path)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        plt.imshow(img)
        plt.title("Normal")
        plt.axis('off')
        plt.show()

# visualize images from each folder
visualize_images(train_dir, "Training")
visualize_images(test_dir, "Testing")
visualize_images(val_dir, "Validation")

## Part 2: Data Preparation

Since we are dealing with images and folders, and since we know that we do not have a dataframe to choose from, we need to present our folders and images in a suitable format for the different models.

To do that, we will be use the method

```
load_images_from_folder()
```
that will take as parameters the folder path and the label, and return 2 arrays, one for the features (which are the images) and one for the labels.

To build this method, we need at first to initiate 2 empty arrays to hold the new values that we will be working with and we need to make a dictionnary to map the labels COVID and Normal to numeric values (0 and 1).

Next, we need to loop over the images inside each folder and do the following:

1.   Read the image
2.   Convert each image to grayscale
3.   Normalize the pixels values between 0 and 1
4.   Flatten the image
5.   Append the image and its label to the correspondng arrays


In [None]:
# import libaries
import numpy as np

def load_images_from_folder(folder_path, label, target_size=(100, 100)):
    images = []
    labels = []
    label_dict = {'COVID': 0, 'Normal': 1}  # Define a dictionary to map labels to numeric values
    label_code = label_dict[label]

    label_path = os.path.join(folder_path, label)
    for img_name in os.listdir(label_path):
        img_path = os.path.join(label_path, img_name)
        # load the image
        img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)  # Read the image in grayscale
        # resize the image
        img_resized = cv2.resize(img, target_size)  # Resize the image to the specified target size
        # preprocess the resized image (e.g., normalize pixel values)
        img_resized = img_resized.astype('float32') / 255.0  # Normalize pixel values to [0, 1]
        # flatten the resized image
        img_flat = img_resized.flatten()
        # append the flattened image and corresponding label
        images.append(img_flat)
        labels.append(label_code)
    return np.array(images), np.array(labels)


Next, call this method for all the folders you have to prepare your data for training.

In [None]:
# load images from the training folder for both COVID and Normal labels
train_covid_images, train_covid_labels = load_images_from_folder(train_dir, 'COVID')
train_normal_images, train_normal_labels = load_images_from_folder(train_dir, 'Normal')

# load images from the testing folder for both COVID and Normal labels
test_covid_images, test_covid_labels = load_images_from_folder(test_dir, 'COVID')
test_normal_images, test_normal_labels = load_images_from_folder(test_dir, 'Normal')

# load images from the validation folder for both COVID and Normal labels
val_covid_images, val_covid_labels = load_images_from_folder(val_dir, 'COVID')
val_normal_images, val_normal_labels = load_images_from_folder(val_dir, 'Normal')

In [None]:
train_covid_images.shape

In [None]:
# combine COVID and Normal images and labels for training, validation and test
train_images = np.concatenate((train_covid_images, train_normal_images), axis=0)
train_labels = np.concatenate((train_covid_labels, train_normal_labels), axis=0)

val_images = np.concatenate((val_covid_images, val_normal_images), axis=0)
val_labels = np.concatenate((val_covid_labels, val_normal_labels), axis=0)

test_images = np.concatenate((test_covid_images, test_normal_images), axis=0)
test_labels = np.concatenate((test_covid_labels, test_normal_labels), axis=0)


## Part 3: Models


### Decision Tree

Train a Decision Tree model using the default hyperparameters and evaluate its performance on the validation set

In [None]:
# import libaries
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score


# initialize and train the decision tree model
dt_clf = DecisionTreeClassifier()
dt_clf.fit(train_images, train_labels)

# predict labels for the validation set
val_set_pred = dt_clf.predict(val_images)

# report the validation accuracy of the trained model
val_accuracy = accuracy_score(val_labels, val_set_pred)
print(f"Validation Accuracy: {val_accuracy}")


In [None]:
# plot the learning curve of the trained model to examine bias and variance
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, title, X, y, cv=None, n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
    plt.figure()
    plt.title(title)
    plt.xlabel("Training examples")
    plt.ylabel("Score")

    # Determine sizes and scores
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)

    # Calculate the average and standard deviation of the training and test scores
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)

    # Plot the learning curve
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")

    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt



plot_learning_curve(dt_clf, "Decision Tree Learning Curve", train_images, train_labels, cv=5)
plt.show()

Does your model suffer from overfitting (high variance) or underfitting (high bias) or neither and why?

Answer:

#### Hyperparameter tuning

Use the **GridSearchCV** module of the **sklearn** library to tune the hyperparameter of your Decision Tree model on the validation set. You should try to tune as many hyperparameters as you can, such as split criterion, maximum depth, etc. Your tuning should be guided by the observations you made from the learning curve of the untuned model.   

In [None]:
param_grid = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 10, 20]
}
grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=3)
grid_search.fit(train_images, train_labels)
print("Best parameters:", grid_search.best_params_)

### Gradient Boosting

Train a Gradient Boosting model using the default hyperparameters and validate it on the validation set.



In [None]:
# import libraries
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score

# initialize and train the Gradient Boosting model
gb_clf = GradientBoostingClassifier()
gb_clf.fit(train_images, train_labels)

# predict labels for the validation set
val_set_pred = gb_clf.predict(val_images)

# report the validation accuracy of the trained model
val_accuracy = accuracy_score(val_labels, val_set_pred)
print(f"Validation Accuracy: {val_accuracy}")


In [None]:
# plot the learning curve of the trained model to examine bias and variance
plot_learning_curve(gb_clf, "Gradient Boosting Learning Curve", train_images, train_labels, cv=5)
plt.show()


Does your model suffer from overfitting (high variance) or underfitting (high bias) or neither and why?

Answer:

#### Hyperparameter tuning

Use the **GridSearchCV** module of the **sklearn** library to tune the hyperparameter of your Gradient Boosting model on the validation set. You should try to tune as many hyperparameters as you can, such as criterion, maximum depth, learning rate, number of estimators, etc. Your tuning should be guided by the observations you made from the learning curve of the untuned model.

In [None]:
param_grid = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 10, 20]
}
grid_search = GridSearchCV(GradientBoostingClassifier(), param_grid, cv=3)
grid_search.fit(train_images, train_labels)
print("Best parameters:", grid_search.best_params_)

### Random Forest

Train a Random Forest model using the default hyperparameters and validate it on the validation set.


In [None]:
# import libaries
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# initialize and train the Random Forest model
rf_clf = RandomForestClassifier()
rf_clf.fit(train_images, train_labels)

# predict labels for the validation set
val_set_pred = rf_clf.predict(val_images)

# report the validation accuracy of the trained model
val_accuracy = accuracy_score(val_labels, val_set_pred)
print(f"Validation Accuracy: {val_accuracy}")


In [None]:
# plot the learning curve of the trained model to examine bias and variance
plot_learning_curve(rf_clf, "Random Forest Learning Curve", train_images, train_labels, cv=5)
plt.show()

Does your model suffer from overfitting (high variance) or underfitting (high bias) or neither and why?

Answer:

#### Hyperparameter tuning

Use the **GridSearchCV** module of the **sklearn** library to tune the hyperparameter of your decision tree model on the validation set. You should try to tune as many hyperparameters as you can, such as split criterion, maximum depth, etc. Your tuning should be guided by the observations you made from the learning curve of the untuned model. Your tuning should be guided by the observations you made from the learning curve of the untuned model.

In [None]:
# code Here

### Support Vector Machines

Train a linear Support Vector Machines model using the default hyperparameters and validate it on the validation set.

In [None]:
#code here
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score


# initialize and train the Support Vector Machines model
svm_clf = SVC()
svm_clf.fit(train_images, train_labels)

# report the number of support vectors


# predict labels for the validation set
val_set_pred = svm_clf.predict(val_labels, val_set_pred)

# report the validation accuracy of the trained model
val_accuracy = accuracy_score(val_labels, val_set_pred)
print(f"Validation Accuracy: {val_accuracy}")


In [None]:
# plot the learning curve of the trained model to examine bias and variance
plot_learning_curve(svm_clf, "SVM Learning Curve", train_images, train_labels, cv=5)
plt.show()

Does your model suffer from overfitting (high variance) or underfitting (high bias) or neither and why?

Answer:

#### Hyperparameter tuning

Use the **GridSearchCV** module of the **sklearn** library to tune the hyperparameter of your Support Vector Machines model on the validation set. You should try to tune as many hyperparameters as you can, such as the regularization parameter C, the kernel, the degree of polynomial kernels, and the gamma parameter. Your tuning should be guided by the observations you made from the learning curve of the untuned model.

In [None]:
# code Here

### Feed-forward Neural Network

In [None]:
# install necessary libraries if needed
!pip install keras.wrappers
!pip install tensorflow
!pip install scikeras
!pip install keras==2.15.0

Train a Feed-forward Neural Network consisting of three layers. The first hidden layer has 128 units and uses the ReLU activation function, the second hidden layer has 64 units and uses the ReLU activation function, and the output layer has 1 unit and uses sigmoid activation for binary classification. Use the Adam optimizer to train the neural network, and then validate your model using the validation set.

In [None]:
# import libraries
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from scikeras.wrappers import KerasClassifier, KerasRegressor
from keras.models import Sequential
from keras.layers import Flatten
from tensorflow.keras.layers import Dense
from sklearn.metrics import accuracy_score


# function to create a dense model
def create_dense_model(input_shape):
    model = Sequential([
        Flatten(input_shape=input_shape),  # Flatten the input images
        Dense(128, activation='relu'),  # First hidden layer with 128 units and ReLU activation
        Dense(64, activation='relu'),   # Second hidden layer with 64 units and ReLU activation
        Dense(1, activation='sigmoid')  # Output layer with 1 unit and sigmoid activation for binary classification
    ])
    model.compile(optimizer='adam',
                  loss='binary_crossentropy',
                  metrics=['accuracy'])
    return model

# function to extract the dense model from the pipeline
def extract_model(pipeline):
    return pipeline.steps[-1][1]

# Create a pipeline for the dense model
dense_pipeline = Pipeline([
    ('flatten', FunctionTransformer(lambda x: x.reshape((x.shape[0], -1)))),
    ('dense', KerasClassifier(build_fn=create_dense_model, input_shape=(10000,), epochs=10, batch_size=32))
])

# train the network
nn_model = create_model((train_images.shape[1],))
history = nn_model.fit(train_images, train_labels, epochs=10, batch_size=32, validation_data=(val_images, val_labels))

# predict labels for the validation set
val_probabilities = nn_model.predict(val_images)
val_set_pred = (val_probabilities > 0.5).astype("int32")

# report the validation accuracy of the trained model
val_accuracy = accuracy_score(val_labels, val_set_pred)
print(f"Validation Accuracy: {val_accuracy}")


In [None]:
# plot the learning curve of the trained model to examine bias and variance
import matplotlib.pyplot as plt

# Plotting training and validation accuracy
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Learning Curve')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

Does your model suffer from overfitting (high variance) or underfitting (high bias) or neither and why?

Answer:

#### Hyperparameter tuning

Use the **RandomizedSearchCV** module of the **sklearn** library to tune the hyperparameter of your Neural Network model on the validation set. You should try to tune as many hyperparameters as you can, such as the learning rate, the number of hidden units, batch size, learning rate decay, and the number of hidden layers. Your tuning should be guided by the observations you made from the learning curve of the untuned model.

In [None]:
# code Here

## Part 4: Testing

Select the best tuned model from all the five models you have trained in part 4, and then test it on the test data

In [None]:
# Predict labels for the test set using the best model
test_predictions = best_model.predict(test_images)

# Report the test accuracy of the trained model
test_accuracy = accuracy_score(test_labels, test_predictions)
print(f"Test Accuracy: {test_accuracy}")