# Using Machine Learning Tools 2023, Assignment 3

## Sign Language Image Classification using Deep Learning

## Overview

In this assignment you will implement different deep learning networks to classify images of hands in poses that correspond to letters in American Sign Language. The dataset is contained in the assignment zip file, along with some images and a text file describing the dataset. It is similar in many ways to other MNIST datasets.

The main aims of the assignment are:

 - To implement and train different types of deep learning network;
 
 - To systematically optimise the architecture and parameters of the networks;
  
 - To explore over-fitting and know what appropriate actions to take in these cases.
 

It is the intention that this assignment will take you through the process of implementing and optimising deep learning approaches. The way that you work is more important than the results for this assignment, as what is most crucial for you to learn is how to take a dataset, understand the problem, write appropriate code, optimize performance and present results. A good understanding of the different aspects of this process and how to put them together well (which will not always be the same, since different problems come with different constraints or difficulties) is the key to being able to effectively use deep learning techniques in practice.

This assignment relates to the following ACS CBOK areas: abstraction, design, hardware and software, data and information, HCI and programming.


## Scenario

A client is interested in having you (or rather the company that you work for) investigate whether it is possible to develop an app that would enable American sign language to be translated for people that do not sign, or those that sign in different languages/styles. They have provided you with a labelled data of images related to signs (hand positions) that represent individual letters in order to do a preliminary test of feasibility.

Your manager has asked you to do this feasibility assessment, but subject to a constraint on the computational facilities available.  More specifically, you are asked to do **no more than 50 training runs in total** (including all models and hyperparameter settings that you consider).  

In addition, you are told to **create a validation set and any necessary test sets using _only_ the supplied testing dataset.** It is unusual to do this, but here the training set contains a lot of non-independent, augmented images and it is important that the validation images must be totally independent of the training data and not made from augmented instances of training images.

The clients have asked to be informed about the following:
 - **unbiased accuracy** estimate of a deep learning model (since DL models are fast when deployed)
 - the letter with the lowest individual accuracy
 - the most common error (of one letter being incorrectly labelled as another)
 
Your manager has asked you to create a jupyter notebook that shows the following:
 - loading the data, checking it, fixing any problems, and displaying a sample
 - training and optimising both **densely connected** *and* **CNN** style models
 - finding the best one, subject to a rapid turn-around and corresponding limit of 50 training runs in total
 - reporting clearly what networks you have tried, the method you used to optimise them, the associated learning curves, their summary performance and selection process to pick the best model
     - this should be clear enough that another employee, with your skillset, should be able to take over from you and understand your methods
 - results from the model that is selected as the best, showing the information that the clients have requested
 - a statistical test between the best and second-best models, to see if there is any significant difference in performance (overall accuracy)
 - it is hoped that the accuracy will exceed 96% overall and better than 90% for every individual letter, and you are asked to:
     - report the overall accuracy
     - report the accuracy for each individual letter
     - write a short recommendation regarding how likely you think it is to achieve these goals either with the current model or by continuing to do a small amount of model development/optimisation


## Guide to Assessment

This assignment is much more free-form than others in order to test your ability to run a full analysis like this one from beginning to end, using the correct procedures. So you should use a methodical approach, as a large portion of the marks are associated with the decisions that you take and the approach that you use.  There are no marks associated with the performance - just report what you achieve, as high performance does not get better marks - to get good marks you need to use the right steps, as you've used in other assignments and workshops.

Make sure that you follow the instructions found in the scenario above, as this is what will be marked.  And be careful to do things in a way that gives you an *unbiased* result.

The notebook that you submit should be similar to those in the other assignments, where it is important to clearly structure your outputs and code so that it could be understood by your manager or your co-worker - or, even more importantly, the person marking it! This does not require much writing, beyond the code, comments and the small amount that you've seen in previous assignments.  Do not write long paragraphs to explain every detail of everything you do - it is not that kind of report and longer is definitely not better.  Just make your code clear, your outputs easy to understand (short summaries often help here), and include a few small markdown cells that describe or summarise things when necessary.

Marks for the assignment will be determined according to the general rubric that you can find on MyUni, with a breakdown into sections as follows:
 - 10%: Loading, investigating, manipulating and displaying data
 - 20%: Initial model successfully trained (and acting as a baseline)
 - 45%: Optimisation of an appropriate set of models in an appropriate way (given the constraint of 50 training runs)
 - 25%: Comparison of models, selection of the best two and reporting of final results

Remember that most marks will be for the **steps you take**, rather than the achievement of any particular results. There will also be marks for showing appropriate understanding of the results that you present.  

What you need to do this assignment can all be found in the first 10 weeks of workshops, lectures and also the previous two assignments. The one exception to this is the statistical test, which will be covered in week 11.

## Final Instructions

While you are free to use whatever IDE you like to develop your code, your submission should be formatted as a Jupyter notebook that interleaves Python code with output, commentary and analysis. 
- Your code must use the current stable versions of python libraries, not outdated versions.
- All data processing must be done within the notebook after calling appropriate load functions.
- Comment your code, so that its purpose is clear to the reader!
- In the submission file name, do not use spaces or special characters.

The marks for this assignment are mainly associated with making the right choices and executing the workflow correctly and efficiently. Make sure you have clean, readable code as well as producing outputs, since your coding will also count towards the marks (however, excessive commenting is discouraged and will lose marks, so aim for a modest, well-chosen amount of comments and text in outputs).

This assignment can be solved using methods from sklearn, pandas, matplotlib and keras, as presented in the workshops. Other high-level libraries should not be used, even though they might have nice functionality such as automated hyperparameter or architecture search/tuning/optimisation. For the deep learning parts please restrict yourself to the library calls used in workshops 7-10 or ones that are very similar to these. You are expected to search and carefully read the documentation for functions that you use, to ensure you are using them correctly.

As ususal, feel free to use code from the workshops as a base for this assignment but be aware that they will normally not do *exactly* what you want (code examples rarely do!) and so you will need to make suitable modifications.





In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
from keras.utils.np_utils import to_categorical
from collections import Counter
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from keras.preprocessing.image import ImageDataGenerator

from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense

ModuleNotFoundError: No module named 'pandas'

# 1. Data Prepareation

### Load train and test datasets

In [None]:
train = pd.read_csv('sign_mnist_train.csv')
test = pd.read_csv('sign_mnist_test.csv')

### Check if any null values

In [None]:
print(train.isnull().any().sum())
print(test.isnull().any().sum())

### Check labels

In [None]:
train_labels_unique = train['label'].unique()
test_labels_unique = test['label'].unique()

print(f"labels in train dataset: {sorted(train_labels_unique)}")
print(f"labels in test dataset: {sorted(test_labels_unique)}")

In [None]:
train = train[train['label'] != 200]

In [None]:
train_labels_unique = train['label'].unique()
test_labels_unique = test['label'].unique()

print(f"labels in train dataset: {sorted(train_labels_unique)}")
print(f"labels in test dataset: {sorted(test_labels_unique)}")

### Check pixel values

In [None]:
train_pixels_min = train.drop('label', axis=1).values.min()
train_pixels_max = train.drop('label', axis=1).values.max()

test_pixels_min = test.drop('label', axis=1).values.min()
test_pixels_max = test.drop('label', axis=1).values.max()

print(f"Train pixel values range: {train_pixels_min} - {train_pixels_max}")
print(f"Test pixel values range: {test_pixels_min} - {test_pixels_max}")

### Display Data

In [None]:
train_images = train.drop('label', axis=1).values
train_labels = train['label'].values

unique_labels = np.unique(train_labels)
fig, axes = plt.subplots(3, 8, figsize=(20, 8))
axes = axes.ravel()

for i in range(24):
    label_index = np.where(train_labels == unique_labels[i])[0][0]
    axes[i].imshow(train_images[label_index].reshape(28, 28), cmap='gray')
    axes[i].set_title(f"Label: {unique_labels[i]}")
    axes[i].axis('off')  # Hide axis

plt.subplots_adjust(hspace=0.4)
plt.show()

In [None]:
train_labels = train['label'].values
true_labels = train_labels

# Count occurrences of each label
label_counts = np.bincount(true_labels)

# Plotting
plt.figure(figsize=(15,7))
plt.bar(range(len(label_counts)), label_counts)
plt.xlabel('Labels')
plt.ylabel('Number of Samples')
plt.title('Distribution of Labels in Training Set')
plt.xticks(range(len(label_counts)), [chr(i + 65) for i in range(len(label_counts))])
plt.show()

In [None]:
true_labels = test['label'].values

# Count occurrences of each label
label_counts = np.bincount(true_labels)

# Plotting
plt.figure(figsize=(15,7))
plt.bar(range(len(label_counts)), label_counts)
plt.xlabel('Labels')
plt.ylabel('Number of Samples')
plt.title('Distribution of Labels in Test Set')
plt.xticks(range(len(label_counts)), [chr(i + 65) for i in range(len(label_counts))])
plt.show()

In [None]:
train_images = train_images / 255.0
test_images = test.drop('label', axis=1).values / 255.0

# Reshape images to have a single channel (grayscale)
train_images = train_images.reshape(-1, 28, 28, 1)
test_images = test_images.reshape(-1, 28, 28, 1)

train_labels = to_categorical(train_labels)
test_labels = to_categorical(test['label'].values)

# 2. Baseline Model Training

### Densely connected

In [None]:
model = Sequential([
    Flatten(input_shape=(28, 28, 1)),  # Flattens the input
    Dense(512, activation='relu'),
    Dense(256, activation='relu'),
    Dense(128, activation='relu'),
    Dense(25, activation='softmax')
])

In [None]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(train_images, train_labels, epochs=10, validation_split=0.2, batch_size=32)

In [None]:
loss, accuracy = model.evaluate(test_images, test_labels)
print(f"Test accuracy: {accuracy}")

### CNN style

In [None]:
# Build the model
model = Sequential([
    Conv2D(16, kernel_size=(3, 3), activation='selu', input_shape=(28,28,1)),
    MaxPooling2D(pool_size=(2, 2)),
    Conv2D(32, kernel_size=(3, 3), activation='selu'),
    MaxPooling2D(pool_size=(2, 2)),
    Flatten(),
    Dense(128, activation='selu'),
    Dense(25, activation='softmax')
])

In [None]:
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.fit(train_images, train_labels, batch_size=32, epochs=10, validation_split=0.2)

In [None]:
loss, accuracy = model.evaluate(test_images, test_labels)
print(f"Test accuracy: {accuracy}")

# 3. Optimisation

In [None]:
import keras


epochs = 50

early_stop = keras.callbacks.EarlyStopping(monitor='accuracy', patience=4, restore_best_weights=True) 

In [None]:
def find_worst_letter(model):
    predictions = model.predict(test_images)
    predicted_labels = np.argmax(predictions, axis=1)
    true_labels = np.argmax(test_labels, axis=1)

    accuracies = []

    # Calculate accuracy for each letter
    for i in range(25): # since there are 24 classes
        correct_predictions = np.sum((predicted_labels == i) & (true_labels == i))
        total_true = np.sum(true_labels == i)
        acc = correct_predictions / total_true if total_true != 0 else 0
        
        if i == 9:
            accuracies.append(1.0)
        else:
            accuracies.append(acc)

    # Finding the letter with the lowest accuracy
    lowest_acc = np.argmin(accuracies)

    print(f"Letter with the lowest accuracy: {chr(65 + lowest_acc)} with accuracy {accuracies[lowest_acc]*100:.2f}%")
    return accuracies

In [None]:
def most_common_mismatch(model):
    predictions = model.predict(test_images)
    predicted_labels = np.argmax(predictions, axis=1)
    true_labels = np.argmax(test_labels, axis=1)

    # Finding mismatches between true and predicted labels
    mismatches = [(true, predicted) for true, predicted in zip(true_labels, predicted_labels) if true != predicted]

    # Counting mismatches to find the most common error
    error_counts = Counter(mismatches)

    most_common, error_count = error_counts.most_common(1)[0]

    print(f"The most common error is {chr(65 + most_common[0])} being mislabeled as {chr(65 + most_common[1])} with {error_count} occurrences.")

In [None]:
def model_cnn_factory(hiddensizes, actfn, optimizer, learningrate=0.0001, kernel_size = 3, dropout = 0.1):
    model = keras.models.Sequential()
    model.add(Conv2D(filters=hiddensizes[0], kernel_size=3, 
                                  strides=1, activation=actfn, padding="same", 
                                  input_shape=[28, 28, 1]))    
    model.add(Dropout(0.1))              
    model.add(MaxPooling2D(pool_size=2))
    for n in hiddensizes[1:-1]:
        model.add(Conv2D(filters=n, kernel_size=3, strides=1, 
                                      padding="same", activation=actfn))  
                                      
        model.add(MaxPooling2D(pool_size=2))
        model.add(Dropout(dropout))
        
    model.add(Conv2D(filters=hiddensizes[-1], kernel_size=3, 
                                  strides=1, padding="same", activation=actfn))  
                                  # 2nd Conv

    model.add(Flatten())
    model.add(Dense(256, activation='selu'))
    model.add(Dropout(0.5))
    model.add(Dense(128, activation='selu'))
    model.add(Dropout(0.5))
    model.add(Dense(25, activation = "softmax"))  

    model.compile(loss="categorical_crossentropy", 
                  optimizer=optimizer(learning_rate=learningrate), metrics=["accuracy"])   
    return model

### Model 1: 

- Activation function: Selu
- Optimizer: Nadam (Learning rate: 0.0005)
- Kernel size: 3x3
- Dropout rate: 0.2

In [None]:
model_1 = model_cnn_factory(hiddensizes = [32, 64, 64, 64, 32], actfn = "selu", 
                        optimizer = keras.optimizers.Nadam, learningrate = 0.0005,
                        kernel_size = 3, dropout = 0.2)

In [None]:
history_1 = model_1.fit(train_images, train_labels, batch_size=64, verbose = 0,
                        epochs=epochs, validation_split=0.1, callbacks=[early_stop])

loss, accuracy = model_1.evaluate(test_images, test_labels)
print(f"Test accuracy: {accuracy}")

In [None]:
letter_acc_model1 = find_worst_letter(model_1)

In [None]:
most_common_mismatch(model_1)

### Model 2: 

- Activation function: ReLU
- Optimizer: Nadam (Learning rate: 0.0005)
- Kernel size: 3x3
- Dropout rate: 0.2

In [None]:
model_2 = model_cnn_factory(hiddensizes = [32, 64, 64, 64, 32], actfn = "relu", 
                        optimizer = keras.optimizers.Nadam, learningrate = 0.0005,
                        kernel_size = 3, dropout = 0.2)

In [None]:
history_2 = model_2.fit(train_images, train_labels, batch_size=64, verbose = 0,
                        epochs=epochs, validation_split=0.1, callbacks=[early_stop])

loss, accuracy = model_2.evaluate(test_images, test_labels)
print(f"Test accuracy: {accuracy}")

In [None]:
letter_acc_model2 = find_worst_letter(model_2)

In [None]:
most_common_mismatch(model_2)

### Model 3: 
- Activation function: ReLU
- Optimizer: Nadam (Learning rate: 0.0005)
- Kernel size: 5x5
- Dropout rate: 0.2

In [None]:
model_3 = model_cnn_factory(hiddensizes = [32, 64, 64, 64, 32], actfn = "relu", 
                        optimizer = keras.optimizers.Nadam, learningrate = 0.0005,
                        kernel_size = 5, dropout = 0.2)

In [None]:
history_3 = model_3.fit(train_images, train_labels, batch_size=64, verbose = 0,
                        epochs=epochs, validation_split=0.1, callbacks=[early_stop])

loss, accuracy = model_3.evaluate(test_images, test_labels)
print(f"Test accuracy: {accuracy}")

In [None]:
letter_acc_model3 = find_worst_letter(model_3)

In [None]:
most_common_mismatch(model_3)

### Model 4:
- Activation function: ReLU
- Optimizer: SGD (Learning rate: 0.01)
- Kernel size: 3x3
- Dropout rate: 0.2

In [None]:
model_4 = model_cnn_factory(hiddensizes = [32, 64, 64, 64, 32], actfn = "relu", 
                        optimizer = keras.optimizers.SGD, learningrate = 0.01,
                        kernel_size = 3, dropout = 0.2)

In [None]:
history_4 = model_4.fit(train_images, train_labels, batch_size=64, verbose = 0,
                        epochs=epochs, validation_split=0.1, callbacks=[early_stop])

In [None]:
loss, accuracy = model_4.evaluate(test_images, test_labels)
print(f"Test accuracy: {accuracy}")

In [None]:
letter_acc_model4 = find_worst_letter(model_4)

In [None]:
most_common_mismatch(model_4)

### Model 5:
- Activation function: ReLU
- Optimizer: Nadam (Learning rate: 0.0001)
- Kernel size: 5x5
- Dropout rate: 0.2

In [None]:
model_5 = model_cnn_factory(hiddensizes = [32, 64, 64, 64, 32], actfn = "relu", 
                        optimizer = keras.optimizers.Nadam, learningrate = 0.0001,
                        kernel_size = 5, dropout = 0.2)

In [None]:
history_5 = model_5.fit(train_images, train_labels, batch_size=64, verbose = 0,
                        epochs=epochs, validation_split=0.1, callbacks=[early_stop])

loss, accuracy = model_5.evaluate(test_images, test_labels)
print(f"Test accuracy: {accuracy}")

In [None]:
letter_acc_model5 = find_worst_letter(model_5)

In [None]:
most_common_mismatch(model_5)

### Report

**Model Architecture**: We utilized a CNN (Convolutional Neural Network) with multiple convolutional layers, dropouts, and dense layers for classification.

**Optimization Methods**: We experimented with various configurations like activation functions, optimizers, learning rates, kernel sizes, and dropout rates.

**Early Stopping**: We introduced an early stopping mechanism to halt training if there's no improvement in accuracy after 4 epochs to avoid overfitting and save computation time.

**Evaluation Metrics**: We used categorical cross-entropy as the loss function and accuracy as the metric. We also evaluated each model on its ability to identify the letter with the lowest individual accuracy and its most common labeling error.

In [None]:
plt.figure(figsize=(10, 6))

history_list = [history_1, history_2, history_3, history_4, history_5]
model_name = ['model 1' , 'model 2', 'model 3', 'model 4', 'model 5']
for i in range (5):
    if (i == 0):
        ax1 = plt.subplot(2, 3, i + 1)
    else:
        ax = plt.subplot(2, 3, i + 1, sharey = ax1)
    train_acc = history_list[i].history['accuracy']
    val_acc = history_list[i].history['val_accuracy']

    plt.plot(train_acc, label='Training Accuracy')
    plt.plot(val_acc, label='Validation Accuracy')
    plt.title(model_name[i])

plt.show()

In [None]:
plt.figure(figsize=(10, 10))

letter_acc_list = [letter_acc_model1, letter_acc_model2, letter_acc_model3, letter_acc_model4, letter_acc_model5]
model_name = ['model 1' , 'model 2', 'model 3', 'model 4', 'model 5']
for i in range (5):
    if (i == 0):
        ax1 = plt.subplot(3, 2, i + 1)
    else:
        ax = plt.subplot(3, 2, i + 1, sharey = ax1)
    
    plt.bar(range(len(letter_acc_list[i])), letter_acc_list[i])
    plt.title(model_name[i])

plt.show()

Among the 5 models, Model 3 shows the highest test accuracy of 98.80%, Letter with the lowest accuracy is 90.38. It uses the ReLU activation function, Nadam optimizer with a learning rate of 0.0005, and a kernel size of 5x5. The second best model is Model 5, it has test accuracy of 97.25, Letter with the lowest accuracy is 88.25.It uses the ReLU activation function, Nadam optimizer with a learning rate of 0.0001, and a kernel size of 5x5. Thus, Model 3 and Model 5 is our recommended model.

# Report

In [None]:
final_model_1 = model_cnn_factory(hiddensizes = [32, 64, 64, 64, 32], actfn = "relu", 
                        optimizer = keras.optimizers.Nadam, learningrate = 0.0005,
                        kernel_size = 5, dropout = 0.3)

In [None]:
history_1 = final_model_1.fit(train_images, train_labels, batch_size=64, verbose = 0,
                        epochs=epochs, validation_split=0.1, callbacks=[early_stop])

loss, accuracy = final_model_1.evaluate(test_images, test_labels)
print(f"Test accuracy: {accuracy}")

In [None]:
letter_acc_final_model_1 = find_worst_letter(final_model_1)

In [None]:
most_common_mismatch(final_model_1)

In [None]:
final_model_2 = model_cnn_factory(hiddensizes = [32, 64, 64, 64, 32], actfn = "relu", 
                        optimizer = keras.optimizers.Nadam, learningrate = 0.0001,
                        kernel_size = 5, dropout = 0.3)

In [None]:
history_2 = final_model_2.fit(train_images, train_labels, batch_size=64, verbose = 0,
                        epochs=epochs, validation_split=0.1, callbacks=[early_stop])

loss, accuracy = final_model_2.evaluate(test_images, test_labels)
print(f"Test accuracy: {accuracy}")

In [None]:
letter_acc_final_model_2 = find_worst_letter(final_model_2)

In [None]:
most_common_mismatch(final_model_2)

In [None]:
plt.figure(figsize=(10, 4))

ax = plt.subplot(1, 2, 1)
plt.bar(range(len(letter_acc_final_model_1)), letter_acc_final_model_1)
plt.title("final model 1 accuracy for each letter")
ax = plt.subplot(1, 2, 2, sharey = ax)
plt.bar(range(len(letter_acc_final_model_2)), letter_acc_final_model_2)
plt.title("final model 2 accuracy for each letter")
plt.show()

### Compare the final two model:

**Overall Accuracy:**

final model 1: 98.70%

final model 2: 97.64%

**Accuracy for Each Individual Letter:**

For final model 1:

S: 90.38% (Lowest accuracy)

Other letters: Above 90%

For final model 2:

Y: 88.25% (Lowest accuracy)

Other letters: Above 90%

**Recommendation:**

Both models exceed the overall accuracy goal of 96%. However, when examining individual letters, final model 2 (for letter 'Y') fall slightly below the desired 90% threshold. So final model 1 is better.

However, when examining individual letters, final model 1, the accuracy is slightly upon the 90% threshold, so we may try possible steps include using data augmentation techniques for underrepresented letters, fine-tuning the model architecture, or employing ensemble methods.

### Final model

In [None]:
final_model = model_cnn_factory(hiddensizes = [32, 64, 64, 64, 32], actfn = "relu", 
                        optimizer = keras.optimizers.Nadam, learningrate = 0.0005,
                        kernel_size = 5, dropout = 0.35)

In [None]:
history = final_model.fit(train_images, train_labels, batch_size=64, verbose = 0,
                        epochs=epochs, validation_split=0.1, callbacks=[early_stop])

loss, accuracy = final_model.evaluate(test_images, test_labels)
print(f"Test accuracy: {accuracy}")

In [None]:
letter_acc_final_model = find_worst_letter(final_model_1)

In [None]:
plt.bar(range(len(letter_acc_final_model)), letter_acc_final_model)
plt.title("")
plt.show()

In [None]:
most_common_mismatch(final_model_1)

The final result is:

Overall Accuracy: 98.21%

Letter with the lowest accuracy: R with accuracy 91.42%

The most common error is N being mislabeled as M with 25 occurrences.
