# Abstract

In this report, the final task solution is documented and detailed. The document walks the reader through the task descriptions and requirmenets first. After that, it presents the task solution starting from the data preparation and preprocessing, then the model design and training, and at the end it presents the test predictions preparation. In addition to this, there is an appendix where all the results and curves are shown.

# Task description

## Requirments

In this task, a classifier needs to be trained to perform classification task on black and white images in order to identify the character in each image out of 62 different classes. The following requirements have to be fulfilled:
1.	Approach:
A deep learning classifier model has to be designed, trained and optimized following cross validation approach, and the results of the training and testing have to be saved.
2.	Indicators:
The main two goodness indicators/metrics which have to be used are accuracy and area under curve “AUC”.
3.	Output format:
Text file contains the prediction for the testing unlabeled data, each preditction is in a separate line/row in the corresponding order of the provided data.


## Dataset

The provided data structure and features are as the following:
1.	Dataset splits:
The data is split into two main folders, training and testing. The testing folder contains the testing data, where there are 7100 images, each is named with respect to its order. This data is without labels, since it is used for the overall task evaluation by the teacher. The prediction submission file content is created based on this data.
The other folder, i.e. training data folder, contains around 53172 images which are distributed over 62 subdolders, each folder corresponds to a separate class/label. This data is used for training, validation and performance evaluation.
2.	Dataset description:
The data is black and white character images, each image is of size (128 x 128 x 3). Data has 62 different categories. The first ten classes (1 - 10) are for digits characters (0 – 9), capital letters (A – Z) are assigned classes (11 – 36), and lastly the small letters characters (a – z) have the classes (37 – 62).


# Task solution

## Data preperation

The main objective of this step is to record the images and their labels in a reusable format, to avoid repeating data gathering and labels assignment. Another additional step is to compress the data in a proper data structure where data importing at the beginning of the training can be easy. An extra motivation for the data processing is the limitations of the computational and storage resources. To do so, the following steps are performed:

### Data and labels reading

All training images are read and stacked in an np array, and another array is built for their corresponding labels which are extracted from the from the file/folder name . The same is done for the testing data, the only difference that there are no labels to extract.
Also, and additional dictionary is built to match between the class number and the characters, e.g. class number ‘11’ corrosponds to character ‘A’.

In [None]:
train_dir = './Train/*/*.png'
train_imgs, train_labels = [], []

for im_dir in glob.glob(train_dir):
  im = cv2.imread(im_dir)
  idx = im_dir.index('img')
  im_class = int(im_dir[idx-3:idx-1])
  train_imgs.append(im)
  train_labels.append(im_class)
print(len(train_imgs))
train_imgs = np.stack(train_imgs)
print(train_imgs.shape)
print(len(train_labels))
train_labels = np.stack(train_labels)
print(train_labels.shape)

In [None]:
numbers = [str(i) for i in range(0,10)]
upper_case_letters = list(string.ascii_uppercase)
lower_case_letters = list(string.ascii_lowercase)
digits = numbers + upper_case_letters + lower_case_letters

digits_dict = {}

for idx in range(len(digits)):
  digits_dict[idx+1] = digits[idx]

classes = np.array(list(digits_dict.items()))

print(classes.shape)

### Compressing

Both training and testing data are compressed into ‘train_data.npz’ and ‘test_data.npz’ respectively using “compressArray” function.

### Uncompressing

Later, the compressed files are un compressed into np arrays (for the data) and a dictionary (for the classes) using “uncompressArray” function.

In [None]:
def compressArray(file_dir, images=None, labels=None, classes=None, train=True):
  with open(file_dir, 'wb') as file:
    if train:
      np.savez_compressed(file, x_train=images, y_train=labels, classes=classes)
    else:
      np.savez_compressed(file, x_test=images)
        
def uncompressArray(file_dir, train=True):
  key = 'x_train' if train else 'x_test'
  with open(file_dir, 'rb') as file:
    loaded_file = np.load(file)
    images = loaded_file[key].copy()
    labels, classes = None, None
    if train:
      labels = loaded_file['y_train'].copy()
      classes = dict(loaded_file['classes'].copy())
  return images, labels, classes

## Packages importing

The first step in the training is to import the needed packages, some of them are needed for model building, some are needed for training, and others for data processing.

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation,\
                         Conv2D, MaxPooling2D, Flatten

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, StratifiedKFold

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import cv2
import os

## Data preprocessing

The second important step is to perform data preprocessing, this includes importing and processing the data to prepare it for training stage.

### Data importing

The training data and labels are imported into np arrays. The data and labels are in matching order.

In [None]:
train_file = './data/train_data.npz'
X_train, Y, classes = uncompressArray(train_file)
X_train.shape, Y.shape

### Image resizing

Due to the limitations of the computational resources, the images are resized from 128x128x3 into 32x32x3. Of course, smaller images leads to less processing resources and memory are required.

In [None]:
X = []
WIDTH, HEIGHT = 32, 32
for image in X_train:
  X.append(cv2.resize(image, (WIDTH,HEIGHT), interpolation=cv2.INTER_CUBIC))
X = np.asarray(X)
del X_train

### Data splitting

Next, data is split and shuffled into train and test data (note that test here means the training test data, not the task evaluation unlabeled data). 80% of the data is used for training and the remaining is for testing.

In [None]:
X, x_test, Y, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

### Data normalization

Till this point, the pixels values in the images are in range of (0 – 255), before going further these values are normalized for training and testing images, the pixel value new range is (0 – 1).
An extra processing is performed here again in order to reduce the computational cost of the data processing. Only one channel i extracted of each image out of the available three. This is done since the images are black and white, which means that the three channels are the same, and one channel contains the same information of the three. Each image becomes of size 32x32 instead 32x32x3 (notice that the last dimension is removed).

In [None]:
X = X[:,:,:,0]/255.
x_test = x_test[:,:,:,0]/255.

### Data reshaping

After the previous step processing, the lost dimension need to be returned while keeping only one channel for each image. This is done by reshaping the np arrays.

In [None]:
def reshape_data(X):
  X = X.reshape(X.shape[0], X.shape[1], X.shape[2], 1)
  return X

X = reshape_data(X)
x_test = reshape_data(x_test)

### One hot encoding

In this step the labels are vectorized, i.e. one hot encoding representation. However, since the labels start from ‘1’, and since the desired representation should contain the same number of classes, ‘1’ is subtracted from the labels when encoding them . An important point, that at this stage only test labels are processed, while for training labels the processing is performed inside the cross validation loop for each fold separately.

In [None]:
y_test = to_categorical(y_test-1, dtype ="uint8")

## Model design

In order to build a classification model, a deep sequentional model is designed. The design contains convolutional, activation and dense layers.

### Model architecture

The model consists of sequentional layers. The input layer is a Conv layer with Relu activation, which receives the image, then it passes it another Conv layer with Relu activation of the same size, after passing the result to a Maxpooling layer, 25% of the nodes are dropped. The output from this step is sent to upsampling step, which is similar to the previous one, except that it is with a doubled size. The last step is to make the prediction, to do so, the network is flattened and a Dense layer is added, again Relu activation is used, and 50% of the nodes are dropped here. Lastly it is connected to a Dense layer of size of the number of classes in order to make the predictions, while the activation function which is used here is SoftMax.

### Model parameters

In overall, the model has 2,194,462 trainable parameters.

In [None]:
def classificationModel(input_shape, num_classes):
  model = Sequential()

  model.add(Conv2D(32, (3,3), padding='same', activation="relu", input_shape=input_shape))
  model.add(Conv2D(32, (3,3), padding='same', activation="relu"))
  model.add(MaxPooling2D(pool_size=(2,2)))
  model.add(Dropout(0.25))
  
  model.add(Conv2D(64, (3,3), padding='same', activation="relu"))
  model.add(Conv2D(64, (3,3), padding='same', activation="relu"))
  model.add(MaxPooling2D(pool_size=(2,2)))
  model.add(Dropout(0.25))
  
  model.add(Flatten())
  model.add(Dense(512))
  model.add(Activation('relu'))
  model.add(Dropout(0.5))
  model.add(Dense(num_classes))
  model.add(Activation('softmax'))

  return model

In [None]:
input_shape = (32,32,1)
no_classes = 62
Model = classificationModel(input_shape, no_classes)
Model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv2d (Conv2D)             (None, 32, 32, 32)        320       
                                                                 
 conv2d_1 (Conv2D)           (None, 32, 32, 32)        9248      
                                                                 
 max_pooling2d (MaxPooling2D  (None, 16, 16, 32)       0         
 )                                                               
                                                                 
 dropout (Dropout)           (None, 16, 16, 32)        0         
                                                                 
 conv2d_2 (Conv2D)           (None, 16, 16, 64)        18496     
                                                                 
 conv2d_3 (Conv2D)           (None, 16, 16, 64)        36928     
                                                        

## Model training

For the training step, the cross validation approach is performed, and the training parameters are optimized to achieve the best performance. During the training the best model is saved, and it is the one which is used to predict the (unlabeled) test data.

### Tarining paramters
a.	Batch size: 32
b.	Number of epochs: 20
c-	Input shape: (32, 32, 1)
d-	Loss function: Categorical Cross Entropy (CCE)
e-	Optimizer: Adam
f-	Metrics: Accuracy and Area Under Curve (AUC)
g-	Number of folds (for cross validation): 5

In [None]:
Batch_size = 32
no_epochs = 20
input_shape = (32,32,1)
no_classes = 62
Loss_function = 'Categorical Cross Entropy'
Optimizer = 'Adam'
metrics=['accuracy', 'auc']
no_folds = 5

### Cross validation

As mentioned in the training parameters, the number of chosen folds is 5, this means that the training is performed on five diffenet sets of training and validation data. For the validation data, it is chosen by the fold function. After every fold, the resulted model performance is tested against the test data which was held out previously, and the best model is saved.

In [None]:
# Model configuration
batch_size = 32
img_width, img_height, img_num_channels = X.shape[1:]
loss_function = 'categorical_crossentropy'
no_epochs = 20
optimizer = 'adam'
verbosity = 1
num_folds = 5
no_classes = 62

# Determine shape of the data
input_shape = (img_width, img_height, img_num_channels)

# Define per-fold score containers
acc_per_fold = []
loss_per_fold = []
auc_per_fold = []
best_model = None
best_fold = 0
best_acc = 0

# Define the K-fold Cross Validator
kfold = StratifiedKFold(n_splits=num_folds, shuffle=False)

# K-fold Cross Validation model evaluation
fold_no = 1
for train, test in kfold.split(X, Y):

  fold_dir = os.path.dirname(f'./Folds/{fold_no}/')

  Y_train = to_categorical(Y[train]-1, dtype ="uint8")
  Y_test = to_categorical(Y[test]-1, dtype ="uint8")

  # Define the model architecture
  model = classificationModel(input_shape, no_classes)

  # Compile the model
  model.compile(loss=loss_function,
                optimizer=optimizer,
                metrics=['accuracy', tf.keras.metrics.AUC(name='auc', multi_label=True)])
  
  # Include the epoch in the file name (uses `str.format`)
  checkpoint_file = 'cp-{epoch:02d}.ckpt'
  checkpoint_path = os.path.join(fold_dir, checkpoint_file)

  # Create a callback that saves the model's weights every 5 epochs
  cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path, 
                                                    verbose=2, 
                                                    save_weights_only=True,
                                                    save_freq='epoch',
                                                    period=5)

  model.save_weights(checkpoint_path.format(epoch=0))

  # Generate a print
  print('------------------------------------------------------------------------')
  print(f'Training for fold {fold_no} ...')

  # Fit data to model
  history = model.fit(X[train], Y_train,
              batch_size=batch_size,
              epochs=no_epochs,
              verbose=verbosity, 
              callbacks=[cp_callback],
              validation_data=(X[test], Y_test))
  
  # save plots
  savePlots(history, fold_dir)
  # save history to csv: 
  hist_csv_file = f'history_{fold_no}.csv'
  hist_csv_path = os.path.join(fold_dir, hist_csv_file)
  # convert the history.history dict to a pandas DataFrame:     
  hist_df = pd.DataFrame(history.history)

  with open(hist_csv_path, mode='w') as f:
    hist_df.to_csv(f)

  scores = model.evaluate(x_test, y_test, verbose=0)
  current_acc = scores[1]*100
  current_loss = scores[0]
  current_auc = scores[2]

  print(f'Score for fold {fold_no}: {model.metrics_names[0]} of {current_loss}; {model.metrics_names[2]} of {current_auc}; {model.metrics_names[1]} of {current_acc}%')
  
  acc_per_fold.append(current_acc)
  loss_per_fold.append(current_loss)
  auc_per_fold.append(current_auc)

  if (fold_no == 1) or (current_acc > best_acc):
    best_fold = fold_no
    best_model = model
    best_acc = current_acc

  # Increase fold number
  fold_no = fold_no + 1

best_model.save('./best_model/best_model.h5')
# acc_per_fold = acc_per_fold[1:]
# == Provide average scores ==
print('------------------------------------------------------------------------')
print('Score per fold')
for i in range(0, len(acc_per_fold)):
  print('------------------------------------------------------------------------')
  print(f'> Fold {i+1} - Loss: {loss_per_fold[i]} - Accuracy: {acc_per_fold[i]}% - AUC: {auc_per_fold[i]}')
print('------------------------------------------------------------------------')
print('Average scores for all folds:')
print(f'> Accuracy: {np.mean(acc_per_fold)} (+- {np.std(acc_per_fold)})')
print(f'> Loss: {np.mean(loss_per_fold)}')
print(f'> AUC: {np.mean(auc_per_fold)}')
print('------------------------------------------------------------------------')
print('Best fold of all folds:')
print(f'> Fold: {best_fold}')
print('------------------------------------------------------------------------')

### Best Model

The best model achieved accuracy is 92.77% and 91.97% for training and testing respectively.

|| Accuracy | Loss | AUC |
| --- | --- | --- | --- |
|Training| 92.77%	| 0.1789 | 0.9987 |
|Validation| 91.81%	| 0.2202 | 0.9973 |
|Testing| 91.97% | 0.2207 | 0.9968 |

The training and validation curves, show smooth learning behviour and proper convergence. Also, it appears that after the 10th epoch, the learning process started to stabilize. Higher accuracy might have been achieved with more epochs, but the risk of overfitting increases.

<img src='https://drive.google.com/uc?id=12HnGtmbuCtOlaOMq2lE5ZrNJLNlnqm0i'>
<img src='https://drive.google.com/uc?id=12HPcTTe_zH6l-aKCJz3M8KNM9C4RGGL8'>
<img src='https://drive.google.com/uc?id=12H6oxNEIyLYDs0ogfF7xkHJi-xjPPnLM'>

## Test data prediction

The best model is used to predict the classes/labels of the provided (unlabeled) test data, and the results/predictions are prepared for submission.

### Data preprocessing

The same steps which are perfomed for the training step are performed here except, i.e. importing, resizing, normalization and reshaping. Note here that neither splitting is not needed, nor the one hot encoding (because there are no labels).

In [None]:
# Read test data
test_file = './data/test_data.npz'
X_test, _, _ = uncompressArray(test_file, train=False)

# Preprocess test data
# Resize
X_TEST = []
WIDTH, HEIGHT = 32, 32
for image in X_test:
  X_TEST.append(cv2.resize(image, (WIDTH,HEIGHT), interpolation=cv2.INTER_CUBIC))
X_TEST = np.asarray(X_TEST)
del X_test
# Normalize
X_TEST = X_TEST[:,:,:,0]/255. # extract only one channel
# Reshape
def reshape_data(X):
  X = X.reshape(X.shape[0], X.shape[1], X.shape[2], 1)
  return X
X_TEST = reshape_data(X_TEST)

### Model loading

The best model is loaded in order to be used for test data classes predicting.

In [None]:
Best_model = tf.keras.models.load_model('./best_model/best_model.h5')

### Predictions generation

The predictions are generated in one hot encoding format, then ArgMax is used to return the class/label, here ‘1’ is added to the predicted label, because the predictions are generated based on the training data where ‘1’ was subtracted of the labels values. The final predictions are added to a predictions list, then converted to a dictionary which contain the test iamge file names as keys, and their corrosponding predictions as values.

In [None]:
OneHot_predictions = Best_model.predict(X_TEST)

predictions = [np.argmax(prediction)+1 for prediction in OneHot_predictions] # Labels start from 1

pred_dict = {}
for idx, prediction in enumerate(predictions):
  key = f'Test{(idx+1):04d}.png'
  value = str(prediction)
  pred_dict[key] = value

### Creating submission file

Lastly, submission file is created by writing the predictions dictionary content into a txt file, each prediction in separate row/line, in the following format:
*class;TestImage*.

In [None]:
results_dir = os.path.dirname('./submission_results/')
results_file = 'submission.txt'
results_path = os.path.join(results_dir, results_file)

with open(results_path, 'w') as f:
  f.write('class;TestImage\n')
  for key, value in pred_dict.items():
    line = f'{value};{key}\n'
    f.write(line)