## Problem Statement

You can download train and test dataset from [here](https://drive.google.com/drive/folders/1F2PjpJ_u_iaD-Fs0wwcymRiVVLK34-Fu). This dataset has 4 classes. Labels for
training data are provided, you have to submit labels of test data. Feel free to use any Machine
learning or Deep learning technique.

### Imports

Libraries used in the problem are: 
* [Numpy](http://www.numpy.org/) will be used for powerful matrix and scientific operations. 
* [Pandas](https://pandas.pydata.org/) for data transformation and analysis. 
* [Matplotlib](https://matplotlib.org/) and [Seaborn](https://seaborn.pydata.org/) for data visualization.
* [Scikit-learn](https://scikit-learn.org/stable/) to use machine learning classifiers, splitting data and metrics for evaluation.
* [Keras](https://keras.io/) to build deep learning based classifiers.

In [2]:
#standard utilities
import os
import pickle #to load pickle data
from collections import Counter

#data science and visualization libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

#scikit-learn utilities
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import (classification_report, accuracy_score, 
                             confusion_matrix)  

#deep learning library
import keras
from keras.models import Sequential, load_model
from keras.layers import (Conv2D, MaxPooling2D, 
                          Dense, Flatten, 
                          Dropout, BatchNormalization)
from keras.optimizers import Adam, RMSprop
from keras.preprocessing.image import ImageDataGenerator

% matplotlib inline

Using TensorFlow backend.


ModuleNotFoundError: No module named 'tensorflow'

In [None]:
#Main directory path for all the files
PATH = r'..\CV_problem'

In [None]:
os.listdir(PATH)

### Data Loading and Visualization

Data has been provided in pickle format, hence, first step will be to load in the pickle file which can be easily done in 2 lines as shown in the following cell.

In [None]:
with open(f'{PATH}\\Data\\train_image.pkl', 'rb') as image_file:
    train_images = pickle.load(image_file)

Continuing the task:

In [None]:
print(f"Number of training samples: {len(train_images)}")

In [None]:
train_labels = np.array(np.load(f'{PATH}\\Data\\train_label.pkl', allow_pickle=True))

In [None]:
Counter(train_labels)

So, in this dataset, we have 2000 training samples corresponding to each of the 4 classes: 0, 2, 3 and 6, hence, in total we have 8000 number of images for training. As all the classes are equally distributed, our dataset is perfectly balanced and doesn't need any kind of oversampling or undersampling.<br>
Now, lets have look at some of the images from our training data. Before, plotting we will first convert the list: `train_images` to a `numpy` array.

In [3]:
train_images = np.array(train_images)

NameError: name 'train_images' is not defined

In [None]:
print(f"Shape of train_images: {train_images.shape}")

We already now that 8000 is the number of training samples. We can conclude from shape of the `train_images` that each image has been represented by a vector of length 784. So, its very much likely that each of the images were initially of size: 28 X 28 pixels and have been flattened to 28\*28 i.e. 784 length of vector. Let's see if our inference is correct by plotting the samples by reshaping them to a size of 28 X 28. Below is the function, `plot_multiple_data()` that will plot `n_rows*n_columns` number of images simultaneously.

In [None]:
def plot_multiple_data(n_rows, n_columns, indices):
    '''
    Parameters-
        n_rows, n_columns: Number of rows and columns in the figure
        indices: List of indices for the images from the dataset
    '''
    #figure that will be displayed
    fig = plt.figure(figsize=(n_rows*2, n_columns*2))

    #Showing first n_rows*n_columns images from the dataset specified by indices
    for i in range(1, n_rows*n_columns + 1):
        plt.subplot(n_rows, n_columns, i)
        plt.imshow(train_images[indices[i]].reshape(28, 28))
        plt.title(f'Label: {train_labels[indices[i]]}') #corresponding label to each of the image
    fig.tight_layout()  #for better padding amongst subplots
    plt.show()

In [None]:
random_indices = np.random.permutation(train_images.shape[0]) #generate random indices for plotting
plot_multiple_data(10, 10, random_indices)

After seeing the above subplots, we can say that:
* Label 0 is for Half sleeve T-shirts/Tops
* Label 2 is for Long sleeve T-shirts or Pullovers
* Label 3 is for Dress
* Label 6 is for Shirts

### Data Pre-processing <a id='preprocessing'></a>


In [None]:
#setting random seed so that every time we run random, we get the same result
np.random.seed(42)

#### Splitting the available training data into train and validation sets

 Here, we'll use `scikit-learn`'s `train_test_split` to split the data into 90:10 ratio. 90% data is for training and rest of the 10% for validation.

In [None]:
Train_x, Val_x, Train_y, Val_y = train_test_split(train_images, train_labels, test_size=0.1, random_state=42)

In [None]:
print(f"Shape of Training features: {Train_x.shape}")
print(f"Shape of Training labels: {Train_y.shape}")
print(f"Shape of Validation features: {Val_x.shape}")
print(f"Shape of Validation labels: {Val_y.shape}")  

#### Pre-processing for CNN classifier

Convolution Neural Networks (CNN) require images to be in 2D shape while in our case each image is a vector of 784 length. 

##### Reshaping the Train_x and Val_x 

Following function, `reshape_vector` will take in the datasets with images in the vector form and will return datasets with reshaped representation of images. Each of the image will be represented by a matrix of shape, 28 x 28 x 1. Since the images are black and white, 3rd dimension is equal to 1. 

In [None]:
def reshape_vector(Train_x, Val_x):
    '''
    Parameters-
        Train_x, Val_x: Training and validation sets with images as vector array
    Returns-
        Reshaped Train_x and Val_x 
    '''
    return Train_x.reshape((-1, 28, 28, 1)), Val_x.reshape((-1, 28, 28, 1))

##### Normalizing by scaling down pixel values to the range [0, 1]

Range of pixel values is 0 to 255. If we normalize the pixel values to a smaller range of 0 to 1, the model will be able to learn the real structures instead of dealing with the scale differences.

In [None]:
def normalize(Train_x, Val_x): 
    '''
    Parameters-
        Train_x, Val_x: Reshaped Train_x and Val_x
    Returns-
        Normalized Train_x and Val_x
    
    '''
    return Train_x.astype("float32") / 255.0, Val_x.astype("float32") / 255.0

##### One-hot encoding Labels

Later in the notebook, CNN classifier instead of directly outputting one of the classes, it will be outputting probabilities corresponding to each class. So, the output will always be in the range: [0, 1]. As, in our case, labels are 0, 2, 3 and 6, it is required to transform these labels and represent them in the form of 0 and 1. This transformation is nothing but one-hot encoding. <br>
Here `pandas`' `get_dummies` method will come in handy that will easily transorm the data in required encoded form.

In [None]:
def one_hot_encode(Train_y, Val_y):
    '''
    Parameters-
        Train_y, Val_y = Array of labels
    Returns-
        One-hot encoded labels
    '''
    return pd.get_dummies(Train_y), pd.get_dummies(Val_y)

### Classification 

Now, as we are done with all the pre-processing, we can start trying machine learning algorithms and choose the one which outperforms all of the other ones.

#### 1. Using K-Nearest Neighbors Classifier

Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point. In scikit-learn's implementation, number of nearest neighbors is assigned by parameter, `n_neighbors`. Here, we'll use `n_neighbors` = 3. <br>
Training data can be easily trained using this classifier by using `fit()` method on classifier's instance.

In [None]:
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(Train_x, Train_y) 

In [None]:
neigh_predictions = neigh.predict(Val_x)

In [None]:
neigh_accuracy = accuracy_score(Val_y, neigh_predictions)
print(f'Accuracy score of KNN Classifier: {neigh_accuracy}')

78% accuracy is reasonably good considering simplicity of algorithm and minimal parameter tuning. Let's look at how many samples from validation set were correctly predicted by plotting the confusion matrix. Again, we can easily get the values for confusion matrix using `scikit-learn` and it can then be plotted using `matplotlib` and `seaborn`.

In [None]:
def plot_confusion_matrix(y_true, y_pred, labels):
    '''
    Parameters-
        y_true: Array of true labels
        y_pred: Array of predicted labels
        labels: List of labels ([0, 2, 3, 6])
    '''
    cm = confusion_matrix(y_true, y_pred)
    
    ax= plt.subplot()
    sns.heatmap(cm, annot = True, ax = ax, fmt = 'g'); #annot=True to annotate cells, fmt='g' to sho

    # labels, title and ticks
    ax.set_xlabel('Predicted labels')
    ax.set_ylabel('True labels')
    ax.set_title('Confusion Matrix')
    ax.xaxis.set_ticklabels(labels)
    ax.yaxis.set_ticklabels(labels[::-1])

In [None]:
labels = [0, 2, 3, 6]
plot_confusion_matrix(Val_y, neigh_predictions, labels)

The following conclusions are made:
* Out of 217 samples that have label 0, 185 have been predicted correctly. 
* There are 25 samples whose actual label is 0, but were predicted as 6.
* Similarly 30 samples with actual label as 2 are predicted 6.
* Out of 208 samples that have label 6, 42 are predicted as 0, 33 are predicted as 2 and 7 are predicted 3.
So, the model is mostly getting confused due to label 6. Let's look at some more metrics by generating classification report to get much clearer view at the results.

In [None]:
neigh_classification_report = classification_report(Val_y, neigh_predictions)
print(f'Classification report of KNN Classifier: \n{neigh_classification_report}')

Clearly, precision, recall and f1-score is lowest when it comes to class label 6.

#### 2. Using Random Forest Classifier

Second classifier that we'll be using is Random Forest Classifier. This classifier is an ensemble of multiple decision trees that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.<br>
In below cell, we've defined three parameters and rest are kept as default. `criterion` is the function to measure the quality of a split. Here it is set to `entropy` is for information gain that is computed using logarithmics. `max_depth` is the maximum depth for the decision trees in the forest. `n_estimators` is the number of trees to be used, usually the more, the better.<br>
Same steps that were followed for KNN classifier will be followed here too.

In [None]:
forest = RandomForestClassifier(criterion='entropy', max_depth=50, n_estimators=100)
forest.fit(Train_x, Train_y)

In [None]:
forest_predictions = forest.predict(Val_x)

In [None]:
forest_accuracy = accuracy_score(Val_y, forest_predictions)
print(f'Accuracy score of Random Forest Classifier: {forest_accuracy}')

Random forest classifier is certainly more accurate than KNN classifier. Let's see if it is able to correctly predict label 6 this time.

In [None]:
labels = [0, 2, 3, 6]
plot_confusion_matrix(Val_y, forest_predictions, labels)

Well, this time too classifier performs most poorly for label 6, but there certainly is significant improvement, if we look at number of correctly predicted labels.

In [None]:
forest_classification_report = classification_report(Val_y, forest_predictions)
print(f'Classification report of Random Forest Classifier: \n{forest_classification_report}')

#### 3. Using Support Vector Classifier

Before moving on to CNNs, that are meant to outperform other algorithms when it comes to computer vision problems, we'll give a shot to one more popular classification algorithm i.e. Support Vector Machine Classifier. The idea of SVM classifier can be simply put as: The algorithm creates a line or a hyperplane which separates the data into classes.<br>
Here, `scikit-learn`'s implementation SVC will be use for building the classifier. Specified parameters are: `C` which is a penalty parameter `C` of the error term, `kernel` is set to `poly` as the data is unstructured. `gamma` is the coefficient for kernel set to `auto` hence coefficient will be `1\n_features`.

In [None]:
svc = SVC(C=10, kernel='poly', gamma='auto')
svc.fit(Train_x, Train_y)

In [None]:
svc_predictions = svc.predict(Val_x)

In [None]:
svc_accuracy = accuracy_score(Val_y, svc_predictions)
print(f'Accuracy score of Support Vector Classifier: {svc_accuracy}')

We are down on accuracy by about 1%. Let's look at confusion matrix and classification report.

In [None]:
labels = [0, 2, 3, 6]
plot_confusion_matrix(Val_y, svc_predictions, labels)

In [None]:
svc_classification_report = classification_report(Val_y, svc_predictions)
print(f'Classification report of Support Vector Classifier: \n{svc_classification_report}')

SVC performs equally good as Random forest classifier (some what poorer). At this point, we'vepretty much tried most used machine learning algorithms and should move on to much more complex, deep learning algorithm, CNN.

#### 4. Using Convolutional Neural Networks

CNNs are made up of neurons with learnable weights and biases. 
Main advantage of CNNs over ANNs and other machine learning algorithms is that these operate over volumes. That is also the reason, we'll have to reshape the image vectors into 2D forms. <br>
Now, we'll call the pre-processing functions that were exclusively defined for CNNs earlier in the [Data Pre-processing](#preprocessing) section.

In [None]:
Train_x, Val_x = reshape_vector(Train_x, Val_x)
Train_x, Val_x = normalize(Train_x, Val_x)
Train_y, Val_y = one_hot_encode(Train_y, Val_y)

After running the above, we're all set to build the CNN model. Here, we'll use `Keras`' [Sequential API](https://keras.io/models/sequential/). Using this, model can be considered as a linear stack of layers, added one after other in sequential manner.<br> We'll keep tuning the parameters until the result recieved is reasonable. One thing, that is most important to account for is Overfitting, hence, we'll monitor the validation accuracy of the models at each epoch.

##### Model 1

First CNN model will be consisting of following layers:
* [Conv2D](https://keras.io/layers/convolutional/)
* [MaxPooling2D](https://keras.io/layers/pooling/)
* [Dropout](https://keras.io/layers/core/)
* [Flatten](https://keras.io/layers/core/)
* [Dense](https://keras.io/layers/core/)

In [None]:
model = Sequential()

#feature extraction part
model.add(Conv2D(filters = 32, kernel_size = (3, 3), input_shape = (28,28,1), activation = 'relu')) #output: 26 x 26 x 32 
model.add(Conv2D(filters = 32, kernel_size = (3, 3), activation = 'relu')) #output: 24 x 24 x 32
model.add(MaxPooling2D(pool_size = (2, 2))) #output: 12 x 12 x 32

model.add(Dropout(0.50)) #output: 12 x 12 x 32

model.add(Conv2D(filters = 64, kernel_size = (3, 3), activation = 'relu')) #output: 10 x 10 x 64
model.add(Conv2D(filters = 64, kernel_size = (3, 3), activation = 'relu')) #output: 8 x 8 x 64
model.add(MaxPooling2D(pool_size = (2, 2))) #output: 4 x 4 x 64

model.add(Dropout(0.50)) #output: 4 x 4 x 64
model.add(Conv2D(filters = 32, kernel_size = (3, 3), activation='relu')) #output: 2 x 2 x 32
model.add(MaxPooling2D(pool_size = (2, 2))) #output: 1 x 1 x 32

#classification part
model.add(Flatten()) #output: 32
model.add(Dense(units = 32, activation = "relu")) #output: 32
model.add(Dense(units = 4, activation = "softmax")) #output: 4 (probabilites for each class)

The summary for whole model by calling `summary()` method.

In [None]:
model.summary()

So, in total we have 84, 644 parameters to be trained. For training these parameters, we'll have to specify a loss function, an optimizer for that loss function and metrics to evaluate the model during compilation step.<br> As we are using softmax function to get the outputs, we'll have to choose a loss function that will increase the probability for true classes. This can be done in best way using cross entropy loss function. Now, to minimize this loss function there are several optimizers available. Here, we'll use Adam optimizer. The name "Adam" is derived from adaptive moment estimation and have the following advantages as coined by its authors: 
* Straightforward to implement and computationally efficient.
* Invariant to diagonal rescale of the gradients.
* Well suited for problems that are large in terms of data and/or parameters.
* Appropriate for non-stationary objectives.
* Appropriate for problems with very noisy/or sparse gradients.
* Hyper-parameters have intuitive interpretation and typically require little tuning.<br>

Learning rate `lr` of the Adam is set to 0.001 which is also the default value.


In [None]:
model.compile(loss = 'categorical_crossentropy', optimizer=Adam(lr=0.001), metrics =['accuracy'])

In [None]:
epochs = 25 #number of times training data will pass through model
batch_size=64 #number of training samples passed through model at a time
history = model.fit(Train_x, Train_y,
                     batch_size = batch_size,
                     epochs = epochs,
                     verbose = 2,
                     validation_data = (Val_x, Val_y))

We've got straight 4% increase in the validation accuracy by using CNN. <br>

In [None]:
train_loss = history.history['loss']
train_acc = history.history['acc']
val_loss = history.history['val_loss']
val_acc = history.history['val_acc']

In [None]:
def plot_loss_acc(history, n_epochs):
    
    '''
    Parameters-
    history: Default Keras' callback which records training metrics
    n_epochs: Number of times data is passed through model during training
    '''
    
    #list to keep accuracy and loss values obtained during training and validation
    history_record = []
    
    #during training
    history_record.append(history.history['loss'])
    history_record.append(history.history['acc'])
    
    #during validation
    history_record.append(history.history['val_loss'])
    history_record.append(history.history['val_acc'])
    
    fig = plt.figure(figsize=(8, 4))
    
    #plotting two subplots
    #first subplot is for loss values
    #second for accuracy values
    for i in range(1, 3):
        
        plt.subplot(1, 2, i)
        plt.plot(np.arange(n_epochs), history_record[i - 1], label = "Training")
        plt.plot(np.arange(n_epochs), history_record[i + 1], label = "Validation")
        
        #axis labels
        plt.xlabel('Epochs')
        if(i % 2 != 0):
            plt.ylabel('Loss function values')
        else:
            plt.ylabel('Accuracy values')
        plt.legend()
        
    fig.tight_layout() #for better padding amongst subplots
    plt.show()
    

In [None]:
plot_loss_acc(history, 25)

 We can see the gradual decrease in the loss and increase in accuracy values for training as well as validation sets. Also, the difference between training accuracy and validation accuracy is ~1%, hence, our model is clearly not overfitting.<br>

Let's generate predictions and evaluate this model with metrics that we also used above for machine learning algorithms.

In [None]:
predictions = model.predict(Val_x)

At this time, a particular prediction will look like:

In [None]:
model.predict(Val_x[0].reshape(1, 28, 28, 1))

As the output of the model is supposed to be softmaxed vector of length = number of classes. Above predicted vector is nothing but the probabilitiy corresponding to each class. As value at the 2nd position or index 1 is maximum, hence, predicted class is the second class i.e. 2. <br>

In following cell, these predicted vectors are converted into corresponding class labels.

In [None]:
#dictionary mapping from indices of output vector to corresponding class labels
prediction_dict = {0: 0, 1: 2, 2: 3, 3: 6}

#array to hold the predicted labels
CNN_predictions = np.zeros(len(predictions))

#loop through predicted vectors and adding predicted classes to CNN_predictions 
for i in range(len(predictions)):
    
    #get index of maximum element in vector
    arg_max = np.argmax(predictions[i])
    
    #add value of key: arg_max from prediction_dict to CNN_predictions
    CNN_predictions[i] = prediction_dict[arg_max]

As `Val_y` has also been converted to an array of one hot vectors. We'll have to decode them to original form.

In [None]:
#get index of maximum element in vector
Orig_Val_y = Val_y.values.argmax(axis=1)

Orig_Val_y = np.array([prediction_dict[y] for y in Orig_Val_y])

In [None]:
Orig_Val_y[:5]

In [None]:
CNN_predictions[:5]

Now, `Orig_Val_y` as well as `CNN_predictions` are in the similar required form.

In [None]:
labels = [0, 2, 3, 6]
plot_confusion_matrix(Orig_Val_y, CNN_predictions, labels)

Clearly, this model is performing  better than Random forest classifier and SVC but it is still underperforming for class label 6.<br>

We'll save this model in HDF5 format for now so that if the later models does not outperform this one, we can consider this model as the final one.

In [None]:
model.save(f'{PATH}//Models//model-1.h5')

As the model is now downloaded to local computer, it can be deleted from memory.

In [None]:
del model

##### Model 2

Configuration of this second model will be kept almost same as the first one. Two notable changes are that this time `padding` parameter will be kept as `same` and number of output units for intermediate dense layer is now increased to 64 which will increase the number of parameters to 91, 972 as shown in the summary of the model. Increasing the number of parameters will increase model's complexity hence should be able to perform better.

In [None]:
model = Sequential()

model.add(Conv2D(filters = 32, kernel_size = (3, 3), input_shape = (28,28,1), padding = "same", activation = 'relu')) #output: 28 x 28 x 32 
model.add(Conv2D(filters = 32, kernel_size = (3, 3), padding = "same", activation = 'relu')) #output: 28 x 28 x 32
model.add(MaxPooling2D(pool_size = (2, 2))) #output: 14 x 14 x 32

model.add(Dropout(0.50)) #output: 14 x 14 x 32

model.add(Conv2D(filters = 64, kernel_size = (3, 3), padding = "same", activation = 'relu')) #output: 14 x 14 x 64
model.add(Conv2D(filters = 64, kernel_size = (3, 3), padding = "same", activation = 'relu')) #output: 14 x 14 x 64
model.add(MaxPooling2D(pool_size = (2, 2))) #output: 7 x 7 x 64

model.add(Dropout(0.50)) #output: 7 x 7 x 64

model.add(Conv2D(filters = 32, kernel_size = (3, 3), activation='relu')) #output: 5 x 5 x 32
model.add(MaxPooling2D(pool_size = (2, 2))) #output: 2 x 2 x 32

model.add(Flatten()) #output: 128
model.add(Dense(units = 64, activation = "relu")) #output: 64
model.add(Dense(units = 4, activation = "softmax")) #output: 4

In [None]:
model.summary()

In [None]:
model.compile(loss = 'categorical_crossentropy', optimizer= Adam(lr=0.001), metrics =['accuracy'])

In [None]:
epochs = 40 #number of times training data will pass through model
batch_size=64 #number of training samples passed through model at a time

#Image augmenter
aug = ImageDataGenerator(rotation_range=20, zoom_range=(0.9, 1.1),
width_shift_range=0.1, height_shift_range=0.1, shear_range=0.5,
horizontal_flip=True, fill_mode="nearest")

#fitting the model
history = model.fit_generator(aug.flow(Train_x, Train_y, batch_size = batch_size),
validation_data=(Val_x, Val_y), epochs = epochs)

In [None]:
plot_loss_acc(history, 40)

Final validation accuracy for this model is 87.12% which is better than the previous model. Let's just repeat the steps we performed for generating predictions and confusion matrix for the previous model.

In [None]:
predictions = model.predict(Val_x)

In [None]:
#dictionary mapping from indices of output vector to corresponding class labels
prediction_dict = {0: 0, 1: 2, 2: 3, 3: 6}

#array to hold the predicted labels
CNN_predictions = np.zeros(len(predictions))

#loop through predicted vectors and adding predicted classes to CNN_predictions 
for i in range(len(predictions)):
    
    #get index of maximum element in vector
    arg_max = np.argmax(predictions[i])
    
    #add value of key: arg_max from prediction_dict to CNN_predictions
    CNN_predictions[i] = prediction_dict[arg_max]

In [None]:
Orig_Val_y = Val_y.values.argmax(axis=1)
Orig_Val_y = np.array([prediction_dict[y] for y in Orig_Val_y])

In [None]:
labels = [0, 2, 3, 6]
plot_confusion_matrix(Orig_Val_y, CNN_predictions, labels)

In [None]:
print(f"Classification report for CNN classifier: \n{classification_report(Orig_Val_y, CNN_predictions)}")

This model has clearly outperformed all previous classifiers by good margin. Particularly, if we compare the classification report for class label 6, it has increased by almost 10%. <br>

Let's now save the model.

In [None]:
model.save(f'{PATH}//Models//model-2.h5')

Model 2 has shown good level of performance
So, let's train the Model 2 on training data for 10 more epochs.

In [None]:
history = model.fit(Train_x, Train_y, batch_size = batch_size,
validation_data=(Val_x, Val_y), epochs = 10)

Well, we have an increase in accuracy of about 1% and 447 seconds that took it to train this model didn't go in vain. We'll call this model as Model 2.2 as it is just more trained version of Model 2. Let's again copy the above cells for prediction and evaluation.

In [None]:
predictions = model.predict(Val_x)

In [None]:
#dictionary mapping from indices of output vector to corresponding class labels
prediction_dict = {0: 0, 1: 2, 2: 3, 3: 6}

#array to hold the predicted labels
CNN_predictions = np.zeros(len(predictions))

#loop through predicted vectors and adding predicted classes to CNN_predictions 
for i in range(len(predictions)):
    
    #get index of maximum element in vector
    arg_max = np.argmax(predictions[i])
    
    #add value of key: arg_max from prediction_dict to CNN_predictions
    CNN_predictions[i] = prediction_dict[arg_max]

In [None]:
Orig_Val_y = Val_y.values.argmax(axis=1)
Orig_Val_y = np.array([prediction_dict[y] for y in Orig_Val_y])

In [None]:
labels = [0, 2, 3, 6]
plot_confusion_matrix(Orig_Val_y, CNN_predictions, labels)

In [None]:
print(f"Classification report for CNN classifier: \n{classification_report(Orig_Val_y, CNN_predictions)}")

In [None]:
model.save(f'{PATH}//Models//model-2.2.h5')

At this point, we are done with building the classifiers and our final chosen model will be Model 2.2 as it has more generalization as well as prediction power.

In [None]:
del model

### Generating prediction for Test set using Model 2.2

#### Test data loading and pre-processing

In [None]:
with open(f'{PATH}//Data//test_image.pkl', 'rb') as image_file:
    test_images = pickle.load(image_file)

In [None]:
print(f'Number of test samples: {len(test_images)}')

In [None]:
#converting the images to numpy array
test_images = np.array(test_images)

In [None]:
print(f'Shape of test images: {test_images.shape}')

In [None]:
#reshaping the image vectors to 2D arrays
test_images = test_images.reshape((-1, 28, 28, 1))
print(f'Shape of test images: {test_images.shape}')

In [None]:
#normalization
test_images = test_images.astype("float32") / 255.0

#### Loading trained model 

In [None]:
model = load_model(f'{PATH}//Models//model-2.2.h5')

#### Generating predictions and converting them to class labels

In [None]:
test_predictions = model.predict(test_images)

In [None]:
#dictionary mapping from indices of output vector to corresponding class labels
prediction_dict = {0: 0, 1: 2, 2: 3, 3: 6}

#array to hold the predicted labels for test set
CNN_test_predictions = np.zeros(len(test_predictions))

#loop through predicted vectors and adding predicted classes to CNN_predictions 
for i in range(len(test_predictions)):
    
    #get index of maximum element in vector
    arg_max = np.argmax(test_predictions[i])
    
    #add value of key: arg_max from prediction_dict to CNN_predictions
    CNN_test_predictions[i] = prediction_dict[arg_max]

In [None]:
CNN_test_predictions = CNN_test_predictions.astype(int)

#### Plotting the test images alongside there predicted labels


In [None]:
#reloading the test_images
with open(f'{PATH}//Data//test_image.pkl', 'rb') as image_file:
    test_images = np.array(pickle.load(image_file))
    
fig = plt.figure(figsize=(20, 20))

indices = np.random.permutation(test_images.shape[0]) #generating random indices for plotting

#Showing first n_ images from the dataset specified by indices
for i in range(1, 101):
    plt.subplot(10, 10, i)
    plt.imshow(test_images[indices[i]].reshape(28, 28))
    plt.title(f'Predicted Label: {CNN_test_predictions[indices[i]]}') #corresponding predicted label to each of the image

fig.tight_layout()  #for better padding amongst subplots
plt.show()

#### Storing the prediction results in a CSV file

This is the required format of submission:

In [None]:
sample_df = pd.read_csv(f'{PATH}\\Data\\hitkul(sample_submission).csv')

In [None]:
sample_df

Submission:

In [None]:
submission_df = pd.DataFrame({'image_index': np.arange(len(test_images)), 'class': CNN_test_predictions})

In [None]:
submission_df.head(10)

In [None]:
submission_df.to_csv(f'{PATH}\\akshay_aggarwal.csv', index=False)

In [None]:
df = pd.read_csv(f'{PATH}\\akshay_aggarwal.csv')

In [None]:
df.head()