## Impact of data preprocessing methods on the accuracy for selected classifiers

#### Preprocessing methods: 
- HOG transformation
- data deskewing
- dataset shuffling

#### Classifiers types:
- SVM
- Random Forest Classifier
- Neural Network

### Imports and utils methods definitions

In [None]:
import cv2
import numpy as np
import matplotlib.pyplot as plt

from typing import Tuple, List

In [None]:
train_batch = 4000
test_batch = 1000

In [None]:
def openCVHOG(img: np.array) -> np.array:
    """
    Function to transform single image in numpy array format
    into its hog descriptor.

    Attribute:
      img : np.array
        single image
    Returns:
      np.array
        hog descriptor for input image
    """
    winSize = (20,20)
    blockSize = (10,10)
    blockStride = (5,5)
    cellSize = (10,10)
    nbins = 9
    derivAperture = 1
    winSigma = -1.
    histogramNormType = 0
    L2HysThreshold = 0.2
    gammaCorrection = 1
    nlevels = 64
    signedGradients = True

    hog = cv2.HOGDescriptor(winSize, blockSize, blockStride, cellSize, 
                            nbins, derivAperture, winSigma, 
                            histogramNormType, L2HysThreshold, 
                            gammaCorrection, nlevels, signedGradients)
    descriptor = np.ravel(hog.compute(img))
    
    return descriptor

def HOG_transformation(X_train: np.array, 
                       X_test: np.array) -> (np.array, np.array):
    """
    Function to transform images in dataset using HOG transformation.
    """
    HOG_train = [openCVHOG(image) for image in X_train]
    HOG_train_reshaped = np.float32(HOG_train).reshape(-1,81)

    HOG_test = [openCVHOG(image) for image in X_test]
    HOG_test_reshaped = np.float32(HOG_test).reshape(-1,81)

    return HOG_train_reshaped, HOG_test_reshaped

In [None]:
SZ=28
affine_flags = cv2.WARP_INVERSE_MAP|cv2.INTER_LINEAR

def deskew(img: np.array) -> np.array:
    """
    Function to deskew single image in numpy array format.

    Attribute:
      img : np.array
        single image
    Returns:
      np.array
        deskewed input image
    """
    m = cv2.moments(img)
    if abs(m['mu02']) < 1e-2:
        return img.copy()
    skew = m['mu11']/m['mu02']
    M = np.float32([[1, skew, -0.5*SZ*skew], [0, 1, 0]])
    img = cv2.warpAffine(img,M,(SZ, SZ),flags=affine_flags)
    return img

def deskew_transformation(X_train: np.array, 
                          X_test: np.array) -> (np.array, np.array):
    """
    Function to transform images in dataset with the use of deskew.
    """
    X_train_deskewed = np.asarray([deskew(im) for im in X_train])
    X_test_deskewed = np.asarray([deskew(im) for im in X_test])

    return X_train_deskewed, X_test_deskewed

In [None]:
from sklearn.utils import shuffle

def shuffle_transformation(X_train: np.array, y_train: np.array,
                           X_test: np.array, y_test: np.array) -> (np.array, 
                           np.array, np.array, np.array):
    """
    Function to transform images in dataset with the use of shuffling.
    """
    X_train, y_train = shuffle(X_train, y_train)
    X_test, y_test = shuffle(X_test, y_test)
    return X_train, y_train, X_test, y_test

In [None]:
def flatten_transformation(X_train: np.array,
                           X_test: np.array) -> (np.array, np.array):
    """
    Function to transform images in dataset with the use of flatting.
    """
    train_samples_number, nx, ny = X_train.shape
    X_train = X_train.reshape((train_samples_number,nx*ny))

    test_samples_number, nx, ny = X_test.shape
    X_test = X_test.reshape((test_samples_number,nx*ny))

    return X_train, X_test

In [None]:
from keras.utils import to_categorical

def to_nn_format_transformation(X_train: np.array, y_train: np.array,
                                X_test: np.array, y_test: np.array) -> (np.array, 
                                np.array, np.array, np.array):
    """
    Function to transform images and labels in dataset
    into format readable for neural network.
    """
      
    if len(X_train.shape) > 2:
        rows_num, im_length, im_height = X_train.shape

        X_train = X_train.reshape((60000, length * height))
        X_test = X_test.reshape((10000, length * height))
    
    X_train = X_train.astype('float32') / 255
    X_test = X_test.astype('float32') / 255

    y_train = to_categorical(y_train)
    y_test = to_categorical(y_test)

    return X_train, y_train, X_test, y_test

In [None]:
from keras.datasets import mnist

def load_mnist_dataset() -> ((np.array, np.array), (np.array, np.array)):
    """
    Function to load mnist dataset and transform it into uint8 format
    needed for another processes.
    """
    (train_images, train_labels), (test_images, test_labels) = mnist.load_data();
    train_images, test_images = train_images.astype('uint8'), test_images.astype('uint8')
    train_labels, test_labels = train_labels.astype('uint8'), test_labels.astype('uint8')
    return (train_images, train_labels), (test_images, test_labels)

### Selected models and available parameters

#### Support Vector Machines

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

parameters_svc = {
    'kernel' : ('linear', 'rbf', 'poly'),
    'C' : [0.1, 0.5, 1, 5, 10],
    'gamma' : ['auto', 'scale', 0.1, 0.5, 1, 3]
}

In [None]:
def calc_best_acc_for_svc(train_set: np.array, train_labels: np.array, 
                          test_set: np.array, test_labels: np.array, 
                          print_report: bool = False) -> dict:
    """
    Function to calculate the best accuracy for svc model
    with the use of globally defined parameters_svc.
    Returns dict with the best parameters and theirs accuracy.
    """
    # cross validation with grid search
    gcv_svc = GridSearchCV(SVC(), parameters_svc, scoring='accuracy')
    gcv_svc.fit(train_set[:train_batch], train_labels[:train_batch])

    # evaluation
    predict_labels = gcv_svc.predict(test_set[:test_batch])

    # report
    if print_report:
        print(classification_report(test_labels[:test_batch], predict_labels))
    
    # summary
    accuracy = accuracy_score(test_labels[:test_batch], predict_labels)
    best_params = gcv_svc.best_params_
    
    return {'accuracy': accuracy, 'best_params': best_params}

#### Random Forest Classification

In [None]:
from sklearn.ensemble import RandomForestClassifier

parameters_rfc = {
    'n_estimators' : [50, 100, 200],
    'max_depth' : [None, 50, 200],
    'min_samples_split' : [2, 5, 10]
}

In [None]:
def calc_best_acc_for_rfc(train_set: np.array, train_labels: np.array, 
                          test_set: np.array, test_labels: np.array, 
                          print_report: bool = False) -> dict:
    """
    Function to calculate the best accuracy for random forest classifier model
    with the use of globally defined parameters_rfc.
    Returns dict with the best parameters and theirs accuracy.
    """
    
    # cross validation with grid search
    gcv_rfc = GridSearchCV(RandomForestClassifier(), parameters_rfc, scoring='accuracy')
    gcv_rfc.fit(train_set[:train_batch], train_labels[:train_batch])

    # evaluation
    predict_labels = gcv_rfc.predict(test_set[:test_batch])

    # report
    if print_report:
        print(classification_report(test_labels[:test_batch], predict_labels))
    
    # summary
    accuracy = accuracy_score(test_labels[:test_batch], predict_labels)
    best_params = gcv_rfc.best_params_
    
    return {'accuracy': accuracy, 'best_params': best_params}

#### Neural network

In [None]:
from keras.wrappers.scikit_learn import KerasClassifier

parameters_nn = {
    'batch_size' : [10, 50, 100],
    'epochs' : [10, 50, 100]
}

In [None]:
from keras.models import Sequential
from keras.layers import Dense
from sklearn.metrics import confusion_matrix
from functools import reduce

def create_model(shape_tuple: Tuple[int, int]) -> Sequential:
    """
    Function to create sequential neural network model.
    """
    size = reduce(lambda x, y: x * y, shape_tuple, 1)
    model = Sequential()
    model.add(Dense(512, activation='relu', input_shape=(size,)))
    model.add(Dense(10, activation='softmax'))
    model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

def calc_best_acc_for_nn(train_set: np.array, train_labels: np.array, 
                          test_set: np.array, test_labels: np.array, 
                          print_report: bool = False) -> dict:
    """
    Function to calculate the best accuracy for neural network
    with the use of globally defined parameters_nn.
    Returns dict with the best parameters and theirs accuracy.
    """
    encoded_train_set, encoded_train_labels, encoded_test_set, encoded_test_labels = to_nn_format_transformation(train_set, train_labels, test_set, test_labels)
    record_shape = encoded_train_set[0].shape

    # cross validation with grid search on NN
    neural_network = KerasClassifier(build_fn=create_model, shape_tuple=record_shape, verbose=0)
    gcv_nn = GridSearchCV(neural_network, parameters_nn)

    gcv_nn.fit(encoded_train_set[:train_batch, :], encoded_train_labels[:train_batch, :]);

    # evaluation
    predict_labels = gcv_nn.predict(encoded_test_set[:test_batch, :])
    
    if print_report:
        print(f"Best parameters: {gcv_nn.best_params_}")
        print(f"Best score: {gcv_nn.best_score_}")

        labels = [i for i in range(0, 10)]
        cm = confusion_matrix(test_labels[:test_batch], predict_labels, labels=labels)
        plot_nn_cm(cm)
    
    # report
    if print_report:
        print(classification_report(test_labels[:test_batch], predict_labels))
    
    # summary
    accuracy = accuracy_score(test_labels[:test_batch], predict_labels)
    best_params = gcv_nn.best_params_
    
    return {'accuracy': accuracy, 'best_params': best_params}

#### Method to sum up results from svm, rfc and nn

In [None]:
from prettytable import PrettyTable

def generate_summary_table(train_set, train_labels, test_set, test_labels, svc_list, rfc_list, nn_list):
    svc_results = calc_best_acc_for_svc(train_set, train_labels, test_set, test_labels)
    rfc_results = calc_best_acc_for_rfc(train_set, train_labels, test_set, test_labels)
    nn_results = calc_best_acc_for_nn(train_set, train_labels, test_set, test_labels)

    svc_list.append(svc_results['accuracy'])
    rfc_list.append(rfc_results['accuracy'])
    nn_list.append(nn_results['accuracy'])
    
    summary_table = PrettyTable()

    summary_table.field_names = ["Method", "Accuracy", "Best parameters"]
    summary_table.add_row(["SVC", svc_results['accuracy'], svc_results['best_params']])
    summary_table.add_row(["RFC", rfc_results['accuracy'], rfc_results['best_params']])
    summary_table.add_row(["NN", nn_results['accuracy'], nn_results['best_params']])

    return summary_table

In [None]:
# to summarize overall perfomance of trained models
svc_accuracies = []
rfc_accuracies = []
nn_accuracies = []

def generate_overall_summary(svc_list, rfc_list, nn_list):
    summary_table = PrettyTable()
    
    summary_table.field_names = ["Case", "SVC", "RFC", "NN"]

    summary_table.add_row(["Exercise 1", "", "", ""])
    summary_table.add_row(["Deskewed & HOG", svc_list[0], rfc_list[0], nn_list[0]])
    summary_table.add_row(["Only HOG", svc_list[1], rfc_list[1], nn_list[1]])

    summary_table.add_row(["Exercise 2", "", "", ""])
    summary_table.add_row(["HOG", svc_list[2], rfc_list[2], nn_list[2]])
    summary_table.add_row(["Reshape to 1D", svc_list[3], rfc_list[3], nn_list[3]])

    summary_table.add_row(["Exercise 3", "", "", ""])
    summary_table.add_row(["Without shuffling", svc_list[4], rfc_list[4], nn_list[4]])
    summary_table.add_row(["With shuffling", svc_list[5], rfc_list[5], nn_list[5]])

    return summary_table

### Exercise 1

#### Task

Compare performance of selected models (SVM, Random Forest Classifier, Neural Network) using:

a) deskewed images 

b) non-deskewed images

#### Solution

In [None]:
# a
(train_images, train_labels), (test_images, test_labels) = load_mnist_dataset()

train_images, test_images = deskew_transformation(train_images, test_images)
train_images, test_images = HOG_transformation(train_images, test_images)

print(generate_summary_table(train_images, train_labels, test_images, test_labels, svc_accuracies, rfc_accuracies, nn_accuracies))

+--------+----------+-----------------------------------------------------------------+
| Method | Accuracy |                         Best parameters                         |
+--------+----------+-----------------------------------------------------------------+
|  SVC   |  0.964   |           {'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}           |
|  RFC   |   0.94   | {'max_depth': 200, 'min_samples_split': 2, 'n_estimators': 100} |
|   NN   |  0.921   |                {'batch_size': 10, 'epochs': 100}                |
+--------+----------+-----------------------------------------------------------------+




In [None]:
# b
(train_images, train_labels), (test_images, test_labels) = load_mnist_dataset()

train_images, test_images = HOG_transformation(train_images, test_images)

print(generate_summary_table(train_images, train_labels, test_images, test_labels, svc_accuracies, rfc_accuracies, nn_accuracies))

+--------+----------+------------------------------------------------------------------+
| Method | Accuracy |                         Best parameters                          |
+--------+----------+------------------------------------------------------------------+
|  SVC   |  0.959   |             {'C': 10, 'gamma': 0.5, 'kernel': 'rbf'}             |
|  RFC   |  0.924   | {'max_depth': None, 'min_samples_split': 5, 'n_estimators': 200} |
|   NN   |  0.912   |                {'batch_size': 10, 'epochs': 100}                 |
+--------+----------+------------------------------------------------------------------+




#### Observations
In every case deskewing images increased the accuracy of model. Using only HOG method was insufficient. The results are probably caused by the fact that deskewing removes ambiguity from image set (e. g. one can mistake skewed 1 for deskewed 7).

### Exercise 2

#### Task

Compare performance of selected models (SVM, Random Forest Classifier, Neural Network):

a) using HOG transformation from OpenCV

b) reshaping images to one-dimensional array

#### Solution

In [None]:
# a

(train_images, train_labels), (test_images, test_labels) = load_mnist_dataset()

train_images, test_images = deskew_transformation(train_images, test_images)
train_images, test_images = HOG_transformation(train_images, test_images)

print(generate_summary_table(train_images, train_labels, test_images, test_labels, svc_accuracies, rfc_accuracies, nn_accuracies))

+--------+----------+------------------------------------------------------------------+
| Method | Accuracy |                         Best parameters                          |
+--------+----------+------------------------------------------------------------------+
|  SVC   |  0.964   |           {'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}            |
|  RFC   |  0.945   | {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 200} |
|   NN   |   0.94   |                {'batch_size': 10, 'epochs': 100}                 |
+--------+----------+------------------------------------------------------------------+




In [None]:
# b

(train_images, train_labels), (test_images, test_labels) = load_mnist_dataset()

train_images, test_images = deskew_transformation(train_images, test_images)
train_images, test_images = flatten_transformation(train_images, test_images)

print(generate_summary_table(train_images, train_labels, test_images, test_labels, svc_accuracies, rfc_accuracies, nn_accuracies))

+--------+----------+------------------------------------------------------------------+
| Method | Accuracy |                         Best parameters                          |
+--------+----------+------------------------------------------------------------------+
|  SVC   |  0.962   |           {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}           |
|  RFC   |  0.944   | {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 200} |
|   NN   |  0.957   |                 {'batch_size': 10, 'epochs': 50}                 |
+--------+----------+------------------------------------------------------------------+




#### Observations
Classical classifiers (SVM, RFC) responded well to both transformations but their performance scores were slightly better in Histogram of Oriented Gradients scenario. Neural network was much more accurate when the training set consisted of reshaped images. Reshaping can save more (not always essential) information from images. Therefore it could be correct to suppose that the more data about images a neural network model gets, the better it performs. On the other hand, one can conclude that SVM and RFC can increase their score when input data is filtered from unuseful information.

### Exercise 3

#### Task

Compare performance of selected models (SVM, Random Forest Classifier, Neural Network):

a) without shuffling images

b) with shuffling images

#### Solution

In [None]:
# a

(train_images, train_labels), (test_images, test_labels) = load_mnist_dataset()

train_images, test_images = deskew_transformation(train_images, test_images)
train_images, test_images = flatten_transformation(train_images, test_images)

print(generate_summary_table(train_images, train_labels, test_images, test_labels, svc_accuracies, rfc_accuracies, nn_accuracies))

+--------+----------+-----------------------------------------------------------------+
| Method | Accuracy |                         Best parameters                         |
+--------+----------+-----------------------------------------------------------------+
|  SVC   |  0.962   |           {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}          |
|  RFC   |  0.935   | {'max_depth': 200, 'min_samples_split': 2, 'n_estimators': 100} |
|   NN   |  0.961   |                 {'batch_size': 10, 'epochs': 50}                |
+--------+----------+-----------------------------------------------------------------+




In [None]:
# b

(train_images, train_labels), (test_images, test_labels) = load_mnist_dataset()

train_images, test_images = deskew_transformation(train_images, test_images)
train_images, test_images = flatten_transformation(train_images, test_images)
train_images, train_labels, test_images, test_labels = shuffle_transformation(train_images, train_labels, test_images, test_labels)

print(generate_summary_table(train_images, train_labels, test_images, test_labels, svc_accuracies, rfc_accuracies, nn_accuracies))

+--------+----------+----------------------------------------------------------------+
| Method | Accuracy |                        Best parameters                         |
+--------+----------+----------------------------------------------------------------+
|  SVC   |  0.967   |          {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}          |
|  RFC   |   0.95   | {'max_depth': 50, 'min_samples_split': 2, 'n_estimators': 200} |
|   NN   |  0.961   |                {'batch_size': 50, 'epochs': 50}                |
+--------+----------+----------------------------------------------------------------+




#### Observations
There is a noticable rise in classification accuracy for SVM and especially for Random Forest Classifier. The performance of neural network remains unchanged. That behaviour could be interpreted as an advantage of shuffling data before training models. Its impact relies on randomly changing the arrangement of image samples, thus getting rid of unwanted patterns or negative (with respect to model accuracy) order.

### Summary

In [None]:
print(generate_overall_summary(svc_accuracies, rfc_accuracies, nn_accuracies))

+-------------------+-------+-------+-------+
|        Case       |  SVC  |  RFC  |   NN  |
+-------------------+-------+-------+-------+
|     Exercise 1    |       |       |       |
|   Deskewed & HOG  | 0.964 |  0.94 | 0.921 |
|      Only HOG     | 0.959 | 0.924 | 0.912 |
|     Exercise 2    |       |       |       |
|        HOG        | 0.964 | 0.945 |  0.94 |
|   Reshape to 1D   | 0.962 | 0.944 | 0.957 |
|     Exercise 3    |       |       |       |
| Without shuffling | 0.962 | 0.935 | 0.961 |
|   With shuffling  | 0.967 |  0.95 | 0.961 |
+-------------------+-------+-------+-------+


The table above contains the accuracy scores received from every exercise. The differences between them can reach even 5 percentage points. They prove that data preprocessing step is a vital and important part of creating machine learning model. Skipping it can decrease performance of model (Exercise 1). In addition, there is no one appropiate way to preprocess samples before training - different models can require data in different forms (Exercise 2). Another crucial step is shuffling data before trying to fit the model. Sometimes samples can be collected in specific order - it can then affect the performance of machine learning algorithm (Exercise 3).