**MNIST Data Recognition**

Firstly, import the necessary libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
print("Libraries Imported")

Then import the data:

In [None]:
train_data = pd.read_csv("../input/train.csv")
test_data = pd.read_csv("../input/test.csv")
print("Data imported")

We can take a look at the data:

In [None]:
print("Number of images: %d" % len(train_data))
train_data.head()

We can see that there are 42k images and each is assigned a label, which is the digit the image corresponds to, and 784 columns that represent a flattened array of pixel values.

To take a look at one of the images, take that images row of pixels as a pandas "Series" dataframe. Then convert it into a numpy array and reshape it into a 28x28 array (image).

In [None]:
image1 = train_data.loc[0, train_data.columns != "label"]
plt.imshow(np.array(image1).reshape((28, 28)), cmap="gray")
plt.show()

Each image is 28x28 pixels in size. This image corresponds to the digit "1".

Now take a look at the distribution and range of values that the pixels can have:

In [None]:
plt.hist(image1)
plt.xlabel("Pixel Intensity")
plt.ylabel("Counts")
plt.show()

Each pixel in the pixel array has an integer value between 0 and 255 which corresponds to its grayscale value. Most of the image is composed of pixels with values close to zero, which makes sense as the image above shows most of the image, apart from where the digit is drawn, is dark.

To make the training data less complex, it is often commonplace to normalize the data. In this case, divide each value of pixel intensity by the maximum value it can have (255) so that the range of intensities decreases from 0->255 to 0->1 instead.

Also, to determine the models accuracy and to check for overfitting/underfitting, we want to split the training data into a training dataset and test dataset.

Firstly, split the training data into images and labels and divide each pixel intensity by 255.

Then, split the training data into a 3:1 ratio of training data and test data so that the model has some unseen data to perform accuracy tests on. Use random_state = 1 to set the RNG seed so that the resulting datasets can be duplicated.

Finally, flatten the label data into a 1d array.

In [None]:
#clean and split data
train_images = train_data.loc[:, train_data.columns != "label"] / 255
train_labels = train_data.label
test_data = test_data.loc[:, :] / 255
x_train, x_test, y_train, y_test = train_test_split(train_images, train_labels, test_size=0.25, random_state=1)
y_train = y_train.values.ravel()
y_test = y_test.values.ravel()

print("Data cleaned and split")

This section can be ignored. It just gives the choice to run tests on smaller samples of the data.

In [None]:
sample_size = len(x_train)
x_train_sample = x_train.iloc[0:sample_size, :]
y_train_sample = y_train[0:sample_size]
x_test_sample = x_test.iloc[0:sample_size, :]
y_test_sample = y_test[0:sample_size]

print("Data samples created")

Now this is where the magic begins. We choose the SVC (Support Vector Classifier) model with default parameters to begin with and fit it with the training data.

In [None]:
#SVC classifier
model = SVC()
model.fit(x_train_sample, y_train_sample)
print("Model trained")

Now calculate the accuracy of the model when tested with training data and test data. If the model performs well on the training dataset and poorly on the test dataset, then we know that the model may likely be overfitting. However, if the model performs poorly on both datasets, we might suspect that the model has instead underfitted the data.

Thus, for a well fitted model, we expect the training and test accuracies to be pretty close together. If the accuracy scores are low, then a number of extra measures might need to be taken, such as cleaning/manipulating the dataset more, increasing/decreasing the number of images used or tuning the model parameters.

In [None]:
#training metrics
train_predicts = model.predict(x_train_sample)
train_acc = round(accuracy_score(y_train_sample, train_predicts) * 100)
print("Training Accuracy: %d%%" %train_acc)

#test metrics
test_predicts = model.predict(x_test_sample)
test_acc = round(accuracy_score(y_test_sample, test_predicts) * 100)
print("Training Accuracy: %d%%" %test_acc)

In this case, on the first run, the model achieved a 94% training accuracy and 94% test accuracy. This means that the model probably isn't overfitting or underfitting, which is good news. To increase this accuracy, the models parameters might need to be tuned.

Finally, use the test submission data to create predictions and output them to a csv file for submission to Kaggle.

In [None]:
#submission predictions
predictions = model.predict(test_data)
print("Finished submission predictions")

#export submission data
submission = pd.DataFrame(predictions)
submission.index.name = "ImageId"
submission.index += 1
submission.columns = ["Label"]
submission.to_csv("digit_submissions.csv", header=True)

print("Exported submission predictions")