# Build a digit classifier

This file aims to train a digit classifier with hog feature extractor and linear svm model. MNIST handwritten digit database was chosen since it is publically available and commonly used to evaluate machine learning algorithms. 

In [9]:
#!/usr/bin/env python
# train_digits.py

import numpy as np
from sklearn import datasets
from skimage.feature import hog
from sklearn.svm import LinearSVC
from sklearn.externals import joblib
from sklearn.utils import shuffle

In [5]:
def get_hog(X):
    """Extract the hog features on entire data set"""
    features = []
    for image in X:
        ft = hog(image.reshape((28, 28)), orientations=9, pixels_per_cell=(14, 14), cells_per_block=(1, 1),
                 visualise=False)
        features.append(ft)

    return np.array(features)


Hog(Histogram of Oriented Gradients) was used as a feature discriptor. It was chosen due to the simplicity of use - OpenCV provides a free open source function that can be easily applied. Furthermore, the characters have large amounts of curves and edges with strong contrast which are expected to perform well with this descriptor.

In [8]:
def main():
    """Train a handwritten digit database to build a digit classifier"""
    dataset = datasets.fetch_mldata('MNIST Original')
    X = dataset.data
    y = dataset.target

    X, y = shuffle(X, y, random_state=0)
    # intensity normalisation
    X = X / 255.0 * 2 - 1

    # feature extraction
    X = get_hog(X)

    # split the data set
    from sklearn.cross_validation import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                        test_size=0.2,
                                                        random_state=42)
    print("training a classifier")
    classifier = LinearSVC()
    classifier.fit(X_train, y_train)
    
    print "test score: ", classifier.score(X_test, y_test)

    joblib.dump(classifier, "../classifier/linear_svm_hog.pkl")

if __name__ == '__main__':
    main()

training a classifier
test score:  0.861


80% of total dataset is used as training set and the rest of images were utilised to evaluate the final model. In this project, linear svm was selected as it requires less training time and still gives reasonable performance. I used the default parameters for the linear svm model however the performance can be further improved by tuning the parameters using grid search algorithm.