# Machine Learning in Python - Roll your Own Estimator Example

This notebooks demonstrates how scikit-learn can be extended to include new models by implementing the **TemplateMatchClassifier**.  

The TemplateMatchClassifier is very simple:
* **Training:** For	each target	feature	level calculate the	average	value of all descriptive features for instances that have that target level. Store these average vectors as templates for each target level.
* **Prediction:** When a new prediction needs to be made compare the descriptive feature values	of the new query instance to each template and return the target feature level that belongs to the template that is cloesest (based on Euclidean distance) to the query case.

## Import Packages Etc

In [409]:
from IPython.display import display, HTML, Image

from TAS_Python_Utilities import data_viz
from TAS_Python_Utilities import data_viz_target
from TAS_Python_Utilities import visualize_tree

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import pyplot
import random

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels
from sklearn import metrics
from scipy.spatial import distance


%matplotlib inline
#%qtconsole

## Define TemplateMatchClassifier

Define and test out the TemplateMatchClassifier class. To build a scikit-learn classifier we extend from the **BaseEstimator** and **ClassifierMixin** classes and implement the **init**, **fit**, **predict**, and **predict_proba** methods.

### Define the TemplateMatchClassifier Class

In [410]:
# Create a new classifier which is based on the sckit-learn BaseEstimator and ClassifierMixin classes
class TemplateMatchClassifier(BaseEstimator, ClassifierMixin):
    """The TemplateMatchClassifier is a very naive classification algorithm that calculates the distribution across classes in a training dataset and when asked to make a prediction returns a random class selected according to that distribution. The TemplateMatchClassifier only works for categorical target features. 
        - Training: Simply calculate the distribtion across the target levels in the trianing dataset. And store these as a map.
        - Prediction: When a new prediction needs to draw a random value from the sistrubiton ddefined based on the training dataset. 

    Parameters
    ----------
    add_noise string, optional (default = False)
        Whether or not a little bit of noise should be added to the distribution.

    Attributes
    ----------
    classes_ : array of shape = [n_classes] 
        The class labels (single output problem).
    distribution_: dict
        A dictionary of the probability of each class.
        
    Notes
    -----
    

    See also
    --------
    
    ----------
    
    Examples
    --------
    >>> from sklearn.datasets import load_iris
    >>> from sklearn.model_selection import cross_val_score
    >>> clf = TemplateMatchClassifier()
    >>> iris = load_iris()
    >>> cross_val_score(clf, iris.data, iris.target, cv=10)

    """
    
    # Constructor for the classifier object
    def __init__(self, dis_mat='euclidean', add_noise = False):
        self.add_noise = add_noise
        self.dis_mat = dis_mat

    # The fit function to train a classifier
    def fit(self, X, y):
        """Build a decision tree classifier from the training set (X, y).
        Parameters
        ----------
        X : array-like or sparse matrix, shape = [n_samples, n_features]
            The training input samples. Internally, it will be converted to
            ``dtype=np.float32`` and if a sparse matrix is provided
            to a sparse ``csc_matrix``.
        y : array-like, shape = [n_samples] 
            The target values (class labels) as integers or strings.
        Returns
        -------
        self : object
        """
            
        # Check that X and y have correct shape
        X, y = check_X_y(X, y)

        # Count the number of occurrences of each class in the target vector (uses mupy unique function that returns a list of unique values and their counts)
        unique, counts = np.unique(y, return_counts=True)
        
        uni_count = dict(zip(unique,counts))
        
        templates = {}
        
        for i in range(len(y)):
            if y[i] in templates.keys():
                templates[y[i]] = np.add(templates[y[i]],X[i])
            else:
                templates[y[i]] = X[i]
                
        for key in uni_count.keys():
            templates[key] = np.round(np.true_divide(templates[key],uni_count[key]),2)
        
        
        #add the templates to self
        self.templates_ = templates
        
        # Store the classes seen during fit
        self.classes_ = unique

        # Normalise the counts to sum to 1
        dist = counts/sum(counts)
            
        # If the add_noise attribute is true add a little noise to the distribution
        if(self.add_noise):
            for i in  range(len(dist)):
                dist[i] = dist[i] + dist[i]*random.uniform(-0.25, 0.25)
            # Renormalise the distribution
            dist = dist/sum(dist)
            
        # Create a new dictionary of classes and their normalised frequencies (the distribution)
        self.distribution_ = dict(zip(unique, dist))
        
        # Return the classifier
        return self
    
    # The predict function to make a set of predictions for a set of query instances on the basis of the distance metric
    def predict_distance_metric(self, X, distance_metric='euclidean'):
        # Initialise an empty list to store the predictions made
        predictions = list()
        if distance_metric.casefold()=='euclidean':
            for instance in X:
                min_target = self.classes_[0]
                min_distance = 1000
                for key,value in self.templates_.items():
                    dis = np.round(distance.euclidean(instance,value),2)
                    if min_distance > dis:
                        min_distance = dis
                        min_target = key
                predictions.append(np.round(min_target,2))
            
        elif distance_metric.casefold()=='manhattan':
            for instance in X:
                min_target = self.classes_[0]
                min_distance = 1000
                for key,value in self.templates_.items():
                    dis = np.round(distance.cityblock(instance,value),2)
                    if min_distance > dis:
                        min_distance = dis
                        min_target = key
                predictions.append(np.round(min_target,2))
            
        elif distance_metric.casefold()=='chebyshev':
            for instance in X:
                min_target = self.classes_[0]
                min_distance = 1000
                for key,value in self.templates_.items():
                    dis = np.round(distance.chebyshev(instance,value),2)
                    if min_distance > dis:
                        min_distance = dis
                        min_target = key
                predictions.append(np.round(min_target,2))
            
        elif distance_metric.casefold()=='cosine':
            for instance in X:
                min_target = self.classes_[0]
                min_distance = 1000
                for key,value in self.templates_.items():
                    dis = np.round(distance.cosine(instance,value),2)
                    if min_distance > dis:
                        min_distance = dis
                        min_target = key
                predictions.append(np.round(min_target,2))
            
        else:
            print("enter a valid distance_metric: euclidean, manhattan, chebyshev, mahalanobis")
        
        return predictions

    # The predict function to make a set of predictions for a set of query instances
    def predict(self, X, distance_metric='euclidean'):
        """Predict class labels of the input samples X.
        Parameters
        ----------
        X : array-like or sparse matrix of shape = [n_samples, n_features]
            The input samples. Internally, it will be converted to
            ``dtype=np.float32`` and if a sparse matrix is provided
            to a sparse ``csr_matrix``.
        Returns
        -------
        p : array of shape = [n_samples, ].
            The predicted class labels of the input samples. 
        """
        
        # Check is fit had been called by confirming that the distributions_ dictionary has been set up
        check_is_fitted(self, ['distribution_'])

        # Check that the input features match the type and shape of the training features
        X = check_array(X)

        # Initialise an empty list to store the predictions made
        predictions = list()
        
        #selecting class wise or method wise distance metric to be used. Giving preference to method wise distance metric
        if (self.dis_mat==distance_metric):
            final_distance_metric = self.dis_mat
        elif (distance_metric != 'euclidean'):
            final_distance_metric = distance_metric
        else:
            final_distance_metric = self.dis_mat
        
        predictions = self.predict_distance_metric(X, final_distance_metric)
            
            #Generate a random class according to the learned distribution
            
        return np.array(predictions)
    
    
    # The predict function to make a set of predictions for a set of query instances
    def predict_proba(self, X):
        """Predict class probabilities of the input samples X.
        Parameters
        ----------
        X : array-like matrix of shape = [n_samples, n_features]
            The input samples. 
        Returns
        -------
        p : array of shape = [n_samples, n_labels].
            The predicted class label probabilities of the input samples. 
        """
        
        # Check is fit had been called by confirming that the distributions_ dictionary has been set up
        check_is_fitted(self, ['distribution_'])

        # Check that the input features match the type and shape of the training features
        X = check_array(X)

        # Initialise an array to store the prediction scores generated
        predictions = np.zeros((len(X), len(self.classes_)))

        # Iterate through the query instances in the query dataset 
        for idx, instance in enumerate(X):
            #Generate a random class according to the learned distribution
            pred = self.predict(X)[idx]

            # Always give the predicted class a probability of 0.9 and all other classes the remining probability mass  equally distributed.
            predictions[idx, ]= 0.1/(len(self.classes_) - 1)
            predictions[idx, list(self.classes_).index(pred)] = 0.9
            
        return predictions

### Test the TemplateMatchClassifier

Do a simple test of the TemplateMatchClassifier

In [411]:
a = np.array([[1,23,3,4], [5,6,7,8], [7,5,6,2], [4,9,12,43]])
y = np.array([1, 2, 2, 2])

In [412]:
my_model = TemplateMatchClassifier()

In [413]:
sel = my_model.fit(a, y)

In [414]:
my_model.distribution_

{1: 0.25, 2: 0.75}

In [415]:
q = np.array([[2,15,6,21], [8,9,7,6],[1,24,3,4]])

In [416]:
my_model.predict(q)

array([2, 2, 1])

In [417]:
my_model.predict_proba(q)

array([[0.1, 0.9],
       [0.1, 0.9],
       [0.9, 0.1]])

Fit a model to the iris dataset

In [418]:
from sklearn.datasets import load_iris
iris = load_iris()

clf = TemplateMatchClassifier()
clf.fit(iris.data, iris.target)
clf.distribution_

{0: 0.3333333333333333, 1: 0.3333333333333333, 2: 0.3333333333333333}

Do simple Iris cross validation expeirment

In [419]:
clf = TemplateMatchClassifier()
cross_val_score(clf, iris.data, iris.target, cv=10)

array([0.86666667, 0.93333333, 0.93333333, 0.93333333, 0.93333333,
       0.8       , 1.        , 0.93333333, 1.        , 1.        ])

Fit a model to the iris dataset with noise added to the distribution

In [420]:
clf = TemplateMatchClassifier(add_noise = True)
clf.fit(iris.data, iris.target)
clf.distribution_

{0: 0.33219446390833224, 1: 0.2979084014412676, 2: 0.36989713465040014}

In [421]:
from sklearn.datasets import load_iris
clf = TemplateMatchClassifier(add_noise = True)
iris = load_iris()
cross_val_score(clf, iris.data, iris.target, cv=10)

array([0.86666667, 0.93333333, 0.93333333, 0.93333333, 0.93333333,
       0.8       , 1.        , 0.93333333, 1.        , 1.        ])

## Load & Partition Data

### Setup - IMPORTANT

Take only a sample of the dataset for fast testing

In [422]:
data_sampling_rate = 0.1

Setup the number of folds for all grid searches (should be 5 - 10)

In [423]:
cv_folds = 10

### Load & Partition Data

Load the dataset and explore it.

In [424]:
dataset = pd.read_csv('fashion-mnist_train.csv')
dataset = dataset.sample(frac=data_sampling_rate) #take a sample from the dataset so everyhting runs smoothly
num_classes = 10
classes = {0: "T-shirt/top", 1:"Trouser", 2: "Pullover", 3:"Dress", 4:"Coat", 5:"Sandal", 6:"Shirt", 7:"Sneaker", 8:"Bag", 9:"Ankle boot"}
display(dataset.head())

Unnamed: 0,label,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
22810,7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
44607,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3796,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
31664,3,0,0,0,0,0,0,0,0,39,...,54,17,0,0,0,0,0,0,0,0
9195,2,0,0,0,0,0,2,0,0,0,...,2,1,0,0,129,104,45,0,0,0


Isolate the descriptive features we are interested in

In [425]:
X = dataset[dataset.columns.difference(["label"])]
Y = np.array(dataset["label"])

In [426]:
X = X/255

In [427]:
X_train_plus_valid, X_test, y_train_plus_valid, y_test \
    = train_test_split(X, Y, random_state=0, \
                                    train_size = 0.7)

X_train, X_valid, y_train, y_valid \
    = train_test_split(X_train_plus_valid, \
                                        y_train_plus_valid, \
                                        random_state=0, \
                                        train_size = 0.5/0.7)

## Train and Evaluate a Simple Model

In [428]:
my_model = TemplateMatchClassifier(add_noise = True)
my_model.fit(X_train, y_train)

TemplateMatchClassifier(add_noise=True, dis_mat='euclidean')

#### Euclidian Distance

In [429]:
# Make a set of predictions for the training data
y_pred = my_model.predict(X_train)

# Print performance details
accuracy = metrics.accuracy_score(y_train, y_pred) # , normalize=True, sample_weight=None
print("Accuracy: " +  str(accuracy))
print(metrics.classification_report(y_train, y_pred))

# Print confusion matrix
# print(metrics.confusion_matrix(y_train, y_pred))

# Print nicer homemade confusion matrix
print("Confusion Matrix")
display(pd.crosstab(y_train, y_pred, rownames=['True'], colnames=['Predicted'], margins=True))

Accuracy: 0.679
              precision    recall  f1-score   support

           0       0.74      0.70      0.72       310
           1       0.96      0.87      0.92       323
           2       0.51      0.48      0.49       275
           3       0.67      0.78      0.72       298
           4       0.55      0.55      0.55       316
           5       0.50      0.78      0.61       313
           6       0.31      0.18      0.22       290
           7       0.77      0.80      0.78       302
           8       0.96      0.77      0.85       286
           9       0.85      0.85      0.85       287

    accuracy                           0.68      3000
   macro avg       0.68      0.68      0.67      3000
weighted avg       0.68      0.68      0.67      3000

Confusion Matrix


Predicted,0,1,2,3,4,5,6,7,8,9,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,216,2,7,37,3,32,11,0,2,0,310
1,8,282,6,17,2,6,2,0,0,0,323
2,3,0,132,2,59,35,40,0,4,0,275
3,8,8,2,231,15,20,14,0,0,0,298
4,2,1,55,24,175,20,38,0,1,0,316
5,0,0,0,0,0,245,1,43,1,23,313
6,56,0,49,18,60,54,51,0,2,0,290
7,0,0,0,0,0,44,0,241,0,17,302
8,0,0,7,17,4,23,4,8,221,2,286
9,0,0,1,0,1,14,6,22,0,243,287


In [430]:
# Make a set of predictions for the training data
y_pred = my_model.predict(X_test)

# Print performance details
accuracy = metrics.accuracy_score(y_test, y_pred) # , normalize=True, sample_weight=None
print("Accuracy: " +  str(accuracy))
print(metrics.classification_report(y_test, y_pred))

# Print confusion matrix
# print(metrics.confusion_matrix(y_train, y_pred))

# Print nicer homemade confusion matrix
print("Confusion Matrix")
pd.crosstab(np.array(y_test), y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Accuracy: 0.6911111111111111
              precision    recall  f1-score   support

           0       0.74      0.65      0.69       189
           1       0.96      0.89      0.92       174
           2       0.58      0.45      0.51       191
           3       0.70      0.79      0.74       202
           4       0.59      0.62      0.61       186
           5       0.49      0.74      0.59       166
           6       0.31      0.25      0.28       151
           7       0.74      0.79      0.77       173
           8       0.94      0.78      0.85       192
           9       0.88      0.90      0.89       176

    accuracy                           0.69      1800
   macro avg       0.69      0.69      0.68      1800
weighted avg       0.70      0.69      0.69      1800

Confusion Matrix


Predicted,0,1,2,3,4,5,6,7,8,9,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,122,3,4,27,6,14,11,0,2,0,189
1,8,155,1,5,1,1,2,0,1,0,174
2,0,0,86,2,36,23,41,0,3,0,191
3,12,3,0,159,8,11,8,0,1,0,202
4,0,0,29,15,116,6,20,0,0,0,186
5,0,0,0,0,0,123,0,33,0,10,166
6,22,1,23,9,25,31,38,0,2,0,151
7,0,0,0,0,0,24,0,137,0,12,173
8,1,0,6,10,4,14,3,4,150,0,192
9,0,0,0,0,0,6,1,11,0,158,176


#### Manhattan Distance

In [431]:
# Make a set of predictions for the training data
y_pred = my_model.predict(X_train,"Manhattan")

# Print performance details
accuracy = metrics.accuracy_score(y_train, y_pred) # , normalize=True, sample_weight=None
print("Accuracy: " +  str(accuracy))
print(metrics.classification_report(y_train, y_pred))

# Print confusion matrix
# print(metrics.confusion_matrix(y_train, y_pred))

# Print nicer homemade confusion matrix
print("Confusion Matrix")
display(pd.crosstab(y_train, y_pred, rownames=['True'], colnames=['Predicted'], margins=True))

Accuracy: 0.6166666666666667
              precision    recall  f1-score   support

           0       0.76      0.61      0.68       310
           1       0.81      0.93      0.86       323
           2       0.64      0.30      0.41       275
           3       0.53      0.69      0.60       298
           4       0.45      0.64      0.53       316
           5       0.44      0.55      0.49       313
           6       0.32      0.14      0.20       290
           7       0.58      0.89      0.70       302
           8       0.98      0.55      0.71       286
           9       0.85      0.80      0.82       287

    accuracy                           0.62      3000
   macro avg       0.64      0.61      0.60      3000
weighted avg       0.63      0.62      0.60      3000

Confusion Matrix


Predicted,0,1,2,3,4,5,6,7,8,9,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,190,10,6,61,7,29,6,0,1,0,310
1,2,299,2,8,6,4,2,0,0,0,323
2,3,1,82,3,106,38,40,0,1,1,275
3,7,54,1,207,15,12,2,0,0,0,298
4,1,3,16,39,201,22,34,0,0,0,316
5,0,0,0,1,0,173,0,121,0,18,313
6,47,2,19,33,93,52,42,0,2,0,290
7,0,0,0,0,0,23,0,269,0,10,302
8,0,1,2,32,16,30,5,30,158,12,286
9,0,0,0,3,1,11,0,43,0,229,287


In [432]:
# Make a set of predictions for the training data
y_pred = my_model.predict(X_test,"Manhattan")

# Print performance details
accuracy = metrics.accuracy_score(y_test, y_pred) # , normalize=True, sample_weight=None
print("Accuracy: " +  str(accuracy))
print(metrics.classification_report(y_test, y_pred))

# Print confusion matrix
# print(metrics.confusion_matrix(y_train, y_pred))

# Print nicer homemade confusion matrix
print("Confusion Matrix")
pd.crosstab(np.array(y_test), y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Accuracy: 0.6244444444444445
              precision    recall  f1-score   support

           0       0.86      0.63      0.73       189
           1       0.72      0.95      0.82       174
           2       0.62      0.28      0.38       191
           3       0.60      0.66      0.63       202
           4       0.48      0.68      0.56       186
           5       0.45      0.57      0.50       166
           6       0.30      0.21      0.25       151
           7       0.56      0.90      0.69       173
           8       0.97      0.50      0.66       192
           9       0.87      0.85      0.86       176

    accuracy                           0.62      1800
   macro avg       0.64      0.62      0.61      1800
weighted avg       0.65      0.62      0.61      1800

Confusion Matrix


Predicted,0,1,2,3,4,5,6,7,8,9,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,119,8,0,34,10,11,5,0,2,0,189
1,0,166,1,4,2,0,1,0,0,0,174
2,0,0,53,3,66,25,43,1,0,0,191
3,1,45,0,134,9,7,6,0,0,0,202
4,0,7,15,18,126,9,11,0,0,0,186
5,0,0,0,0,0,94,0,66,0,6,166
6,18,4,14,14,38,31,31,0,1,0,151
7,0,0,0,0,0,9,0,156,0,8,173
8,1,1,2,16,14,16,5,33,96,8,192
9,0,0,0,1,0,5,0,21,0,149,176


#### Chebyshev distance

In [433]:
# Make a set of predictions for the training data
y_pred = my_model.predict(X_train,"Chebyshev")

# Print performance details
accuracy = metrics.accuracy_score(y_train, y_pred) # , normalize=True, sample_weight=None
print("Accuracy: " +  str(accuracy))
print(metrics.classification_report(y_train, y_pred))

# Print confusion matrix
# print(metrics.confusion_matrix(y_train, y_pred))

# Print nicer homemade confusion matrix
print("Confusion Matrix")
display(pd.crosstab(y_train, y_pred, rownames=['True'], colnames=['Predicted'], margins=True))

Accuracy: 0.4673333333333333
              precision    recall  f1-score   support

           0       0.52      0.63      0.57       310
           1       0.99      0.46      0.62       323
           2       0.41      0.44      0.42       275
           3       0.78      0.26      0.39       298
           4       0.60      0.15      0.24       316
           5       0.70      0.25      0.37       313
           6       0.18      0.57      0.27       290
           7       0.81      0.53      0.64       302
           8       0.41      0.81      0.55       286
           9       0.81      0.62      0.70       287

    accuracy                           0.47      3000
   macro avg       0.62      0.47      0.48      3000
weighted avg       0.63      0.47      0.48      3000

Confusion Matrix


Predicted,0,1,2,3,4,5,6,7,8,9,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,195,1,10,9,0,0,86,0,9,0,310
1,76,147,4,4,0,0,91,0,1,0,323
2,2,0,122,1,11,0,126,0,13,0,275
3,47,1,4,78,1,0,165,0,2,0,298
4,7,0,103,1,47,0,156,0,2,0,316
5,2,0,3,0,0,79,58,33,110,28,313
6,39,0,44,6,18,0,164,0,19,0,290
7,0,0,0,0,0,28,13,159,90,12,302
8,3,0,7,1,1,1,38,0,232,3,286
9,2,0,4,0,0,5,10,4,83,179,287


In [434]:
# Make a set of predictions for the training data
y_pred = my_model.predict(X_test,"Chebyshev")

# Print performance details
accuracy = metrics.accuracy_score(y_test, y_pred) # , normalize=True, sample_weight=None
print("Accuracy: " +  str(accuracy))
print(metrics.classification_report(y_test, y_pred))

# Print confusion matrix
# print(metrics.confusion_matrix(y_train, y_pred))

# Print nicer homemade confusion matrix
print("Confusion Matrix")
pd.crosstab(np.array(y_test), y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Accuracy: 0.4527777777777778
              precision    recall  f1-score   support

           0       0.52      0.60      0.56       189
           1       0.97      0.40      0.57       174
           2       0.45      0.41      0.43       191
           3       0.74      0.18      0.29       202
           4       0.57      0.12      0.20       186
           5       0.74      0.17      0.27       166
           6       0.14      0.56      0.22       151
           7       0.80      0.57      0.67       173
           8       0.47      0.82      0.60       192
           9       0.82      0.72      0.76       176

    accuracy                           0.45      1800
   macro avg       0.62      0.45      0.46      1800
weighted avg       0.63      0.45      0.46      1800

Confusion Matrix


Predicted,0,1,2,3,4,5,6,7,8,9,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,113,1,7,7,1,0,54,0,6,0,189
1,43,70,0,1,0,0,59,0,1,0,174
2,2,0,78,1,6,0,101,0,3,0,191
3,41,0,0,37,0,0,124,0,0,0,202
4,0,0,50,2,23,0,107,0,4,0,186
5,1,0,2,0,0,28,34,24,60,17,166
6,15,0,31,2,10,0,84,0,9,0,151
7,0,0,0,0,0,8,5,99,53,8,173
8,2,0,4,0,0,1,25,0,157,3,192
9,1,1,2,0,0,1,4,1,40,126,176


#### Cosine distance

In [435]:
# Make a set of predictions for the training data
y_pred = my_model.predict(X_train,"cosine")

# Print performance details
accuracy = metrics.accuracy_score(y_train, y_pred) # , normalize=True, sample_weight=None
print("Accuracy: " +  str(accuracy))
print(metrics.classification_report(y_train, y_pred))

# Print confusion matrix
# print(metrics.confusion_matrix(y_train, y_pred))

# Print nicer homemade confusion matrix
print("Confusion Matrix")
display(pd.crosstab(y_train, y_pred, rownames=['True'], colnames=['Predicted'], margins=True))

Accuracy: 0.6813333333333333
              precision    recall  f1-score   support

           0       0.74      0.78      0.76       310
           1       0.98      0.89      0.93       323
           2       0.57      0.53      0.55       275
           3       0.69      0.85      0.76       298
           4       0.54      0.65      0.59       316
           5       0.56      0.34      0.42       313
           6       0.38      0.27      0.31       290
           7       0.68      0.84      0.75       302
           8       0.90      0.84      0.87       286
           9       0.68      0.81      0.74       287

    accuracy                           0.68      3000
   macro avg       0.67      0.68      0.67      3000
weighted avg       0.67      0.68      0.67      3000

Confusion Matrix


Predicted,0,1,2,3,4,5,6,7,8,9,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,242,0,1,35,2,0,26,0,4,0,310
1,3,288,8,19,4,0,1,0,0,0,323
2,1,0,146,2,82,0,37,0,7,0,275
3,7,6,1,252,4,0,27,0,1,0,298
4,1,1,58,30,206,0,18,0,2,0,316
5,1,0,0,0,0,106,2,105,4,95,313
6,71,0,39,16,79,0,77,0,8,0,290
7,0,0,0,0,0,32,0,255,0,15,302
8,1,0,2,11,2,17,9,4,240,0,286
9,0,0,0,0,1,34,6,13,1,232,287


In [436]:
# Make a set of predictions for the training data
y_pred = my_model.predict(X_test,"cosine")

# Print performance details
accuracy = metrics.accuracy_score(y_test, y_pred) # , normalize=True, sample_weight=None
print("Accuracy: " +  str(accuracy))
print(metrics.classification_report(y_test, y_pred))

# Print confusion matrix
# print(metrics.confusion_matrix(y_train, y_pred))

# Print nicer homemade confusion matrix
print("Confusion Matrix")
pd.crosstab(np.array(y_test), y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Accuracy: 0.6911111111111111
              precision    recall  f1-score   support

           0       0.74      0.70      0.72       189
           1       0.95      0.89      0.92       174
           2       0.65      0.54      0.59       191
           3       0.73      0.86      0.79       202
           4       0.54      0.69      0.60       186
           5       0.59      0.29      0.39       166
           6       0.30      0.25      0.27       151
           7       0.69      0.86      0.77       173
           8       0.91      0.82      0.86       192
           9       0.71      0.91      0.80       176

    accuracy                           0.69      1800
   macro avg       0.68      0.68      0.67      1800
weighted avg       0.69      0.69      0.68      1800

Confusion Matrix


Predicted,0,1,2,3,4,5,6,7,8,9,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,133,3,0,25,3,0,22,0,3,0,189
1,9,154,2,6,2,0,0,0,1,0,174
2,0,0,104,2,54,0,29,0,2,0,191
3,7,4,0,173,4,0,13,0,1,0,202
4,0,1,28,14,128,0,13,0,2,0,186
5,0,0,0,0,0,48,1,59,1,57,166
6,29,0,26,9,45,0,37,0,5,0,151
7,0,0,0,0,0,13,0,149,1,10,173
8,2,0,1,8,2,8,8,6,157,0,192
9,0,0,0,0,0,12,1,2,0,161,176


## Do a Cross Validation Experiment With Our Model

In [437]:
my_model = TemplateMatchClassifier()
scores = cross_val_score(my_model, X_train_plus_valid, y_train_plus_valid, cv=cv_folds, n_jobs=-1)
print(scores)

[0.67058824 0.68009479 0.66824645 0.65795724 0.63809524 0.68809524
 0.72488038 0.67942584 0.6882494  0.70023981]


## Do a Grid Search Through Distance Metrics

In [440]:
# Set up the parameter grid to seaerch
param_grid = [
 {'add_noise': [False, True]}
]

# Perform the search
my_tuned_model = GridSearchCV(TemplateMatchClassifier(dis_mat='manhattan'), param_grid, cv=cv_folds, verbose = 2, n_jobs=-1)
my_tuned_model.fit(X_train_plus_valid, y_train_plus_valid)

# Print details
print("Best parameters set found on development set:")
print(my_tuned_model.best_params_)
print(my_tuned_model.best_score_)


Fitting 10 folds for each of 2 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  16 out of  20 | elapsed:    0.7s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    0.9s finished


Best parameters set found on development set:
{'add_noise': False}
0.6123809523809524


In [443]:
# Make a set of predictions for the test data
y_pred = my_tuned_model.predict(X_test)

# Print performance details
accuracy = metrics.accuracy_score(y_test, y_pred) # , normalize=True, sample_weight=None
print("Accuracy: " +  str(accuracy))
print(metrics.classification_report(y_test, y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(np.array(y_test), y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Accuracy: 0.6227777777777778
              precision    recall  f1-score   support

           0       0.86      0.63      0.73       189
           1       0.72      0.95      0.82       174
           2       0.59      0.24      0.34       191
           3       0.61      0.67      0.64       202
           4       0.47      0.70      0.56       186
           5       0.45      0.57      0.50       166
           6       0.30      0.20      0.24       151
           7       0.57      0.90      0.69       173
           8       0.97      0.50      0.66       192
           9       0.87      0.85      0.86       176

    accuracy                           0.62      1800
   macro avg       0.64      0.62      0.60      1800
weighted avg       0.65      0.62      0.61      1800

Confusion Matrix


Predicted,0,1,2,3,4,5,6,7,8,9,All
True,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
0,119,8,0,35,9,11,5,0,2,0,189
1,1,166,1,3,2,0,1,0,0,0,174
2,0,0,45,3,76,24,42,1,0,0,191
3,1,44,0,135,9,7,6,0,0,0,202
4,0,7,13,15,131,9,11,0,0,0,186
5,0,0,0,0,0,94,0,66,0,6,166
6,17,4,16,15,37,31,30,0,1,0,151
7,0,0,0,0,0,9,0,156,0,8,173
8,1,1,1,15,15,18,5,32,96,8,192
9,0,0,0,1,0,5,0,21,0,149,176


Demo the predict_proba function.

In [None]:
# Make a set of predictions for the test data
y_pred = my_tuned_model.predict_proba(X_test)
_ = pd.DataFrame(y_pred).hist(figsize = (10,10))