# Machine Learning in Python - Roll your Own Estimator Example

This notebooks demonstrates how scikit-learn can be extended to include new models by implementing the **EducatedGuessClassifier**. 

The **EducatedGuessClassifier** is a **very naive** classification algorithm that calculates the distribution across classes in a training dataset and when asked to make a prediction returns a random class selected according to that distribution. The EducatedGuessClassifier only works for categorical target features. 

The EducatedGuessClassifier is very simple:
* **Training:** Simply calculate the distribtion across the target levels in the trianing dataset. And store these as a map.
* **Prediction:** When a new prediction needs to draw a random value from the sistrubiton ddefined based on the training dataset. 

**NOTE THAT THE EDUCATEDGUESSCLASSIFIER IS A TERRIBLE MODEL AND IS ONLY USED AS A VERY SIMPLE DEMONSTRATION OF HOW TO IMPLEMENT AN ML ALGORITHM IN SCIKIT-LEARN**

## Import Packages Etc

In [None]:
from IPython.display import display, HTML, Image

from TAS_Python_Utilities import data_viz
from TAS_Python_Utilities import data_viz_target
from TAS_Python_Utilities import visualize_tree

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import pyplot
import random

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_array, check_is_fitted
from sklearn.utils.multiclass import unique_labels
from sklearn import metrics
from scipy.spatial import distance


%matplotlib inline
#%qtconsole

## Define EducatedGuessClassifier

Define and test out the EducatedGuessClassifier class. To build a scikit-learn classifier we extend from the **BaseEstimator** and **ClassifierMixin** classes and implement the **init**, **fit**, **predict**, and **predict_proba** methods.

### Define the EducatedGuessClassifier Class

In [None]:
# Create a new classifier which is based on the sckit-learn BaseEstimator and ClassifierMixin classes
class EducatedGuessClassifier(BaseEstimator, ClassifierMixin):
    """The EducatedGuessClassifier is a very naive classification algorithm that calculates the distribution across classes in a training dataset and when asked to make a prediction returns a random class selected according to that distribution. The EducatedGuessClassifier only works for categorical target features. 
        - Training: Simply calculate the distribtion across the target levels in the trianing dataset. And store these as a map.
        - Prediction: When a new prediction needs to draw a random value from the sistrubiton ddefined based on the training dataset. 

    Parameters
    ----------
    add_noise string, optional (default = False)
        Whether or not a little bit of noise should be added to the distribution.

    Attributes
    ----------
    classes_ : array of shape = [n_classes] 
        The class labels (single output problem).
    distribution_: dict
        A dictionary of the probability of each class.
        
    Notes
    -----
    

    See also
    --------
    
    ----------
    
    Examples
    --------
    >>> from sklearn.datasets import load_iris
    >>> from sklearn.model_selection import cross_val_score
    >>> clf = EducatedGuessClassifier()
    >>> iris = load_iris()
    >>> cross_val_score(clf, iris.data, iris.target, cv=10)

    """
    
    # Constructor for the classifier object
    def __init__(self, add_noise = False):
        self.add_noise = add_noise

    # The fit function to train a classifier
    def fit(self, X, y):
        """Build a decision tree classifier from the training set (X, y).
        Parameters
        ----------
        X : array-like or sparse matrix, shape = [n_samples, n_features]
            The training input samples. Internally, it will be converted to
            ``dtype=np.float32`` and if a sparse matrix is provided
            to a sparse ``csc_matrix``.
        y : array-like, shape = [n_samples] 
            The target values (class labels) as integers or strings.
        Returns
        -------
        self : object
        """
            
        # Check that X and y have correct shape
        X, y = check_X_y(X, y)

        # Count the number of occurrences of each class in the target vector (uses mupy unique function that returns a list of unique values and their counts)
        unique, counts = np.unique(y, return_counts=True)
        
        # Store the classes seen during fit
        self.classes_ = unique

        # Normalise the counts to sum to 1
        dist = counts/sum(counts)
            
        # If the add_noise attribute is true add a little noise to the distribution
        if(self.add_noise):
            for i in  range(len(dist)):
                dist[i] = dist[i] + dist[i]*random.uniform(-0.25, 0.25)
            # Renormalise the distribution
            dist = dist/sum(dist)
            
        # Create a new dictionary of classes and their normalised frequencies (the distribution)
        self.distribution_ = dict(zip(unique, dist))
        
        # Return the classifier
        return self

    # The predict function to make a set of predictions for a set of query instances
    def predict(self, X):
        """Predict class labels of the input samples X.
        Parameters
        ----------
        X : array-like or sparse matrix of shape = [n_samples, n_features]
            The input samples. Internally, it will be converted to
            ``dtype=np.float32`` and if a sparse matrix is provided
            to a sparse ``csr_matrix``.
        Returns
        -------
        p : array of shape = [n_samples, ].
            The predicted class labels of the input samples. 
        """
        
        # Check is fit had been called by confirming that the distributions_ dictionary has been set up
        check_is_fitted(self, ['distribution_'])

        # Check that the input features match the type and shape of the training features
        X = check_array(X)

        # Initialise an empty list to store the predictions made
        predictions = list()
        
        # Iterate through the query instances in the query dataset 
        for instance in X:
            
            #Generate a random class according to the learned distribution
            pred = random.choices(list(self.distribution_.keys()), list(self.distribution_.values()))
            
            predictions.append(pred[0])
            
        return np.array(predictions)
    
    
    # The predict function to make a set of predictions for a set of query instances
    def predict_proba(self, X):
        """Predict class probabilities of the input samples X.
        Parameters
        ----------
        X : array-like matrix of shape = [n_samples, n_features]
            The input samples. 
        Returns
        -------
        p : array of shape = [n_samples, n_labels].
            The predicted class label probabilities of the input samples. 
        """
        
        # Check is fit had been called by confirming that the distributions_ dictionary has been set up
        check_is_fitted(self, ['distribution_'])

        # Check that the input features match the type and shape of the training features
        X = check_array(X)

        # Initialise an array to store the prediction scores generated
        predictions = np.zeros((len(X), len(self.classes_)))

        # Iterate through the query instances in the query dataset 
        for idx, instance in enumerate(X):
            
            #Generate a random class according to the learned distribution
            pred = random.choices(list(self.distribution_.keys()), list(self.distribution_.values()))[0]

            # Always give the predicted class a probability of 0.9 and all other classes the remining probability mass  equally distributed.
            predictions[idx, ]= 0.1/(len(self.classes_) - 1)
            predictions[idx, list(self.classes_).index(pred)] = 0.9
            
        return predictions

### Test the EducatedGuessClassifier

Do a simple test of the EducatedGuessClassifier

In [None]:
a = np.array([[1,23,3,4], [5,6,7,8], [7,5,6,2], [4,9,12,43]])
y = np.array([1, 2, 2, 2])

In [None]:
my_model = EducatedGuessClassifier()

In [None]:
my_model.fit(a, y)

In [None]:
my_model.distribution_

In [None]:
q = np.array([[2,15,6,21], [8,9,7,6]])

In [None]:
my_model.predict(q)

In [None]:
my_model.predict_proba(q)

Fit a model to the iris dataset

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

clf = EducatedGuessClassifier()
clf.fit(iris.data, iris.target)
clf.distribution_

Do simple Iris cross validation expeirment

In [None]:
clf = EducatedGuessClassifier()
cross_val_score(clf, iris.data, iris.target, cv=10)

Fit a model to the iris dataset with noise added to the distribution

In [None]:
clf = EducatedGuessClassifier(add_noise = True)
clf.fit(iris.data, iris.target)
clf.distribution_

In [None]:
from sklearn.datasets import load_iris
clf = EducatedGuessClassifier(add_noise = True)
iris = load_iris()
cross_val_score(clf, iris.data, iris.target, cv=10)

## Load & Partition Data

### Setup - IMPORTANT

Take only a sample of the dataset for fast testing

In [None]:
data_sampling_rate = 0.1

Setup the number of folds for all grid searches (should be 5 - 10)

In [None]:
cv_folds = 10

### Load & Partition Data

Load the dataset and explore it.

In [None]:
dataset = pd.read_csv('fashion-mnist_train.csv')
dataset = dataset.sample(frac=data_sampling_rate) #take a sample from the dataset so everyhting runs smoothly
num_classes = 10
classes = {0: "T-shirt/top", 1:"Trouser", 2: "Pullover", 3:"Dress", 4:"Coat", 5:"Sandal", 6:"Shirt", 7:"Sneaker", 8:"Bag", 9:"Ankle boot"}
display(dataset.head())

Isolate the descriptive features we are interested in

In [None]:
X = dataset[dataset.columns.difference(["label"])]
Y = np.array(dataset["label"])

In [None]:
X = X/255

In [None]:
X_train_plus_valid, X_test, y_train_plus_valid, y_test \
    = train_test_split(X, Y, random_state=0, \
                                    train_size = 0.7)

X_train, X_valid, y_train, y_valid \
    = train_test_split(X_train_plus_valid, \
                                        y_train_plus_valid, \
                                        random_state=0, \
                                        train_size = 0.5/0.7)

## Train and Evaluate a Simple Model

In [None]:
my_model = EducatedGuessClassifier(add_noise = True)
my_model.fit(X_train, y_train)

In [None]:
# Make a set of predictions for the training data
y_pred = my_model.predict(X_train)

# Print performance details
accuracy = metrics.accuracy_score(y_train, y_pred) # , normalize=True, sample_weight=None
print("Accuracy: " +  str(accuracy))
print(metrics.classification_report(y_train, y_pred))

# Print confusion matrix
# print(metrics.confusion_matrix(y_train, y_pred))

# Print nicer homemade confusion matrix
print("Confusion Matrix")
display(pd.crosstab(y_train, y_pred, rownames=['True'], colnames=['Predicted'], margins=True))

In [None]:
# Make a set of predictions for the training data
y_pred = my_model.predict(X_test)

# Print performance details
accuracy = metrics.accuracy_score(y_test, y_pred) # , normalize=True, sample_weight=None
print("Accuracy: " +  str(accuracy))
print(metrics.classification_report(y_test, y_pred))

# Print confusion matrix
# print(metrics.confusion_matrix(y_train, y_pred))

# Print nicer homemade confusion matrix
print("Confusion Matrix")
pd.crosstab(np.array(y_test), y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

## Do a Cross Validation Experiment With Our Model

In [None]:
my_model = EducatedGuessClassifier()
scores = cross_val_score(my_model, X_train_plus_valid, y_train_plus_valid, cv=cv_folds, n_jobs=-1)
print(scores)

## Do a Grid Search Through Distance Metrics

In [None]:
# Set up the parameter grid to seaerch
param_grid = [
 {'add_noise': [False, True]}
]

# Perform the search
my_tuned_model = GridSearchCV(EducatedGuessClassifier(), param_grid, cv=cv_folds, verbose = 2, n_jobs=-1)
my_tuned_model.fit(X_train_plus_valid, y_train_plus_valid)

# Print details
print("Best parameters set found on development set:")
print(my_tuned_model.best_params_)
print(my_tuned_model.best_score_)


In [None]:
# Make a set of predictions for the test data
y_pred = my_tuned_model.predict(X_test)

# Print performance details
accuracy = metrics.accuracy_score(y_test, y_pred) # , normalize=True, sample_weight=None
print("Accuracy: " +  str(accuracy))
print(metrics.classification_report(y_test, y_pred))

# Print confusion matrix
print("Confusion Matrix")
pd.crosstab(np.array(y_test), y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

Demo the predict_proba function.

In [None]:
# Make a set of predictions for the test data
y_pred = my_tuned_model.predict_proba(X_test)
_ = pd.DataFrame(y_pred).hist(figsize = (10,10))