# COMP47590 Advanced Machine Learning
# Roll your Own Estimator Example

This notebooks demonstrates how scikit-learn can be extended to include new models by implementing the **EducatedGuessClassifier**.

The **EducatedGuessClassifier** is a **very naive** classification algorithm that calculates the distribution across classes in a training dataset and when asked to make a prediction returns a random class selected according to that distribution. The EducatedGuessClassifier only works for categorical target features.

The EducatedGuessClassifier is very simple:
* **Training:** Simply calculate the distribution across the target levels in the training dataset. And store these as a map.
* **Prediction:** When a new prediction needs to draw a random value from the distribution defined based on the training dataset.

More details of building classifiers are available in the scikit-learn documentation: [https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator](https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator)

There is also a nice Github repository containing template projects (and a nice example classifier implementation) for creating your own scikit-learn contributions: [https://github.com/scikit-learn-contrib/project-template/](https://github.com/scikit-learn-contrib/project-template/)

**NOTE THAT THE EDUCATEDGUESSCLASSIFIER IS A TERRIBLE MODEL AND IS ONLY USED AS A VERY SIMPLE DEMONSTRATION OF HOW TO IMPLEMENT AN ML ALGORITHM IN SCIKIT-LEARN**

## Import Packages Etc

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import sklearn.model_selection
import sklearn.base
import sklearn.utils.validation
import sklearn.metrics
import sklearn.datasets
import sklearn.utils

%matplotlib inline

## The EducatedGuessClassifier

Define and test a simple custom classifier using scikit-learn.

### Define the EducatedGuessClassifier Class

Define and test out the EducatedGuessClassifier class. To build a scikit-learn classifier we extend from the **BaseEstimator** and **ClassifierMixin** classes and implement the **init**, **fit**, **predict**, and **predict_proba** methods.

In [22]:
# Create a new classifier which is based on the sckit-learn BaseEstimator and ClassifierMixin classes
class EducatedGuessClassifier(sklearn.base.BaseEstimator, sklearn.base.ClassifierMixin):
    """The EducatedGuessClassifier is a very naive classification algorithm that calculates the distribution across classes in a training dataset and when asked to make a prediction returns a random class selected according to that distribution. The EducatedGuessClassifier only works for categorical target features.
        - Training: Simply calculate the distribtion across the target levels in the trianing dataset. And store these as a map.
        - Prediction: When a new prediction needs to draw a random value from the sistrubiton ddefined based on the training dataset.

    Parameters
    ----------
    add_noise string, optional (default = False)
        Whether or not a little bit of noise should be added to the distribution.

    Attributes
    ----------
    classes_ : array of shape = [n_classes]
        The class labels (single output problem).
    distribution_: dict
        A dictionary of the probability of each class.

    Notes
    -----


    See also
    --------

    ----------

    Examples
    --------
    >>> from sklearn.datasets import load_iris
    >>> from sklearn.model_selection import cross_val_score
    >>> clf = EducatedGuessClassifier()
    >>> iris = load_iris()
    >>> cross_val_score(clf, iris.data, iris.target, cv=10)

    """

    # Constructor for the classifier object
    def __init__(self, add_noise = False, random_state=None):
        self.random_state = random_state
        self.add_noise = add_noise

    # The fit function to train a classifier
    def fit(self, X, y):
        """Build an educated guess classifier from the training set (X, y).
        Parameters
        ----------
        X : array-like or sparse matrix, shape = [n_samples, n_features]
            The training input samples. Internally, it will be converted to
            ``dtype=np.float32`` and if a sparse matrix is provided
            to a sparse ``csc_matrix``.
        y : array-like, shape = [n_samples]
            The target values (class labels) as integers or strings.
        Returns
        -------
        self : object
        """

        # Check that X and y have correct shape
        X, y = sklearn.utils.validation.check_X_y(X, y)

        # Set up the random number generator to be used to generate
        # predictions - this follows reccommended scikitlearn pattern
        # https://scikit-learn.org/stable/developers/develop.html#coding-guidelines
        self.random_state_ = sklearn.utils.validation.check_random_state(self.random_state)

        # Count the number of occurrences of each class in the target vector (uses mupy unique function that returns a list of unique values and their counts)
        unique, counts = np.unique(y, return_counts=True)

        # Store the classes seen during fit
        self.classes_ = unique

        # Normalise the counts to sum to 1
        dist = counts/sum(counts)

        # If the add_noise attribute is true add a little noise to the distribution
        if(self.add_noise):
            for i in  range(len(dist)):
                dist[i] = dist[i] + dist[i]*self.random_state_.uniform(-0.1, 0.1)
            # Renormalise the distribution
            dist = dist/sum(dist)

        # Create a new dictionary of classes and their normalised frequencies (the distribution)
        self.distribution_ = dict(zip(unique, dist))

        # Return the classifier
        return self

    # The predict function to make a set of predictions for a set of query instances
    def predict(self, X):
        """Predict class labels of the input samples X.
        Parameters
        ----------
        X : array-like or sparse matrix of shape = [n_samples, n_features]
            The input samples. Internally, it will be converted to
            ``dtype=np.float32`` and if a sparse matrix is provided
            to a sparse ``csr_matrix``.
        Returns
        -------
        p : array of shape = [n_samples, ].
            The predicted class labels of the input samples.
        """

        # Check is fit had been called by confirming that the distributions_ dictionary has been set up
        sklearn.utils.validation.check_is_fitted(self, ['distribution_'])

        # Check that the input features match the type and shape of the training features
        X = sklearn.utils.validation.check_array(X)

        # Initialise an empty list to store the predictions made
        predictions = list()

        # Iterate through the query instances in the query dataset
        for instance in X:

            #Generate a random class according to the learned distribution
            pred = self.random_state_.choice(list(self.distribution_.keys()), p = list(self.distribution_.values()))

            predictions.append(pred)

        return np.array(predictions)


    # The predict function to make a set of predictions for a set of query instances
    def predict_proba(self, X):
        """Predict class probabilities of the input samples X.
        Parameters
        ----------
        X : array-like matrix of shape = [n_samples, n_features]
            The input samples.
        Returns
        -------
        p : array of shape = [n_samples, n_labels].
            The predicted class label probabilities of the input samples.
        """

        # Check is fit had been called by confirming that the distributions_ dictionary has been set up
        sklearn.utils.validation.check_is_fitted(self, ['distribution_'])

        # Check that the input features match the type and shape of the training features
        X = sklearn.utils.validation.check_array(X)

        # Initialise an array to store the prediction scores generated
        predictions = np.zeros((len(X), len(self.classes_)))

        # Iterate through the query instances in the query dataset
        for idx, instance in enumerate(X):

            #Generate a random class according to the learned distribution
            pred = self.random_state_.choice(list(self.distribution_.keys()), p = list(self.distribution_.values()))

            # Always give the predicted class a probability of 0.9 and all other classes the remining probability mass  equally distributed.
            predictions[idx, ]= 0.1/(len(self.classes_) - 1)
            predictions[idx, list(self.classes_).index(pred)] = 0.9

        return predictions

### Test The EducatedGuessClassifier With a Toy Dataset

Define a toy dataset.

In [23]:
a = np.array([[1,23,3,4], [5,6,7,8], [7,5,6,2], [4,9,12,43]])
y = np.array([1, 2, 2, 2])

In [24]:
sklearn.utils.resample(a, y, n_samples = 2)

[array([[ 4,  9, 12, 43],
        [ 7,  5,  6,  2]]),
 array([2, 2])]

Create and fit an EducatedGuessClassifier object.

In [25]:
my_model = EducatedGuessClassifier()
my_model.fit(a, y)

Examine the extracted distribution.

In [26]:
my_model.distribution_

{1: 0.25, 2: 0.75}

Define some simple query data.

In [27]:
q = np.array([[2,15,6,21], [8,9,7,6]])

Make predictions for the query instances.

In [28]:
my_model.predict(q)

array([1, 2])

Make predictions with scores for the query instances.

In [29]:
my_model.predict_proba(q)

array([[0.1, 0.9],
       [0.1, 0.9]])

## Test the Model

Load the iris dataset.

In [39]:
iris = sklearn.datasets.load_iris()

Fit a model to the iris dataset and examine the distribution.

In [40]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [41]:
pd.Series(iris.target).value_counts()

Unnamed: 0,count
0,50
1,50
2,50


In [42]:
x_s, y_s = sklearn.utils.resample(iris.data, iris.target, n_samples = 20,
                    stratify = iris.target)
x_s

array([[5.8, 2.8, 5.1, 2.4],
       [5.5, 2.4, 3.8, 1.1],
       [6.1, 2.8, 4. , 1.3],
       [5.8, 2.7, 4.1, 1. ],
       [6.2, 2.8, 4.8, 1.8],
       [4.4, 3.2, 1.3, 0.2],
       [6.4, 2.8, 5.6, 2.2],
       [4.6, 3.4, 1.4, 0.3],
       [5.4, 3.4, 1.5, 0.4],
       [4.3, 3. , 1.1, 0.1],
       [4.7, 3.2, 1.6, 0.2],
       [7.7, 2.6, 6.9, 2.3],
       [6.2, 2.9, 4.3, 1.3],
       [4.9, 2.4, 3.3, 1. ],
       [7. , 3.2, 4.7, 1.4],
       [5.7, 3.8, 1.7, 0.3],
       [6. , 3. , 4.8, 1.8],
       [7. , 3.2, 4.7, 1.4],
       [4.4, 3.2, 1.3, 0.2],
       [6.7, 3.1, 5.6, 2.4]])

In [43]:
pd.Series(y_s).value_counts()

Unnamed: 0,count
1,7
0,7
2,6


In [45]:
clf = EducatedGuessClassifier()
clf.fit(iris.data, iris.target)
clf.distribution_

{0: 0.3333333333333333, 1: 0.3333333333333333, 2: 0.3333333333333333}

In [46]:
clf.predict(iris.data)

array([0, 1, 2, 0, 2, 0, 0, 2, 0, 2, 0, 1, 0, 1, 0, 0, 2, 2, 2, 1, 1, 2,
       2, 0, 1, 2, 2, 0, 2, 2, 2, 0, 1, 0, 0, 0, 1, 1, 2, 1, 0, 1, 0, 2,
       0, 0, 0, 1, 0, 2, 1, 2, 2, 0, 1, 0, 2, 0, 2, 0, 2, 1, 2, 1, 1, 1,
       1, 1, 0, 2, 2, 1, 2, 0, 1, 0, 1, 1, 0, 2, 1, 2, 1, 0, 1, 2, 0, 1,
       2, 2, 0, 2, 2, 0, 2, 2, 2, 0, 1, 0, 0, 1, 2, 1, 2, 1, 2, 0, 2, 1,
       0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 2, 0, 2, 2, 1, 0, 1, 2,
       2, 0, 0, 0, 2, 1, 0, 1, 2, 2, 1, 0, 0, 2, 0, 2, 0, 0])

Do simple cross validation evaluation expeirment using the Iris dataset.

In [48]:
clf = EducatedGuessClassifier()
sklearn.model_selection.cross_val_score(clf, iris.data, iris.target, cv=10)

array([0.4       , 0.2       , 0.2       , 0.6       , 0.33333333,
       0.2       , 0.26666667, 0.2       , 0.4       , 0.2       ])

Fit a model to the iris dataset with noise added to the distribution

In [49]:
clf = EducatedGuessClassifier(add_noise = True)
clf.fit(iris.data, iris.target)
clf.distribution_

{0: 0.30525451055384367, 1: 0.3620592923312889, 2: 0.3326861971148674}

Perform a cross validation experiment with the noisy version of the model.

In [50]:
sklearn.model_selection.cross_val_score(clf, iris.data, iris.target, cv=10)

array([0.4       , 0.53333333, 0.33333333, 0.26666667, 0.2       ,
       0.26666667, 0.2       , 0.26666667, 0.06666667, 0.33333333])

Perform a  hyper-parameter grid search with our newly defined classifier.

In [51]:
cv_folds = 5
param_grid ={'add_noise': [False, True]}

# Perform the search
tuned_clf = sklearn.model_selection.GridSearchCV(EducatedGuessClassifier(), \
                            param_grid, cv=cv_folds, verbose = 2, \
                            n_jobs = -1)

tuned_clf.fit(iris.data, iris.target)


Fitting 5 folds for each of 2 candidates, totalling 10 fits


In [52]:
tuned_clf.best_params_

{'add_noise': True}