# Build your Own Evaluation Framework

In this lab exercise we're going to build our own model evaluation framework. Pretty much every machine learning experiment follows the same template

1. Load a dataset and split into train and test sets
2. Create a model and train it on your training data
3. Predict the labels for the test data and compare with the actual labels
4. Record whatever evaluation metrics you are using

For this example we're going to use the wine dataset [available from the UCI machine learning repository](https://archive.ics.uci.edu/ml/datasets/wine+quality). The csv file for this dataset is available in Brightspace.

In [1]:
import pandas as pd
import numpy as np
import math

df = pd.read_csv('wine.csv')

## Splitting our data into X and y

The first thing we want to do when we load up a dataset is separate the X and Y data. If we accidentally leave our labels in the dataset when we train our models we'll get incredible results! They won't hold up in the real world though!

The pandas **pop()** method lets us extract a column from a dataset. We're using pop() here to pull the wine column out separately as our label column. Unlike most python functions, the **pop()** method has a side effect; as well as returning the column to us, it removes it from the original dataframe. Python usually avoids functions like this because it can be difficult to see what's happening with them, but popping the label column is such a common use case for machine learning that it's survived here.

In [2]:
X = df.copy()

y = X.pop('Wine').values

print("X Data")
print(X)

print("Labels")
print(y)

X Data
     Alcohol  Malic.acid   Ash   Acl   Mg  Phenols  Flavanoids  \
0      14.23        1.71  2.43  15.6  127     2.80        3.06   
1      13.20        1.78  2.14  11.2  100     2.65        2.76   
2      13.16        2.36  2.67  18.6  101     2.80        3.24   
3      14.37        1.95  2.50  16.8  113     3.85        3.49   
4      13.24        2.59  2.87  21.0  118     2.80        2.69   
..       ...         ...   ...   ...  ...      ...         ...   
173    13.71        5.65  2.45  20.5   95     1.68        0.61   
174    13.40        3.91  2.48  23.0  102     1.80        0.75   
175    13.27        4.28  2.26  20.0  120     1.59        0.69   
176    13.17        2.59  2.37  20.0  120     1.65        0.68   
177    14.13        4.10  2.74  24.5   96     2.05        0.76   

     Nonflavanoid.phenols  Proanth  Color.int   Hue    OD  Proline  
0                    0.28     2.29       5.64  1.04  3.92     1065  
1                    0.26     1.28       4.38  1.05  3.40     

## Splitting our Data into Train and Test

If we're going to evaluate our model we need one portion of the data to train our model and another portion of the data to test it. Sci-kit-learn provides a convenience function to do this for us but we're going to have a go at implementing this functionality ourselves first.

We want to write a function which takes an dataframe and an array of labels, and returns

* 2 dataframes, one training set with X% of the data and one test set with the remainder
* 2 label arrays, one training set with X% of the data and one test set with the remainder

Our function takes a propotion between 0 and 1 and splits the training and test set accordingly. Our first job is to work out how many rows belong to the training set

In [3]:
def split_data(X: pd.DataFrame, y: np.array, train_proportion: float):
    
    # We use floor here (or ceiling) to ensure that we take a whole number of rows
    num_train = math.floor(len(X) * train_proportion)
    
    # Using [start:end] indexing, this takes all rows from 0 up to num_train (exclusive)
    X_train = X.iloc[:num_train,]
    # For our test set we'll take everything from num_train up to the end of the dataframe
    X_test = X.iloc[num_train:,]
    
    # Do the same with the y-data (note this is just a regular array and so we don't need .iloc
    # we can index it directly)
    y_train = y[:num_train]
    y_test = y[num_train:]
    
    # Use the comma operator to return all 4 values from our function
    return X_train, X_test, y_train, y_test
    
X_train, X_test, y_train, y_test = split_data(X, y, 0.8)


The function above returns the right number of instances for train and test, but we're just taking the first N rows of the dataframe. This can be a problem, as dataframes are often ordered by the label, so our model might be missing a substantial number of rows from one class. It's always important to ensure that we randomize before sampling.

We can use the numpy **random.shuffle()** to randomly shuffle our array before splitting it. This will make sure that we select random rows from our dataframe

In [4]:
from sklearn.utils import shuffle

def split_data(X: pd.DataFrame, y: np.array, train_proportion: float):
    
    # It's important to make a copy here.
    # Check what happens if you don't do it. Shuffle the y array rather than a copy
    # of it and print the contents of y before and after calling this function
    X_shuffle = shuffle(X.copy())
    y_shuffle = shuffle(y.copy())
    
    
    # We use floor here (or ceiling) to ensure that we take a whole number of rows
    num_train = math.floor(len(X) * train_proportion)
    
    # Using [start:end] indexing, this takes all rows from 0 up to num_train (exclusive)
    X_train = X_shuffle.iloc[:num_train,]
    # For our test set we'll take everything from num_train up to the end of the dataframe
    X_test = X_shuffle.iloc[num_train:,]
    
    # Do the same with the y-data (note this is just a regular array and so we don't need .iloc
    # we can index it directly)
    y_train = y_shuffle[:num_train]
    y_test = y_shuffle[num_train:]
    
    # Use the comma operator to return all 4 values from our function
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = split_data(X, y, 0.8)

We're now making sure that our choice of train and test is properly randomized. However, because we've called shuffle twice on two separate arrays, these arrays will no longer correspond to each other and our labels will all be wrong.

Whenever we generate a random sequence of numbers, we can ensure that we get the same sequence across multiple calls by supplying a **random seed** or **random state**. Numpy provides its own RandomState class. We're going to use this to make sure we have the same result when shuffling both the X and y data. We'll allow the caller to pass in a random seed when they call the function

In [5]:
from numpy.random import RandomState
from sklearn.utils import shuffle

def split_data(X: pd.DataFrame, y: np.array, train_proportion: float, random_seed: int):
    
    # It's important to make a copy here.
    # Check what happens if you don't do it. Shuffle the y array rather than a copy
    # of it and print the contents of y before and after calling this function
    
    rs = RandomState(random_seed)
    X_shuffle = shuffle(X.copy(), random_state=rs)
    
    # reset the random state so we get the same result from shuffle
    rs = RandomState(random_seed)
    y_shuffle = shuffle(y.copy(), random_state=rs)
    
    
    # We use floor here (or ceiling) to ensure that we take a whole number of rows
    num_train = math.floor(len(X) * train_proportion)
    
    # Using [start:end] indexing, this takes all rows from 0 up to num_train (exclusive)
    X_train = X_shuffle.iloc[:num_train,]
    # For our test set we'll take everything from num_train up to the end of the dataframe
    X_test = X_shuffle.iloc[num_train:,]
    
    # Do the same with the y-data (note this is just a regular array and so we don't need .iloc
    # we can index it directly)
    y_train = y_shuffle[:num_train]
    y_test = y_shuffle[num_train:]
    
    # Use the comma operator to return all 4 values from our function
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = split_data(X, y, 0.8, 13)

If we prefer we can use scikit-learn to do the job of splitting the dataset for us. It's often a good idea to let the libraries do the hard work, but it's important to know how to implement something yourself if your needs aren't quite met by the library.

In [6]:
from sklearn.model_selection import train_test_split
from numpy.random import RandomState


random_seed = 13
rs = RandomState(random_seed)

# train_test_split() expects the X and y parameters to correspond to each other, meaning
# that the first value in y is the label for the first row in X. This function will 
# ensure that corresponding values aren't changed.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=rs)


def get_shape(df: pd.DataFrame):
    return f"{df.shape[0]} rows, {df.shape[1]} columns"

print(f"X Train: {get_shape(X_train)}")
print(f"X Test: {get_shape(X_test)}")
print(f"y Train: {len(y_train)} rows")
print(f"y Test: {len(y_test)} rows")

X Train: 142 rows, 13 columns
X Test: 36 rows, 13 columns
y Train: 142 rows
y Test: 36 rows


## Putting the Dataset Together

We've seen that in order to get a dataset ready for machine learning we need to

* Read the dataset from a file
* split off the label column
* split the data into train and test

This is generally going to be the case for any type of dataset so we can make our lives much easier by creating a function to do this work for us. The function below takes everything we've done so far and puts it together.

In [7]:
def load_dataset(filepath: str, label_column: str, train_proportion: float, random_seed: int):
    df = pd.read_csv(filepath)
    label = df.pop(label_column)
    X_train, X_test, y_train, y_test = split_data(df, label, train_proportion, random_seed)
    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = load_dataset('wine.csv', 'Wine', 0.8, 13)

## Training and Evaluating the Model

Now that we've split our data into X and y and train and test we're ready to train and evaluate our model. No matter what model we're evaluating or what dataset we're using, the steps here will always be the same.

1. Train the model using X_train and y_train
2. Make a prediction for each item in X_test
3. Compare the predictions (y_pred) with the actual labels (y_test) and calculate metrics

In [8]:
from sklearn.tree import DecisionTreeClassifier

# Create the model
model = DecisionTreeClassifier()


# To train the model we pass in both the data and the labels
model.fit(X_train, y_train)

# We ask the model to predict labels for each of our test rows
y_pred = model.predict(X_test)

# The numpy equal function takes 2 arrays and for each element returns true if the
# corresponding elements are equal, false otherwise. If y_pred[i] == y_test[i] then
# the model was correct
correct = np.equal(y_pred, y_test)


The total number of correct predictions isn't very useful on its own. At a minimum we'll want to find the percentage of predictions which were correct (*i.e.* the misclassification rate). The following function will calculate the misclassification from looking at y_pred and y_test

In [9]:
def get_misclassification_rate(y_pred, y_test):
    correct = np.equal(y_pred, y_test)
    # By summing a boolean array we count the number of True values
    total_correct = sum(correct)
    # Getting the length gives us the total number of predictions made
    total_predictions = len(correct)
    # Formular for misclassification rate
    return total_correct / total_predictions

get_misclassification_rate(y_pred, y_test)

0.9444444444444444

## Making the Code Reusable

Every evaluation is going to look the same, so by writing a function to train and evaluate the model we can make it very easy for ourselves to compare additional models. In order to train and evaluate a model we'll need

* X_train
* X_test
* y_train
* y_test
* a model

Let's create a function taking each of these as a parameter and returning the misclassification



In [10]:
 def get_misclassification_rate(y_pred, y_test):
    correct = np.equal(y_pred, y_test)
    # By summing a boolean array we count the number of True values
    total_correct = sum(correct)
    # Getting the length gives us the total number of predictions made
    total_predictions = len(correct)
    # Formular for misclassification rate
    return total_correct / total_predictions
    
# There's no easy way to specify a type for all sklearn models so we use the keyword **any**, meaning
# any type is allowed here. However, in order for the function to work the model will need to have a .fit() method and a .predict() method
def evaluate_model(X_train: pd.DataFrame, X_test: np.array, y_train: pd.DataFrame, y_test: np.array, model: any) -> float:

    # To train the model we pass in both the data and the labels
    model.fit(X_train, y_train)

    # We ask the model to predict labels for each of our test rows
    y_pred = model.predict(X_test)
    
    return get_misclassification_rate(y_pred, y_test)



from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

models = [LogisticRegression(max_iter=200000), DecisionTreeClassifier()]

X_train, X_test, y_train, y_test = load_dataset('wine.csv', 'Wine', 0.8, 13)
for model in models:
    # we're digging into the SKLearn model to get its name
    print(f"{type(model).__name__}: {evaluate_model(X_train, X_test, y_train, y_test, model)}")

LogisticRegression: 0.9722222222222222
DecisionTreeClassifier: 0.9444444444444444
