<a href="https://colab.research.google.com/github/LeonardoGoncRibeiro/06_MachineLearning/blob/main/01_Basic/06_MachineLearning_ModelValidation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning: Model validation

In this course, we will learn more about how can we validate our models, so that they can reliably be used in production. Usually, when we build and train our models, we separate two data sets: training and test sets. However, usually, we use the test set to decide which model to use. This means that we actually are fitting our model to the test set, but we still have to guarantee that the accuracy shown in the test (and training) set can be expected in a real world environment. Thus, we should use appropriate techniques to **validate** our model. 

In this course, we will use the following packages:

In [52]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GroupKFold

from sklearn.dummy import DummyClassifier
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score

from sklearn.pipeline import Pipeline

Also, we will use the following dataset:

In [2]:
uri = "https://gist.githubusercontent.com/guilhermesilveira/e99a526b2e7ccc6c3b70f53db43a87d2/raw/1605fc74aa778066bf2e6695e24d53cf65f2f447/machine-learning-carros-simulacao.csv"
df = pd.read_csv(uri).drop(columns=["Unnamed: 0"], axis=1)
df.columns = ['Price', 'Sold', 'Age', 'Km_per_year']
df.head( )

Unnamed: 0,Price,Sold,Age,Km_per_year
0,30941.02,1,18,35085.22134
1,40557.96,1,20,12622.05362
2,89627.5,0,12,11440.79806
3,95276.14,0,3,43167.32682
4,117384.68,1,4,12770.1129


This dataset is related to different cars, with their selling prices, age, and km per year. Also, we can see whether they were sold or not. 

## Creating a baseline model

Here, we will already create a baseline model to compare to others later. First, we will split our dataset into training and test sets, then we fit a model. Here, we will start with a dummy classifier model. 

In [3]:
# Setting a seed
SEED = 158020
np.random.seed(SEED)

# Train test split
y = df.Sold
X = df.drop('Sold', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, stratify = y)

# Model fitting
dummy_clf = DummyClassifier( )
dummy_clf.fit(X_train, y_train)

# Evaluating model accuracy
y_pred = dummy_clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)*100

print("Accuracy: {:.2f}%".format(acc))

Accuracy: 58.00%


So, a Dummy Classifier model showed an accuracy of 58%. What about a decision tree model?

In [4]:
# Setting a seed
SEED = 158020
np.random.seed(SEED)

# Train test split
y = df.Sold
X = df.drop('Sold', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, stratify = y)

# Model fitting
dec_tree = DecisionTreeClassifier(max_depth = 2)
dec_tree.fit(X_train, y_train)

# Evaluating model accuracy
y_pred = dec_tree.predict(X_test)
acc = accuracy_score(y_test, y_pred)*100

print("Accuracy: {:.2f}%".format(acc))

Accuracy: 71.92%


A decision tree model showed a much higher accuracy. 

Note that this accuracy is very dependent on the seed used. For instance, if we change the seed to 5:

In [5]:
# Setting a seed
SEED = 5
np.random.seed(SEED)

# Train test split
y = df.Sold
X = df.drop('Sold', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, stratify = y)

# Model fitting
dec_tree = DecisionTreeClassifier(max_depth = 2)
dec_tree.fit(X_train, y_train)

# Evaluating model accuracy
y_pred = dec_tree.predict(X_test)
acc = accuracy_score(y_test, y_pred)*100

print("Accuracy: {:.2f}%".format(acc))

Accuracy: 76.84%


A much higher accuracy! However, shouldn't we have a more safe way to evaluate the accuracy of our model? How can we say that, in production, we expect our model to present an accuracy of $x\%$?

To that end, we should try to use appropriate methods to evaluate the accuracy over a given set.

# Cross-validation

When we split our data into train and test sets, we effectively separate our data. This means that:

* We lose some data that could be used for training.
* Our accuracy is entirely dependent on the test set chosen.

However, we could try to use another approach: cross-validation. Using cross-validation, we perform multiple splits, and keep training and testing with different sets. This way, we may use the entire dataset for training purposes, while also having a good estimate of how accurate is our model. This time, the accuracy of our model can be given by the mean of the accuracy of all splits!

This idea is known as $k$-fold cross-validation, where $k$ is the number of splits in our data.

Thus, let's try to use the cross-validation to get the accuracy for our Decision Tree Classifier:

In [6]:
# Setting a seed
SEED = 158020
np.random.seed(SEED)

# Defining the target and explicative variables
y = df.Sold
X = df.drop('Sold', axis = 1)

# Model instancing
dec_tree = DecisionTreeClassifier(max_depth = 2)

# Running cross_validation
results = cross_validate(dec_tree, X, y, cv = 3)           # 3-fold cross-validation

results

{'fit_time': array([0.00810766, 0.00716019, 0.00715661]),
 'score_time': array([0.00218606, 0.0020206 , 0.00208902]),
 'test_score': array([0.75704859, 0.7629763 , 0.75337534])}

So, here, we are mostly worried with the ```test_score```. Let's get the mean test score:

In [7]:
# Evaluating model accuracy
acc = (results['test_score'].mean( ))*100

print("Accuracy: {:.2f}%".format(acc))

Accuracy: 75.78%


So, on average, using 3-fold cross-validation, our model showed an accuracy of 75.78% (across the 3 splits).

Also, since we have multiple values, and we have an uncertainty, we can also get a confidence interval for the accuracy. For a significance of 5%, we can consider that $Z \approx 2.0$. Thus, we can do:

In [8]:
acc = (results['test_score'].mean( ))*100
std = (results['test_score'].std( ))*100

print("Accuracy is in the domain [{:.2f}%, {:.2f}%]".format(acc - 2*std, acc + 2*std))

Accuracy is in the domain [74.99%, 76.57%]


Nice! Now, we have a confidence interval for our accuracy. Now, we are much less prone to being biased towards "luck", as the SEED has a lower effect on the accuracy! For instance, let's use a different seed:

In [9]:
# Setting a seed
SEED = 5
np.random.seed(SEED)

# Defining the target and explicative variables
y = df.Sold
X = df.drop('Sold', axis = 1)

# Model instancing
dec_tree = DecisionTreeClassifier(max_depth = 2)

# Running cross_validation
results = cross_validate(dec_tree, X, y, cv = 3)           # 3-fold cross-validation

# Evaluating model accuracy
acc = (results['test_score'].mean( ))*100
std = (results['test_score'].std( ))*100

print("Accuracy is in the domain [{:.2f}%, {:.2f}%]".format(acc - 2*std, acc + 2*std))

Accuracy is in the domain [74.99%, 76.57%]


Actually, we got the same domain! That is very good. However, note that 3-folds might be still very little to get a good estimate for the accuracy. After all, we are evaluating a confidence interval using only 3 values. Let's use 10-folds:

In [10]:
# Setting a seed
SEED = 158020
np.random.seed(SEED)

# Defining the target and explicative variables
y = df.Sold
X = df.drop('Sold', axis = 1)

# Model instancing
dec_tree = DecisionTreeClassifier(max_depth = 2)

# Running cross_validation
results = cross_validate(dec_tree, X, y, cv = 10)           # 10-fold cross-validation

# Evaluating model accuracy
acc = (results['test_score'].mean( ))*100
std = (results['test_score'].std( ))*100

print("Accuracy is in the domain [{:.2f}%, {:.2f}%]".format(acc - 2*std, acc + 2*std))

Accuracy is in the domain [74.24%, 77.32%]


Note that our interval changed, even though it still stayed pretty close to before. 

Usually, 5 or 10 is a good choice for $k$.

## Considering randomness in the cross-validation

Note that, using the sci-kit learn implementation of cross-validate, the method does not shuffle the dataset before spliting it into multiple folds. To consider randomness in the process, which could give us more certainty about our accuracy domain, we could try to shuffle the dataset before performing the cross-validation. 

Before doing that, let's create an user-defined function, so that we don't have to repeat multiple lines of code:

In [11]:
def GetCrossValidationMetrics(model, X, y, cv):
  results = cross_validate(model, X, y, cv = cv)  

  acc = (results['test_score'].mean( ))*100
  std = (results['test_score'].std( ))*100  

  return (acc, std)  

Let's test our function:

In [12]:
# Setting a seed
SEED = 158020
np.random.seed(SEED)

acc, std = GetCrossValidationMetrics(dec_tree, X, y, 10)

print("Accuracy is {:.2f}%, and it is in the domain [{:.2f}%, {:.2f}%]".format(acc, acc - 2*std, acc + 2*std))

Accuracy is 75.78%, and it is in the domain [74.24%, 77.32%]


Nice! We got the same results. Now, to shuffle our folds, we should use a KFold object, and pass it as our cross-validation method:

In [13]:
SEED = 158020
np.random.seed(SEED)

cv = KFold(n_splits = 10, shuffle = True)

acc, std = GetCrossValidationMetrics(dec_tree, X, y, cv)

print("Accuracy is {:.2f}%, and it is in the domain [{:.2f}%, {:.2f}%]".format(acc, acc - 2*std, acc + 2*std))

Accuracy is 75.78%, and it is in the domain [73.58%, 77.98%]


Finally, we got different domains (even though our accuracy was actually the same). Let's test another seed:

In [14]:
SEED = 301
np.random.seed(SEED)

cv = KFold(n_splits = 10, shuffle = True)

acc, std = GetCrossValidationMetrics(dec_tree, X, y, cv)

print("Accuracy is {:.2f}%, and it is in the domain [{:.2f}%, {:.2f}%]".format(acc, acc - 2*std, acc + 2*std))

Accuracy is 75.76%, and it is in the domain [73.26%, 78.26%]


Again, the accuracy was very similar, but the domain was different!

Note that, by using a cross-validation technique, we are less susceptible to randomness, as our accuracy changes very little for different seeds. At the same time, we can still note a difference in the domains!

## Stratifying our dataset in our cross-validation

When we split our data into multiple folds, another good practice is that we try to maintain the stratification of the original dataset. For instance, if our target labels has 20% 0's and 80% 1's, we want to keep this proportion in our splits. 

Using the train test split, we used the ```stratify``` parameter. However, for the KFold, we do not have such a parameter. 

For that end, we may use the StratifiedKFold method:

In [15]:
SEED = 301
np.random.seed(SEED)

cv = StratifiedKFold(n_splits = 10, shuffle = True)

acc, std = GetCrossValidationMetrics(dec_tree, X, y, cv)

print("Accuracy is {:.2f}%, and it is in the domain [{:.2f}%, {:.2f}%]".format(acc, acc - 2*std, acc + 2*std))

Accuracy is 75.78%, and it is in the domain [74.42%, 77.14%]


Nice! Note that our accuracy changed very little, but our interval is smaller. That occurs because our variance for our splits reduced! That is expected, since, now, we are forcing that our splits have a similar structure (which is similar to the structure from the original set).

## Grouping by a feature

What if we want to perform our splits based on a feature? As if we split our entries based on the model of the car. This time, we have a different cross-validation method.

First, let's generate our car model data (randomly):

In [26]:
np.random.seed(SEED)

df['Model'] = df.Age + np.random.randint(-2, 3, size = df.shape[0])
df.Model = df.Model - df.Model.min( ) + 1

So, to consider the model type when performing our splits, we simply use GroupKFold. First, we have to change our user defined function to get the group:

In [29]:
def GetCrossValidationMetrics(model, X, y, cv, group = None):
  results = cross_validate(model, X, y, cv = cv, groups = group)  

  acc = (results['test_score'].mean( ))*100
  std = (results['test_score'].std( ))*100  

  return (acc, std)

Now, we can do:

In [31]:
SEED = 301
np.random.seed(SEED)

cv = GroupKFold(n_splits = 10)

acc, std = GetCrossValidationMetrics(dec_tree, X, y, cv, df.Model)

print("Accuracy is {:.2f}%, and it is in the domain [{:.2f}%, {:.2f}%]".format(acc, acc - 2*std, acc + 2*std))

Accuracy is 75.80%, and it is in the domain [72.00%, 79.60%]


Nice! Again, our accuracy was very similar. However, in real world problems, this approach may given a much better accuracy, especially when we want to know the expected accuracy for a new car model.

# Cross-validation pipeline

It is important that we have a cross-validation pipeline, which allow us to test different models and approaches with our cross-validation methods very easily. 

Our user-defined function helps with this. However, a model pipeline should have even more functionalities. In the following, I show a more complete pipeline for model training and validating:

In [50]:
def GetCrossValidationMetrics(scaler, model, X, y, cv, group = None):
  # Feature normalization
  scaler.fit(X)
  X_scaled = scaler.fit_transform(X)

  # Cross-validation
  results = cross_validate(model, X_scaled, y, cv = cv, groups = group) 

  # Get accuracy
  acc = (results['test_score'].mean( ))*100
  std = (results['test_score'].std( ))*100  

  return (acc, std)

Nice! Now, let's test it:

In [51]:
SEED = 301
np.random.seed(SEED)

cv = StratifiedKFold(n_splits = 10, shuffle = True)
scaler = StandardScaler( )
model = SVC( )

acc, std = GetCrossValidationMetrics(scaler, model, X, y, cv)

print("Accuracy is {:.2f}%, and it is in the domain [{:.2f}%, {:.2f}%]".format(acc, acc - 2*std, acc + 2*std))

Accuracy is 76.71%, and it is in the domain [74.21%, 79.21%]


Nice! However, we could also create a pipeline using the ```Pipeline``` object, implemented by the sklearn documentation. Thus, we may simply do:

In [54]:
scaler = StandardScaler( )
model  = SVC( )

pipeline = Pipeline([('scaling', scaler), ('estimator', model)])

When we create a pipeline, we can use cross-validation in the pipeline:

In [55]:
def GetCrossValidationMetrics(pipeline, X, y, cv, group = None):
  # Cross-validation
  results = cross_validate(pipeline, X, y, cv = cv, groups = group) 

  # Get accuracy
  acc = (results['test_score'].mean( ))*100
  std = (results['test_score'].std( ))*100  

  return (acc, std)

In [56]:
SEED = 301
np.random.seed(SEED)

cv = StratifiedKFold(n_splits = 10, shuffle = True)

acc, std = GetCrossValidationMetrics(pipeline, X, y, cv)

print("Accuracy is {:.2f}%, and it is in the domain [{:.2f}%, {:.2f}%]".format(acc, acc - 2*std, acc + 2*std))

Accuracy is 76.72%, and it is in the domain [74.18%, 79.26%]


Nice! Our pipeline worked.