# <a id='toc1_'></a>[Better model evaluation](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Better model evaluation](#toc1_)    
  - [Cross Validation](#toc1_1_)    
    - [What is cross-validation?](#toc1_1_1_)    
      - [Let's talk about bears](#toc1_1_1_1_)    
    - [Why do we do cross-validation?](#toc1_1_2_)    
    - [**Why cross-validation?**](#toc1_1_3_)    
  - [Types of cross-validation (CV)](#toc1_2_)    
    - [Stratified K-Fold Cross Validation](#toc1_2_1_)    
  - [Repeated KFold](#toc1_3_)    
    - [**How to choose K?**](#toc1_3_1_)    
    - [Shuffle Split](#toc1_3_2_)    
  - [Stratified Shuffle Split](#toc1_4_)    
  - [Time Series Cross Validation](#toc1_5_)    
  - [Extra: Leave-One-Out Cross-Validation   ](#toc1_6_)    
- [Pickling](#toc2_)    
  - [Save the model](#toc2_1_)    
  - [Load the model](#toc2_2_)    
  - [Save the data](#toc2_3_)    
  - [Load the data](#toc2_4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [None]:
from sklearn.datasets import  fetch_california_housing, load_breast_cancer
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import cross_validate, StratifiedShuffleSplit, ShuffleSplit, KFold, RepeatedKFold, GroupKFold, StratifiedKFold
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, mean_absolute_percentage_error
from sklearn.metrics import accuracy_score, roc_auc_score, precision_score, recall_score

## <a id='toc1_1_'></a>[Cross Validation](#toc0_)

### <a id='toc1_1_1_'></a>[What is cross-validation?](#toc0_)

**A way to get more accurate evaluation metrics**. 

Cross-validation allows you to test your model on multiple test sets, which gives us a better idea of how the model will work in the wild. However, it only works if the data and the data splits are representative of the real world data!


#### <a id='toc1_1_1_1_'></a>[Let's talk about bears](#toc0_)
If you were to build a bear classifier that would help tourists in your reservation to avoid bears and used only pictures from the internet, you may not be able to help. 

Why?  


Because bear pictures on the internet usually look like this:

![](../../../img/bear_image.png)

**A good quality, close-up picture of a bear**. But if you were to set up a surveillance camera to monitor bears in the wild you'd see something more like this:

![](../../../img/bear_wild.png)

**A poor quality, far away picture of a bear, with cubs too this time.**

In this scenario, our model is great on the training data (internet images), but it would perform poorly on the test data (surveillance camera images)

Assuming our data is representative of the population, cross-validation can also help us find the optimal hyperparameters for our model. More on that in Hyperparameter Tuning class.

### <a id='toc1_1_2_'></a>[Why do we do cross-validation?](#toc0_)

**Because we care about generalization.**

If we spent 3 months building a model to get our business to generate more revenue, e.g. optimizing flight prices, we would want our model to generalize well. Although it's useful in business (and widely used, especially with small amounts of data), it's even more useful in high-stakes settings, e.g. hospitals, police. 

This is because these fields generally don't have the most updated technology and will not update their models as needed once they realize the data seen by the model in real-life is very different from the data seen by the model during training (i.e. *data drift*). That's why the models used in those fields (e.g. cancer detection, perpetrator recidivism) need to be as good as they can be at the time of the building.

Most big companies do not rely on static models to predict consumer behaviour, but instead regularly track performance and update models. This is more of an MLOps topic and we won't cover it here but it's something you should be aware of. Keywords to search for: *model evaluation*, *concept drift*, *data drift*.

### How do we do cross-validation?

We split our data into multiple parts and use them for training and evaluation, one by one.

![](../../../img/crossval.png)

In [None]:
california = fetch_california_housing()
print(california["DESCR"])

In [None]:
df_cali = pd.DataFrame(california["data"], columns = california["feature_names"])
df_cali["median_house_value"] = california["target"]

df_cali.head()

In [None]:
features = df_cali.drop(columns = ["median_house_value","AveOccup", "Population", "AveBedrms"])
labels = df_cali["median_house_value"]

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.1, random_state=41)

In [None]:
rf_model = RandomForestRegressor(n_estimators=100, max_depth=10, random_state=42)

In [7]:
cv_splitter = KFold(n_splits=5, shuffle=False)
cross_validate(rf_model, X_train, y_train, cv=cv_splitter)

{'fit_time': array([4.55788422, 5.0084424 , 4.46448064, 4.3850708 , 4.69796133]),
 'score_time': array([0.05046988, 0.04568815, 0.03533125, 0.04907346, 0.0434351 ]),
 'test_score': array([0.78673323, 0.77766357, 0.77966809, 0.76711616, 0.77101187])}

In [9]:
cv_splitter = KFold(n_splits=5, shuffle=True, random_state=42)
cross_validate(rf_model, X_train, y_train, cv=cv_splitter)

{'fit_time': array([4.49471521, 4.36314607, 4.48665714, 4.41625166, 4.26591897]),
 'score_time': array([0.04983807, 0.04316187, 0.04079366, 0.05221915, 0.04421759]),
 'test_score': array([0.78358686, 0.78558343, 0.7773195 , 0.77114127, 0.77956676])}

## <a id='toc1_2_'></a>[Types of cross-validation (CV)](#toc0_)

### <a id='toc1_3_2_'></a>[Shuffle Split](#toc0_)

This CV method makes sure to scramble your data a bit before the train test split! This is so you approximate a random selection of data points. You do this because you want your model to not change much when the order of the data points changes. 

The train-test split we used before already does this internally - this is why we had to set a `random_state`!

![](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_008.png)  
(Source: [3.1. Cross-validation: evaluating estimator performance, scikit-learn](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators))

### <a id='toc1_2_1_'></a>[Stratified K-Fold Cross Validation](#toc0_)

This method makes sure your target is equally represented in train-test sets. If you have imbalanced data, e.g. 30% positive 70% negative, the same %s will be present in both your train and test sets.

You should always **stratify based on the target** - which is why `sklearn` does this by default with its cross-validation function.

![](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_009.png)  
(Source: [3.1. Cross-validation: evaluating estimator performance, scikit-learn](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators))

In [10]:
cv_splitter = StratifiedKFold(n_splits=5, shuffle=False)
cross_validate(rf_model, features, labels, cv=cv_splitter)

ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.

Why did we get this error?

In [11]:
titanic = pd.read_csv("https://raw.githubusercontent.com/data-bootcamp-v4/data/main/titanic_train.csv")
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [25]:
y.value_counts() / 5

Survived
0    109.8
1     68.4
Name: count, dtype: float64

In [24]:
y.value_counts() / 10

Survived
0    54.9
1    34.2
Name: count, dtype: float64

In [16]:
X = titanic[["Pclass", "Age"]]
y = titanic["Survived"]

In [13]:
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

In [17]:
cv_splitter = StratifiedKFold(n_splits=5, shuffle=False)
cross_validate(rf_model, X, y, cv=cv_splitter)

{'fit_time': array([0.21109581, 0.19272804, 0.20311642, 0.19867396, 0.19804645]),
 'score_time': array([0.0105288 , 0.01576781, 0.00773025, 0.00806642, 0.00726652]),
 'test_score': array([0.65363128, 0.65168539, 0.70786517, 0.76966292, 0.73595506])}

In [18]:
cv_splitter = StratifiedKFold(n_splits=5, shuffle=True)
cross_validate(rf_model, X, y, cv=cv_splitter)

{'fit_time': array([0.228724  , 0.17601681, 0.19043827, 0.18829465, 0.19900966]),
 'score_time': array([0.00851345, 0.00853682, 0.00851727, 0.00576854, 0.01152873]),
 'test_score': array([0.67597765, 0.65168539, 0.7247191 , 0.69662921, 0.69662921])}

### <a id='toc1_4_'></a>[Stratified Shuffle Split](#toc0_)

Same as the stratified CV we saw before, but now shuffled!

In [None]:
# Set up the cross validator
cv_sss = StratifiedShuffleSplit(n_splits=5, test_size=0.25, random_state=0)
cv_sss.get_n_splits(features, labels)

In [None]:
# Check what the stratified shuffle split does
for i, (train_indices, test_indices) in enumerate(cv_sss.split(features, labels)):
    print('Split no:', i)
    print('Train indices:', train_indices[:5])
    print('Test indices:', test_indices[:5])

In [None]:
# Now see it in action! ...manually
results = []
for train_index, test_index in cv_sss.split(features, labels):
    X_train, X_test = features.iloc[train_index], features.iloc[test_index]
    y_train, y_test = labels[train_index], labels[test_index]
    rf_model.fit(X_train, y_train)
    pred = rf_model.predict(X_test)
    results.append(accuracy_score(y_test, pred))

In [None]:
results

In [None]:
# And now using the sklearn 
scores = cross_val_score(rf_model, features, labels, scoring='accuracy', cv=cv_sss, n_jobs=-1)
print(scores)
scores.mean()

### <a id='toc1_3_'></a>[Repeated KFold](#toc0_)

In [19]:
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score

In [21]:
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Initialize model and cross validate with 10 folds
scores = cross_val_score(rf_model, features, labels, scoring='accuracy', cv=cv, n_jobs=-1)
print(scores)
scores.mean()

[nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 nan nan nan nan nan nan nan nan nan nan nan nan]


nan

In [22]:
len(scores)

30

In [None]:
# Use a different scoring metric
scores = cross_val_score(rf_model, features, labels, scoring='recall', cv=cv, n_jobs=-1)
print(scores)
scores.mean()

### <a id='toc1_3_1_'></a>[**How to choose K?**](#toc0_)

> Typical values for k are k=3, k=5, and k=10, with 10 representing the most common value. This is because, given extensive testing, 10-fold cross-validation provides a good balance of low computational cost and low bias in the estimate of model performance as compared to other k values and a single train-test split. [$^{[3]}$](https://machinelearningmastery.com/loocv-for-evaluating-machine-learning-algorithms/)

### <a id='toc1_5_'></a>[Extra: Time Series Cross Validation](#toc0_)

![](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_013.png)   
(Source: [3.1. Cross-validation: evaluating estimator performance, scikit-learn](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-iterators))

In [None]:
occupancy = pd.read_csv('https://raw.githubusercontent.com/sabinagio/data-analytics/main/data/occupancy.csv')
occupancy.set_index('date', inplace=True)
occupancy.head()

In [None]:
features = occupancy.drop('Occupancy', axis=1)
labels = occupancy['Occupancy']

In [None]:
from sklearn.model_selection import TimeSeriesSplit

# Set up the cross validator
ts_sss = TimeSeriesSplit(n_splits=6)
ts_sss.get_n_splits(features)

In [None]:
# Review how the time series split works
for i, (train_index, test_index) in enumerate(ts_sss.split(features)):
    print('Split no:', i)
    print('Train set size:', len(train_index))
    print('Test set size:', len(test_index))

In [None]:
# And see it in action!... manually
results = []
for train_index, test_index in ts_sss.split(features):
    X_train, X_test = features.iloc[train_index], features.iloc[test_index]
    y_train, y_test = labels[train_index], labels[test_index]
    rf_model.fit(X_train, y_train)
    pred = rf_model.predict(X_test)
    results.append(accuracy_score(y_test, pred))

In [None]:
results

In [None]:
# And now using the sklearn 
scores = cross_val_score(rf_model, features, labels, scoring='accuracy', cv=ts_sss, n_jobs=-1)
print(scores)
scores.mean()

## <a id='toc1_6_'></a>Extra: [Leave-One-Out Cross-Validation](https://machinelearningmastery.com/loocv-for-evaluating-machine-learning-algorithms/)    [&#8593;](#toc0_)

# <a id='toc2_'></a>[Pickling](#toc0_)

We can pickle many things: ML models, pandas dataframes

In [None]:
# !pip install pickle
import pickle

## <a id='toc2_1_'></a>[Save the model](#toc0_)

In [27]:
with open('rf_model.pkl', 'wb') as file:
    pickle.dump(rf_model, file)

In [28]:
pickle.dump(rf_model, open('rf_model.pkl', 'wb'))

## <a id='toc2_2_'></a>[Load the model](#toc0_)

In [None]:
# rf_model.fit(X_train, y_train)
rf_model.fit(X, y) # for Ironkaggle

In [29]:
with open('rf_model.pkl', 'rb') as file:
    rf_model = pickle.load(file)

In [30]:
rf_model = pickle.load(open('rf_model.pkl', 'rb'))

## <a id='toc2_3_'></a>[Save the data](#toc0_)

In [None]:
X_train.to_pickle('train_data.pkl')
y_train.to_pickle('train_label.pkl')

X_test.to_pickle('test_data.pkl')
y_test.to_pickle('test_label.pkl')

## <a id='toc2_4_'></a>[Load the data](#toc0_)

In [None]:
X_train = pd.read_pickle('train_data.pkl')
X_train