# Cross Validation
<div>
<img src="attachment:cfcca9bc-c6bf-4531-a18c-df20ccef4379.png" width="300"/>
</div>


In this activity we will explore cross validation in evaluating a decision tree classifier.

Cross-validation is a statistical method used to estimate the skill of machine learning models.

It is commonly used in applied machine learning to compare and select a model for a given predictive modeling problem because it is easy to understand, easy to implement, and results in skill estimates that generally have a lower bias than other methods.


### Data Setup

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Read dataset
crime_df = pd.read_csv('./pb_compas.csv')

crime_df.head()

Unnamed: 0,id,name,first,last,sex,dob,age,age_cat,race,juv_fel_count,...,vr_charge_desc,type_of_assessment,decile_score.1,score_text,screening_date,v_type_of_assessment,v_decile_score,v_score_text,priors_count.1,event
0,1.0,miguel hernandez,miguel,hernandez,Male,18/04/1947,69,Greater than 45,Other,0,...,,Risk of Recidivism,1,Low,14/08/2013,Risk of Violence,1,Low,0,0
1,2.0,miguel hernandez,miguel,hernandez,Male,18/04/1947,69,Greater than 45,Other,0,...,,Risk of Recidivism,1,Low,14/08/2013,Risk of Violence,1,Low,0,0
2,3.0,michael ryan,michael,ryan,Male,6/2/85,31,25 - 45,Caucasian,0,...,,Risk of Recidivism,5,Medium,31/12/2014,Risk of Violence,2,Low,0,0
3,4.0,kevon dixon,kevon,dixon,Male,22/01/1982,34,25 - 45,African-American,0,...,Felony Battery (Dom Strang),Risk of Recidivism,3,Low,27/01/2013,Risk of Violence,1,Low,0,1
4,5.0,ed philo,ed,philo,Male,14/05/1991,24,Less than 25,African-American,0,...,,Risk of Recidivism,4,Low,14/04/2013,Risk of Violence,3,Low,4,0


In [3]:
# clean up
# drop duplicate rows ignoring the id column
crime_df.drop_duplicates(subset=crime_df.columns.difference(['id']), inplace=True)
#select columns of interest 
crime_df = crime_df[['age', 'c_charge_degree', 'race', 'sex', 'priors_count','score_text']]
print(crime_df.head())
crime_df.dropna(inplace=True)
crime_df.reset_index(drop=True, inplace=True)
print(crime_df.shape)

   age c_charge_degree              race   sex  priors_count score_text
0   69            (F3)             Other  Male             0        Low
2   31             NaN         Caucasian  Male             0     Medium
3   34            (F3)  African-American  Male             0        Low
4   24            (F3)  African-American  Male             4        Low
9   23            (F3)  African-American  Male             1       High
(10595, 6)


### k-Fold Cross-Validation
Cross-validation is a resampling procedure used to evaluate machine learning models on a limited data sample.

Cross-validation procedure consists of:
- randomly partitioning the data in k part or 'folds'
- set one fold aside for testing
- crossvalidation = KFold(n_splits=10, random_state=None, shuffle=False)
train the model on the remaining k-1 folds
- evaluate the model on the test fold
- repeat the process k times
- average evaluation results over the k training sets

By averaging over training sets we get a sense of the variance of the learning algorithm (i.e., its dependence on variations in the training data). Keep in mind that there is considerable overlap between the training sets and they are clearly not independent.

Let's revisit our first model - decision tree classifier. 

In [4]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

Recall that we used One-Hot encoding to include nominal variables into our model. In One-Hot encoding, if your variable has n values, you add n-1 columns to your design matrix. This is repeated for each nominal variable we want to use in our model.

In [5]:
# for learning our tree we want to use age, c_charge_degree, race, sex, priors_count
x = crime_df[['age', 'c_charge_degree', 'race', 'sex', 'priors_count']]

In [6]:
# we want to predict the score text let's make that our y
y = crime_df['score_text']

In [7]:
# converting categorical features using one-hot encoding (i.e., dummy features)
# textbook approach using statsmodels categorical function is depricated
# we will be using pandas get_dummies function instead
x = pd.get_dummies(x, columns=['race','sex','c_charge_degree' ], prefix = ['dummy','dummy','dummy'])

# sex attribute has two values only so we can drop one of them since they are considered redundant 
x.drop('dummy_Male', axis=1, inplace=True)


The KFold function can (intuitively) be used to implement k-fold CV. Here we will use k = 10, a common choice for k, on the Compas data set. We once again set a random seed and initialize a vector in which we will print the CV errors corresponding to the polynomial fits of orders one to ten.

In [8]:
k = 10
crossvalidation = KFold(n_splits=k, random_state=1, shuffle=True)

Setup CART decision tree model

In [9]:
cart01 = DecisionTreeClassifier(criterion = "gini", max_leaf_nodes=5).fit(x,y)    

now we can evaluate our model using the number of k-folds. Note that the 'cv' term is equal to our KFold module we set above

In [10]:
cart_cv_scores = cross_val_score(cart01, x, y, cv=crossvalidation)

Let's see how well our model performed

In [11]:
print("Cart cross validation scores with k=10: ", cart_cv_scores)
print("Average score of all folds:",cart_cv_scores.mean())

Cart cross validation scores with k=10:  [0.63018868 0.58679245 0.6254717  0.58490566 0.61792453 0.62417375
 0.62983947 0.63172805 0.63078376 0.60434372]
Average score of all folds: 0.616615176296613


### To Do: How does this result using k-fold cross-validation compare to the result we got from using training-testing split in the previous activity?

In [None]:
## provides a more robust estimate of the model’s performance compared to a simple training-testing split

### To Do: Repeat using k-fold cross-validation on the C5.0 Decision Tree model and compare the evaluation result to results from using training-testing split.

In [12]:
cart_c5 = DecisionTreeClassifier(criterion="entropy", max_leaf_nodes=5).fit(x, y)
cart_c5_cv_scores = cross_val_score(cart_c5, x, y, cv=crossvalidation)

print("C5.0-like Decision Tree cross-validation scores with k=10: ", cart_c5_cv_scores)
print("Average score of all folds:", cart_c5_cv_scores.mean())

C5.0-like Decision Tree cross-validation scores with k=10:  [0.62641509 0.58679245 0.6254717  0.58490566 0.61792453 0.62134089
 0.62700661 0.63172805 0.63078376 0.60434372]
Average score of all folds: 0.6156712455680867


### To Do: Repeat using k-fold cross-validation on the Random Forest Decision Tree model and compare the evaluation result to results from using training-testing split.

In [13]:
random_forest = RandomForestClassifier(random_state=1).fit(x, y)
random_forest_cv_scores = cross_val_score(random_forest, x, y, cv=crossvalidation)

print("Random Forest cross-validation scores with k=10: ", random_forest_cv_scores)
print("Average score of all folds:", random_forest_cv_scores.mean())

Random Forest cross-validation scores with k=10:  [0.62924528 0.57830189 0.62264151 0.58207547 0.61132075 0.6128423
 0.60056657 0.60528801 0.63456091 0.598678  ]
Average score of all folds: 0.6075520694140074


### To Do: If you were going to use cross-validation using Leave-one-out method on this data set what would you set the value of k? Why did you choose this value? 

In [None]:
# the value of k should be set to the number of observations in the dataset.
# it provides the maximum possible training data for each fold, making it useful when the dataset is very small

## Stratified  k-folds
Stratified sampling is a method of selecting samples from a population by dividing the population into groups, referred to as "strata," based on a specific characteristic, and then choosing samples from each stratum in proportions that mirror their representation in the overall population.

Incorporating the principle of stratified sampling into cross-validation guarantees that the training and test sets maintain the same distribution of the target feature as observed in the original dataset.

In [14]:
from sklearn.model_selection import StratifiedKFold

kfold = StratifiedKFold(n_splits=3,shuffle=True,random_state=11)
splits = kfold.split(x,y)

Let's see how the proportions of the different values for the target variable look like after using tratified sampling to create our folds.

In [15]:
print(f'PROPORTION OF TARGET IN THE ORIGINAL DATA\n{y.value_counts() / len(x)}\n\n')

PROPORTION OF TARGET IN THE ORIGINAL DATA
score_text
Low       0.548372
Medium    0.255781
High      0.195847
Name: count, dtype: float64




Let's also display some information about the sizes of training and test sets and the distribution of the target feature within these sets for each split from the cross-validation process. It can help us assess how the data is divided and whether the target feature distribution is maintained across different splits.

In [16]:
for n,(train_index,test_index) in enumerate(splits):
    print(f'SPLIT NO {n+1}\nTRAINING SET SIZE: {np.round(len(train_index) / (len(train_index)+len(test_index)),2)}'+
          f'\tTEST SET SIZE: {np.round(len(test_index) / (len(train_index)+len(test_index)),2)}\nPROPORTION OF TARGET IN THE TRAINING SET\n'+
          f'{x.iloc[test_index,3].value_counts() / len(x.iloc[test_index,3])}\nPROPORTION OF TARGET IN THE TEST SET\n'+
          f'{x.iloc[train_index,3].value_counts() / len(x.iloc[train_index,3])}\n\n')

SPLIT NO 1
TRAINING SET SIZE: 0.67	TEST SET SIZE: 0.33
PROPORTION OF TARGET IN THE TRAINING SET
dummy_Asian
False    0.996036
True     0.003964
Name: count, dtype: float64
PROPORTION OF TARGET IN THE TEST SET
dummy_Asian
False    0.994478
True     0.005522
Name: count, dtype: float64


SPLIT NO 2
TRAINING SET SIZE: 0.67	TEST SET SIZE: 0.33
PROPORTION OF TARGET IN THE TRAINING SET
dummy_Asian
False    0.995187
True     0.004813
Name: count, dtype: float64
PROPORTION OF TARGET IN THE TEST SET
dummy_Asian
False    0.994903
True     0.005097
Name: count, dtype: float64


SPLIT NO 3
TRAINING SET SIZE: 0.67	TEST SET SIZE: 0.33
PROPORTION OF TARGET IN THE TRAINING SET
dummy_Asian
False    0.993769
True     0.006231
Name: count, dtype: float64
PROPORTION OF TARGET IN THE TEST SET
dummy_Asian
False    0.995612
True     0.004388
Name: count, dtype: float64




### To Do: Repeat creating the Random Forest Decision Tree model using a stratified k-folds and compare the evaluation result to the results from using the k-fold approach above.

In [17]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Instantiate the Random Forest model
random_forest = RandomForestClassifier(random_state=1)

# Stratified K-Fold Cross-Validation with 3 splits
kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=11)
random_forest_stratified_cv_scores = cross_val_score(random_forest, x, y, cv=kfold)

# Print results
print("Random Forest cross-validation scores with Stratified K-Fold (n_splits=3):", random_forest_stratified_cv_scores)
print("Average score of all stratified folds:", random_forest_stratified_cv_scores.mean())

Random Forest cross-validation scores with Stratified K-Fold (n_splits=3): [0.59881087 0.59569649 0.59246672]
Average score of all stratified folds: 0.5956580281920827
