<a href="https://colab.research.google.com/github/AlvinChiew/MachineLearning/blob/main/Sklearn_CrossValidation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Objectives : 
* Reduce chances of overfitting, especially when dataset is small or dimensionality is high
* Primarily used to assess how well ML model predict unseen data, i.e. generalization performance




Here, scores from cross validation is used to have prelimenary evaluation on model performance. 
It can serve to 
* compare among models' performance on unseen data 
* estimate best parameters for model

<img src="https://scikit-learn.org/stable/_images/grid_search_workflow.png" width="600">

cross_val_predict() can be used to check prediction from each fold, 
>predictions = cross_val_predict(model, df, y, cv=6)

For CV method visualization : [ref](https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html)

# Import Module

In [1]:
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split, cross_val_score, LeaveOneOut, ShuffleSplit, GroupKFold
from sklearn.linear_model import LogisticRegression

# Load Data & Train Model

In [2]:
X, y = make_blobs(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Evalaution

## Without CV

In [5]:
model = LogisticRegression()
model.fit(X_train, y_train)
print(f'Train Score : {model.score(X_train, y_train)}')
print(f'Test Score : {model.score(X_test, y_test)}')

Train Score : 0.9066666666666666
Test Score : 0.88


## k-fold CV

In [7]:
# train test split is not necessary as dataset is split into train-test-set with  k-1 : 1 ratio in each fold
scores_kfold = cross_val_score(model, X, y, cv=5)
print(f'Scores for each fold : {scores_kfold}')
print(f'Mean score : {scores_kfold.mean()}')

# Seem like there is a slight improvement when results are averaged out.

Scores for each fold : [0.9  0.9  0.9  0.8  0.95]
Mean score : 0.89


# Stratified k-fold

Stratifying train-test split in k-fold eliminates the possibility of uneven group of labels/target value,<br>
e.g. when data is sorted by label or is dominated by specific label(s)
<br><br>
sklearn.model_selection.StratifiedKFold()


## LeaveOneOut method

In [9]:
# k = size of data, i.e. # rows
# Way longer processing

scores_loo = cross_val_score(model, X, y, cv=LeaveOneOut())
print(f'Scores for each fold : {scores_loo}')
print(f'Mean score : {scores_loo.mean()}')

Scores for each fold : [1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.
 0. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1.]
Mean score : 0.89


## ShuffleSplit

In [12]:
# randomly split dataset by "n_splits" time(s) instead of going by the order in dataset

scores_ss = cross_val_score(model, X, y,
                            cv=ShuffleSplit(test_size=0.5, n_splits=10))
print(f'Scores for each fold : {scores_ss}')
print(f'Mean score : {scores_ss.mean()}')

Scores for each fold : [0.88 0.86 0.92 0.84 0.9  0.94 0.9  0.9  0.94 0.9 ]
Mean score : 0.898


## Group k-fold

Split groups into k fold which same group is not represented in both test and train set by each fold.<br>
Such is to ensure data is not overfitted into the dominant subject, e.g. human's facial expression by first providing the grouping so that groups included in previous fold will not be considered in the future folds 


In [18]:
X,y = make_blobs(n_samples=12, random_state=0)
groups = [0,0,0,1,1,1,1,2,2,3,3,3] # disctint group (4) >= k (3)
scores_gkf = cross_val_score(model, X,y, groups, cv=GroupKFold(n_splits=3))  
print(f'Scores for each fold : {scores_gkf}')
print(f'Mean score : {scores_gkf.mean()}')
print(X)
print(y)

# The result looks horrible because grouping and X,y dataset do not really correlate. So Group k=fold doesn't help in this case.
# Overfitting is very likely to happen with extremely small dataset

Scores for each fold : [0.75       0.6        0.66666667]
Mean score : 0.6722222222222222
[[ 3.54934659  0.6925054 ]
 [ 1.9263585   4.15243012]
 [ 0.0058752   4.38724103]
 [ 1.12031365  5.75806083]
 [ 1.7373078   4.42546234]
 [ 2.36833522  0.04356792]
 [-0.49772229  1.55128226]
 [-1.4811455   2.73069841]
 [ 0.87305123  4.71438583]
 [-0.66246781  2.17571724]
 [ 0.74285061  1.46351659]
 [ 2.49913075  1.23133799]]
[1 0 2 0 0 1 1 2 0 2 2 1]
