## What is Cross Validation and Why should i cross validate my model?

Normally in a machine learning process, data is divided into training and test sets; the training set is then used to train the model and the test set is used to evaluate the performance of a model. However, this approach may lead to variance problems. In simpler words, a variance problem refers to the scenario where our accuracy obtained on one test is very different to accuracy obtained on another test set using the same algorithm.

In other words, we cant be sure that the model will have the desired accuracy and variance in production environment. We need some kind of assurance of the accuracy of the predictions that our model is putting out. For this, we need to validate our model

### Train_Test Split approach.

Many are only familia with the popular train test split, where a certain percentage of the data is kept aside for validation,
in this case we randomly split the complete data into training and test sets. Then Perform the model training on the training set and use the test set for validation purpose, ideally split the data into 70:30 or 80:20. With this approach there is a possibility of high bias if we have limited data, because we would miss some information about the data which we have not used for training..

![image](images/1-16.png)

We keep using the same random 20% for testing, what if its just by chance or luck the model is able to predict correctly on this 20%, We really need to be sure that same result will be gotten on an entirely differant data (i.e the model generalizes).

### K-Folds Cross Validation

K-Fold is a popular and easy to understand, it generally results in a less biased model compare to other methods. Because it ensures that every observation from the original dataset has the chance of appearing in training and test set. This is one among the best approach if we have a limited input data.

*How does it work*

![title](images/cv.png)

1. Split the entire data randomly into k folds (value of k shouldn’t be too small or too high, ideally we choose 5 to 10 depending on the data size).

2. Then fit the model using the K-1 (K minus 1) folds and validate the model using the remaining Kth fold. Note down the scores/errors.

3. Repeat this process until every K-fold serve as the test set. Then take the average of your recorded scores. That will be the performance metric for the model.

Cross validation in English, what do we mean...
Lets say we have a dataset of 100 rows(samples)..and we pick k(the number of folds) to be 5.
meaning each fold will have 20 rows each [20 - 20 - 20 - 20 - 20].

As in the picture above..for the first iteration we take k - 1 folds (5 - 1 = 4) i.e 4 folds(80 rows) for training then tests on the fold we left aside (20 row).

For the next iteration we take another set of 20 rows for testing, training on the rest..
It continues till we have completed all the folds...
easy right !!!!
by doing this we have succeeded in train on all our data and tested on all the data

NB: The score for each iteration is recorded.

You can decide to code this manually, as per Bad guy  !!!

or Use the inbuilt function in scikit learn 'cross_val_score'....No time !!

In [4]:
from sklearn.model_selection import cross_val_score

I'll be Using these Algorithms 
- RandomForestClassifier
- GradientBoostingClassifier
- DecisionTreeClassifier
- LogisticRegression

In [5]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

  return f(*args, **kwds)


In [6]:
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()
lr = LogisticRegression()
dsc = DecisionTreeClassifier()

### First Lets Use the Train_test_split approach

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
import pandas as pd
import numpy as np
data = pd.read_csv('data/clean_train.csv') #importing my already preprocessed data
y = data['Claim'] #defining my target variable
X = data.drop(['Claim','Customer Id'], axis = 1) #defining my input data

  return f(*args, **kwds)


In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [10]:
import warnings
warnings.filterwarnings('ignore')

In [11]:
rfc.fit(X_train, y_train) #training the Randomforestclassifier
gbc.fit(X_train, y_train)#training the gradientboostingclassifier
lr.fit(X_train, y_train)
dsc.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [12]:
#predicting on the test set
rfc_pred = rfc.predict_proba(X_test)[:,-1]
gbc_pred = gbc.predict_proba(X_test)[:,-1]
lr_pred = lr.predict_proba(X_test)[:,-1]
dsc_pred = dsc.predict_proba(X_test)[:,-1]

In [13]:
#evaluation metric for submission
from sklearn.metrics import roc_auc_score

In [14]:
rfc_score = roc_auc_score(y_test,rfc_pred)
gbc_score = roc_auc_score(y_test,gbc_pred)
lr_score = roc_auc_score(y_test,lr_pred)
dsc_score = roc_auc_score(y_test,dsc_pred)

dict_ = {'Algorithm': ['RandomForestClassifier','GradientBoostingClassifier', 
                       'LinearRegression','DecisionTreeClassifier'],
         'roc_auc_score':[rfc_score, gbc_score, lr_score,dsc_score]}
df = pd.DataFrame(dict_, index=[0,1,2,3])
df

Unnamed: 0,Algorithm,roc_auc_score
0,RandomForestClassifier,0.648452
1,GradientBoostingClassifier,0.720327
2,LinearRegression,0.71645
3,DecisionTreeClassifier,0.593968


GradientBoostingClassifier and LogisticRegression looks promising, and their Scores are not too far apart`

But Which of the two will i Select as my model: 
    cross validation will help us out here

### k fold cross validation

In [15]:
gbc_cv = cross_val_score(gbc,X,y,scoring='roc_auc', cv = 5)#k=5, meaning 5 fold cross validation
lr_cv = cross_val_score(lr,X,y,scoring='roc_auc', cv = 5)#k=5, meaning 5 fold cross validation
gbc_cv = [i.round(3) for i in gbc_cv]
lr_cv = [i.round(3) for i in lr_cv]

In [17]:
dict_ = {'algorithm':['GradientBoostingClassifier','LogisticRegression'],
         'cv_score':[gbc_cv,lr_cv],
          'cv_mean': [np.mean(gbc_cv),np.mean(lr_cv)],
         'cv_std': [np.std(gbc_cv),np.std(lr_cv)]}

In [18]:
df = pd.DataFrame(dict_, index = [0,1])
df

Unnamed: 0,algorithm,cv_score,cv_mean,cv_std
0,GradientBoostingClassifier,"[0.726, 0.727, 0.714, 0.697, 0.691]",0.711,0.014738
1,LogisticRegression,"[0.721, 0.721, 0.713, 0.694, 0.694]",0.7086,0.012274


Now We know that GradientBoosting Classifier always perform better
than LogisticRegression on Unseen data..Its evident from our cv_score

## What Next ?

Can we still improve on our current score ?.. 

I think there is still more juice to squeeze out of GradientBoostingClassifier

if we can find the right combination of hyperparameters, Hopefully our model should perform better

### Parameter/Hyperparameter.. Whats the difference?

A machine learning model has two types of parameters. The first type of parameters are the parameters that are learned through a machine learning model while the second type of parameters are the hyper parameter that we pass to the machine learning model.

In [19]:
#For GradientBoostingClassifer we used the default hyperparameters
print(gbc)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='auto',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)


Normally we randomly set the value for these hyper parameters and see what parameters result in best performance. However randomly selecting the parameters for the algorithm can be exhaustive.

Also, it is not easy to compare performance of different algorithms by randomly setting the hyper parameters because one algorithm may perform better than the other with different set of parameters. And if the parameters are changed, the algorithm may perform worse than the other algorithms.

Therefore, instead of randomly selecting the values of the parameters, a better approach would be to develop an algorithm which automatically finds the best parameters for a particular model. Grid Search is one such algorithm.

## Grid Search with Sklearn 

In [20]:
from sklearn.model_selection import GridSearchCV

We already have a baseline model without tuning, i'll start with tuning n_estimator, learning_rate, max_depth(most people consider them to be the most important parameters).. 


- create dictionary of parameters
- Define you GridSearch parameters
- Fit on training data
- get best parameter
- Update gridsearch parameter by adding best parameter


In [21]:
# p_test3 = {'learning_rate':[0.15,0.1,0.05,0.01,0.005,0.001], 'n_estimators':[100,250,500,750,1000,1250,1500,1750]}

# tuning = GridSearchCV(estimator =GradientBoostingClassifier(max_depth=4, min_samples_split=2, min_samples_leaf=1, subsample=1,max_features='sqrt', random_state=10), 
#             param_grid = p_test3, scoring='accuracy',n_jobs=4,iid=False, cv=5)
# tuning.fit(X_train,y_train)
# tuning.grid_scores_, tuning.best_params_, tuning.best_score_

In [81]:
#Create dictionary of parameters
param = {'learning_rate':[0.1,0.02,0.05,0.5,0.01],
           'n_estimators':[100,300,500,600,750,900,1000]}
#Define GridSearch Parameters
tuning = GridSearchCV(estimator =gbc, 
            param_grid = param, scoring='roc_auc',n_jobs=-1,iid=False, cv=5)
#fit on  training data
tuning.fit(X_train,y_train)

#print to see best parameter and score
print( tuning.best_params_, tuning.best_score_)

{'learning_rate': 0.01, 'n_estimators': 600} 0.7155278708933107


In [83]:
#Uncomment to see a breakdown of the result
# pd.DataFrame(tuning.cv_result_)

### Tuning Max depth 

In [84]:
param = {'max_depth':[1,2,3,4,5,6,7] }

#'learning_rate': 0.01, 'n_estimators': 600} is added to the gridsearch parameters
tuning = GridSearchCV(estimator =GradientBoostingClassifier(n_estimators=600, learning_rate= 0.01), 
            param_grid = param, scoring='roc_auc',n_jobs=-1,iid=False, cv=5)
tuning.fit(X_train,y_train)

print(tuning.best_params_,tuning.best_score_)

{'max_depth': 2} 0.7177232722449656


### Other factors
Tree related parameters: Min sample split and min samples leaf

In [85]:
param = {'min_samples_split':[2,4,6,8,10],
         'min_samples_leaf':[1,3,5,7,9]}

#add the new max depth parameter
tuning = GridSearchCV(estimator =GradientBoostingClassifier(learning_rate=0.01, n_estimators=600,max_depth=2), 
            param_grid = param, scoring='roc_auc',n_jobs=-1,iid=False, cv=5)
tuning.fit(X_train,y_train)

print(tuning.best_params_,tuning.best_score_)

{'min_samples_leaf': 5, 'min_samples_split': 2} 0.7181150388044063


In [89]:
# param = {'max_features':[9,10,12,13,17,20,22]}
# tuning = GridSearchCV(estimator =GradientBoostingClassifier(learning_rate=0.01, n_estimators=600,max_depth=2, min_samples_split=2,
#                                                             min_samples_leaf=5), 
#                             param_grid = param, scoring='roc_auc',n_jobs=-1,iid=False, cv=5)
# tuning.fit(X_train,y_train)

# print(tuning.best_score_,tuning.best_params_)

In [90]:
param= {'subsample':[0.4,0.5,0.55,0.6,0.65,0.7,0.8,1]}

tuning = GridSearchCV(estimator =GradientBoostingClassifier(learning_rate=0.01, n_estimators=600,max_depth=2, 
                                                            min_samples_split=2, min_samples_leaf=5,random_state=10), 
param_grid = param, scoring='roc_auc',n_jobs=-1,iid=False, cv=5)
tuning.fit(X_train,y_train)

print(tuning.best_score_,tuning.best_params_)

0.7181150388044063 {'subsample': 1}


In [91]:
tuning.best_estimator_

GradientBoostingClassifier(criterion='friedman_mse', init=None,
                           learning_rate=0.01, loss='deviance', max_depth=2,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=5, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=600,
                           n_iter_no_change=None, presort='auto',
                           random_state=10, subsample=1, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

### Now lets see the performance of our tuned model using 5-fold cross validation

In [95]:
tuned_gbc_cv = cross_val_score(tuning, X, y, scoring = 'roc_auc', cv =5)

In [96]:
default_gbc_cv = cross_val_score(gbc, X, y, scoring = 'roc_auc', cv =5)

In [97]:
# Rounding up scores to 4 decimal places
tuned_gbc_cv = [i.round(4) for i in tuned_gbc_cv]
default_gbc_cv  = [i.round(4) for i in default_gbc_cv]

In [100]:
dict_ = {'model':['default_model','Tuned_model'],
         'cv_score':[default_gbc_cv,tuned_gbc_cv],
          'cv_mean': [np.mean(default_gbc_cv),np.mean(tuned_gbc_cv)],
         'cv_std': [np.std(default_gbc_cv),np.std(tuned_gbc_cv)]}

### Comparing both Models

In [101]:
df = pd.DataFrame(dict_, index = [0,1])
df

Unnamed: 0,model,cv_score,cv_mean,cv_std
0,default_model,"[0.7263, 0.7268, 0.7135, 0.6971, 0.6902]",0.71078,0.014937
1,Tuned_model,"[0.7255, 0.7339, 0.7192, 0.704, 0.6927]",0.71506,0.014855


## Conclusion:
Though the improvement looks insignificant (in some ML competitions 0.0001 improvement can mean a lot)
you can always increase the range of values of parameters,but note that the more values you have the longer it 
takes to run gridsearch, 

Another cool option is RandomSearch..You could Check that up yourself