# KFold Cross Validation 

In this method, we split the data-set into k number of subsets(known as folds) then we perform training on the all the subsets but leave one(k-1) subset for the evaluation of the trained model. In this method, we iterate k times with a different subset reserved for testing purpose each time.

Note:
It is always suggested that the value of k should be 10 as the lower value 
of k is takes towards validation and higher value of k leads to LOOCV method.

In [32]:
from numpy import array
from sklearn.model_selection import KFold

In [33]:
data = array([0.1,0.2,0.3,0.4,0.5,0.6])
data

array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])

In [34]:
# kfold = KFold(n_splits=5)
kfold = KFold(3, True, 1)

# KFold(n_splits=5, *, shuffle=False, random_state=None)



In [35]:
# Enumerate the split

for train, test in kfold.split(data):
    print('train: %s, test:%s' %(data[train],data[test]))

train: [0.1 0.4 0.5 0.6], test:[0.2 0.3]
train: [0.2 0.3 0.4 0.6], test:[0.1 0.5]
train: [0.1 0.2 0.3 0.5], test:[0.4 0.6]


# Hyper Parameter Optimization

A Machine Learning model is defined as a mathematical model with a number of parameters that need to be learned from the data. By training a model with existing data, we are able to fit the model parameters.
However, there is another kind of parameters, known as Hyperparameters, that cannot be directly learned from the regular training process. They are usually fixed before the actual training process begins. These parameters express important properties of the model such as its complexity or how fast it should learn.



### Two best strategies for Hyperparameter tuning are:

> GridSearchCV

> RandomizedSearchCV

In [36]:
import numpy as np
import pandas as pd

In [37]:
df = pd.read_csv("C:/Users/arun_r2/Desktop/Board_Infinity/Python/Raw_File_main/winequality-red.csv")

In [38]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [39]:
X = df.iloc[:,0:11].astype(int)
y = df.iloc[:,11].astype(int)

In [40]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)

In [41]:
from sklearn.ensemble import RandomForestClassifier
Classifier = RandomForestClassifier(n_estimators = 300, random_state=0)

In [42]:
from sklearn.model_selection import cross_val_score
all_accuracies = cross_val_score(estimator=Classifier, X=X_train, y=y_train, cv=5)

In [43]:
print(all_accuracies)

[0.57421875 0.6171875  0.59375    0.6328125  0.60784314]


In [44]:
print(all_accuracies.mean())

0.6051623774509804


# Grid Search CV - Grid Search Validation

In GridSearchCV approach, machine learning model is evaluated for a range of hyperparameter values. This approach is called GridSearchCV, because it searches for best set of hyperparameters from a grid of hyperparameters values.

Drawback : GridSearchCV will go through all the intermediate combinations of hyperparameters which makes grid search computationally very expensive.

In [45]:
grid_param = {
    'n_estimators' :[100,300,500,800,1000],
    'criterion' :['gini','entropy'],
    'bootstrap' :[True, False]
}

# total 20 different random forest models will be built

In [46]:
from sklearn.model_selection import GridSearchCV # all our feature 
#engineering needs are covered in Grid search CV

#import time 

#start = time.time()
gd_sr = GridSearchCV(estimator = Classifier, 
                     param_grid=grid_param,
                    scoring = 'accuracy',
                    cv=5,
                    n_jobs=-1)
# n_jobs tells how many cpus the machine should use.



In [47]:
#import time 

#start = time.time()

gd_sr.fit(X_train, y_train)

# print('execution time', start - time.time(),'Seconds' )

# this will take lot of time to executes because it build multiple models
# in this case it will build 20 different models

GridSearchCV(cv=5,
             estimator=RandomForestClassifier(n_estimators=300, random_state=0),
             n_jobs=-1,
             param_grid={'bootstrap': [True, False],
                         'criterion': ['gini', 'entropy'],
                         'n_estimators': [100, 300, 500, 800, 1000]},
             scoring='accuracy')

In [48]:
best_parameter = gd_sr.best_params_
print(best_parameter)

{'bootstrap': True, 'criterion': 'gini', 'n_estimators': 1000}


In [49]:
best_result = gd_sr.best_score_
print(best_result)

0.6090655637254903


# Randomized Search CV

RandomizedSearchCV solves the drawbacks of GridSearchCV, as it goes through only a fixed number of hyperparameter settings. It moves within the grid in random fashion to find the best set hyperparameters. This approach reduces unnecessary computation.

In [51]:
rf_params = {'max_depth': [3,5,10],
            'max_features': (1,2,3,4,5,6),
            'criterion': ['gini','entropy'],
            'bootstrap': [True,False],
            'min_samples_leaf':(1,2,3,5,7,8,9,10)
            }

In [53]:
from sklearn.model_selection import RandomizedSearchCV
gd_sr = RandomizedSearchCV(Classifier, rf_params,5, random_state= 0)



In [54]:
search = gd_sr.fit(X_train,y_train)

In [56]:
search.best_params_

{'min_samples_leaf': 8,
 'max_features': 5,
 'max_depth': 10,
 'criterion': 'entropy',
 'bootstrap': True}

In [57]:
best_result = search.best_score_
print(best_result)

0.5691973039215685
