<h2 align='center' style='color:purple'>Finding best model and hyper parameter tunning using GridSearchCV</h2>

**For iris flower dataset in sklearn library, we are going to find out best model and best hyper parameters using GridSearchCV**

**Load iris flower dataset**

In [1]:
import numpy as np
import pandas as pd
from sklearn import datasets
flowers = datasets.load_iris()
iris=datasets.load_iris()

In [2]:
dir(flowers)

['DESCR', 'data', 'feature_names', 'filename', 'target', 'target_names']

In [3]:
df=pd.DataFrame(flowers.data,columns=flowers.feature_names)

In [4]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [5]:
df['target']=flowers.target

In [6]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [7]:
flowers.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [8]:
print(flowers.target_names[0])
print(flowers.target_names[1])
print(flowers.target_names[2])

setosa
versicolor
virginica


In [9]:
df['Flower_Name']=df.target.apply(lambda x : flowers.target_names[x])

In [10]:
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target,Flower_Name
0,5.1,3.5,1.4,0.2,0,setosa
1,4.9,3.0,1.4,0.2,0,setosa
2,4.7,3.2,1.3,0.2,0,setosa
3,4.6,3.1,1.5,0.2,0,setosa
4,5.0,3.6,1.4,0.2,0,setosa


In [11]:
df=df.drop(columns='target')

In [12]:
df.columns=['sepallength', 'sepalwidth', 'petallength',
       'petalwidth', 'Flower_Name']

In [13]:
df.head()

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,Flower_Name
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


<h3 style='color:blue'>Approach 1: Use train_test_split and manually tune parameters by trial and error</h3>

In [14]:
Y=df['Flower_Name']
X=df.drop(columns='Flower_Name')

In [15]:
X.head()

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [16]:
Y.head()

0    setosa
1    setosa
2    setosa
3    setosa
4    setosa
Name: Flower_Name, dtype: object

In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.3)

In [18]:
from sklearn.svm import SVC
model1=SVC(kernel='rbf',C=30,gamma='auto')

In [19]:
model1.fit(X_train,y_train)

SVC(C=30, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [20]:
model1.score(X_test,y_test )

0.9333333333333333

Now, if we execute this again, the score might change as the X_train and X_test are always changing.
The score will fluctuate.
Therefore we use K fold cross validation

<h3 style='color:blue'>Approach 2: Use K Fold Cross validation</h3>

**Manually try suppling models with different parameters to cross_val_score function with 5 fold cross validation**

In [21]:
from sklearn.model_selection import cross_val_score

In [22]:
cross_val_score(SVC(kernel='linear',C=10,gamma='auto'),iris.data, iris.target, cv=5) #cv=5 :there will be 5 combination of train and test set

array([1.        , 1.        , 0.9       , 0.96666667, 1.        ])

In [23]:
cross_val_score(SVC(kernel='rbf',C=10,gamma='auto'),flowers.data, flowers.target, cv=5)

array([0.96666667, 1.        , 0.96666667, 0.96666667, 1.        ])

In [24]:
cross_val_score(SVC(kernel='rbf',C=20,gamma='auto'),flowers.data, flowers.target, cv=5)

array([0.96666667, 1.        , 0.9       , 0.96666667, 1.        ])

For each of these above combintaion we found out scores.

Take average.

Find out the optimal value for these parameters.

There can be many combinations of kernel and regularization(C) 

such as

kernel : linear : 10,20,30

kernel : rbf: 10,20,30

The question is how many times you will write the same line

**Above approach is tiresome and very manual. We can use for loop as an alternative**

In [25]:
kernels = ['rbf', 'linear']
C = [1,10,20]
for i in kernels:
    for j in C:
        scores=cross_val_score(SVC(kernel=i,C=j,gamma='auto'),flowers.data, flowers.target, cv=5)
        avg_score=np.array(scores).mean()
        print("kernel:"+i+" C:"+str(j)+" Score:"+str(avg_score))


kernel:rbf C:1 Score:0.9800000000000001
kernel:rbf C:10 Score:0.9800000000000001
kernel:rbf C:20 Score:0.9666666666666668
kernel:linear C:1 Score:0.9800000000000001
kernel:linear C:10 Score:0.9733333333333334
kernel:linear C:20 Score:0.9666666666666666


**From above results we can say that rbf with C=1 or 10 or linear with C=1 will give best performance**

**The same thing can be done by using GridSearchCV**

<h3 style='color:blue'>Approach 3: Use GridSearchSV</h3>

For Reference:

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

**GridSearchCV does exactly same thing as for loop above but in a single line of code**

In [27]:
from sklearn.model_selection import GridSearchCV
clf=GridSearchCV(SVC(gamma='auto'),
                {
                    'C': [1,10,20],
                    'kernel' :  ['rbf', 'linear']
                },cv=5, return_train_score=False)

clf.fit(flowers.data, flowers.target)

GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='auto', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='warn', n_jobs=None,
             param_grid={'C': [1, 10, 20], 'kernel': ['rbf', 'linear']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [28]:
clf.cv_results_

{'mean_fit_time': array([0.00120234, 0.00079913, 0.00079951, 0.00079942, 0.00119882,
        0.00079942]),
 'std_fit_time': array([0.00039797, 0.00039957, 0.00039976, 0.00039971, 0.00039993,
        0.00039971]),
 'mean_score_time': array([0.00059977, 0.00019717, 0.00080223, 0.00079985, 0.00019999,
        0.00099988]),
 'std_score_time': array([4.89706738e-04, 3.94344330e-04, 4.01152120e-04, 3.99923708e-04,
        3.99971008e-04, 4.10190833e-07]),
 'param_C': masked_array(data=[1, 1, 10, 10, 20, 20],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_kernel': masked_array(data=['rbf', 'linear', 'rbf', 'linear', 'rbf', 'linear'],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 1, 'kernel': 'rbf'},
  {'C': 1, 'kernel': 'linear'},
  {'C': 10, 'kernel': 'rbf'},
  {'C': 10, 'kernel': 'linear'},
  {'C': 20, 'kernel': 'rbf'},
  {'C': 20

In [29]:
xf=pd.DataFrame(clf.cv_results_)

In [30]:
xf

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.001202,0.000398,0.0006,0.0004897067,1,rbf,"{'C': 1, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
1,0.000799,0.0004,0.000197,0.0003943443,1,linear,"{'C': 1, 'kernel': 'linear'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
2,0.0008,0.0004,0.000802,0.0004011521,10,rbf,"{'C': 10, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
3,0.000799,0.0004,0.0008,0.0003999237,10,linear,"{'C': 10, 'kernel': 'linear'}",1.0,1.0,0.9,0.966667,1.0,0.973333,0.038873,4
4,0.001199,0.0004,0.0002,0.000399971,20,rbf,"{'C': 20, 'kernel': 'rbf'}",0.966667,1.0,0.9,0.966667,1.0,0.966667,0.036515,5
5,0.000799,0.0004,0.001,4.101908e-07,20,linear,"{'C': 20, 'kernel': 'linear'}",1.0,1.0,0.9,0.933333,1.0,0.966667,0.042164,5


In [192]:
xf[['param_C','param_kernel','mean_test_score']]

Unnamed: 0,param_C,param_kernel,mean_test_score
0,1,rbf,0.98
1,1,linear,0.98
2,10,rbf,0.98
3,10,linear,0.973333
4,20,rbf,0.966667
5,20,linear,0.966667


In [31]:
clf.best_params_

{'C': 1, 'kernel': 'rbf'}

In [32]:
clf.best_score_

0.98

**Use RandomizedSearchCV to reduce number of iterations and with random combination of parameters. 
This is useful when you have too many parameters to try and your training time is longer.
It helps reduce the cost of computation**

**GridsearchCV is slower than RandomizedSearchCV**

In [36]:
from sklearn.model_selection import RandomizedSearchCV 
rf=RandomizedSearchCV(SVC(gamma='auto'),
                {
                    'C': [1,10,20],
                    'kernel' :  ['rbf', 'linear']
                }, cv=5,n_iter=2, # n_iter= no of combintions
                     return_train_score=False)

In [37]:
rf.fit(flowers.data, flowers.target)

RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=SVC(C=1.0, cache_size=200, class_weight=None,
                                 coef0=0.0, decision_function_shape='ovr',
                                 degree=3, gamma='auto', kernel='rbf',
                                 max_iter=-1, probability=False,
                                 random_state=None, shrinking=True, tol=0.001,
                                 verbose=False),
                   iid='warn', n_iter=2, n_jobs=None,
                   param_distributions={'C': [1, 10, 20],
                                        'kernel': ['rbf', 'linear']},
                   pre_dispatch='2*n_jobs', random_state=None, refit=True,
                   return_train_score=False, scoring=None, verbose=0)

In [38]:
shop=pd.DataFrame(rf.cv_results_)

In [40]:
shop

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_kernel,param_C,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.001199,0.0004,0.0008,0.0004,rbf,10,"{'kernel': 'rbf', 'C': 10}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
1,0.0006,0.00049,0.0004,0.00049,linear,20,"{'kernel': 'linear', 'C': 20}",1.0,1.0,0.9,0.933333,1.0,0.966667,0.042164,2


In [41]:
shop[['param_C','param_kernel','mean_test_score']]

Unnamed: 0,param_C,param_kernel,mean_test_score
0,10,rbf,0.98
1,20,linear,0.966667


**It randomly took these 2 combination of paramters**

# How about different models with different hyperparameters?

In [42]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [43]:
model_params = {
    'svm': {
        'model': svm.SVC(gamma='auto'),
        'params' : {
            'C': [1,10,20],
            'kernel': ['rbf','linear']
        }  
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'params' : {
            'n_estimators': [1,5,10]
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        'params': {
            'C': [1,5,10]
        }
    }
}


In [44]:
for x,y in model_params.items():
    print(x)

svm
random_forest
logistic_regression


In [45]:
for x,y in model_params.items():
    print(y)

{'model': SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False), 'params': {'C': [1, 10, 20], 'kernel': ['rbf', 'linear']}}
{'model': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators='warn',
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False), 'params': {'n_estimators': [1, 5, 10]}}
{'model': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100

In [46]:
scores=[]
for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(flowers.data, flowers.target)
    scores.append({
    'model_name' : mp['model'],
    
    'best_score' : clf.best_score_,
    
    'best_params' : clf.best_params_
    }
    )
    

    



In [47]:
scores

[{'model_name': SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
      max_iter=-1, probability=False, random_state=None, shrinking=True,
      tol=0.001, verbose=False),
  'best_score': 0.98,
  'best_params': {'C': 1, 'kernel': 'rbf'}},
 {'model_name': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                         max_depth=None, max_features='auto', max_leaf_nodes=None,
                         min_impurity_decrease=0.0, min_impurity_split=None,
                         min_samples_leaf=1, min_samples_split=2,
                         min_weight_fraction_leaf=0.0, n_estimators='warn',
                         n_jobs=None, oob_score=False, random_state=None,
                         verbose=0, warm_start=False),
  'best_score': 0.9666666666666667,
  'best_params': {'n_estimators': 10}},
 {'model_name': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_int

In [48]:
pd.DataFrame(scores)

Unnamed: 0,best_params,best_score,model_name
0,"{'C': 1, 'kernel': 'rbf'}",0.98,"SVC(C=1.0, cache_size=200, class_weight=None, ..."
1,{'n_estimators': 10},0.966667,"RandomForestClassifier(bootstrap=True, class_w..."
2,{'C': 5},0.966667,"LogisticRegression(C=1.0, class_weight=None, d..."


**Based on above, I can conclude that SVM with C=1 and kernel='rbf' is the best model for solving my problem of iris flower classification**