<a href="https://colab.research.google.com/github/Kartikgc9/Grid-search-CV-error-handling-in-ML/blob/main/1_GridSearchCV.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h2 align='center' style='color:purple'>Finding best model and hyper parameter tunning using GridSearchCV and RandomisedSearchCV</h2>

**For iris flower dataset in sklearn library, we are going to find out best model and best hyper parameters using GridSearchCV**

**Step1: Load iris flower dataset**

In [1]:
from sklearn import svm, datasets
import numpy as np
# Loading Iris dataset
iris = datasets.load_iris()

**Step2: Converting data into Dataframe**

In [2]:
import pandas as pd
# adding feature names to coloumn
df = pd.DataFrame(iris.data,columns=iris.feature_names)
# adding target coloumn name
df['flower'] = iris.target
# mapping class to each flower name
df['flower'] = df['flower'].apply(lambda x: iris.target_names[x])
df[47:150]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),flower
47,4.6,3.2,1.4,0.2,setosa
48,5.3,3.7,1.5,0.2,setosa
49,5.0,3.3,1.4,0.2,setosa
50,7.0,3.2,4.7,1.4,versicolor
51,6.4,3.2,4.5,1.5,versicolor
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


<h3 style='color:blue'>Approach 1: Use train_test_split and manually tune parameters by trial and error</h3>

**Step3: Train test split**

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
# (test_size=0.3) --> 30% testing data and 70% is training data
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)

**Step4: Training the SVM model**

In [4]:
# creating an instance of SVM classifier
model = svm.SVC(kernel='rbf',C=30,gamma='auto')
# training the model
model.fit(X_train,y_train)
# testing the model
model.score(X_test, y_test)

0.9555555555555556

<h3 style='color:blue'>Approach 2: Use K Fold Cross validation</h3>

**Step5.1: Manually try supplying models and parameters to cross_val_score function with 5 fold cross validation**

In [5]:
from numpy import mean
# k fold cross validation with cv=5
scores=cross_val_score(svm.SVC(kernel='linear',C=10,gamma='auto'),iris.data, iris.target, cv=5)
# printing scores of each fold
print(scores)
# average score
print(mean(scores))

[1.         1.         0.9        0.96666667 1.        ]
0.9733333333333334


**Step5.2: trying with different parameters**

In [6]:
scores=cross_val_score(svm.SVC(kernel='rbf',C=10,gamma='auto'),iris.data, iris.target, cv=5)
print(scores)
print(mean(scores))

[0.96666667 1.         0.96666667 0.96666667 1.        ]
0.9800000000000001


**Step5.3: Again trying with different parameters**

In [7]:
scores=cross_val_score(svm.SVC(kernel='rbf',C=20,gamma='auto'),iris.data, iris.target, cv=5)
print(scores)
print(mean(scores))

[0.96666667 1.         0.9        0.96666667 1.        ]
0.9666666666666668


**Step6: Above approach is tiresome and very manual. We can use for loop as an alternative**

In [8]:
# trying different values for kernel value
kernels = ['rbf', 'linear']
C = [1,10,20]
avg_scores = {}
# nested loops for iterating on different values of kernel and C
for kval in kernels:
    for cval in C:
        cv_scores = cross_val_score(svm.SVC(kernel=kval,C=cval,gamma='auto'),iris.data, iris.target, cv=5)
        avg_scores[kval + '_' + str(cval)] = np.average(cv_scores)

avg_scores

{'rbf_1': 0.9800000000000001,
 'rbf_10': 0.9800000000000001,
 'rbf_20': 0.9666666666666668,
 'linear_1': 0.9800000000000001,
 'linear_10': 0.9733333333333334,
 'linear_20': 0.9666666666666666}

<h3 style='color:blue'>Approach 3: Use GridSearchCV</h3>

**Step7: GridSearchCV does exactly same thing as for loop above but in a single line of code**

In [9]:
from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(svm.SVC(gamma='auto'), {
    'C': [1,10,20],
    'kernel': ['rbf','linear']
}, cv=5, return_train_score=False)
clf.fit(iris.data, iris.target)
clf.cv_results_

{'mean_fit_time': array([0.00189381, 0.0012361 , 0.00137582, 0.00121837, 0.00148058,
        0.00137267]),
 'std_fit_time': array([4.43517245e-04, 3.71537243e-05, 3.36549085e-05, 4.18665740e-05,
        1.10896065e-04, 1.50066995e-04]),
 'mean_score_time': array([0.00114975, 0.0008256 , 0.00088382, 0.00082445, 0.00096221,
        0.00090427]),
 'std_score_time': array([2.22056547e-04, 1.52985234e-05, 2.40739021e-05, 1.89304952e-05,
        2.99742129e-05, 1.13993545e-04]),
 'param_C': masked_array(data=[1, 1, 10, 10, 20, 20],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_kernel': masked_array(data=['rbf', 'linear', 'rbf', 'linear', 'rbf', 'linear'],
              mask=[False, False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 1, 'kernel': 'rbf'},
  {'C': 1, 'kernel': 'linear'},
  {'C': 10, 'kernel': 'rbf'},
  {'C': 10, 'kernel': 'linear'},
  {'C': 20, 'ker

**Step7.1: Converting results into dataframe**

In [10]:
df = pd.DataFrame(clf.cv_results_)
df

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_C,param_kernel,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.001894,0.000444,0.00115,0.000222,1,rbf,"{'C': 1, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
1,0.001236,3.7e-05,0.000826,1.5e-05,1,linear,"{'C': 1, 'kernel': 'linear'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
2,0.001376,3.4e-05,0.000884,2.4e-05,10,rbf,"{'C': 10, 'kernel': 'rbf'}",0.966667,1.0,0.966667,0.966667,1.0,0.98,0.01633,1
3,0.001218,4.2e-05,0.000824,1.9e-05,10,linear,"{'C': 10, 'kernel': 'linear'}",1.0,1.0,0.9,0.966667,1.0,0.973333,0.038873,4
4,0.001481,0.000111,0.000962,3e-05,20,rbf,"{'C': 20, 'kernel': 'rbf'}",0.966667,1.0,0.9,0.966667,1.0,0.966667,0.036515,5
5,0.001373,0.00015,0.000904,0.000114,20,linear,"{'C': 20, 'kernel': 'linear'}",1.0,1.0,0.9,0.933333,1.0,0.966667,0.042164,6


**Step7.2: Viewing only the required coloumns**

In [11]:
df[['param_C','param_kernel','mean_test_score']]

Unnamed: 0,param_C,param_kernel,mean_test_score
0,1,rbf,0.98
1,1,linear,0.98
2,10,rbf,0.98
3,10,linear,0.973333
4,20,rbf,0.966667
5,20,linear,0.966667


**Step7.3: Viewing best parameters**

In [12]:
clf.best_params_

{'C': 1, 'kernel': 'rbf'}

**Step7.4: Viewing best score**

In [13]:
clf.best_score_

0.9800000000000001

**Step8: Use RandomizedSearchCV to reduce number of iterations and with random combination of parameters. This is useful when you have too many parameters to try and your training time is longer. It helps reduce the cost of computation**

In [14]:
from sklearn.model_selection import RandomizedSearchCV
rs = RandomizedSearchCV(svm.SVC(gamma='auto'), {
        'C': [1,10,20],
        'kernel': ['rbf','linear']
    },
    cv=5,
    return_train_score=False,
    n_iter=2
)
rs.fit(iris.data, iris.target)
pd.DataFrame(rs.cv_results_)[['param_C','param_kernel','mean_test_score']]

Unnamed: 0,param_C,param_kernel,mean_test_score
0,20,linear,0.966667
1,10,linear,0.973333


**Step9: How about different models with different hyperparameters?**

In [15]:
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
# Iitialising model_params dictionary.Each model is defined as a key-value pair in the dictionary.
# The key represents the models name, and the value is another dictionary that contains the model
# object and its corresponding hyperparameter options.
model_params = {
    'svm': {
        'model': svm.SVC(gamma='auto'),
        'params' : {
            'C': [1,10,20],
            'kernel': ['rbf','linear']
        }
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'params' : {
            'n_estimators': [1,5,10]
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        'params': {
            'C': [1,5,10]
        }
    }
}


**Step9.1: Applying loop on the dictionary**

In [16]:
scores = []
# model_name is the key and mp is vlaue of the above dictionary
# Example svm is model name and mp is the dictionary value of svm
for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(iris.data, iris.target)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })

df = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df

Unnamed: 0,model,best_score,best_params
0,svm,0.98,"{'C': 1, 'kernel': 'rbf'}"
1,random_forest,0.946667,{'n_estimators': 5}
2,logistic_regression,0.966667,{'C': 5}


**Based on above, I can conclude that SVM with C=1 and kernel='rbf' is the best model for solving my problem of iris flower classification**