## Iris classification model using grid search
### LDA and Logistic regression parameter tuning

#### Problem description
The Iris Dataset contains four features (length and width of sepals and petals) of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). we are asked to use these measures to predict the species name of a new given petal and sepal length and width.

I chose this particular problem for demonstration purpuses of how we use grid searching from scikit-learn to find the best parameters or models of a given dataset, for that we will make a very simple pipeline using a StandardScaler and a logistic regression model

In [1]:
# Import data and split into features/labels
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Load data
X = pd.DataFrame(load_iris().data)
X.columns = load_iris().feature_names
y = load_iris().target
# Features and labels 
print(y[:5])
X.head()

[0 0 0 0 0]


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import *
from sklearn.preprocessing import *
from sklearn.linear_model import LogisticRegression
from sklearn import set_config
set_config(display="diagram")

# Pipeline preparing

# preprocessing
preprocessing = ColumnTransformer(transformers=[
    ("Normalizer" , StandardScaler(),X.columns)
])

# iterate throught the diffrent values of k and cross validate score and save result each time

k = range(1,6)
results = {}
for i in k:
    pipe = Pipeline(steps=[
        ("Preprocessor" , preprocessing),
        ("Model fitting" , LogisticRegression(C=i) )
    ])
    results[" k = "+str(i)] = cross_val_score(pipe,X,y,cv=10).mean()

print("Searching for the best C value for the Logistic regression model : " , results)


# Note that in order to add diffrent parameters for diffrent proccesses 
# it will get too compilicated too fast

pipe 

Searching for the best C value for the Logistic regression model :  {' k = 1': 0.9600000000000002, ' k = 2': 0.9733333333333334, ' k = 3': 0.9733333333333334, ' k = 4': 0.9800000000000001, ' k = 5': 0.9800000000000001}


In [1]:
def create_liste_tuples(From,to,by):
    res_list = []
    res_list.append((From,))
    while From < to:
        x = (From,)
        y = list(x)
        y[0] = y[0] + by
        x = tuple(y)
        res_list.append(x)
        From = From + by
    return res_list

In [11]:
create_liste_tuples(1,1000,100)

[tuple([i]) for i, k in enumerate(range(10))]

{ _indx: i for _indx, i in enumerate( [3,4,5,6])}

var = 0 if 0 == 0 else 1


{0: 3, 1: 4, 2: 5, 3: 6}

In [None]:

# Parameters tuning using grid search cross validation


# Pipeline preparing 

# Preprocessing
preprocessing = ColumnTransformer(transformers=[
    ("Normalizer" , StandardScaler(),X.columns)
])

# to make things a bit more exciting, we can implement linear discriminant analysis
# to reduce our 4 dimensions to 2 or even 1 feature, a new created feature that maximise the separation
# of the target dependant variable (the specie label)

# Dimensinality reduction LDA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA()
# Model 
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(max_iter=1000)
# Pipeline
pipe = Pipeline(steps=[
("preprocessing" , preprocessing),
("lda" , lda),
("classifier" , mlp)
])

# Grid parameters

# grid parameters specifying, note that we can access a given step of the pipeline by
# the double column specifiyer and then the name of the parameter
param_grid = {"lda__n_components" : [1,2],
              "lda__solver" : ["svd","eigen"], 
              "classifier__hidden_layer_sizes" : create_liste_tuples(1,1000,100),
              "classifier__activation" : ['identity', 'logistic', 'tanh', 'relu'],
              "classifier__learning_rate" : ['constant', 'invscaling', 'adaptive'],
              "classifier__solver" : ['lbfgs', 'sgd', 'adam']

              }
# instanciate the grid cross validation class from sklearn.model_selection
# with the pipe and the grid parameters 
grid = GridSearchCV(pipe,param_grid = param_grid)
# fit data
result = grid.fit(X,y)

In [119]:
# Inspect results , for simplicity i choosed to drop time and splits results 
result_df = pd.DataFrame(result.cv_results_).drop(["mean_fit_time","std_fit_time",
                                       "mean_score_time","std_score_time",
                                       "params","split0_test_score"	,"split1_test_score",
                                       "split2_test_score","split3_test_score","split4_test_score"],axis=1)

result_df.sort_values(by=["rank_test_score"])

Unnamed: 0,param_classifier__C,param_lda__n_components,param_lda__solver,mean_test_score,std_test_score,rank_test_score
8,10.0,1,svd,0.986667,0.026667,1
9,10.0,1,eigen,0.986667,0.026667,1
0,0.1,1,svd,0.98,0.026667,3
1,0.1,1,eigen,0.98,0.026667,3
2,0.1,2,svd,0.98,0.026667,3
3,0.1,2,eigen,0.98,0.026667,3
4,1.0,1,svd,0.98,0.026667,3
5,1.0,1,eigen,0.98,0.026667,3
6,1.0,2,svd,0.98,0.026667,3
7,1.0,2,eigen,0.98,0.026667,3


## Conclusion
 GridsearchCv allows you to define a set of parameter that you want to try with a given model

 
or even group of models and it will automatically run cross val using each of these params
 
keepeing track of the resulting scores.