<h1>Iris DataSet</h1>

<p>
    The Iris DataSet is a classical dataset that was introduced by the British statistician Ronald Fisher. He collected 50 different samples for each of the 3 Iris flower species. For each sample, he included the length and width of the speals and the petals, along with the name of the species it belongs to. 
</p>

In this tutorial, we will try to implement a classifier that would predict the species of an Iris plant given the 4 features like in the dataset.<br>
We will mainly try two algorithms for the classifier, the linearSVM and SVM, and we will try out different paramaters for these models using the greadSearch method

<h3>Loading the dataset</h3>

First let's load the Iris dataset from sklearn

In [112]:
# Load the iris dataset and its corresponding data and targets
from sklearn.datasets import load_iris

# Store the dataset in a variable
iris = load_iris()

In [113]:
# Load the data into the variable x
X = iris.data

# Load the target into the variable y
y = iris.target

After that, we split the data into two sets, one for training and one for testing for later use 

In [114]:
from sklearn.model_selection  import train_test_split


# use train/test split with different random_state values 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=None, test_size=0.2)

<h3>Learning data using LinearSVC Model </h3>

<h5>Choosing paramaters using grid search</h5>

<p>We are going to brute force the hyperparamaters of the LinearSVC model using the gridSearch class<br>First, we need to choose the set of hyper-paramaters for the grid to bruteforce upon<br>Then, we pass these paramters along with the model type to the grid class.</p>

In [115]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
import numpy as np


# choose possible paramaters
param_grid = {
               
    'C':[1,10,100,300,500,700,1000],
    'tol':[1e-4,1e-5,1e-6]
}

# instantiate the grid
# the cv param indicates the number of folds and the scoring param indicates the scoring stratetgy
grid = GridSearchCV(LinearSVC(random_state=2), param_grid, cv=10, scoring='accuracy')

Now we let the gridSearchCV instance search for the best combination of hyperparamaters by training on the training set that we have split from the main data

In [116]:
# fit the grid with data
grid.fit(X_train, y_train)

GridSearchCV(cv=10, error_score='raise',
       estimator=LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=2, tol=0.0001,
     verbose=0),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'C': [1, 10, 100, 300, 500, 700, 1000], 'tol': [0.0001, 1e-05, 1e-06]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=0)

Now we can see the best paramaters the gridSearch has found for us and the respective model

In [117]:
# print the best hyper-paramaters
print("best paramaters : " ,grid.best_params_)
print("best score : ",grid.best_score_)

best paramaters :  {'C': 10, 'tol': 0.0001}
best score :  0.95


In [118]:
model = grid.best_estimator_
model.score(X_test,y_test)

0.93333333333333335

So the LinearSVC model has yeilded an accuracy rate of 93.3% on the test set

<h3>Learning data using SVC Model</h3>

<p>Now we will try and see whether the SVC model would lead to better results</p>

<p>We will repeat the whole process, and the only difference is the hyperparamaters given to the gridSearch instance.</p>

In [119]:
from sklearn.svm import SVC

# choose possible paramaters using different kernels
param_grid = [
    {
        'kernel': ['rbf'],
        'gamma': [1e-3, 1e-4,1e-5],
        'C': [1, 10, 100,500, 1000]
    },
    {
        'kernel': ['linear'], 
        'C': [1, 10, 100,500, 1000]
    }
]

# instantiate the grid
# the cv param indicates the number of folds and the scoring param indicates the scoring stratetgy
grid = GridSearchCV(SVC(random_state=0), param_grid, cv=10, scoring='accuracy')


# fit the grid with data
grid.fit(X_train, y_train)

GridSearchCV(cv=10, error_score='raise',
       estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=0, shrinking=True,
  tol=0.001, verbose=False),
       fit_params=None, iid=True, n_jobs=1,
       param_grid=[{'kernel': ['rbf'], 'gamma': [0.001, 0.0001, 1e-05], 'C': [1, 10, 100, 500, 1000]}, {'kernel': ['linear'], 'C': [1, 10, 100, 500, 1000]}],
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='accuracy', verbose=0)

In [120]:
# print the best hyper-paramaters
print("best paramaters : " ,grid.best_params_)
print("best score : ",grid.best_score_)

best paramaters :  {'C': 500, 'gamma': 0.001, 'kernel': 'rbf'}
best score :  0.983333333333


In [121]:
model = grid.best_estimator_
model.score(X_test,y_test)

0.96666666666666667

<p>The SVC model has scored a rate of 96.7% accuracy on the same test set that the linearSVC model has been tested on, therefore we conclude that the linearSVC model might be a better option in tackling this problem</p>