Import Required Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

## Load the IRIS classification dataset
The data is taken from kaggle
https://www.kaggle.com/arshid/iris-flower-dataset

In [2]:
iris_data = pd.read_csv('IRIS.csv')
iris_data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Now we check if any cell of our dataframe has a null value. The info() function returns the no of entries in our dataset and no of Non-null values.

In [3]:
iris_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


We get to know that our data is free from NaN values..!!!!

## Feature Selection and Normalization
Here we have not got any NaN Values in our dataset.

We select the first 4 columns for feature selection. Also sklearn library works on the numpy array of inputs. So we convert our data in a two dimensional array.
Data Standardization give data zero mean and unit variance, it is good practice, especially for algorithms such as KNN which is based on distance of cases:

In [4]:
X = iris_data[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']].values
X[0:5]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

In [5]:
y = iris_data['species'].values
y[0:5]

array(['Iris-setosa', 'Iris-setosa', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa'], dtype=object)

In [6]:
X = preprocessing.StandardScaler().fit(X).transform(X)

## Train/Test dataset
Okay, we split our dataset into train and test set:

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

# Modeling

## SVM with scikit-learn
The SVM algorithm offers a choice of kernel functions for performing its processing. Basically, mapping data into a higher dimensional space is called kernelling. The mathematical function used for the transformation is known as the kernel function, and can be of different types, such as:

```
1.Linear
2.Polynomial
3.Radial basis function (RBF)
4.Sigmoid
```

Each of these functions has its characteristics, its pros and cons, and its equation, but as there's no easy way of knowing which function performs best with any given dataset, we usually choose different functions in turn and compare the results. Let's just use the default, RBF (Radial Basis Function) for this lab.

Here we are going to use 'linear' kernel function. The reason is that 'linear' kernel gave the best result when compared with other kernel functions when compared in K-fold validation which is given at the end of this notebook

In [8]:
from sklearn import svm
svm_model = svm.SVC(kernel = 'linear')
svm_model.fit(X_train, y_train)

SVC(kernel='linear')

In [9]:
yhat_svm = svm_model.predict(X_test)

## K Nearest Neighbor (KNN)
Here we are taking the value of no of neighbors as 5. It is selected based on the results of K-fold validation given in the end

In [10]:
from sklearn.neighbors import KNeighborsClassifier
k = 5
knn_model = KNeighborsClassifier(n_neighbors = 5).fit(X_train, y_train)

In [11]:
yhat_knn = knn_model.predict(X_test)

## Evaluation

In [12]:
from sklearn.metrics import confusion_matrix

In [13]:
# For SVM Model
confusion_matrix(y_test, yhat_svm)

array([[19,  0,  0],
       [ 0, 12,  1],
       [ 0,  0, 13]], dtype=int64)

From the confusion matrix we can tell that our model is predicted the species of the test set of Iris flowers with only one false prediction

In [14]:
from sklearn.metrics import f1_score
f1_score(y_test, yhat_svm, average='weighted') 

0.9777448559670783

This tells us that our SVM model has a good accuracy score

In [15]:
# For KNN Model
from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, knn_model.predict(X_train)))
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat_knn))

Train set Accuracy:  0.9523809523809523
Test set Accuracy:  1.0


## K-fold Cross Validation


In [16]:
from sklearn.model_selection import cross_val_score
# For SVM
svm_model_linear = svm.SVC(kernel = 'linear')
score_linear = cross_val_score(svm_model_linear, X, y, cv=5)
print(score_linear.mean())

svm_model_poly = svm.SVC(kernel = 'poly')
score_poly = cross_val_score(svm_model_poly, X, y, cv=5)
print(score_poly.mean())

svm_model_rbf = svm.SVC(kernel = 'rbf')
score_rbf = cross_val_score(svm_model_rbf, X, y, cv=5)
print(score_rbf.mean())

svm_model_sigmoid = svm.SVC(kernel = 'sigmoid')
score_sigmoid = cross_val_score(svm_model_sigmoid, X, y, cv=5)
print(score_sigmoid.mean())

0.9666666666666668
0.9266666666666665
0.9666666666666666
0.9


Therefore 'linear' kernel has a slightly more accuracy than the rest and clearly a lot more than 'sigmoid' and 'poly' kernels

In [17]:
#For KNN
knn_model_4 = KNeighborsClassifier(n_neighbors = 4)
score_4 = cross_val_score(knn_model_4, X, y, cv = 4)
print(score_4.mean())

knn_model_5 = KNeighborsClassifier(n_neighbors = 5)
score_5 = cross_val_score(knn_model_5, X, y, cv = 5)
print(score_5.mean())

knn_model_6 = KNeighborsClassifier(n_neighbors = 6)
score_6 = cross_val_score(knn_model_6, X, y, cv = 6)
print(score_6.mean())

knn_model_7 = KNeighborsClassifier(n_neighbors = 7)
score_7 = cross_val_score(knn_model_7, X, y, cv = 7)
print(score_7.mean())

0.9464793741109531
0.96
0.9533333333333333
0.9529993815708101
