### Let's assume that a hobby botanist is interested in distinguishing the species of some iris flower that she has found. She has collected some measurements associated with each iris: the length and width of the sepals and the length and width of the petals, all mesured in centimeters.

In [1]:
#load the data
from sklearn.datasets import load_iris
iris_dataset = load_iris()

print("Keys of iris_dataset:\n", iris_dataset.keys())

Keys of iris_dataset:
 dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])


In [2]:
#description of the data
print("Description: \n",iris_dataset["DESCR"])

Description: 
 .. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.

In [3]:
#Data type
print("Type of data:\n", type(iris_dataset['data']))

Type of data:
 <class 'numpy.ndarray'>


In [4]:
print("First five columns of data:\n",iris_dataset["data"][:5])

First five columns of data:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


In [5]:
print("Target:\n",iris_dataset["target"])
#0=Setosa, 1=Versicolor, 2=virginica

Target:
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [6]:
#Split the data 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris_dataset["data"],iris_dataset['target'],random_state=0)

In [7]:
print("X_train shape: ", X_train.shape)
print("y_train shape: ", y_train.shape)
print("X_test shape: ", X_test.shape)
print("y_test shape: ", y_test.shape)

X_train shape:  (112, 4)
y_train shape:  (112,)
X_test shape:  (38, 4)
y_test shape:  (38,)


In [33]:
#We can convert the data in dataframe and follow with the process using dataframe

In [35]:
#convert iris_dataset into dataframe
import pandas as pd
iris_dataframe = pd.DataFrame(iris_dataset.data, columns=iris_dataset.feature_names)
#add target column to the dataframe
iris_dataframe['target']=iris_dataset.target
display(iris_dataframe)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


In [32]:
#use knn model
#reason for using knn model is that we have less features and hence knn works much faster
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors =5)

In [19]:
#fit the model
knn.fit(X_train, y_train)

#predict for train data
y_train_pred=knn.predict(X_train)
print('Train set prediction:\n',y_train_pred)

Train set prediction:
 [1 1 2 0 2 0 0 1 2 2 2 2 1 2 1 1 2 2 1 2 1 2 1 0 2 1 1 1 1 2 0 0 2 1 0 0 1
 0 2 1 0 1 2 1 0 2 2 2 2 0 0 2 2 0 2 0 2 2 0 0 2 0 0 0 1 2 2 0 0 0 1 1 0 0
 1 0 2 1 2 1 0 2 0 2 0 0 2 0 2 1 1 1 2 2 2 2 0 1 2 2 0 1 1 1 1 0 0 0 2 1 2
 0]


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [11]:
#prectiction
y_test_pred=knn.predict(X_test)
print('test set prediction:\n',y_test_pred)

test set prediction:
 [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [23]:
#Checking the train accuracy 
print('Train set score:', knn.score(X_train,y_train)*100)
#Alternative method
from sklearn.metrics import accuracy_score
print('Train accuracy:',accuracy_score(y_train_pred,y_train))

#Checking the test accuracy
print('Test set score:', knn.score(X_test,y_test)*100)
#alternative method
print('Test accuracy:',accuracy_score(y_test_pred,y_test))

Train set score: 97.32142857142857
Train accuracy: 0.9732142857142857
Test set score: 97.36842105263158
Test accuracy: 0.9736842105263158


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [45]:
#introduce new data to the model
import numpy as np
X_new = np.array([[5,2.9,1,0.2]])
print("Shape of New data:",X_new.shape)

Shape of New data: (1, 4)


In [49]:
#predicting the new data
predict = knn.predict(X_new)
print("prediction: ",predict)
print("Predicted target name: ", iris_dataset["target_names"][predict])

prediction:  [0]
Predicted target name:  ['setosa']


  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)
