**Iris Series Using K-Nearest Neighbours**

# Importing Libraries:

In [None]:
import pandas as pd
import numpy as np

## Exploring the Data:
The data we use is the Iris dataset. It is included in sckit-learn in the dataset module:

In [None]:
from sklearn.datasets import load_iris
iris_dataset=load_iris()

The iris object that is returned by load_iris contains key values:

In [None]:
print('Keys of iris_datsets: \n{}'.format(iris_dataset.keys()))

Keys of iris_datsets: 
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])


The value of the key DESCR is a short description of the dataset:

In [None]:
print(iris_dataset['DESCR'][:200] + "\n...")

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive
...


The target names of the iris datasets:




In [None]:
print('Target names:{}'.format(iris_dataset['target_names']))

Target names:['setosa' 'versicolor' 'virginica']


The feature names are:

In [None]:
print('Feature names: \n{}'.format(iris_dataset['feature_names']))

Feature names: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


The data contains numeric measurements of sepal length, sepal width, petal length and petal width:

In [None]:
print('Type of data: {}'.format(type(iris_dataset['data'])))

Type of data: <class 'numpy.ndarray'>


The rows represent the data array correspond to flowers, while the columns represent the four measurements that were for each flower:

In [None]:
print('Shape of data: {}'.format(iris_dataset['data'].shape))

Shape of data: (150, 4)


We see that there are 150 flowers. We have 150 data points and 4 features:

In [None]:
print('First five rows of data:\n{}'.format(iris_dataset['data'][:5]))

First five rows of data:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


In [None]:
print('Types of data: {}'.format(type(iris_dataset['target'])))

Types of data: <class 'numpy.ndarray'>


Target is a one-dimensional array, and on seeing the shape we can see that it contains one entry per flower:

In [None]:
print('Shape of target: {}'.format(iris_dataset['target'].shape))

Shape of target: (150,)


On seeing the target key and exploring the values we can easily see that the species are encoded as integers from 0,1,2:

In [None]:
print('target:\n{}'.format(iris_dataset['target']))

target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


# Measuring Sucess: Training and Testing Data:

Here we split the data we have collected into two parts, one for building our ml model, and is called the training set or training data. The rest of the data will be used to access how well the model works, that is called the test data, test set or hold-out set:

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(
    iris_dataset['data'], iris_dataset['target'], random_state=0)

The output of the train_test_split function is X_train, y_train, X_test, y_test which are all NumPy arrays. X_train contains 75% of the rows of the datset nd X_test contains the remaining 25%:

In [None]:
print("X_train shape{}".format(X_train.shape))
print("y_train shape{}".format(y_train.shape))
print("X_test shape{}".format(X_test.shape))
print("y_train shape{}".format(y_test.shape))

X_train shape(112, 4)
y_train shape(112,)
X_test shape(38, 4)
y_train shape(38,)


Here this algorithm is implemented in the KNeighborsClassifier class from the neigbors module:

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)

To build the model on the training set, we call the fit method:

In [None]:
knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

# Making Predictions:

These are the imagined length and width of an iris sepal and petal. We can put this data into NumPy array, by calculating the shape:



In [None]:
X_new=np.array([[5,2.9,1,0.2]])
print("X_new.shape: {}".format(X_new.shape))

X_new.shape: (1, 4)


To make the prediction we call the predict method of knn object:

In [None]:
prediction=knn.predict(X_new)
print("prediction: {}".format(prediction))
print("predicted target name: {}".format(
    iris_dataset['target_names'][prediction]))

prediction: [0]
predicted target name: ['setosa']


#Evaluating the Model:
We can make prediction for each iris in the test data and compare it against the the known species label. We can measure how well the model works by computing the accuracy:

In [None]:
 y_pred=knn.predict(X_test)
 print("Test set predictions:\n {}".format(y_pred))

Test set predictions:
 [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]


In [None]:
print("Test set score : {}".format(np.mean(y_pred==y_test)))

Test set score : 0.9736842105263158
