# **Author: Pravin Konasirasgi**


**MACHINE LEARNING BOOTCAMP PROJECT POWERED BY SHAPE AI**

**Project name: Iris_dataset using k-nearest neighbors**

**Task: Prediction using Supervised Machine Learning**

**Problem Statement: Given the length and width of sepals and petals (in cms), predict the category type of species of iris.**

**I will be performing the following steps:-**

*  Data reading and understanding
*  Exploratory Data Analysis
*  Building a k-nearest neighbor Machine Learning Model
*  Model Evaluation and Prediction


#**Importing important Libraries:**

***Let's import some of the libraries that we will be using in this module:***

In [None]:
import pandas as pd
import numpy as np

#**Data Exploration:**

***Here we use the Iris dataset which is included in scikit-learn in the datasets module. We load it by using load_iris function:***

In [None]:
from sklearn.datasets import load_iris
iris_dataset=load_iris()

***The iris object returned by load_iris is a Bunch object contains keys and values:***

In [None]:
print("Keys of iris dataset: \n{}".format(iris_dataset.keys()))

Keys of iris dataset: 
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])


***The value of the key DESCR contains:***

In [None]:
val=iris_dataset['DESCR']
start_val=val[:200]
print(start_val+"\n...")

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive
...


***The value of the key target_names is an array of strings, containing the species of flower that we want to predict i.e 'setosa', 'versicolor' and 'virginica'.***

In [None]:
print("Target names: {}".format(iris_dataset['target_names']))

Target names: ['setosa' 'versicolor' 'virginica']


***The value of feature_names is a list of strings, giving the description of each feature it includes the 'sepal length', 'sepal width', 'petal length' and the 'petal width' all in cms.***

In [None]:
print("Feature names: \n{}".format(iris_dataset['feature_names']))

Feature names: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


***The data contains numeric measurements of sepal length, sepal width, sepal width, petal lengh and petal width in a NumPy array:***

In [None]:
print("Type of dara: {}".format(type(iris_dataset['data'])))

Type of dara: <class 'numpy.ndarray'>


***In this data array, the rows correspond to the flowers & the columns represent the four measurements that were taken for each flower:***

In [None]:
print("Shape of data: {}".format(iris_dataset['data'].shape))

Shape of data: (150, 4)


***We see that the array contains measurements for 150 different flowers.So we have 150 data points and 4 features. Here are the feature values for the first five samples:***

In [None]:
print("First five columns of data:\n{}".format(iris_dataset['data'][:5]))

First five columns of data:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


***We can see that the first five flowers have a petal width of 0.2 cm & that the first flower has the longest sepal at 5.1 cm.***

***Let's come to the target key containing the species of each of the flowers that were measured, also as a NumPy array:***

In [None]:
print("Type of target: {}".format(type(iris_dataset['target'])))

Type of target: <class 'numpy.ndarray'>


***target is a one-dimensional array & it contains one entry per flower:***

In [None]:
print("Shape of data: {}".format(iris_dataset['target'].shape))

Shape of data: (150,)


***Now we can see that the species are encoded as integers from 0 to 2:***

In [None]:
print("Target:\n{}".format(iris_dataset['target']))

Target:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


***The meanings of the numbers are given by the iris['target_names'] array: 0 means setosa, 1 means versicolor and 2 means virginica.***

#**Measuring Success: Training and Testing Data:**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    iris_dataset['data'], iris_dataset['target'], random_state=0)

***The output of the train_test_split function is X_train, X_test, y_train and y_test, which are all NumPy arrays. X_train contains 75% of the rows of the dataset and X_test contains the remaining 25%:***

In [None]:
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

X_train shape: (112, 4)
y_train shape: (112,)
X_test shape: (38, 4)
y_test shape: (38,)


#**Building the k-nearest neighbor Machine Learning Algorithm Model:**

***The most important parameter of KNeighbors Classifier is the number of neigbors, which we will set to 1:***

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=1)

#**Training the entire model on the training data:**



In [None]:
knn.fit(X_train, y_train)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

#**Making Predictions:**

***We can now make predictions using this model on new data for which we might not know the correct labels:***

In [None]:
X_new=np.array([[5,2.9,1,0.2]])
print("X_new.shape: {}".format(X_new.shape))

X_new.shape: (1, 4)


***To make a prediction, we call the prediction method of the knn object:***

In [None]:
prediction=knn.predict(X_new)

print("Prediction: {}".format(prediction))
print("Predicted target name: {}".format(
iris_dataset['target_names'][prediction]))

Prediction: [0]
Predicted target name: ['setosa']


#**Evaluating the Model:**

***Here we will be testing the model on the tested dataset:***

In [None]:
y_pred=knn.predict(X_test)
print("Test set predictions:\n {}".format(y_pred))

Test set predictions:
 [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]


#**Checking if the predictions are right:**

In [None]:
print("Test set score: {}".format(np.mean(y_pred == y_test)))

Test set score: 0.9736842105263158


***This is the accuracy of the model we have created.***