# **Importing libraries:**

Here I have imported some libraries

In [1]:
import pandas as pd
import numpy as np

# **Exploring the data:**

The data we have used is Iris dataset. 
It is included in scikit-learn in the datasets module. We can load it by calling the load_iris function.

In [2]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()

In [3]:
print("Keys of iris_dataset: \n{}".format(iris_dataset.keys()))

Keys of iris_dataset: 
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])


Value of key DESCR is a short description of the dataset:

In [4]:
val = iris_dataset["DESCR"]
start_val = val[:200]
print(start_val + "\n...")

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive
...


Value of key target_names is an array of strings containing the species of flower that we want to predict:

In [5]:
print("Target names: {}".format(iris_dataset["target_names"]))

Target names: ['setosa' 'versicolor' 'virginica']


Value of feature_names giving the description of each feature it includes "sepal length", "petal widfth", "petal lenth" and "petal width" (all in cms):

In [6]:
print("Keys of Features names: \n{}".format(iris_dataset["feature_names"]))

Keys of Features names: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


Value of target_names giving the description of each target:

In [7]:
print("Keys of Target names: \n{}".format(iris_dataset["target_names"]))

Keys of Target names: 
['setosa' 'versicolor' 'virginica']


Type of data given in the description:

In [8]:
print("Type of data: \n{}".format(type(iris_dataset["data"])))

Type of data: 
<class 'numpy.ndarray'>


Shape of the data given:

In [9]:
print("Shape of data: \n{}".format(iris_dataset["data"].shape))

Shape of data: 
(150, 4)


In [10]:
print("First five columns of data: \n{}".format(iris_dataset["data"][:5]))

First five columns of data: 
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


Type of target give in description:

In [11]:
print("Type of target: \n{}".format(type(iris_dataset["target"])))

Type of target: 
<class 'numpy.ndarray'>


Shape of the target:

In [12]:
print("Shape of target: \n{}".format(iris_dataset["target"].shape))

Shape of target: 
(150,)


In [13]:
print("Target: \n{}".format(iris_dataset["target"]))

Target: 
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


The meaning of the numbers are given by the iris ["target_names"] array: 0 means setosa, 1 means versicolor and 2 means virginica.

# **Measuring Success: Training and Testing Data:**

Now, we import machine learning framework(sklearn). 
In scikit-learn, data is usually denoted with capital 'X', while labels are denoted as lowercase 'y'. (X is the input to a function and y is the output).

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], iris_dataset['target'],random_state=0)

The output of the train_test_split function is X_train, X_test, y_train, y_test which are all NumPy arrays.
X_train contains 75% of the rows of the datset and X_test contains the remaining 25%

In [15]:
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

X_train shape: (112, 4)
y_train shape: (112,)
X_test shape: (38, 4)
y_test shape: (38,)


# **Inspecting the data:** 

Now we will build the actual machine learning model.
Here we will use k-nearest neighbor classifier.

In [16]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)

To build the model on the training set, we call the fit method of the knn object, which takes arguements the NumPy array X_train containing the training data and the NumPy array y_train of the corresponding training labels.

In [17]:
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform')

# **Making Predictions:**

Now we can make predictions using this model on new data

In [18]:
X_new = np.array([[5, 2.9, 1, 0.2]])
print("X_new.shape: {}".format(X_new.shape))

X_new.shape: (1, 4)


To make prediction, we call the predict method of the knn object:

In [19]:
prediction = knn.predict(X_new)
print("Prediction: {}".format(prediction))
print("Predicted target name: {}".format(iris_dataset["target_names"][prediction]))

Prediction: [0]
Predicted target name: ['setosa']


# **Evaluating the Model:**

Here, we can make a prediction for each iris in the test data and compare it against its label (the known species).
We can measure how well the model works by computing the accuracy, which is the fraction of flowers for which the right species was predicted.

In [20]:
y_pred = knn.predict(X_test)
print("Test set predictions: \n {}".format(y_pred))

Test set predictions: 
 [2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]


In [21]:
print ("Test set score: {}".format(np.mean(y_pred == y_test)))

Test set score: 0.9736842105263158
