**ML Model on IRIS Dataset**

**Problem Statement:** A hobby botanist is distinguishing flowers based on the length and width of the sepals and petals of flowers. She and her friends created a database using their knowledge and research. She wants a machine learning model which can learn from the database she made and predict the species of a new, unknown flower based on it's features(sepal length, sepal width, petal length and petal width). With the help of this model, any person from any part of the world, can know from which species a given flower belongs to.

**Importing pandas and numpy libraries:**

In [None]:
import pandas as pd
import numpy as np

**Exploring the data:**

We are using the iris dataset which is included in scikit-learn's datasets module. We load the iris dataset by calling the load_iris function.

In [None]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()

Printing the keys of iris_dataset:

In [None]:
print("Keys of the iris dataset are:\n{}".format(iris_dataset.keys()));

Keys of the iris dataset are:
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])


Printing the first 200 values inside iris_dataset's DESCR key:

In [None]:
print("\nDESCR's first 200 values:\n{}".format(iris_dataset['DESCR'][:200]))


DESCR's first 200 values:
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive


Printing the values present in iris_dataset's target_names key:

In [None]:
print("\nTarget names:\n{}".format(iris_dataset['target_names']))


Target names:
['setosa' 'versicolor' 'virginica']


Printing the values present in iris_dataset's feature_names key:

In [None]:
print("\nFeature names:\n{}".format(iris_dataset['feature_names']))


Feature names:
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


Printing the type of the "data" key of iris_dataset:

In [None]:
print("\nType of data: {}".format(type(iris_dataset['data'])))


Type of data: <class 'numpy.ndarray'>


Printing the shape of "data" key of the iris_dataset:

In [None]:
print("\nShape of data: {}".format(iris_dataset['data'].shape))


Shape of data: (150, 4)


Printing the first five columns of the "data" key of the iris_dataset:

In [None]:
print("\nFirst five columns of data are:\n{}".format(iris_dataset['data'][:5]))


First five columns of data are:
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]


Printing the type of the "target" key of the iris_dataset:

In [None]:
print("\nType of target: {}".format(type(iris_dataset['target'])))


Type of target: <class 'numpy.ndarray'>


Printing the shape of the "target" key of the iris_dataset:

In [None]:
print("\nShape of target: {}".format(iris_dataset['target'].shape))


Shape of target: (150,)


Printing all the values stored in the "target" key of the iris_dataset:

In [None]:
print("\nTarget: {}".format(iris_dataset['target']))


Target: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


Splitting the dataset into training and test data: 
From scikit learn framework, we are using a function "train_test_split" to split the dataset into training and testing datasets. X represents features, here sepal length, sepal width, petal length, and petal width. y represents the labels, here setosa, versicolor, and virginica. The dataset is split in such a way that 75% of the data is in the training set and the remaining 25% data is in testing set. (Random state is set to zero to ensure that everybody working on this will be getting a similar value.)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris_dataset['data'], 
                                        iris_dataset['target'], random_state=0)

We are now checking the shape of X_train, y_train, X_test, y_test. (They are all numpy ndarrays)

In [None]:
print("X_train shape: {}".format(X_train.shape))
print("y_train shape: {}".format(y_train.shape))
print("X_test shape: {}".format(X_test.shape))
print("y_test shape: {}".format(y_test.shape))

X_train shape: (112, 4)
y_train shape: (112,)
X_test shape: (38, 4)
y_test shape: (38,)


K-nearest neighbors model creation:

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)

Training the model: We are training the model by passing the training sets as arguments to the knn classifier using fit function.

In [None]:
knn.fit(X_train,y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=1, p=2,
                     weights='uniform')

Making predictions: We are giving the model a new data for which even we might not know the labels.

In [None]:
X_new = np.array([[5, 2.9, 1, 0.2]])

print("X_new's shape: {}".format(X_new.shape))

X_new's shape: (1, 4)


Now we're going to make preditions using the predict method of knn object

In [None]:
prediction = knn.predict(X_new)

print("Prediction: {}".format(prediction))
print("Predicted target name: {}".format(iris_dataset['target_names'][prediction]))

Prediction: [0]
Predicted target name: ['setosa']


Evaluating the model: Data for which we know the labels are given to the machine for prediction. Based on what it predicts, we know the effficiency of the machine.

In [None]:
y_pred = knn.predict(X_test)

print("Predictions made on test set: \n{}".format(y_pred))

Predictions made on test set: 
[2 1 0 2 0 2 0 1 1 1 2 1 1 1 1 0 1 1 0 0 2 1 0 0 2 0 0 1 1 0 2 1 0 2 2 1 0
 2]


To check whether the predictions are correct or wrong: We are checking the accuracy of the model by comparing the prediction and test labels, taking a mean of the comparison results.

In [None]:
print("Test set score: {}".format(np.mean(y_pred == y_test)))

Test set score: 0.9736842105263158
