In [1]:
import sklearn

#### Loading the data
scikit-learn comes with a few small standard datasets that do not require to download and to read data from external websites. In this programming task, we are going to be using the Iris Plants Dataset. This dataset contains information about different iris flowers, i.e. sepal length, sepal width, petal length, petal width and species (with three possible values for species: setosa, versicolor and virginica).

The iris dataset is typically used for supervised learning tasks, and in particular for classification. The idea is that we have measurements (i.e. sepal length, sepal width, petal length and petal width) for which we know the correct species. So if we go out in nature and find some iris flowers and measure their sepal length, sepal width, petal length and petal width, then we can use the iris dataset to predict which species each flower belongs to. Nice, ha? And since there are three possible values for the iris species, it's a classification task.

In [2]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()

The iris_dataset object that is returned by load_iris is a Bunch object, which contains some information about the dataset,and they contain keys and values.

In [3]:
print("Keys of iris_dataset: ", iris_dataset.keys())

Keys of iris_dataset:  dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])


There are five types of information in the dataset:

DESCR
feature_names
target_names
data
target
Let's have a closer look at each one of them.

DESCR is a short description of the dataset. Run the code below to get an extract of the first 200 characters. If you want to get a bigger extract, all you need to do is change 200 to a larger number.

In [4]:
print(iris_dataset['DESCR'][:200] + "\n.......")

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive
.......


feature_names corresponds to the names of all the features in the dataset, in other words all the variables that we take into account when building our machine learning model.

In [5]:
print("Feature names: ", iris_dataset['feature_names'])

Feature names:  ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


target_names corresponds to the class labels. By running the code below, we can see that there are three class labels: 'setosa', 'versicolor' and 'virginica'.

In [6]:
print("Target names: ", iris_dataset['target_names'])

Target names:  ['setosa' 'versicolor' 'virginica']


In [8]:
#The actual data is contained in the data and target fields. data contains the values for the different features, e.g. sepal length.
print(iris_dataset['data'].shape)

(150, 4)


We can see that we have data for 150 iris flowers. For each flower case we have 4 features.
Get the first three rows in data.

In [9]:
print("First three rows of data:\n", iris_dataset['data'][:3])

First three rows of data:
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]]


According to this output, we get the following values for the first flower:

sepal length (cm): 5.1
sepal width (cm): 3.5
petal length (cm): 1.4
petal width (cm): 0.2

In [10]:
#get the shape of target and the first two elements.
print("Shape of target: ", iris_dataset['target'].shape)

Shape of target:  (150,)


In [11]:
print("First two elements in target: ", iris_dataset['target'][:2])

First two elements in target:  [0 0]


We can see that target contains the species for each of the 150 iris flowers in the database. The species of the first two flowers is setosa, as 0 corresponds to setosa, 1 to versicolor and 2 to virginica. (How de we know this? It is a convention that elements in target_names appear in an increasing order, starting from 0.)
## Part 3: Splitting our dataset into training data and test data
Before using our model for previously unseen iris flowers, we need to know how well it performs. To do this, we split our labelled data in two parts: i) a training dataset that we use for building the model, and ii) a test dataset that we use for testing the accuracy of our model. We do this with the use of the train_test_split function, which shuffles the dataset randomly, and by default extracts 75% of the cases as training data and 25% of the cases as test data.

In [13]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    iris_dataset['data'], iris_dataset['target'], random_state=0)

By setting random_state=0 we are making sure that, even though our dataset is randomly shuffled by the train_test_split function, we can reproduce our results by using the same fixed seed for the random number generator (in this case 0). 

In [14]:
print("X_train shape: ", X_train.shape)
print("y_train shape: ", y_train.shape)

X_train shape:  (112, 4)
y_train shape:  (112,)


In [15]:
print("X_test shape: ", X_test.shape)
print("y_test shape: ", y_test.shape)

X_test shape:  (38, 4)
y_test shape:  (38,)



### Part 4: Creating our first model: K Nearest Neighbours
We will now learn how to build a classification model for the iris dataset with the use of the k nearest neighbours algorithm.

Building the model
To build a k nearest neighbours model, we will use the KNeighborsClassifier class from the sklearn.neighbors module


In [17]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)

In [18]:
knn.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=1)

In [19]:
#Evaluating the model
print("Test set score: ", knn.score(X_test, y_test))

Test set score:  0.9736842105263158


In [20]:
# to get the value of accuracy rounded to two or three decimal places, then all you would need to do is change {:.3f} to {:.2f}.

print("Test set score rounded to three decimal places: {:.3f}".format(knn.score(X_test, y_test)))

Test set score rounded to three decimal places: 0.974


Using the model to make predictions
We will now use our model to make a prediction about a previously unseen iris flower case. We will first import the numpy libary, then we will specify the previously unseen iris flower case (we'll call it X_new) and finally we will use the predict method on X_new to get the prediction (we'll call the result prediction, but we could use any name we want).

In [32]:
import numpy as np
X_new = np.array([[5.3, 2.7, 1, 0.3]])
prediction = knn.predict(X_new)

print("Prediction label: ", prediction)
print("Predicted target name: ", iris_dataset['target_names'][prediction])

Prediction label:  [0]
Predicted target name:  ['setosa']


According to this output, the prediction for case X_new is setosa.
### Decision Trees

In [33]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    iris_dataset['data'], iris_dataset['target'], random_state=7)

In [34]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=12)
tree.fit(X_train, y_train)

DecisionTreeClassifier(random_state=12)

In [35]:
print("Accuracy on training set: ", tree.score(X_train, y_train))
print("Accuracy on test set: ", tree.score(X_test, y_test))

Accuracy on training set:  1.0
Accuracy on test set:  0.8947368421052632


The decision tree built has accuracy 100% on the training dataset. This means that our decision tree is over-fitting the training data.

In order to avoid overfitting (and hopefully improve the accuracy of the model on test data), we can stop before the entire tree is created. We can do this by setting the maximal depth of the tree.

In [36]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(max_depth=3, random_state=12)
tree.fit(X_train, y_train)

print("Accuracy on training set: ", tree.score(X_train, y_train))
print("Accuracy on test set: ", tree.score(X_test, y_test))

Accuracy on training set:  0.9910714285714286
Accuracy on test set:  0.9210526315789473


In [37]:
prediction = tree.predict(X_new)

print("Prediction label: ", prediction)
print("Predicted target name: ", iris_dataset['target_names'][prediction])

Prediction label:  [0]
Predicted target name:  ['setosa']


According to this output, the prediction for case X_new is setosa. This prediction is in line with the prediction that we got using the K Nearest Neighbours classifier.