# Building a Classification Model for the Iris data set

Chanin Nantasenamat

<i>Data Professor YouTube channel, http://youtube.com/dataprofessor </i>

In this Jupyter notebook, we will be building a classification model for the Iris data set using the random forest algorithm.

## 1. Import libraries

In [3]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

## 2. Load the *iris* data set

In [9]:
iris = datasets.load_iris()
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

In [7]:
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


## 3. Input features
The ***iris*** data set contains 4 input features and 1 output variable (the class label).

### 3.1. Input features

In [10]:
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


### 3.2. Output features

In [11]:
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


In [12]:
print(iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


## 4. Glimpse of the data

### 4.1. Input features

In [13]:
iris.data

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

### 4.2. Output variable (the Class label)

In [14]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

### 4.3. Assigning *input* and *output* variables
Let's assign the 4 input variables to X and the output variable (class label) to Y

In [19]:
X = iris.data
Y = iris.target

### 4.3. Let's examine the data dimension

In [16]:
X.shape
# This corresponds to 150 flowers and 4 features

(150, 4)

In [20]:
Y.shape

(150,)

## 5. Build Classification Model using Random Forest

In [22]:
clf = RandomForestClassifier()

In [23]:
clf.fit(X, Y)
# X = input features, Y = class label

RandomForestClassifier()

## 6. Feature Importance

In [28]:
print(iris.feature_names)
print(clf.feature_importances_)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
[0.09201904 0.01891373 0.47707799 0.41198924]


## 7. Make Prediction

In [29]:
X[1]

array([4.9, 3. , 1.4, 0.2])

In [30]:
X[0]

array([5.1, 3.5, 1.4, 0.2])

In [31]:
print(clf.predict([[5.1, 3.5, 1.4, 0.2]]))
# 0 means the first specie of flower - Setosa

[0]


In [35]:
print(clf.predict(X[[28]]))

[0]


In [38]:
print(clf.predict_proba(X[[75]]))
# Probability for the fisrt flower X[0] to be the class n1 (setosa) based on all the features of this flower

[[0. 1. 0.]]


In [39]:
clf.fit(iris.data, iris.target_names[iris.target])

RandomForestClassifier()

## 8. Data split (80/20 ratio)

In [42]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2)

In [43]:
X_train.shape, Y_train.shape

((120, 4), (120,))

In [19]:
X_test.shape, Y_test.shape
# For X_test: 30 flowers and 4 features; for Y_test: 30 flowers and 1 class label (= flower specie)

((30, 4), (30,))

## 9. Rebuild the Random Forest Model (using the 80/20 split)

In [51]:
clf.fit(X_train, Y_train)

RandomForestClassifier()

### 9.1. Performs prediction on single sample from the data set

In [46]:
print(clf.predict([X[0]]))

[0]


In [47]:
print(clf.predict_proba([[5.1, 3.5, 1.4, 0.2]]))

[[1. 0. 0.]]


### 9.2. Performs prediction on the test set

#### *Predicted class labels*

In [52]:
print(clf.predict(X_test))
# Instead of checking one flower, we ll use the whole test data set as input

[1 1 0 2 0 1 1 0 0 0 0 2 1 1 2 2 2 0 1 1 0 1 2 0 0 1 2 1 2 1]


In [49]:
clf.fit(iris.data, iris.target_names[iris.target])

RandomForestClassifier()

#### *Actual class labels*

In [53]:
print(Y_test)

[1 1 0 1 0 1 1 0 0 0 0 2 1 1 2 2 2 0 1 2 0 1 2 0 0 1 2 1 2 1]


## 10. Model Performance

In [54]:
print(clf.score(X_test, Y_test))

0.9333333333333333
