# Iris Flower Species Detection Model

Iris flower has three species; setosa, versicolor and virginica, which differ according to their measurements.

This is a RandomForestClassifer model that predicts the species of the flower with features SepalLenth (cm), SepalWidth (cm), PetalLength (cm), PetalWidth (cm).

# 1.Load Python Packages

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# 2.Pre-process the data

In [2]:
iris = pd.read_csv("Iris.csv")
iris=iris.drop("Id",axis=1)

Converting the flower species into numerical data. The values associated with various species are given below :

Iris-setosa : 1

Iris-versicolor : 2

Iris-virginica : 3

In [3]:
iris.Species.unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [4]:
iris.Species.replace("Iris-setosa",1,inplace=True)
iris.Species.replace("Iris-versicolor",2,inplace=True)
iris.Species.replace("Iris-virginica",3,inplace=True)
iris.Species.unique()

array([1, 2, 3], dtype=int64)

In [5]:
iris.head(10)

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,1
1,4.9,3.0,1.4,0.2,1
2,4.7,3.2,1.3,0.2,1
3,4.6,3.1,1.5,0.2,1
4,5.0,3.6,1.4,0.2,1
5,5.4,3.9,1.7,0.4,1
6,4.6,3.4,1.4,0.3,1
7,5.0,3.4,1.5,0.2,1
8,4.4,2.9,1.4,0.2,1
9,4.9,3.1,1.5,0.1,1


In [6]:
iris.dtypes

SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species            int64
dtype: object

In [7]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   SepalLengthCm  150 non-null    float64
 1   SepalWidthCm   150 non-null    float64
 2   PetalLengthCm  150 non-null    float64
 3   PetalWidthCm   150 non-null    float64
 4   Species        150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB


# 3.Subset the data

specifying the input attribute values and target values.

In [8]:
x = iris.drop("Species",axis=1)
y = iris.Species

# 4.Split data into train and test data

splitting the data into 4:1 ratio for training data and testing data respectively.

X_train : Training input attributes.

X_test : Testing input attributes.

Y_train : Training target attributes.

Y_test : Testing target attibutes.

In [9]:
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size=0.2)

In [10]:
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((120, 4), (30, 4), (120,), (30,))

In [11]:
X_train

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
140,6.7,3.1,5.6,2.4
38,4.4,3.0,1.3,0.2
27,5.2,3.5,1.5,0.2
97,6.2,2.9,4.3,1.3
118,7.7,2.6,6.9,2.3
...,...,...,...,...
123,6.3,2.7,4.9,1.8
1,4.9,3.0,1.4,0.2
44,5.1,3.8,1.9,0.4
126,6.2,2.8,4.8,1.8


In [12]:
X_test

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
139,6.9,3.1,5.4,2.1
98,5.1,2.5,3.0,1.1
131,7.9,3.8,6.4,2.0
13,4.3,3.0,1.1,0.1
29,4.7,3.2,1.6,0.2
149,5.9,3.0,5.1,1.8
103,6.3,2.9,5.6,1.8
21,5.1,3.7,1.5,0.4
59,5.2,2.7,3.9,1.4
74,6.4,2.9,4.3,1.3


In [13]:
Y_train

140    3
38     1
27     1
97     2
118    3
      ..
123    3
1      1
44     1
126    3
86     2
Name: Species, Length: 120, dtype: int64

In [14]:
Y_test

139    3
98     2
131    3
13     1
29     1
149    3
103    3
21     1
59     2
74     2
3      1
143    3
76     2
92     2
137    3
127    3
144    3
11     1
105    3
20     1
4      1
107    3
85     2
95     2
141    3
88     2
113    3
54     2
55     2
106    3
Name: Species, dtype: int64

# 5. Build Random Forest Classifier

Building RandomForestClassifier model and tarining it with the training data.

In [15]:
clf = RandomForestClassifier(n_estimators=100)

In [16]:
clf.fit(X_train, Y_train)

RandomForestClassifier()

# 6.Prediction

Prediction of the test data with the Model.

In [17]:
y_label = clf.predict(X_test)

In [18]:
y_label

array([3, 2, 3, 1, 1, 3, 3, 1, 2, 2, 1, 3, 2, 2, 3, 3, 3, 1, 3, 1, 1, 3,
       2, 2, 3, 2, 3, 2, 2, 2], dtype=int64)

Comparing the accuracy of the training model based on trained data.

In [19]:
clf.score(X_train, Y_train)

1.0

# 7.Check the Accuracy of the Model

In [20]:
clf.score(X_test,Y_test)

0.9666666666666667

In [21]:
print(classification_report(Y_test, y_label))

              precision    recall  f1-score   support

           1       1.00      1.00      1.00         7
           2       0.91      1.00      0.95        10
           3       1.00      0.92      0.96        13

    accuracy                           0.97        30
   macro avg       0.97      0.97      0.97        30
weighted avg       0.97      0.97      0.97        30



In [22]:
confusion_matrix(Y_test, y_label)

array([[ 7,  0,  0],
       [ 0, 10,  0],
       [ 0,  1, 12]], dtype=int64)

In [23]:
accuracy_score(Y_test, y_label)

0.9666666666666667

# 8.Check feature importance

In [24]:
list(zip(X_train, clf.feature_importances_))

[('SepalLengthCm', 0.09187034299353417),
 ('SepalWidthCm', 0.023652846311441995),
 ('PetalLengthCm', 0.42882518844985534),
 ('PetalWidthCm', 0.4556516222451685)]

# 9.Estimator Selection

Efficient estimator selection using the n_estimators. It is an iterative step to find the accuracy of the model at various levels of estimator.

In [25]:
for i in range(10,110,10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators=i)
    clf.fit(X_train, Y_train)
    print(f"The model accuracy on test set: {clf.score(X_test, Y_test)*100:.2f}%")
    print("")

Trying model with 10 estimators...
The model accuracy on test set: 96.67%

Trying model with 20 estimators...
The model accuracy on test set: 96.67%

Trying model with 30 estimators...
The model accuracy on test set: 96.67%

Trying model with 40 estimators...
The model accuracy on test set: 96.67%

Trying model with 50 estimators...
The model accuracy on test set: 96.67%

Trying model with 60 estimators...
The model accuracy on test set: 96.67%

Trying model with 70 estimators...
The model accuracy on test set: 96.67%

Trying model with 80 estimators...
The model accuracy on test set: 96.67%

Trying model with 90 estimators...
The model accuracy on test set: 96.67%

Trying model with 100 estimators...
The model accuracy on test set: 96.67%



In [26]:
clf = RandomForestClassifier(n_estimators=50).fit(X_train, Y_train)

# 10.Save the model

In [27]:
import pickle
pickle.dump(clf, open("iris_prediction.pkl", "wb"))

# 11. Load the saved model

In [28]:
loaded_model = pickle.load(open("iris_prediction.pkl", "rb"))
loaded_model.score(X_test, Y_test)

0.9666666666666667

# 12. Testing with random input attributes

In [30]:
loaded_model.predict([[4.5,3.5,8.2,1.5]])



array([3], dtype=int64)