# Iris Classification
Submitted by: Mihir Kulkarni

# Problem Statement:
Iris flower has three species; setosa, versicolor, and virginica, which differs according to their
measurements. Now assume that you have the measurements of the iris flowers according to
their species, and here your task is to train a machine learning model that can learn from the
measurements of the iris species and classify them.



# Importing necessary libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

# Loading the dataset

In [2]:
data=pd.read_csv('iris.csv')
data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


# Analysis of data

In [3]:
# 150 examples with 6 features in total
data.shape

(150, 6)

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [5]:
data.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [6]:
data['Species'].value_counts()

Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: Species, dtype: int64

In [7]:
data.isnull().sum()

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

# Choosing the algorithm for model
From observing the data , we can see that this classification problem has a small dataset with 6 features and 150 examples.
There are 5 numeric features and 1 categorical feature and no null values in the dataset.
All the numerical features are normally distributed and we have a balanced dataset.
Hence we can use logistic regression algorithm to train the model and get necessary results.

# Deleting irrelevant features
The column 'Id' has no relevance to what species a flower is. So we will drop this column as keeping it can lead the model into making wrong assumptions and hence reducing the model accuracy.

In [8]:
data.drop(columns='Id',inplace=True)
data.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


# Transforming categorical features into numeric
As we are using Logistic Regression, we need to convert the categorical features into numeric variables for the algorithm to work. As the target feature 'Species' is a categorical feature, we will convert it into numeric using the Label Encoding technique with the help of scikit-learn.

In [9]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

data['Species']=label_encoder.fit_transform(data['Species'])

In [10]:
data.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


# Separating independent and target variables

In [11]:
x=data.drop(columns='Species')
y=data['Species']
x.head()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


# Splitting the dataset

In [12]:
from sklearn.model_selection import train_test_split as tts

xtrain,xtest,ytrain,ytest=tts(x,y,test_size=0.3,random_state=42)
xtrain.shape,xtest.shape,ytrain.shape,ytest.shape

((105, 4), (45, 4), (105,), (45,))

# Training the model

In [13]:
from sklearn.linear_model import LogisticRegression 

lr=LogisticRegression()

lr.fit(xtrain,ytrain)
y_pred=lr.predict(xtest)

# Performance Metrics

In [14]:
accuracy=lr.score(xtest,ytest)
accuracy

1.0

In [15]:
from sklearn.metrics import accuracy_score,f1_score
a=accuracy_score(y_pred,ytest)
a

1.0

In [16]:
from sklearn.metrics import confusion_matrix
confusion_matrix(ytest,y_pred)

array([[19,  0,  0],
       [ 0, 13,  0],
       [ 0,  0, 13]], dtype=int64)

In [17]:
f=f1_score(ytest,y_pred,average='weighted')
f

1.0

# Result
As we see, our model perfectly fits the data and has perfect accuracy and f1_score. Hence this model can perfectly predict the flower species given a set of inputs. 