# IRIS Classification model

<b>Objective:</b>
To evaluate and compare the performance of four classification models—Logistic Regression, Gaussian Naive Bayes, Random Forest, and Support Vector Classifier (SVC)—on the Iris dataset, which contains features of sepal and petal dimensions for three species of Iris flowers.

Models and Performance:

* Logistic Regression: Achieved perfect classification with 100% precision, recall, F1-score, and accuracy, effectively capturing the linear     separability of the features.

* Gaussian Naive Bayes (GaussianNB): Also reached perfect scores, demonstrating that the continuous numeric features align well with its probabilistic assumptions.

* Random Forest: Delivered flawless predictions, showing the ensemble of decision trees accurately modeled the class boundaries.

* Support Vector Classifier (SVC): Achieved 100% accuracy as well, highlighting its ability to find optimal hyperplanes separating the three classes.

In [2]:
import numpy as np
import pandas as pd

In [3]:
df=pd.read_csv(r"C:\Users\bmsha\Documents\CodSoft\IRISClassification\dataset\IRIS.csv")
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [5]:
df.isna().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

In [7]:
df.duplicated().sum()

3

In [9]:
df[df.duplicated()]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
34,4.9,3.1,1.5,0.1,Iris-setosa
37,4.9,3.1,1.5,0.1,Iris-setosa
142,5.8,2.7,5.1,1.9,Iris-virginica


In [10]:
df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [11]:
 df['species'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

### encoding for model building

In [12]:
from sklearn.preprocessing import LabelEncoder

In [13]:
le=LabelEncoder()

In [14]:
df['species']=le.fit_transform(df['species'])
for label,i in enumerate(le.classes_):
    print(label,":",i)

0 : Iris-setosa
1 : Iris-versicolor
2 : Iris-virginica


In [15]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


### Evaluation

In [16]:
from sklearn.model_selection import train_test_split

In [20]:
fet=['sepal_length','sepal_width','petal_length','petal_width']
x=df[fet]
y=df['species']
xtrain,xtest,ytrain,ytest=train_test_split(x,y,random_state=42,test_size=0.2)


In [22]:
xtrain.shape

(120, 4)

In [23]:
xtest.shape

(30, 4)

### Model building

1) <b>LogisticRegression

In [32]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import joblib

In [28]:
lr=LogisticRegression()
lr.fit(xtrain,ytrain)

In [31]:
ypred=lr.predict(xtest)
print(classification_report(ytest,ypred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



<h3 style='background-color:cyan'>The results indicate that the logistic regression model has achieved flawless classification across all classes, with precision, recall, and F1-score consistently at 1.00. This shows the model is not only accurate but also balanced in predicting each category without bias. Such perfect performance highlights that the dataset is either highly separable or limited in complexity, allowing the model to capture relationships very effectively. While this demonstrates strong predictive capability, it also suggests the need to validate the model on a larger or unseen dataset to ensure the performance is not due to overfitting or the simplicity of the current data.</h3>

In [34]:
joblib.dump(lr,r"C:\Users\bmsha\Documents\CodSoft\IRISClassification\dataset\logstic_model.pkl")

['C:\\Users\\bmsha\\Documents\\CodSoft\\IRISClassification\\dataset\\logstic_model.pkl']

2) <b>Naive bayes

In [35]:
from  sklearn.naive_bayes import GaussianNB

In [36]:
nb=GaussianNB()

In [37]:
nb.fit(xtrain,ytrain)

In [41]:
ypred1=nb.predict(xtest)

In [43]:
print(classification_report(ytest,ypred1))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



<h3 style="background-color:cyan">This result shows that GaussianNB is able to perfectly separate the Iris species based on the given features (sepal and petal dimensions). Since the Iris dataset is well-structured, with clear boundaries between the classes (especially setosa), it is not surprising that a simple probabilistic model like Naive Bayes performs so well. However, perfect results can also mean the dataset is relatively small and easy, so testing with cross-validation or on different splits is a good idea to confirm that the model generalizes well.</h3>

In [44]:
joblib.dump(lr,r"C:\Users\bmsha\Documents\CodSoft\IRISClassification\dataset\Naive_bayess_model.pkl")

['C:\\Users\\bmsha\\Documents\\CodSoft\\IRISClassification\\dataset\\Naive_bayess_model.pkl']

3)<b>Randomforest Classfier

In [45]:
from sklearn.ensemble import RandomForestClassifier

In [46]:
rc=RandomForestClassifier()

In [47]:
rc.fit(xtrain,ytrain)

In [48]:
ypred2=rc.predict(xtest)

In [51]:
print(classification_report(ytest,ypred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



<h3 style="background-color:cyan">Random Forest classifier has also delivered perfect results on the Iris dataset, achieving 100% precision, recall, F1-score, and overall accuracy across all three classes. This means the ensemble of decision trees in the Random Forest successfully captured the class boundaries and made no errors in predicting the test samples</h3>

In [74]:
joblib.dump(rc,r"C:\Users\bmsha\Documents\CodSoft\IRISClassification\dataset\Random_forest_model.pkl")

['C:\\Users\\bmsha\\Documents\\CodSoft\\IRISClassification\\dataset\\Random_forest_model.pkl']

4)<b> SVC

In [55]:
from sklearn.svm import SVC

In [69]:
svc=SVC(kernel='linear',random_state=42)
svc.fit(xtrain,ytrain)

In [71]:
ypred4=svc.predict(xtest)

In [73]:
print(classification_report(ytest,ypred4))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



<h3 style='background-color:cyan">The Support Vector Classifier (SVC) performed perfectly on the Iris dataset, achieving 100% precision, recall, F1-score, and accuracy across all three classes. This shows that SVC was able to find optimal separating hyperplanes that completely distinguished the three flower species based on sepal and petal measurements. Its performance highlights the algorithm’s strength in handling well-structured, linearly separable data, making it highly effective for this dataset. However, to confirm that the model generalizes beyond this specific split, it is advisable to test it on different data splits or use cross-validation.</h3>

<h3 style='background-color:cyan'>The Support Vector Classifier (SVC) performed perfectly on the Iris dataset, achieving 100% precision, recall, F1-score, and accuracy across all three classes. This shows that SVC was able to find optimal separating hyperplanes that completely distinguished the three flower species based on sepal and petal measurements. Its performance highlights the algorithm’s strength in handling well-structured, linearly separable data, making it highly effective for this dataset. However, to confirm that the model generalizes beyond this specific split, it is advisable to test it on different data splits or use cross-validation.</h3>

In [76]:
joblib.dump(svc,r"C:\Users\bmsha\Documents\CodSoft\IRISClassification\dataset\SVC_model.pkl")

['C:\\Users\\bmsha\\Documents\\CodSoft\\IRISClassification\\dataset\\SVC_model.pkl']

<b>Overall Findings:</b>
All four models performed exceptionally well, correctly classifying every test instance. This indicates that the Iris dataset is well-structured, balanced, and easily separable, making it suitable for a variety of classification algorithms. While the results demonstrate strong predictive capability, cross-validation or testing on unseen data is recommended to ensure the models generalize beyond this specific dataset.

<h3 style='background-color:red'><b>Conclusion:</b>
The study confirms that Logistic Regression, GaussianNB, Random Forest, and SVC are all highly effective for this dataset, with SVC and Random Forest offering more robust methods for complex or slightly noisier datasets, while Naive Bayes and Logistic Regression provide simpler, interpretable solutions.</h3>