Assignment on Classification
Do the following in the iris dataset
1. Read the dataset to python environment
2. Do the necessary pre-processing steps
3. Find out which classification model gives the best result (try all the
classification algorithms discussed in the sessions)

#1. Read the dataset to python environment

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data=pd.read_excel('/content/iris.xls')

In [3]:
data.head()

Unnamed: 0,SL,SW,PL,PW,Classification
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [4]:
data.shape

(150, 5)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   SL              143 non-null    float64
 1   SW              144 non-null    float64
 2   PL              144 non-null    float64
 3   PW              150 non-null    float64
 4   Classification  150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [6]:
data.columns

Index(['SL', 'SW', 'PL', 'PW', 'Classification'], dtype='object')

#2. Do the necessary pre-processing steps

In [7]:
# Check for missing values
print(data.isnull().sum())

SL                7
SW                6
PL                6
PW                0
Classification    0
dtype: int64


In [8]:
# Fill missing values for numerical columns with the mean
data['SL'].fillna(data['SL'].mean(), inplace=True)
data['SW'].fillna(data['SW'].mean(), inplace=True)
data['PL'].fillna(data['PL'].mean(), inplace=True)
data['PW'].fillna(data['PW'].mean(), inplace=True)

data['Classification'].fillna(data['Classification'].mode()[0], inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['SL'].fillna(data['SL'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['SW'].fillna(data['SW'].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are settin

In [9]:
data.head()

Unnamed: 0,SL,SW,PL,PW,Classification
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,5.855944,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [10]:
data['Classification'].nunique()

3

In [11]:
data['Classification'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

In [12]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['Classification'] = le.fit_transform(data['Classification'])

In [13]:
x = data[['SL', 'SW', 'PL', 'PW']]
y = data['Classification']

#3. Find out which classification model gives the best result (try all the classification algorithms discussed in the sessions)

In [14]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2,random_state = 42)

In [15]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# List of classifiers
classifiers = {
    'Logistic Regression': LogisticRegression(max_iter=200),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Support Vector Machine': SVC(),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'Naive Bayes': GaussianNB()
}

# Dictionary to store accuracy
accuracy_results = {}

# Train and evaluate each classifier
for name, clf in classifiers.items():
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_results[name] = accuracy
    print(f"{name}: {accuracy:.4f}")

# Best model
best_model = max(accuracy_results, key=accuracy_results.get)
print(f"\nBest Model: {best_model} with accuracy: {accuracy_results[best_model]:.4f}")

Logistic Regression: 1.0000
K-Nearest Neighbors: 0.9667
Support Vector Machine: 1.0000
Decision Tree: 1.0000
Random Forest: 1.0000
Gradient Boosting: 1.0000
Naive Bayes: 1.0000

Best Model: Logistic Regression with accuracy: 1.0000


**Performance Metrics:**

**Logistic Regression:** 1.0000

**Support Vector Machine:** 1.0000

**Decision Tree:** 1.0000

**Random Forest:** 1.0000

**Naive Bayes:** 1.0000

**K-Nearest Neighbors:** 0.9667

All models except KNN achieved perfect accuracy on the test set, indicating their effectiveness in classifying the Iris dataset.