<a href="https://www.kaggle.com/code/khoshbayani/iris-machine-learning?scriptVersionId=238765012" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# **Iris Dataset Classification Analysis**

This Jupyter notebook demonstrates a classification workflow using the classic **Iris Dataset**. We will:
1. **Load and inspect** the dataset
2. **Preprocess** (e.g., scaling) the features
3. **Train and evaluate** multiple classification models, specifically:
   - **K-Nearest Neighbors (KNN)**
   - **Support Vector Machine (SVM)**
   - **Decision Tree**
4. **Tune** some hyperparameters to find better performing or more generalized models.

We will also highlight potential **overfitting** vs. **generalization** by comparing training and test scores, and show how to interpret or choose final models from the results.

In [1]:
# Importing necessary libraries
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, ShuffleSplit
from sklearn.tree import DecisionTreeClassifier, plot_tree
import numpy as np  # if needed
from sklearn.metrics import classification_report

# Load the Iris dataset
iris = load_iris()

# 1. **Initial Data Exploration**
We check shape, missing values, data types, and descriptive statistics for quick insights about the dataset.

In [2]:
# Checking the shape (rows x columns)
iris.data.shape

(150, 4)

In [3]:
# Checking the total number of missing values
pd.DataFrame(iris.data).isnull().sum().sum()

0

In [4]:
# Checking data types for each of the 4 feature columns
pd.DataFrame(iris.data).dtypes

0    float64
1    float64
2    float64
3    float64
dtype: object

In [5]:
# Basic descriptive statistics
pd.DataFrame(iris.data).describe()

Unnamed: 0,0,1,2,3
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


# 2. **Feature Scaling**
We use `StandardScaler` to scale features to have mean = 0 and standard deviation = 1.  This can benefit certain models, especially distance-based or those sensitive to different feature scales.

In [6]:
X = StandardScaler().fit_transform(iris.data)
y = iris.target

In [7]:
# Checking that the scaled version has mean ~0 and std ~1
pd.DataFrame(X).describe()

Unnamed: 0,0,1,2,3
count,150.0,150.0,150.0,150.0
mean,-1.468455e-15,-1.823726e-15,-1.610564e-15,-9.473903e-16
std,1.00335,1.00335,1.00335,1.00335
min,-1.870024,-2.433947,-1.567576,-1.447076
25%,-0.9006812,-0.592373,-1.226552,-1.183812
50%,-0.05250608,-0.1319795,0.3364776,0.1325097
75%,0.6745011,0.5586108,0.7627583,0.7906707
max,2.492019,3.090775,1.785832,1.712096


# 3. **Train-Test Split**
We'll split our scaled data into training and testing subsets, typically using ~20-30% for test. Below we do 22% for demonstration.

In [8]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.22, random_state=72
)

# 4. **K-Nearest Neighbors**
- We do a basic KNN classification with `n_neighbors=5`.
- Then check train/test accuracy.
- Also, we'll do a small grid search to find better hyperparameters.

In [9]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)  # Fit the model on the training data

In [10]:
# Evaluate the model on the train data
knn.score(X_train, y_train)

0.9658119658119658

In [11]:
# Evaluate the model on the test data
knn.score(X_test, y_test)

0.9696969696969697

#### **Grid Search for KNN**
We will search over different `n_neighbors` and `weights` to see if we can improve performance or reduce overfitting.

In [12]:
params = {
    'n_neighbors': range(1,22,2),
    'weights': ['uniform', 'distance']
}
cv = ShuffleSplit(n_splits=3, test_size=0.33, random_state=82)
grid_search = GridSearchCV(knn, param_grid=params, cv=cv, return_train_score=True, verbose=10)
grid_search.fit(X_train, y_train)  # Fit the model on the training data
grid_search

Fitting 3 folds for each of 22 candidates, totalling 66 fits
[CV 1/3; 1/22] START n_neighbors=1, weights=uniform.............................
[CV 1/3; 1/22] END n_neighbors=1, weights=uniform;, score=(train=1.000, test=0.897) total time=   0.0s
[CV 2/3; 1/22] START n_neighbors=1, weights=uniform.............................
[CV 2/3; 1/22] END n_neighbors=1, weights=uniform;, score=(train=1.000, test=0.923) total time=   0.0s
[CV 3/3; 1/22] START n_neighbors=1, weights=uniform.............................
[CV 3/3; 1/22] END n_neighbors=1, weights=uniform;, score=(train=1.000, test=0.949) total time=   0.0s
[CV 1/3; 2/22] START n_neighbors=1, weights=distance............................
[CV 1/3; 2/22] END n_neighbors=1, weights=distance;, score=(train=1.000, test=0.897) total time=   0.0s
[CV 2/3; 2/22] START n_neighbors=1, weights=distance............................
[CV 2/3; 2/22] END n_neighbors=1, weights=distance;, score=(train=1.000, test=0.923) total time=   0.0s
[CV 3/3; 2/22] ST

In [13]:
print("best_params", grid_search.best_params_)
print("best_Score (train score)", grid_search.cv_results_["mean_train_score"][grid_search.best_index_])
print("best_Score (test score)", grid_search.cv_results_["mean_test_score"][grid_search.best_index_])

best_params {'n_neighbors': 9, 'weights': 'uniform'}
best_Score (train score) 0.9615384615384616
best_Score (test score) 0.9401709401709403


> The best hyperparameters show slightly improved average performance. We can also check if there's less difference between training & test to reduce overfitting.

We'll see how different parameter combos do by analyzing the difference in train/test performance.

In [14]:
mean_train_score = grid_search.cv_results_["mean_train_score"]
mean_test_score  = grid_search.cv_results_["mean_test_score"]

In [15]:
# index of the param setting that yields the minimum difference between train & test
index_min_diff = (abs(mean_train_score - mean_test_score)).argmin()

In [16]:
grid_search.cv_results_["params"][index_min_diff]

{'n_neighbors': 15, 'weights': 'uniform'}

So, `{'n_neighbors': 15, 'weights': 'uniform'}` yields the **smallest** train-test difference, which might be a good compromise to avoid overfitting, even if the absolute test score might be slightly lower. This is a matter of preference.

In [17]:
best_no_overfit_knn = KNeighborsClassifier(n_neighbors=15,weights='uniform')
best_no_overfit_knn.fit(X_train, y_train)
y_true = y_test

y_pred_on_train = best_no_overfit_knn.predict(X_train)
y_pred_on_test = best_no_overfit_knn.predict(X_test)


print("Confusion Matrix on Train Data")
cls_report_train = classification_report(y_train, y_pred_on_train, target_names=iris.target_names)
print(cls_report_train)

print("\n\n================================\n\n")

print("Confusion Matrix on Test Data")
cls_report_test = classification_report(y_true, y_pred_on_test, target_names=iris.target_names)
print(cls_report_test)



Confusion Matrix on Train Data
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        37
  versicolor       0.93      0.98      0.95        41
   virginica       0.97      0.92      0.95        39

    accuracy                           0.97       117
   macro avg       0.97      0.97      0.97       117
weighted avg       0.97      0.97      0.97       117





Confusion Matrix on Test Data
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        13
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        33
   macro avg       1.00      1.00      1.00        33
weighted avg       1.00      1.00      1.00        33



# 5. **Support Vector Machine (SVM)**
- We'll try a basic SVM with `'linear'` kernel.
- Evaluate train/test.
- Then do a grid search with different kernels and C/gamma combos.

In [18]:
# Re-split data to a different ratio just to see different test sizes.
# (But typically you'd keep the same, unless you wanted more test coverage etc.)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.44, random_state=72
)

In [19]:
svm = SVC(kernel='linear', C=1.0, random_state=72)
svm.fit(X_train, y_train)
svm

In [20]:
svm.score(X_train, y_train)

0.9642857142857143

In [21]:
svm.score(X_test, y_test)

0.9545454545454546

#### **SVM Grid Search**
We'll vary kernel, C, and gamma.

In [22]:
params = {
    'kernel': ['linear','poly','rbf','sigmoid'],
    'C': [0.1,1,10,100],
    'gamma': ['scale','auto']
}
cv = ShuffleSplit(n_splits=3, test_size=0.22, random_state=72)
svm_grid_search = GridSearchCV(svm, param_grid=params, cv=cv, return_train_score=True, verbose=10)
svm_grid_search.fit(X_train, y_train)
svm_grid_search

Fitting 3 folds for each of 32 candidates, totalling 96 fits
[CV 1/3; 1/32] START C=0.1, gamma=scale, kernel=linear..........................
[CV 1/3; 1/32] END C=0.1, gamma=scale, kernel=linear;, score=(train=0.969, test=0.947) total time=   0.0s
[CV 2/3; 1/32] START C=0.1, gamma=scale, kernel=linear..........................
[CV 2/3; 1/32] END C=0.1, gamma=scale, kernel=linear;, score=(train=0.969, test=0.947) total time=   0.0s
[CV 3/3; 1/32] START C=0.1, gamma=scale, kernel=linear..........................
[CV 3/3; 1/32] END C=0.1, gamma=scale, kernel=linear;, score=(train=0.954, test=0.947) total time=   0.0s
[CV 1/3; 2/32] START C=0.1, gamma=scale, kernel=poly............................
[CV 1/3; 2/32] END C=0.1, gamma=scale, kernel=poly;, score=(train=0.800, test=0.737) total time=   0.0s
[CV 2/3; 2/32] START C=0.1, gamma=scale, kernel=poly............................
[CV 2/3; 2/32] END C=0.1, gamma=scale, kernel=poly;, score=(train=0.754, test=0.895) total time=   0.0s
[CV 3/3;

In [23]:
# Best found parameters
svm_grid_search.best_params_

{'C': 1, 'gamma': 'scale', 'kernel': 'linear'}

In [24]:
svm_grid_search.best_score_

0.9649122807017543

We see that `'C':1, 'gamma':'scale', 'kernel':'linear'` is the best combination for average performance.  Let's also check how close or far the train/test are.

In [25]:
svm_train_scores = svm_grid_search.cv_results_["mean_train_score"]
svm_test_scores  = svm_grid_search.cv_results_["mean_test_score"]


In [26]:
diff_scores = abs(svm_train_scores - svm_test_scores)
index_mean_diff_svm_scores = diff_scores.argmin()

In [27]:
# This parameter set yields the smallest train-test difference
print("params:",svm_grid_search.cv_results_["params"][index_mean_diff_svm_scores])
print("train-score:",svm_grid_search.cv_results_["mean_train_score"][index_mean_diff_svm_scores])
print("test-score:",svm_grid_search.cv_results_["mean_test_score"][index_mean_diff_svm_scores])

params: {'C': 10, 'gamma': 'scale', 'kernel': 'sigmoid'}
train-score: 0.876923076923077
test-score: 0.8771929824561403


So from a generalization standpoint, `{'C': 10, 'gamma': 'scale', 'kernel': 'sigmoid'}` has nearly equal train/test average, but the final acuracy score is so low.
Let's check dataframe of scores and difference of score

In [28]:
svm_scores_df = pd.DataFrame({
    "train_scores": svm_train_scores,
    "test_scores":  svm_test_scores,
    "diff_scores":  diff_scores,
    "params":       svm_grid_search.cv_results_["params"]
}).sort_values(by="diff_scores", ascending=True).reset_index(drop=True)
svm_scores_df

Unnamed: 0,train_scores,test_scores,diff_scores,params
0,0.876923,0.877193,0.00027,"{'C': 10, 'gamma': 'scale', 'kernel': 'sigmoid'}"
1,0.969231,0.964912,0.004318,"{'C': 10, 'gamma': 'scale', 'kernel': 'poly'}"
2,0.969231,0.964912,0.004318,"{'C': 1, 'gamma': 'auto', 'kernel': 'linear'}"
3,0.969231,0.964912,0.004318,"{'C': 1, 'gamma': 'scale', 'kernel': 'linear'}"
4,0.974359,0.964912,0.009447,"{'C': 10, 'gamma': 'auto', 'kernel': 'poly'}"
5,0.979487,0.964912,0.014575,"{'C': 10, 'gamma': 'auto', 'kernel': 'linear'}"
6,0.979487,0.964912,0.014575,"{'C': 10, 'gamma': 'scale', 'kernel': 'linear'}"
7,0.892308,0.877193,0.015115,"{'C': 100, 'gamma': 'auto', 'kernel': 'sigmoid'}"
8,0.964103,0.947368,0.016734,"{'C': 1, 'gamma': 'auto', 'kernel': 'rbf'}"
9,0.964103,0.947368,0.016734,"{'C': 0.1, 'gamma': 'scale', 'kernel': 'linear'}"


We think the params the below params is the best (it isn't overfit and it has good score at the same time)

In [29]:
svm_scores_df[1:2]

Unnamed: 0,train_scores,test_scores,diff_scores,params
1,0.969231,0.964912,0.004318,"{'C': 10, 'gamma': 'scale', 'kernel': 'poly'}"


In [30]:
best_svm = SVC(kernel='poly', C=10, gamma='scale')
best_svm.fit(X_train, y_train)
y_true = y_test
y_pred_on_train = best_svm.predict(X_train)
y_pred_on_test = best_svm.predict(X_test)

print("Confusion Matrix on Train Data")
cls_report_train = classification_report(y_train, y_pred_on_train, target_names=iris.target_names)
print(cls_report_train)

print("\n\n================================\n\n")

print("Confusion Matrix on Test Data")
cls_report_test = classification_report(y_true, y_pred_on_test, target_names=iris.target_names)
print(cls_report_test)


Confusion Matrix on Train Data
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        26
  versicolor       0.94      1.00      0.97        30
   virginica       1.00      0.93      0.96        28

    accuracy                           0.98        84
   macro avg       0.98      0.98      0.98        84
weighted avg       0.98      0.98      0.98        84





Confusion Matrix on Test Data
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        24
  versicolor       0.83      1.00      0.91        20
   virginica       1.00      0.82      0.90        22

    accuracy                           0.94        66
   macro avg       0.94      0.94      0.94        66
weighted avg       0.95      0.94      0.94        66



# 6. **Decision Tree**
Let's do a simple Decision Tree with limited max_leaf_nodes to reduce overfitting.

In [31]:
decison_tree = DecisionTreeClassifier(max_leaf_nodes=4, random_state=72)
decison_tree.fit(X_train, y_train)
decison_tree

In [32]:
decison_tree.score(X_train, y_train)

0.9761904761904762

In [33]:
decison_tree.score(X_test, y_test)

0.9696969696969697

We get quite good performance with a small tree. Typically, you can also do grid search over `max_depth` or `max_leaf_nodes` or `min_samples_split` to refine further.


In [34]:
y_true = y_test
y_pred_on_train = decison_tree.predict(X_train)
y_pred_on_test = decison_tree.predict(X_test)

print("Confusion Matrix on Train Data")
cls_report_train = classification_report(y_train, y_pred_on_train, target_names=iris.target_names)
print(cls_report_train)

print("\n\n================================\n\n")

print("Confusion Matrix on Test Data")
cls_report_test = classification_report(y_true, y_pred_on_test, target_names=iris.target_names)
print(cls_report_test)


Confusion Matrix on Train Data
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        26
  versicolor       1.00      0.93      0.97        30
   virginica       0.93      1.00      0.97        28

    accuracy                           0.98        84
   macro avg       0.98      0.98      0.98        84
weighted avg       0.98      0.98      0.98        84





Confusion Matrix on Test Data
              precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        24
  versicolor       0.91      1.00      0.95        20
   virginica       1.00      0.91      0.95        22

    accuracy                           0.97        66
   macro avg       0.97      0.97      0.97        66
weighted avg       0.97      0.97      0.97        66




# **Conclusion**
- We tried multiple classifiers on the Iris dataset.
- KNN, SVM, and a small Decision Tree all yield high accuracy (90+%).
- Among all these machine learning algorithms, the Decision Tree performs best.
- Using cross-validation and grid search can help find good hyperparameters that either maximize test accuracy or minimize overfitting.



AI was used for cleaning up, organizing, and refactoring the code