# Task 1 — Classical ML with Scikit-learn

## Dataset: **Iris Species Dataset**

### Goal:
This notebook demonstrates a classical machine learning workflow using the Iris Species Dataset. The primary objectives are:
- **Data Preprocessing:** Handle missing values and prepare the data for model training.
- **Model Training:** Train a Decision Tree Classifier on the preprocessed data.
- **Model Evaluation:** Assess the model's performance using standard metrics like accuracy, precision, and recall.

In [1]:
# Import necessary libraries for data manipulation, machine learning, and evaluation.
import pandas as pd # Used for creating and manipulating DataFrames and Series.
from typing import Any # Import Any for type hinting
from sklearn.datasets import load_iris # Function to load the famous Iris flower dataset.
from sklearn.model_selection import train_test_split # Utility to split datasets into training and testing subsets.
from sklearn.tree import DecisionTreeClassifier # The machine learning model we will be using.
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report # Functions to evaluate the model's performance.
import numpy as np # Used for numerical operations, especially for handling NaN (Not a Number) values.

In [2]:
# --- Data Loading and Initial Inspection --- 

# Load the Iris dataset.
# The 'iris' object is a Bunch object, similar to a dictionary, containing data, target, feature names, etc.
iris: Any = load_iris()

# Create a pandas DataFrame (X) from the feature data (sepal length, sepal width, petal length, petal width).
# The column names are taken directly from `iris.feature_names` for clarity.
X = pd.DataFrame(iris.data, columns=iris.feature_names)

# Create a pandas Series (y) for the target variable (species).
# The target values are integers (0, 1, 2) representing the three different Iris species.
y = pd.Series(iris.target, name='species')

# Display the first 5 rows of the feature DataFrame.
# This helps in quickly understanding the structure and initial values of the dataset.
X.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [3]:
# --- Data Preprocessing: Simulating and Handling Missing Values --- 

# To demonstrate robust data preprocessing, we simulate missing values in the dataset.
# Set the value at the first row (index 0) and first column (index 0, 'sepal length (cm)') to NaN.
X.iloc[0, 0] = np.nan

# Set the value at the eleventh row (index 10) and third column (index 2, 'petal length (cm)') to NaN.
X.iloc[10, 2] = np.nan

# Handle the simulated missing values by imputing them with the mean of their respective columns.
# This is a common strategy for numerical data to prevent loss of data rows.
X = X.fillna(X.mean())

# Verify that all missing values have been handled.
# This command sums the boolean results of `isnull()` for each column; a sum of 0 indicates no missing values.
X.isnull().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
dtype: int64

In [4]:
# --- Data Splitting --- 

# Split the dataset into training and testing sets.
# X: Features (input data), y: Target variable (output labels).
# test_size=0.2: Allocates 20% of the data for testing and 80% for training.
# random_state=42: Ensures that the data split is reproducible. Using the same random_state will yield the same split every time.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
# --- Model Training --- 

# Initialize the Decision Tree Classifier.
# random_state=42: Ensures reproducibility of the model's internal randomness (e.g., feature selection at splits).
model = DecisionTreeClassifier(random_state=42)

# Train the model using the training data.
# The model learns the patterns and relationships between features (X_train) and the target variable (y_train).
model.fit(X_train, y_train)

# Make predictions on the unseen test set.
# The model uses the learned patterns to predict the species for the test features (X_test).
y_pred = model.predict(X_test)

In [6]:
# --- Model Evaluation --- 

# Calculate Accuracy: The proportion of correctly classified instances out of the total instances.
acc = accuracy_score(y_test, y_pred)

# Calculate Precision: The ability of the classifier not to label as positive a sample that is negative.
# 'macro' average is used for multi-class classification, calculating the metric independently for each class and then taking the average.
prec = precision_score(y_test, y_pred, average='macro')

# Calculate Recall: The ability of the classifier to find all the positive samples.
# 'macro' average is used for multi-class classification, calculating the metric independently for each class and then taking the average.
rec = recall_score(y_test, y_pred, average='macro')


# Print the calculated evaluation metrics, formatted to three decimal places for readability.
print(f'Accuracy: {acc:.3f}')
print(f'Precision: {prec:.3f}')
print(f'Recall: {rec:.3f}')


# Print a comprehensive classification report.
# This report provides precision, recall, F1-score, and support for each class, along with overall averages.
# `target_names` are used to display the actual species names (e.g., 'setosa') instead of numerical labels (0, 1, 2).
print('\nClassification Report:\n', classification_report(y_test, y_pred, target_names=iris.target_names))

Accuracy: 1.000
Precision: 1.000
Recall: 1.000

Classification Report:
               precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        10
  versicolor       1.00      1.00      1.00         9
   virginica       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

