<a href="https://colab.research.google.com/github/Iannoh-png/Week-3-Assignment-AI-Module/blob/main/Classical_ML_with_Scikit_learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.preprocessing import LabelEncoder
import numpy as np

# Load the dataset (assuming 'Iris.csv' is in the same directory or path provided)
df = pd.read_csv('Iris.csv')

# --- 1. Preprocess the data ---

# Check for missing values
# Displays count of missing values for each column.
print("Missing values per column:\n", df.isnull().sum().to_markdown(numalign="left", stralign="left"))

# Drop the 'Id' column as it's not a feature for prediction
# The 'Id' column is a unique identifier and does not contribute to the prediction.
df_processed = df.drop('Id', axis=1)

# Separate features (X) and target (y)
# X contains the independent variables (features) used for prediction.
X = df_processed.drop('Species', axis=1)
# y contains the dependent variable (target) that we want to predict.
y = df_processed['Species']

# Encode the target variable 'Species' into numerical labels
# Machine learning models require numerical input. LabelEncoder converts
# categorical labels (like 'Iris-setosa') into numerical representations (like 0).
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)
print("\nEncoded Species labels (first 5):\n", y_encoded[:5])
print("Original Species to Encoded Mapping:\n", list(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_))))


# --- 2. Train a Decision Tree Classifier ---

# Split the data into training and testing sets
# train_test_split divides the dataset into subsets for training and testing.
# test_size=0.30 means 30% of the data will be used for testing, 70% for training.
# random_state ensures reproducibility of the split; the same split is generated each time.
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.30, random_state=42)

# Initialize the Decision Tree Classifier
# A Decision Tree is a non-parametric supervised learning method used for classification and regression.
# random_state for reproducibility of the decision tree's internal randomness.
dt_classifier = DecisionTreeClassifier(random_state=42)

# Train the classifier using the training data
# The .fit() method trains the model using the features (X_train) and corresponding labels (y_train).
dt_classifier.fit(X_train, y_train)

print("\nDecision Tree Classifier trained successfully.")

# --- 3. Evaluate using accuracy, precision, and recall ---

# Make predictions on the test set
# The trained model predicts the labels for the unseen test features (X_test).
y_pred = dt_classifier.predict(X_test)

# Calculate Accuracy
# Accuracy is the proportion of correctly classified instances (true positives + true negatives)
# out of the total instances.
accuracy = accuracy_score(y_test, y_pred)
print(f"\nAccuracy: {accuracy:.4f}")

# Calculate Precision
# Precision is the ratio of true positives to the total predicted positives.
# For multi-class classification, 'weighted' average is often used to account for class imbalance,
# calculating metrics for each label and finding their average, weighted by the number of true instances for each label.
precision = precision_score(y_test, y_pred, average='weighted')
print(f"Precision: {precision:.4f}")

# Calculate Recall
# Recall is the ratio of true positives to the total actual positives.
# Similar to precision, 'weighted' average is used for multi-class problems.
recall = recall_score(y_test, y_pred, average='weighted')
print(f"Recall: {recall:.4f}")

Missing values per column:
 |               | 0   |
|:--------------|:----|
| Id            | 0   |
| SepalLengthCm | 0   |
| SepalWidthCm  | 0   |
| PetalLengthCm | 0   |
| PetalWidthCm  | 0   |
| Species       | 0   |

Encoded Species labels (first 5):
 [0 0 0 0 0]
Original Species to Encoded Mapping:
 [('Iris-setosa', np.int64(0)), ('Iris-versicolor', np.int64(1)), ('Iris-virginica', np.int64(2))]

Decision Tree Classifier trained successfully.

Accuracy: 1.0000
Precision: 1.0000
Recall: 1.0000
