# **Supervised Machine Learning: Classification Lab**
In this lab, we will explore various classification algorithms using built-in datasets from scikit-learn.

---

## 📚**Classification on the Iris Dataset**

Classification is a fundamental task in supervised machine learning, where the goal is to predict a **categorical** class label for a given input. In this notebook, we will explore multiple classification algorithms—Logistic Regression, Naive Bayes, K-Nearest Neighbors, Support Vector Machine, and Decision Trees—using the **Iris dataset**.

We will:
- Load and understand the Iris dataset.
- Split the data into training and testing sets.
- Train different classifiers on the training set.
- Evaluate their performance on the test set using accuracy, confusion matrices, and classification reports.

---

##🔗 **Library Overview**

- **NumPy**: Numerical Python library for handling arrays and numerical operations.  
  [Official Docs](https://numpy.org/)

- **pandas**: Data analysis and manipulation library with a convenient DataFrame structure.  
  [Official Docs](https://pandas.pydata.org/)

- **Matplotlib**: A comprehensive library for creating static, animated, and interactive visualizations in Python.  
  [Official Docs](https://matplotlib.org/)

- **Scikit-learn (`sklearn`)**:  
  - `model_selection` (train_test_split): for creating train/test sets.  
  - `datasets`: contains popular built-in datasets like Iris, Boston, etc.  
  - `preprocessing` (Optional: StandardScaler): for feature scaling.  
  - Classifiers (LogisticRegression, GaussianNB, KNeighborsClassifier, SVC, DecisionTreeClassifier).  
  - `metrics`: for evaluating classification performance (accuracy, confusion matrix, classification report).  
  [Official Docs](https://scikit-learn.org/stable/)

---

## **The Iris Dataset**

The Iris dataset is a classic in machine learning:
- **150 samples** of iris flowers.
- **4 features**:
  1. Sepal length  
  2. Sepal width  
  3. Petal length  
  4. Petal width  
- **Target**: Iris species (setosa, versicolor, virginica).



## **Step 1: Import Required Libraries**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

**Explanation**  
In this cell, we import the necessary libraries:
- `numpy` and `pandas` for data manipulation
- `matplotlib.pyplot` for plotting (if needed)
- `scikit-learn` functionalities (train_test_split, preprocessing, and evaluation metrics)


## **Step 2: Load Dataset**


In [2]:
from sklearn.datasets import load_iris

# Load Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split data into training (80%) and testing (20%) sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f'Training samples: {X_train.shape[0]}, Testing samples: {X_test.shape[0]}')

Training samples: 120, Testing samples: 30


**Explanation**  
Here we:
1. Load the classic **Iris dataset** from scikit-learn, which contains 150 samples of iris flowers with 4 features each (sepal length, sepal width, petal length, petal width) and a target variable (species).
2. Split the data into **training** and **testing** sets, with 80% for training and 20% for testing (random_state=42 for reproducibility).
3. Print the number of samples in each subset.


## **Step 3: Train and Evaluate Classification Models**



### **Brief Overview of Each Classification Algorithm**

In this lab, we compare five popular classification algorithms: **Logistic Regression**, **Naive Bayes**, **K-Nearest Neighbors**, **Support Vector Machine**, and **Decision Trees**. Below is a short summary of each.

#### **1. Logistic Regression**
- **Key Idea**: Extends linear regression to classification by applying a **sigmoid (logistic) function** to constrain outputs to a probability (0 to 1).  
- **Usage**: Often used for **binary classification** (spam vs. not spam), but can be generalized to multi-class (e.g., using one-vs-rest).  
- **Pros**: Easy to implement, interpretable coefficients, works well with linear separation.  
- **Cons**: Assumes a linear decision boundary, may underperform if data is not linearly separable.


>>>![Logistic Function](https://upload.wikimedia.org/wikipedia/commons/8/88/Logistic-curve.svg)

>>>*Figure: Logistic (sigmoid) curve mapping linear inputs to a 0-1 probability range.*


---

### **2. Naive Bayes (GaussianNB)**
- **Key Idea**: Based on **Bayes’ Theorem**, assumes independence among features given the class label.  
- **Usage**: Useful for text classification (spam detection, sentiment analysis) and other cases where independence assumptions roughly hold.  
- **Pros**: Fast to train, works well with high-dimensional data, can handle small datasets effectively.  
- **Cons**: Strong (often unrealistic) independence assumption, but still performs surprisingly well in many domains.

>>>![Bayes Theorem Diagram](https://i.ytimg.com/vi/OByl4RJxnKA/maxresdefault.jpg)  
*Figure: Visual representation of Bayes' Theorem illustrating the relationship between prior, likelihood, and posterior probabilities.*



---

#### **3. K-Nearest Neighbors (KNN)**
- **Key Idea**: A **lazy learner** that uses the training data directly for classification by looking at the **‘k’ closest points** in the feature space.  
- **Usage**: Good for smaller datasets without too many features; commonly used in recommendation systems or simple classification tasks.  
- **Pros**: Simple concept, no explicit training phase.  
- **Cons**: Computationally expensive at prediction time (must search for nearest neighbors), sensitive to scaling of features and outliers.

>![K-Nearest Neighbors Diagram](https://miro.medium.com/max/1151/0%2AItVKiyx2F3ZU8zV5)  
*Figure: Illustration of K-Nearest Neighbors algorithm classifying a new data point based on its nearest neighbors.*


---

#### **4. Support Vector Machine (SVM)**
- **Key Idea**: Finds a **hyperplane** (or set of hyperplanes in higher-dimensional space) that best separates classes, maximizing the margin between them.  
- **Usage**: Very effective in high-dimensional spaces, can handle non-linear separations using **kernel tricks** (RBF, polynomial, etc.).  
- **Pros**: Powerful and flexible with kernels, often works well in practice even with limited data.  
- **Cons**: Can be tricky to tune (especially kernel parameters), not as interpretable as simpler linear models.

>>>![Support Vector Machine Diagram](https://learnopencv.com/wp-content/uploads/2018/07/support-vectors-and-maximum-margin.png)  
*Figure: Illustration of a Support Vector Machine showing support vectors and the maximum margin hyperplane.*



---

#### **5. Decision Trees**
- **Key Idea**: Splits the feature space into regions by recursively asking “yes/no” questions (e.g., `x_i <= threshold?`).  
- **Usage**: Widely used in many fields due to easy interpretability. Good for capturing non-linear relationships.  
- **Pros**: Highly interpretable, requires little data prep (no scaling needed), can handle mixed feature types (numeric/categorical).  
- **Cons**: Prone to **overfitting**, unstable splits if not tuned with pruning or ensemble methods (Random Forests, Gradient Boosting).

>>> ![Decision Tree Diagram](https://eloquentarduino.github.io/wp-content/uploads/2020/08/DecisionTree.png)  
*Figure: Example of a Decision Tree illustrating decision nodes, branches, and classification outcomes.*


---

### **When to Use Which Algorithm?**
- **Logistic Regression**: Baseline linear classifier, good interpretability and easy to implement.  
- **Naive Bayes**: Fast, works well for high-dimensional data (e.g., text classification).  
- **KNN**: Very intuitive, no explicit training. Works well for smaller datasets with few features.  
- **SVM**: Often robust in high-dimensional spaces, can adapt to non-linear boundaries with kernels.  
- **Decision Tree**: Easy to interpret, but can easily overfit if not regularized or combined into ensembles.

Choose based on your data characteristics (size, dimensionality), interpretability requirements, and how much time you have for tuning.



### **Logistic Regression**

In [4]:
from sklearn.linear_model import LogisticRegression

# Create and train a Logistic Regression classifier
log_reg = LogisticRegression(max_iter=200)
log_reg.fit(X_train, y_train)

# Predict on the test set
y_pred = log_reg.predict(X_test)

# Evaluate the model
print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 1.00
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



**Explanation**  
- **Logistic Regression** is a simple yet popular linear model used for classification.  
- We instantiate a `LogisticRegression` model with `max_iter=200` (allowing it more iterations to converge).
- Train (fit) the model on `X_train, y_train`.
- Predict labels on the test set (`X_test`) and evaluate via:
  - **Accuracy**  
  - **Confusion Matrix**  
  - **Classification Report** (precision, recall, f1-score)


### **Naive Bayes**

In [5]:
from sklearn.naive_bayes import GaussianNB

# Create and train a Gaussian Naive Bayes classifier
nb = GaussianNB()
nb.fit(X_train, y_train)

# Predict on the test set
y_pred = nb.predict(X_test)

# Evaluate the model
print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 1.00
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



**Explanation**  
- **Naive Bayes** (Gaussian) uses Bayes’ theorem with the assumption of feature independence.
- We instantiate and train a `GaussianNB` classifier.
- Again, we predict on `X_test` and then print accuracy, confusion matrix, and classification report.


### **K-Nearest Neighbors (KNN)**

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Create and train a KNN classifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Predict on the test set
y_pred = knn.predict(X_test)

# Evaluate
print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 1.00
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



**Explanation**  
- **K-Nearest Neighbors** is a simple, instance-based algorithm that classifies a data point based on how its neighbors are labeled.
- We create a `KNeighborsClassifier` with `n_neighbors=5` (the default is 5, but we make it explicit here).
- Train and predict, then evaluate as before.


### **Support Vector Machine (SVM)**

In [None]:
from sklearn.svm import SVC

# Create and train an SVM classifier with a linear kernel
svm = SVC(kernel='linear')
svm.fit(X_train, y_train)

# Predict on the test set
y_pred = svm.predict(X_test)

# Evaluate
print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 1.00
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



**Explanation**  
- **Support Vector Machine** with a linear kernel tries to find an optimal hyperplane that separates classes in a possibly high-dimensional space.
- We set `kernel='linear'` to keep it simpler. Train on `X_train` and evaluate as usual.


### **Decision Tree Classifier**

In [None]:
from sklearn.tree import DecisionTreeClassifier

# Create and train a Decision Tree classifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)

# Predict on the test set
y_pred = dt.predict(X_test)

# Evaluate
print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 1.00
[[10  0  0]
 [ 0  9  0]
 [ 0  0 11]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



**Explanation**  
- **Decision Tree** builds a flowchart-like structure to decide class labels.
- We use `DecisionTreeClassifier` with default parameters (e.g., Gini impurity).
- As before, we fit the model and then see the performance metrics on test data.


##📚 **Additional Resources**

Below are some resources and Kaggle notebooks you can explore to expand your understanding of classification techniques:


1. **Titanic - Machine Learning from Disaster (Kaggle Competition)**  
   - [Competition Page](https://www.kaggle.com/c/titanic)  
     A classic binary classification challenge (survival or not). Many public notebooks demonstrate advanced techniques like feature engineering, hyperparameter tuning, ensemble methods, etc.  
   - Example Notebooks:  
     - [A Data Science Framework: To Achieve 99% Accuracy (Beginner)](https://www.kaggle.com/code/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy)  
       Walks through the entire ML process, from data cleaning to model evaluation, focusing on classification.

2. **Penguin Dataset**  
   - **Dataset**: [Palmer Penguins Dataset](https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data)  
   - **Sample Notebook**: [Penguin Classification with ML Models](https://www.kaggle.com/code/parulpandey/penguin-dataset-the-new-iris)  
   - Often touted as the “new Iris,” it’s a multi-class classification problem for three penguin species using numeric features like flipper length, body mass, etc.

3. **Spam/Ham Email Classifier**  
   - **Dataset**: [SMS Spam Collection Dataset](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset)  

4. **Scikit-Learn Official Documentation**  
   - [Classification User Guide](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning)  
   - Detailed reference on implementing classification algorithms, handling imbalanced datasets, evaluating models, etc.

5. **Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow (by Aurélien Géron)**  
   - Includes chapters on fundamental classification algorithms, hyperparameter tuning, and best practices.

By exploring these resources, you’ll see real-world data preprocessing, feature engineering, and advanced techniques that build upon the core classification methods demonstrated in this lab.
