# Supervised Learning Project - Jupyter Notebook

This notebook will walk through a hands-on supervised learning project.
Detailed theoretical explanations and code explanations are provided step-by-step.

## 1. Introduction to Supervised Learning Project

**Supervised Learning** is a type of machine learning where the model is trained on labeled data.

**Labeled data** means each input (feature) has a known output (label or target).

The model learns to map the input to the correct output during training.

### Key Components:
- **Training a Model:** The model learns from the provided data.
- **Testing a Model:** The model is evaluated using unseen data.
- **Performance Metrics:** Accuracy, Precision, Recall, F1-score, etc.

### Objective:
We will build a machine learning model using a dataset, evaluate it, and then improve its performance using ensemble methods.

## 2. Building the Model

### Step 1: Import Necessary Libraries

Explanation:
- `pandas`: For data manipulation.
- `numpy`: For numerical operations.
- `sklearn`: For machine learning algorithms and evaluation.
- `matplotlib`: For visualization.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

### Step 2: Load and Understand the Dataset

Explanation:
- The Iris dataset is a well-known classification dataset.
- It contains measurements of different types of iris flowers.

In [2]:
iris = load_iris()
X = iris.data  # Features
y = iris.target  # Labels

print("Feature Names:", iris.feature_names)
print("Target Names:", iris.target_names)

Feature Names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
Target Names: ['setosa' 'versicolor' 'virginica']


### Step 3: Split the Dataset

Explanation:
- We split the dataset into 70% training and 30% testing.
- This ensures the model is tested on unseen data.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### Step 4: Train the Model

Explanation:
- `RandomForestClassifier` is chosen because it is robust and handles classification problems well.

In [4]:
model = RandomForestClassifier()
model.fit(X_train, y_train)

### Step 5: Evaluate the Model

Explanation:
- We predict on the test set.
- Accuracy shows how many predictions are correct.

In [5]:
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Display Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:\n", cm)

# Detailed Classification Report
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=iris.target_names))

Accuracy: 1.0
Confusion Matrix:
 [[19  0  0]
 [ 0 13  0]
 [ 0  0 13]]
Classification Report:
               precision    recall  f1-score   support

      setosa       1.00      1.00      1.00        19
  versicolor       1.00      1.00      1.00        13
   virginica       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45



## 3. Combining Multiple Algorithms for Better Performance (Ensemble Learning)

**Ensemble Learning:**
- Combines multiple models to improve performance.
- Types: Voting, Bagging, Boosting, Stacking.

**Voting Classifier:**
- Combines predictions from multiple models.
- Hard Voting: Majority class is selected.

**Benefits:**
- Reduces overfitting.
- Improves accuracy.

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

# Create individual models
model1 = LogisticRegression(max_iter=200)
model2 = DecisionTreeClassifier()
model3 = RandomForestClassifier()

# Create Voting Classifier
voting_model = VotingClassifier(estimators=[('lr', model1), ('dt', model2), ('rf', model3)], voting='hard')

# Train the ensemble model
voting_model.fit(X_train, y_train)

# Evaluate the ensemble model
y_pred_voting = voting_model.predict(X_test)
voting_accuracy = accuracy_score(y_test, y_pred_voting)
print("Voting Classifier Accuracy:", voting_accuracy)

## 4. Project Activity: Develop and Train the Model (Practice)

### Practice Instructions:
1. Load a dataset (Iris or any classification dataset).
2. Perform data exploration (check for missing values, class balance).
3. Preprocess the dataset if needed.
4. Split the dataset into training and testing sets.
5. Train a classifier (Random Forest, Decision Tree, Logistic Regression).
6. Evaluate using accuracy, precision, recall.
7. Try combining models using Voting Classifier.
8. Compare individual and combined model performances.

## 5. Project Feedback and Discussion

### Discussion Points:
- What challenges did you face while preprocessing the data?
- Which model performed best individually?
- Did ensemble improve the model performance?
- What did you learn about the importance of choosing the right algorithm and preprocessing steps?

### Key Takeaways:
- Data preprocessing is critical for good model performance.
- Trying multiple algorithms and tuning them is essential.
- Combining models can often lead to improved results.