## LECTURE 12: Ensemble Learning 

## course: Awfera Machine Learning

## Instructor: Dr. Shazia Saqib

## Student: Muhammad Shafiq

____________

# Introduction to Ensamble Learning

### What is Ensemble Learning?
Ensemble learning combines multiple weak learners to create a stronger model. it's like asking multiple experts for advice to make a better decision.

### The power of collective wisdom:
By leveraging diverse perspectives, ensemble method often outperform individual models in accuracy and robustness

## Types of Ensemble Methods:
There are three main types of ensemble learning techniques: *Bagging*, *Boosting*, and *Stacking*. Each method has its distinct characteristics and applications.

## 1. Bagging
 - **Definition**: Bagging, or Bootstrap Aggregating, involves training multiple models in parallel using different subsets of the data. The final prediction is made by combining the results from all models.
 - **Key Characteristics:**
    - **Parallel Models**: Models are trained independently and simultaneously.
    - **Variance Reduction**: Bagging helps in reducing variance by averaging multiple predictions, which makes it less sensitive to data fluctuations.
## **Common Example**:
 Random Forest is a classic example of a bagging model. It combines multiple decision trees for classification or regression tasks.
 - **Application**: Bagging is particularly useful when you need to reduce the variance in the predictions, making it ideal for high-variance models such as decision trees.

## 2. Boosting
 - **Definition**: Boosting is an ensemble technique where models are trained sequentially. Each new model focuses on the errors made by previous models, improving upon them.
 - **Key Characteristics:**
   - **Sequential Models**: The models are trained one after the other, with each new model correcting the mistakes of the previous ones.
   - **Bias Reduction**: Boosting is primarily used to reduce bias by giving more weight to incorrectly classified data points.
 - **Common Example**: AdaBoost (Adaptive Boosting) is a popular boosting algorithm that adjusts the weights of incorrect predictions.
 - **Application**: Boosting is best suited for improving the performance of weak learners by combining them into a stronger, more accurate model.

## 3. Stacking
 - **Definition**: Stacking involves training multiple models and combining their outputs to create a new dataset, which is then used by another model to make predictions.
 - **Key Characteristics:**
   - **Model Combination**: The outputs from several base models are used to form a new dataset, and another model is trained on this dataset to make the final prediction.
   - **Improved Performance:** By combining the outputs from different models, stacking often results in improved accuracy and robustness.
 - **Common Example:** A stacking model may combine outputs from decision trees, support vector machines (SVM), and logistic regression, with a final estimator like Logistic Regression or Random Forest to make the final prediction.
 - **Application:** Stacking is used when you want to combine the strengths of different models and improve prediction accuracy.

# What is a Random Forest?
  - 1. **Ensemble of Decision Trees**
  Random Forests combine multiple decision trees for classification or regression tasks.
  - 2. **Robust Predictions**
  By aggregation multiple trees, Random Forests reduce overfitting and improve generalization.

 - 3. **Versatile Application**
 Effective for various data types and problem domains.

## Random Forests Algorithm OR How Random Forest Works
 - **Creation**
 Ensemble learning techinque that creates multiple Decision Trees
 - **Variablity**
 Each tree uses random data and feature subsets to introduce variabilty
 - **Improved Performance**
 Reduces overfitting and improves prediction performance
 - **Result Aggregation**
 Aggregates results by voting (classification) or averaging(regression)
 - **Versatility**
 Handles complex data and provides reliable predictions across various domains

## Adavantage of Random Forests
 - **High Accuracy**
 Combines multiple models for improved predictions
 - **Handles Missing Values**
 Robust to incomplete data
 - **Reduced Overfitting**
 Generalizes well to unseen data.
 - **Scalability**
 Efficient for large datasets.

## Applications of Random Forest
 - **Finance**
 Credit sooring and risk assessment with robust predictions
 - **Environment**
 Monitoring land cover changes and preventing deforestation
 - **Cybersecurity**
 Fraud detection and anomaly spotting in online activities
 - **Medical Diagnosis**
 Predicting diseases based on patient data
 - **Fraud Detection**
 identifiying suspicious financial transactions.
 - **Stock market prediction**
 Forecasting market trends and stock prices 
 - **Sustomer segmentation**
  Grouping customers based on behaviour patterns

## Limitations of Random Forests
 1. **Lack of interpretability**
 Comple model structure makes it difficult to explain individual predictions

 2. **Slower predictions**
 Large forests can be computationally expensive during inference
  
 3. **Hyperparameter Sensitivity**
 Performance can vary significantly based on parameter tuning.

## Key Features of Random Forest
 1. **High predictive Accuracy**
 Combine multiple decision trees for better predictions.
 2. **Resistance to Overfitting**
 Reduces overfitting by averaging tree outputs
 3. **Handles Large DAtasets**
 Processes large datasets efficiently by splitting data across tress.
 4. **Feature Importance**
 Identifies key features influencing predictions.
 5. **Built-in Cross-Validation**
 Uses out-o-bag samples for model validation
 6. **Handles Missing VAlues**
 Adapts to missing data for robust predictions
 7. **Parallelization**
 Build trees simultaneously for faster processing

## Random Forests vs. Decision Tree

### Decision Trees
 - Single tree structure
 - Prone to overfitting
 - Less robust to noise 

### Random Forests
 - Ensemble of trees
 - Reducesd overfitting
 - More robust generalization 

## Ensemble Learning Beyond Random Forests
  - **Gradient Boosting**

  Builds trees sequentially to correct previous errors. Ideal for complex relationships.
  - **AdaBoost**

  Adjusts weights of misclassified instances. Effective for binary classification
  - **Stacking**
  
  Combines predictions from multiple models. Versatile for various problem types.

In [2]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [3]:
# Dataset URL
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"

# Column names as per dataset description
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness',
           'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']

# Load dataset
df = pd.read_csv(url, header=None, names=columns)

# Print basic info
print(df.head())
print(df.info())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768

In [5]:
# step 2: Handle Missing Value
print("\nChecking for missing values:")
print(df.isnull().sum())

# fill missing numerical values with the median
df.fillna(df.median(numeric_only=True), inplace=True)

# fill missing categorical values with the mode if any
for col in df.select_dtypes(include=['object']):
    df[col].fillna(df[col].mode()[0], implace =True)


Checking for missing values:
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


In [6]:
# step 3: prepare data 
# separate features and target vaiable
X = df.drop(columns=['Outcome'])
y = df['Outcome']

# step 4: apply standard scalling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# step 5: split data into training and testing sets
X_train, X_test, y_trian, y_test = train_test_split(X_scaled, y,test_size=0.3, random_state=42 )

#step 6: train a support vector machine classifier
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# bagging model
bagging_model = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=3)
bagging_model.fit(X_train, y_trian)

# prediction and accuracy
y_pred_bag = bagging_model.predict(X_test)
print("Bagging Accuracy:", accuracy_score(y_test, y_pred_bag))


Bagging Accuracy: 0.7272727272727273


In [7]:
from sklearn.ensemble import AdaBoostClassifier

# boostin model
boosting_model = AdaBoostClassifier(n_estimators=10)
boosting_model.fit(X_train, y_trian)

# prediction and accuracy
y_pred_boost = boosting_model.predict(X_test)
print("Boosting Accuracy:", accuracy_score(y_test, y_pred_boost))

Boosting Accuracy: 0.7489177489177489


In [10]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# define base model
base_model = [
    ('tree', DecisionTreeClassifier()),
    ('svm', SVC(probability=True))
]

# meta-model is logistic regression
stacking_model = StackingClassifier(estimators=base_model, final_estimator=LogisticRegression())
stacking_model.fit(X_train, y_trian)

# prediction and accuracy
y_pred_stack = stacking_model.predict(X_test)
print("Stacking Accuracy:", accuracy_score(y_test, y_pred_stack))

Stacking Accuracy: 0.7359307359307359
