In [1]:
"""
Ensemble Learning - Detailed Explanation with Python Comments

Ensemble Learning is a machine learning paradigm where multiple models 
(often called "weak learners" or "base learners") are combined 
to produce a single predictive model. The main idea is to improve 
performance and robustness over any individual model.

Why use Ensemble Learning?
- Individual models have limitations: prone to overfitting, bias, variance, etc.
- Combining models reduces errors caused by bias, variance, or noise.
- Ensembles generally achieve better accuracy, stability, and generalization.

Types of Ensemble Learning:
1. Bagging (Bootstrap Aggregating)
   - Train multiple models independently on different random subsets of data.
   - Combine their predictions by voting (classification) or averaging (regression).
   - Example: Random Forests.

2. Boosting
   - Train models sequentially, where each new model focuses on the mistakes of the previous ones.
   - Combine models by weighted voting or summing predictions.
   - Example: AdaBoost, Gradient Boosting Machines (GBM), XGBoost.

3. Stacking (Stacked Generalization)
   - Train multiple base models on the same dataset.
   - Train a meta-model to combine base model outputs for final prediction.

4. Voting
   - Combine predictions from multiple different models by majority vote (classification) or averaging (regression).



"""

'\nEnsemble Learning - Detailed Explanation with Python Comments\n\nEnsemble Learning is a machine learning paradigm where multiple models \n(often called "weak learners" or "base learners") are combined \nto produce a single predictive model. The main idea is to improve \nperformance and robustness over any individual model.\n\nWhy use Ensemble Learning?\n- Individual models have limitations: prone to overfitting, bias, variance, etc.\n- Combining models reduces errors caused by bias, variance, or noise.\n- Ensembles generally achieve better accuracy, stability, and generalization.\n\nTypes of Ensemble Learning:\n1. Bagging (Bootstrap Aggregating)\n   - Train multiple models independently on different random subsets of data.\n   - Combine their predictions by voting (classification) or averaging (regression).\n   - Example: Random Forests.\n\n2. Boosting\n   - Train models sequentially, where each new model focuses on the mistakes of the previous ones.\n   - Combine models by weighted

In [2]:
"""
Bagging (Bootstrap Aggregating) - Detailed Explanation

Definition:
Bagging is an ensemble learning technique designed to improve the stability 
and accuracy of machine learning algorithms by reducing variance and helping 
to avoid overfitting. It works by creating multiple versions of a predictor 
and using these to get an aggregated prediction.

How Bagging Works (Step-by-step):

1. Bootstrap Sampling:
   - From the original training dataset with N samples, generate multiple 
     new training datasets (called bootstrap samples).
   - Each bootstrap sample is created by randomly selecting N samples from 
     the original dataset **with replacement**.
   - Because of replacement, some original samples appear multiple times, 
     while others may be missing in each bootstrap sample.

2. Train Base Learners:
   - For each bootstrap sample, train a base learner (often a weak model like 
     a decision tree).
   - Each base learner is trained independently on its own bootstrap sample.

3. Aggregate Predictions:
   - For classification: aggregate by majority voting (the class predicted by 
     most models becomes the final prediction).
   - For regression: aggregate by averaging predictions from all models.

Why Bagging Works:
- By training on different samples, each base learner sees a slightly different 
  dataset and thus makes different errors.
- Aggregating predictions reduces the overall variance and smooths out 
  overfitting from individual models.
- The ensemble typically outperforms any single base learner.

Important Characteristics:
- Parallelizable: Each base learner can be trained independently.
- Reduces variance but does NOT reduce bias significantly.
- Works best with unstable base learners (models that are sensitive to training 
  data variations), e.g., decision trees.

Example use case:
- Random Forest is a popular bagging method that builds many decision trees on 
  bootstrap samples and introduces additional randomness in feature selection.

Mathematical Intuition:
- Suppose the base learner has error variance σ².
- If M independent models are averaged, variance reduces roughly to σ²/M.
- Hence, ensemble reduces variance and improves robustness.

Limitations:
- Bagging does not improve bias (systematic errors).
- If base learner is very stable (like linear regression), bagging provides 
  little benefit.

--------------------------------------------------------
Simple Python example to illustrate Bagging with Decision Trees
--------------------------------------------------------
"""

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize a single Decision Tree (base learner)
dt = DecisionTreeClassifier(random_state=42)

# Train and test single decision tree for baseline
dt.fit(X_train, y_train)
y_pred_single = dt.predict(X_test)
print(f"Single Decision Tree Accuracy: {accuracy_score(y_test, y_pred_single):.4f}")

# Initialize Bagging Classifier with decision trees as base learners
bagging = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
                            n_estimators=100,        # Number of trees in ensemble
                            max_samples=1.0,         # Each sample size = original dataset size
                            bootstrap=True,          # Use bootstrap sampling
                            random_state=42,
                            n_jobs=-1)               # Use all CPU cores for parallelism

# Train Bagging ensemble
bagging.fit(X_train, y_train)

# Predict on test data
y_pred_bagging = bagging.predict(X_test)

# Evaluate accuracy
print(f"Bagging Ensemble Accuracy: {accuracy_score(y_test, y_pred_bagging):.4f}")

"""
Detailed Breakdown:

- base_estimator=DecisionTreeClassifier(): Decision tree is unstable, making it ideal for bagging.
- n_estimators=100: Number of base learners trained on different bootstrap samples.
- max_samples=1.0: Each bootstrap sample is same size as original dataset.
- bootstrap=True: Sampling with replacement (bootstrap).
- n_jobs=-1: Use all CPU cores to speed up training in parallel.

Observations:
- Accuracy of the Bagging ensemble is usually better than single decision tree.
- Bagging reduces variance, leading to more stable and accurate predictions.
- Since models are independent, parallel training speeds up computation.

Summary:
Bagging is an effective and simple ensemble technique mainly focused on reducing variance.
It's ideal for models prone to overfitting on small data variations.
Random Forest is a special case of bagging that adds feature randomness.
"""

# End of explanation


Single Decision Tree Accuracy: 1.0000


TypeError: BaggingClassifier.__init__() got an unexpected keyword argument 'base_estimator'

In [None]:
"""
Boosting - Detailed Explanation

Definition:
Boosting is an ensemble learning technique that aims to create a strong classifier 
by sequentially combining multiple weak classifiers. Unlike bagging, where base 
learners are trained independently, boosting trains models sequentially, each 
trying to correct the errors of the previous ones.

Core Idea:
- Start with a weak learner (slightly better than random guessing).
- After each model is trained, increase the weight (importance) of the samples 
  that were misclassified.
- The next learner focuses more on these "hard" samples.
- Final prediction is a weighted combination of all base learners' predictions.

Why Boosting Works:
- By focusing on mistakes, the ensemble progressively reduces bias.
- Combines weak models into a strong one.
- Often achieves higher accuracy than bagging but may be more prone to overfitting if not regularized.

Key Characteristics:
- Models are trained sequentially, not independently.
- Each subsequent model focuses on the errors of the prior ensemble.
- Outputs a weighted vote or sum of predictions.
- Can reduce both bias and variance.
- Usually requires careful tuning (number of learners, learning rate).

Popular Boosting Algorithms:
1. AdaBoost (Adaptive Boosting)
2. Gradient Boosting Machines (GBM)
3. XGBoost (Extreme Gradient Boosting)
4. LightGBM, CatBoost (optimized gradient boosting variants)

Mathematical Intuition (AdaBoost):
- Assign equal weights to all training samples initially.
- Train weak learner on weighted data.
- Calculate weighted error rate.
- Compute learner's weight based on error (more accurate learner gets higher weight).
- Increase weights of misclassified samples so next learner focuses on them.
- Final prediction: weighted majority vote of all learners.

--------------------------------------------------------
Python example: AdaBoost with Decision Trees
--------------------------------------------------------
"""

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
data = load_iris()
X, y = data.data, data.target

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Weak learner: Decision Tree stump (depth=1)
weak_learner = DecisionTreeClassifier(max_depth=1, random_state=42)

# Initialize AdaBoost classifier with 50 weak learners
ada = AdaBoostClassifier(base_estimator=weak_learner, n_estimators=50, learning_rate=1.0, random_state=42)

# Train AdaBoost model
ada.fit(X_train, y_train)

# Predict on test data
y_pred = ada.predict(X_test)

# Evaluate accuracy
print(f"AdaBoost Classifier Accuracy: {accuracy_score(y_test, y_pred):.4f}")

"""
Detailed Explanation:

- base_estimator=DecisionTreeClassifier(max_depth=1): Weak learner with low complexity.
- n_estimators=50: Number of boosting rounds (weak learners).
- learning_rate=1.0: Controls contribution of each learner to final prediction (can be tuned to avoid overfitting).

How AdaBoost Works Internally:

1. Initialize sample weights equally.
2. Train the first weak learner.
3. Compute error weighted by sample weights.
4. Compute learner weight = log((1 - error) / error).
5. Increase weights of misclassified samples.
6. Normalize weights so they sum to 1.
7. Train next learner on updated weights.
8. Repeat steps 3-7 for all learners.
9. Final prediction: weighted sum of all learners' predictions.

Advantages of Boosting:
- Improves model accuracy by focusing on difficult samples.
- Can reduce both bias and variance.
- Works well with weak base learners.
- Often produces state-of-the-art results.

Disadvantages:
- More prone to overfitting if too many estimators or improper tuning.
- Sequential training means longer training times and less parallelism.
- Sensitive to noisy data and outliers.

Summary:
Boosting is a powerful ensemble method that builds a strong model by combining many weak learners in a sequential manner, focusing on correcting errors step-by-step. It is widely used in competitions and real-world applications for its accuracy.

"""

# End of explanation


In [None]:
"""
Stacking (Stacked Generalization) - Detailed Explanation

Definition:
Stacking is an ensemble learning technique that combines multiple different 
base models (called level-0 models) by training a higher-level model 
(called meta-model or level-1 model) to learn how to best combine their predictions.

How Stacking Works (Step-by-step):

1. Train multiple base learners (can be different types of models) 
   on the original training data.

2. Generate predictions from each base learner on a validation dataset 
   (or via cross-validation).

3. Use these predictions as input features to train a meta-model.

4. The meta-model learns to combine base learners' predictions optimally 
   to improve overall accuracy.

5. For final prediction on unseen data:
   - Get predictions from base learners.
   - Feed these predictions to the meta-model.
   - Meta-model outputs the final prediction.

Key Points:
- Base learners can be heterogeneous (different algorithms) or homogeneous.
- Meta-model is often a simple model like Logistic Regression or Linear Regression.
- Helps reduce bias and variance by leveraging complementary strengths of models.
- More complex than bagging or boosting.
- Requires careful validation to avoid overfitting (usually via cross-validation).

Advantages:
- Can combine diverse models to improve predictive performance.
- Often outperforms individual models and simpler ensembles.
- Flexible framework, adaptable to many problems.

Disadvantages:
- More complex to implement and tune.
- Requires extra data handling to generate meta-features.
- Training time is longer due to multiple models.
- Risk of overfitting meta-model if validation is not done properly.

--------------------------------------------------------
Python example using sklearn's StackingClassifier
--------------------------------------------------------
"""

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define base learners (level-0 models)
base_learners = [
    ('rf', RandomForestClassifier(n_estimators=10, random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=10, random_state=42)),
    ('svm', SVC(probability=True, random_state=42))  # probability=True for meta-model to use probabilities
]

# Define meta learner (level-1 model)
meta_learner = LogisticRegression()

# Initialize stacking classifier
stacking_clf = StackingClassifier(
    estimators=base_learners,
    final_estimator=meta_learner,
    passthrough=False,    # If True, original features are also passed to meta-model
    cv=5                  # Use 5-fold CV to generate meta-features to avoid overfitting
)

# Train stacking model
stacking_clf.fit(X_train, y_train)

# Predict on test data
y_pred = stacking_clf.predict(X_test)

# Evaluate accuracy
print(f"Stacking Classifier Accuracy: {accuracy_score(y_test, y_pred):.4f}")

"""
Detailed Explanation:

- Base learners (Random Forest, Gradient Boosting, SVM) are trained on the training data.
- The stacking classifier internally performs cross-validation on the training set to create meta-features:
  predictions of base learners on held-out folds.
- Meta learner (Logistic Regression) trains on these meta-features to learn how to combine base models.
- Final predictions combine base learners' outputs through the meta learner.
- passthrough=False means only base learners' predictions are input to meta-model, not original features.

Why Use Stacking?
- Different models capture different patterns and have different strengths.
- Meta learner learns which models to trust more depending on the input.
- Can boost overall predictive performance beyond bagging and boosting.

Best Practices:
- Use cross-validation to create meta-features to prevent overfitting.
- Keep meta learner simple to avoid complexity.
- Tune base learners individually before stacking.
- Consider using probabilities (predict_proba) instead of hard predictions for richer meta-features.

Summary:
Stacking is a powerful ensemble technique that "learns to learn" by training a meta-model to combine base learners’ predictions, often achieving superior results by exploiting the diversity of models.

"""

# End of explanation


In [None]:
"""
Voting Ensemble - Detailed Explanation

Definition:
Voting is one of the simplest ensemble learning methods where multiple different 
models (called base learners) are combined by aggregating their predictions 
to form a final prediction.

Types of Voting:
1. Hard Voting (Majority Voting):
   - Each base learner predicts a class label.
   - The final class prediction is the one that gets the majority of votes 
     from the base learners.

2. Soft Voting (Weighted Probability Averaging):
   - Each base learner predicts class probabilities.
   - The predicted probabilities are averaged (or weighted averaged).
   - The class with the highest average probability is chosen as final prediction.
   - Usually performs better than hard voting because it uses more information.

Why Voting Works:
- Combines multiple diverse models to reduce the risk of individual model errors.
- Aggregating multiple opinions usually leads to better generalization.
- Voting ensembles often improve accuracy and robustness compared to single models.

Characteristics:
- Models can be heterogeneous (different algorithms) or homogeneous.
- Voting does not explicitly try to reduce bias or variance like boosting or bagging but improves stability.
- Easy to implement and computationally efficient compared to stacking or boosting.

Limitations:
- Voting ensembles treat each base model equally (unless weights are applied).
- Hard voting ignores confidence (probability) information.
- Does not learn how to combine models optimally (unlike stacking).

--------------------------------------------------------
Python example using sklearn VotingClassifier
--------------------------------------------------------
"""

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load dataset
data = load_iris()
X, y = data.data, data.target

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define base models
log_clf = LogisticRegression(random_state=42)
dt_clf = DecisionTreeClassifier(random_state=42)
svm_clf = SVC(probability=True, random_state=42)  # probability=True for soft voting

# Initialize Voting Classifier with hard voting
voting_hard = VotingClassifier(
    estimators=[('lr', log_clf), ('dt', dt_clf), ('svm', svm_clf)],
    voting='hard'  # Use majority voting on predicted classes
)

# Train hard voting ensemble
voting_hard.fit(X_train, y_train)

# Predict with hard voting
y_pred_hard = voting_hard.predict(X_test)
print(f"Hard Voting Accuracy: {accuracy_score(y_test, y_pred_hard):.4f}")

# Initialize Voting Classifier with soft voting
voting_soft = VotingClassifier(
    estimators=[('lr', log_clf), ('dt', dt_clf), ('svm', svm_clf)],
    voting='soft'  # Average predicted probabilities
)

# Train soft voting ensemble
voting_soft.fit(X_train, y_train)

# Predict with soft voting
y_pred_soft = voting_soft.predict(X_test)
print(f"Soft Voting Accuracy: {accuracy_score(y_test, y_pred_soft):.4f}")

"""
Explanation:

- Hard Voting:
  * Each model predicts a class label.
  * Final prediction is the class that gets the most votes.
  * Easy to understand and implement.
  
- Soft Voting:
  * Each model outputs class probabilities.
  * Probabilities are averaged across models.
  * The class with the highest average probability is selected.
  * Usually gives better results because it considers model confidence.
  
- Models Used:
  * Logistic Regression, Decision Tree, and SVM provide diversity.
  * SVM needs probability=True to output probabilities for soft voting.
  
- VotingClassifier:
  * 'estimators' is a list of (name, model) tuples.
  * 'voting' param controls hard or soft voting.

Summary:
Voting ensembles aggregate predictions of multiple base models either by majority vote (hard) or averaging probabilities (soft) to improve robustness and accuracy with minimal complexity.

"""

# End of explanation
