AdaBoost 

What is AdaBoost, and how does it differ from gradient boosting?

Who invented AdaBoost and when? (Freund & Schapire, 1997)

Explain the "adaptive" part - how does it adaptively weight samples?


Write the AdaBoost algorithm pseudocode for binary classification.

What are sample weights and how are they updated?

How are weak learners combined? (Weighted majority voting)

What is the classifier weight α_t and how is it calculated?


Derive the formula for classifier weight: α_t = ½ ln((1-ε_t)/ε_t)

How are sample weights updated: w_i^{(t+1)} = w_i^{(t)} exp(-α_t y_i h_t(x_i))

Explain the exponential loss function: L(y, F(x)) = exp(-yF(x))

Show that AdaBoost minimizes exponential loss via stagewise additive modeling.


Weak learners: Typically decision stumps (depth-1 trees)

Error rate ε_t: Weighted error of weak learner

D_t: Distribution over training samples

Final classifier: H(x) = sign(∑ α_t h_t(x))

Implementation Variants
AdaBoost.M1: Original algorithm for binary classification

AdaBoost.M2: Extension for multi-class

SAMME and SAMME.R (scikit-learn implementations)

Real AdaBoost: Using confidence-rated predictions

Theoretical Properties
Why does AdaBoost focus on "hard" examples?

What is the margin theory for AdaBoost?

How does AdaBoost relate to forward stagewise additive modeling?

What is the training error bound? (Decreases exponentially)

Advantages & Limitations
Advantages:

Simple to implement

Less prone to overfitting than some algorithms

Feature selection capability

No need for extensive parameter tuning

Limitations:

Sensitive to noisy data and outliers

Weak learner performance requirement (ε_t < 0.5)

Can be slow with many weak learners

Less popular than gradient boosting today

How to choose the number of estimators in AdaBoost?

What happens if a weak learner has error > 0.5?

How does AdaBoost handle multi-class problems?

Compare AdaBoost vs. Random Forest vs. Gradient Boosting.



Voting Classifiers

What is a voting classifier/regressor?

Explain hard voting vs. soft voting

What is the wisdom of crowds principle in ensemble learning?

Types of Voting
Hard Voting (Majority Voting):

Final prediction = mode of individual predictions

For classification only

Simple but effective

Soft Voting (Weighted Averaging):

Final prediction = weighted average of probability estimates

Requires classifiers to output probabilities

Often performs better than hard voting

Weighted Voting:

Assign different weights to different models

How to determine optimal weights?

Algorithm & Implementation
How to implement voting from scratch?

scikit-learn: VotingClassifier and VotingRegressor

How to choose diverse base models? (Heterogeneous ensembles)

Theoretical Basis
Condorcet's Jury Theorem: Conditions for majority voting superiority

No Free Lunch Theorem: Need for diverse models

Error reduction through averaging

Advantages & Use Cases
Advantages:

Simple to implement and understand

Can combine different types of models

Often improves over single models

Parallelizable (models trained independently)

Use Cases:

Combining fundamentally different algorithms

When computational resources allow multiple models

Competitions where blending helps

Common Combinations
Logistic Regression + Random Forest + SVM

Linear models + tree-based models + neural networks

Different preprocessing pipelines with same algorithm



Stacking Algorithm
Split training data into K folds

Train base models on K-1 folds, predict on holdout fold

Repeat for all folds to get out-of-fold predictions

Train meta-model on out-of-fold predictions

Retrain base models on full training data

Key Components
Base learners: Diverse models (heterogeneous stacking)

Meta-learner: Typically simple model (linear regression, logistic regression)

Out-of-Fold (OOF) predictions: Prevent data leakage

Stacking layers: Can have multiple levels (deep stacking)

Implementation Details
How to prevent target leakage in stacking?

What are blending vs. stacking? (Blending uses holdout set)

How to handle multi-class problems in stacking?

Should base models be correlated or uncorrelated?

Variants of Stacking
Simple stacking: One meta-model

Multi-level stacking: Multiple stacking layers

Feature-weighted linear stacking: Learn feature importance

Bayesian stacking: Bayesian model averaging

Meta-Learner Choices
Linear models: Linear regression, logistic regression

Tree-based: LightGBM, XGBoost (risk of overfitting)

Neural networks: Can capture complex interactions

Ridge regression: Good default choice

Advantages & Challenges
Advantages:

Can capture strengths of different models

Often highest performance in competitions

Flexible framework

Challenges:

Complex to implement correctly

Risk of overfitting

Computationally expensive

Hard to interpre

When to Use Which?

Voting: Quick improvement over single models

Bagging (RF): Reduce variance, parallel training needed

Boosting: Maximize accuracy, handle bias

Stacking: Competition settings, maximize performance

Blending: Simpler alternative to stacking

Performance Considerations
Diversity: Essential for all ensemble methods

Correlation: Uncorrelated errors improve ensembles

Computational cost: Stacking > Boosting > Bagging > Voting

Interpretability: Voting > Bagging > Boosting > Stacking


Model Diversity
Why is diversity important in ensembles?

How to measure model diversity? (Q-statistic, correlation, disagreement)

Techniques to increase diversity:

Different algorithms

Different hyperparameters

Different feature subsets

Different training subsets

Dynamic Classifier Selection
DCS: Choose best classifier per instance

DES: Dynamic ensemble selection

OLA (Overall Local Accuracy)

LCA (Local Class Accuracy)

Ensemble Pruning
Why prune ensembles? (Redundancy, overfitting)

Methods: Ranking-based, clustering-based, optimization-based

How many models in an ensemble? (Law of diminishing returns)

Online Ensemble Learning
Online Bagging (Oza & Russell)

Online Boosting

Adapting to concept drift

AdaBoost
Prove that AdaBoost's training error decreases exponentially

Derive the weight update rule from loss minimization perspective

Why must weak learners have error < 0.5?

Voting
Prove that for independent classifiers with error p < 0.5, majority voting error → 0 as N → ∞

What is the Condorcet Jury Theorem and its assumptions?

Stacking
Why does stacking with cross-validation prevent overfitting?

How does stacking relate to Bayesian Model Averaging?

General
What is the bias-variance-covariance decomposition for ensembles?

How does ambiguity decomposition explain ensemble success?


AdaBoost
How to handle imbalanced data in AdaBoost?

What's the effect of increasing number of estimators?

How to implement AdaBoost with custom weak learners?

Stacking
How to choose base models for stacking?

What meta-learner works best?

How to prevent overfitting in multi-layer stacking?

How to handle different prediction types (probabilities vs. labels)?

Voting
How to determine optimal weights for weighted voting?

Should you use calibrated probabilities for soft voting?

How to handle models with different prediction speeds?


Conceptual
Explain AdaBoost to a non-technical person

Why does boosting often outperform bagging?

When would you choose stacking over a single complex model?

What's the difference between bagging, boosting, and stacking?

Technical
How would you implement stacking without data leakage?

What happens if all base models in voting make the same error?

Why does AdaBoost use decision stumps as weak learners?

How do you handle missing predictions in voting classifiers?

Practical
You have 5 models with accuracies: 0.85, 0.86, 0.84, 0.87, 0.83. Should you ensemble them?

How would you combine a neural network and gradient boosting model?

What metrics would you use to evaluate ensemble diversity?

How to deploy an ensemble model in production efficiently?


AdaBoost
Face detection (Viola-Jones)

Text classification

Customer churn prediction

Stacking
Kaggle competitions (most winning solutions)

Netflix Prize (blended models)

Financial forecasting

Voting
Medical diagnosis systems

Fraud detection

Quality control systems

