# Boosting Techniques – Assignment Solutions



## Question 1
**What is Boosting in Machine Learning? Explain how it improves weak learners.**

**Answer:**
Boosting is an ensemble learning technique that combines multiple **weak learners** (models that perform slightly better than random guessing) to form a strong predictive model. Models are trained **sequentially**, where each new model focuses more on the samples that previous models misclassified.

**How it improves weak learners:**
- Assigns higher weights to misclassified samples
- Forces subsequent models to focus on hard-to-learn patterns
- Aggregates predictions (weighted voting or summation)
- Reduces bias and improves overall accuracy

## Question 2
**Difference between AdaBoost and Gradient Boosting**

**Answer:**
- **AdaBoost:** Adjusts sample weights after each iteration. Misclassified samples receive higher weights so the next learner focuses on them.
- **Gradient Boosting:** Fits each new model to the **residual errors** of the previous ensemble using gradient descent on a loss function.

**Key Difference:** AdaBoost focuses on re-weighting data points, while Gradient Boosting optimizes a loss function using gradients.

## Question 3
**How does regularization help in XGBoost?**

**Answer:**
XGBoost uses L1 (Lasso) and L2 (Ridge) regularization to penalize complex models. Regularization helps prevent overfitting by:
- Penalizing large tree weights
- Limiting tree depth and number of leaves
- Encouraging simpler models that generalize better

## Question 4
**Why is CatBoost efficient for categorical data?**

**Answer:**
CatBoost handles categorical variables internally using target encoding techniques. It avoids one-hot encoding explosion, reduces overfitting, and maintains order-aware encoding to prevent data leakage.

## Question 5
**Real-world applications where boosting is preferred over bagging**

**Answer:**
- Credit risk and loan default prediction
- Fraud detection
- Customer churn prediction
- Medical diagnosis
- Search ranking systems

Boosting is preferred when **reducing bias** and capturing complex patterns is critical.

## Question 6
**AdaBoost Classifier – Breast Cancer Dataset**

In [1]:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train AdaBoost
model = AdaBoostClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Prediction and accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracy


0.9707602339181286

## Question 7
**Gradient Boosting Regressor – California Housing Dataset**

In [2]:

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score

# Load dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
gbr = GradientBoostingRegressor(random_state=42)
gbr.fit(X_train, y_train)

# Prediction and R2 score
y_pred = gbr.predict(X_test)
r2_score(y_test, y_pred)


0.7803012822391022

## Question 8
**XGBoost Classifier with Hyperparameter Tuning**

In [None]:

from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

param_grid = {
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [50, 100]
}

grid = GridSearchCV(xgb, param_grid, cv=3, scoring='accuracy')
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
best_params = grid.best_params_
best_accuracy = best_model.score(X_test, y_test)

best_params, best_accuracy


## Question 9
**CatBoost Classifier & Confusion Matrix**

In [4]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.8-cp313-cp313-win_amd64.whl.metadata (1.5 kB)
Collecting graphviz (from catboost)
  Downloading graphviz-0.21-py3-none-any.whl.metadata (12 kB)
Downloading catboost-1.2.8-cp313-cp313-win_amd64.whl (102.4 MB)
   ---------------------------------------- 0.0/102.4 MB ? eta -:--:--
   ---------------------------------------- 0.5/102.4 MB 11.6 MB/s eta 0:00:09
    --------------------------------------- 1.6/102.4 MB 4.6 MB/s eta 0:00:23
   - -------------------------------------- 2.9/102.4 MB 5.1 MB/s eta 0:00:20
   - -------------------------------------- 3.4/102.4 MB 4.8 MB/s eta 0:00:21
   - -------------------------------------- 4.2/102.4 MB 4.3 MB/s eta 0:00:23
   - -------------------------------------- 4.7/102.4 MB 4.0 MB/s eta 0:00:25
   -- ------------------------------------- 5.5/102.4 MB 3.9 MB/s eta 0:00:25
   -- ------------------------------------- 6.0/102.4 MB 3.8 MB/s eta 0:00:26
   -- ------------------------------------- 6.8/1


[notice] A new release of pip is available: 25.1.1 -> 25.3
[notice] To update, run: C:\Users\lakha\AppData\Local\Programs\Python\Python313\python.exe -m pip install --upgrade pip


In [None]:

from catboost import CatBoostClassifier
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

cat_model = CatBoostClassifier(verbose=0, random_state=42)
cat_model.fit(X_train, y_train)

y_pred = cat_model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm, annot=True, fmt='d')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


## Question 10
**FinTech Loan Default Prediction – Boosting Pipeline**

**Answer (Step-by-step):**
1. Handle missing values using median (numeric) and mode (categorical)
2. Encode categorical variables (CatBoost preferred)
3. Use CatBoost or XGBoost due to imbalance and mixed data
4. Apply GridSearchCV for tuning
5. Evaluate using ROC-AUC, Precision-Recall, F1-score
6. Business benefit: reduced loan defaults, better risk control, higher profitability

CatBoost is preferred because it natively handles categorical variables and imbalance.