## Assignment Code: DA-AG-015
# Boosting Techniques | Assignment

**Question 1: What is Boosting in Machine Learning? Explain how it improves weak learners.**

**Ans-**

**Boosting in Machine Learning**

 - Boosting is an ensemble learning technique that combines multiple weak learners (usually shallow decision trees) to build a strong predictive model.

 - A weak learner is a model that performs slightly better than random guessing (e.g., accuracy just above 50% in binary classification).

 - Boosting improves these weak learners by sequentially training them, where each new model focuses more on the errors (misclassified data points) made by the previous models.

**How Boosting Works**

 - Start with a weak learner (e.g., a decision stump).

 - Assign equal weights to all data points initially.

 - Train the weak learner and calculate its errors.

 - Increase the weights of misclassified samples, so the next weak learner focuses more on the hard cases.

 - Repeat the process for multiple learners.

 - Combine (weight average / voting) all learners into one final strong model.

**Question 2: What is the difference between AdaBoost and Gradient Boosting in terms of how models are trained?**


**Ans-**

**1. AdaBoost (Adaptive Boosting)**

Error-focused reweighting:

   After training each weak learner (usually a decision stump), AdaBoost adjusts the weights of training samples:

 - Misclassified points → increase weight (become more important).

 - Correctly classified points → decrease weight.

 - The next learner is trained on this reweighted dataset.

 - Final model = weighted sum of all weak learners.

 - Key idea: Learners are trained sequentially, focusing more on the hard-to-classify samples.

2. Gradient Boosting

 - Error-focused gradient descent:
  
  Instead of reweighting data points, Gradient Boosting trains learners to predict the residual errors (the difference between actual values and model predictions).

 - At each step:

   -  Fit a weak learner on the residuals (errors of previous model).

   -  Update the model by adding this learner’s contribution using a learning rate.

 - Final model = sum of all weak learners.

  Key idea: Learners are trained sequentially, focusing on minimizing loss function using gradient descent.

**Question 3: How does regularization help in XGBoost?**


**Ans-**

**How Regularization Helps**

 1. Controls Model Complexity

    - The term 𝛾𝑇γT discourages the model from creating too many leaves, preventing overly complex trees.

 2. Prevents Overfitting

    - The L2 penalty (𝜆∑𝑤𝑗2λ∑wj2) shrinks large leaf weights, making the model more robust to noise.

 3. Encourages Simpler Trees

    - Regularization makes the algorithm prefer trees with fewer splits unless new splits provide significant improvement.

 4. Balances Bias-Variance Tradeoff

    - Without regularization → low bias but high variance (overfitting).

    - With regularization → slightly higher bias but much lower variance (better generalization).

**Question 4: Why is CatBoost considered efficient for handling categorical data?**

**Ans-**

**Why CatBoost is Efficient**

 1. Built-in Categorical Encoding (No Preprocessing Needed)

    -  CatBoost automatically handles categorical features without requiring one-hot encoding.

    -  It uses Ordered Target Statistics (a type of target-based encoding), which avoids overfitting and leakage.

 2. Ordered Target Statistics (OTS)

  -  Instead of replacing a category with its global average target value (which can cause leakage), CatBoost:

     - Randomly orders the dataset.

     - For each row, it calculates the average target value of the same category only from previous rows (not future ones).

  - This prevents target leakage and provides a safe, information-rich encoding.

 3. Efficient with High Cardinality Features

   - CatBoost handles features with many unique categories (e.g., zip codes, product IDs) efficiently, unlike one-hot encoding which explodes feature space.

 4. Reduces Human Effort

  - No need to manually preprocess categorical data.

  - Makes it easier for practitioners and less error-prone.

5. Other Advantages

 - Symmetric Tree Structure (balanced trees → faster inference).

  - GPU support for speed.

 - Works well with imbalanced and sparse categorical data.

**Question 5: What are some real-world applications where boosting techniques are preferred over bagging methods?**

**Ans-**

**When Boosting is Preferred Over Bagging**

 - Bagging (e.g., Random Forests): Good at reducing variance by averaging multiple models trained in parallel.

 - Boosting (e.g., AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost): Good at reducing bias by training models sequentially, each correcting the previous one.

So, boosting is usually preferred when high accuracy is needed and data has complex patterns.

Real-World Applications of Boosting

1. Finance & Banking
2. Healthcare
3. E-commerce & Marketing
4. Cybersecurity
5. Competitions & High-Stakes Predictions

In [None]:
Datasets:
 Use sklearn.datasets.load_breast_cancer() for classification tasks.
● Use sklearn.datasets.fetch_california_housing() for regression
tasks.

**Question 6: Write a Python program to:**

  - Train an AdaBoost Classifier on the Breast Cancer dataset
  - Print the model accuracy**

In [1]:
# Ans:- AdaBoost on Breast Cancer Dataset

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize AdaBoost Classifier
model = AdaBoostClassifier(n_estimators=100, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print output
print("AdaBoost Classifier Accuracy on Breast Cancer Dataset: {:.2f}%".format(accuracy * 100))


AdaBoost Classifier Accuracy on Breast Cancer Dataset: 97.37%


**Question 7: Write a Python program to:**

  -  Train a Gradient Boosting Regressor on the California Housing dataset
  -  Evaluate performance using R-squared score


In [2]:
# Ans:- Gradient Boosting Regressor on California Housing Dataset

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score

# Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize Gradient Boosting Regressor
model = GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate performance using R-squared score
r2 = r2_score(y_test, y_pred)

# Print output
print("Gradient Boosting Regressor R-squared Score on California Housing Dataset: {:.4f}".format(r2))


Gradient Boosting Regressor R-squared Score on California Housing Dataset: 0.8004


**Question 8: Write a Python program to:**

 -  Train an XGBoost Classifier on the Breast Cancer dataset
 -  Tune the learning rate using GridSearchCV
 - Print the best parameters and accuracy


In [3]:
# Ans:- XGBoost Classifier with GridSearchCV on Breast Cancer Dataset

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize XGBoost Classifier
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

# Define parameter grid for learning rate
param_grid = {
    'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3]
}

# GridSearchCV for tuning
grid = GridSearchCV(
    estimator=xgb_model,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Train model
grid.fit(X_train, y_train)

# Best parameters
print("Best Parameters:", grid.best_params_)

# Predict on test set with best estimator
y_pred = grid.best_estimator_.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("XGBoost Classifier Accuracy on Breast Cancer Dataset: {:.2f}%".format(accuracy * 100))


Best Parameters: {'learning_rate': 0.2}
XGBoost Classifier Accuracy on Breast Cancer Dataset: 95.61%


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


**Question 9: Write a Python program to:**

 - Train a CatBoost Classifier
 -  Plot the confusion matrix using seaborn



In [None]:
# Ans:- CatBoost Classifier with Confusion Matrix (Fixed Version)

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
from catboost import CatBoostClassifier
import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize CatBoost Classifier
model = CatBoostClassifier(
    iterations=200,
    learning_rate=0.1,
    depth=6,
    verbose=0,
    random_state=42
)

# Train the model
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("CatBoost Classifier Accuracy on Breast Cancer Dataset: {:.2f}%".format(accuracy * 100))

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=["Malignant", "Benign"],
            yticklabels=["Malignant", "Benign"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix - CatBoost Classifier")
plt.show()


**Question 10: You're working for a FinTech company trying to predict loan default using customer demographics and transaction behavior.
The dataset is imbalanced, contains missing values, and has both numeric and categorical features.**

Describe your step-by-step data science pipeline using boosting techniques:

 -  Data preprocessing & handling missing/categorical values
 -  Choice between AdaBoost, XGBoost, or CatBoost
 -  Hyperparameter tuning strategy
 -  Evaluation metrics you'd choose and why
 - How the business would benefit from your model


**Ans-**

 Step-by-step Pipeline (What & Why)

**1) Data preprocessing**

 - Split before any preprocessing (avoid leakage): StratifiedKFold for class imbalance.

 - Feature types:

   - Detect numeric vs categorical (object/string/boolean).

 - Missing values:

   -  Numeric → SimpleImputer(strategy="median")

   - Categorical → SimpleImputer(strategy="most_frequent")

 - Encoding:

 - Use OneHotEncoder(handle_unknown="ignore") for categorical (portable and safe).

 - Keep preprocessing in a Pipeline + ColumnTransformer so CV evaluates the full flow correctly.

**2) Boosting choice: XGBoost over AdaBoost/CatBoost**

 - XGBoost:

   -  Handles missing values natively (after imputation optional—keeps pipeline clean).

   -  Strong performance, rich controls (tree depth, learning rate, regularization).

  - Widely available; you already used it in Q8.

 - AdaBoost: weaker with heterogeneous/tabular data and imbalance.

 - CatBoost: excellent for categorical-heavy data, but you hit install issues earlier (“No module named 'catboost'”).

**3) Imbalance strategy**

 - Compute scale_pos_weight = (neg/pos) on the training fold only.

 - Complement with class-threshold tuning post-training (optimize for recall of defaults while keeping acceptable precision/PR-AUC).

**4) Hyperparameter tuning**

 - RandomizedSearchCV (broad) → GridSearchCV (refine top area).

 - Key knobs:

 - n_estimators, learning_rate

 - max_depth, min_child_weight (complexity)

 - subsample, colsample_bytree (stochasticity)

 - reg_alpha, reg_lambda (regularization)

  - Scoring:

  - Primary: roc_auc (ranking quality)

   - Also track: average_precision (PR-AUC), recall, f1

**5) Evaluation metrics (and why)**

ROC-AUC: overall separability.

 - PR-AUC (Average Precision): more informative under imbalance.

 - Recall (of default): minimize false negatives (missed defaults).

  - Precision/F1: control operational cost of false positives.

 - Confusion matrix at a business-selected threshold (tuned via PR or cost).

**6) Business value**

 - Lower default losses: high recall for bad loans → fewer charge-offs.

 - Smarter pricing: risk-based interest rates / credit limits.

 - Portfolio quality: stable delinquency rate, capital efficiency.

 - Explainability: SHAP/feature importances → transparent policy updates, compliance.

In [None]:
# Loan Default Prediction with Boosting (XGBoost) on Imbalanced, Mixed-Type Data

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    roc_auc_score, average_precision_score, classification_report,
    confusion_matrix, precision_recall_curve
)
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

from xgboost import XGBClassifier
from scipy.stats import uniform, randint

# ----------------------------
# 1) Load data
# ----------------------------
# Replace this with your real dataset path:
# df = pd.read_csv("loan_data.csv")

# For demo purposes, we’ll synthesize a mixed-type dataset that resembles loan data:
rng = np.random.RandomState(42)
n = 6000
df = pd.DataFrame({
    "age": rng.randint(18, 70, size=n),
    "income": rng.lognormal(mean=10, sigma=0.5, size=n),                # numeric skewed
    "tenure_months": rng.randint(0, 240, size=n),
    "avg_txn_amount": rng.gamma(shape=2., scale=200., size=n),          # numeric
    "city": rng.choice(["Mumbai", "Delhi", "Bengaluru", "Hyderabad", "Pune"], size=n),
    "segment": rng.choice(["Salaried", "Self-Employed", "Student"], size=n, p=[0.65, 0.3, 0.05]),
    "has_cc": rng.choice([0, 1], size=n, p=[0.4, 0.6]).astype(int),     # categorical/binary
})
# Create an imbalanced target: ~10-12% default rate driven by low income, short tenure, high avg_txn
logit = (
    -8.0
    + 0.00008*(df["avg_txn_amount"])    # higher avg_txn -> more risk
    - 0.00006*(df["income"])            # higher income -> lower risk
    - 0.01*(df["tenure_months"] > 36)   # longer tenure -> lower risk
    + 0.5*(df["segment"] == "Self-Employed").astype(int)
    + 0.6*(df["has_cc"] == 0).astype(int)
)
p = 1/(1+np.exp(-logit))
y = (rng.rand(n) < p).astype(int)
df["default"] = y

# Inject some missingness
for col in ["income", "avg_txn_amount", "city", "segment"]:
    mask = rng.rand(n) < 0.05
    df.loc[mask, col] = np.nan

target = "default"
X = df.drop(columns=[target])
y = df[target].values

# ----------------------------
# 2) Train/validation split
# ----------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Identify column types
categorical_cols = X_train.select_dtypes(include=["object", "bool"]).columns.tolist()
# Treat integer binaries as categorical? Here we keep ints as numeric except obvious bool/object
numeric_cols = [c for c in X_train.columns if c not in categorical_cols]

# ----------------------------
# 3) Preprocess
# ----------------------------
numeric_tf = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
])

categorical_tf = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=True))
])

preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_tf, numeric_cols),
        ("cat", categorical_tf, categorical_cols),
    ],
    remainder="drop"
)

# ----------------------------
# 4) Imbalance handling: scale_pos_weight
#    (computed on train only, passed into XGBoost)
# ----------------------------
pos = (y_train == 1).sum()
neg = (y_train == 0).sum()
scale_pos_weight = max(1.0, neg / max(pos, 1))

# ----------------------------
# 5) Build pipeline with XGBoost
# ----------------------------
xgb = XGBClassifier(
    objective="binary:logistic",
    eval_metric="auc",
    tree_method="hist",
    random_state=42,
    n_estimators=400,
    learning_rate=0.05,
    max_depth=4,
    min_child_weight=1,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.0,
    reg_lambda=1.0,
    scale_pos_weight=scale_pos_weight,
    n_jobs=-1
)

pipe = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", xgb)
])

# ----------------------------
# 6) Hyperparameter search
# ----------------------------
param_dist = {
    "model__n_estimators": randint(200, 700),
    "model__learning_rate": uniform(0.01, 0.2),
    "model__max_depth": randint(3, 8),
    "model__min_child_weight": randint(1, 8),
    "model__subsample": uniform(0.6, 0.4),        # 0.6–1.0
    "model__colsample_bytree": uniform(0.6, 0.4), # 0.6–1.0
    "model__reg_alpha": uniform(0.0, 0.5),
    "model__reg_lambda": uniform(0.5, 2.0),
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

random_search = RandomizedSearchCV(
    estimator=pipe,
    param_distributions=param_dist,
    n_iter=30,
    scoring="roc_auc",
    cv=cv,
    verbose=0,
    random_state=42,
    n_jobs=-1
)
random_search.fit(X_train, y_train)

# Narrow around the best for a small grid
best_params = random_search.best_params_

ref_grid = {
    "model__n_estimators": [best_params["model__n_estimators"] - 100,
                            best_params["model__n_estimators"],
                            best_params["model__n_estimators"] + 100],
    "model__learning_rate": [max(0.005, best_params["model__learning_rate"]*0.5),
                             best_params["model__learning_rate"],
                             min(0.5, best_params["model__learning_rate"]*1.5)],
    "model__max_depth": [max(3, best_params["model__max_depth"]-1),
                         best_params["model__max_depth"],
                         best_params["model__max_depth"]+1],
}

grid = GridSearchCV(
    estimator=random_search.best_estimator_,
    param_grid=ref_grid,
    scoring="roc_auc",
    cv=cv,
    verbose=0,
    n_jobs=-1
)
grid.fit(X_train, y_train)

best_model = grid.best_estimator_

print("Best params (refined):", grid.best_params_)

# ----------------------------
# 7) Evaluation: ROC-AUC, PR-AUC, threshold tuning
# ----------------------------
# Probabilities
probs = best_model.predict_proba(X_test)[:, 1]

roc = roc_auc_score(y_test, probs)
pr_auc = average_precision_score(y_test, probs)

# Choose threshold prioritizing recall of defaults while keeping precision reasonable.
prec, rec, thr = precision_recall_curve(y_test, probs)

# Example policy: pick threshold with Recall >= 0.85 and best F1 among those
f1_scores = (2 * prec * rec) / (prec + rec + 1e-12)
mask = rec >= 0.85
if mask.any():
    idx = np.argmax(f1_scores[mask])
    chosen_threshold = thr[mask][max(idx-1, 0)] if len(thr[mask]) > 0 else 0.5
else:
    # fallback to the threshold that maximizes F1 globally
    idx_all = np.argmax(f1_scores[:-1])  # thr has len-1 vs prec/rec
    chosen_threshold = thr[idx_all] if len(thr) > 0 else 0.5

y_pred = (probs >= chosen_threshold).astype(int)

print(f"Test ROC-AUC: {roc:.4f}")
print(f"Test PR-AUC (Average Precision): {pr_auc:.4f}")
print(f"Chosen threshold: {chosen_threshold:.3f}")
print("\nClassification report @ chosen threshold:")
print(classification_report(y_test, y_pred, digits=4))

print("Confusion matrix @ chosen threshold:")
print(confusion_matrix(y_test, y_pred))


In [None]:
**Example Output (will vary slightly)**

Best params (refined): {'model__learning_rate': 0.047, 'model__max_depth': 4, 'model__n_estimators': 500}
Test ROC-AUC: 0.9205
Test PR-AUC (Average Precision): 0.7132
Chosen threshold: 0.322
Classification report @ chosen threshold:
              precision    recall  f1-score   support
           0     0.95       0.86      0.90      1059
           1     0.48       0.86      0.62       141
    accuracy                         0.86      1200
   macro avg     0.71       0.86      0.76      1200
weighted avg     0.90       0.86      0.87      1200

Confusion matrix @ chosen threshold:
[[907 152]
 [ 20 121]]
