# Boosting Techniques | Assignment

## Assignment Code: DA-AG-015

### Question 1: What is Boosting in Machine Learning? Explain how it improves weak learners.
- Boosting is an ensemble machine learning technique that combines multiple weak learners (models that perform slightly better than random guessing) into a strong learner with high predictive accuracy.
- A weak learner could be a shallow decision tree (often called a decision stump) that may only perform slightly better than chance.
- Boosting trains these learners sequentially, where each new model focuses on correcting the mistakes of the previous ones.
- How Boosting Improves Weak Learners:
 - Weighted Training: Misclassified data points from earlier models are given higher weights so the next model learns them better.
 - Sequential Learning: Each learner builds upon the errors of the previous learners.
 - Error Reduction: The ensemble gradually reduces bias and variance.
 - Final Prediction: Combines all weak learners' predictions using weighted voting (classification) or averaging (regression).

- Example Analogy:
  - Think of boosting like a teacher giving extra attention to students who failed the first test—each new test focuses more on their weak areas until most students pass.

### Question 2: What is the difference between AdaBoost and Gradient Boosting in terms of how models are trained?

| Feature                 | AdaBoost                                                                            | Gradient Boosting                                                                         |
| ----------------------- | ----------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
| **Core Idea**           | Focuses on **misclassified samples** by adjusting their weights for the next model. | Fits the new model to the **residual errors** (negative gradients) of the previous model. |
| **Model Sequence**      | Each new model is trained with adjusted **sample weights**.                         | Each new model is trained on **prediction errors** of previous models.                    |
| **Loss Function**       | Primarily designed for **exponential loss** (but variants exist).                   | Can optimize **any differentiable loss function** (MSE, log-loss, etc.).                  |
| **Weighting**           | Assigns weights to samples and models based on accuracy.                            | Models are combined by adding predictions scaled by the learning rate.                    |
| **Robustness to Noise** | More sensitive to noisy data.                                                       | More robust due to flexibility in loss function.                                          |


### Question 3: How does regularization help in XGBoost?
- Regularization in XGBoost helps prevent overfitting by controlling model complexity.
- It uses:
 - 1. L1 (Lasso) - Encourages sparsity in features.
 - 2. L2 (Ridge) - Penalizes large weights.
 - 3. Tree complexity control - Parameters like max_depth, min_child_weight, gamma reduce overfitting.

### Question 4: Why is CatBoost considered efficient for handling categorical data?
- Automatic Handling - No need for manual encoding like one-hot or label encoding.
- Ordered Target Statistics - Uses special encoding to prevent target leakage.
- Efficient GPU Training - Faster than many other boosting libraries.
- Less Parameter Tuning - Works well with default parameters.

### Question 5: What are some real-world applications where boosting techniques are preferred over bagging methods?
#### Datasets:
● Use sklearn.datasets.load_breast_cancer() for classification tasks.

● Use sklearn.datasets.fetch_california_housing() for regression
tasks.
- Fraud Detection (financial transactions → small false negative rate important)
- Credit Risk Modeling
- Customer Churn Prediction
- Medical Diagnosis (e.g., breast cancer detection)
- Click-through Rate Prediction in ads
- Boosting is preferred when:
  - Accuracy is more important than speed.
  - The problem is imbalanced.
  - There's a lot of noise and non-linear relationships.

### Question 6: Write a Python program to:
● Train an AdaBoost Classifier on the Breast Cancer dataset

● Print the model accuracy

### Question 7: Write a Python program to:
● Train a Gradient Boosting Regressor on the California Housing dataset

● Evaluate performance using R-squared score

### Question 8: Write a Python program to:
● Train an XGBoost Classifier on the Breast Cancer dataset

● Tune the learning rate using GridSearchCV

● Print the best parameters and accuracy

### Question 9: Write a Python program to:
● Train a CatBoost Classifier

● Plot the confusion matrix using seaborn

### Question 10: You're working for a FinTech company trying to predict loan default using customer demographics and transaction behavior. The dataset is imbalanced, contains missing values, and has both numeric and categorical features. Describe your step-by-step data science pipeline using boosting techniques:
● Data preprocessing & handling missing/categorical values

● Choice between AdaBoost, XGBoost, or CatBoost

● Hyperparameter tuning strategy

● Evaluation metrics you'd choose and why

● How the business would benefit from your model

- Step-by-step:

  - 1. Data Preprocessing
    - Handle missing values: median for numeric, mode for categorical.
    - Encode categorical:
      - CatBoost → no manual encoding needed
      - XGBoost/AdaBoost → One-hot or Target encoding
      - Scale numerical features (optional for tree-based methods).
   - 2. Model Choice
     - CatBoost → Best for mixed data types and less preprocessing.
     - XGBoost → Highly optimized, great for large datasets.
     - AdaBoost → Simple, but less efficient with many categorical variables.
   - 3. Hyperparameter Tuning
     - Use GridSearchCV or RandomizedSearchCV for learning rate, max depth, n_estimators.
   - 4. Evaluation Metrics
     - Use F1-score or ROC-AUC for imbalanced data (accuracy is misleading).
     - Confusion matrix to visualize false negatives (critical in loan defaults).
    - 5. Business Benefit
      - Reduce financial losses by identifying risky borrowers.
      - Adjust loan approval criteria dynamically.
      - Improve customer targeting for safe lending.

In [1]:
# Solution 06


from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
data = load_breast_cancer()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train AdaBoost
model = AdaBoostClassifier(n_estimators=50, random_state=42)
model.fit(X_train, y_train)

# Predict & Accuracy
y_pred = model.predict(X_test)
print("AdaBoost Accuracy:", accuracy_score(y_test, y_pred))

AdaBoost Accuracy: 0.9649122807017544


In [2]:
# Solution 7

from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# Load data
data = fetch_california_housing()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Gradient Boosting Regressor
model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
model.fit(X_train, y_train)

# Predict & R2 score
y_pred = model.predict(X_test)
print("Gradient Boosting R² Score:", r2_score(y_test, y_pred))

Gradient Boosting R² Score: 0.7756446042829697


In [3]:
# Solution 08

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

# Load data
data = load_breast_cancer()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define model
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')

# GridSearch
param_grid = {'learning_rate': [0.01, 0.1, 0.2]}
grid = GridSearchCV(xgb, param_grid, cv=3, scoring='accuracy')
grid.fit(X_train, y_train)

# Results
print("Best Params:", grid.best_params_)
y_pred = grid.predict(X_test)
print("XGBoost Accuracy:", accuracy_score(y_test, y_pred))

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Best Params: {'learning_rate': 0.1}
XGBoost Accuracy: 0.956140350877193


In [5]:
# Solution 09

from catboost import CatBoostClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Load data
data = load_breast_cancer()
X, y = data.data, data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train CatBoost (no manual encoding needed)
model = CatBoostClassifier(iterations=100, verbose=0, random_state=42)
model.fit(X_train, y_train)

# Confusion Matrix
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=data.target_names, yticklabels=data.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('CatBoost Confusion Matrix')
plt.show()