<a href="https://colab.research.google.com/github/Pradeep333Singh/Pw_Assignments_DataScience/blob/main/Boosting_Assignment_Solutions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Boosting Techniques | Assignment Solutions**



## **Question 1**
What is Boosting in Machine Learning? Explain how it improves weak learners.


**Answer:**

**Boosting** is an ensemble learning technique used to convert a set of weak learners (models that perform slightly better than random guessing) into a strong learner.

**How it improves weak learners:**
* **Sequential Learning:** Unlike Bagging (which trains in parallel), Boosting trains models sequentially. Each new model focuses on the errors made by the previous ones.
* **Reweighting:** It assigns higher weights to data points that were misclassified by previous models, forcing the next learner to focus on these "hard-to-classify" examples.
* **Bias Reduction:** By iteratively correcting errors, boosting effectively reduces the bias of the combined model.


## **Question 2**
What is the difference between AdaBoost and Gradient Boosting in terms of how models are trained?


**Answer:**

| Feature | **AdaBoost (Adaptive Boosting)** | **Gradient Boosting** |
| :--- | :--- | :--- |
| **Error Handling** | It identifies shortcomings by spotting **misclassified data points** and increasing their weights. | It identifies shortcomings by calculating the **residuals** (gradients) of the loss function (the difference between predicted and actual values). |
| **Model Construction** | Each new tree is trained on a **re-weighted version** of the original dataset. | Each new tree is trained directly on the **residuals** (errors) of the previous tree. |
| **Tree Depth** | Typically uses "stumps" (very short trees with depth=1). | Typically uses deeper trees (e.g., depth 4-8) compared to AdaBoost. |


## **Question 3**
How does regularization help in XGBoost?


**Answer:**

Regularization in XGBoost helps prevent **overfitting**, which is a common issue in boosting algorithms due to their aggressive nature of minimizing errors. XGBoost includes standard regularization parameters in its objective function:

1.  **L1 Regularization (Lasso / Alpha):** Penalizes the absolute value of leaf weights. It can induce sparsity (pushing some weights to zero), which acts as a form of feature selection.
2.  **L2 Regularization (Ridge / Lambda):** Penalizes the square of leaf weights. This keeps the weights small and stable, making the model less sensitive to individual data points.
3.  **Gamma (Minimum Loss Reduction):** Specifies a minimum loss reduction required to make a further partition on a leaf node, effectively pruning the tree.


## **Question 4**
Why is CatBoost considered efficient for handling categorical data?


**Answer:**

CatBoost (Categorical Boosting) is specifically designed to handle categorical data efficiently without extensive pre-processing (like One-Hot Encoding) for the following reasons:

* **Ordered Target Statistics:** Instead of standard Target Encoding (which can lead to target leakage), CatBoost uses a technique called "Ordered Boosting." It calculates target statistics for a current data point using only the data points observed *before* it in a random permutation.
* **Handling High Cardinality:** It can handle features with many categories automatically, reducing the dimensionality explosion that happens with One-Hot Encoding.
* **Feature Combinations:** CatBoost automatically combines categorical features to create new interaction features during the tree-building process.


## **Question 5**
What are some real-world applications where boosting techniques are preferred over bagging methods?


**Answer:**

Boosting is generally preferred over Bagging (like Random Forest) in scenarios where **high accuracy** is paramount and the data is clean (not overly noisy).

1.  **Imbalanced Class Problems:** Applications like **Fraud Detection** or **Rare Disease Diagnosis** benefit from Boosting because it forces the model to focus on the minority class (the hard-to-classify examples).
2.  **Search Ranking & Recommendations:** Web search engines (e.g., Google, Bing) and recommendation systems (e.g., Netflix) often use Gradient Boosting (LambdaMART) for ranking results.
3.  **Kaggle/Data Science Competitions:** Boosting algorithms (XGBoost, LightGBM, CatBoost) are dominant in tabular data competitions due to their superior predictive performance.
4.  **Credit Risk Scoring:** Financial institutions use boosting to predict loan defaults with high precision.


---


## **Question 6**
Write a Python program to:
* Train an AdaBoost Classifier on the Breast Cancer dataset
* Print the model accuracy




In [None]:
# Import necessary libraries
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Load the dataset
data = load_breast_cancer()
X = data.data
y = data.target

# 2. Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Train the AdaBoost Classifier
# Using default base estimator (Decision Tree Stump)
ada_clf = AdaBoostClassifier(n_estimators=50, random_state=42)
ada_clf.fit(X_train, y_train)

# 4. Predict and Evaluate
y_pred = ada_clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"AdaBoost Classifier Accuracy: {accuracy:.4f}")


## **Question 7**
Write a Python program to:
* Train a Gradient Boosting Regressor on the California Housing dataset
* Evaluate performance using R-squared score


In [None]:
# Import necessary libraries
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# 1. Load the dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target

# 2. Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Train the Gradient Boosting Regressor
gb_reg = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gb_reg.fit(X_train, y_train)

# 4. Predict and Evaluate
y_pred = gb_reg.predict(X_test)
r2 = r2_score(y_test, y_pred)

print(f"Gradient Boosting Regressor R-squared Score: {r2:.4f}")


## **Question 8**
Write a Python program to:
* Train an XGBoost Classifier on the Breast Cancer dataset
* Tune the learning rate using GridSearchCV
* Print the best parameters and accuracy


In [None]:
# Install XGBoost if not already installed
!pip install xgboost

import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# 1. Load Data
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Initialize XGBoost Classifier
# use_label_encoder=False and eval_metric='logloss' to avoid warnings
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

# 3. Define Hyperparameter Grid
param_grid = {
    'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3]
}

# 4. Setup GridSearchCV
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, scoring='accuracy', cv=3, verbose=1)

# 5. Train with Grid Search
grid_search.fit(X_train, y_train)

# 6. Print Best Parameters and Accuracy
print("-" * 30)
print(f"Best Learning Rate: {grid_search.best_params_['learning_rate']}")
print(f"Best CV Accuracy: {grid_search.best_score_:.4f}")

# Optional: Test set accuracy with best model
best_model = grid_search.best_estimator_
test_acc = best_model.score(X_test, y_test)
print(f"Test Set Accuracy: {test_acc:.4f}")


## **Question 9**
Write a Python program to:
* Train a CatBoost Classifier
* Plot the confusion matrix using seaborn


In [None]:
# Install CatBoost
!pip install catboost

from catboost import CatBoostClassifier
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# 1. Load Data (Using Breast Cancer dataset as generic example)
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Train CatBoost Classifier
# verbose=0 suppresses the training output logs
cat_model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=5, verbose=0, random_state=42)
cat_model.fit(X_train, y_train)

# 3. Predict
y_pred = cat_model.predict(X_test)

# 4. Plot Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=data.target_names, yticklabels=data.target_names)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix - CatBoost Classifier')
plt.show()


## **Question 10**
You're working for a FinTech company trying to predict loan default using customer demographics and transaction behavior. The dataset is imbalanced, contains missing values, and has both numeric and categorical features.

Describe your step-by-step data science pipeline using boosting techniques.


**Pipeline Description:**

1.  **Data Preprocessing:**
    * **Missing Values:** Use Median imputation for numerical features (robust to outliers) and 'Missing' or Mode imputation for categorical features.
    * **Categorical Encoding:** Since we will likely use Boosting, we can use Target Encoding (if using XGBoost/LightGBM) or leave them as is if using CatBoost.
    * **Imbalance Handling:** Apply SMOTE (Synthetic Minority Over-sampling Technique) or adjust `scale_pos_weight` (in XGBoost) to penalize errors on the minority class more heavily.

2.  **Model Choice: CatBoost or XGBoost**
    * **Selection:** I would choose **CatBoost** primarily because the dataset contains categorical features (demographics) and CatBoost handles these natively and efficiently without data leakage.

3.  **Hyperparameter Tuning:**
    * Tune `learning_rate`, `depth` (tree depth), and `l2_leaf_reg` (regularization).
    * Crucially, tune the `class_weights` or `scale_pos_weight` parameter to handle the loan default imbalance.

4.  **Evaluation Metrics:**
    * **ROC-AUC:** To measure the model's ability to distinguish between defaulters and non-defaulters across thresholds.
    * **Precision-Recall AUC (PR-AUC):** Since the positive class (default) is rare, PR-AUC is more informative than ROC-AUC.
    * **F1-Score:** To balance precision and recall.

5.  **Business Benefit:**
    * The model allows the company to **minimize risk** by identifying high-risk applicants before approval.
    * It allows for **Dynamic Pricing**: offering lower interest rates to low-risk customers and adjusting terms for higher-risk ones.


In [None]:
# Pseudo-code / Skeleton Pipeline implementation
# This code demonstrates the logic described above using a Pipeline structure.

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# --- MOCK DATA CREATION (For demonstration purposes) ---
# Creating a dummy dataframe to represent FinTech data
df = pd.DataFrame({
    'Income': np.random.randint(20000, 100000, 100),
    'Credit_Score': np.random.randint(300, 850, 100),
    'Employment_Type': np.random.choice(['Salaried', 'Self-Employed', 'Business'], 100),
    'Loan_Default': np.random.choice([0, 1], 100, p=[0.9, 0.1]) # Imbalanced
})
# Introduce missing values
df.loc[0:5, 'Income'] = np.nan

X = df.drop('Loan_Default', axis=1)
y = df['Loan_Default']

# --- PIPELINE START ---

# 1. Define Preprocessing for Numeric and Categorical columns
numeric_features = ['Income', 'Credit_Score']
categorical_features = ['Employment_Type']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# 2. Define the Model (XGBoost with imbalance handling)
# scale_pos_weight = count(negative) / count(positive) approx 9 in this dummy data
clf = XGBClassifier(scale_pos_weight=9, eval_metric='logloss', use_label_encoder=False)

# 3. Create the full Pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('classifier', clf)])

# 4. Train and Evaluate
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

print("Pipeline Steps Completed: Preprocessing -> Encoding -> XGBoost")
print("Classification Report:")
print(classification_report(y_test, y_pred, zero_division=0))
