# **Boosting Techniques | Assignment**

Question 1: What is Boosting in Machine Learning? Explain how it improves weak
learners.
- Boosting is an ensemble learning technique that combines multiple weak learners (models that perform slightly better than random guessing) to build a strong and highly accurate model. Instead of training all models independently (like bagging), boosting trains them sequentially, where each new model focuses on correcting the errors made by the previous one.

Question 2: What is the difference between AdaBoost and Gradient Boosting in terms
of how models are trained?
- AdaBoost (Adaptive Boosting)

  AdaBoost trains weak learners sequentially, focusing on misclassified samples.

  It adjusts sample weights after each iteration:

  Misclassified samples → weight increases

  Correctly classified samples → weight decreases

  The next learner is trained on this reweighted dataset.

  It combines learners using error-based weights.

- Gradient Boosting

  Gradient Boosting also trains models sequentially, but instead of adjusting sample weights, it fits each new learner to the residual errors (gradients) of the previous model.

  It uses gradient descent to minimize a differentiable loss function (e.g., MSE, Log-loss).

  Each new weak learner tries to reduce the loss step-by-step.

Question 3: How does regularization help in XGBoost?
- Regularization in XGBoost is one of the key reasons it performs better than traditional Gradient Boosting. It helps control model complexity and prevents overfitting.

Question 4: Why is CatBoost considered efficient for handling categorical data?
- CatBoost is considered highly efficient for handling categorical data because it processes categorical features internally and automatically, without requiring manual encoding techniques such as one-hot encoding or label encoding.

Question 5: What are some real-world applications where boosting techniques are
preferred over bagging methods?

- Boosting techniques are preferred over bagging when the problem requires high accuracy, handling complex patterns, and reducing bias more effectively than variance. Because boosting focuses on correcting previous errors, it performs exceptionally well in several real-world scenarios.



Question 6: Write a Python program to:
● Train an AdaBoost Classifier on the Breast Cancer dataset
● Print the model accuracy

In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

# Load Dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split into Train and Test Sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize AdaBoost Classifier
model = AdaBoostClassifier(n_estimators=100, learning_rate=1.0, random_state=42)

# Train the Model
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Calculate Accuracy
accuracy = accuracy_score(y_test, y_pred)

print("AdaBoost Classifier Accuracy:", accuracy)


AdaBoost Classifier Accuracy: 0.9736842105263158


Question 7: Write a Python program to:
● Train a Gradient Boosting Regressor on the California Housing dataset
● Evaluate performance using R-squared score


In [3]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score

# Load California Housing Dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Initialize Model
model = GradientBoostingRegressor(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

# Train Model
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate R² Score
r2 = r2_score(y_test, y_pred)

print("Gradient Boosting Regressor R-squared Score:", r2)


Gradient Boosting Regressor R-squared Score: 0.8004451261281281


Question 8: Write a Python program to:
● Train an XGBoost Classifier on the Breast Cancer dataset
● Tune the learning rate using GridSearchCV
● Print the best parameters and accuracy


In [4]:
from xgboost import XGBClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load Dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# XGBoost Classifier
model = XGBClassifier(
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)

# Hyperparameter Grid (tuning only learning rate)
param_grid = {
    'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.2]
}

# GridSearchCV
grid = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)

# Train Model
grid.fit(X_train, y_train)

# Best Parameters
print("Best Parameters:", grid.best_params_)

# Predict using the best model
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

# Calculate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print("XGBoost Classifier Accuracy:", accuracy)


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Best Parameters: {'learning_rate': 0.2}
XGBoost Classifier Accuracy: 0.956140350877193


Question 9: Write a Python program to:
● Train a CatBoost Classifier
● Plot the confusion matrix using seaborn

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
from catboost import CatBoostClassifier

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train CatBoost Classifier
model = CatBoostClassifier(iterations=200, learning_rate=0.1, verbose=0)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)

# Plot matrix
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=True, cmap="Blues", fmt="d")
plt.title("Confusion Matrix - CatBoost Classifier")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()


Question 10: You're working for a FinTech company trying to predict loan default using
customer demographics and transaction behavior.
The dataset is imbalanced, contains missing values, and has both numeric and
categorical features.
Describe your step-by-step data science pipeline using boosting techniques:
● Data preprocessing & handling missing/categorical values
● Choice between AdaBoost, XGBoost, or CatBoost
● Hyperparameter tuning strategy
● Evaluation metrics you'd choose and why
● How the business would benefit from your model


-

In [None]:
from catboost import CatBoostClassifier, Pool
# preprocess: impute medians, fill categorical missing with "<MISSING>", engineer features
cat_features = [index list]

model = CatBoostClassifier(
    iterations=2000,
    learning_rate=0.03,
    depth=6,
    l2_leaf_reg=3,
    eval_metric='AUC',
    random_seed=42,
    early_stopping_rounds=100,
    auto_class_weights='Balanced'
)

train_pool = Pool(X_train, y_train, cat_features=cat_features)
val_pool   = Pool(X_val, y_val, cat_features=cat_features)

model.fit(train_pool, eval_set=val_pool)
probs = model.predict_proba(X_test)[:, 1]
# evaluate: average_precision_score(y_test, probs) ; compute business cost for thresholds
