# Assignment

Q.1.  What is Boosting in Machine Learning? Explain how it improves weak
learners.

Answer ->

**Boosting :**
Boosting in Machine Learning is an ensemble learning technique that combines multiple weak learners (usually simple models like shallow decision trees) to create a strong learner with improved predictive performance.

**How Boosting Improves Weak Learners :**

Boosting improves weak learners by:

- Focusing on mistakes: Later models learn where earlier ones went wrong.

- Weighted combination: It assigns higher importance to more accurate models.

- Sequential learning: Each learner contributes a small correction, and together they form a highly accurate model.

As a result, boosting can convert multiple weak models into one strong model with much better accuracy.

Q.2. What is the difference between AdaBoost and Gradient Boosting in terms
of how models are trained?

Answer ->

**AdaBoost :**

- It focuses on re-weighting data points — increases weights of misclassified samples
- It uses sample weights to make the next model pay more attention to difficult examples
- It is less flexible
- Common use in Classification

**Gradient Boosting :**

- It focuses on minimizing residual errors — fits new models to the gradients (residuals) of a loss function
- It uses gradients (direction of maximum error reduction) to correct previous model’s mistakes
- It is more flexible
- Common use in Classification and Regression

Q.3. How does regularization help in XGBoost?

Answer ->

Regularization in XGBoost helps control model complexity and prevent overfitting by adding penalty terms to the objective function. It includes L1 (alpha) and L2 (lambda) regularization on leaf weights, which shrink or eliminate less important features, and gamma (γ), which penalizes the addition of too many leaves in a tree. These penalties discourage overly deep or complex trees and ensure smoother, more generalizable predictions. By balancing model accuracy and simplicity, regularization allows XGBoost to maintain strong performance on training data while improving its ability to generalize well to unseen or noisy data in real-world applications.

Q.4. Why is CatBoost considered efficient for handling categorical data?

Answer ->

**CatBoost** is efficient for handling **categorical data** because it has a **built-in mechanism** to process categorical features directly, without requiring manual preprocessing like one-hot or label encoding. It uses a technique called **ordered target encoding**, which converts categorical values into numerical representations based on target statistics while preventing **target leakage**. This approach allows CatBoost to learn meaningful relationships from categorical features without overfitting. Additionally, it efficiently handles **high-cardinality** features (those with many unique categories) and reduces both **training time** and **memory usage**, making it faster and more accurate than other gradient boosting methods for categorical data.



Q.5.  What are some real-world applications where boosting techniques are
preferred over bagging methods?

Answer ->

Boosting techniques are preferred over bagging methods in real-world applications where high accuracy, complex decision boundaries, and minimizing bias are crucial. Unlike bagging (which reduces variance), boosting focuses on reducing bias by combining weak learners sequentially to correct previous errors.

Here are some key applications :     
- Credit Scoring and Loan Default Prediction (Finance)
- Fraud Detection (Banking & E-commerce)
- Customer Churn Prediction (Telecom & Marketing)
- Medical Diagnosis and Disease Prediction (Healthcare)
- Click-Through Rate (CTR) Prediction (Online Advertising)
- Stock Market Forecasting and Risk Management

Q.6. Datasets:

● Use sklearn.datasets.load_breast_cancer() for classification tasks.

● Use sklearn.datasets.fetch_california_housing() for regression
tasks.

Question : Write a Python program to:

● Train an AdaBoost Classifier on the Breast Cancer dataset

● Print the model accuracy


In [1]:
# Answer ->>

# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the AdaBoost Classifier
model = AdaBoostClassifier(n_estimators=100, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print model accuracy
print("AdaBoost Classifier Accuracy:", accuracy)


AdaBoost Classifier Accuracy: 0.9736842105263158


Q.7. Write a Python program to:

● Train a Gradient Boosting Regressor on the California Housing dataset

● Evaluate performance using R-squared score

In [2]:
# Answer ->>

# Import required libraries
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# Load California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Split dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize Gradient Boosting Regressor
gbr = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)

# Train the model
gbr.fit(X_train, y_train)

# Make predictions
y_pred = gbr.predict(X_test)

# Evaluate performance using R-squared score
r2 = r2_score(y_test, y_pred)
print("Gradient Boosting Regressor R-squared Score:", r2)


Gradient Boosting Regressor R-squared Score: 0.7756446042829697


Q.8. Write a Python program to:

● Train an XGBoost Classifier on the Breast Cancer dataset

● Tune the learning rate using GridSearchCV

● Print the best parameters and accuracy

In [6]:
# Answer ->>

# Import required libraries
from sklearn.datasets import load_breast_cancer
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

# Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize XGBoost Classifier
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)

# Define hyperparameter grid for learning rate
param_grid = {
    'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3]
}

# Initialize GridSearchCV
grid_search = GridSearchCV(estimator=xgb, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Get best parameters
best_params = grid_search.best_params_

# Predict on test set using the best model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)

# Print results
print("Best Parameters:", best_params)
print("Test Set Accuracy:", accuracy)


Best Parameters: {'learning_rate': 0.2}
Test Set Accuracy: 0.956140350877193


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Q.9. Write a Python program to:

● Train a CatBoost Classifier

● Plot the confusion matrix using seaborn

In [None]:
# Answer ->>

# Import required libraries
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize CatBoost Classifier
model = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=3, verbose=0, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix using seaborn
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=data.target_names, yticklabels=data.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - CatBoost Classifier')
plt.show()


Q.10. You're working for a FinTech company trying to predict loan default using
customer demographics and transaction behavior.
The dataset is imbalanced, contains missing values, and has both numeric and
categorical features.

Describe your step-by-step data science pipeline using boosting techniques:

● Data preprocessing & handling missing/categorical values

● Choice between AdaBoost, XGBoost, or CatBoost

● Hyperparameter tuning strategy

● Evaluation metrics you'd choose and why

● How the business would benefit from your model

Answer ->

Data Cleaning: Handle missing values.

Feature Processing: Encode categorical variables; scale if needed.

Imbalance Handling: Class weighting or resampling.

Model Selection: CatBoost/XGBoost for robust boosting.

Hyperparameter Tuning: CV-based search for learning rate, depth, regularization.

Model Evaluation: ROC-AUC, Precision-Recall, F1-score.

Business Impact: Reduced default, better loan allocation, and regulatory insight.