Q1. What is Boosting? How does it improve weak learners?

Ans Boosting is an ensemble technique that builds a strong model by combining many weak learners (typically shallow decision trees). Models are trained sequentially; each new learner focuses more on samples the previous ones misclassified (by reweighting data or fitting the negative gradient of a loss). The final prediction is a weighted sum/majority vote of all learners, which reduces bias, controls variance, and improves generalization.

Q2.   What is Boosting in Machine Learning? Explain how it improves weak
learners.


Ans.AdaBoost: reweights training samples each round based on classification errors; the next stump/tree pays more attention to misclassified points. Optimizes an exponential loss implicitly via sample weights.

Gradient Boosting: fits each new tree to the negative gradient (residuals) of a specified loss (e.g., squared error, deviance). It’s a general gradient-descent framework in function space with learning rate and tree parameters controlling steps.

Q3. How does regularization help in XGBoost?

Ans. XGBoost regularizes at multiple levels:

L1/L2 penalties on leaf weights (controls model complexity).

Shrinkage (learning_rate) to take smaller, safer steps.

Subsampling of rows/columns to decorrelate trees and cut variance.

Tree constraints (max_depth, min_child_weight, gamma) to prune weak splits.
Together these prevent overfitting and improve out-of-sample performance and training stability.

Q4. Why is CatBoost considered efficient for handling categorical data?

Ans. CatBoost uses ordered target statistics and ordered boosting to transform categories while avoiding target leakage. It handles high-cardinality features with efficient encodings, supports missing values natively, and uses symmetric trees for fast, robust training—often requiring minimal preprocessing compared to other boosters.

Q5. What are some real-world applications where boosting techniques are
preferred over bagging methods?


Ans. Credit default & fraud detection (imbalanced, heterogeneous features)

Click-through rate / ad ranking

Medical diagnosis (tabular clinical features)

Customer churn / propensity modeling

Forecasting tabular regression (e.g., house prices, demand)
Boosters shine on structured/tabular data with complex, non-linear interactions and mixed feature types

Q6. Write a Python program to:
● Train an AdaBoost Classifier on the Breast Cancer dataset
● Print the model accuracy
(Include your Python code and output in the code box below.)


In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import AdaBoostClassifier

data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model = AdaBoostClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))


Q7. Write a Python program to:
● Train a Gradient Boosting Regressor on the California Housing dataset
● Evaluate performance using R-squared score
(Include your Python code and output in the code box below.)


In [None]:
  from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score

data = fetch_california_housing()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

gbr = GradientBoostingRegressor(random_state=42)
gbr.fit(X_train, y_train)
y_pred = gbr.predict(X_test)
print("R^2:", r2_score(y_test, y_pred))


Q8. Write a Python program to:
● Train an XGBoost Classifier on the Breast Cancer dataset
● Tune the learning rate using GridSearchCV
● Print the best parameters and accuracy
(Include your Python code and output in the code box below.)

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

X, y = load_breast_cancer(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

xgb = XGBClassifier(
    n_estimators=200,
    max_depth=3,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42,
)

grid = GridSearchCV(
    estimator=xgb,
    param_grid={'learning_rate': [0.01, 0.05, 0.1, 0.2]},
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
grid.fit(X_tr, y_tr)
best = grid.best_estimator_
print("Best params:", grid.best_params_)
print("Accuracy:", accuracy_score(y_te, best.predict(X_te)))


Q9. Write a Python program to:
● Train a CatBoost Classifier
● Plot the confusion matrix using seaborn
(Include your Python code and output in the code box below.)


In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from catboost import CatBoostClassifier
import seaborn as sns, matplotlib.pyplot as plt

X, y = load_breast_cancer(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

model = CatBoostClassifier(
    depth=6, learning_rate=0.1, iterations=300,
    loss_function='Logloss', verbose=0, random_seed=42
)
model.fit(X_tr, y_tr)
pred = model.predict(X_te)

cm = confusion_matrix(y_te, pred)
sns.heatmap(cm, annot=True, fmt='d', xticklabels=['malignant','benign'], yticklabels=['malignant','benign'])
plt.title("CatBoost Confusion Matrix")
plt.xlabel("Predicted"); plt.ylabel("True"); plt.tight_layout(); plt.show()
