Q1. What is Boosting in Machine Learning? Explain how it improves weak learners?

Answer:
Boosting is an ensemble learning technique that combines multiple weak learners to form a strong learner. Models are trained sequentially, where each new model focuses more on the errors made by previous models. This improves performance by reducing bias and improving predictive accuracy.

Q2. What is the difference between AdaBoost and Gradient Boosting in terms of how models are trained?

Answer:
AdaBoost adjusts the weights of misclassified samples and trains models sequentially. Gradient Boosting trains models sequentially by optimizing a loss function using gradient descent.

Q3. How does regularization help in XGBoost?

Answer:
Regularization in XGBoost (L1 and L2) helps prevent overfitting by penalizing complex models and controlling tree complexity.

Q4. Why is CatBoost considered efficient for handling categorical data?

Answer:
CatBoost handles categorical features internally using ordered target statistics, eliminating the need for one-hot encoding and reducing overfitting.

Q5. What are some real-world applications where boosting techniques are preferred over bagging methods?

Answer:
Boosting is preferred in credit scoring, fraud detection, medical diagnosis, and recommendation systems where high predictive accuracy is required.

Q6. Train an AdaBoost Classifier on the Breast Cancer dataset and print accuracy.

Answer:

In [None]:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = AdaBoostClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

pred = model.predict(X_test)
accuracy_score(y_test, pred)


Q7. Train a Gradient Boosting Regressor on the California Housing dataset and evaluate RÂ² score.

Answer:

In [None]:

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

gbr = GradientBoostingRegressor(random_state=42)
gbr.fit(X_train, y_train)

pred = gbr.predict(X_test)
r2_score(y_test, pred)


Q8. Train an XGBoost Classifier, tune learning rate using GridSearchCV, and print best parameters and accuracy.

Answer:

In [None]:

# Install if needed:
# !pip install xgboost

from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

X, y = load_breast_cancer(return_X_y=True)

params = {
    'learning_rate': [0.01, 0.1, 0.2]
}

xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
grid = GridSearchCV(xgb, params, cv=3)
grid.fit(X, y)

best_model = grid.best_estimator_
grid.best_params_, accuracy_score(y, best_model.predict(X))


Q9. Train a CatBoost Classifier and plot the confusion matrix.

Answer:

In [None]:

# Install if needed:
# !pip install catboost

from catboost import CatBoostClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = CatBoostClassifier(verbose=0)
model.fit(X_train, y_train)

pred = model.predict(X_test)
cm = confusion_matrix(y_test, pred)

sns.heatmap(cm, annot=True, fmt='d')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()


Q10. FinTech Loan Default Prediction using Boosting Techniques

Answer:
1. Data Preprocessing: Handle missing values using imputation, encode categorical features, and scale numeric data.
2. Model Choice: CatBoost is preferred due to categorical handling and robustness to missing values.
3. Hyperparameter Tuning: Use GridSearchCV or Bayesian optimization.
4. Evaluation Metrics: ROC-AUC, Precision, Recall, and F1-score due to class imbalance.
5. Business Benefit: Reduces default risk, improves loan approval accuracy, and increases profitability.