In [None]:
# 1. What is Boosting in Machine Learning?

Boosting is an ensemble learning technique that combines multiple weak learners (usually decision trees) to create a strong learner. The idea is to sequentially train models, where each model tries to correct the errors made by the previous one. Boosting aims to reduce both bias and variance in predictions.

In [None]:
# 2. How does Boosting differ from Bagging?

While both are ensemble methods, Boosting trains models sequentially, with each model focusing on the mistakes of the previous ones. In contrast, Bagging trains models independently and then aggregates their predictions. Bagging reduces variance by averaging predictions, while Boosting reduces bias by improving model accuracy step by step.

In [None]:
# 3. What is the key idea behind AdaBoost?

AdaBoost (Adaptive Boosting) is a popular Boosting algorithm that adjusts the weights of misclassified instances, forcing the next weak learner to focus more on these difficult cases. The idea is to adaptively combine weak models into a strong one.

In [None]:
# 4. Explain the working of AdaBoost with an example.

In AdaBoost, a base learner (e.g., a decision tree stump) is trained on the dataset, and misclassified instances are given higher weights. The process is repeated, and each learner is assigned a weight based on its accuracy. For example, in a binary classification task, AdaBoost might first classify easy points, but after boosting, the model corrects itself by focusing more on points it previously misclassified.

In [None]:
# 5. What is Gradient Boosting, and how is it different from AdaBoost?

Gradient Boosting builds models sequentially like AdaBoost, but instead of reweighting instances, it uses the gradient of the loss function to identify errors. Each model is trained to minimize the residual errors (difference between actual and predicted values) from the previous model, using gradient descent.

In [None]:
# 6. What is the loss function in Gradient Boosting?

The loss function in Gradient Boosting varies depending on the task. For regression, it's typically Mean Squared Error (MSE), while for classification, it could be Log Loss (cross-entropy). The model tries to minimize this loss by updating the weights based on the gradient of the error.

In [None]:
# 7. How does XGBoost improve over traditional Gradient Boosting?

XGBoost (Extreme Gradient Boosting) introduces several optimizations, including:

Regularization: Prevents overfitting by controlling model complexity.

Tree Pruning: Stops the construction of trees when further splitting has no gain.

Parallelization: Speeds up computation by parallelizing tasks.

Handling Missing Data: Efficiently handles missing values by learning best imputation strategies.

In [None]:
# 8. What is the difference between XGBoost and CatBoost?

While both are advanced boosting algorithms, XGBoost is highly efficient for numerical and sparse data, while CatBoost is designed to handle categorical features effectively without the need for preprocessing (like one-hot encoding). CatBoost reduces overfitting in high-cardinality categorical features.

In [None]:
# 9. What are some real-world applications of Boosting techniques?

Boosting techniques are used in a wide range of applications, such as:

Fraud detection: Identifying fraudulent transactions.


Marketing: Predicting customer churn or preferences.

Finance: Credit scoring and risk analysis.

Healthcare: Disease prediction and patient diagnosis.

In [None]:
# 10. How does regularization help in XGBoost?

Regularization in XGBoost (L1 and L2) penalizes complex models, thus preventing overfitting. L1 regularization (lasso) shrinks coefficients to zero, simplifying the model, while L2 regularization (ridge) limits large weights to maintain a balanced model.

In [None]:
# 11. What are some hyperparameters to tune in Gradient Boosting models?

Key hyperparameters include:

Learning Rate: Controls the contribution of each model.

Number of Trees: Defines how many weak learners are used.

Max Depth: Limits the complexity of trees.

Subsample: Fraction of data used for training.

Min Samples Split/Leaf: Limits the number of samples required to split a node.

In [None]:
# 12. What is the concept of Feature Importance in Boosting?

Feature Importance refers to the contribution of each feature in making predictions. In Boosting models, it is often calculated based on how often a feature is used to split a node and the resulting improvement in performance.

In [None]:
# 13. Why is CatBoost efficient for categorical data?

CatBoost is efficient for categorical data because it uses Ordered Target Encoding, which preserves the integrity of the training data by avoiding information leakage. It also automates the handling of categorical variables without the need for explicit encoding, making it more robust for real-world datasets.

In [None]:
                                                                    # Practical

In [None]:
# 14. Train an AdaBoost Classifier on a sample dataset and print model accuracy

from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create a sample dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train AdaBoost Classifier
adb_clf = AdaBoostClassifier(n_estimators=50, random_state=42)
adb_clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = adb_clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")


In [None]:
# 15. Train an AdaBoost Regressor and evaluate performance using Mean Absolute Error (MAE)

from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import make_regression
from sklearn.metrics import mean_absolute_error

# Create a regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train AdaBoost Regressor
adb_reg = AdaBoostRegressor(n_estimators=50, random_state=42)
adb_reg.fit(X_train, y_train)

# Predict and evaluate
y_pred = adb_reg.predict(X_test)
print(f"Mean Absolute Error (MAE): {mean_absolute_error(y_test, y_pred):.2f}")


In [None]:
# 16. Train a Gradient Boosting Classifier on the Breast Cancer dataset and print feature importance

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
import numpy as np

# Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Train Gradient Boosting Classifier
gb_clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_clf.fit(X, y)

# Print feature importance
importance = gb_clf.feature_importances_
for i, v in enumerate(importance):
    print(f"Feature: {data.feature_names[i]}, Importance: {v:.2f}")


In [None]:
# 17. Train a Gradient Boosting Regressor and evaluate using R-Squared Score

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_regression
from sklearn.metrics import r2_score

# Create a regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Gradient Boosting Regressor
gbr = GradientBoostingRegressor(n_estimators=100, random_state=42)
gbr.fit(X_train, y_train)

# Predict and evaluate
y_pred = gbr.predict(X_test)
print(f"R-Squared Score: {r2_score(y_test, y_pred):.2f}")


In [None]:
# 18. Train an XGBoost Classifier on a dataset and compare accuracy with Gradient Boosting

from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create a classification dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train XGBoost Classifier
xgb_clf = XGBClassifier(n_estimators=100, random_state=42)
xgb_clf.fit(X_train, y_train)
y_pred_xgb = xgb_clf.predict(X_test)

# Train Gradient Boosting Classifier
gb_clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_clf.fit(X_train, y_train)
y_pred_gb = gb_clf.predict(X_test)

# Compare accuracy
print(f"XGBoost Accuracy: {accuracy_score(y_test, y_pred_xgb):.2f}")
print(f"Gradient Boosting Accuracy: {accuracy_score(y_test, y_pred_gb):.2f}")


In [None]:
# 19. Train a CatBoost Classifier and evaluate using F1-Score

from catboost import CatBoostClassifier
from sklearn.metrics import f1_score
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create a classification dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train CatBoost Classifier
cat_clf = CatBoostClassifier(iterations=100, random_state=42, verbose=0)
cat_clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = cat_clf.predict(X_test)
print(f"F1-Score: {f1_score(y_test, y_pred):.2f}")


In [None]:
# 20. Train an XGBoost Regressor and evaluate using Mean Squared Error (MSE)

from xgboost import XGBRegressor
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error

# Create a regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train XGBoost Regressor
xgb_reg = XGBRegressor(n_estimators=100, random_state=42)
xgb_reg.fit(X_train, y_train)

# Predict and evaluate
y_pred = xgb_reg.predict(X_test)
print(f"Mean Squared Error (MSE): {mean_squared_error(y_test, y_pred):.2f}")


In [None]:
# 21. Train an AdaBoost Classifier and Visualize Feature Importance

import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create a dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train AdaBoost Classifier
adb_clf = AdaBoostClassifier(n_estimators=50, random_state=42)
adb_clf.fit(X_train, y_train)

# Visualize feature importance
importance = adb_clf.feature_importances_
plt.bar(range(len(importance)), importance)
plt.title('Feature Importance - AdaBoost Classifier')
plt.show()


In [None]:
# 22. Train a Gradient Boosting Regressor and Plot Learning Curves

import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import learning_curve
from sklearn.datasets import make_regression

# Create a dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)

# Train Gradient Boosting Regressor
gbr = GradientBoostingRegressor(n_estimators=100, random_state=42)

# Plot learning curves
train_sizes, train_scores, test_scores = learning_curve(gbr, X, y, train_sizes=[0.1, 0.33, 0.55, 0.78, 1.0], cv=5)
plt.plot(train_sizes, train_scores.mean(axis=1), label='Train Score')
plt.plot(train_sizes, test_scores.mean(axis=1), label='Test Score')
plt.title('Learning Curves - Gradient Boosting Regressor')
plt.xlabel('Training Set Size')
plt.ylabel('Score')
plt.legend()
plt.show()


In [None]:
# 23. Train an XGBoost Classifier and Visualize Feature Importance

import xgboost as xgb
import matplotlib.pyplot as plt
from xgboost import plot_importance
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create a dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train XGBoost Classifier
xgb_clf = xgb.XGBClassifier(n_estimators=100, random_state=42)
xgb_clf.fit(X_train, y_train)

# Visualize feature importance
plot_importance(xgb_clf)
plt.title('Feature Importance - XGBoost Classifier')
plt.show()


In [None]:
# 24. Train a CatBoost Classifier and Plot the Confusion Matrix

from catboost import CatBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Create a dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train CatBoost Classifier
cat_clf = CatBoostClassifier(iterations=100, random_state=42, verbose=0)
cat_clf.fit(X_train, y_train)

# Predict and plot confusion matrix
y_pred = cat_clf.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm).plot()
plt.title('Confusion Matrix - CatBoost Classifier')
plt.show()


In [None]:
# 25. Train an AdaBoost Classifier with Different Numbers of Estimators and Compare Accuracy

from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create a dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train and evaluate AdaBoost Classifier with different numbers of estimators
for n_estimators in [10, 50, 100]:
    adb_clf = AdaBoostClassifier(n_estimators=n_estimators, random_state=42)
    adb_clf.fit(X_train, y_train)
    y_pred = adb_clf.predict(X_test)
    print(f"Accuracy with {n_estimators} estimators: {accuracy_score(y_test, y_pred):.2f}")


In [None]:
# 26. Train a Gradient Boosting Classifier and Visualize the ROC Curve

import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc

# Create a dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Gradient Boosting Classifier
gb_clf = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_clf.fit(X_train, y_train)

# Predict probabilities
y_prob = gb_clf.predict_proba(X_test)[:, 1]

# Plot ROC curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f'ROC Curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.title('ROC Curve - Gradient Boosting Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()


In [None]:
# 27. Train an XGBoost Regressor and Tune the Learning Rate using GridSearchCV

from sklearn.model_selection import GridSearchCV
import xgboost as xgb
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Create a dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Define parameter grid for learning rate
param_grid = {'learning_rate': [0.01, 0.1, 0.2]}

# Train XGBoost Regressor with GridSearchCV
xgb_reg = xgb.XGBRegressor(n_estimators=100, random_state=42)
grid_search = GridSearchCV(xgb_reg, param_grid, scoring='neg_mean_squared_error', cv=5)
grid_search.fit(X_train, y_train)

# Best model and evaluation
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print(f"Best Learning Rate: {grid_search.best_params_['learning_rate']}")
print(f"Mean Squared Error: {mean_squared_error(y_test, y_pred):.2f}")


In [None]:
# 28. Train a CatBoost Classifier on an Imbalanced Dataset and Compare Performance with Class Weighting

from catboost import CatBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# Create an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, weights=[0.9, 0.1], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train CatBoost Classifier without class weights
cat_clf = CatBoostClassifier(iterations=100, random_state=42, verbose=0)
cat_clf.fit(X_train, y_train)
y_pred = cat_clf.predict(X_test)
print(f"F1-Score without Class Weights: {f1_score(y_test, y_pred):.2f}")

# Train CatBoost Classifier with class weights
cat_clf_weighted = CatBoostClassifier(iterations=100, class_weights=[1, 10], random_state=42, verbose=0)
cat_clf_weighted.fit(X_train, y_train)
y_pred_weighted = cat_clf_weighted.predict(X_test)
print(f"F1-Score with Class Weights: {f1_score(y_test, y_pred_weighted):.2f}")


In [None]:
# 29. Train an AdaBoost Classifier and Analyze the Effect of Different Learning Rates

from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create a dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train and evaluate AdaBoost Classifier with different learning rates
for lr in [0.01, 0.1, 1]:
    adb_clf = AdaBoostClassifier(n_estimators=50, learning_rate=lr, random_state=42)
    adb_clf.fit(X_train, y_train)
    y_pred = adb_clf.predict(X_test)
    print(f"Accuracy with learning rate {lr}: {accuracy_score(y_test, y_pred):.2f}")


In [None]:
# 30. Train an XGBoost Classifier for Multi-Class Classification and Evaluate using Log-Loss

import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

# Create a multi-class dataset
X, y = make_classification(n_samples=1000, n_features=20, n_classes=3, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train XGBoost Classifier
xgb_clf = xgb.XGBClassifier(n_estimators=100, objective='multi:softprob', num_class=3, random_state=42)
xgb_clf.fit(X_train, y_train)

# Predict probabilities and evaluate using log-loss
y_prob = xgb_clf.predict_proba(X_test)
print(f"Log-Loss: {log_loss(y_test, y_prob):.2f}")
