                               Boosting Techniques

1.What is Boosting in Machine Learning?

. Boosting in machine learning is an ensemble technique that combines multiple weak learners, typically decision trees, sequentially to create a stronger overall model by focusing more on the errors of previous models.

2.How does Boosting differ from Bagging?

. Boosting builds models sequentially, each correcting the errors of the previous one, while bagging builds models independently in parallel and combines their results by averaging or voting.

3.What is the key idea behind AdaBoost?

. The key idea behind AdaBoost is to iteratively adjust the weights of training samples, giving more focus to misclassified examples so that subsequent weak learners improve on the hardest cases.

4.Explain the working of AdaBoost with an example.

. AdaBoost works by training weak learners sequentially. Initially, all data points have equal weights. After each learner is trained, AdaBoost increases the weights of misclassified points so the next learner focuses more on them. This process repeats, and the final model combines all learners weighted by their accuracy.

5.What is Gradient Boosting, and how is it different from AdaBoost?

. Gradient Boosting builds models sequentially by optimizing a loss function using gradient descent, fitting each new model to the residual errors of the previous one. Unlike AdaBoost, which adjusts sample weights, Gradient Boosting directly minimizes prediction errors via gradients.

6.What is the loss function in Gradient Boosting?

. The loss function in Gradient Boosting measures how well the model’s predictions match the actual values, guiding the model to minimize errors during training by fitting new learners to the negative gradients of this loss.

7.How does XGBoost improve over traditional Gradient Boosting?

. XGBoost improves traditional Gradient Boosting by adding regularization, handling missing data, using parallel processing, and optimizing speed and memory efficiency for better performance and scalability.

8.What is the difference between XGBoost and CatBoost?

. XGBoost focuses on speed and regularization, while CatBoost is designed to handle categorical features automatically and reduce prediction shift, making it easier to use with categorical data.

9.What are some real-world applications of Boosting techniques?

. Boosting techniques are used in fraud detection, spam filtering, customer churn prediction, image recognition, and recommendation systems.

10.How does regularization help in XGBoost?

. Regularization in XGBoost helps prevent overfitting by penalizing complex models, encouraging simpler trees and improving generalization to new data.

11.What are some hyperparameters to tune in Gradient Boosting models?

. Key hyperparameters in Gradient Boosting include:

Learning rate: controls step size in updates

Number of estimators: total trees to build

Max depth: limits tree depth

Subsample: fraction of data used per tree

Min samples split/leaf: controls node splitting

Loss function: defines error to minimize

12.What is the concept of Feature Importance in Boosting?

. Feature importance in Boosting shows how much each feature contributes to the model's predictions, helping identify which features are most influential in decision-making.

13.Why is CatBoost efficient for categorical data?

. CatBoost is efficient for categorical data because it automatically handles categorical features using a technique called ordered target statistics, reducing the need for manual preprocessing and preventing overfitting.


                             Practical

14.Train an AdaBoost Classifier on a sample dataset and print model accuracy.








































































In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create a sample classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, random_state=42)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train AdaBoost classifier
model = AdaBoostClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict and evaluate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Model Accuracy:", accuracy)


15.Train an AdaBoost Regressor and evaluate performance using Mean Absolute Error (MAE)

In [None]:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# Create a sample regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.3, random_state=42)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train AdaBoost regressor
regressor = AdaBoostRegressor(n_estimators=100, random_state=42)
regressor.fit(X_train, y_train)

# Predict and evaluate using Mean Absolute Error
y_pred = regressor.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print("Mean Absolute Error (MAE):", mae)


16.Train a Gradient Boosting Classifier on the Breast Cancer dataset and print feature importance

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import pandas as pd
import matplotlib.pyplot as plt

# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Gradient Boosting Classifier
model = GradientBoostingClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Get and display feature importances
importances = model.feature_importances_
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

print(feature_importance_df)

# Optional: Plot feature importances
plt.figure(figsize=(10, 6))
plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'])
plt.xlabel("Feature Importance")
plt.ylabel("Features")
plt.title("Gradient Boosting Classifier - Feature Importance")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()


17.Train a Gradient Boosting Regressor and evaluate using R-Squared Score

In [None]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# Create a sample regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.3, random_state=42)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Gradient Boosting Regressor
regressor = GradientBoostingRegressor(n_estimators=100, random_state=42)
regressor.fit(X_train, y_train)

# Predict and evaluate using R-Squared Score
y_pred = regressor.predict(X_test)
r2 = r2_score(y_test, y_pred)
print("R-Squared Score:", r2)


18.4 Train an XGBoost Classifier on a dataset and compare accuracy with Gradient Boosting

In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

# Create a sample classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, random_state=42)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Gradient Boosting Classifier
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
gb_model.fit(X_train, y_train)
gb_preds = gb_model.predict(X_test)
gb_accuracy = accuracy_score(y_test, gb_preds)

# Train XGBoost Classifier
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb_model.fit(X_train, y_train)
xgb_preds = xgb_model.predict(X_test)
xgb_accuracy = accuracy_score(y_test, xgb_preds)

# Print accuracies
print(f"Gradient Boosting Accuracy: {gb_accuracy:.4f}")
print(f"XGBoost Accuracy: {xgb_accuracy:.4f}")


19.Train a CatBoost Classifier and evaluate using F1-Score

In [None]:
from catboost import CatBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# Create a sample classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, random_state=42)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train CatBoost Classifier
model = CatBoostClassifier(iterations=100, random_seed=42, verbose=0)
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate using F1-Score
f1 = f1_score(y_test, y_pred)
print("F1-Score:", f1)


20.Train an XGBoost Regressor and evaluate using Mean Squared Error (MSE)

In [None]:
from xgboost import XGBRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Create a sample regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.3, random_state=42)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train XGBoost Regressor
model = XGBRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate using Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error (MSE):", mse)


21.Train an AdaBoost Classifier and visualize feature importance

In [None]:
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np

# Create a sample classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, random_state=42)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train AdaBoost classifier
model = AdaBoostClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Get feature importances
importances = model.feature_importances_
features = np.arange(len(importances))

# Plot feature importances
plt.figure(figsize=(10, 6))
plt.barh(features, importances, color='skyblue')
plt.xlabel('Feature Importance')
plt.ylabel('Feature Index')
plt.title('AdaBoost Classifier Feature Importance')
plt.gca().invert_yaxis()
plt.show()


22.Train a Gradient Boosting Regressor and plot learning curves.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Create a sample regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.3, random_state=42)

# Split into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Gradient Boosting Regressor with staged predictions
model = GradientBoostingRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Track training and validation error at each stage
train_errors = []
val_errors = []

for y_train_pred in model.staged_predict(X_train):
    train_errors.append(mean_squared_error(y_train, y_train_pred))
for y_val_pred in model.staged_predict(X_val):
    val_errors.append(mean_squared_error(y_val, y_val_pred))

# Plot learning curves
plt.figure(figsize=(10,6))
plt.plot(train_errors, label='Training MSE')
plt.plot(val_errors, label='Validation MSE')
plt.xlabel('Number of Trees')
plt.ylabel('Mean Squared Error')
plt.title('Gradient Boosting Regressor Learning Curves')
plt.legend()
plt.show()


23. Train an XGBoost Classifier and visualize feature importance.

In [None]:
import matplotlib.pyplot as plt
from xgboost import XGBClassifier, plot_importance
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create a sample classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, random_state=42)

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train XGBoost Classifier
model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
model.fit(X_train, y_train)

# Plot feature importance
plt.figure(figsize=(10, 8))
plot_importance(model, max_num_features=20, importance_type='weight', show_values=False)
plt.title('XGBoost Feature Importance')
plt.show()


24.Train a CatBoost Classifier and plot the confusion matrix.

In [None]:
from catboost import CatBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Create a sample classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, random_state=42)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train CatBoost Classifier
model = CatBoostClassifier(iterations=100, random_seed=42, verbose=0)
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap=plt.cm.Blues)
plt.title("CatBoost Classifier Confusion Matrix")
plt.show()


25.Train an AdaBoost Classifier with different numbers of estimators and compare accuracy.


In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Create a sample classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Different numbers of estimators to try
estimators_list = [10, 50, 100, 150, 200]
accuracies = []

for n_estimators in estimators_list:
    model = AdaBoostClassifier(n_estimators=n_estimators, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracies.append(accuracy)
    print(f"Estimators: {n_estimators} - Accuracy: {accuracy:.4f}")

# Plot accuracy vs number of estimators
plt.figure(figsize=(8,5))
plt.plot(estimators_list, accuracies, marker='o')
plt.xlabel('Number of Estimators')
plt.ylabel('Accuracy')
plt.title('AdaBoost Accuracy vs Number of Estimators')
plt.grid(True)
plt.show()


26.Train a Gradient Boosting Classifier and visualize the ROC curve.


In [None]:
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc

# Create a sample classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Gradient Boosting Classifier
model = GradientBoostingClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict probabilities for positive class
y_scores = model.predict_proba(X_test)[:, 1]

# Compute ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_scores)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.4f})')
plt.plot([0,1], [0,1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Gradient Boosting Classifier ROC Curve')
plt.legend(loc="lower right")
plt.grid(True)
plt.show()


27.Train an XGBoost Regressor and tune the learning rate using GridSearchCV

In [None]:
from xgboost import XGBRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error

# Create sample regression dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.3, random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize XGBoost Regressor
xgb = XGBRegressor(n_estimators=100, random_state=42)

# Define parameter grid for learning rate
param_grid = {
    'learning_rate': [0.01, 0.05, 0.1, 0.2]
}

# Setup GridSearchCV
grid_search = GridSearchCV(estimator=xgb, param_grid=param_grid,
                           cv=3, scoring='neg_mean_squared_error', verbose=1)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Best learning rate
best_lr = grid_search.best_params_['learning_rate']
print(f"Best learning rate: {best_lr}")

# Evaluate on test set with best estimator
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Test MSE with best learning rate: {mse:.4f}")


28.Train a CatBoost Classifier on an imbalanced dataset and compare performance with class weighting.


In [None]:
from catboost import CatBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import numpy as np

# Create an imbalanced classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           n_redundant=5, weights=[0.9, 0.1], flip_y=0,
                           random_state=42)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train CatBoost without class weights
model_no_weight = CatBoostClassifier(iterations=100, random_seed=42, verbose=0)
model_no_weight.fit(X_train, y_train)
y_pred_no_weight = model_no_weight.predict(X_test)
f1_no_weight = f1_score(y_test, y_pred_no_weight)

# Compute class weights manually (inverse of class frequency)
classes, counts = np.unique(y_train, return_counts=True)
class_weights = {cls: 1.0/count for cls, count in zip(classes, counts)}

# Map sample weights based on class weights
sample_weights = np.array([class_weights[label] for label in y_train])

# Train CatBoost with class weights via sample weights
model_weighted = CatBoostClassifier(iterations=100, random_seed=42, verbose=0)
model_weighted.fit(X_train, y_train, sample_weight=sample_weights)
y_pred_weighted = model_weighted.predict(X_test)
f1_weighted = f1_score(y_test, y_pred_weighted)

print(f"F1-Score without class weighting: {f1_no_weight:.4f}")
print(f"F1-Score with class weighting: {f1_weighted:.4f}")


29.Train an AdaBoost Classifier and analyze the effect of different learning rates.


In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Create a sample classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, random_state=42)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Different learning rates to test
learning_rates = [0.01, 0.05, 0.1, 0.2, 0.5, 1.0]
accuracies = []

for lr in learning_rates:
    model = AdaBoostClassifier(n_estimators=100, learning_rate=lr, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)
    print(f"Learning Rate: {lr} - Accuracy: {acc:.4f}")

# Plot accuracy vs learning rate
plt.figure(figsize=(8,5))
plt.plot(learning_rates, accuracies, marker='o')
plt.xlabel('Learning Rate')
plt.ylabel('Accuracy')
plt.title('Effect of Learning Rate on AdaBoost Accuracy')
plt.grid(True)
plt.show()


30.Train an XGBoost Classifier for multi-class classification and evaluate using log-loss.


In [None]:
from xgboost import XGBClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

# Create a sample multi-class dataset (e.g., 3 classes)
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
                           n_redundant=5, n_classes=3, random_state=42)

# Split dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train XGBoost Classifier for multi-class
model = XGBClassifier(objective='multi:softprob', num_class=3,
                      use_label_encoder=False, eval_metric='mlogloss', random_state=42)
model.fit(X_train, y_train)

# Predict probabilities on test set
y_proba = model.predict_proba(X_test)

# Calculate log-loss
logloss = log_loss(y_test, y_proba)
print(f"Multi-class Log-Loss: {logloss:.4f}")
