<a href="https://colab.research.google.com/github/PUHUPAGARWAL1515/Python-codes/blob/main/Boosting_Techniques_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
1. What is Boosting in Machine Learning?

Boosting is an ensemble learning technique that combines multiple weak learners (models that perform slightly better than random guessing) into a strong learner. It works by sequentially training models, with each model focusing on correcting the errors made by the previous ones. The final prediction is a weighted combination of the predictions from all the models.

In [None]:
2. How does Boosting differ from Bagging?

Bagging (Bootstrap Aggregating):
Trains multiple independent models in parallel, each on a different bootstrap sample (random sampling with replacement) of the training data.
Reduces variance and overfitting by averaging or voting the predictions of the individual models.
Examples: Random Forest.
Boosting:
Trains models sequentially, with each model learning from the mistakes of the previous ones.
Focuses on reducing bias and improving the overall accuracy of the model.
Assigns weights to the training instances, giving more weight to misclassified instances in subsequent iterations.
Examples: AdaBoost, Gradient Boosting, XGBoost, CatBoost.

In [None]:
3. What is the key idea behind AdaBoost?

AdaBoost (Adaptive Boosting) focuses on instances that are difficult to classify. It assigns weights to each training instance, initially equal. After each weak learner is trained, the weights of misclassified instances are increased, while the weights of correctly classified instances are decreased. This forces subsequent models to pay more attention to the difficult instances.

In [None]:
4. Explain the working of AdaBoost with an example.

Let's say we want to classify points as either red or blue:

Initialize weights: Assign equal weights to all data points.
Train a weak learner: Train a decision stump (a simple decision tree with one level) on the weighted data.
Calculate error: Determine the weighted error of the weak learner.
Calculate learner weight: Assign a weight to the learner based on its error (higher accuracy = higher weight).
Update instance weights: Increase the weights of misclassified instances and decrease the weights of correctly classified instances.
Repeat: Go back to step 2 and train another weak learner on the updated weighted data.
Combine learners: Combine the predictions of all weak learners, weighted by their learner weights.

In [None]:
5. What is Gradient Boosting, and how is it different from AdaBoost?

AdaBoost: Updates instance weights based on misclassifications.
Gradient Boosting: Fits new models to the residual errors (the difference between the actual and predicted values) of the previous models. It uses gradient descent to minimize the loss function.
Key Difference: Gradient Boosting is more flexible as it can optimize arbitrary differentiable loss functions, while AdaBoost typically uses exponential loss.

In [None]:
6. What is the loss function in Gradient Boosting?

The loss function in Gradient Boosting measures the difference between the predicted and actual values. Common loss functions include:

Mean Squared Error (MSE): For regression problems.
Cross-Entropy (Log Loss): For classification problems.
Huber Loss: Robust to outliers.

In [None]:
7. How does XGBoost improve over traditional Gradient Boosting?

XGBoost (Extreme Gradient Boosting) introduces several improvements:

Regularization: Prevents overfitting by adding penalty terms to the loss function.
Tree Pruning: Grows trees up to a maximum depth and then prunes back branches that do not improve performance.
Parallel Processing: Supports parallel computation for faster training.
Handling Missing Values: Can automatically learn the best direction to go when features are missing.
Cross-Validation: Built-in cross-validation at each iteration.

In [None]:
8. What is the difference between XGBoost and CatBoost?

XGBoost: Primarily designed for numerical features and requires categorical features to be encoded.
CatBoost: Specifically designed to handle categorical features directly, without the need for extensive preprocessing. It uses a novel algorithm called Ordered Boosting to reduce prediction shift caused by target leakage

In [None]:
9. What are some real-world applications of Boosting techniques?

Search Engine Ranking: Learning to rank search results.
Fraud Detection: Identifying fraudulent transactions.
Medical Diagnosis: Predicting disease risk and diagnosis.
Image Recognition: Object detection and image classification.
Natural Language Processing: Sentiment analysis and text classification.

In [None]:
10. How does regularization help in XGBoost?

Regularization in XGBoost helps prevent overfitting by adding penalty terms to the loss function. This discourages the model from becoming too complex and fitting the noise in the training data. Common regularization techniques include:

L1 Regularization (Lasso): Adds a penalty proportional to the absolute value of the weights.
L2 Regularization (Ridge): Adds a penalty proportional to the square of the weights.

In [None]:
11. What are some hyperparameters to tune in Gradient Boosting models?

Number of Estimators (n_estimators): The number of boosting stages.
Learning Rate (learning_rate): Controls the contribution of each tree to the final prediction.
Maximum Depth (max_depth): Limits the maximum depth of each tree.
Minimum Samples Split (min_samples_split): The minimum number of samples required to split an internal node.
Subsample (subsample): The fraction of samples used for fitting each tree.
Colsample_bytree (colsample_bytree): The fraction of features used for fitting each tree.

In [None]:
12. What is the concept of Feature Importance in Boosting?

Feature Importance in Boosting refers to the measure of how much each feature contributes to the model's predictions. It helps identify the most relevant features and understand the underlying relationships in the data. Common methods for calculating feature importance include:

Gain: Measures the improvement in accuracy brought by a feature to the branches it is on.
Cover: Measures the relative number of observations related to the feature.
Frequency: Measures the relative number of times a feature is used in the trees.

In [None]:
13. Why is CatBoost efficient for categorical data?

CatBoost is efficient for categorical data because:

Ordered Boosting: Reduces prediction shift caused by target leakage by using a permutation-driven approach.
Handling High Cardinality: Can handle features with a large number of unique categories effectively.
Built-in Categorical Feature Support: Does not require explicit encoding of categorical features.

In [None]:
#PRACTICAL_QUESTIONS:

In [None]:
#14. Train an AdaBoost Classifier on a sample dataset and print model accuracy.
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create a sample dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train AdaBoost Classifier
model = AdaBoostClassifier(random_state=42)
model.fit(X_train, y_train)

# Make predictions and calculate accuracy
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"AdaBoost Classifier Accuracy: {accuracy}")

In [None]:
#15. Train an AdaBoost Regressor and evaluate performance using Mean Absolute Error (MAE).
from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

# Create a sample regression dataset
X, y = make_regression(n_samples=1000, n_features=20, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train AdaBoost Regressor
model = AdaBoostRegressor(random_state=42)
model.fit(X_train, y_train)

# Make predictions and calculate MAE
y_pred = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)

print(f"AdaBoost Regressor MAE: {mae}")

In [None]:
#16. Train a Gradient Boosting Classifier on the Breast Cancer dataset and print feature importance.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Gradient Boosting Classifier
model = GradientBoostingClassifier(random_state=42)
model.fit(X_train, y_train)

# Print feature importance
feature_importance = model.feature_importances_
for i, importance in enumerate(feature_importance):
    print(f"Feature {data.feature_names[i]}: {importance}")

In [None]:
#17. Train a Gradient Boosting Regressor and evaluate using R-Squared Score.
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# Create a sample regression dataset
X, y = make_regression(n_samples=1000, n_features=20, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Gradient Boosting Regressor
model = GradientBoostingRegressor(random_state=42)
model.fit(X_train, y_train)

# Make predictions and calculate R-Squared
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)

print(f"Gradient Boosting Regressor R-Squared: {r2}")

In [None]:
#18. Train an XGBoost Classifier on a dataset and compare accuracy with Gradient Boosting.
import xgboost as xgb
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create a sample dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train XGBoost Classifier
xgb_model = xgb.XGBClassifier(random_state=42)
xgb_model.fit(X_train, y_train)

# Train Gradient Boosting Classifier
gb_model = GradientBoostingClassifier(random_state=42)
gb_model.fit(X_train, y_train)

# Make predictions and calculate accuracy
xgb_pred = xgb_model.predict(X_test)
gb_pred = gb_model.predict(X_test)

xgb_accuracy = accuracy_score(y_test, xgb_pred)
gb_accuracy = accuracy_score(y_test, gb_pred)

print(f"XGBoost Classifier Accuracy: {xgb_accuracy}")
print(f"Gradient Boosting Classifier Accuracy: {gb_accuracy}")

In [None]:
#19. Train a CatBoost Classifier and evaluate using F1-Score.
from catboost import CatBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# Create a sample dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train CatBoost Classifier
model = CatBoostClassifier(random_state=42, verbose=0) # verbose=0 to suppress output
model.fit(X_train, y_train)

# Make predictions and calculate F1-Score
y_pred = model.predict(X_test)
f1 = f1_score(y_test, y_pred)

print(f"CatBoost Classifier F1-Score: {f1}")

In [None]:
#20.Train an XGBoost Regressor and evaluate using Mean Squared Error (MSE).
import xgboost as xgb
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Create a sample regression dataset
X, y = make_regression(n_samples=1000, n_features=20, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train XGBoost Regressor
xgb_model = xgb.XGBRegressor(random_state=42)
xgb_model.fit(X_train, y_train)

# Make predictions and calculate MSE
y_pred = xgb_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)

print(f"XGBoost Regressor MSE: {mse}")

In [None]:
#21. Train an AdaBoost Classifier and visualize feature importance.
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train AdaBoost Classifier
model = AdaBoostClassifier(random_state=42)
model.fit(X_train, y_train)

# Visualize feature importance
feature_importance = model.feature_importances_
plt.figure(figsize=(10, 6))
plt.bar(data.feature_names, feature_importance)
plt.xticks(rotation=90)

In [None]:
#22. Train a Gradient Boosting Regressor and plot learning curves.
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import learning_curve
import numpy as np

# Create a sample regression dataset
X, y = make_regression(n_samples=1000, n_features=20, random_state=42)

# Train Gradient Boosting Regressor
model = GradientBoostingRegressor(random_state=42)

# Plot learning curves
train_sizes, train_scores, test_scores = learning_curve(model, X, y, cv=5, scoring='neg_mean_squared_error')

train_scores_mean = -np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = -np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)

plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_scores_mean, label='Training MSE')
plt.plot(train_sizes, test_scores_mean, label='Cross-validation MSE')
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1)
plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1)
plt.xlabel('Training examples')
plt.ylabel('MSE')
plt.legend(loc='best')
plt.title('Learning Curves')
plt.show()

In [None]:
#23.Train an XGBoost Classifier and visualize feature importance.
import xgboost as xgb
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train XGBoost Classifier
xgb_model = xgb.XGBClassifier(random_state=42)
xgb_model.fit(X_train, y_train)

# Visualize feature importance
xgb.plot_importance(xgb_model)
plt.show()

In [None]:
#24. Train a CatBoost Classifier and plot the confusion matrix
import matplotlib.pyplot as plt
from catboost import CatBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import seaborn as sns

# Create a sample dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train CatBoost Classifier
model = CatBoostClassifier(random_state=42, verbose=0)
model.fit(X_train, y_train)

# Make predictions and calculate confusion matrix
y_pred = model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix')
plt.show()

In [None]:
#25. Train an AdaBoost Classifier with different numbers of estimators and compare accuracy.
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create a sample dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train AdaBoost Classifier with different n_estimators
n_estimators_list = [50, 100, 200, 300]
accuracy_list = []

for n_estimators in n_estimators_list:
    model = AdaBoostClassifier(n_estimators=n_estimators, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_list.append(accuracy)

# Plot the results
plt.figure(figsize=(8, 6))
plt.plot(n_estimators_list, accuracy_list, marker='o')
plt.xlabel('Number of Estimators')
plt.ylabel('Accuracy')
plt.title('AdaBoost Accuracy vs. Number of Estimators')
plt.xticks(n_estimators_list)
plt.grid(True)
plt.show()

In [None]:
#26. Train a Gradient Boosting Classifier and visualize the ROC curve.
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, roc_auc_score

# Load Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Gradient Boosting Classifier
model = GradientBoostingClassifier(random_state=42)
model.fit(X_train, y_train)

# Calculate ROC curve and AUC
y_prob = model.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = roc_auc_score(y_test, y_prob)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.show()

In [None]:
#27. Train an XGBoost Regressor and tune the learning rate using GridSearchCV.
import xgboost as xgb
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split, GridSearchCV

# Create a sample regression dataset
X, y = make_regression(n_samples=1000, n_features=20, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train XGBoost Regressor and tune learning rate
model = xgb.XGBRegressor(random_state=42)
param_grid = {'learning_rate': [0.01, 0.1, 0.2, 0.3]}
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

print(f"Best learning rate: {grid_search.best_params_['learning_rate']}")
print(f"Best score: {-grid_search.best_score_}")


In [None]:
#28. Train a CatBoost Classifier on an imbalanced dataset and compare performance with class weighting.
from catboost import CatBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score

# Create an imbalanced dataset
X, y = make_classification(n_samples=1000, n_features=20, weights=[0.9, 0.1], random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train CatBoost Classifier without class weights
model_no_weights = CatBoostClassifier(random_state=42, verbose=0)
model_no_weights.fit(X_train, y_train)
y_pred_no_weights = model_no_weights.predict(X_test)
accuracy_no_weights = accuracy_score(y_test, y_pred_no_weights)
f1_no_weights = f1_score(y_test, y_pred_no_weights)

# Calculate class weights
from sklearn.utils import class_weight
class_weights = class_weight.compute_sample_weight('balanced', y_train)

# Train CatBoost Classifier with class weights
model_with_weights = CatBoostClassifier(random_state=42, verbose=0)
model_with_weights.fit(X_train, y_train, sample_weight=class_weights)
y_pred_with_weights = model_with_weights.predict(X_test)
accuracy_with_weights = accuracy_score(y_test, y_pred_with_weights)
f1_with_weights = f1_score(y_test, y_pred_with_weights)

print(f"Accuracy without class weights: {accuracy_no_weights}")
print(f"F1-score without class weights: {f1_no_weights}")
print(f"Accuracy with class weights: {accuracy_with_weights}")
print(f"F1-score with class weights: {f1_with_weights}")

In [None]:
#29. Train an AdaBoost Classifier and analyze the effect of different learning rates.
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create a sample dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train AdaBoost Classifier with different learning rates
learning_rates =
accuracy_list =

for lr in learning_rates:
    model = AdaBoostClassifier(learning_rate=lr, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_list.append(accuracy)

# Plot the results
plt.figure(figsize=(8, 6))
plt.plot(learning_rates, accuracy_list, marker='o')
plt.xlabel('Learning Rate')
plt.ylabel('Accuracy')
plt.title('AdaBoost Accuracy vs. Learning Rate')
plt.xticks(learning_rates)
plt.grid(True)
plt.show()

In [None]:
#30.Train an XGBoost Classifier for multi-class classification and evaluate using log-loss.

import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

# Load the Iris dataset (multi-class)
iris = load_iris()
X, y = iris.data, iris.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train XGBoost Classifier for multi-class classification
model = xgb.XGBClassifier(objective='multi:softprob', random_state=42)  # Use multi:softprob for multi-class
model.fit(X_train, y_train)

# Make predictions (probabilities)
y_prob = model.predict_proba(X_test)

# Evaluate using log-loss
logloss = log_loss(y_test, y_prob)

print(f"Log-loss: {logloss}")