In [None]:
                                                               #Theoretical

In [None]:
#1.  What is Boosting in Machine Learning?
'''**Boosting** is an ensemble learning technique in machine learning that combines multiple **weak learners** (typically decision trees) to create a **strong learner** with improved performance.

### Key Idea:

Boosting trains models sequentially. Each new model focuses on correcting the errors made by the previous ones. It gives more weight to misclassified examples so that subsequent models learn from them.

### How it works (in simple steps):

1. Start with a weak model.
2. Evaluate errors and increase focus (weight) on misclassified points.
3. Train the next model to correct those mistakes.
4. Repeat and combine the predictions (usually by weighted voting or averaging).

### Popular Boosting Algorithms:

* **AdaBoost (Adaptive Boosting)**
* **Gradient Boosting**
* **XGBoost**
* **LightGBM**
* **CatBoost**

### Benefits:

* Improves accuracy.
* Reduces bias.
* Works well with structured/tabular data.

### Use cases:

* Classification and regression problems.
* Widely used in competitions and real-world applications like fraud detection and customer churn prediction.
'''

In [None]:
#2.  How does Boosting differ from Bagging?
'''**Boosting** and **Bagging** are both ensemble learning methods, but they work in different ways:

### Bagging:

* Stands for **Bootstrap Aggregating**.
* Trains multiple models **independently and in parallel** on different random subsets of the data (with replacement).
* All models have equal weight, and their predictions are combined, usually by majority voting or averaging.
* It mainly helps to **reduce variance** and prevent overfitting.
* A popular example is **Random Forest**.

### Boosting:

* Trains models **sequentially**, where each new model tries to correct the errors made by the previous ones.
* It gives more importance to misclassified points so that future models can learn from them.
* The predictions of all models are combined using **weighted voting** or summation.
* Boosting helps to **reduce bias and variance**, often achieving higher accuracy.
* Examples include **AdaBoost**, **Gradient Boosting**, and **XGBoost**.

### In Summary:

* Bagging builds models in parallel and focuses on variance reduction.
* Boosting builds models sequentially and focuses on learning from mistakes to reduce bias.
'''

In [None]:
#3.  What is the key idea behind AdaBoost?
'''The **key idea behind AdaBoost (Adaptive Boosting)** is to combine multiple **weak learners** (typically shallow decision trees) in a **sequential** manner to form a **strong classifier**.

### Here's how AdaBoost works:

1. **Start Simple**: Train a weak learner (e.g., a decision stump) on the original data.
2. **Focus on Mistakes**: Increase the weights of the misclassified samples so the next learner focuses more on them.
3. **Repeat**: Continue adding learners, each correcting the errors of the previous ones.
4. **Combine**: Final prediction is made by a **weighted vote** of all the weak learners, where more accurate models get more say.

### Why it works:

AdaBoost adapts by giving more importance to examples that are hard to classify, leading to better overall performance.

It’s effective in reducing **bias** and improving **accuracy**, especially on datasets where some examples are harder to predict correctly.
'''

In [None]:
#4.  Explain the working of AdaBoost with an example.
'''### Working of AdaBoost – Explained with an Example

Let’s say we want to classify emails as **spam** or **not spam** using AdaBoost.

---

### Step-by-step Example:

#### Step 1: Assign Equal Weights

Start with equal weights for all training examples.
Suppose we have 5 emails:

* 3 are **not spam**
* 2 are **spam**
  Each gets a weight of 1/5 = 0.2.

#### Step 2: Train a Weak Learner

Train a weak classifier (like a small decision stump).
Say it misclassifies 2 emails.

#### Step 3: Calculate Error

Calculate the **weighted error** (sum of weights of misclassified samples).
Let’s say the error = 0.4

#### Step 4: Calculate Classifier Weight

Compute how much say this classifier gets in the final decision:

$$
alpha = (1/2) * log((1 - error) / error)
$$

#### Step 5: Update Sample Weights

* Increase weights for **misclassified** examples (they become more important).
* Decrease weights for **correctly classified** examples.

#### Step 6: Repeat

Train a new weak learner on the **updated weights**.
Repeat steps 2–5 for several rounds.

---

### Final Prediction

All weak learners vote on the final class. Their votes are **weighted** by their accuracy (i.e., their α values). The final prediction is the class with the highest total weighted vote.

---

### Summary

AdaBoost improves performance by:

* Focusing more on **hard-to-classify** examples.
* Combining many weak learners into a **strong ensemble model**.

Let me know if you'd like a code example too!
'''


In [None]:
#5.  What is Gradient Boosting, and how is it different from AdaBoost?
'''### What is Gradient Boosting?

**Gradient Boosting** is an ensemble technique where models are trained **sequentially**, and each new model tries to **correct the errors** made by the previous ones by minimizing a **loss function** using **gradient descent**.

Instead of adjusting weights like in AdaBoost, Gradient Boosting fits each new model to the **residual errors** (differences between actual and predicted values).

---

### How it differs from AdaBoost:

1. **Error Handling**:

   * **AdaBoost** focuses on misclassified examples by adjusting their weights.
   * **Gradient Boosting** fits new models to the residuals of previous models using gradients.

2. **Loss Function**:

   * **AdaBoost** mainly uses exponential loss.
   * **Gradient Boosting** can optimize different loss functions (e.g., squared loss, log loss), making it more flexible.

3. **Weighting**:

   * **AdaBoost** gives weights to training examples.
   * **Gradient Boosting** gives weights to **models** based on how well they reduce loss.

---

### Summary:

* **Both** build models sequentially and combine them to make strong predictions.
* **AdaBoost** adjusts sample weights to focus on hard examples.
* **Gradient Boosting** uses gradient descent to reduce prediction errors.

Let me know if you'd like a visual or code demo!
'''

In [None]:
#6.  What is the loss function in Gradient Boosting?
'''In **Gradient Boosting**, the **loss function** measures how well the model's predictions match the actual values. The boosting process works by minimizing this loss function step-by-step using gradient descent.

### Common Loss Functions:

#### For Regression:

* **Mean Squared Error (MSE)**

  $$
 Loss = (1/n) * Σ (y_i - y_pred_i)^2
  $$
* **Mean Absolute Error (MAE)**

#### For Binary Classification:

* **Log Loss (Cross-Entropy Loss)**

  $$
  Loss = -[y * log(p) + (1 - y) * log(1 - p)]
  $$

#### For Multiclass Classification:

* **Multiclass Log Loss (Softmax + Cross-Entropy)**

---

### Key Idea:

Gradient Boosting doesn't assume a specific loss function—it uses the **gradient (slope) of the loss** to improve the model at each stage. You can customize the loss function based on the problem (regression or classification).

So, the loss function in Gradient Boosting depends on the task, and the algorithm uses its gradient to build better models iteratively.
'''

In [None]:
#7.  How does XGBoost improve over traditional Gradient Boosting?
'''**XGBoost (Extreme Gradient Boosting)** is an optimized version of traditional Gradient Boosting that offers better performance and efficiency through several enhancements.

### Key Improvements of XGBoost over Traditional Gradient Boosting:

1. **Regularization**

   * Adds **L1 (Lasso)** and **L2 (Ridge)** regularization to the objective function.
   * Helps prevent **overfitting**, which traditional Gradient Boosting lacks.

2. **Tree Pruning (Post-Pruning)**

   * Uses a **depth-first** tree building approach and prunes trees **after** growing them.
   * Traditional methods typically stop splitting early (pre-pruning), which may miss optimal structures.

3. **Handling Missing Values**

   * XGBoost **automatically learns** the best direction to handle missing data during training.
   * No need for manual imputation.

4. **Parallelization**

   * Supports **parallel computation** during training to build trees faster.
   * Traditional Gradient Boosting is sequential and slower.

5. **Weighted Quantile Sketch**

   * Efficient method for finding the best split points on large datasets with high precision.

6. **Sparsity Aware**

   * Efficient handling of **sparse data** (e.g., data with many zeros or missing values).

7. **Cross-validation Built-in**

   * XGBoost includes built-in **cross-validation** functionality.

8. **Scalability**

   * Designed to work efficiently with large datasets and supports distributed computing.

---

### Summary:

XGBoost is faster, more regularized, and more robust than traditional Gradient Boosting, making it one of the most popular and effective machine learning algorithms today.
'''

In [None]:
#8. What is the difference between XGBoost and CatBoost?
'''Here’s how **XGBoost** and **CatBoost** differ:

---

### 1. **Handling Categorical Features:**

* **XGBoost:** Requires manual preprocessing of categorical variables (e.g., one-hot encoding).
* **CatBoost:** Natively supports categorical features without explicit encoding, using advanced techniques like **ordered target statistics** to handle them efficiently.

---

### 2. **Algorithm and Implementation:**

* **XGBoost:** Uses a traditional gradient boosting approach with second-order gradients and regularization.
* **CatBoost:** Uses **ordered boosting** to reduce prediction shift and bias, improving accuracy and stability, especially on small datasets.

---

### 3. **Speed and Efficiency:**

* **XGBoost:** Fast and efficient with support for parallel and distributed computing.
* **CatBoost:** Also fast with GPU support; often faster on datasets with many categorical variables due to native handling.

---

### 4. **Dealing with Overfitting:**

* **XGBoost:** Uses L1/L2 regularization and tree pruning to prevent overfitting.
* **CatBoost:** Incorporates techniques like **ordered boosting** and **random permutations** to reduce overfitting and prediction bias.

---

### 5. **Ease of Use:**

* **XGBoost:** Powerful but requires more feature engineering for categorical data.
* **CatBoost:** Easier to use when working with categorical data, with less preprocessing needed.

---

### Summary:

* Use **CatBoost** when you have many categorical features and want an easy, powerful model with less preprocessing.
* Use **XGBoost** for general-purpose gradient boosting with fine control over model parameters, especially for numeric-heavy data.

'''


In [None]:
#9.  What are some real-world applications of Boosting techniques?
'''Boosting techniques are widely used across various real-world applications because of their strong predictive performance. Here are some common areas where boosting shines:

1. **Fraud Detection**

   * Detecting fraudulent transactions in banking and finance by identifying subtle patterns in transaction data.

2. **Customer Churn Prediction**

   * Predicting if a customer is likely to stop using a service, helping businesses take preventive actions.

3. **Credit Scoring**

   * Assessing the creditworthiness of loan applicants to minimize financial risks.

4. **Medical Diagnosis**

   * Classifying medical images or patient data to detect diseases like cancer, diabetes, or heart conditions.

5. **Spam Email Detection**

   * Filtering out unwanted or malicious emails from users’ inboxes.

6. **Click-Through Rate (CTR) Prediction**

   * Predicting how likely users are to click on online ads, important in digital marketing.

7. **Recommendation Systems**

   * Improving product or content recommendations based on user behavior.

8. **Image and Speech Recognition**

   * Enhancing the accuracy of identifying objects in images or understanding spoken language.

9. **Sentiment Analysis**

   * Analyzing text data (like reviews or social media posts) to determine the sentiment or opinion expressed.

---

Boosting’s ability to combine weak learners and focus on hard examples makes it very effective for these complex prediction tasks.
'''

In [None]:
#10.  How does regularization help in XGBoost?
'''Regularization in **XGBoost** helps by **preventing overfitting** and improving the model's generalization to unseen data. Here’s how it works:

1. **Types of Regularization in XGBoost:**

   * **L1 regularization (Lasso):** Adds a penalty proportional to the absolute value of the leaf weights. This encourages sparsity, effectively pushing some weights to zero and performing feature selection.
   * **L2 regularization (Ridge):** Adds a penalty proportional to the square of the leaf weights, discouraging large weights and smoothing the model.

2. **Why it helps:**

   * By penalizing complex models (with large or many leaf weights), regularization reduces the risk of fitting noise in the training data.
   * It encourages simpler, more robust trees that perform better on new data.
   * Controls the complexity of each tree, improving stability and reducing variance.

3. **In XGBoost’s Objective Function:**

   * Regularization terms are added to the loss function, so the training process tries to minimize both the prediction error and the complexity of the model.

---

**Summary:**
Regularization in XGBoost balances fitting the data well and keeping the model simple, which leads to better performance and less overfitting.
'''

In [None]:
#11. What are some hyperparameters to tune in Gradient Boosting models?
'''Here are some important hyperparameters you can tune in **Gradient Boosting models** to improve performance:

1. **n\_estimators**

   * Number of boosting rounds (trees). More trees can improve accuracy but may cause overfitting.

2. **learning\_rate (shrinkage)**

   * Step size for each tree’s contribution. Smaller values require more trees but can lead to better generalization.

3. **max\_depth**

   * Maximum depth of each tree. Controls model complexity; deeper trees can capture more patterns but may overfit.

4. **min\_samples\_split**

   * Minimum number of samples required to split a node. Helps prevent creating nodes that are too specific.

5. **min\_samples\_leaf**

   * Minimum number of samples required at a leaf node. Prevents leaves with very few samples.

6. **subsample**

   * Fraction of training samples used for building each tree. Using less than 1.0 adds randomness and helps prevent overfitting.

7. **max\_features**

   * Number of features to consider when looking for the best split. Smaller values reduce overfitting and speed up training.

8. **loss**

   * Loss function to optimize (e.g., deviance for classification, squared error for regression).

'''

In [None]:
#12. What is the concept of Feature Importance in Boosting?
'''**Feature Importance** in boosting models refers to a way of measuring how much each feature contributes to the predictive power of the model.

### Key points about Feature Importance in Boosting:

* **What it shows:**
  It ranks features based on their impact on the model’s predictions.

* **How it’s calculated:**
  Common methods include:

  * **Gain:** Measures the improvement in accuracy (reduction in loss) brought by a feature when it’s used for splitting. Higher gain means more important feature.
  * **Frequency (or Weight):** Counts how many times a feature is used to split across all trees. More splits imply higher importance.
  * **Cover:** Measures the number of samples affected by splits on the feature, reflecting how broadly it influences the data.

* **Why it matters:**

  * Helps identify which features the model relies on most.
  * Assists in feature selection and dimensionality reduction.
  * Improves interpretability and trust in the model.

---

### Summary:

Feature Importance in boosting shows the relative contribution of each input variable, helping you understand and optimize your model better.
'''

In [None]:
#13.  Why is CatBoost efficient for categorical data?
'''CatBoost is efficient for categorical data because it **handles categorical features natively** without needing manual preprocessing like one-hot encoding. Here's why:

1. **Ordered Target Statistics:**
   CatBoost converts categorical features into numerical values using statistics (like average target value) calculated in an **ordered** manner that avoids data leakage. This means it carefully uses only past data to encode current rows during training.

2. **No Need for Manual Encoding:**
   You don’t have to do one-hot or label encoding beforehand, which saves time and avoids the curse of dimensionality caused by one-hot encoding many categories.

3. **Reduces Overfitting:**
   By using a special permutation-driven approach to calculate these statistics, CatBoost reduces bias and overfitting compared to naive target encoding.

4. **Efficient Handling of High Cardinality:**
   Works well even when categorical features have many unique values (high cardinality), which is challenging for other algorithms.

---

**In short:** CatBoost’s smart and leakage-free encoding method for categorical variables makes it both accurate and efficient on datasets with categorical data.
'''

In [None]:
                                                                    #Practical

In [None]:
#14.  Train an AdaBoost Classifier on a sample dataset and print model accuracy.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

# Load sample dataset
data = load_iris()
X = data.data
y = data.target

# For simplicity, convert it into a binary classification problem
# (e.g., class 0 vs classes 1 and 2)
y_binary = (y == 0).astype(int)

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.3, random_state=42)

# Initialize AdaBoost Classifier
model = AdaBoostClassifier(n_estimators=50, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")


In [None]:
#15.  Train an AdaBoost Regressor and evaluate performance using Mean Absolute Error (MAE).

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_absolute_error

# Load sample regression dataset
# Note: sklearn deprecated load_boston; use fetch_california_housing instead
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing()
X = data.data
y = data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize AdaBoost Regressor
model = AdaBoostRegressor(n_estimators=50, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print(f"Mean Absolute Error: {mae:.3f}")


In [None]:
#16.  Train a Gradient Boosting Classifier on the Breast Cancer dataset and print feature importance.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd

# Load Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train Gradient Boosting Classifier
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
model.fit(X_train, y_train)

# Get feature importances
importances = model.feature_importances_

# Create a DataFrame for better visualization
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

print("Feature Importances:")
print(feature_importance_df)


In [None]:
#17.  Train a Gradient Boosting Regressor and evaluate using R-Squared Score.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score

# Load sample regression dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Gradient Boosting Regressor
model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Calculate R-squared score
r2 = r2_score(y_test, y_pred)
print(f"R-squared Score: {r2:.3f}")


In [None]:
#18.  Train an XGBoost Classifier on a dataset and compare accuracy with Gradient Boosting.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

# Load Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Gradient Boosting Classifier
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
gb_model.fit(X_train, y_train)
gb_pred = gb_model.predict(X_test)
gb_accuracy = accuracy_score(y_test, gb_pred)

# Initialize XGBoost Classifier
xgb_model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
xgb_model.fit(X_train, y_train)
xgb_pred = xgb_model.predict(X_test)
xgb_accuracy = accuracy_score(y_test, xgb_pred)

print(f"Gradient Boosting Accuracy: {gb_accuracy:.3f}")
print(f"XGBoost Accuracy: {xgb_accuracy:.3f}")


In [None]:
#19.  Train a CatBoost Classifier and evaluate using F1-Score.

from catboost import CatBoostClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize CatBoost Classifier
model = CatBoostClassifier(verbose=0, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Calculate F1-Score
f1 = f1_score(y_test, y_pred)
print(f"F1-Score: {f1:.3f}")


In [None]:
#20.  Train an XGBoost Regressor and evaluate using Mean Squared Error (MSE).

from xgboost import XGBRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load regression dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize XGBoost Regressor
model = XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42)

# Train the model
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.3f}")


In [None]:
#21. Train an AdaBoost Classifier and visualize feature importance.

import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train AdaBoost Classifier
model = AdaBoostClassifier(n_estimators=50, random_state=42)
model.fit(X_train, y_train)

# Get feature importances
importances = model.feature_importances_

# Plot feature importances
plt.figure(figsize=(10,6))
plt.barh(feature_names, importances)
plt.xlabel('Feature Importance')
plt.title('Feature Importance from AdaBoost Classifier')
plt.gca().invert_yaxis()  # Highest importance on top
plt.show()



In [None]:
#22. Train a Gradient Boosting Regressor and plot learning curves.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize model with a large number of estimators
model = GradientBoostingRegressor(n_estimators=200, learning_rate=0.1, random_state=42)
model.fit(X_train, y_train)

# Arrays to store training and validation errors
train_errors = []
val_errors = []

# Calculate error for each stage of boosting
for y_train_pred, y_val_pred in zip(model.staged_predict(X_train), model.staged_predict(X_val)):
    train_errors.append(mean_squared_error(y_train, y_train_pred))
    val_errors.append(mean_squared_error(y_val, y_val_pred))

# Plot learning curves
plt.figure(figsize=(10,6))
plt.plot(train_errors, label='Training MSE')
plt.plot(val_errors, label='Validation MSE')
plt.xlabel('Number of Trees')
plt.ylabel('Mean Squared Error')
plt.title('Learning Curves for Gradient Boosting Regressor')
plt.legend()
plt.show()


In [None]:
#23. Train an XGBoost Classifier and visualize feature importance.

import matplotlib.pyplot as plt
from xgboost import XGBClassifier, plot_importance
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train XGBoost Classifier
model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
model.fit(X_train, y_train)

# Plot feature importance
plt.figure(figsize=(10, 8))
plot_importance(model, max_num_features=15, importance_type='weight', show_values=False)
plt.title('Feature Importance - XGBoost Classifier')
plt.show()


In [None]:
#24. Train a CatBoost Classifier and plot the confusion matrix.

import matplotlib.pyplot as plt
from catboost import CatBoostClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train CatBoost Classifier
model = CatBoostClassifier(verbose=0, random_state=42)
model.fit(X_train, y_train)

# Predict on test data
y_pred = model.predict(X_test)

# Compute confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Plot confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=data.target_names)
disp.plot(cmap=plt.cm.Blues)
plt.title('Confusion Matrix - CatBoost Classifier')
plt.show()


In [None]:
#25. Train an AdaBoost Classifier with different numbers of estimators and compare accuracy.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Different numbers of estimators to try
estimators = [10, 50, 100, 200, 300]

accuracies = []

for n in estimators:
    model = AdaBoostClassifier(n_estimators=n, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)
    print(f"Estimators: {n} — Accuracy: {acc:.3f}")

# Plotting the results
plt.figure(figsize=(8,5))
plt.plot(estimators, accuracies, marker='o')
plt.xlabel('Number of Estimators')
plt.ylabel('Accuracy')
plt.title('AdaBoost Accuracy vs Number of Estimators')
plt.grid(True)
plt.show()


In [None]:
#26. Train a Gradient Boosting Classifier and visualize the ROC curve.

import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_curve, auc

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Gradient Boosting Classifier
model = GradientBoostingClassifier(random_state=42)
model.fit(X_train, y_train)

# Predict probabilities for the positive class
y_probs = model.predict_proba(X_test)[:, 1]

# Compute ROC curve and AUC
fpr, tpr, thresholds = roc_curve(y_test, y_probs)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(8,6))
plt.plot(fpr, tpr, label=f'Gradient Boosting (AUC = {roc_auc:.3f})')
plt.plot([0,1], [0,1], 'k--', label='Random Guess')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Gradient Boosting Classifier')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()


In [None]:
#27. Train an XGBoost Regressor and tune the learning rate using GridSearchCV.

from xgboost import XGBRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error

# Load dataset
data = fetch_california_housing()
X = data.data
y = data.target

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize XGBoost Regressor
xgb = XGBRegressor(n_estimators=100, random_state=42)

# Define parameter grid for learning rate
param_grid = {
    'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3]
}

# Setup GridSearchCV
grid_search = GridSearchCV(estimator=xgb, param_grid=param_grid, cv=3, scoring='neg_mean_squared_error', verbose=1)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Best parameters and score
print("Best learning rate:", grid_search.best_params_['learning_rate'])
print("Best CV MSE:", -grid_search.best_score_)

# Evaluate on test set using best estimator
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
test_mse = mean_squared_error(y_test, y_pred)
print(f"Test MSE with best learning rate: {test_mse:.4f}")


In [None]:
#28. Train a CatBoost Classifier on an imbalanced dataset and compare performance with class weighting.

from catboost import CatBoostClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
import numpy as np

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Create an imbalanced dataset by downsampling the minority class (class 1)
class_0_indices = np.where(y == 0)[0]
class_1_indices = np.where(y == 1)[0]

# Keep all class 0, reduce class 1 to 20% to simulate imbalance
np.random.seed(42)
reduced_class_1_indices = np.random.choice(class_1_indices, size=int(0.2 * len(class_1_indices)), replace=False)

# Combine indices
imbalanced_indices = np.concatenate([class_0_indices, reduced_class_1_indices])

X_imbalanced = X[imbalanced_indices]
y_imbalanced = y[imbalanced_indices]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_imbalanced, y_imbalanced, test_size=0.3, random_state=42)

# Train CatBoost without class weights
model_no_weights = CatBoostClassifier(verbose=0, random_state=42)
model_no_weights.fit(X_train, y_train)
y_pred_no_weights = model_no_weights.predict(X_test)

# Calculate class weights manually: inverse frequency
from sklearn.utils.class_weight import compute_class_weight
classes = np.unique(y_train)
class_weights = compute_class_weight(class_weight='balanced', classes=classes, y=y_train)
class_weights_dict = dict(zip(classes, class_weights))

# Train CatBoost with class weights
model_with_weights = CatBoostClassifier(verbose=0, class_weights=class_weights_dict, random_state=42)
model_with_weights.fit(X_train, y_train)
y_pred_with_weights = model_with_weights.predict(X_test)

# Print results
print("Without class weights:")
print(classification_report(y_test, y_pred_no_weights))
print("Accuracy:", accuracy_score(y_test, y_pred_no_weights))

print("\nWith class weights:")
print(classification_report(y_test, y_pred_with_weights))
print("Accuracy:", accuracy_score(y_test, y_pred_with_weights))


In [None]:
#29. Train an AdaBoost Classifier and analyze the effect of different learning rates.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Different learning rates to try
learning_rates = [0.01, 0.05, 0.1, 0.5, 1, 1.5, 2]

accuracies = []

for lr in learning_rates:
    model = AdaBoostClassifier(learning_rate=lr, n_estimators=50, random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    accuracies.append(acc)
    print(f"Learning rate: {lr} — Accuracy: {acc:.3f}")

# Plot accuracy vs learning rate
plt.figure(figsize=(8,5))
plt.plot(learning_rates, accuracies, marker='o')
plt.xlabel('Learning Rate')
plt.ylabel('Accuracy')
plt.title('Effect of Learning Rate on AdaBoost Accuracy')
plt.grid(True)
plt.show()


In [None]:
#30. Train an XGBoost Classifier for multi-class classification and evaluate using log-loss.

from xgboost import XGBClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss

# Load multi-class dataset (Iris)
data = load_iris()
X = data.data
y = data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize and train XGBoost Classifier for multi-class
model = XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', objective='multi:softprob', num_class=3, random_state=42)
model.fit(X_train, y_train)

# Predict probabilities on test set
y_pred_proba = model.predict_proba(X_test)

# Calculate log loss
loss = log_loss(y_test, y_pred_proba)
print(f"Log-Loss on test set: {loss:.4f}")
