<a href="https://colab.research.google.com/github/Seyjuti8884/pwskills_assignment/blob/main/Boosting_Algorithm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Theoretical Questions

1. **What is Boosting in Machine Learning?**  
   Boosting is an ensemble learning technique that combines multiple weak learners (usually decision trees) to create a strong learner. It sequentially trains models so that each new model corrects the errors of the previous ones, improving overall prediction accuracy.

2. **How does Boosting differ from Bagging?**  
   - **Boosting:** Models are trained sequentially, with each new model focusing on the errors of the previous one. It reduces bias.  
   - **Bagging:** Models are trained independently and in parallel on different subsets of the data, reducing variance (e.g., Random Forest).

3. **What is the key idea behind AdaBoost?**  
   AdaBoost (Adaptive Boosting) assigns higher weights to misclassified samples and lower weights to correctly classified ones in each iteration. The weak learners are combined using weighted majority voting to create a strong learner.

4. **Explain the working of AdaBoost with an example.**  
   - Start with equal weights for all samples.  
   - Train a weak learner (e.g., a decision stump).  
   - Misclassified samples get higher weights.  
   - Train the next weak learner with updated weights.  
   - Continue this process, combining weak learners to form a strong model.  
   Example: Classifying emails as spam/non-spam using weighted decision trees.

5. **What is Gradient Boosting, and how is it different from AdaBoost?**  
   - **Gradient Boosting** minimizes a loss function by fitting new weak learners to the residual errors of the previous model, using gradient descent.  
   - **Difference:** AdaBoost adjusts sample weights, whereas Gradient Boosting optimizes the loss function using gradients.

6. **What is the loss function in Gradient Boosting?**  
   The loss function in Gradient Boosting depends on the task:  
   - **Regression:** Mean Squared Error (MSE) or Mean Absolute Error (MAE).  
   - **Classification:** Log Loss or Cross-Entropy Loss.

7. **How does XGBoost improve over traditional Gradient Boosting?**  
   XGBoost (Extreme Gradient Boosting) offers:  
   - Regularization to prevent overfitting.  
   - Parallelization for faster training.  
   - Handling of missing values.  
   - Tree pruning for efficiency.

8. **What is the difference between XGBoost and CatBoost?**  
   - **XGBoost:** Optimized for speed, supports missing values, and uses greedy tree pruning.  
   - **CatBoost:** Specialized for categorical data, using ordered boosting to prevent target leakage.

9. **What are some real-world applications of Boosting techniques?**  
   - Fraud detection (e.g., banking).  
   - Customer churn prediction.  
   - Medical diagnosis.  
   - Spam detection.  
   - Recommendation systems.

10. **How does regularization help in XGBoost?**  
    Regularization (L1 & L2) helps by:  
    - Preventing overfitting.  
    - Reducing model complexity.  
    - Improving generalization.

11. **What are some hyperparameters to tune in Gradient Boosting models?**  
    - Learning rate.  
    - Number of estimators.  
    - Maximum depth of trees.  
    - Subsample ratio.  
    - Minimum child weight.

12. **What is the concept of Feature Importance in Boosting?**  
    Feature importance ranks the most influential variables in model predictions. It helps in:  
    - Feature selection.  
    - Understanding model decisions.  
    - Reducing dimensionality.

13. **Why is CatBoost efficient for categorical data?**  
    - Uses ordered boosting to prevent data leakage.  
    - Handles categorical variables without needing one-hot encoding.  
    - Uses efficient GPU support for fast training.



Practical Questions

## 14. **Train an AdaBoost Classifier on a sample dataset and print model accuracy**
```python
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create sample dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train AdaBoost Classifier
clf = AdaBoostClassifier(n_estimators=50, random_state=42)
clf.fit(X_train, y_train)

# Predict and print accuracy
y_pred = clf.predict(X_test)
print("AdaBoost Classifier Accuracy:", accuracy_score(y_test, y_pred))
```
**Expected Output (Varies slightly on each run):**
```
AdaBoost Classifier Accuracy: ~0.85 - 0.90
```

---

## 15. **Train an AdaBoost Regressor and evaluate performance using Mean Absolute Error (MAE)**
```python
from sklearn.ensemble import AdaBoostRegressor
from sklearn.datasets import make_regression
from sklearn.metrics import mean_absolute_error

# Create regression dataset
X, y = make_regression(n_samples=1000, n_features=10, noise=0.1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train AdaBoost Regressor
regressor = AdaBoostRegressor(n_estimators=50, random_state=42)
regressor.fit(X_train, y_train)

# Predict and evaluate MAE
y_pred = regressor.predict(X_test)
print("Mean Absolute Error (MAE):", mean_absolute_error(y_test, y_pred))
```
**Expected Output:**
```
Mean Absolute Error (MAE): ~5.0 - 10.0
```

---

## 16. **Train a Gradient Boosting Classifier on the Breast Cancer dataset and print feature importance**
```python
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Train Gradient Boosting Classifier
gbc = GradientBoostingClassifier(n_estimators=100, random_state=42)
gbc.fit(X, y)

# Print feature importance
feature_importance = pd.DataFrame({'Feature': data.feature_names, 'Importance': gbc.feature_importances_})
print(feature_importance.sort_values(by="Importance", ascending=False))
```
**Expected Output:**  
A table showing the most important features, e.g.:
```
        Feature         Importance
0   worst radius       0.20
1   mean texture      0.12
2   worst perimeter   0.10
...
```

---

## 17. **Train a Gradient Boosting Regressor and evaluate using R-Squared Score**
```python
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import r2_score

# Train Gradient Boosting Regressor
gbr = GradientBoostingRegressor(n_estimators=100, random_state=42)
gbr.fit(X_train, y_train)

# Predict and evaluate R^2 score
y_pred = gbr.predict(X_test)
print("R-Squared Score:", r2_score(y_test, y_pred))
```
**Expected Output:**
```
R-Squared Score: ~0.85 - 0.95
```

---

## 18. **Train an XGBoost Classifier on a dataset and compare accuracy with Gradient Boosting**
```python
from xgboost import XGBClassifier

# Train XGBoost Classifier
xgb = XGBClassifier(n_estimators=100, use_label_encoder=False, eval_metric='logloss')
xgb.fit(X_train, y_train)

# Predict and compare accuracy
y_pred_xgb = xgb.predict(X_test)
print("XGBoost Classifier Accuracy:", accuracy_score(y_test, y_pred_xgb))
```
**Expected Output:**
```
XGBoost Classifier Accuracy: ~0.87 - 0.92
```
(This will be compared with the Gradient Boosting accuracy)

---

## 19. **Train a CatBoost Classifier and evaluate using F1-Score**
```python
from catboost import CatBoostClassifier
from sklearn.metrics import f1_score

# Train CatBoost Classifier
cbc = CatBoostClassifier(iterations=100, verbose=0)
cbc.fit(X_train, y_train)

# Predict and compute F1-score
y_pred_cbc = cbc.predict(X_test)
print("F1-Score:", f1_score(y_test, y_pred_cbc))
```
**Expected Output:**
```
F1-Score: ~0.85 - 0.90
```

---

## 20. **Train an XGBoost Regressor and evaluate using Mean Squared Error (MSE)**
```python
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

# Train XGBoost Regressor
xgb_reg = XGBRegressor(n_estimators=100)
xgb_reg.fit(X_train, y_train)

# Predict and evaluate MSE
y_pred_xgb_reg = xgb_reg.predict(X_test)
print("Mean Squared Error (MSE):", mean_squared_error(y_test, y_pred_xgb_reg))
```
**Expected Output:**
```
Mean Squared Error (MSE): ~10.0 - 20.0
```

---
