<a href="https://colab.research.google.com/github/TusharGwal/Machine-Learning/blob/main/Model_Selection_and_Boosting/xg_boost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# XGBoost

Great question! 🚀
Let’s break down **XGBoost** — one of the most powerful and widely used algorithms in machine learning.

---

## 🧠 What is XGBoost?

> **XGBoost** stands for **Extreme Gradient Boosting**.
> It’s an optimized, scalable implementation of **Gradient Boosted Decision Trees**.

✅ It's known for:

* High **accuracy**
* Great **speed**
* Built-in **regularization** to reduce overfitting

---

## 🏗️ How XGBoost Works (in simple steps):

1. **Start with an initial weak model** (like predicting the mean).
2. At each step, **build a new decision tree** to fix the errors (residuals) made by the previous model.
3. Each new tree **focuses on the mistakes** of the previous ones.
4. Final prediction = sum of all previous trees’ outputs.

This is **boosting** — a sequential method to reduce bias and error.

---

## 🧮 Why It’s Better Than Basic Gradient Boosting:

| Feature                  | XGBoost Enhancements                          |
| ------------------------ | --------------------------------------------- |
| **Speed**                | Parallelized tree construction                |
| **Accuracy**             | Advanced regularization (`lambda`, `alpha`)   |
| **Flexibility**          | Supports regression, classification, ranking  |
| **Handles Missing Data** | Yes — internally handles NaNs smartly         |
| **Early Stopping**       | Stops training if performance stops improving |
| **Tree Pruning**         | Post-pruning for better generalization        |

---

## ✅ Real-World Use Cases

| Domain       | Example Use Case                      |
| ------------ | ------------------------------------- |
| Finance      | Credit scoring, fraud detection       |
| Healthcare   | Disease risk prediction               |
| Retail       | Customer churn, demand forecasting    |
| Competitions | Used in many Kaggle-winning solutions |

---

## 🔧 Code Example (Classification)

```python
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

model = XGBClassifier()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
```

---

## ⚙️ Key Hyperparameters

| Parameter       | Description                                 |
| --------------- | ------------------------------------------- |
| `n_estimators`  | Number of boosting rounds (trees)           |
| `learning_rate` | How much each tree impacts the final result |
| `max_depth`     | Depth of individual trees                   |
| `subsample`     | Fraction of data to sample for each tree    |
| `gamma`         | Minimum loss reduction before splitting     |

---

## 🧠 Summary

| Feature                   | Description                              |
| ------------------------- | ---------------------------------------- |
| Type                      | Ensemble (Boosted Trees)                 |
| Strength                  | Fast, accurate, handles missing data     |
| Use Case                  | Structured/tabular data problems         |
| Compared to Random Forest | Usually more accurate but slower to tune |

---


## Importing the libraries

In [25]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [26]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

In [27]:
from sklearn.preprocessing import MinMaxScaler
sc = MinMaxScaler()
y = sc.fit_transform(y.reshape(-1, 1))


## Splitting the dataset into the Training set and Test set

In [28]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Training XGBoost on the Training set

In [29]:
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, y_train)

## Making the Confusion Matrix

In [30]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[85  2]
 [ 1 49]]


0.9781021897810219

## Applying k-Fold Cross Validation

In [31]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 96.71 %
Standard Deviation: 2.28 %
