<a href="https://colab.research.google.com/github/ReyhaneTaj/ML_Algorithms/blob/main/XGBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# XGBoost (eXtreme Gradient Boosting)

XGBoost is a powerful and efficient implementation of the gradient boosting algorithm, widely used for regression, classification, and ranking problems. It is known for its speed, performance, and flexibility.

## Key Features of XGBoost

### 1. Gradient Boosting Framework
XGBoost is built on the gradient boosting framework, where models are trained sequentially to correct errors made by previous models. Each model in the sequence improves the overall performance by learning from the residuals (errors) of the preceding models.

### 2. Regularization
XGBoost includes L1 (Lasso) and L2 (Ridge) regularization to prevent overfitting and improve the model's generalization to unseen data. This makes XGBoost more robust compared to other gradient boosting implementations.

### 3. Parallel Processing
One of the standout features of XGBoost is its ability to parallelize the training process, significantly reducing training time. This is achieved by parallelizing the tree construction process, making it faster than traditional gradient boosting methods.

### 4. Handling Missing Values
XGBoost has a built-in mechanism to handle missing data, allowing the model to learn patterns even when some features are missing. This is particularly useful in real-world scenarios where datasets may have missing or incomplete information.

### 5. Tree Pruning
XGBoost employs a technique called "max depth pruning," which stops growing a tree when additional splits do not improve model performance. This helps in reducing overfitting and speeds up the computation.

### 6. Cross-Validation and Early Stopping
XGBoost includes built-in cross-validation and early stopping features. Cross-validation helps in evaluating model performance, while early stopping prevents overfitting by halting the training process when the model's performance stops improving.

### 7. Scalability
XGBoost is highly scalable and can be distributed across a cluster of machines, making it suitable for training on large datasets.

## Applications of XGBoost
- **Classification Tasks:** Predicting binary or multi-class outcomes, such as spam detection or customer churn prediction.
- **Regression Tasks:** Estimating continuous values, like predicting house prices or stock market trends.
- **Ranking Problems:** Used in recommendation systems and search engines to rank items or pages.

## Step-by-Step Process of XGBoost

### 1. Data Preprocessing
- **Collect Data:** Start by gathering and organizing the dataset.
- **Handle Missing Values:** XGBoost can internally handle missing values, but preprocessing might still be required.
- **Feature Engineering:** Create and modify features to improve model performance.
- **Split Data:** Divide the data into training and testing (or validation) sets.

### 2. Model Initialization
- **Define Objective Function:** Choose an objective function (e.g., `binary:logistic` for classification or `reg:squarederror` for regression).
- **Set Hyperparameters:** Configure model parameters like `n_estimators`, `max_depth`, `learning_rate`, `subsample`, `colsample_bytree`, `lambda`, and `alpha`.

### 3. Boosting Process
- **Initialize Base Model:** Start with an initial prediction (e.g., mean value for regression).
- **Compute Pseudo-Residuals:** Calculate residuals between observed and predicted values.
- **Fit a New Tree on Residuals:** Train a tree to predict these residuals.
- **Update Model:** Combine the new tree's predictions with previous ones.
- **Repeat:** Continue this process for the specified number of boosting rounds or until early stopping criteria are met.

### 4. Regularization and Pruning
- **Apply Regularization:** Use L1 or L2 regularization to prevent overfitting.
- **Tree Pruning:** Control splits using parameters like `max_depth` and `gamma` to avoid unnecessary complexity.

### 5. Prediction
- **Make Predictions:** Use the trained model to predict outcomes on the test set.
- **Thresholding:** For classification tasks, apply a threshold to convert probabilities into class labels.

### 6. Model Evaluation
- **Evaluate Performance:** Assess the model using metrics like Accuracy, Precision, Recall, F1-Score, Mean Squared Error (MSE), or Area Under the Curve (AUC).
- **Cross-Validation:** Perform cross-validation to obtain a robust estimate of model performance.

### 7. Tuning and Optimization
- **Hyperparameter Tuning:** Optimize hyperparameters using techniques like grid search or random search.
- **Early Stopping:** Monitor validation performance and stop training when performance plateaus.

### 8. Deployment
- **Final Model Training:** Retrain the model on the entire dataset with the best hyperparameters.
- **Deploy the Model:** Implement the model in a production environment for making real-time predictions.

### 9. Monitoring and Maintenance
- **Monitor Model Performance:** Continuously track the model's performance in production to detect any degradation.
- **Retraining:** Regularly retrain the model as new data becomes available to ensure it remains accurate and reliable.

In this example, we'll use XGBoost to predict house prices using the California Housing dataset. This is a regression problem where the target variable is the median house value in different California districts.

In [11]:
# XGBoost for California Housing Price Prediction

# Step 1: Import Libraries
import xgboost as xgb
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import numpy as np

# Step 2: Load and Preprocess the Data

# Load the California housing dataset
data = fetch_california_housing()
X, y = data.data, data.target

# Standardize the features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Train the XGBoost Model

# Initialize the XGBoost regressor
model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=1000, max_depth=4, learning_rate=0.1)

# Train the model
model.fit(X_train, y_train)

# Step 4: Make Predictions and Evaluate the Model

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f'RMSE: {rmse:.2f}')


RMSE: 0.45


## Model Evaluation

An RMSE (Root Mean Squared Error) of **0.51** indicates that, on average, the predictions made by the XGBoost model differ from the actual house prices by about 0.51 units of the target variable.

Given that the California Housing dataset's target variable represents the median house value in units of $100,000, this RMSE translates to an average prediction error of about **$51,000**.

### Interpretation

- **Low RMSE**: A lower RMSE value indicates a better fit of the model to the data, meaning the predictions are closer to the actual values.
- **High RMSE**: Conversely, a higher RMSE suggests that the modelâ€™s predictions are less accurate.

In this context, an RMSE of 0.51 is a reasonable result, but whether it is "good" depends on the specific use case and acceptable error margin in predicting house prices.

### Improving the Model

If you want to reduce the RMSE further, consider the following:

- **Hyperparameter Tuning**: Adjust hyperparameters such as `max_depth`, `learning_rate`, `n_estimators`, `subsample`, etc.
- **Feature Engineering**: Add, remove, or transform features to better capture the underlying patterns.
- **Cross-Validation**: Use cross-validation to ensure the model generalizes well to unseen data.
- **Regularization**: Experiment with the regularization parameters (`alpha`, `lambda`) to prevent overfitting.

The goal is to balance model complexity and performance to achieve the best possible prediction accuracy.
