<h1 align=center> XGBoost Classification In Depth </h1>

![xgboostc.png](attachment:xgboostc.png)

- XGBoost, short for Extreme Gradient Boosting
- Supervised learning algorithm
- XGBoost is used for regression and classification problems
- Do not require feature scaling
- Not sensitive to outliers
- Handling missing values by default
- Decision-tree-based ensemble ML algorithm
- Sequential ensemble learning

### How It Works:

![Flow-chart-of-XGBoost.png](attachment:Flow-chart-of-XGBoost.png)

Here's a more detailed look at how XGBoost works:

1. **Initial Prediction:** XGBoost starts by making a simple prediction on the training data, often using the average of the target variable.
2. **Error Calculation:** It then calculates the residuals, which are the differences between the predicted values and the actual values in the training data. Essentially, these residuals represent the errors in the initial prediction.
3. **Building the First Decision Tree:** XGBoost builds the first decision tree in the ensemble. This tree focuses on learning these residuals, aiming to minimize the overall error. To do this, the algorithm finds the best split points in the features that will reduce the errors the most.
4. **Subsequent Trees and Error Correction:** Here's where the magic of gradient boosting happens. XGBoost doesn't discard the previous tree. Instead, it uses the residuals again, but this time for the predictions made by the entire ensemble so far (including the first tree). The new tree specifically targets these remaining errors, further improving the model's accuracy.
5. **Minimizing Loss Function:** Throughout the process, XGBoost optimizes a loss function. This function mathematically measures how well the model's predictions match the actual values. By minimizing the loss function, XGBoost ensures the ensemble is on the right track to make accurate predictions.
6. **Regularization for Complexity Control:** XGBoost incorporates L1 and L2 regularization penalties in the loss function. These penalize models that are too complex, helping to prevent overfitting. Imagine a decision tree with too many branches; it might memorize the training data too well but fail to generalize to new, unseen data. Regularization helps avoid such overly complex models.
7. **Stopping Criteria:** XGBoost adds trees until a certain stopping criteria is met. These criteria could be a maximum number of trees, a minimum improvement in the loss function, or reaching a certain level of accuracy.

### XGBoost Classification Practical Example

- Below is our data
    
    
    | Drug Dosage (mg) | Drug Effectiveness |
    | --- | --- |
    | 2 | 0 |
    | 8 | 1 |
    | 12 | 1 |
    | 18 | 0 |

**Step1: Make Initial Prediction**

- The prediction can be anything, but by default, it is 0.5, regardless of whether you are using XGBoost for regression or classification

**Step2: We Calculate The Residuals (Observed-Predicted)**

- For this example Predicted Value = 0.5, so residuals are listed below

| Drug Dosage (mg) | Drug Effectiveness | Residuals |
| --- | --- | --- |
| 2 | 0 | -0.5 |
| 8 | 1 | 0.5 |
| 12 | 1 | 0.5 |
| 18 | 0 | -0.5 |

**Step3: This Step Can Be Done In The Three Sub-Steps**

1. Part(a): we try different threshold values to construct the tree
2. Part(b): we calculate the similarity score
3. Part(c): we compute the gain, any split with the largest gain, that threshold will be selected

**First(1) The First Branch In Tree:**

**Part(a).1: Build XGboost  Tree**

- First, we selected the Dosage<15 and we got the below tree
- `Note:` We got the Dosage<15 by taking the average of the last two dosage ((12+18)/2 = 15)

![xgc1.png](attachment:xgc1.png)

**Part(b).1: Calculate Similarity Score**

- For the classification problem, we use the formula below to compute the similarity score

$$
Similarity = \frac {\sum (Residuals)^2}{\sum[Previous\;Probability * (1-Previous\;Probability)]+\lambda}
$$

- lambda is regularization parameter, for this example we assign lambda=0

$$
Similarity\; Score\; For\; Dosage<15:\\root\_S = \frac {(-0.5+ 0.5+0.5-0.5)^2}{(0.5*(1-0.5))+(0.5*(1-0.5))+ (0.5*(1-0.5))+ (0.5*(1-0.5))+ 0} = 0\\ left\_S = \frac {(-0.5+0.5+0.5)^2}{(0.5*(1-0.5))+ (0.5*(1-0.5))+ (0.5*(1-0.5))+0} = 1 \\ right\_S = \frac {(-0.5)^2}{(0.5*(1-0.5))+0}=0.33
$$

![xgc2.png](attachment:xgc2.png)

**Part(c).1: Calculate Gain (left_s + right_s  -  root_s)**

- As we selected Dosage<15 for the first branch in the tree, here we compute Gain
- Gain for Dosage<15 = 0.33 + 1 - 0 = 1.33
- As we selected Dosage<15 for the first branch in the tree, we need to try different threshold values and repeat part(a), part(b), and part(c). The threshold values with the largest gain will be selected
- `Note` We tried different values to select as threshold for first branch in the tree (e.g. Dosage<5 and Dosage<10 ). The threshold Dosage<15 got the largest Gain (1.33) compared to other threshold values, so we will select it for root. Dosage<15 is better at splitting the residual into clusters of similar values.

**Second(2) We check the leaf on the left and in the right to see if we can split it further:**

- As we have only one residual right, we can not split it further
- However, we can split the 3 Residuals in the left into two leaves
- So, we will perform the same operation as we did above

**Part(a).2: Build Tree**

- First select the average of the first two observations, 2 and 8, for which we got 5 as the average
- Then, we shifted the threshold value for the last two observations, where we got 10, the average of 8 and 12
- For both we compute the Gain, the one with the largest Gain will be select

`Note:` The Gain for Dosage<10 was 0.66, while the Dosage<5 had the largest Gain. For simplicity, we will only show the steps for Dosage<5 as having the largest Gain.

- We split the tree on the left by selecting the threshold Dosage< 5
- Below is our tree

![xgc3.png](attachment:xgc3.png)

**Part(b).2: Calculate Similarity Score**

$$
Similarity\; Score\; For\; Dosage<5 (left branch):\\root\_S = \frac {(-0.5+ 0.5+0.5)^2}{(0.5*(1-0.5))+(0.5*(1-0.5))+ (0.5*(1-0.5))+ 0} = 0.33\\ left\_S = \frac {(-0.5)^2}{(0.5*(1-0.5))+0} = 1 \\ right\_S = \frac {(0.5+0.5)^2}{(0.5*(1-0.5))+(0.5*(1-0.5))+0}=2
$$

![xgc4.png](attachment:xgc4.png)

**Part(c).3: Calculate Gain**

- Gain for threshold Dosage< 5 = 1 + 2 – 0.33 = 2.66
- After trying Dosage with having different values, we got that Dosage< 5 has the largest Gain, therefore we will select it as the threshold for this branch

`Note`: For this example, we limited the tree depth to two levels for simplicity. So, we will not split the leaf any further.

### **Cover**

- When using XGBoost for classification, we have to be aware that the minimum number of Residuals in a leaf is related to a metric called cover
- Cover is the denominator of the similarity score, minus lambda
- The default value for the minimum cover is 1
- For this example, we will select a minimum cover value of 0

$$
Cover =\sum[Previous\;Probability * (1-Previous\;Probability)]\\ 
$$

**Step4: Prune The Tree** 

- In XGBoost, pruning is a technique used to reduce the complexity of the model by eliminating parts of the decision trees that do not contribute significantly to the model’s performance. Pruning helps prevent overfitting and enhances the model’s generalization to unseen data
- We prune an XGBoost Tree based on it`s Gain values
- We start by packing a number, for example, 2. XGBoost called this number gamma
- We then calculate the difference between the Gain associated with the lowest branch in the tree (Gain — gamma)
- If the difference between Gain and gamma is negative we will remove the branch, otherwise, we do not remove the branch (Positive)

$$
(Gain-\gamma)\\ 2.66-2=0.33
$$

- We are not prone because the output is a positive number

`Note`Imagine, if we assign gamma=3, it will remove all the tree (because 2.66 - 3 is negative, and 1.33 - 3 is also negative), and we will construct the tree again with a different lambda value (In the above example lambda was 0)

- For now, regardless of lambda and gamma, let`s assume that this is the tree we are working with:

![xgc3.png](attachment:xgc3.png)

**Step5: Compute Output**

$$
Output\;Value = \frac {\sum (Residuals)}{\sum[Previous\;Probability * (1-Previous\;Probability)]+\lambda}
$$

$$
Output\_leaf\_left= \frac{-0.5} {0.5 *(1-0.5)+0} = -2\\ Output\_leaf\_mid= \frac{0.5+0.5} {0.5 *(1-0.5)+ 0.5 *(1-0.5) +0} = 2 \\ Output\_leaf\_right= \frac{-0.5} {0.5 *(1-0.5)+0} = -2
$$

- Below are our output values, and our first tree is completed!



**Step6: New Prediction**

$$
New\_Prediction = init\_Prediction + learning\_rate * First\_Tree
$$

- Just like other boosting methods, XGBoost for classification start with the initial prediction
- However, just like with unextreme Gradient Boost for classification, we need to convert this probability (init_prediction) to log(odds) value

$$
\frac{p}{1-p} = odds\\ log(\frac{p}{1-p}) = log(odds)\\ p=0.5\\ log(\frac{0.5}{1-0.5}) = log(1) = 0 
$$

- The learning rate is eta, and the default value is 0.3
- The new predicted value for the observation Dosage=2:

$$
log(odds) Prediction = 0 + 0.3*(-2) = -0.6
$$

- To convert a log(odds) value into probability, we plug it into the Logistic Function:

$$
Probability = \frac {e^{log(odds)}}{1 + e^{log(odds)}} \\ Probability = \frac {e^{-0.6}}{1 + e^{-0.6}} = 0.35
$$

- So the new Residual is 0.35
- We perform the same process for other observations to predict the new residuals, and below is our prediction is shown in the table

| Drug Dosage (mg) | Drug Effectiveness | Residuals | New Prediction(Residuals) |
| --- | --- | --- | --- |
| 2 | 0 | -0.5 | 0.35 |
| 8 | 1 | 0.5 | 0.65 |
| 12 | 1 | 0.5 | 0.65 |
| 18 | 0 | -0.5 | 0.35 |
- Next, we will build another tree with new residuals and make prediction, we keep building trees until the residuals are super small, or we have reached the maximum number

### **Pros**

- **High Accuracy:** XGBoost’s ensemble approach of combining multiple models leads to superior accuracy compared to single models like decision trees
- **Scalability:** It’s optimized for handling large datasets and can run efficiently on systems with parallel processing capabilities
- **Flexibility:** XGBoost is a versatile tool that can be used for various tasks, including regression, classification, and ranking problems
- **Interpretability:** While not as easily interpretable as simpler models, XGBoost provides insights into feature importance. This helps understand which factors significantly impact the predictions

### **Cons**

- **Complexity:** Compared to simpler models, XGBoost can be more complex to understand and fine-tune
- **Potential for Overfitting:** While it has built-in regularization, XGBoost can still overfit if not tuned properly. Careful selection of hyperparameters is crucial
- **Memory Usage:** The tree-based structure of XGBoost can consume a significant amount of memory, especially when dealing with large datasets
- **Not ideal for Complex Data:** XGBoost might not perform as well on very high-dimensional or sparse datasets compared to other algorithms

Resources:

- [StatQuest With Josh Stramer](https://www.youtube.com/watch?v=8b1JEDvenQU&ab_channel=StatQuestwithJoshStarmer)

In [1]:
import xgboost as xgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load the dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the XGBClassifier
model = xgb.XGBClassifier()

# Train the model
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")

Accuracy: 100.00%
