# Introduction to XGBoost in Python  
  
This is just a notebook taking tutorials and information from multiple websites to familiarise myself with XGBoost machine learning algorithms.  
  
### What is XGBoost?  
 - XGBoost is a machine learning algorithm that belongs to the _ensemble_ learning category  
 - specifically the gradient boosting group  
 - it utilises decision trees as base learners and employs regularisation techniques to enhance model generalisation  
 - ensemble methods, or boosting, create a sequence of models that correct the mistakes of models before them in the sequence  
 - key components: 
   - _decision trees_  
   - _objective functions_  
   - _learning tasks_  
 - features: variables used to predict the target variable  
 - important to know which feature has more predictive power  
 - 


#-----
   - **decision trees**
     - are supervised learning algorithms
     - can create both classification and regression models  
     - looks like a flow charts, start at the _root node_ with a specific question of data
     - then leads to branches holding potential answers  
     - branches lead to _decision (internal) nodes_, which ask more questions that lead to more outcomes  
     - this continues until the data reaches the _terminal node_ (leaf)  
     - benefits: lay out the problem and all outcomes; analyses possible consequences of a decision; can predict outcomes for future data  
     - types: classification trees; regression trees
   - **regularization** 
     - reduces overfitting in machine learning models  
     - methods: _lasso_; _ridge_; _elastic net_  
     - _lambda_ - this is a crucial parameter chosen for cross-validation to balance fitting the training data  
     - how does it work:
       - adds a penalty term to the standard loss function  
       - encourages the model to keep parameters small  
       - modifies the loss function (Regularise Loss = Original Loss + lambda x Penalty)
     - _lasso_: penalty is the sum of the absolute values of the parameters  
     - _ridge_: penalty is the sum of the squares of the parameters  
     - why: prevent overfitting; improve model generalisation (through lack of overfitting); handle multicolinearity; feature selection; improve robustness to noise; trade bias for variance; aid in convergence  
----- 

- XGBoost builds a predictive model by combining the predictions of multiple individual models in an iterative manner  
- The algorithms works by sequentially adding weak learners to the ensemble  
  - each new learner focusing on correcting the errors made by existing ones  
- It uses a gradient descent optimisation technique to minimise a predefined loss function during training  
- **Bagging**  
  - sample randomly from the initial sample data to obtain the sample subset  
  - reduces variance by averaging predictions from models trained on different subsets of data  
- **Boosting**  
  - trees built sequentially so each subsequent tree aims to reduce the errors of the previous tree  
  - each tree learns from its predecessor & updates the residual errors  
  - base learners in boosting are weak learners - bias is high, predictive power is low  
  - makes use of trees with fewer splits  



## Gradient Boosting Framework  
https://medium.com/@cristianleo120/the-math-behind-xgboost-3068c78aad9d  
  
1. Initialisation  
 - start with a base model: simple prediction for all instances (e.g. mean for regression, mode for classification)  
 - initial prediction: algorithm will iteratively improve on it  
  
2. Iterative improvement  
 - sequentially add weak learners
 - residuals/error calculation: calculate after each tree is added  

3. Gradient Descent Step  
 - compute negative gradient: the neg gradient of the loss function, showing how the prediction should be changed to reduce loss  
 - fit new model to gradient  

4. Update model with Learning Rate  
 - apply learning rate: predictions of new tree scaled by parameter known as _learning rate_ (_shrinkage factor_)  
 - update the model  
 - learning rate controls how fast the model learns and helps prevent overfitting  

5. Regularisation  
 - control model complexity: regularise the model
 - e.g. set maximum depth for trees, minimum samples for a leaf, number of trees  

6. Stopping Criteria  
 - determine when to stop: stop after either fixed number of trees are added, or when improvement drops below a threshold  



## Initialisation: Starting Model  
  
 - this step involves creating a starting model that provides a baseline prediction for all instances in the dataset  
 - regression tasks: initial model often predicts the mean  
 - classification tasks: intial model often predicts the mode or log odds ratio  


## Iterative Improvement  
  
 - the algorithm successively adds weak learners (usually decision trees) to the ensemble  
 - targets the shortcomings of the existing combined model  
 - each new learner focuses on the errors or residuals  
 - **How**  
   - _calculate residuals/error_  
     - regression residuals = differences between observed and predicted values  
     - classification residuals = derivative of the loss function with respect to the current model's predictions  
   - _fit a weak learner to residuals_  
     - the new learner is trained on the residuals  
     - aims to predict residuals for each instance  
   - _compute outcome of the learner_  
     - regression = predicted residual  
     - classification = calculate value to update current model to improve accuracy  
   - _update model_  
     - model is updated by adding the scaled output of the new learner to the existing prediction  
   - _repeat steps_  
     - repeated for specified number of iterations, or until convergence criterion is met  

  
## Gradient Descent Step  

 - gradient boosting minimises the loss function by moving in the direction of the steepest descent as defined by the negative gradient  
 - occurs for each stage in the boosting process  
 - _calculate gradient_  
 - _fit a weak learner to the negative gradients_  
 - _determine step size (learning rate)_ - how much the model is adjusted by  
 - _update the model_  
  

## Update Model with Learning Rate  
  
 - predictions from newly added learner are incorporated into existing model  
 - _apply learning rate_  
 - _combine weak learner's predictions with current model_  

  
## Regularisation  
 - adds constraints or penalties to the model  
 - controls the complexities  
 - helps make it less sensitive to noise in the training data  
 - _tree constraints_:  
   - max depth - limit depth of each tree; deeper trees can capture more complex patterns but might overfit  
   - min samples per leaf - prevents model from learning rules that are too specific to the training data  
   - max leaves  
 - _shrinkage/learning rate_  
   - when _v_ is less than 1, it reduces the contribution of each tree and slows learning 
   - helps generalisation  
 - _subsampling of data (stochastic gradient boosting)_  
   - feature subsampling - using subset of features for fitting each tree (RF approach)  
   - data subsampling - using a random subset of training data to fit each tree = bagging  
 - _penalty on leaf weights_  
   - LASSO - adds penalty to absolute value of magnitude of coefficients; can let some feature weights be 0  
   - RIDGE - add penalty equal to square of the magnitude of the coefficients; stops large weights, but does not typically have 0 weights  
 - 