# N Burning XGBoost FAQs Answered to Use the Library Like a Pro
## Master the nitty-gritty about XGBoost
![](images/unsplash.jpg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://unsplash.com/@haithemfrd_off?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Haithem Ferdi</a>
        on 
        <a href='https://unsplash.com/s/photos/boost?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText'>Unsplash.</a> All images are by the author unless specified otherwise.
    </strong>
</figcaption>

## Setup

In [1]:
import warnings

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import xgboost as xgb
from matplotlib import rcParams
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.preprocessing import OneHotEncoder

rcParams["font.size"] = 15

In [2]:
iris = sns.load_dataset("iris").dropna()
penguins = sns.load_dataset("penguins").dropna()

In [3]:
i_input, i_target = iris.drop("species", axis=1), iris[["species"]]
p_input, p_target = penguins.drop("body_mass_g", axis=1), penguins[["body_mass_g"]]
p_input = pd.get_dummies(p_input)

In [4]:
X_train_i, X_test_i, y_train_i, y_test_i = train_test_split(
    i_input, i_target, test_size=0.2, random_state=1121218
)


X_train_p, X_test_p, y_train_p, y_test_p = train_test_split(
    p_input, p_target, test_size=0.2, random_state=1121218
)

## Motivation

## 1. Which API should I choose - Scikit-learn or the core learning API?

Even though this question has been answered by many, I will just state my answer because most of the other questions depend on this one.

XGBoost in Python have two APIs - Scikit-learn compatible (estimators have the familiar `fit/predict` pattern) and the core XGBoost-native API (there is a global `train` function, whose objectives can be tweaked to switch between regression and classification).

The majority of Python community, including Kagglers and myself use the Scikit-learn API. 

Using the Sklearn API enables you to freely integrate XGBoost estimators into your familiar workflow. The benefits are (and not limited to) the ability to pass core XGB algorithms into [Sklearn pipelines](https://towardsdatascience.com/how-to-use-sklearn-pipelines-for-ridiculously-neat-code-a61ab66ca90d?source=your_stories_page-------------------------------------), using a more efficient cross-validation workflow, avoiding the hassles that come with learning a new API, etc.

We will also see some nuances in XGBoost functionality that will tip the favor towards Sklearn API even further.

## 2. How Do I Completely Control the Randomness in XGBoost?

> The rest of the references to XGBoost algorithms mainly imply the Sklearn-compatible XGBRegressor and XGBClassifier (or similar) estimators.

The estimators have the `random_state` parameter similar to Sklearn estimators (the alternative `seed` parameter has been deprecated but still works). However, running XGBoost with default parameters will yield identical results even with different seeds. 

The reason for this behavior is that XGBoost induces randomness only when the parameters `subsample` and all other parameters that start with `colsample_*` prefix are used. As the names suggest, these parameters have a lot to do with [random sampling](https://towardsdatascience.com/why-bootstrap-sampling-is-the-badass-tool-of-probabilistic-thinking-5d8c7343fb67?source=your_stories_page-------------------------------------) to combat overfitting.

Therefore, you should only use `random_state` when tuning these hyperparameters to get the same results across runs for the same seed.

When using with other Sklearn transformers or estimators that have their own `random_state`, you should pass a seed number both to XGBoost and other classes for reproducibility.

## 3. What are objectives in XGBoost and how to specify them for different tasks?

Both regression and classification tasks have different types and implementations. They change depending on the objective function, the distributions they can work with and their loss function.

For example, regression can be performed using RMSE (Root Mean Squared Error), RMSLE (Root Mean Squared Log Error), Huber Error, etc. loss functions. Sklearn implements different regressors for each of these but in XGBoost, these are all packed into XGBRegressor estimator. 

You can switch between the implementations of different loss functions, supported distributions with the `objective` parameter. It accepts special code strings provided by XGBoost. Most commons ones are:

- `reg:squarederror`
- `reg:squaredlogerror`
- `reg:gamma`
- `reg:tweedie`

Similarly, classification objectives change based on their underlying loss function. These objectives start either with `binary:*` or `multi:*` prefixes depending on the target cardinality. 

There are many other objective types and I will leave it to you to explore the rest and find out the details using this documentation [link](https://xgboost.readthedocs.io/en/latest/parameter.html#learning-task-parameters).

> Note that specifying the correct objective gets rid of that unbelievably annoying warning you get when fitting XGB classifiers.