## XGBoost

XGBoost is an algorithm that has recently been dominating applied machine learning and Kaggle competitions for structured or tabular data. It's implementation of gradient boosted decision trees is designed for speed and performance.

**To install XGBoost in your system**

Simply type the below code for in your command prompt or in anaconda-prompt (If you are using Anaconda environment),
    
    pip install xgboost
    
**Note:** If you don't have pip installed, refer to [this](https://www.liquidweb.com/kb/install-pip-windows/).
    
Now that Xgboost is installed, let's look at how we can impelement it.

For this, we will use the famous iris dataset.

**The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/index.php).
It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.**

![](extras/iris.jpg)

The three species of Iris mainly includes:

- Iris-Setosa
- Iris-Versicolor
- Iris-Verginica


The dataset we will use is taken from Kaggle. You can downlaod it [here](https://www.kaggle.com/uciml/iris)

### Goal:

To predict three species of iris flowers based on four features :

   - Sepal length
   - Sepal width
   - Petal length
   - Petal width

####  

#### Let's import the basic libraries first

In [13]:
import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')

#### Data Collection

In [9]:
#Specifying file path loading the data in a variable

data = pd.read_csv('Iris.csv')

#### Let's look at the data

In [10]:
data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


As you can see, there are mainly four features that can be used to predict the species of the iris flower.

Our target variable is the **'Species'** column

##### Let's implement a simple XGBoost Classifier model and try to predict the flower species.

First we need to split the dataset. We will use train_test_split method from sklearn.

In [21]:
from sklearn.model_selection import train_test_split

inputs = data.drop('Species', axis = 1) #Assigning the features or independent variables
targets = data['Species'] #Assigning the target variables

x_train, x_test, y_train, y_test = train_test_split(inputs, targets , test_size = 0.2, random_state = 42)

"""
    Let's break this down a bit,

    x_train and x_test will contain the inputs data splitted in the ratio 80:20, i.e training data will be 80%
    and test data will be 20%.

    y_train and y_test will contain the target variable data splitted in the same ratio mentioned above.

"""

"\n    Let's break this down a bit,\n\n    x_train and x_test will contain the inputs data splitted in the ratio 80:20, i.e training data will be 80%\n    and test data will be 20%.\n\n    y_train and y_test will contain the target variable data splitted in the same ratio mentioned above.\n\n"

### XGBoost model

In [22]:
from xgboost import XGBClassifier

xg = XGBClassifier() #Creating an instance of the xgboost classifier

xg.fit(x_train, y_train) #Training the data on the training set

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

These are all the default parameters that have been set and based on these parameters the model is trained.

Now let's predict on the test data.

In [23]:
# predicting on the test data and storing it in y_pred variable to compare it with
# actual test targets , i.e y_test
y_pred = xg.predict(x_test)

In [24]:
y_pred

array(['Iris-versicolor', 'Iris-setosa', 'Iris-virginica',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-setosa',
       'Iris-versicolor', 'Iris-virginica', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-virginica', 'Iris-setosa', 'Iris-setosa',
       'Iris-setosa', 'Iris-setosa', 'Iris-versicolor', 'Iris-virginica',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-virginica',
       'Iris-setosa', 'Iris-virginica', 'Iris-setosa', 'Iris-virginica',
       'Iris-virginica', 'Iris-virginica', 'Iris-virginica',
       'Iris-virginica', 'Iris-setosa', 'Iris-setosa'], dtype=object)

##### The model has predicted all the species for the test data. Let's compare it with actual test targets.

We will use accuracy score for validating the performance of our model

In [34]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)

print('Accuracy : ', accuracy.round(2)*100 )

Accuracy :  100.0


#### Conclusion:

Our XGBoost model actually performed really well with an accuracy of 100%. 

This is just a basic model and the default parameters will not give optimal performance for more complex dataset.

Try optimizing the model for other complex datasets using hyper-parameter tuning and assess the performance improvements as well.