#  Classification Example with XGBClassifier in Python 

The XGBoost stands for eXtreme Gradient Boosting, which is a boosting algorithm based on gradient boosted decision trees algorithm. XGBoost applies a better regularization technique to reduce overfitting, and it is one of the differences from the gradient boosting.
The ‘xgboost’ is an open-source library that provides machine learning algorithms under the gradient boosting methods. 
The xgboost.XGBClassifier is a scikit-learn API compatible class for classification.
In this post, we'll briefly learn how to classify iris data with XGBClassifier in Python. We'll use xgboost library module and you may need to install if it is not available on your machine. The tutorial cover:

    1.Preparing data
    2.Defining the model
    3.Predicting test data
    4.Video tutorial
    5.Source code listing


In [9]:
# We'll start by loading the required libraries.

from xgboost import XGBClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import accuracy_score

## Preparing data

In this tutorial, we'll use the iris dataset as the classification data. First, we'll separate data into x and y parts.

In [4]:
# loading data
iris = load_iris()
x, y = iris.data, iris.target

Then we'll split them into train and test parts. Here, we'll extract 15 percent of the dataset as test data.

In [5]:
# splitting a data
xtrain, xtest, ytrain, ytest=train_test_split(x, y, test_size=0.15)

## Defining the model

We've loaded the XGBClassifier class from xgboost library above. Now we can define the classifier model.

In [6]:
# Create a model 
xgbc = XGBClassifier()
print(xgbc)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None,
       objective='multi:softprob', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1) 

XGBClassifier(base_score=None, booster=None, colsample_bylevel=None,
              colsample_bynode=None, colsample_bytree=None, gamma=None,
              gpu_id=None, importance_type='gain', interaction_constraints=None,
              learning_rate=None, max_delta_step=None, max_depth=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              objective='binary:logistic', random_state=None, reg_alpha=None,
              reg_lambda=None, scale_pos_weight=None, subsample=None,
              tree_method=None, validate_parameters=None, verbosity=None)


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=None,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, monotone_constraints=None,
              n_estimators=100, n_jobs=1, nthread=None, num_parallel_tree=None,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
              subsample=1, tree_method=None, validate_parameters=None,
              verbosity=1)

You can change the classifier model parameters according to your dataset characteristics. Here, we've defined it with default parameter values.
We'll fit the model with train data.

In [7]:
# fitting a data on model
xgbc.fit(xtrain, ytrain)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [11]:
# Next, we'll check the model accuracy.
y_pred = xgbc.predict(xtest)
accuracy_score(y_pred,ytest)

0.9565217391304348

## Predicting test data

Finally, we'll predict test data check the prediction accuracy with a confusion matrix.

In [12]:
# testing a data
ypred = xgbc.predict(xtest)
cm = confusion_matrix(ytest,ypred) 

print(cm)

[[ 5  0  0]
 [ 0 10  1]
 [ 0  0  7]]
