# Python Package Introduction

This notebook gives a basic walkthrough of xgboost python package.And we will see lot real examples in  other example notebooks later.

## Install XGBoost

To install XGBoost, follow instructions in [Installation Guide.](https://xgboost.readthedocs.io/en/latest/build.html)

To verify your installation, run the following in Python:

## Data Interface¶

**The XGBoost python module is able to load data from:**

- LibSVM text format file
- Comma-separated values (CSV) file
- NumPy 2D array
- SciPy 2D sparse array
- Pandas data frame, and
- XGBoost binary buffer file.

### Basic Input Format

XGBoost currently supports two text formats for ingesting data: LibSVM and CSV. Here is a simple and very brief view of the LibSVM format. (See this Wikipedia article for a description of the CSV format.)

For training, XGBoost takes an instance file with the format as below:

`train.txt`

Each line represent a single instance, and in the first line ‘1’ is the instance label, ‘101’ and ‘102’ are feature indices, ‘1.2’ and ‘0.03’ are feature values. In the binary classification case, ‘1’ is used to indicate positive samples, and ‘0’ is used to indicate negative samples. We also support probability values in [0,1] as label, to indicate the probability of the instance being positive.

The data is stored in a **DMatrix** object.

**DMatrix** is a internal data structure that used by XGBoost which is optimized for both memory efficiency and training speed. You can construct DMatrix from numpy.arrays,pandas DataFrame

- To load a libsvm text file or a XGBoost binary buffer file into DMatrix:

- To load a CSV file into DMatrix:

In [1]:
import xgboost as xgb
import numpy as np
import pandas as pd

### load a NumPy array into DMatrix:

In [5]:
data = np.random.rand(5, 10)  # 5 entities, each contains 10 features
label = np.random.randint(2, size=5)  # binary target
dtrain = xgb.DMatrix(data, label=label)

In [6]:
data

array([[0.48516736, 0.06622999, 0.16797126, 0.55295515, 0.18496159,
        0.15718123, 0.5569322 , 0.623564  , 0.58523372, 0.16918317],
       [0.46716857, 0.65638034, 0.18316988, 0.32296578, 0.02028441,
        0.28975191, 0.94222454, 0.63358342, 0.52584004, 0.35256441],
       [0.88565107, 0.85465098, 0.84833395, 0.35387303, 0.25754147,
        0.06719269, 0.8298736 , 0.96227229, 0.80613462, 0.38857751],
       [0.05226646, 0.46454811, 0.34781382, 0.70249846, 0.52963296,
        0.97339461, 0.12700059, 0.03400443, 0.62372425, 0.22378784],
       [0.97121446, 0.07955303, 0.97512642, 0.62818599, 0.07342065,
        0.80914954, 0.10231779, 0.53521168, 0.77181652, 0.07095929]])

In [11]:
label

array([1, 0, 0, 0, 0])

In [12]:
dtrain

<xgboost.core.DMatrix at 0x25d83055f98>

**Note**

Categorical features not supported

Note that XGBoost does not support categorical features; if your data contains categorical features, load it as a NumPy array first and then perform one-hot encoding.

**Note**

Use Pandas to load CSV files with headers

Currently, the DMLC data parser cannot parse CSV files with headers. Use Pandas (see below) to read CSV files with headers.

### To load a Pandas data frame into DMatrix:

In [18]:
data = pd.DataFrame(np.arange(12).reshape((4,3)), columns=['a', 'b', 'c'])
label = pd.DataFrame(np.random.randint(2, size=4))
dtrain = xgb.DMatrix(data, label=label)

In [19]:
data

Unnamed: 0,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8
3,9,10,11


In [20]:
label

Unnamed: 0,0
0,0
1,1
2,1
3,1


In [21]:
dtrain

<xgboost.core.DMatrix at 0x25d8306e320>

**Saving DMatrix into a XGBoost binary file will make loading faster:**

In [23]:
dtrain.save_binary('train.buffer') # creates a file on your disk at present working directory

**Missing values can be replaced by a default value in the DMatrix constructor:**

In [24]:
dtrain = xgb.DMatrix(data, label=label, missing=-999.0)

**Weights can be set when needed:**

In [25]:
w = np.random.rand(5, 1)
dtrain = xgb.DMatrix(data, label=label, missing=-999.0, weight=w)

In [27]:
w # what are these weights need to be understood

array([[0.29319483],
       [0.44319105],
       [0.65289691],
       [0.75436332],
       [0.00407469]])

**note:** When performing ranking tasks, the number of weights should be equal to number of groups.

## Setting Parameters

XGBoost can use either a list of pairs or a dictionary to set parameters. For instance:

- Booster Parameters


- You can also specify multiple eval metrics:

- Specify validations set to watch performance

## Training


Training a model requires a parameter list and data set.

After training, the model can be saved.

The model and its feature map can also be dumped to a text file.

A saved model can be loaded as follows:

Methods including update and boost from `xgboost.Booster` are designed for internal usage only. The wrapper function xgboost.train does some pre-configuration including setting up caches and some other parameters.

## Early Stopping

If you have a validation set, you can use early stopping to find the optimal number of boosting rounds. Early stopping requires at least one set in evals. If there’s more than one, it will use the last.

The model will train until the validation score stops improving. Validation error needs to decrease at least every early_stopping_rounds to continue training.

If early stopping occurs, the model will have three additional fields: `bst.best_score`, `bst.best_iteration` and `bst.best_ntree_limit`. Note that `xgboost.train()` will return a model from the last iteration, not the best one.

This works with both metrics to minimize (RMSE, log loss, etc.) and to maximize (MAP, NDCG, AUC). Note that if you specify more than one evaluation metric the last one in param`['eval_metric']` is used for early stopping.

## Prediction

A model that has been trained or loaded can perform predictions on data sets.

If early stopping is enabled during training, you can get predictions from the best iteration with bst.best_ntree_limit:

## Ploting


You can use plotting module to plot feature importance and output tree.

To plot importance, use `xgboost.plot_importance()`. This function requires **matplotlib** to be installed.

To plot the output tree via matplotlib, use `xgboost.plot_tree()`, specifying the ordinal number of the target tree. This function requires **graphviz and matplotlib**.

When you use Jupyter Notebook, you can use the `xgboost.to_graphviz()` function, which converts the target tree to a **graphviz** instance. The graphviz instance is automatically rendered in Jupyter Notebook

In [2]:
import xgboost as xgb
# read in data
dtrain = xgb.DMatrix('demo/data/agaricus.txt.train')
dtest = xgb.DMatrix('demo/data/agaricus.txt.test')
# specify parameters via map or dict
param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
num_round = 2
bst = xgb.train(param, dtrain, num_round)
# make prediction
preds = bst.predict(dtest)

[05:17:13] 6513x127 matrix with 143286 entries loaded from demo/data/agaricus.txt.train
[05:17:13] 1611x127 matrix with 35442 entries loaded from demo/data/agaricus.txt.test


In [3]:
preds


array([0.28583017, 0.9239239 , 0.28583017, ..., 0.9239239 , 0.05169873,
       0.9239239 ], dtype=float32)

In [4]:
preds.shape

(1611,)