# XGBoost
This notebook will work through creating the XGboost code.

In [2]:
%pip install pandas numpy xgboost sklearn

Collecting xgboost
  Downloading xgboost-1.3.3-py3-none-win_amd64.whl (95.2 MB)
Collecting sklearn
  Using cached sklearn-0.0.tar.gz (1.1 kB)
Collecting scipy
  Downloading scipy-1.6.2-cp38-cp38-win_amd64.whl (32.7 MB)
Collecting scikit-learn
  Downloading scikit_learn-0.24.1-cp38-cp38-win_amd64.whl (6.9 MB)
Collecting joblib>=0.11
  Using cached joblib-1.0.1-py3-none-any.whl (303 kB)
Collecting threadpoolctl>=2.0.0
  Using cached threadpoolctl-2.1.0-py3-none-any.whl (12 kB)
Using legacy 'setup.py install' for sklearn, since package 'wheel' is not installed.
Installing collected packages: scipy, xgboost, joblib, threadpoolctl, scikit-learn, sklearn
    Running setup.py install for sklearn: started
    Running setup.py install for sklearn: finished with status 'done'
Successfully installed joblib-1.0.1 scikit-learn-0.24.1 scipy-1.6.2 sklearn-0.0 threadpoolctl-2.1.0 xgboost-1.3.3
Note: you may need to restart the kernel to use updated packages.
You should consider upgrading via the 'C:\U

In [5]:
import xgboost as xgb
from sklearn import metrics
import pandas as pd

## Set up all variables needed for XGBoost
The set of variables below are needed for XGBoost. A note on the variables:
- eta = learning rate
- num_class = 20 as there are 20 MIC values
- object = "multi:softprob" This will have the prediction be probability of a datapoint being each MIC value (requires num_class)
- early_stopping_rounds = Used in Cross-Validation to stop early when the loss does not decrease after this number of rounds

In [25]:
num_folds = 1
objective = "multi:softprob"
num_classes = 20    # 20 possible MICs (not including NaN which is -1)
max_depth=5
eta=0.2
early_stopping_rounds=10

## Load data
Load the preprocessed training/testing data

In [15]:
form_3 = pd.read_csv("form_3.csv", index_col=0)
labels = pd.read_csv("labels.csv", index_col=0)

In [16]:
form_3.head()

Unnamed: 0_level_0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,10,...,Unnamed: 278,Unnamed: 279,280,Unnamed: 281,Unnamed: 282,Unnamed: 283,Unnamed: 284,Unnamed: 285,Unnamed: 286,Unnamed: 287
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,12,1,18,9,1,10,4,9,9,15,...,0,0,10,9,6,8,17,5,1,22
2,12,1,18,9,1,10,4,9,23,24,...,24,24,24,24,24,24,24,24,24,24


In [10]:
labels.head()

Unnamed: 0_level_0,Antibiotic_1,Antibiotic_2,Antibiotic_3,Antibiotic_4,Antibiotic_5,Antibiotic_6,Antibiotic_7,Antibiotic_8,Antibiotic_9,Antibiotic_10,Antibiotic_11
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,13,-1,5,7,9,-1,5,6,15,11,11
2,13,-1,8,-1,12,14,11,7,-1,-1,13


## Cross Validation
This next section will create the needed  decison matrices and perform cross validation to see how well XGBoost can predict. There are only 2 datapoints, so it is not work much, but it at least makes sure the code works.

In [29]:
# Each model is for a different antibiotic, so we will only try with antibiotic 5 for right now (No NaN values and 2 different MIC values).
train_dmatrix = xgb.DMatrix(form_3, label=pd.DataFrame(labels['Antibiotic_5']))

# Parameters used for training
params = {'max_depth': max_depth, 'eta': eta, 'objective': objective, 'num_class': num_classes}

# Cross Validation (must be 2 since we only have 2 datapoints)
cv_results = xgb.cv(
        params=params,
        dtrain=train_dmatrix,
        nfold=num_folds,
        early_stopping_rounds=early_stopping_rounds,
        feval=metrics.f1_score,
        maximize=True
    )

ValueError: need at least one array to concatenate

In [27]:
model = xgb.train(params=params, dtrain=train_dmatrix, num_boost_round=10)



In [28]:
model.predict(train_dmatrix)

array([[0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05,
        0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05],
       [0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05,
        0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05]],
      dtype=float32)

## Saving the model
Now to save the model for later

### Save as binary
The next line saves the model as a binary file that can be loaded in and used again later for predicting.

In [30]:
# Save the model in a form that can be loaded and used later
model.save_model('xgboost.model')

### Save as text
Save the model in a text format to be interpreted later. Cannot be loaded and used again.

Use with [Xgbfi](https://github.com/Far0n/xgbfi) to create an image for easier reading.

In [31]:
model.dump_model('xgboost.txt')