## XGBoost Machine Learning and Predictive Analytics
Example of using gradient boosted decision trees for machine learning, also includes iPython Widgets for interactive functionality

Adapted from http://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/

In [35]:
# If necessary, install 'xgboost' for Python if it doesn't already exist - the --user parameter will install it locally.

# %%sh
# pip install xgboost --user

In [36]:
%%sh
conda list


# packages in environment at /opt/anaconda3:
#
alabaster                 0.7.7                    py35_0  
anaconda-client           1.4.0                    py35_0  
anaconda                  custom                   py35_0  
anaconda-navigator        1.1.0                    py35_0  
argcomplete               1.0.0                    py35_1  
astropy                   1.1.2               np110py35_0  
babel                     2.2.0                    py35_0  
beautifulsoup4            4.4.1                    py35_0  
bitarray                  0.8.1                    py35_0  
blaze                     0.9.1                    py35_0  
bokeh                     0.11.1                   py35_0  
boto                      2.39.0                   py35_0  
bottleneck                1.0.0               np110py35_0  
cairo                     1.12.18                       6  
cffi                      1.5.2                    py35_0  
chest                     0.2.3                    py

Using Anaconda Cloud api site https://api.anaconda.org


In [37]:
# %%sh
# pip install ipywidgets --user
# jupyter nbextension enable --py widgetsnbextension

In [38]:
# Update the system path so that the locally installed xgboost module can be found - 
# note that this needs to be changed to reflect your own user path

import sys 
import os
sys.path.append(os.path.abspath("/home/ad.edap-cluster.com/dsmith04/.local/lib/python3.5/site-packages"))

In [39]:
# Import classes and functions we will use for analysis

import numpy as np
import xgboost
from sklearn import cross_validation
from sklearn.metrics import accuracy_score

In [40]:
# Import widget functionality

from ipywidgets import *

### Decision Tree

http://xgboost.readthedocs.io/en/latest/model.html

![Tree](https://raw.githubusercontent.com/dmlc/web-data/master/xgboost/model/cart.png)

We will be using the Pima Indian Diabetes dataset from the UC Irvine Machine Learning Repository - https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes 
For the purposes of this app, I downloaded it and uploaded it into the same Jupyter folder with the notebook

In [41]:
# Load the data - 

dataset = np.loadtxt("pima-indians-diabetes.data.txt", delimiter=",")

### Dataset:
https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data

### Observation Values:
1. Number of times pregnant 
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 
3. Diastolic blood pressure (mm Hg) 
4. Triceps skin fold thickness (mm) 
5. 2-Hour serum insulin (mu U/ml) 
6. Body mass index (weight in kg/(height in m)^2) 
7. Diabetes pedigree function 
8. Age (years) 

### Output Field
* Diabetes Outcome (Binary 1/0)

In [42]:
# in the first set of columns are the input variables, in the last column is the outcome
# We must separate the columns (attributes or features) of the dataset into input patterns (X) 
# and output patterns (Y). We can do this easily by specifying the column indices in the NumPy array format.

# split data into X (observations) and y (outcome)
X = dataset[:,0:8]
Y = dataset[:,8]

In [43]:
# split data into training and test sets
seed = 7
test_size = 0.25
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, Y, test_size=test_size, random_state=seed)

In [44]:
# Train the model - in Jupyter this will also print the model parameters
model = xgboost.XGBClassifier()
model.fit(X_train, y_train)

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)

In [45]:
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

In [46]:
# evaluate model's predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 78.12%


Compare accuracy estimate http://www.is.umk.pl/projects/datasets.html#Diabetes

In [47]:
def ipredict(npreg,gluc,bp,skin,insulin,bmi,diab,age):
    my_vals=np.array([[npreg,gluc,bp,skin,insulin,bmi,diab,age],[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]])
    modresult=model.predict(my_vals)[0]
    if modresult > 0:
        result="Model Prediction: Diabetes"
    else:
        result="Model Prediction: No Diabetes"
    return result

In [24]:
my_pred = ipredict(2,70,72,35,0,33.6,0.627,36)
print (my_pred)

Model Prediction: No Diabetes


In [26]:
interact(ipredict, npreg=(0,6), gluc=(0,200), bp=(22,122), skin=(0,46), insulin=(0,60), bmi=(0,64), diab=(0,2), age=(21,81));

'Model Prediction: Diabetes'