# Model Training with XGBoost

Train a model that predicts whether or not a patient has diabetes, based on medical features. 

### 1. Import the required libraries and packages.

In [1]:
# Instead of this, use a custom image
%pip install xgboost


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
from typing import List, Dict

import pandas as pd

from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

### 2. Load the data into a Pandas dataframe.

In [3]:
data = pd.read_csv('./data/diabetes.csv')

### 3. Preprocess the data.

Split the data into two data frames: features (`X`) and target variable (`y`).

In [4]:
X = data.drop('Outcome', axis=1)
y = data['Outcome']

Inspect the two dataframes.

In [5]:
X.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33


In [6]:
y.head()

0    1
1    0
2    1
3    0
4    1
Name: Outcome, dtype: int64

Divide the data into training and test data sets. 

The `train_test_split` method of Scikit-learn can split the data set into random train and test subsets.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=0
)

print(f"Number of samples in training set: {X_train.shape[0]}")
print(f"Number of samples in test set: {X_test.shape[0]}")

Number of samples in training set: 614
Number of samples in test set: 154


### 4. Create and train the model.

Create an instance of one of the models available in Scikit-learn and train the model with the training set. For this particular example, you can use a logistic regression model, which is a simple approach to solve classification problems.

In [8]:
# Instantiate the model with hyper parameters
model = XGBClassifier(
    n_estimators=300,
    learning_rate=0.03,
    gamma=0.1
)

# Train the model
model.fit(X_train, y_train)

### 5. Evaluate the model metrics.

After the model is trained, evaluate the model against the test set.

In [9]:
# Compute the model predictions for the test data: y_predicted
y_predicted = model.predict(X_test)

# Compare the predicted values for the test set (y_predicted)
# against the expected values (y_test)
print("Classification report:")
print(classification_report(y_test, y_predicted))

Classification report:
              precision    recall  f1-score   support

           0       0.90      0.85      0.88       107
           1       0.70      0.79      0.74        47

    accuracy                           0.83       154
   macro avg       0.80      0.82      0.81       154
weighted avg       0.84      0.83      0.83       154



The trained model has an accuracy value of 83%.

You can improve the score by retraining the model after more sophisticated data engineering or by tweaking the model hyper parameters.

### 6. Test the model with sample cases.
Test the model with data from two patients: one patient with diabetes and one patient without diabetes.

In [10]:
# Tuple for textual display of prediction
classes = ('No diabetes', 'Diabetes')


def predict(patients: List[Dict]):
    inputs = pd.DataFrame(patients)
    predictions = model.predict(inputs)
    return [classes[p] for p in predictions]


diabetes_patient = {
    "Pregnancies": 6.0,
    "Glucose": 110.0,
    "BloodPressure": 65.0,
    "SkinThickness": 15.0,
    "Insulin": 1.0,
    "BMI": 45.7,
    "DiabetesPedigreeFunction": 0.627,
    "Age": 50
}

no_diabetes_patient = {
    "Pregnancies": 0,
    "Glucose": 88.0,
    "BloodPressure": 60.0,
    "SkinThickness": 35.0,
    "Insulin": 1.0,
    "BMI": 45.7,
    "DiabetesPedigreeFunction": 0.27,
    "Age": 20
}


predictions = predict([diabetes_patient, no_diabetes_patient])
print(predictions)

['Diabetes', 'No diabetes']
