## Model Training with Scikit-Learn

You are ready to create a Diabetes model, which will predict whether or not a patient has diabetes, based on medical readings. 

Import the required libraries and packages.

In [1]:
import pandas as pd

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

Load the data into the dataframe.

In [2]:
data_file_name = './data/diabetes.csv'
data = pd.read_csv(data_file_name)

Split the data into two data frames: features (`X`) and target variable (`y`)

In [3]:
X = data.drop('Outcome', axis=1)
y = data['Outcome']

Inspect the two dataframes.

In [4]:
X.head(4)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21


In [5]:
y.head(4)

0    1
1    0
2    1
3    0
Name: Outcome, dtype: int64

In [6]:
from sklearn import preprocessing

feature_scaler = preprocessing.StandardScaler()
X = feature_scaler.fit_transform(X)

Divide the data into training and test datasets. 
Use the `train_test_split` method of Scikit-learn to split the dataset into random train and test subsets.

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print(f"Number of samples in training set: {X_train.shape[0]}")
print(f"Number of samples in test set: {X_test.shape[0]}")

Number of samples in training set: 614
Number of samples in test set: 154


## 2. Create the Model

Create an instance of a model and train the model with the training set.

In [8]:
# Instantiate the model
model = LogisticRegression()

# Train the model
model.fit(X_train, y_train)

Once the model is created we should test the model's accuracy.  We can do this by comparing the results from the test data set.

In [9]:
# Predict the labels of the test data: y_pred
y_predicted = model.predict(X_test)

# Generate the confusion matrix and classification report
print("Classification accuracy on the test set:", accuracy_score(y_test, y_predicted))
print("Classification report:")
print(classification_report(y_test, y_predicted))

Classification accuracy on the test set: 0.8246753246753247
Classification report:
              precision    recall  f1-score   support

           0       0.84      0.92      0.88       107
           1       0.76      0.62      0.68        47

    accuracy                           0.82       154
   macro avg       0.80      0.77      0.78       154
weighted avg       0.82      0.82      0.82       154



The trained model has an accuracy value of nearly 82%.

You can improve the score by retraining the model with better data or more features or by tweaking the hyper parameters.

# Test the Model with Sample Cases
Test the model with data from two patients: one patient with diabetes and one patient without diabetes.

In [10]:
# Dict for textual display of prediction
classes = {0: 'No diabetes', 1: 'Diabetes'}


def predict(patients):
    inputs = pd.DataFrame(patients)
    inputs = feature_scaler.transform(inputs)
    predictions = model.predict(inputs)
    return [classes[prediction] for prediction in predictions]


diabetes_patient = {
    "Pregnancies": 6.0,
    "Glucose": 110.0,
    "BloodPressure": 65.0,
    "SkinThickness": 15.0,
    "Insulin": 1.0,
    "BMI": 45.7,
    "DiabetesPedigreeFunction": 0.627,
    "Age": 50
}

no_diabetes_patient = {
    "Pregnancies": 0,
    "Glucose": 88.0,
    "BloodPressure": 60.0,
    "SkinThickness": 35.0,
    "Insulin": 1.0,
    "BMI": 45.7,
    "DiabetesPedigreeFunction": 0.27,
    "Age": 20
}


predictions = predict([diabetes_patient, no_diabetes_patient])
print(predictions)

['Diabetes', 'No diabetes']
