## Import librairies

In [1]:
import pandas as pd
import numpy as np

## Load dataset

In [6]:
df = pd.read_csv('pima_indian_data.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
Number  of times pregnant     768 non-null int64
Plasma glucose                768 non-null int64
Diastolic blood pressure      768 non-null int64
Triceps skinfold thickness    768 non-null int64
serum insulin                 768 non-null int64
Body mass index               768 non-null float64
Diabetes pedigree function    768 non-null float64
Age                           768 non-null int64
Class                         768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [7]:
df.head()

Unnamed: 0,Number of times pregnant,Plasma glucose,Diastolic blood pressure,Triceps skinfold thickness,serum insulin,Body mass index,Diabetes pedigree function,Age,Class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Split off the features and the target variable

In [8]:
features = [x for x in df.columns if x != 'Class']

X =  df[features]
y = df['Class']

The input variables are the number of pregnancies the patient had, their Plasma glucose, Diastolic blood pressure, Triceps skinfold thickness, serum insulin insulin level, Diabetes pedigree function and age.

The target is Class (diabetes or healthy patient).

## Train/Test sets split

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state = 20)
                                                            
print(f'Training samples: {X_train.shape[0]}')
print(f'Test samples: {X_test.shape[0]}')

Training samples: 614
Test samples: 154


## Create a pipeline to perform any feature processing and model logistic regression

In [11]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler 
from sklearn.linear_model import LogisticRegression

modeling_pipeline = Pipeline([('data_prepocessing', StandardScaler()), ('logreg', LogisticRegression())])

## Run a baseline model

In [12]:
from sklearn.metrics import classification_report

base = modeling_pipeline.fit(X_train, y_train)
base_p = base.predict(X_test)

print(classification_report(y_test, base_p))

              precision    recall  f1-score   support

           0       0.78      0.86      0.82       101
           1       0.67      0.55      0.60        53

    accuracy                           0.75       154
   macro avg       0.73      0.70      0.71       154
weighted avg       0.75      0.75      0.75       154



## Logistic Regression model with grid search cross-validation using 10 folds

In [14]:
from sklearn.model_selection import GridSearchCV

# Search 5 different regularization strengths and 2 solvers in the grid
param_grid = [
  {'logreg__class_weight': [None, 'balanced'], 'logreg__C':[0.01, 0.1, 1, 10, 100]}
 ]

# recall is the metric used to select the model
# grid search cross-validation using 10 folds, so cv=10
gcv_results = GridSearchCV(estimator=modeling_pipeline, param_grid=param_grid, scoring='recall', cv=10, refit=True)
gcv_results = gcv_results.fit(X_train, y_train)

# best model
print(gcv_results.best_estimator_)

Pipeline(memory=None,
         steps=[('data_prepocessing',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('logreg',
                 LogisticRegression(C=10, class_weight='balanced', dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='auto', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='lbfgs', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False)


I picked Recall as a metric since we are going to predict whether or not a patient has diabetes. Recall will inform us of all the patients that are actually diabetic how many we labeled correctly. Recall is also expensive for this case.

## Determine how this performs on the test set

In [15]:
y_testp = gcv_results.predict(X_test)

print(classification_report(y_test, y_testp))

              precision    recall  f1-score   support

           0       0.82      0.74      0.78       101
           1       0.58      0.68      0.63        53

    accuracy                           0.72       154
   macro avg       0.70      0.71      0.70       154
weighted avg       0.73      0.72      0.73       154



68% of patients with diabetes were correctly identified by this model.

Recall was improved from 0.55 to 0.68 for detecting the patients with diabetes.  
Precision went from 0.67 to 0.58 for detecting the patients with diabetes.  Compared to the baseline model, we have more of the diabetes patients, with the trade-off of more false-positives. 