In [2]:
import statsmodels.formula.api as smf
import statsmodels.api         as sm
import pandas as pd
import numpy as np

In [3]:
diabetes = pd.read_csv('diabetes-dataset.csv')
diabetes.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,2,138,62,35,0,33.6,0.127,47,1
1,0,84,82,31,125,38.2,0.233,23,0
2,0,145,0,0,0,44.2,0.63,31,1
3,0,135,68,42,250,42.3,0.365,24,1
4,1,139,62,41,480,40.7,0.536,21,0


## Training and testing sets

When we are developing a model we do not use all of our data for training, what we do is that we divide the data we posses into two sets: the training set and the testing set. A general rule of thumb is to use 80% of the data for training and 20% for testing our model. There are variations of this depending on the circumstances, but, in general, this is a good starting point. By the way, all the examples of our training data should be picked randomly to avoid any bias; it is not a good practice to pick these examples in a deterministic fashion.

In [4]:
np.random.seed(1337) 
number_of_rows = diabetes.shape[0]
index_train = np.random.choice(range(number_of_rows), int(0.8 * number_of_rows), replace=False)
index_test = np.asarray(list(set(range(number_of_rows)) - set(index_train)))
train_set = diabetes.iloc[index_train] 
test_set = diabetes.iloc[index_test] 
print(train_set.shape)
print(test_set.shape)

(1600, 9)
(400, 9)


In [5]:
train_set.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
967,2,81,72,15,76,30.1,0.547,25,0
628,5,128,80,0,0,34.6,0.144,45,0
577,2,118,80,0,0,42.9,0.693,21,1
89,1,107,68,19,0,26.5,0.165,24,0
1171,3,102,74,0,0,29.5,0.121,32,0


Now we create the training and testing sets for the $Y$ variable.

In [6]:
y_train = train_set['Outcome']
train_set = train_set.drop(columns='Outcome')
y_test = test_set['Outcome']
test_set = test_set.drop(columns='Outcome')

In [7]:
y_train

967     0
628     0
577     1
89      0
1171    0
       ..
4       0
1972    0
1920    1
1947    0
647     1
Name: Outcome, Length: 1600, dtype: int64

## Training the model

With the `statsmodel` library is very easy to train the logit model:

In [8]:
logistic_model = sm.Logit(y_train, train_set).fit()
logistic_model.summary()

Optimization terminated successfully.
         Current function value: 0.604441
         Iterations 5


0,1,2,3
Dep. Variable:,Outcome,No. Observations:,1600.0
Model:,Logit,Df Residuals:,1592.0
Method:,MLE,Df Model:,7.0
Date:,"Sun, 01 Jan 2023",Pseudo R-squ.:,0.05647
Time:,00:19:12,Log-Likelihood:,-967.11
converged:,True,LL-Null:,-1025.0
Covariance Type:,nonrobust,LLR p-value:,5.808e-22

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
Pregnancies,0.1352,0.021,6.541,0.000,0.095,0.176
Glucose,0.0137,0.002,7.312,0.000,0.010,0.017
BloodPressure,-0.0311,0.003,-9.478,0.000,-0.038,-0.025
SkinThickness,0.0037,0.004,0.898,0.369,-0.004,0.012
Insulin,2.743e-05,0.001,0.049,0.961,-0.001,0.001
BMI,-0.0076,0.007,-1.067,0.286,-0.022,0.006
DiabetesPedigreeFunction,0.0943,0.169,0.560,0.576,-0.236,0.425
Age,-0.0125,0.006,-2.115,0.034,-0.024,-0.001


For more information go to *Johnson R.A., Wichern D.W., "Applied Multivariate Statistical Analysis", Edition 6a, Pearson, 2013*.

In [9]:
import pickle
pickle.dump(logistic_model, open('model.pkl', 'wb'))