# Diabetes Prediction using Logistic Regression

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [2]:
df = pd.read_csv('diabetes_dataset.csv')

In [3]:
df.head()

Unnamed: 0,year,gender,age,location,race:AfricanAmerican,race:Asian,race:Caucasian,race:Hispanic,race:Other,hypertension,heart_disease,smoking_history,bmi,hbA1c_level,blood_glucose_level,diabetes
0,2020,Female,32.0,Alabama,0,0,0,0,1,0,0,never,27.32,5.0,100,0
1,2015,Female,29.0,Alabama,0,1,0,0,0,0,0,never,19.95,5.0,90,0
2,2015,Male,18.0,Alabama,0,0,0,0,1,0,0,never,23.76,4.8,160,0
3,2015,Male,41.0,Alabama,0,0,1,0,0,0,0,never,27.32,4.0,159,0
4,2016,Female,52.0,Alabama,1,0,0,0,0,0,0,never,23.75,6.5,90,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 16 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   year                  100000 non-null  int64  
 1   gender                100000 non-null  object 
 2   age                   100000 non-null  float64
 3   location              100000 non-null  object 
 4   race:AfricanAmerican  100000 non-null  int64  
 5   race:Asian            100000 non-null  int64  
 6   race:Caucasian        100000 non-null  int64  
 7   race:Hispanic         100000 non-null  int64  
 8   race:Other            100000 non-null  int64  
 9   hypertension          100000 non-null  int64  
 10  heart_disease         100000 non-null  int64  
 11  smoking_history       100000 non-null  object 
 12  bmi                   100000 non-null  float64
 13  hbA1c_level           100000 non-null  float64
 14  blood_glucose_level   100000 non-null  int64  
 15  d

## Machine Learning Analysis

Before training the model, we first perform necessary preprocessing steps with our data to ensure correctness and compatibility. Since the machine learning algorithm that'll be used for this analysis doesn't directly handle variables with categorical values especially in its string representation, we apply a common preprocessing technique called one-hot encoding. This technique transforms each unique value of our nominal feature into its own separate column containing a binary representation.

In [5]:
df = pd.get_dummies(df, columns=["gender", "location", "smoking_history"]) # Perform one-hot encoding

In [6]:
print("Features List:\n")
print(df.columns.tolist())

print(f"\nFeatures Count: {df.columns.size}")

Features List:

['year', 'age', 'race:AfricanAmerican', 'race:Asian', 'race:Caucasian', 'race:Hispanic', 'race:Other', 'hypertension', 'heart_disease', 'bmi', 'hbA1c_level', 'blood_glucose_level', 'diabetes', 'gender_Female', 'gender_Male', 'gender_Other', 'location_Alabama', 'location_Alaska', 'location_Arizona', 'location_Arkansas', 'location_California', 'location_Colorado', 'location_Connecticut', 'location_Delaware', 'location_District of Columbia', 'location_Florida', 'location_Georgia', 'location_Guam', 'location_Hawaii', 'location_Idaho', 'location_Illinois', 'location_Indiana', 'location_Iowa', 'location_Kansas', 'location_Kentucky', 'location_Louisiana', 'location_Maine', 'location_Maryland', 'location_Massachusetts', 'location_Michigan', 'location_Minnesota', 'location_Mississippi', 'location_Missouri', 'location_Montana', 'location_Nebraska', 'location_Nevada', 'location_New Hampshire', 'location_New Jersey', 'location_New Mexico', 'location_New York', 'location_North Carol

We split our dataset into training and test set which will help us in evaluating our model later on.

In [9]:
train_set, test_set = train_test_split(df, test_size=0.2) # Using a 8:2 ratio for train-test split

print(f"Training Set: {len(training_set)} samples (80%)")
print(f"Test Set: {len(test_set)} samples (20%)")

Training Set: 80000 samples (80%)
Test Set: 20000 samples (20%)


Define the input variables and target variable for training and testing

In [10]:
train_X = train_set.iloc[:, 0:75].to_numpy() # Input variables
train_y = train_set.loc[:, ["diabetes"]].to_numpy().ravel() # Target variable

test_X = test_set.iloc[:, 0:75].to_numpy() # Input variables
test_y = test_set.loc[:, ["diabetes"]].to_numpy().ravel() # Target variable

Define the logistic regression model

In [44]:
model = LogisticRegression(random_state=0, max_iter=2000) # Configured to ensure convergence when training

Train the model using the training set and the corresponding inputs and target

In [45]:
model.fit(train_X, train_y) # Fit model to training data

Evaluate the model using our test set

In [46]:
test_accuracy = model.score(test_X, test_y)

print(f"Test Accuracy: {test_accuracy * 100}%")

Test Accuracy: 100.0%


Perform predictions using sample data

In [28]:
# model.predict([Insert sample data here...])