This notebook addresses the task of predicting diabetes using a large dataset of 100,000 people. The data includes various details like age, gender, and medical history. One challenge is that only 8.5% of the people have diabetes, so the dataset is imbalanced, hence choosing the correct metric for evaluation is key here. Let's dive in!!

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [2]:
# Reading the data
df=pd.read_csv('/kaggle/input/100000-diabetes-clinical-dataset/diabetes_dataset.csv')
df.head()

Unnamed: 0,year,gender,age,location,race:AfricanAmerican,race:Asian,race:Caucasian,race:Hispanic,race:Other,hypertension,heart_disease,smoking_history,bmi,hbA1c_level,blood_glucose_level,diabetes
0,2020,Female,32.0,Alabama,0,0,0,0,1,0,0,never,27.32,5.0,100,0
1,2015,Female,29.0,Alabama,0,1,0,0,0,0,0,never,19.95,5.0,90,0
2,2015,Male,18.0,Alabama,0,0,0,0,1,0,0,never,23.76,4.8,160,0
3,2015,Male,41.0,Alabama,0,0,1,0,0,0,0,never,27.32,4.0,159,0
4,2016,Female,52.0,Alabama,1,0,0,0,0,0,0,never,23.75,6.5,90,0


In [3]:
# Understanding the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 16 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   year                  100000 non-null  int64  
 1   gender                100000 non-null  object 
 2   age                   100000 non-null  float64
 3   location              100000 non-null  object 
 4   race:AfricanAmerican  100000 non-null  int64  
 5   race:Asian            100000 non-null  int64  
 6   race:Caucasian        100000 non-null  int64  
 7   race:Hispanic         100000 non-null  int64  
 8   race:Other            100000 non-null  int64  
 9   hypertension          100000 non-null  int64  
 10  heart_disease         100000 non-null  int64  
 11  smoking_history       100000 non-null  object 
 12  bmi                   100000 non-null  float64
 13  hbA1c_level           100000 non-null  float64
 14  blood_glucose_level   100000 non-null  int64  
 15  d

In [4]:
# Checking for missing values
df.isna().sum()

year                    0
gender                  0
age                     0
location                0
race:AfricanAmerican    0
race:Asian              0
race:Caucasian          0
race:Hispanic           0
race:Other              0
hypertension            0
heart_disease           0
smoking_history         0
bmi                     0
hbA1c_level             0
blood_glucose_level     0
diabetes                0
dtype: int64

No Missing Values found

In [5]:
# Differentiating between object and non-object features
num_columns = []
non_num_columns = []
for column in df.columns:
  if df[column].dtypes == 'O':
    non_num_columns.append(column)
  else:
    num_columns.append(column)
print(num_columns)
print(non_num_columns)



['year', 'age', 'race:AfricanAmerican', 'race:Asian', 'race:Caucasian', 'race:Hispanic', 'race:Other', 'hypertension', 'heart_disease', 'bmi', 'hbA1c_level', 'blood_glucose_level', 'diabetes']
['gender', 'location', 'smoking_history']


In [6]:
# Looking through all the categories in categorical features
for feature in df[non_num_columns].columns:
  print(df[feature].value_counts(sort=True))

gender
Female    58552
Male      41430
Other        18
Name: count, dtype: int64
location
Iowa                    2038
Nebraska                2038
Kentucky                2038
Hawaii                  2038
Florida                 2037
Minnesota               2037
New Jersey              2037
Arkansas                2037
Delaware                2036
Kansas                  2036
Michigan                2036
Massachusetts           2036
Maine                   2036
District of Columbia    2036
Louisiana               2036
Georgia                 2036
Oregon                  2036
Pennsylvania            2036
Alabama                 2036
Illinois                2036
Rhode Island            2035
Colorado                2035
Maryland                2035
New York                2035
Connecticut             2035
Mississippi             2035
Missouri                2035
Alaska                  2035
North Carolina          2035
New Hampshire           2035
North Dakota            2035
South Dakot

No wrong Categories found (Spelling mistakes, symbols inbetween, etc).

In [7]:
# Checking the percentage of people having diabetes vs people not having diabetes
df['diabetes'].value_counts(normalize=True)

diabetes
0    0.915
1    0.085
Name: proportion, dtype: float64

Here, we check the balance of our target variable—whether a person has diabetes or not. As the dataset is imbalanced, accuracy will not be the best metric to rely on (Ps. F1 score would be a great metric in this case), so it's important to check this early on.

In [8]:
# Splitting X and y. We're predicting if a person has diabetes or not based on independent features.
df.columns
X=df.drop('diabetes', axis=1)
y=df['diabetes']
X.head()

Unnamed: 0,year,gender,age,location,race:AfricanAmerican,race:Asian,race:Caucasian,race:Hispanic,race:Other,hypertension,heart_disease,smoking_history,bmi,hbA1c_level,blood_glucose_level
0,2020,Female,32.0,Alabama,0,0,0,0,1,0,0,never,27.32,5.0,100
1,2015,Female,29.0,Alabama,0,1,0,0,0,0,0,never,19.95,5.0,90
2,2015,Male,18.0,Alabama,0,0,0,0,1,0,0,never,23.76,4.8,160
3,2015,Male,41.0,Alabama,0,0,1,0,0,0,0,never,27.32,4.0,159
4,2016,Female,52.0,Alabama,1,0,0,0,0,0,0,never,23.75,6.5,90


In [9]:
y.head()

0    0
1    0
2    0
3    0
4    0
Name: diabetes, dtype: int64

Before doing any feature engineering, we split the data into training and test sets (not allowing any data leakage). This way, we can assess our models on unseen data, ensuring they generalize well. We use 10% of the data as our test set considering 10k rows would be sufficient to test our model.

In [10]:
# Splitting the data - Train Set and Test Set before feature engineering
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.1, random_state=13)
print(X_train.info(), X_test.info())

<class 'pandas.core.frame.DataFrame'>
Index: 90000 entries, 91768 to 98642
Data columns (total 15 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   year                  90000 non-null  int64  
 1   gender                90000 non-null  object 
 2   age                   90000 non-null  float64
 3   location              90000 non-null  object 
 4   race:AfricanAmerican  90000 non-null  int64  
 5   race:Asian            90000 non-null  int64  
 6   race:Caucasian        90000 non-null  int64  
 7   race:Hispanic         90000 non-null  int64  
 8   race:Other            90000 non-null  int64  
 9   hypertension          90000 non-null  int64  
 10  heart_disease         90000 non-null  int64  
 11  smoking_history       90000 non-null  object 
 12  bmi                   90000 non-null  float64
 13  hbA1c_level           90000 non-null  float64
 14  blood_glucose_level   90000 non-null  int64  
dtypes: float64(3), int64

In [11]:
df.columns

Index(['year', 'gender', 'age', 'location', 'race:AfricanAmerican',
       'race:Asian', 'race:Caucasian', 'race:Hispanic', 'race:Other',
       'hypertension', 'heart_disease', 'smoking_history', 'bmi',
       'hbA1c_level', 'blood_glucose_level', 'diabetes'],
      dtype='object')

In [12]:
import warnings
warnings.filterwarnings('ignore')

For preprocessing, we scale the numerical features and apply one-hot encoding to the categorical ones. Scaling helps ensure that all features contribute equally to the model, while one-hot encoding is used because there's no inherent order in the categories (OHE increases the complexity, a thing to note).

In [13]:
# OneHotEncoding categorical features, Scaling numeric features
# Using one hot encoding as there's no inherent order in the categories
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
columntransformer = ColumnTransformer([
        ('scale', StandardScaler(), ['year', 'age', 'race:AfricanAmerican',
                                     'race:Asian', 'race:Caucasian', 'race:Hispanic', 'race:Other',
                                     'hypertension', 'heart_disease', 'bmi',
                                     'hbA1c_level', 'blood_glucose_level']),
        ('ohe', OneHotEncoder(drop='first'), ['gender', 'smoking_history', 'location'])
    ]
)

# Apply the transformer to your training data
X_train_transformed = (columntransformer.fit_transform(X_train))
X_test_transformed = (columntransformer.transform(X_test))


In [14]:
# Converting the transformed matrix into a dataframe to improve readability
newcolnames=columntransformer.get_feature_names_out()
X_train_transformed=pd.DataFrame(X_train_transformed.toarray(), columns=newcolnames)
X_test_transformed=pd.DataFrame(X_test_transformed.toarray(), columns=newcolnames)

In [15]:
X_test_transformed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 73 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   scale__year                         10000 non-null  float64
 1   scale__age                          10000 non-null  float64
 2   scale__race:AfricanAmerican         10000 non-null  float64
 3   scale__race:Asian                   10000 non-null  float64
 4   scale__race:Caucasian               10000 non-null  float64
 5   scale__race:Hispanic                10000 non-null  float64
 6   scale__race:Other                   10000 non-null  float64
 7   scale__hypertension                 10000 non-null  float64
 8   scale__heart_disease                10000 non-null  float64
 9   scale__bmi                          10000 non-null  float64
 10  scale__hbA1c_level                  10000 non-null  float64
 11  scale__blood_glucose_level          10000 

We try out several classification models to see which one performs best without any hyperparameter tuning. This gives us a baseline to compare against once we start fine-tuning the models. 

Hyperparameter tuning is done in the cells to follow! This is not the end.

In [16]:
# Trying out different models without any hyperparameter setting established
models={
    'Logistic Regression':LogisticRegression(max_iter=500),
    'SVC':SVC(),
    'Naive Bayes': GaussianNB(),
    'KNN':KNeighborsClassifier(),
    'Decision Tree':DecisionTreeClassifier(),
    'Random Forest':RandomForestClassifier(),
}

for model_name, model in models.items():
  print(model_name)
  model.fit(X_train_transformed, y_train)
  pred=model.predict(X_test_transformed)
  print(confusion_matrix(y_test, pred))
  print(classification_report(y_test, pred))
  print('------------------------------')

Logistic Regression
[[9105   61]
 [ 289  545]]
              precision    recall  f1-score   support

           0       0.97      0.99      0.98      9166
           1       0.90      0.65      0.76       834

    accuracy                           0.96     10000
   macro avg       0.93      0.82      0.87     10000
weighted avg       0.96      0.96      0.96     10000

------------------------------
SVC
[[9148   18]
 [ 316  518]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.98      9166
           1       0.97      0.62      0.76       834

    accuracy                           0.97     10000
   macro avg       0.97      0.81      0.87     10000
weighted avg       0.97      0.97      0.96     10000

------------------------------
Naive Bayes
[[5235 3931]
 [  48  786]]
              precision    recall  f1-score   support

           0       0.99      0.57      0.72      9166
           1       0.17      0.94      0.28       834

   

Without hyperparameter tuning,
* NaiveBayes gives result with lowest False Negative (Having diabetes but predicting as not having diabetes) but it also has highest false positives. Accuracy is the lowest.
* KNN has low F1 score.


Now, we try out hyperparameter tuning for different algorithms to further improve the model. Note: We're using GridSearchCV here and our dataset is quite large with high dimension. So, this might take quite some time to run. I didn't play around with all the hyperparameters here considering the complexity but one can definitely aim to further improve the model.

In [17]:
# Logistic Regression with hyperparameter tuning
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
param={
    'penalty': ['l1', 'l2'],
    'solver':['liblinear','saga'],
    'max_iter':[700],
    'random_state':[42]
}
lr=GridSearchCV(LogisticRegression(), param, cv=3)
lr.fit(X_train_transformed, y_train)
pred=lr.predict(X_test_transformed)
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
print(lr.best_params_, lr.best_score_)


[[9104   62]
 [ 289  545]]
              precision    recall  f1-score   support

           0       0.97      0.99      0.98      9166
           1       0.90      0.65      0.76       834

    accuracy                           0.96     10000
   macro avg       0.93      0.82      0.87     10000
weighted avg       0.96      0.96      0.96     10000

{'max_iter': 700, 'penalty': 'l1', 'random_state': 42, 'solver': 'liblinear'} 0.9596


In [18]:
# Hyperparameter tuning for Naive Bayes
param={
'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5]
}
nb=GridSearchCV(GaussianNB(), param, cv=2)
nb.fit(X_train_transformed, y_train)
pred=nb.predict(X_test_transformed)
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
print(nb.best_params_, nb.best_score_)

[[7938 1228]
 [ 202  632]]
              precision    recall  f1-score   support

           0       0.98      0.87      0.92      9166
           1       0.34      0.76      0.47       834

    accuracy                           0.86     10000
   macro avg       0.66      0.81      0.69     10000
weighted avg       0.92      0.86      0.88     10000

{'var_smoothing': 1e-05} 0.8383111111111111


In [19]:
# Hyperparameter tuning for KNN
param={
'n_neighbors': [3, 5, 7, 9, 11, 15, 20],
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan', 'minkowski']
}
knn=GridSearchCV(KNeighborsClassifier(), param, cv=2)
knn.fit(X_train_transformed, y_train)
pred=knn.predict(X_test_transformed)
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
print(knn.best_params_, knn.best_score_)

[[9139   27]
 [ 368  466]]
              precision    recall  f1-score   support

           0       0.96      1.00      0.98      9166
           1       0.95      0.56      0.70       834

    accuracy                           0.96     10000
   macro avg       0.95      0.78      0.84     10000
weighted avg       0.96      0.96      0.96     10000

{'metric': 'euclidean', 'n_neighbors': 11, 'weights': 'uniform'} 0.9560777777777778


In [20]:
# Hyperparameter tuning for SVC
param={
    'kernel': ['linear', 'rbf'],
    'C': [0.1, 1, 10]
}
svc=GridSearchCV(SVC(), param, cv=2)
svc.fit(X_train_transformed, y_train)
pred=svc.predict(X_test_transformed)
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
print(svc.best_params_, svc.best_score_)


[[9128   38]
 [ 278  556]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.98      9166
           1       0.94      0.67      0.78       834

    accuracy                           0.97     10000
   macro avg       0.95      0.83      0.88     10000
weighted avg       0.97      0.97      0.97     10000

{'C': 10, 'kernel': 'rbf'} 0.9621888888888889


In [21]:
# Hyperparameter tuning for RandomForest
param={
'n_estimators':[100, 300, 500],
'max_depth':[20, 30,40],
'max_features': ['auto', 'sqrt'],
'random_state':[42]
}
rf=GridSearchCV(RandomForestClassifier(), param, cv=2)
rf.fit(X_train_transformed, y_train)
pred=rf.predict(X_test_transformed)
print(confusion_matrix(y_test, pred))
print(classification_report(y_test, pred))
print(rf.best_params_, rf.best_score_)

[[9164    2]
 [ 240  594]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.99      9166
           1       1.00      0.71      0.83       834

    accuracy                           0.98     10000
   macro avg       0.99      0.86      0.91     10000
weighted avg       0.98      0.98      0.97     10000

{'max_depth': 30, 'max_features': 'auto', 'n_estimators': 300, 'random_state': 42} 0.9715444444444444


The Random Forest model delivered the best results, more so interms of recall and F1 score - which acts as an important metric incase of an imbalanced dataset. This means that the model is not only good at identifying true positive cases (i.e., correctly predicting individuals with diabetes) but also maintains a good balance between precision and recall. 

Boosting algorithms might further improve the model! A thing to try next..