# Predicting Diabetes
using Random Forests and Logistic Regression


In [2]:
import pandas as pd
import numpy as np

## The Data
reading in the data and basic data analysis.
The dataset can be found on Kaggle, <a href="https://www.kaggle.com/houcembenmansour/predict-diabetes-based-on-diagnostic-measures" target="_blank">here</a>

In [3]:
df = pd.read_csv('diabetes.csv')

In [4]:
df.head()

Unnamed: 0,patient_number,cholesterol,glucose,hdl_chol,chol_hdl_ratio,age,gender,height,weight,bmi,systolic_bp,diastolic_bp,waist,hip,waist_hip_ratio,diabetes
0,1,193,77,49,39,19,female,61,119,225,118,70,32,38,84,No diabetes
1,2,146,79,41,36,19,female,60,135,264,108,58,33,40,83,No diabetes
2,3,217,75,54,4,20,female,67,187,293,110,72,40,45,89,No diabetes
3,4,226,97,70,32,20,female,64,114,196,122,64,31,39,79,No diabetes
4,5,164,91,67,24,20,female,70,141,202,122,86,32,39,82,No diabetes


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 390 entries, 0 to 389
Data columns (total 16 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   patient_number   390 non-null    int64 
 1   cholesterol      390 non-null    int64 
 2   glucose          390 non-null    int64 
 3   hdl_chol         390 non-null    int64 
 4   chol_hdl_ratio   390 non-null    object
 5   age              390 non-null    int64 
 6   gender           390 non-null    object
 7   height           390 non-null    int64 
 8   weight           390 non-null    int64 
 9   bmi              390 non-null    object
 10  systolic_bp      390 non-null    int64 
 11  diastolic_bp     390 non-null    int64 
 12  waist            390 non-null    int64 
 13  hip              390 non-null    int64 
 14  waist_hip_ratio  390 non-null    object
 15  diabetes         390 non-null    object
dtypes: int64(11), object(5)
memory usage: 48.9+ KB


In [6]:
df.isnull().sum()

patient_number     0
cholesterol        0
glucose            0
hdl_chol           0
chol_hdl_ratio     0
age                0
gender             0
height             0
weight             0
bmi                0
systolic_bp        0
diastolic_bp       0
waist              0
hip                0
waist_hip_ratio    0
diabetes           0
dtype: int64

## Adjusting Datatypes
several columns are objects, when they should be int values. We'll start with chol_hdl_ratio, bmi, and waist_hip_ratio.
these columns are decimal values, but with commas being used instead of periods.


In [7]:
df['chol_hdl_ratio'] = df['chol_hdl_ratio'].str.replace(',', '.').astype(float)
df['bmi'] = df['bmi'].str.replace(',', '.').astype(float)
df['waist_hip_ratio'] = df['waist_hip_ratio'].str.replace(',', '.').astype(float)

Next is gender and diabetes. These are boolean values that are currently being represented as strings

In [8]:
df['gender'] = (df['gender']=='male').astype(int)
df['diabetes'] = (df['diabetes']=='Diabetes').astype(int)

In [9]:
df.head(2)

Unnamed: 0,patient_number,cholesterol,glucose,hdl_chol,chol_hdl_ratio,age,gender,height,weight,bmi,systolic_bp,diastolic_bp,waist,hip,waist_hip_ratio,diabetes
0,1,193,77,49,3.9,19,0,61,119,22.5,118,70,32,38,0.84,0
1,2,146,79,41,3.6,19,0,60,135,26.4,108,58,33,40,0.83,0


The patient_number column can also be dropped so that it won't overfit the model

In [10]:
df.drop('patient_number', axis=1, inplace=True)

Before we move into our Random Forest and Logistic Regression models, we should look at the balance of our Target class

In [11]:
df['diabetes'].value_counts()

0    330
1     60
Name: diabetes, dtype: int64

it is important to note that we are skewed towards "no diabetes" and this will likely make our models slightly less accurate.


## Splitting the data

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
X = df.drop('diabetes', axis=1)
y = df['diabetes']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Random Forest 

In [14]:
from sklearn.ensemble import RandomForestClassifier

In [15]:
rfc = RandomForestClassifier(n_estimators=500)

In [16]:
rfc.fit(X_train, y_train)

RandomForestClassifier(n_estimators=500)

In [17]:
rfc_pred = rfc.predict(X_test)

In [18]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [19]:
print(confusion_matrix(y_test, rfc_pred))
print('\n')
print(classification_report(y_test, rfc_pred))

[[100   4]
 [  8  17]]


              precision    recall  f1-score   support

           0       0.93      0.96      0.94       104
           1       0.81      0.68      0.74        25

    accuracy                           0.91       129
   macro avg       0.87      0.82      0.84       129
weighted avg       0.90      0.91      0.90       129



## Logistic Regression

In [20]:
from sklearn.linear_model import LogisticRegression

In [21]:
logmodel = LogisticRegression(max_iter=10000)

In [22]:
logmodel.fit(X_train, y_train)

LogisticRegression(max_iter=10000)

In [23]:
Log_Pred = logmodel.predict(X_test)

In [24]:
print(confusion_matrix(y_test, Log_Pred))
print('\n')
print(classification_report(y_test, Log_Pred))

[[100   4]
 [ 11  14]]


              precision    recall  f1-score   support

           0       0.90      0.96      0.93       104
           1       0.78      0.56      0.65        25

    accuracy                           0.88       129
   macro avg       0.84      0.76      0.79       129
weighted avg       0.88      0.88      0.88       129



## Conclusion

In [25]:
print('Random Forest accuracy: {}%'.format(accuracy_score(y_test, rfc_pred)*100))
print('Logistic Regression accuracy: {}%'.format(accuracy_score(y_test, Log_Pred)*100))

Random Forest accuracy: 90.69767441860465%
Logistic Regression accuracy: 88.37209302325581%


Random forest is slightly more accurate at just over 90%, though both are quite close.
both could be made more accurate through a more balanced dataset. As mentioned before, this dataset is skewed towards "no diabetes" quite heavily. 