Diabetes Prediction Model


Data Description


The Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone survey that is collected annually by the CDC. Each year, the survey collects responses from over 400,000 Americans on health-related risk behaviors, chronic health conditions, and the use of preventative services. It has been conducted every year since 1984. For this project, a csv of the dataset available on Kaggle for the year 2015 was used. This original dataset contains responses from 441,455 individuals and has 330 features. These features are either questions directly asked of participants, or calculated variables based on individual participant responses.
This dataset, diabetes _ 012 _ health _ indicators _ BRFSS2015.csv, is a clean dataset of 253,680 survey responses to the CDC's BRFSS2015. The target variable Diabetes_012 has 3 classes. There is class imbalance in this dataset. This dataset has 21 feature variables
0 is for No Diabetes or only during pregnancy
1 is for Prediabetes, and
2 is for Diabetes.
Research in the field has identified the following as important risk factors for diabetes and other chronic illnesses like heart disease (not in strict order of importance):
Blood Pressure (high)
Cholesterol (high)
Smoking
Diabetes
Obesity
Age
Sex
Race
Diet
Exercise
Alcohol Consumption
BMI
Household Income
Marital Status
Sleep
Time since last checkup
Education
Health Care Coverage
Mental Health
Here, Diabetes_012 is our target variable and we will try to predict whether a person has diabetes or not based on the 21 features using bagging, specifically the Random Forest algorithm, which is an ensemble method based on bagging.




Conclusion


: The accuracy of the classifier on the validation set is approximately 0.848 or . This metric measures the overall correctness of the classifier's predictions.: The classification report provides additional metrics such as precision, recall, and F1-score for each class. Here's a breakdown:
For Class 0 (Non-diabetic):
Precision is approximately 0.86, which means that among the instances predicted as class 0, 
Recall is approximately 0.98, which means that among all the actual class 0.0 instances, 
, which is the harmonic mean of precision and recall.
For Class 1 (Pre-diabetic):
Precision is 0.0, which means that  This might indicate an issue with class imbalance or data quality.
Recall is 0.0, indicating that 
For class 2.0 (Diabetec):
Precision is approximately 0.56, indicating that 
Recall is approximately 0.16, indicating that 
: The confusion matrix provides a detailed breakdown of true positives, true negatives, false positives, and false negatives for each class. It allows you to see how the classifier's predictions match with the actual classes.
For Class 0: There are 41,884 true positives (correctly predicted), 0 false positives, and 820 false negatives.
For Class 1: There are 0 true positives, 0 false positives, and 941 false negatives.
For Class 2: There are 1,132 true positives, 0 false positives, and 5,959 false negatives.
The "UndefinedMetricWarning" indicates that precision and F1-score are ill-defined for Class 1 because there are no predicted samples for this class, likely due to a lack of Class 1 instances in the validation set or an issue with class imbalance.
Overall, the model appears to perform well for Class 0 but struggles with Class 1, possibly due to data imbalance or data quality issues for that class. Further analysis and data preprocessing may be needed to improve performance on Class 1.

In [1]:
#import necessary files
import pandas as pd
import  numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [2]:
#reading the file "diabetes_012_health_indicators_BRFSS2015.csv" into dataframe named variable
dataframe=pd.read_csv("diabetes_012_health_indicators_BRFSS2015.csv")
dataframe

Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
253675,0.0,1.0,1.0,1.0,45.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,5.0,0.0,1.0,5.0,6.0,7.0
253676,2.0,1.0,1.0,1.0,18.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,4.0,0.0,0.0,1.0,0.0,11.0,2.0,4.0
253677,0.0,0.0,0.0,1.0,28.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0,5.0,2.0
253678,0.0,1.0,0.0,1.0,23.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,0.0,0.0,1.0,7.0,5.0,1.0


In [3]:
#lokking for any null value if present
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253680 entries, 0 to 253679
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Diabetes_012          253680 non-null  float64
 1   HighBP                253680 non-null  float64
 2   HighChol              253680 non-null  float64
 3   CholCheck             253680 non-null  float64
 4   BMI                   253680 non-null  float64
 5   Smoker                253680 non-null  float64
 6   Stroke                253680 non-null  float64
 7   HeartDiseaseorAttack  253680 non-null  float64
 8   PhysActivity          253680 non-null  float64
 9   Fruits                253680 non-null  float64
 10  Veggies               253680 non-null  float64
 11  HvyAlcoholConsump     253680 non-null  float64
 12  AnyHealthcare         253680 non-null  float64
 13  NoDocbcCost           253680 non-null  float64
 14  GenHlth               253680 non-null  float64
 15  

In [4]:
#analyzing the data by correlating "Diabetes_012" with all other features
correlations=dataframe.corr()['Diabetes_012']
correlations

Diabetes_012            1.000000
HighBP                  0.271596
HighChol                0.209085
CholCheck               0.067546
BMI                     0.224379
Smoker                  0.062914
Stroke                  0.107179
HeartDiseaseorAttack    0.180272
PhysActivity           -0.121947
Fruits                 -0.042192
Veggies                -0.058972
HvyAlcoholConsump      -0.057882
AnyHealthcare           0.015410
NoDocbcCost             0.035436
GenHlth                 0.302587
MentHlth                0.073507
PhysHlth                0.176287
DiffWalk                0.224239
Sex                     0.031040
Age                     0.185026
Education              -0.130517
Income                 -0.171483
Name: Diabetes_012, dtype: float64

In [5]:
#Analyzing number of values in the target column
c_0= dataframe[dataframe['Diabetes_012']==0]
print("Number of zeros in this dataset: ",len(c_0))

c_1= dataframe[dataframe['Diabetes_012']==1]
print("Number of ones in this dataset: ",len(c_1))

c_2= dataframe[dataframe['Diabetes_012']==2]
print("Number of twos in this dataset: ",len(c_2))

Number of zeros in this dataset:  213703
Number of ones in this dataset:  4631
Number of twos in this dataset:  35346


In [6]:
#dropping the duplicate values
dataframe.drop_duplicates(inplace=True)

In [7]:
#splitting the data into features and target variable
X=dataframe.drop('Diabetes_012',axis=1)
y=dataframe['Diabetes_012']

In [8]:
#splitting data into training and testing sets
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)

In [9]:
# Initialize the RandomForestClassifier without any parameter
random_forest_model=RandomForestClassifier()

# Fit the classifier to the training data
random_forest_model.fit(X_train,y_train)

In [10]:
# Use the trained classifier to make predictions on the testing data (X_test)
y_pred=random_forest_model.predict(X_test)

In [11]:
#Finding the accuracy score before oversampling the data
print("Accuracy Score before oversampling:",accuracy_score(y_test, y_pred))

Accuracy Score before oversampling: 0.8243793110951542


In [12]:
#getting the classification report for data before oversampling it
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

         0.0       0.85      0.96      0.90     38116
         1.0       0.00      0.00      0.00       906
         2.0       0.45      0.19      0.27      6935

    accuracy                           0.82     45957
   macro avg       0.43      0.38      0.39     45957
weighted avg       0.77      0.82      0.79     45957



In [13]:
# Initialize the SMOTE oversampler
oversampler = SMOTE()

# Use the fit_resample method to apply SMOTE oversampling to the input features (X) and target variable (y)
X_new, y_new = oversampler.fit_resample(X, y)

In [14]:
#splitting data into training and testing sets after doing oversampling
X_train,X_test,y_train,y_test=train_test_split(X_new,y_new,test_size=0.2,random_state=42)

In [15]:
# Initialize the RandomForestClassifier without any parameter
new_random_forest_model=RandomForestClassifier()

# Fit the classifier to the training data
new_random_forest_model.fit(X_train,y_train)

In [16]:
# Use the trained classifier to make predictions on the testing data (X_test)
new_y_pred=new_random_forest_model.predict(X_test)

In [17]:
print("Accuracy Score before oversampling:",accuracy_score(y_test, new_y_pred))

Accuracy Score before oversampling: 0.9207510106723492


In [18]:
#Classification report after doing oversampling
c1=classification_report(y_test,new_y_pred)
print(c1)

              precision    recall  f1-score   support

         0.0       0.86      0.95      0.90     37783
         1.0       0.98      0.96      0.97     38258
         2.0       0.93      0.85      0.89     37992

    accuracy                           0.92    114033
   macro avg       0.92      0.92      0.92    114033
weighted avg       0.92      0.92      0.92    114033

