# Diabetes Prediction
### <i>Logistic Regression</i>

## About the Dataset
The dataset used in this model is obtained from kaggle using the following link "https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset/data". This dataset is obtained from a survey conducted through the BRFSS. The Behavioral Risk Factor Surveillance System (BRFSS) is a health-related telephone survey that is collected annually by the CDC. Each year, the survey collects responses from over 400,000 Americans on health-related risk behaviors, chronic health conditions, and the use of preventative services. It has been conducted every year since 1984. For this project, a csv of the dataset available on Kaggle for the year 2015 was used. This original dataset contains responses from 441,455 individuals and has 330 features. These features are either questions directly asked of participants, or calculated variables based on individual participant responses.
The following dataset is a cleaner and filtered version of the original dataset that fits the criteria for predicting algorithm development.

## Variable Description
<table>
    <thead>
        <tr>
            <th>Index</th>
            <th>Variable No.</th>
            <th>Variable Name</th>
            <th>Variable Description</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>0</td>
            <td>Diabetes_012</td>
            <td>Diabetes</td>
            <td>0 = no diabetes 1 = prediabetes 2 = diabetes</td>
        </tr>
        <tr>
            <td>1</td>
            <td>HighBP</td>
            <td>High Blood Pressure</td>
            <td>0 = no high BP 1 = high BP</td>
        </tr>
        <tr>
            <td>2</td>
            <td>HighChol</td>
            <td>High Cholesterol</td>
            <td>0 = no high cholesterol 1 = high cholesterol</td>
        </tr>
        <tr>
            <td>3</td>
            <td>CholCheck</td>
            <td>Cholesterol Check</td>
            <td>0 = no cholesterol check in 5 years 1 = yes cholesterol check in 5 years</td>
        </tr>
        <tr>
            <td>4</td>
            <td>BMI</td>
            <td>Body Mass Index</td>
            <td></td>
        </tr>
        <tr>
            <td>5</td>
            <td>Smoker</td>
            <td>Smoking Status</td>
            <td>Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] 0 = no 1 = yes</td>
        </tr>
        <tr>
            <td>6</td>
            <td>Stroke</td>
            <td>Stroke History</td>
            <td>(Ever told) you had a stroke. 0 = no 1 = yes</td>
        </tr>
        <tr>
            <td>7</td>
            <td>HeartDiseaseorAttack</td>
            <td>Heart Disease or Heart Attack</td>
            <td>Coronary heart disease (CHD) or myocardial infarction (MI) 0 = no 1 = yes</td>
        </tr>
        <tr>
            <td>8</td>
            <td>PhysActivity</td>
            <td>Physical Activity</td>
            <td>Physical activity in past 30 days - not including job 0 = no 1 = yes</td>
        </tr>
        <tr>
            <td>9</td>
            <td>Fruits</td>
            <td>Fruit Consumption</td>
            <td>Consume Fruit 1 or more times per day 0 = no 1 = yes</td>
        </tr>
        <tr>
            <td>10</td>
            <td>Veggies</td>
            <td>Vegetable Consumption</td>
            <td>Consume Vegetables 1 or more times per day 0 = no 1 = yes</td>
        </tr>
        <tr>
            <td>11</td>
            <td>HvyAlcoholConsump</td>
            <td>Heavy Alcohol Consumption</td>
            <td>Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) 0 = no</td>
        </tr>
        <tr>
            <td>12</td>
            <td>AnyHealthcare</td>
            <td>Healthcare Coverage</td>
            <td>Have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc. 0 = no 1 = yes</td>
        </tr>
        <tr>
            <td>13</td>
            <td>NoDocbcCost</td>
            <td>Doctor Visit Cost Barrier</td>
            <td>Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? 0 = no 1 = yes</td>
        </tr>
        <tr>
            <td>14</td>
            <td>GenHlth</td>
            <td>General Health Rating</td>
            <td>Would you say that in general your health is: scale 1-5 1 = excellent 2 = very good 3 = good 4 = fair 5 = poor</td>
        </tr>
        <tr>
            <td>15</td>
            <td>MentHlth</td>
            <td>Mental Health Rating</td>
            <td>Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how</td>
        </tr>
        <tr>
            <td>16</td>
            <td>PhysHlth</td>
            <td>Physical Health Rating</td>
            <td>Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days</td>
        </tr>
        <tr>
            <td>17</td>
            <td>DiffWalk</td>
            <td>Walking Difficulty</td>
            <td>Do you have serious difficulty walking or climbing stairs? 0 = no 1 = yes</td>
        </tr>
        <tr>
            <td>18</td>
            <td>Sex</td>
            <td>Sex</td>
            <td>0 = female 1 = male</td>
        </tr>
        <tr>
            <td>19</td>
            <td>Age</td>
            <td>Age</td>
            <td>13-level age category 1 = 18-24 9 = 60-64 13 = 80 or older</td>
        </tr>
        <tr>
            <td>20</td>
            <td>Education</td>
            <td>Education Level</td>
            <td>Education level scale 1-6 1 = Never attended school or only kindergarten 2 = Grades 1 through 8</td>
        </tr>
        <tr>
            <td>21</td>
            <td>Income</td>
            <td>Income</td>
            <td>Income scale scale 1-8 1 = less than $10,000 5 = less than $35,000 8 = $75,000 or more</td>
        </tr>
    </tbody>
</table>

## Installing Dependencies

In [2]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Importing the csv file into pandas dataframe

In [3]:
data = pd.read_csv("diabetes_data.csv")
data.head()

Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,1.0,1.0,40.0,1.0,0.0,0.0,0.0,0.0,...,1.0,0.0,5.0,18.0,15.0,1.0,0.0,9.0,4.0,3.0
1,0.0,0.0,0.0,0.0,25.0,1.0,0.0,0.0,1.0,0.0,...,0.0,1.0,3.0,0.0,0.0,0.0,0.0,7.0,6.0,1.0
2,0.0,1.0,1.0,1.0,28.0,0.0,0.0,0.0,0.0,1.0,...,1.0,1.0,5.0,30.0,30.0,1.0,0.0,9.0,4.0,8.0
3,0.0,1.0,0.0,1.0,27.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,11.0,3.0,6.0
4,0.0,1.0,1.0,1.0,24.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,3.0,0.0,0.0,0.0,11.0,5.0,4.0


In [3]:
data.shape

(253680, 22)

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253680 entries, 0 to 253679
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Diabetes_012          253680 non-null  float64
 1   HighBP                253680 non-null  float64
 2   HighChol              253680 non-null  float64
 3   CholCheck             253680 non-null  float64
 4   BMI                   253680 non-null  float64
 5   Smoker                253680 non-null  float64
 6   Stroke                253680 non-null  float64
 7   HeartDiseaseorAttack  253680 non-null  float64
 8   PhysActivity          253680 non-null  float64
 9   Fruits                253680 non-null  float64
 10  Veggies               253680 non-null  float64
 11  HvyAlcoholConsump     253680 non-null  float64
 12  AnyHealthcare         253680 non-null  float64
 13  NoDocbcCost           253680 non-null  float64
 14  GenHlth               253680 non-null  float64
 15  

In [6]:
data.Income.unique()

array([3., 1., 8., 6., 4., 7., 2., 5.])

In [7]:
data.Income.info()

<class 'pandas.core.series.Series'>
RangeIndex: 253680 entries, 0 to 253679
Series name: Income
Non-Null Count   Dtype  
--------------   -----  
253680 non-null  float64
dtypes: float64(1)
memory usage: 1.9 MB


In [20]:
data.Diabetes_012.unique()

array([0., 2., 1.])

In [4]:
# Changing the datatype of Diabetes_012 column into category
data["Diabetes_012"] = data['Diabetes_012'].astype("category", copy = False)
data["Diabetes_012"].info()

<class 'pandas.core.series.Series'>
RangeIndex: 253680 entries, 0 to 253679
Series name: Diabetes_012
Non-Null Count   Dtype   
--------------   -----   
253680 non-null  category
dtypes: category(1)
memory usage: 248.0 KB


In [21]:
# Coding the column as binary category
data.Diabetes_012 = [1 if value == 2.0 else 0 for value in data.Diabetes_012]

In [25]:
data["Diabetes_012"].unique()

[0, 1]
Categories (2, int64): [0, 1]

## Splitting Features and Target

In [26]:
X = data.drop(['Diabetes_012'], axis = 1)
y = data['Diabetes_012']

## Data Standardization / Normalization

In [27]:
# Creating a Scaler Object
scaler = StandardScaler()

# Fitting the Scaler into the Dataset and Data Transformation
X_scaled = scaler.fit_transform(X)

In [28]:
print(X_scaled)

[[ 1.15368814  1.16525449  0.19692156 ...  0.31690008 -1.06559465
  -1.4744874 ]
 [-0.86678537 -0.85818163 -5.07816412 ... -0.33793279  0.96327159
  -2.44013754]
 [ 1.15368814  1.16525449  0.19692156 ...  0.31690008 -1.06559465
   0.93963796]
 ...
 [-0.86678537 -0.85818163  0.19692156 ... -1.97501498 -0.05116153
  -1.95731247]
 [ 1.15368814 -0.85818163  0.19692156 ... -0.33793279 -0.05116153
  -2.44013754]
 [ 1.15368814  1.16525449  0.19692156 ...  0.31690008  0.96327159
  -1.95731247]]


## Splitting the Dataset into Train and Test Data

In [29]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size = 0.30, random_state=42)

In [43]:
X_train.shape

(177576, 21)

In [44]:
X_test.shape

(76104, 21)

## Building the Model

In [30]:
# Creating the logistic model
lr = LogisticRegression()

# Training the model on training data
lr.fit(X_train, y_train)

# Predict the target variable using test predictors
y_predict = lr.predict(X_test)

In [31]:
y_predict

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

## Testing the Accuracy of the Model

In [32]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_predict)
print(f"Accuracy: {accuracy: .2f}")

Accuracy:  0.87


This model can predict the risk of diabetes using relevent features with an accuarcy of 87%

## Summary of Classification Precision

In [33]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_predict))

              precision    recall  f1-score   support

           0       0.88      0.98      0.93     65605
           1       0.54      0.16      0.24     10499

    accuracy                           0.87     76104
   macro avg       0.71      0.57      0.59     76104
weighted avg       0.83      0.87      0.83     76104

