# Introduction

*This project have educational purposes, for me to apply and understood the depths of logistic regression and machine learning*

The goal of this project is to use logistic regression and machine learning to extract informations and relations between features, focusing on understanding if for this amount of data how much the income can influence on probability to develop diabetes.

# Basic imports

In [1]:
import torch
import numpy as np
import matplotlib.pyplot as mp
import pandas as pd
from torch import nn

device = "cuda" if torch.cuda.is_available() else "cpu"

# Data

This dataset is based on BRFSS, Behavioral Risk Factor Surveillance System, an important survey applied by Center for Disease Control and Prevention (CDC) of USA, this survey collects data from all states and territories of the country focusing on general health

In [2]:
# https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset?select=diabetes_binary_5050split_health_indicators_BRFSS2015.csv
# https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf
df = pd.read_csv("archive/diabetes_binary_5050split_health_indicators_BRFSS2015.csv")

In [3]:
df

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,0.0,1.0,26.0,0.0,0.0,0.0,1.0,0.0,...,1.0,0.0,3.0,5.0,30.0,0.0,1.0,4.0,6.0,8.0
1,0.0,1.0,1.0,1.0,26.0,1.0,1.0,0.0,0.0,1.0,...,1.0,0.0,3.0,0.0,0.0,0.0,1.0,12.0,6.0,8.0
2,0.0,0.0,0.0,1.0,26.0,0.0,0.0,0.0,1.0,1.0,...,1.0,0.0,1.0,0.0,10.0,0.0,1.0,13.0,6.0,8.0
3,0.0,1.0,1.0,1.0,28.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,3.0,0.0,3.0,0.0,1.0,11.0,6.0,8.0
4,0.0,0.0,0.0,1.0,29.0,1.0,0.0,0.0,1.0,1.0,...,1.0,0.0,2.0,0.0,0.0,0.0,0.0,8.0,5.0,8.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
70687,1.0,0.0,1.0,1.0,37.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,4.0,0.0,0.0,0.0,0.0,6.0,4.0,1.0
70688,1.0,0.0,1.0,1.0,29.0,1.0,0.0,1.0,0.0,1.0,...,1.0,0.0,2.0,0.0,0.0,1.0,1.0,10.0,3.0,6.0
70689,1.0,1.0,1.0,1.0,25.0,0.0,0.0,1.0,0.0,1.0,...,1.0,0.0,5.0,15.0,0.0,1.0,0.0,13.0,6.0,4.0
70690,1.0,1.0,1.0,1.0,18.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,4.0,0.0,0.0,1.0,0.0,11.0,2.0,4.0


**Table description**
| Name | Description | Values |
| :-------- | :-------- | :-------- |
| Diabetes_binary | a dummy variable that shows if someone have diabetes | 0 = no diabetes; 1 = pre-diabetes, diabetes |
| HighBP | a dummy variable that shows if someone have high blood pressure | 0 = no; 1 = yes |
| HighChol | a dummy variable that shows if someone have high cholesterol| 0 = no; 1 = yes|
| BMI | Body max index| a number variable|
| Smoker| a variable that indicates if someone smoked at least 100 cigarettes in entire life| 0 = no; 1 = yes|
| Stroke| a dummy variable that indicates if someone had a stroke| 0 = no; 1 = yes|
| HeartDiseaseorAttack| a dummy variable that indicates if someone had a heart disease or attack| 0 = no; 1 = yes|
| PhysActivity| a dummy variable that indicates if someone did any physical activity in past 30 days| 0 = no; 1 = yes|
| Fruits| A variable that indicates if someone eat at least 1 fruit per day| 0 = no; 1 = yes|
| Veggies| A variable that indicates if someone eat at least 1 vegetables per day| 0 = no; 1 = yes|
| HvyAlcoholConsume| adult men >=14 drinks per week and adult women>=7 drinks per week| 0 = no; 1 = yes|
| AnyHealthCare| Have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc. | 0 = no; 1 = yes|
| NoDocBCCost| Was there a time in the past 12 months when you needed to see a doctor but could not because of cost?| 0 = no; 1 = yes|
| GenHealth| Would you say that in general your health is | scale 1-5 1 = excellent 2 = very good 3 = good 4 = fair 5 = poor|
| MentHlth| days of poor mental health | scale 1-30 days|
| PhysHltl| physical illness or injury days in past 30 days| scale 1-30 days|
| DiffWalk| a variable that indicates if someone have difficult to walk or clim stairs| 0 = no; 1 = yes|
| Sex| variable that indicates the gender of someone| 0 = female; 1 = male|
| Age| variable that indicates what class of age someone are in| _AGEG5YR see [codebook](https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf)|
| Education| variable that indicates how far someone went on education| EDUCA see [codebook](https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf)|
| Income| varible that indicates how much someone gain as income| INCOME2 see [codebook](https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf)|

In [4]:
# Check if is there any null values
df.isnull().mean()

Diabetes_binary         0.0
HighBP                  0.0
HighChol                0.0
CholCheck               0.0
BMI                     0.0
Smoker                  0.0
Stroke                  0.0
HeartDiseaseorAttack    0.0
PhysActivity            0.0
Fruits                  0.0
Veggies                 0.0
HvyAlcoholConsump       0.0
AnyHealthcare           0.0
NoDocbcCost             0.0
GenHlth                 0.0
MentHlth                0.0
PhysHlth                0.0
DiffWalk                0.0
Sex                     0.0
Age                     0.0
Education               0.0
Income                  0.0
dtype: float64

## Distribuitions

In [44]:
dummy_columns = ["Diabetes_binary", "HighBP", "HighChol", "Smoker", "Stroke", "HeartDiseaseorAttack",
                 "PhysActivity", "Fruits", "Veggies", "HvyAlcoholConsump", "AnyHealthcare", "NoDocbcCost",
                "DiffWalk", "Sex"]

dummy_df = df[dummy_columns]

for column in dummy_df.columns:
    distribuitions = (np.count_nonzero(dummy_df[column]) / d.count()) * 100
    print(f"{column}: {distribuitions:.2f}%")

Diabetes_binary: 50.00%
HighBP: 56.35%
HighChol: 52.57%
Smoker: 47.53%
Stroke: 6.22%
HeartDiseaseorAttack: 14.78%
PhysActivity: 70.30%
Fruits: 61.18%
Veggies: 78.88%
HvyAlcoholConsump: 4.27%
AnyHealthcare: 95.50%
NoDocbcCost: 9.39%
DiffWalk: 25.27%
Sex: 45.70%
