# DIABETES

**Which health and Circumstantial factors influence the most on cases of diabetes and pre-diabetes?**

https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset


### Data Exploration

In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


In [5]:
# Read the diabetes_012 file
file_path = './diabetes_012_health_indicators_BRFSS2015.csv'
diabetes_data = pd.read_csv(file_path)

# Basic information about the dataset
print(diabetes_data.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253680 entries, 0 to 253679
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Diabetes_012          253680 non-null  float64
 1   HighBP                253680 non-null  float64
 2   HighChol              253680 non-null  float64
 3   CholCheck             253680 non-null  float64
 4   BMI                   253680 non-null  float64
 5   Smoker                253680 non-null  float64
 6   Stroke                253680 non-null  float64
 7   HeartDiseaseorAttack  253680 non-null  float64
 8   PhysActivity          253680 non-null  float64
 9   Fruits                253680 non-null  float64
 10  Veggies               253680 non-null  float64
 11  HvyAlcoholConsump     253680 non-null  float64
 12  AnyHealthcare         253680 non-null  float64
 13  NoDocbcCost           253680 non-null  float64
 14  GenHlth               253680 non-null  float64
 15  

In [6]:
# Display the first 5 rows of the dataset
print(diabetes_data.head())

   Diabetes_012  HighBP  HighChol  CholCheck   BMI  Smoker  Stroke  \
0           0.0     1.0       1.0        1.0  40.0     1.0     0.0   
1           0.0     0.0       0.0        0.0  25.0     1.0     0.0   
2           0.0     1.0       1.0        1.0  28.0     0.0     0.0   
3           0.0     1.0       0.0        1.0  27.0     0.0     0.0   
4           0.0     1.0       1.0        1.0  24.0     0.0     0.0   

   HeartDiseaseorAttack  PhysActivity  Fruits  ...  AnyHealthcare  \
0                   0.0           0.0     0.0  ...            1.0   
1                   0.0           1.0     0.0  ...            0.0   
2                   0.0           0.0     1.0  ...            1.0   
3                   0.0           1.0     1.0  ...            1.0   
4                   0.0           1.0     1.0  ...            1.0   

   NoDocbcCost  GenHlth  MentHlth  PhysHlth  DiffWalk  Sex   Age  Education  \
0          0.0      5.0      18.0      15.0       1.0  0.0   9.0        4.0   
1     

In [7]:
# Get summary statistics of the dataset
print(diabetes_data.describe())

        Diabetes_012         HighBP       HighChol      CholCheck  \
count  253680.000000  253680.000000  253680.000000  253680.000000   
mean        0.296921       0.429001       0.424121       0.962670   
std         0.698160       0.494934       0.494210       0.189571   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000       0.000000       0.000000       1.000000   
50%         0.000000       0.000000       0.000000       1.000000   
75%         0.000000       1.000000       1.000000       1.000000   
max         2.000000       1.000000       1.000000       1.000000   

                 BMI         Smoker         Stroke  HeartDiseaseorAttack  \
count  253680.000000  253680.000000  253680.000000         253680.000000   
mean       28.382364       0.443169       0.040571              0.094186   
std         6.608694       0.496761       0.197294              0.292087   
min        12.000000       0.000000       0.000000              0.000000  

----------------------

First we have the variable **Diabetes_012**, with a mean of 0.296921, which suggest that about 29.7% of the dataset has some form of diabetes or pre-diabetes. Also, having a max value of 2 confirm that is coded as: 0 (No diabetes), 1 (Pre-diabetes), 2 (Diabetes)

Then we found a bunch of binary variables, since they have values 0 or 1. 

They are the following:
- **HighBP** (42.9% positive)
- **HighChol** (42.4% positive)
- **CholCheck** (96.3% positive)
- **Smoker** (44.3% positive)
- **Stroke** (4.1% positive)
- **HeartDiseaseorAttack** (9.4% positive)
- **PhysActivity** (75.7% positive)
- **AnyHealthcare** (95.1% positive)
- **NoDocbcCost** (8.4% positive)
- **DiffWalk** (16.8% positive)

On the other hand, we have the variable **BMI** (Body mass index), with a mean of 28.38. 
And from this variable we can see that the range (12 to 98) seems very wide, here we might want to check for potential outliers.

We have the variable **Age** with a mean of 8.03. Which might suggest us that it could be coded into categories, so we will have to check it out.
**Age** has been categorized between ranges of 1 to 13. Where individuals in range 1 have between 18 to 24 years old.

**Education** and **Income** also appear to be categorical variables, with ranges from 1 to 6 and 1 to 8, respectively.

Finally there are several Health Metrics, such as **GenHlth** with ranges from 1 to 5 (mean of 2.51), **MentHlth** and **PhysHlth** both range from 0 to 30, with means of 3.18 and 4.24 respectively.


--------------

In [8]:
# Identify the number of unique values for categorical variables
categorical_columns = ['Diabetes_012', 'HighBP', 'HighChol', 'CholCheck', 'Smoker', 'Stroke', 
                       'HeartDiseaseorAttack', 'PhysActivity', 'AnyHealthcare', 'NoDocbcCost', 
                       'DiffWalk', 'Education', 'Income', 'GenHlth', 'Age']

unique_values = diabetes_data[categorical_columns].nunique()
print(unique_values)

Diabetes_012             3
HighBP                   2
HighChol                 2
CholCheck                2
Smoker                   2
Stroke                   2
HeartDiseaseorAttack     2
PhysActivity             2
AnyHealthcare            2
NoDocbcCost              2
DiffWalk                 2
Education                6
Income                   8
GenHlth                  5
Age                     13
dtype: int64


In [9]:
# Check for missing values in the dataset
missing_values = diabetes_data.isnull().sum()

# Filter out columns with no missing values
missing_values = missing_values[missing_values > 0]

# Display the missing values and their distribution
print(missing_values)

Series([], dtype: int64)


----

Having this result means that probably we have the missing values as outliers.

We checked the data source and saw that each answer as "don't know" was resulting in deleating the individual from the dataset. 

---------

### Anomaly Detection