# Getting to know the Dataset

In [1]:
import pandas as pd

from src.config import ORIGINAL_DATA, PROCESSED_DATA

df_diabetes = pd.read_csv(ORIGINAL_DATA, compression='zip')


In [2]:
with pd.option_context("display.max_columns", None):
    display(df_diabetes.head())

Unnamed: 0,Diabetes_binary,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,0.0,1.0,0.0,1.0,26.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,3.0,5.0,30.0,0.0,1.0,4.0,6.0,8.0
1,0.0,1.0,1.0,1.0,26.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,3.0,0.0,0.0,0.0,1.0,12.0,6.0,8.0
2,0.0,0.0,0.0,1.0,26.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,10.0,0.0,1.0,13.0,6.0,8.0
3,0.0,1.0,1.0,1.0,28.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,3.0,0.0,3.0,0.0,1.0,11.0,6.0,8.0
4,0.0,0.0,0.0,1.0,29.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,8.0,5.0,8.0


For clarity, some columns are renamed:

In [3]:
df_diabetes.rename(columns={'Diabetes_binary': 'Diabetes', 'MentHlth': 'DaysMentalHealth', 'PhysHlth': 'DaysPhysicalHealth', 'GenHlth': 'GeneralHealth'}, inplace=True)

In [4]:
# Display a summary of the DataFrame
df_diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70692 entries, 0 to 70691
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Diabetes              70692 non-null  float64
 1   HighBP                70692 non-null  float64
 2   HighChol              70692 non-null  float64
 3   CholCheck             70692 non-null  float64
 4   BMI                   70692 non-null  float64
 5   Smoker                70692 non-null  float64
 6   Stroke                70692 non-null  float64
 7   HeartDiseaseorAttack  70692 non-null  float64
 8   PhysActivity          70692 non-null  float64
 9   Fruits                70692 non-null  float64
 10  Veggies               70692 non-null  float64
 11  HvyAlcoholConsump     70692 non-null  float64
 12  AnyHealthcare         70692 non-null  float64
 13  NoDocbcCost           70692 non-null  float64
 14  GeneralHealth         70692 non-null  float64
 15  DaysMentalHealth   

Note that there are no missing values, which is great. Also note that all the columns are of the float64 type.

In [5]:
# Count the number of unique values in each column to identify binary/categorical variables
df_diabetes.nunique()

Diabetes                 2
HighBP                   2
HighChol                 2
CholCheck                2
BMI                     80
Smoker                   2
Stroke                   2
HeartDiseaseorAttack     2
PhysActivity             2
Fruits                   2
Veggies                  2
HvyAlcoholConsump        2
AnyHealthcare            2
NoDocbcCost              2
GeneralHealth            5
DaysMentalHealth        31
DaysPhysicalHealth      31
DiffWalk                 2
Sex                      2
Age                     13
Education                6
Income                   8
dtype: int64

In this dataset, some columns represent **binary variables**, meaning they have only two possible values (e.g., 0 or 1). For example:

- `Diabetes_binary`: 0 = no diabetes, 1 = diabetes  
- `HighBP`: 0 = no high blood pressure, 1 = high blood pressure  
- `Smoker`: 0 = non-smoker, 1 = smoker  

These binary variables are often used to represent the presence or absence of a condition, making them useful for classification models.

Other columns are **non-binary variables**, which can have multiple possible values or be continuous. For example:

- `BMI`: Body Mass Index (continuous value)  
- `Age`: Age group (categorical but with more than two options)  
- `GenHlth`: General health (ordinal variable with values from 1 to 5)  

Recognizing these distinctions is important because:

1. **Modeling Approach**: Binary variables may require different handling compared to continuous or ordinal variables, especially in machine learning algorithms. For example, binary variables may be used as labels in classification tasks, while continuous variables are often used for regression tasks.

2. **Data Preprocessing**: Understanding the type of each column helps in data preprocessing, such as encoding categorical variables, scaling numerical variables, or handling missing values appropriately.

3. **Feature Engineering**: Knowing which columns are binary, ordinal, or continuous helps in creating new features, selecting important variables, and improving model performance.

By identifying these types of columns, you ensure that the data is properly processed and used effectively for analysis or predictive modeling.


In [6]:
binary_columns = df_diabetes.nunique()[df_diabetes.nunique() == 2].index.tolist()

binary_columns

['Diabetes',
 'HighBP',
 'HighChol',
 'CholCheck',
 'Smoker',
 'Stroke',
 'HeartDiseaseorAttack',
 'PhysActivity',
 'Fruits',
 'Veggies',
 'HvyAlcoholConsump',
 'AnyHealthcare',
 'NoDocbcCost',
 'DiffWalk',
 'Sex']

In [7]:
# Create a copy of the original DataFrame to apply preprocessing steps without altering the original data
df_diabetes_processed = df_diabetes.copy()


### Converting Binary Columns to Categorical with Meaningful Labels
This loop iterates over all binary columns in `df_diabetes_processed` and converts them to categorical data with more descriptive labels:


In [8]:
for column in binary_columns:
    if column != "Sex":
        df_diabetes_processed[column] = pd.Categorical(df_diabetes_processed[column]).rename_categories(["No", "Yes"])
    else:
        df_diabetes_processed[column] = pd.Categorical(df_diabetes_processed[column]).rename_categories(["Female", "Male"])

In [None]:
# Checking the applied transformations
with pd.option_context("display.max_columns", None):
    display(df_diabetes_processed.head())

Unnamed: 0,Diabetes,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthcare,NoDocbcCost,GeneralHealth,DaysMentalHealth,DaysPhysicalHealth,DiffWalk,Sex,Age,Education,Income
0,No,Yes,No,Yes,26.0,No,No,No,Yes,No,Yes,No,Yes,No,3.0,5.0,30.0,No,Male,4.0,6.0,8.0
1,No,Yes,Yes,Yes,26.0,Yes,Yes,No,No,Yes,No,No,Yes,No,3.0,0.0,0.0,No,Male,12.0,6.0,8.0
2,No,No,No,Yes,26.0,No,No,No,Yes,Yes,Yes,No,Yes,No,1.0,0.0,10.0,No,Male,13.0,6.0,8.0
3,No,Yes,Yes,Yes,28.0,Yes,No,No,Yes,Yes,Yes,No,Yes,No,3.0,0.0,3.0,No,Male,11.0,6.0,8.0
4,No,No,No,Yes,29.0,Yes,No,No,Yes,Yes,Yes,No,Yes,No,2.0,0.0,0.0,No,Female,8.0,5.0,8.0


In [10]:
# Also checking the infos
df_diabetes_processed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70692 entries, 0 to 70691
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   Diabetes              70692 non-null  category
 1   HighBP                70692 non-null  category
 2   HighChol              70692 non-null  category
 3   CholCheck             70692 non-null  category
 4   BMI                   70692 non-null  float64 
 5   Smoker                70692 non-null  category
 6   Stroke                70692 non-null  category
 7   HeartDiseaseorAttack  70692 non-null  category
 8   PhysActivity          70692 non-null  category
 9   Fruits                70692 non-null  category
 10  Veggies               70692 non-null  category
 11  HvyAlcoholConsump     70692 non-null  category
 12  AnyHealthcare         70692 non-null  category
 13  NoDocbcCost           70692 non-null  category
 14  GeneralHealth         70692 non-null  float64 
 15  Da

Note that all binary columns are now of the categorical type.

### Adjusting non-binary categories to the data dictionary categories.

In [11]:
# Convert "GeneralHealth" to an ordered categorical variable 
df_diabetes_processed["GeneralHealth"] = pd.Categorical(
    df_diabetes_processed["GeneralHealth"],
    ordered=True
).rename_categories(["Excellent", "Very good", "Good", "Acceptable", "Poor"])

df_diabetes_processed["GeneralHealth"].head()

0         Good
1         Good
2    Excellent
3         Good
4    Very good
Name: GeneralHealth, dtype: category
Categories (5, object): ['Excellent' < 'Very good' < 'Good' < 'Acceptable' < 'Poor']

In [12]:
# After verifying the transformation for `"GeneralHealth"`, we apply the same process to other ordinal categorical variables: `"Age"`, `"Education"`, and `"Income"`. 

df_diabetes_processed["Age"] = pd.Categorical(
    df_diabetes_processed["Age"],
    ordered=True
).rename_categories(
    [
        "18-24",
        "25-29",
        "30-34",
        "35-39",
        "40-44",
        "45-49",
        "50-54",
        "55-59",
        "60-64",
        "65-69",
        "70-74",
        "75-79",
        "80+",
    ]
)

df_diabetes_processed["Education"] = pd.Categorical(
    df_diabetes_processed["Education"],
    ordered=True
).rename_categories(
    [
        "Never attended school",
        "Elementary school",
        "Inc. high school",
        "High school",
        "Inc. undergrad. or technical course",
        "Undergrad. +",
    ]
)


df_diabetes_processed["Income"] = pd.Categorical(
    df_diabetes_processed["Income"],
    ordered=True
).rename_categories(
    [
        "< $10.000",
        "$10.000-$14.999",
        "$15.000-$19.999",
        "$20.000-$24.999",
        "$25.000-$34.999",
        "$35.000-$49.999",
        "$50.000-$74.999",
        "$75.000+",
    ]
)

In [13]:
# Checking the infos after the applied transformations
df_diabetes_processed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70692 entries, 0 to 70691
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   Diabetes              70692 non-null  category
 1   HighBP                70692 non-null  category
 2   HighChol              70692 non-null  category
 3   CholCheck             70692 non-null  category
 4   BMI                   70692 non-null  float64 
 5   Smoker                70692 non-null  category
 6   Stroke                70692 non-null  category
 7   HeartDiseaseorAttack  70692 non-null  category
 8   PhysActivity          70692 non-null  category
 9   Fruits                70692 non-null  category
 10  Veggies               70692 non-null  category
 11  HvyAlcoholConsump     70692 non-null  category
 12  AnyHealthcare         70692 non-null  category
 13  NoDocbcCost           70692 non-null  category
 14  GeneralHealth         70692 non-null  category
 15  Da

In [None]:
# Checking the dataframe
with pd.option_context("display.max_columns", None):
    display(df_diabetes_processed.head())

Unnamed: 0,Diabetes,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthcare,NoDocbcCost,GeneralHealth,DaysMentalHealth,DaysPhysicalHealth,DiffWalk,Sex,Age,Education,Income
0,No,Yes,No,Yes,26.0,No,No,No,Yes,No,Yes,No,Yes,No,Good,5.0,30.0,No,Male,35-39,Undergrad. +,$75.000+
1,No,Yes,Yes,Yes,26.0,Yes,Yes,No,No,Yes,No,No,Yes,No,Good,0.0,0.0,No,Male,75-79,Undergrad. +,$75.000+
2,No,No,No,Yes,26.0,No,No,No,Yes,Yes,Yes,No,Yes,No,Excellent,0.0,10.0,No,Male,80+,Undergrad. +,$75.000+
3,No,Yes,Yes,Yes,28.0,Yes,No,No,Yes,Yes,Yes,No,Yes,No,Good,0.0,3.0,No,Male,70-74,Undergrad. +,$75.000+
4,No,No,No,Yes,29.0,Yes,No,No,Yes,Yes,Yes,No,Yes,No,Very good,0.0,0.0,No,Female,55-59,Inc. undergrad. or technical course,$75.000+


In [15]:
# Generate summary statistics for numerical columns in the processed dataset
df_diabetes_processed.describe()

Unnamed: 0,BMI,DaysMentalHealth,DaysPhysicalHealth
count,70692.0,70692.0,70692.0
mean,29.856985,3.752037,5.810417
std,7.113954,8.155627,10.062261
min,12.0,0.0,0.0
25%,25.0,0.0,0.0
50%,29.0,0.0,0.0
75%,33.0,2.0,6.0
max,98.0,30.0,30.0


In [16]:
# Identify numerical columns 
numerical_columns = df_diabetes_processed.select_dtypes(include="number").columns.tolist()

# Downcast numerical columns to smaller integer types where possible, reducing memory usage
for column in numerical_columns:
    df_diabetes_processed[column] = pd.to_numeric(
        df_diabetes_processed[column],
        downcast="integer"
    )

# Verify changes in data types and memory usage
df_diabetes_processed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70692 entries, 0 to 70691
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   Diabetes              70692 non-null  category
 1   HighBP                70692 non-null  category
 2   HighChol              70692 non-null  category
 3   CholCheck             70692 non-null  category
 4   BMI                   70692 non-null  int8    
 5   Smoker                70692 non-null  category
 6   Stroke                70692 non-null  category
 7   HeartDiseaseorAttack  70692 non-null  category
 8   PhysActivity          70692 non-null  category
 9   Fruits                70692 non-null  category
 10  Veggies               70692 non-null  category
 11  HvyAlcoholConsump     70692 non-null  category
 12  AnyHealthcare         70692 non-null  category
 13  NoDocbcCost           70692 non-null  category
 14  GeneralHealth         70692 non-null  category
 15  Da

### Importance of Memory Optimization
Even though the reduction in memory usage from 2.9 MB to 1.5 MB might seem small, **improving memory usage is a key step**, especially when working with larger datasets. In this case, the memory usage was reduced by **almost half**.

#### **Why this is important:**
- **Efficient Memory Usage**: Lower memory usage allows for faster computations and more efficient storage, which is crucial when scaling up or running multiple analyses simultaneously.
- **No Loss of Information**: The downcasting did not result in any loss of information. Specifically:
  - **BMI** values are always given as integers in the database, so converting them from floats to integers doesn't lose any precision.
  - **`DaysMentalHealth`** and **`DaysPhysicalHealth`** are inherently integer-based metrics, so they were appropriately downcast without any risk of data loss.

By making these adjustments, we've enhanced the dataset's performance without sacrificing its integrity.


In [17]:
# Verify that the statistical summary remains the same after memory optimization
df_diabetes_processed.describe()

Unnamed: 0,BMI,DaysMentalHealth,DaysPhysicalHealth
count,70692.0,70692.0,70692.0
mean,29.856985,3.752037,5.810417
std,7.113954,8.155627,10.062261
min,12.0,0.0,0.0
25%,25.0,0.0,0.0
50%,29.0,0.0,0.0
75%,33.0,2.0,6.0
max,98.0,30.0,30.0


In [18]:
# Generate summary statistics for categorical columns to understand their distribution
df_diabetes_processed.describe(exclude="number")

Unnamed: 0,Diabetes,HighBP,HighChol,CholCheck,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,Veggies,HvyAlcoholConsump,AnyHealthcare,NoDocbcCost,GeneralHealth,DiffWalk,Sex,Age,Education,Income
count,70692,70692,70692,70692,70692,70692,70692,70692,70692,70692,70692,70692,70692,70692,70692,70692,70692,70692,70692
unique,2,2,2,2,2,2,2,2,2,2,2,2,2,5,2,2,13,6,8
top,No,Yes,Yes,Yes,No,No,No,Yes,Yes,Yes,No,Yes,No,Good,No,Female,65-69,Undergrad. +,$75.000+
freq,35346,39832,37163,68943,37094,66297,60243,49699,43249,55760,67672,67508,64053,23427,52826,38386,10856,26020,20646


In [19]:
# Save the processed DataFrame to a Parquet file for efficient storage and future use
df_diabetes_processed.to_parquet(PROCESSED_DATA, index=False)