# Assignement 2: Preprocessing

## Table of Contents
1. [Notebook Summary](#notebook-summary)
2. [Import the Dataset](#import-the-dataset)
3. [Add the Target Variable](#add-the-target-variable)
4. [Drop Unused Variables and Drop Observations with too Many Missing Values](#4-drop-unused-variables-and-drop-observations-with-too-many-missing-values)
5. [Impute Missing Values](#impute-missing-values)
7. [Export all Dataframes as Parquet and Lazy Read all Parquet Files](#export-all-dataframes-as-parquet-and-lazy-read-all-parquet-files)





## 1. Notebook Summary

### Dataframe Dictionary Key

| Dataframe Name | Description                        |
|----------------|------------------------------------|
| df_original    | No modifications to the data       |
| df             | Added column HadHearDisease. If a target variable is missing, HadHeartDisease missing.|
| df_heart_drop  | Drop all observations where a target variable is missing. Dropped `HadHeartAttack`, `HadStroke`, `HadAngina`, and `BMI`|
| df_heart_drop_## | Drop observations that pass the ## threshold for missing values |
| df*_imp        | Imputed version of all the above dataframes |

## 2. Import the Dataset

Parquet colomnar storage format is faster than CSV.

In [17]:
from asgmnt_2_tools import lazydict_to_parquet
import polars as pl # Lazy Dataframe Manipulation 
import pandas as pd # Dataframe

data_path = "../../Data/" # Relative Path to Data Folder.
drive_path = "../../Data/GoogleDrive/" # Relative Path to GoogleDrive Folder.


In [18]:
df_dict = dict() # Dictionary of preprocessing lazyframes
df_dict_imp = dict() # Dictionary of imputed lazyframes
df_dict_all = dict() #Dictionary of all lazyframes


df = pl.scan_csv(data_path + "heart_2022_with_nans.csv") # Read CSV
df.collect().write_parquet(data_path + "heart_2022_with_nans.parquet") # Write as Parquet
df = pl.scan_parquet(data_path + "heart_2022_with_nans.parquet") # Lazy Read Parquet

df_dict['df_original'] = df #Add dataframe with no modifications to the set.
df.fetch(5) # Lazy print head

State,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,HadAngina,HadStroke,HadAsthma,HadSkinCancer,HadCOPD,HadDepressiveDisorder,HadKidneyDisease,HadArthritis,HadDiabetes,DeafOrHardOfHearing,BlindOrVisionDifficulty,DifficultyConcentrating,DifficultyWalking,DifficultyDressingBathing,DifficultyErrands,SmokerStatus,ECigaretteUsage,ChestScan,RaceEthnicityCategory,AgeCategory,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos
str,str,str,f64,f64,str,str,f64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,f64,f64,f64,str,str,str,str,str,str,str
"""Alabama""","""Female""","""Very good""",0.0,0.0,"""Within past ye…","""No""",8.0,,"""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""No""","""No""","""Never smoked""","""Not at all (ri…","""No""","""White only, No…","""Age 80 or olde…",,,,"""No""","""No""","""Yes""","""No""","""Yes, received …","""No""","""No"""
"""Alabama""","""Female""","""Excellent""",0.0,0.0,,"""No""",6.0,,"""No""","""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Never smoked""","""Never used e-c…","""No""","""White only, No…","""Age 80 or olde…",1.6,68.04,26.57,"""No""","""No""","""No""","""No""","""No, did not re…","""No""","""No"""
"""Alabama""","""Female""","""Very good""",2.0,3.0,"""Within past ye…","""Yes""",5.0,,"""No""","""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Never smoked""","""Never used e-c…","""No""","""White only, No…","""Age 55 to 59""",1.57,63.5,25.61,"""No""","""No""","""No""","""No""",,"""No""","""Yes"""
"""Alabama""","""Female""","""Excellent""",0.0,0.0,"""Within past ye…","""Yes""",7.0,,"""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Current smoker…","""Never used e-c…","""Yes""","""White only, No…",,1.65,63.5,23.3,"""No""","""No""","""Yes""","""Yes""","""No, did not re…","""No""","""No"""
"""Alabama""","""Female""","""Fair""",2.0,0.0,"""Within past ye…","""Yes""",9.0,,"""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Never smoked""","""Never used e-c…","""Yes""","""White only, No…","""Age 40 to 44""",1.57,53.98,21.77,"""Yes""","""No""","""No""","""Yes""","""No, did not re…","""No""","""No"""


## 3. Add the Target Variable

Set `HadHeartDisease` to `"Yes"` if a least one of the following adverse cardiovascular events is `"Yes"`: `HadHeartAttack`, `HadStroke`, `HadAngina`.

In [19]:

df = df.with_columns(
    pl.when(
        (pl.col("HadHeartAttack") == "Yes") |
        (pl.col("HadStroke") == "Yes") |
        (pl.col("HadAngina") == "Yes")
    )
    .then(pl.lit("Yes"))
    .otherwise(
        pl.when(
            pl.col("HadHeartAttack").is_null() |
            pl.col("HadStroke").is_null() |
            pl.col("HadAngina").is_null()
        )
        .then(None)
        .otherwise(pl.lit("No"))
    )
    .alias("HadHeartDisease")
)

df.fetch(5)

State,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,HadAngina,HadStroke,HadAsthma,HadSkinCancer,HadCOPD,HadDepressiveDisorder,HadKidneyDisease,HadArthritis,HadDiabetes,DeafOrHardOfHearing,BlindOrVisionDifficulty,DifficultyConcentrating,DifficultyWalking,DifficultyDressingBathing,DifficultyErrands,SmokerStatus,ECigaretteUsage,ChestScan,RaceEthnicityCategory,AgeCategory,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos,HadHeartDisease
str,str,str,f64,f64,str,str,f64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,f64,f64,f64,str,str,str,str,str,str,str,str
"""Alabama""","""Female""","""Very good""",0.0,0.0,"""Within past ye…","""No""",8.0,,"""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""No""","""No""","""Never smoked""","""Not at all (ri…","""No""","""White only, No…","""Age 80 or olde…",,,,"""No""","""No""","""Yes""","""No""","""Yes, received …","""No""","""No""","""No"""
"""Alabama""","""Female""","""Excellent""",0.0,0.0,,"""No""",6.0,,"""No""","""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Never smoked""","""Never used e-c…","""No""","""White only, No…","""Age 80 or olde…",1.6,68.04,26.57,"""No""","""No""","""No""","""No""","""No, did not re…","""No""","""No""","""No"""
"""Alabama""","""Female""","""Very good""",2.0,3.0,"""Within past ye…","""Yes""",5.0,,"""No""","""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Never smoked""","""Never used e-c…","""No""","""White only, No…","""Age 55 to 59""",1.57,63.5,25.61,"""No""","""No""","""No""","""No""",,"""No""","""Yes""","""No"""
"""Alabama""","""Female""","""Excellent""",0.0,0.0,"""Within past ye…","""Yes""",7.0,,"""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Current smoker…","""Never used e-c…","""Yes""","""White only, No…",,1.65,63.5,23.3,"""No""","""No""","""Yes""","""Yes""","""No, did not re…","""No""","""No""","""No"""
"""Alabama""","""Female""","""Fair""",2.0,0.0,"""Within past ye…","""Yes""",9.0,,"""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Never smoked""","""Never used e-c…","""Yes""","""White only, No…","""Age 40 to 44""",1.57,53.98,21.77,"""Yes""","""No""","""No""","""Yes""","""No, did not re…","""No""","""No""","""No"""


## 4. Drop Unused Variables and Drop Observations With Too Many Missing Values

1. Create `df_heart_drop` where observations are dropped if `HadHeartAttack`, `HadStroke`, or `HadAngina` are missing.

2. Drop `HadHeartAttack`, `HadAngina`, and `HadStroke` because `HadHeartDisease`, replaced them. 

3. Drop `BMI` because `HeightInMeters` and `WeightInKilograms`, were chosen instead.

4. From `df_heart_drop`, make multiple dataframes that drop survey participants based
on the number of missing responses.

5. Collect the dataframes and return the length of each entry.

6. Delete variables: `HadHeartAttack`, `HadAngina`, `HadStroke` and `BMI`


In [20]:
# Drop observations where HadHeartDisease is missing
df_heart_drop_null = df.drop_nulls(
    subset=["HadHeartDisease"]
)
df_dict["df_heart_drop_null"] = df_heart_drop_null


# Drop HadHeartAttack, HadStroke, HadAngina, BMI
df_heart_drop = df_heart_drop_null.drop(
    ['HadHeartAttack', 'HadAngina', 'HadStroke', 'BMI']
)

In [21]:
df_dict.update({"df" : df, "df_heart_drop" : df_heart_drop}) # Dictionary of lazy dataframes
thresholds = [0, 1, 3, 5, 10, 20, 40] # List of thresholds

# If number of missing is > threshold, drop the observation.
for threshold in thresholds:
    df_name = f"df_heart_drop_{threshold:02}"
    # Filter observations if # of null values is greater than threshold.
    df_filter = df_heart_drop.filter(
        pl.sum_horizontal(pl.all().is_null()) <= threshold
    )
    df_dict[df_name] = df_filter # Add lazy frame to dictionary


### Impute Missing Values

1. Show column types

2. Impute float and integer values by median.

3. Impute string values by mode.

Note: This is applied to df_heart_drop, other dataframes can be imputed.


In [22]:
### Imputation
def impute_df(df):
    df = df.collect() # Collect because iteration is needed.
    for i in range(len(df.columns)):
        col_name = df.columns[i]
        dtype = df.dtypes[i]
        ## Impute string using the mode
        if dtype == pl.Utf8:
            mode_value = df[col_name].mode()
            df = df.with_columns(df[col_name].fill_null(mode_value))
        ## Impute float using the median
        elif dtype == pl.Float64 or dtype == pl.Int64:
            median_value = df[col_name].median()
            df = df.with_columns(df[col_name].fill_null(median_value))
        ## Warning catch: if type is not a string or float.
        else:
            print("Unexpected type:", dtype)
    df = pl.LazyFrame(df)
    return df

### Show the number of missing values for each column
# impute_df(df_heart_drop).null_count()
for key, value in df_dict.items():
    df_name = f"{key}_imp"
    df_dict_imp[df_name] = impute_df(value)


### Data Dictionary

All lazy dataframes are saved to the data dictionary, all have 37 columns except original.

[Dictionary Key](#1-notebook-summary)

In [23]:
df_dict_all.update(df_dict)

df_dict_all.update(df_dict_imp)

for key in df_dict_all.keys():
    print(key)


df_original
df_heart_drop_null
df
df_heart_drop
df_heart_drop_00
df_heart_drop_01
df_heart_drop_03
df_heart_drop_05
df_heart_drop_10
df_heart_drop_20
df_heart_drop_40
df_original_imp
df_heart_drop_null_imp
df_imp
df_heart_drop_imp
df_heart_drop_00_imp
df_heart_drop_01_imp
df_heart_drop_03_imp
df_heart_drop_05_imp
df_heart_drop_10_imp
df_heart_drop_20_imp
df_heart_drop_40_imp


## Export all Dataframes as Parquet and Lazy Read all Parquet Files

Save to Google Drive. Create a shortcut of your decisiontreebruining
from the shared drive your personal drive then download Google Drive 
Desktop and create a sys link using `ln -s <source path> <linked folder path>`.

In [24]:
lazydict_to_parquet(df_dict_all, drive_path)