# Preprocessing for Neural Network
### Finalized 5/9/24

## Summary

Multiple datasets were created to document combinations of null-threshold filtering and target variable balancing applied to the BRFSS Kaggle dataset.

|Operation| Parameters|
|---|---|
|Null-Threshold Filtering|0, 1, 3, 5, 10, 20, 40|
|Random Over Sampling (n:y)| 1:1, 1:2, 1:3, 1:5, 1:7 |
|Random Under Sampling (n:y)| 1:1, 2:1, 3:1, 5:1, 7:1 |
|Total Combinations| 70|

## Table of Contents

- [Notebook Setup](#Notebook-Setup)
- [Read in Parquet](#Read-in-Parquet)
- [Encode Features](#Encode-Features)
- [Balance Target Features](#Balance-Target-Features)
- [Results](#Results)
- [Save as Parquet](#Save-as-Parquet)

## Notebook Setup

Significant functions used can be found in [assignment_3_tools.py](./assignment_3_tools.py)

 To Run this notebook __100GB of ram/swap__ is required until the notebook is optimized by iterative reading, encoding, and balancing are used.

In [1]:
from assignment_3_tools import (
    parquet_to_dict, dict_to_parquet, encode_cvd_var,
    rand_under_sample, rand_over_sample, balanced_dict,
    pd_dict_to_parquet)
import polars as pl
import pandas as pd
import seaborn as sns
from great_tables import GT, md, html, from_column, style, loc

## Read in Parquet

Read in all null-threshold datasets

In [2]:
%%time
pq_jar = parquet_to_dict('../../Data/GoogleDrive/') #lazy read all parquet

thresholds = [0, 1, 3, 5, 10, 20, 40]
all_heart_drop = dict()

# Select all drop threshold datasets and convert to pandas
for threshold in thresholds:
    df_name = f"df_heart_drop_{threshold:02}_imp"
    all_heart_drop[df_name] = pq_jar[df_name].collect().to_pandas()

CPU times: user 4.58 s, sys: 1.73 s, total: 6.31 s
Wall time: 2.61 s


## Encode Features

Encode all null-threshold datasets. Encoding operations from [`encode_cvd_varencode_cvd_var()`](./assignment_3_tools.py) were applied to all null-thresholds.

In [3]:
%%time
split_encoded = dict()
# Encode all drop threshold datasets
for key, value in all_heart_drop.items():
    X_train = f"{key}_X_train"
    X_test = f"{key}_X_test"
    y_train = f"{key}_y_train"
    y_test = f"{key}_y_test"
    (split_encoded[X_train], 
     split_encoded[X_test], 
     split_encoded[y_train], 
     split_encoded[y_test]) = encode_cvd_var(value)

CPU times: user 11.5 s, sys: 2.52 s, total: 14 s
Wall time: 14 s


## Balance Target Features

Balance all null-threshold datasets.

In [4]:
%%time
ratios = [1,2,3,5,7]
balanced_X_trains = dict()
balanced_y_trains = dict()

for key in all_heart_drop:
    X_train = split_encoded[f"{key}_X_train"]
    y_train = split_encoded[f"{key}_y_train"]
    for ratio in ratios:
        (balanced_X_trains[f"Over_Sample_1:{ratio}_threshold_{key[-6:-4]}_X_train"], 
         balanced_y_trains[f"Over_Sample_1:{ratio}_threshold_{key[-6:-4]}_y_train"]) = rand_over_sample(X_train, y_train, ratio)
        (balanced_X_trains[f"Under_Sample_{ratio}:1_threshold_{key[-6:-4]}_X_train"], 
         balanced_y_trains[f"Under_Sample_{ratio}:1_threshold_{key[-6:-4]}_y_train"]) = rand_over_sample(X_train, y_train, ratio)

CPU times: user 45 s, sys: 42.3 s, total: 1min 27s
Wall time: 1min 27s


## Results

- [x] Make a table of over/under sampled datasets
- [] Add MLP performance

In [5]:
%%time
Balance_Method = list()
No_CVD = list()
Yes_CVD = list()
Total_Obs = list()

for key, value in balanced_y_trains.items():
    Balance_Method.append(key[:-8])
    No_CVD.append(value.value_counts()[0])
    Yes_CVD.append(value.value_counts()[1])
    Total_Obs.append(value.value_counts()[1] + value.value_counts()[0])

tbl_balanced = pd.DataFrame({
    'Balance_Method': Balance_Method,
    'No_CVD': No_CVD,
    'Yes_CVD': Yes_CVD,
    'Total_Obs': Total_Obs})

GT(tbl_balanced.sort_values(by='Balance_Method', ascending=True))

CPU times: user 2.17 s, sys: 180 ms, total: 2.35 s
Wall time: 3.47 s


0,1,2,3
Over_Sample_1:1_threshold_00,164339,164339,328678
Over_Sample_1:1_threshold_01,226824,226824,453648
Over_Sample_1:1_threshold_03,255918,255918,511836
Over_Sample_1:1_threshold_05,260572,260572,521144
Over_Sample_1:1_threshold_10,272572,272572,545144
Over_Sample_1:1_threshold_20,290589,290589,581178
Over_Sample_1:1_threshold_40,290772,290772,581544
Over_Sample_1:2_threshold_00,328678,164339,493017
Over_Sample_1:2_threshold_01,453648,226824,680472
Over_Sample_1:2_threshold_03,511836,255918,767754


## Save as Parquet

In [8]:
%%time
pd_dict_to_parquet(balanced_X_trains, "../../Data/GoogleDrive/Encoded_Data/")
pd_dict_to_parquet(balanced_y_trains, "../../Data/GoogleDrive/Encoded_Data/")
pd_dict_to_parquet(split_encoded, "../../Data/GoogleDrive/Encoded_Data/")
tbl_balanced.to_parquet("../../Data/GoogleDrive/tbl_of_balanced.parquet")

CPU times: user 1min 51s, sys: 1min 1s, total: 2min 53s
Wall time: 2min 16s
