# Preprocessing for Neural Network
### Finalized 5/9/24

## Summary

Multiple datasets were created to document combinations of null-threshold filtering and target variable balancing applied to the BRFSS Kaggle dataset.

| Operation | Parameters | Notes | Filename |
|---|---|---|---|
|Null-Threshold Filtering|0, 1, 3, 5, 10, 20, 40| 0 = No missing and 40 = all missing |---|
|Random Over Sampling (n:y)| 1:1, 1:2, 1:3, 1:5, 1:7 | Randomly select CVD observations with replacement |---|
|Random Under Sampling (n:y)| 1:1, 2:1, 3:1, 5:1, 7:1 | Randomly remove Healthy observations |---|
|Baseline test sets| 7 (14 files) | Null-threshold datasets without over/under sampling | `df_heart_drop_[threshold]_imp_[X/y]_test`|
|Baseline Training Sets | 7 (14 files) | Null-threshold datasets without over/under sampling | `df_heart_drop_[threshold]_imp[X/y]_train`|
|Preprocessing Training Sets| 70 (140 files) | Null-threshold * (Under + Over) sampling | `[Over/Under]_Sample_[ratio]-_threshold_[threshold]_[X/y]_train`|
|Total datasets| 84 (168 files) | All datasets are imputed and encoded but not standardized.| All files are in parquet format in `./Data/GoogleDrive/Encoded_Data/`|

## Table of Contents

- [Notebook Setup](#Notebook-Setup)
- [Read in Parquet](#Read-in-Parquet)
- [Encode Features](#Encode-Features)
- [Balance Target Features](#Balance-Target-Features)
- [Results](#Results)
- [Save as Parquet](#Save-as-Parquet)

## Notebook Setup

Significant functions used can be found in [assignment_3_tools.py](./assignment_3_tools.py)

 To Run this notebook __100GB of ram/swap__ is required until the notebook is optimized by iterative reading, encoding, and balancing are used.

In [4]:
from app_tools import (
    parquet_to_dict,
    pickle_to_dict,
    encode_cvd_var,
    load_and_transform_new_data
)
import polars as pl
import pandas as pd

## Read in Parquet

Read in all null-threshold datasets

In [5]:
%%time
pq_jar = parquet_to_dict('../../Data/GoogleDrive/Assignment_2_Archive/') #lazy read all parquet

thresholds = [20]
all_heart_drop = dict()

# Select all drop threshold datasets and convert to pandas
for threshold in thresholds:
    df_name = f"df_heart_drop_{threshold:02}_imp"
    all_heart_drop[df_name] = pq_jar[df_name].collect().to_pandas()

CPU times: user 485 ms, sys: 278 ms, total: 763 ms
Wall time: 407 ms


## Encode Features

Encode all null-threshold datasets. Encoding operations from [`encode_cvd_varencode_cvd_var()`](./assignment_3_tools.py) were applied to all null-thresholds.

In [6]:
%%time
split_encoded = dict()
# Encode all null-threshold datasets
for key, value in all_heart_drop.items():
    X_train = f"{key}_X_train"
    X_test = f"{key}_X_test"
    y_train = f"{key}_y_train"
    y_test = f"{key}_y_test"
    (split_encoded[X_train], 
     split_encoded[X_test], 
     split_encoded[y_train], 
     split_encoded[y_test]) = encode_cvd_var(value)

CPU times: user 2.14 s, sys: 386 ms, total: 2.52 s
Wall time: 2.63 s


## Balance Target Features

Balance all null-threshold datasets.

In [9]:
%%time
ratios = [1]
balanced_X_trains = dict()
balanced_y_trains = dict()

# Apply Balancing to all null-threshold datasets
for key in all_heart_drop:
    X_train = split_encoded[f"{key}_X_train"]
    y_train = split_encoded[f"{key}_y_train"]
    for ratio in ratios:
        (balanced_X_trains[f"Over_Sample_1:{ratio}_threshold_{key[-6:-4]}_X_train"], 
         balanced_y_trains[f"Over_Sample_1:{ratio}_threshold_{key[-6:-4]}_y_train"]) = rand_over_sample(X_train, y_train, ratio)
        (balanced_X_trains[f"Under_Sample_{ratio}:1_threshold_{key[-6:-4]}_X_train"], 
         balanced_y_trains[f"Under_Sample_{ratio}:1_threshold_{key[-6:-4]}_y_train"]) = rand_over_sample(X_train, y_train, ratio)

CPU times: user 1.24 s, sys: 554 ms, total: 1.8 s
Wall time: 1.85 s


## Results

- [x] Make a table of over/under sampled datasets
- [] Add MLP performance

In [10]:
%%time
Balance_Method = list()
No_CVD = list()
Yes_CVD = list()
Total_Obs = list()

for key, value in balanced_y_trains.items():
    Balance_Method.append(key[:-8])
    No_CVD.append(value.value_counts()[0])
    Yes_CVD.append(value.value_counts()[1])
    Total_Obs.append(value.value_counts()[1] + value.value_counts()[0])

tbl_balanced = pd.DataFrame({
    'Balance_Method': Balance_Method,
    'No_CVD': No_CVD,
    'Yes_CVD': Yes_CVD,
    'Total_Obs': Total_Obs})

GT(tbl_balanced.sort_values(by='Balance_Method', ascending=True))

CPU times: user 54 ms, sys: 8.63 ms, total: 62.7 ms
Wall time: 66 ms


0,1,2,3
Over_Sample_1:1_threshold_20,290589,290589,581178
Under_Sample_1:1_threshold_20,290589,290589,581178
Balance_Method,No_CVD,Yes_CVD,Total_Obs


## Save as Parquet

In [7]:
%%time
pd_dict_to_parquet(balanced_X_trains, "../../Data/GoogleDrive/Encoded_Data/")
pd_dict_to_parquet(balanced_y_trains, "../../Data/GoogleDrive/Encoded_Data/")
pd_dict_to_parquet(split_encoded, "../../Data/GoogleDrive/Encoded_Data/")
tbl_balanced.to_parquet("../../Data/GoogleDrive/tbl_of_balanced.parquet")

CPU times: user 1min 52s, sys: 1min 2s, total: 2min 55s
Wall time: 2min 27s
