# Figures and Tables

## Table of Contents
1. [Read all parquets from Google Drive](#1-read-all-parquets-from-google-drive)
2. HadHeartDisease Summarization
3. Missing Threshold Representation
4. 

In [1]:
from asgmnt_2_tools import lazy_read_parquet
from great_tables import GT, md, html # Dataframe Formatting
import numpy as np # Array
import pandas as pd # Dataframe
import polars as pl # Lazyframe
import seaborn as sns # Plots
import sidetable as stb# Frequency/Missing Dataframe

## 1. Read All Parquets From Google Drive

| Dataframe Name | Description                        |
|----------------|------------------------------------|
| df_original    | No modifications to the data       |
| df             | Added column HadHearDisease. If a target variable is missing, HadHeartDisease missing.|
| df_heart_drop  | Drop all observations where a target variable is missing. Dropped `HadHeartAttack`, `HadStroke`, `HadAngina`, and `BMI`|
| df_heart_drop_## | Drop observations that pass the ## threshold for missing values |
| df*_imp        | Imputed version of all the above dataframes |

In [2]:
path_drive = "../../Data/GoogleDrive"

dict_lazy = dict(sorted(lazy_read_parquet(path_drive).items()))

## 2. Frequency of Adverse Cardiovascular Events

Describes how `HadHeartDisease` summarizes `HadHeartAttack`, `HadStroke`, and `HadAngina`

To save the table, install selenium using `pip install -U selenium` then
download chromedriver and put into `usr/local/bin`.

In [3]:
# Use SideTable to create frequency dataframe
tbl_1 = (
    dict_lazy["df"]
    .select(["HadHeartAttack", "HadStroke", "HadAngina", "HadHeartDisease"])
    .collect() # Lazyframe to dataframe
    .to_pandas() # Polars to Pandas
    # Using sidetable, generate a frequency table
    .stb.freq(
    ["HadHeartAttack", "HadStroke", "HadAngina", "HadHeartDisease"]
    )
    .drop(columns=['cumulative_count', 'cumulative_percent']) # Drop Cumulative Columns
    .assign(percent = lambda df: df['percent'] / 100) # Convert to Percent
)



In [4]:
tbl_1_cols = ["HadHeartAttack", "HadStroke", "HadAngina", "HadHeartDisease"]

# Transform all specified columns to nullify non-'Yes' values
tbl_1_tot = (
    dict_lazy["df"]
    .select(tbl_1_cols)
    .with_columns(
        [
            pl.when(pl.col(colname) == "Yes")
            .then(pl.lit(1))
            .otherwise(pl.lit(None))
            .alias(colname)
            for colname in tbl_1_cols
        ]
    )
    .count()
    .collect()
    .to_pandas()
)

In [5]:
tbl_1_no = (
    dict_lazy["df"]
    .select(tbl_1_cols)
    .with_columns(
        pl.when(
            (pl.col('HadHeartAttack') == "No") &
            (pl.col('HadStroke') == "No") &
            (pl.col('HadAngina') == "No")
        )
        .then(1)
        .otherwise(None)
        .alias('No_Sum')
        )
    .count()
    .sum()
    .collect()
    
)


pd.options.display.float_format = '{:,}'.format


tbl_1_tot['count'] = tbl_1_tot.iloc[0].sum()
tbl_1_tot['percent'] = tbl_1_tot.iloc[0].sum() / (tbl_1_no['No_Sum'][0] + tbl_1_tot.iloc[0].sum())
tbl_1_tot[' '] = 'Total Positive'



In [6]:
tbl_1 = pd.concat([tbl_1, tbl_1_tot], ignore_index=True)
tbl_1 = tbl_1.reindex(columns=[' ', "HadHeartAttack", "HadStroke", "HadAngina", "HadHeartDisease", 'count', 'percent'])

In [7]:

# Use GreatTables to format the dataframe
tbl_event_freq = (GT(tbl_1)
    .tab_header(title="Adverse Cardiovascular Event Frequency",
                subtitle = "Summarizing Adverse Cardiovascular Events with Heart Disease") # Title
    # Rename columns
    .cols_label(
        count = html("Count"),
        percent = html("Percent"),
        HadHeartAttack = html("Heart Attack"),
        HadStroke = html("Stroke"),
        HadAngina = html("Angina"),
        HadHeartDisease = html("Heart Disease")
    )
    .cols_align(align="center") # Body Alignment
    .fmt_percent(columns="percent") # Column Formatting
    .fmt_number(columns="count", decimals=0)
    .fmt_number(columns=tbl_1_cols,rows = 8, decimals=0)
    # Footnote
    .tab_source_note(
        source_note="Heart Disease is positive if at least Heart Attack, Stroke, and Angina were positive."
    )
)


# tbl_event_freq.save(file=f"./figures/tbl_event_freq.png")
tbl_event_freq

Adverse Cardiovascular Event Frequency,Adverse Cardiovascular Event Frequency,Adverse Cardiovascular Event Frequency,Adverse Cardiovascular Event Frequency,Adverse Cardiovascular Event Frequency,Adverse Cardiovascular Event Frequency,Adverse Cardiovascular Event Frequency
Summarizing Adverse Cardiovascular Events with Heart Disease,Summarizing Adverse Cardiovascular Events with Heart Disease.1,Summarizing Adverse Cardiovascular Events with Heart Disease.2,Summarizing Adverse Cardiovascular Events with Heart Disease.3,Summarizing Adverse Cardiovascular Events with Heart Disease.4,Summarizing Adverse Cardiovascular Events with Heart Disease.5,Summarizing Adverse Cardiovascular Events with Heart Disease.6
,No,No,No,No,387696,88.61%
,No,No,Yes,Yes,12438,2.84%
,No,Yes,No,Yes,11939,2.73%
,Yes,No,No,Yes,9789,2.24%
,Yes,No,Yes,Yes,9259,2.12%
,Yes,Yes,Yes,Yes,2568,0.59%
,Yes,Yes,No,Yes,2091,0.48%
,No,Yes,Yes,Yes,1730,0.40%
Total Positive,25108,19239,26551,52256,123154,38.85%
,Heart Attack,Stroke,Angina,Heart Disease,Count,Percent


## HadHeartDisease Summarization

In [15]:
tbl_row = [] # to tabulate df rows

# If number of missing is > threshold, drop the observation.
for key, value in dict_lazy.items():
    if key[-2:].isdigit():
        threshold = int(key[-2:])
        tbl_row.append(
            {"Missing Threshold": threshold,
             "Number of Rows": value.select(pl.len()).collect().item()
             })  

tbl_row = pd.DataFrame(tbl_row)

(GT(tbl_row)
    .tab_header(title="observations Dropped",
                subtitle = "Summarizing Adverse Cardiovascular Events with Heart Disease") # Title
    .cols_align(align="center") # Body Alignment
    # Footnote
    .tab_source_note(
        source_note="Heart Disease is positive if at least Heart Attack, Stroke, and Angina were positive."
    )
)

observations Dropped,observations Dropped
Summarizing Adverse Cardiovascular Events with Heart Disease,Summarizing Adverse Cardiovascular Events with Heart Disease.1
0,247367
1,341472
3,385494
5,392561
10,410633
20,437240
40,437510
Missing Threshold,Number of Rows
"Heart Disease is positive if at least Heart Attack, Stroke, and Angina were positive.","Heart Disease is positive if at least Heart Attack, Stroke, and Angina were positive."


## Missing Threshold Representation

## Todo

1. Table comparing the number of observations of `HadHeartAttack`, `HadStroke`, `HadAngina` and the summarized column `HadHeartDisease`.
    a. Provide an example using a rudimentary model (Logit Regression with enet regularization)

2. Table showing the number of observations after setting a missing threshold.
    a. Provide an example using a rudimentary model (Logit Regression with enet regularization)

3. Table comparing using imputation.