# Figures and Tables

## Table of Contents
1. [Read all parquets from Google Drive](#1-read-all-parquets-from-google-drive)
2. [Summarizing Adverse Cardiovascular Events with Heart Disease](#2-summarizing-adverse-cardiovascular-events-with-heart-disease)
3. [Missing Threshold Performance Table](#3-missing-threshold-performance-table)
4. 

In [1]:
from asgmnt_2_tools import lazy_read_parquet
from great_tables import GT, md, html # Dataframe Formatting
import numpy as np # Array
import pandas as pd # Dataframe
import polars as pl # Lazyframe
import seaborn as sns # Plots
import sidetable as stb# Frequency/Missing Dataframe

## 1. Read All Parquets From Google Drive

| Dataframe Name | Description                        |
|----------------|------------------------------------|
| df_original    | No modifications to the data       |
| df             | Added column HadHearDisease. If a target variable is missing, HadHeartDisease missing.|
| df_heart_drop  | Drop all observations where `HadHeartDisease` is missing.|
| df_heart_drop_## | Drop observations that pass the ## threshold for missing values |
| df*_imp        | Imputed version of all the above dataframes |

In [2]:
path_drive = "../../Data/GoogleDrive"

dict_lazy = dict(sorted(lazy_read_parquet(path_drive).items()))

## 2. Summarizing Adverse Cardiovascular Events with Heart Disease

Describes how `HadHeartDisease` summarizes `HadHeartAttack`, `HadStroke`, and `HadAngina`

To save the table, install selenium using `pip install -U selenium` then
download chromedriver and put into `usr/local/bin`.

In [3]:
# Use SideTable to create frequency dataframe
tbl_1 = (
    dict_lazy["df_heart_drop_null"]
    .select(["HadHeartAttack", "HadStroke", "HadAngina", "HadHeartDisease"])
    .collect() # Lazyframe to dataframe
    .to_pandas() # Polars to Pandas
    # Using sidetable, generate a frequency table
    .stb.freq(
    ["HadHeartAttack", "HadStroke", "HadAngina", "HadHeartDisease"]
    )
    .drop(columns=['cumulative_count', 'cumulative_percent']) # Drop Cumulative Columns
    .assign(percent = lambda df: df['percent'] / 100) # Convert to Percent
)



In [4]:
tbl_1_cols = ["HadHeartAttack", "HadStroke", "HadAngina", "HadHeartDisease"]

# Transform all specified columns to nullify non-'Yes' values
tbl_1_tot = (
    dict_lazy["df_heart_drop_null"]
    .select(tbl_1_cols)
    .with_columns(
        [
            pl.when(pl.col(colname) == "Yes")
            .then(pl.lit(1))
            .otherwise(pl.lit(0))
            .alias(colname)
            for colname in tbl_1_cols
        ]
    )
    .sum()
    .collect()
    .to_pandas()
)
tbl_1_tot

Unnamed: 0,HadHeartAttack,HadStroke,HadAngina,HadHeartDisease
0,25108,19239,26551,52256


In [5]:
tbl_1_no = (
    dict_lazy["df_heart_drop_null"]
    .select(tbl_1_cols)
    .with_columns(
        pl.when(
            (pl.col('HadHeartAttack') == "No") &
            (pl.col('HadStroke') == "No") &
            (pl.col('HadAngina') == "No")
        )
        .then(1)
        .otherwise(0)
        .alias('No_Sum')
        )
    .sum()
    .collect()
    .to_pandas()
    
)
print(tbl_1_no)


pd.options.display.float_format = '{:,}'.format


tbl_1_tot[' '] = 'Total Positive'



  HadHeartAttack HadStroke HadAngina HadHeartDisease  No_Sum
0           None      None      None            None  387696


In [6]:
tbl_1 = pd.concat([tbl_1, tbl_1_tot], ignore_index=True)
tbl_1 = tbl_1.reindex(columns=[' ', "HadHeartAttack", "HadStroke", "HadAngina", "HadHeartDisease", 'count', 'percent'])

In [7]:

# Use GreatTables to format the dataframe
tbl_event_freq = (GT(tbl_1)
    .tab_header(title="Adverse Cardiovascular Event Frequency",
                subtitle = "Summarizing Adverse Cardiovascular Events with Heart Disease") # Title
    # Rename columns
    .cols_label(
        count = html("Count"),
        percent = html("Percent"),
        HadHeartAttack = html("Heart Attack"),
        HadStroke = html("Stroke"),
        HadAngina = html("Angina"),
        HadHeartDisease = html("Heart Disease")
    )
    .cols_align(align="center") # Body Alignment
    .fmt_percent(columns="percent", decimals=1) # Column Formatting
    .fmt_number(columns="count", rows= list(range(8)), decimals=0)
    .fmt_number(columns=tbl_1_cols, rows = 8, decimals=0)
    # Footnote
    .tab_source_note(
        source_note="Heart Disease was positive if Heart Attack, Stroke, or Angina were positive."
    )
)


tbl_event_freq.save(file=f"./figures/tbl_event_freq.png")
tbl_event_freq

Adverse Cardiovascular Event Frequency,Adverse Cardiovascular Event Frequency,Adverse Cardiovascular Event Frequency,Adverse Cardiovascular Event Frequency,Adverse Cardiovascular Event Frequency,Adverse Cardiovascular Event Frequency,Adverse Cardiovascular Event Frequency
Summarizing Adverse Cardiovascular Events with Heart Disease,Summarizing Adverse Cardiovascular Events with Heart Disease.1,Summarizing Adverse Cardiovascular Events with Heart Disease.2,Summarizing Adverse Cardiovascular Events with Heart Disease.3,Summarizing Adverse Cardiovascular Events with Heart Disease.4,Summarizing Adverse Cardiovascular Events with Heart Disease.5,Summarizing Adverse Cardiovascular Events with Heart Disease.6
,No,No,No,No,387696,88.6%
,No,No,Yes,Yes,12438,2.8%
,No,Yes,No,Yes,11939,2.7%
,Yes,No,No,Yes,9789,2.2%
,Yes,No,Yes,Yes,9259,2.1%
,Yes,Yes,Yes,Yes,2568,0.6%
,Yes,Yes,No,Yes,2091,0.5%
,No,Yes,Yes,Yes,1730,0.4%
Total Positive,25108,19239,26551,52256,,
,Heart Attack,Stroke,Angina,Heart Disease,Count,Percent


## 3. Missing Threshold Performance Table

In [8]:
tbl_2 = [] # to tabulate df rows

# If number of missing is > threshold, drop the observation.
for key, value in dict_lazy.items():
    if key[-2:].isdigit():
        threshold = int(key[-2:])
        if threshold == 0:
            threshold = "Drop All Missing Observations"
        elif threshold == 40:
            threshold = "Keep All Missing Observations"
        else:
            threshold = threshold
        tbl_2.append(
            {"Threshold": threshold,
             "Number of Rows": value.select(pl.len()).collect().item()
             })  

tbl_2 = pd.DataFrame(tbl_2)

tbl_2 = (GT(tbl_2)
    .tab_header(title="Missing Value Threshold Performance",
                subtitle = "Metrics from Logistic Regressions Were Used to Measure Performance") # Title
    .cols_align(align="center") # Body Alignment
)

tbl_2.save(file=f"./figures/tbl_threshold.png")
tbl_2

Missing Value Threshold Performance,Missing Value Threshold Performance
Metrics from Logistic Regressions Were Used to Measure Performance,Metrics from Logistic Regressions Were Used to Measure Performance.1
Drop All Missing Observations,248265
1,342961
3,387487
5,394700
10,412908
20,439673
Keep All Missing Observations,439952
Threshold,Number of Rows


## 4. Table comparing using imputation.