## Assignement 2: Exploritory Data Analysis

William "Taylor" Martinez

#### Notes and Setup

[`dplyr` to `polars`](https://docs.pola.rs/user-guide/migration/pandas/#column-assignment):
| Operation                | Syntax                                   |
|--------------------------|------------------------------------------|
| read (lazy)              | `df.scan_csv()` or `df.scan_parquet()`  |
| collect                  | `df.collect()`                           |
| select                   | `df.select("col_name1", "col_name2")`   |
| filter                   | `df.filter(pl.col("col_name") < 10)`    |
| missing                  | `null`                                   |
| mutate                   | `df.with_columns(new_col_name = pl.col("col_name") * 10)` |
| mutate (conditional)   | ```df.with_columns( pl.when(pl.col("c") == 2) .then(pl.col("b")) .otherwise(pl.col("a")).alias("a") )``` |

`csv` vs `parquet`:
    Parquet was chosen over `csv` because it takes up less space, it is columnar formatted, and is has improved query performance. [medium](https://medium.com/@dinesh1.chopra/unveiling-the-battle-apache-parquet-vs-csv-exploring-the-pros-and-cons-of-data-formats-b6bfd8e43107)








In [1]:
import polars as pl

In [2]:
# Convert to parquet and read in the data
data_path = "../../Data/"
df = pl.scan_csv(data_path + "heart_2022_with_nans.csv")
df.collect().write_parquet(data_path + "heart_2022_with_nans.parquet")
df = pl.scan_parquet(data_path + "heart_2022_with_nans.parquet")

### EDA Task 1: Create `HadHeartDisease` column

1. Set `HadHeartDisease` to `True` if the survey participant reported having a least one of the following adverse cardiovascular events: heart attack (`HadHeartAttack`), stroke (`HadStroke`), or angina  (`HadAngina`).

In [3]:
df = df.with_columns(
    pl.when(
        (pl.col("HadHeartAttack") == "Yes") |
        (pl.col("HadStroke") == "Yes") |
        (pl.col("HadAngina") == "Yes")
    )
    .then(pl.lit("Yes"))
    .otherwise(pl.lit("No"))
    .alias("HadHeartDisease")
)

df.fetch(10)

State,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,HadAngina,HadStroke,HadAsthma,HadSkinCancer,HadCOPD,HadDepressiveDisorder,HadKidneyDisease,HadArthritis,HadDiabetes,DeafOrHardOfHearing,BlindOrVisionDifficulty,DifficultyConcentrating,DifficultyWalking,DifficultyDressingBathing,DifficultyErrands,SmokerStatus,ECigaretteUsage,ChestScan,RaceEthnicityCategory,AgeCategory,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos,HadHeartDisease
str,str,str,f64,f64,str,str,f64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,f64,f64,f64,str,str,str,str,str,str,str,str
"""Alabama""","""Female""","""Very good""",0.0,0.0,"""Within past ye…","""No""",8.0,,"""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""No""","""No""","""Never smoked""","""Not at all (ri…","""No""","""White only, No…","""Age 80 or olde…",,,,"""No""","""No""","""Yes""","""No""","""Yes, received …","""No""","""No""","""No"""
"""Alabama""","""Female""","""Excellent""",0.0,0.0,,"""No""",6.0,,"""No""","""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Never smoked""","""Never used e-c…","""No""","""White only, No…","""Age 80 or olde…",1.6,68.04,26.57,"""No""","""No""","""No""","""No""","""No, did not re…","""No""","""No""","""No"""
"""Alabama""","""Female""","""Very good""",2.0,3.0,"""Within past ye…","""Yes""",5.0,,"""No""","""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Never smoked""","""Never used e-c…","""No""","""White only, No…","""Age 55 to 59""",1.57,63.5,25.61,"""No""","""No""","""No""","""No""",,"""No""","""Yes""","""No"""
"""Alabama""","""Female""","""Excellent""",0.0,0.0,"""Within past ye…","""Yes""",7.0,,"""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Current smoker…","""Never used e-c…","""Yes""","""White only, No…",,1.65,63.5,23.3,"""No""","""No""","""Yes""","""Yes""","""No, did not re…","""No""","""No""","""No"""
"""Alabama""","""Female""","""Fair""",2.0,0.0,"""Within past ye…","""Yes""",9.0,,"""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Never smoked""","""Never used e-c…","""Yes""","""White only, No…","""Age 40 to 44""",1.57,53.98,21.77,"""Yes""","""No""","""No""","""Yes""","""No, did not re…","""No""","""No""","""No"""
"""Alabama""","""Male""","""Poor""",1.0,0.0,"""Within past ye…","""No""",7.0,,"""Yes""","""No""","""Yes""","""No""","""No""","""No""","""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""No""","""No""","""Never smoked""","""Never used e-c…","""No""","""White only, No…","""Age 80 or olde…",1.8,84.82,26.08,"""No""","""No""","""No""","""Yes""","""No, did not re…","""No""","""No""","""Yes"""
"""Alabama""","""Female""","""Very good""",0.0,0.0,"""Within past ye…","""Yes""",7.0,,"""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Former smoker""","""Never used e-c…","""No""","""Black only, No…","""Age 80 or olde…",1.65,62.6,22.96,"""Yes""","""No""","""No""","""No""","""No, did not re…","""No""","""No""","""No"""
"""Alabama""","""Female""","""Good""",0.0,0.0,"""Within past ye…","""No""",8.0,,"""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Never smoked""","""Never used e-c…","""Yes""","""White only, No…","""Age 80 or olde…",1.63,73.48,27.81,"""No""","""No""","""Yes""","""Yes""","""Yes, received …","""No""","""No""","""No"""
"""Alabama""","""Female""","""Good""",0.0,0.0,"""Within past ye…","""Yes""",6.0,,"""No""","""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""Yes""","""No""","""No""","""Yes""","""No""","""Yes""","""No""","""No""","""Former smoker""","""Not at all (ri…",,"""White only, No…","""Age 75 to 79""",1.7,,,"""No""","""Yes""","""No""","""No""","""Yes, received …","""No""","""No""","""No"""
"""Alabama""","""Female""","""Good""",1.0,0.0,"""Within past ye…","""Yes""",7.0,,"""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Yes""","""No""","""Yes""","""No""","""No""","""No""","""No""","""No""","""No""","""Never smoked""","""Never used e-c…",,"""White only, No…","""Age 70 to 74""",1.68,81.65,29.05,"""Yes""",,"""Yes""","""Yes""","""No, did not re…","""No""","""No""","""No"""


### EDA Task 2: Drop Observations With Too Many Missing Values

1. Create `df_heart_drop` where participants are dropped if Heart attack (`HadHeartAttack`), stroke (`HadStroke`), or angina  (`HadAngina`) are missing.

2. From `df_heart_drop`, make multiple dataframes that drop survey participants based
on the number of missing responses.

3. Collect the dataframes and return the length of each entry.


In [7]:
df_heart_drop = df.drop_nulls(subset=["HadHeartAttack", "HadStroke", "HadAngina"])

df.fetch(5)

State,Sex,GeneralHealth,PhysicalHealthDays,MentalHealthDays,LastCheckupTime,PhysicalActivities,SleepHours,RemovedTeeth,HadHeartAttack,HadAngina,HadStroke,HadAsthma,HadSkinCancer,HadCOPD,HadDepressiveDisorder,HadKidneyDisease,HadArthritis,HadDiabetes,DeafOrHardOfHearing,BlindOrVisionDifficulty,DifficultyConcentrating,DifficultyWalking,DifficultyDressingBathing,DifficultyErrands,SmokerStatus,ECigaretteUsage,ChestScan,RaceEthnicityCategory,AgeCategory,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers,HIVTesting,FluVaxLast12,PneumoVaxEver,TetanusLast10Tdap,HighRiskLastYear,CovidPos,HadHeartDisease
str,str,str,f64,f64,str,str,f64,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,str,f64,f64,f64,str,str,str,str,str,str,str,str
"""Alabama""","""Female""","""Very good""",0.0,0.0,"""Within past ye…","""No""",8.0,,"""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""No""","""No""","""Never smoked""","""Not at all (ri…","""No""","""White only, No…","""Age 80 or olde…",,,,"""No""","""No""","""Yes""","""No""","""Yes, received …","""No""","""No""","""No"""
"""Alabama""","""Female""","""Excellent""",0.0,0.0,,"""No""",6.0,,"""No""","""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Never smoked""","""Never used e-c…","""No""","""White only, No…","""Age 80 or olde…",1.6,68.04,26.57,"""No""","""No""","""No""","""No""","""No, did not re…","""No""","""No""","""No"""
"""Alabama""","""Female""","""Very good""",2.0,3.0,"""Within past ye…","""Yes""",5.0,,"""No""","""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Never smoked""","""Never used e-c…","""No""","""White only, No…","""Age 55 to 59""",1.57,63.5,25.61,"""No""","""No""","""No""","""No""",,"""No""","""Yes""","""No"""
"""Alabama""","""Female""","""Excellent""",0.0,0.0,"""Within past ye…","""Yes""",7.0,,"""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""Yes""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Current smoker…","""Never used e-c…","""Yes""","""White only, No…",,1.65,63.5,23.3,"""No""","""No""","""Yes""","""Yes""","""No, did not re…","""No""","""No""","""No"""
"""Alabama""","""Female""","""Fair""",2.0,0.0,"""Within past ye…","""Yes""",9.0,,"""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""No""","""Never smoked""","""Never used e-c…","""Yes""","""White only, No…","""Age 40 to 44""",1.57,53.98,21.77,"""Yes""","""No""","""No""","""Yes""","""No, did not re…","""No""","""No""","""No"""


In [25]:
df_heart_drop_00 = df_heart_drop.filter(pl.sum_horizontal(pl.all().is_null()) <= 00)
df_heart_drop_01 = df_heart_drop.filter(pl.sum_horizontal(pl.all().is_null()) <= 1)
df_heart_drop_03 = df_heart_drop.filter(pl.sum_horizontal(pl.all().is_null()) <= 3)
df_heart_drop_05 = df_heart_drop.filter(pl.sum_horizontal(pl.all().is_null()) <= 5)
df_heart_drop_10 = df_heart_drop.filter(pl.sum_horizontal(pl.all().is_null()) <= 10)
df_heart_drop_20 = df_heart_drop.filter(pl.sum_horizontal(pl.all().is_null()) <= 20)
df_heart_drop_40 = df_heart_drop.filter(pl.sum_horizontal(pl.all().is_null()) <= 40)

In [31]:
print(df_heart_drop_00.select(pl.len()).collect().item())
print(df_heart_drop_01.select(pl.len()).collect().item())
print(df_heart_drop_03.select(pl.len()).collect().item())
print(df_heart_drop_05.select(pl.len()).collect().item())
print(df_heart_drop_10.select(pl.len()).collect().item())
print(df_heart_drop_20.select(pl.len()).collect().item())
print(df_heart_drop_40.select(pl.len()).collect().item())

246022
331181
381718
391725
410245
436507
437510
