# US Holidays and Special Events

- New Year's Day
- MLK
- Super Bowl
- Presidents Day
- Memorial Day
- Independence Day
- Labor Day
- Columbus Day
- Veterans Day
- Thanksgiving
- Christmas
- New Year's Eve

Sources: 
- https://www.usa.gov/holidays
- https://www.timeanddate.com/holidays/us/2015


## Setup environment

In [0]:
blob_container = "261storagecontainer"  
storage_account = "261storage" 
secret_scope = "261_team_6_1_spring24_scope"  
secret_key = "team_6_1_key"  
team_blob_url = f"wasbs://{blob_container}@{storage_account}.blob.core.windows.net" 


# blob storage is mounted here.
mids261_mount_path = "/mnt/mids-w261"

# SAS Token: Grant the team limited access to Azure Storage resources
spark.conf.set(
    f"fs.azure.sas.{blob_container}.{storage_account}.blob.core.windows.net",
    dbutils.secrets.get(scope=secret_scope, key=secret_key),
)

# see what's in the blob storage root folder
# display(dbutils.fs.ls(f"{team_blob_url}"))

# mount
data_BASE_DIR = "dbfs:/mnt/mids-w261/"
# display(dbutils.fs.ls(f"{data_BASE_DIR}"))

## Import libraries

In [0]:
#standard
import pandas as pd
import matplotlib.pyplot as plt
import pyspark.sql.functions as F 
import seaborn as sns

# Boolean flags for sanity checks
from pyspark.sql.types import BooleanType, ArrayType


## Load file
- Uploaded .csv file from Google Drive to our blob via File --> Add data
- Queried using SQL
- Saved as Sparkdf
- Saved as Parquet

In [0]:
%sql
SELECT * FROM `hive_metastore`.`default`.`events_2015_2019_sheet_1`;


In [0]:
events_2015_2019 = _sqldf
events_2015_2019.display()

In [0]:
# Generate a new dataframe with one row per date between start_date and end_date for each state
exploded_dates = events_2015_2019.select("*", F.explode(F.expr("sequence(to_date(start_date), to_date(end_date), interval 1 day)")).alias("date"))

# Show the exploded dates DataFrame
exploded_dates.display()

In [0]:
events_2015_2019 = exploded_dates.select("event", "date")
events_2015_2019.display()

## Sanity checks
1. Check that each event has 25 rows (5 per year, for 5 years)
2. Check unique cases for event, to check for possible typos


In [0]:
# Count the number of cases for each event
case_counts = events_2015_2019.groupBy("event").count()
case_counts.display()

## Save Parquet file

In [0]:
# save as parquet file
events_2015_2019.write.mode("overwrite").parquet(f"{team_blob_url}/5y_events")

In [0]:
# Load checkpointed file
events_2015_2019 = spark.read.parquet( f"wasbs://{blob_container}@{storage_account}.blob.core.windows.net/5y_events" )
events_2015_2019.display()