## Minnesota State COVID Response Analysis
This notebook contains the work to identify associations between the Minnesota state governmental response and the COVID-19 case count throughout the pandemic.


## Data Cleanup
As with most data mining projects, we will need to clean up the given data file in order to focus on the goal at hand. The "all-states-history.csv" file is a dataset of U.S. COVID-19 cases and deaths dating from the start of the pandemic to 11/29/20 and was sourced from [The Covid Tracking Project](https://covidtracking.com/data). We are analyzing 3 periods throughout this timeline:

- Early Breakout (Early March -> May)
- Summer (June -> August)
- Fall/Present (September -> Late November)

We will divide up the data into 3 different frames according to these periods.

In order to analyze with state policy actions, we will merge data from the [Oxford Covid-19 Government Response Tracker](https://github.com/OxCGRT/covid-policy-tracker) github dataset titled 'state-policies.csv'. 

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import squarify
import seaborn as sns

Initializing the dataframes

In [None]:
# COVID tracking project data
covid_data = pd.read_csv('all-states-history.csv')

# state plicy data
policy_data = pd.read_csv('state-policies.csv')

Cleaning up Covid data to only include Minnesota instances and the appropriate attributes

In [None]:
#isolating the columns we need
columns_to_show = ['date','state','death','deathConfirmed','deathIncrease','hospitalized','hospitalizedIncrease','negative'
                   ,'negativeIncrease','positive','positiveIncrease','totalTestResults','totalTestResultsIncrease']

#isolating only for MN data and putting in order March->November
covid_clean_data = covid_data[covid_data['state'] == 'MN']
covid_clean_data = covid_clean_data[columns_to_show]
covid_clean_data = covid_clean_data.iloc[::-1]

#reindexing for weekly processing 
covid_clean_data['date'] = covid_clean_data['date'].astype('datetime64[ns]')
covid_clean_data = covid_clean_data.set_index('date')

# isolating the columns that need to be summed when converting to weekly index
columns_to_sum = covid_clean_data[['deathIncrease','hospitalizedIncrease','negativeIncrease','positiveIncrease','totalTestResultsIncrease']]
weekly_data = columns_to_sum.resample('W', label='right', closed='right').sum()
weekly_data = weekly_data.reset_index()

# converting remaining non-sum columns to weekly index
remaining_cols = covid_clean_data[['state','death','deathConfirmed','hospitalized', 'negative','positive','totalTestResults']]
remaining_cols = remaining_cols.resample('W').backfill().reset_index()
remaining_cols.head(39)

#merging and resetting the datframe order to be more clear
covid_clean_data = pd.merge(remaining_cols, weekly_data, on='date').fillna(0)
covid_clean_data = covid_clean_data[['date','state','death','deathIncrease','deathConfirmed','hospitalized', 'hospitalizedIncrease','negative',
                        'negativeIncrease','positive', 'positiveIncrease','totalTestResults','totalTestResultsIncrease']]

Cleaning up state policy dataframe:

In [None]:
#isolating data only about the current state of interest, Minnesota
policy_clean_data = policy_data[policy_data['RegionName'] == 'Minnesota']
#deleting rows whose dates are outside of the scope of this project
policy_clean_data = policy_clean_data.iloc[60:] #delete the first 60 rows due to their January - February dates
policy_clean_data = policy_clean_data.iloc[:-3,] #as well as the last 3 rows due to their December dates

#declaring and extracting columns of interest from the original dataset
columns_of_interest = ['RegionName', 'Jurisdiction', 'Date', 'C1_School closing', 'C2_Workplace closing', 
                       'C3_Cancel public events', 'C6_Stay at home requirements', 
                       'C7_Restrictions on internal movement', 'C8_International travel controls', 
                       'H1_Public information campaigns', 'H2_Testing policy', 'H3_Contact tracing', 
                       'H4_Emergency investment in healthcare', 'H5_Investment in vaccines', 
                       'H6_Facial Coverings', 'M1_Wildcard']
policy_clean_data = policy_clean_data[columns_of_interest]

# reformating date section
policy_clean_data = policy_clean_data.reset_index(drop = True)
from datetime import datetime as dt

for i in range(policy_clean_data.shape[0]):
    date_string = str(policy_clean_data['Date'][i])
    policy_clean_data['Date'][i] = dt.strptime(date_string, "%Y%m%d").date()


Merging dataframes:

In [None]:
## Breaking down clean data into each period (earliest days at bottom of dataset)

early_breakout_data = clean_data[0:13]

summer_data = clean_data[13:26]

fall_data = clean_data[26:]

early_breakout_data.head(13)

bins = pd.cut(early_breakout_data['positiveIncrease'],4)

print(bins.shape)


## Analysis

Important MN Stats:

- Population (mn.gov estimate): 5,680,337
- Land Area (estimate): 79,610.08 sq. mi.
- Population Density: 71.35 people/sq. mi.

Since we are performing a market basket analysis using the Apriori algorithm, we will need to discretize the data. To do so, we've implemented a function 'discretize_data':

In [None]:
# arr is the dataframe 
# k is the number of equal frequency bins
def discretize_data(arr, k):
    out = pd.DataFrame({'date': arr['date']})
    out['state'] = arr['state']
    cols = arr.columns[2:]
    for i in cols:
        bins = pd.cut(arr[i], k, 'retbins' == True, labels = list(range(k)))
        bin_range = pd.cut(arr[i],k)
        for j in range(k):
            count = 0
            for row in arr.index:
                if bins.loc[row] == j:
                    out.loc[row, i + " bin " +  str(bin_range.loc[count])] = 1
                count += 1
    out = out.fillna(0)
    return out      

Early Breakout Analysis:

In [None]:
early_break_disc = discretize_data(early_breakout_data,4)
early_break_disc.head(10)