# Data Cleaning

This notebook cleans, filtes, and aggregates the data to focus on the county I'm investigating - Clark County in Nevada ( the greater part of Las Vegas). All the data used in this notebook can be found in data/\*/\*.csv. Links for these data are found in the README.md

In [37]:
# Imports
import pandas as pd

## Load

> We load in the compliance , mandate, and cases data into dataframes for manipulation, cleaning, and aggregation

In [38]:
compliance_df = pd.read_csv('data/compliance/mask-use-by-county.csv')
compliance_df

Unnamed: 0,COUNTYFP,NEVER,RARELY,SOMETIMES,FREQUENTLY,ALWAYS
0,1001,0.053,0.074,0.134,0.295,0.444
1,1003,0.083,0.059,0.098,0.323,0.436
2,1005,0.067,0.121,0.120,0.201,0.491
3,1007,0.020,0.034,0.096,0.278,0.572
4,1009,0.053,0.114,0.180,0.194,0.459
...,...,...,...,...,...,...
3137,56037,0.061,0.295,0.230,0.146,0.268
3138,56039,0.095,0.157,0.160,0.247,0.340
3139,56041,0.098,0.278,0.154,0.207,0.264
3140,56043,0.204,0.155,0.069,0.285,0.287


In [39]:
mandate_df = pd.read_csv('data/mandate/U.S._State_and_Territorial_Public_Mask_Mandates_From_April_10__2020_through_August_15__2021_by_County_by_Day.csv')
mandate_df

Unnamed: 0,State_Tribe_Territory,County_Name,FIPS_State,FIPS_County,date,order_code,Face_Masks_Required_in_Public,Source_of_Action,URL,Citation
0,AL,Autauga County,1,1,4/10/2020,2,,,,
1,AL,Autauga County,1,1,4/11/2020,2,,,,
2,AL,Autauga County,1,1,4/12/2020,2,,,,
3,AL,Autauga County,1,1,4/13/2020,2,,,,
4,AL,Autauga County,1,1,4/14/2020,2,,,,
...,...,...,...,...,...,...,...,...,...,...
1593864,VI,St. Thomas Island,78,30,8/11/2021,1,Yes,Official,,"V.I. Twenty-Seventh Supp. Exec. Order (Aug. 6,..."
1593865,VI,St. Thomas Island,78,30,8/12/2021,1,Yes,Official,,"V.I. Twenty-Seventh Supp. Exec. Order (Aug. 6,..."
1593866,VI,St. Thomas Island,78,30,8/13/2021,1,Yes,Official,,"V.I. Twenty-Seventh Supp. Exec. Order (Aug. 6,..."
1593867,VI,St. Thomas Island,78,30,8/14/2021,1,Yes,Official,,"V.I. Twenty-Seventh Supp. Exec. Order (Aug. 6,..."


In [40]:
cases_df = pd.read_csv('data/case/RAW_us_confirmed_cases.csv')
cases_df

Unnamed: 0,Province_State,Admin2,UID,iso2,iso3,code3,FIPS,Country_Region,Lat,Long_,...,10/17/22,10/18/22,10/19/22,10/20/22,10/21/22,10/22/22,10/23/22,10/24/22,10/25/22,10/26/22
0,Alabama,Autauga,84001001,US,USA,840,1001.0,US,32.539527,-86.644082,...,18452,18452,18452,18480,18480,18480,18480,18480,18480,18480
1,Alabama,Baldwin,84001003,US,USA,840,1003.0,US,30.727750,-87.722071,...,65819,65819,65819,65895,65895,65895,65895,65895,65895,65895
2,Alabama,Barbour,84001005,US,USA,840,1005.0,US,31.868263,-85.387129,...,6910,6910,6910,6926,6926,6926,6926,6926,6926,6926
3,Alabama,Bibb,84001007,US,USA,840,1007.0,US,32.996421,-87.125115,...,7547,7547,7547,7560,7560,7560,7560,7560,7560,7560
4,Alabama,Blount,84001009,US,USA,840,1009.0,US,33.982109,-86.567906,...,17256,17256,17256,17286,17286,17286,17286,17286,17286,17286
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3337,Wyoming,Teton,84056039,US,USA,840,56039.0,US,43.935225,-110.589080,...,11771,11800,11800,11800,11800,11800,11800,11800,11814,11814
3338,Wyoming,Uinta,84056041,US,USA,840,56041.0,US,41.287818,-110.547578,...,6163,6176,6176,6176,6176,6176,6176,6176,6186,6186
3339,Wyoming,Unassigned,84090056,US,USA,840,90056.0,US,0.000000,0.000000,...,0,0,0,0,0,0,0,0,0,0
3340,Wyoming,Washakie,84056043,US,USA,840,56043.0,US,43.904516,-107.680187,...,2669,2669,2669,2669,2669,2669,2669,2669,2672,2672


## Filter

We filter the data down to data for Clark County, in Nevada. Below we do a bit of column renaming and subsetting to get the data necessary for each of the three 3 dataframes (cases, mandate, and compliance)

In [41]:
nevada_cases_df = cases_df[(cases_df['Province_State'] == 'Nevada') & (cases_df['Admin2'] == 'Clark')].T.iloc[11:].reset_index().rename(columns={'index':'date', 1815:'confirmed_cases'})
nevada_cases_df

Unnamed: 0,date,confirmed_cases
0,1/22/20,0
1,1/23/20,0
2,1/24/20,0
3,1/25/20,0
4,1/26/20,0
...,...,...
1004,10/22/22,643510
1005,10/23/22,643510
1006,10/24/22,643510
1007,10/25/22,643510


In [42]:
nevada_mandate_df = mandate_df[(mandate_df['County_Name'] == 'Clark County') & (mandate_df['State_Tribe_Territory'] == 'NV')].reset_index()
nevada_mandate_df

Unnamed: 0,index,State_Tribe_Territory,County_Name,FIPS_State,FIPS_County,date,order_code,Face_Masks_Required_in_Public,Source_of_Action,URL,Citation
0,464821,NV,Clark County,32,3,5/17/2020,2,,,,
1,464822,NV,Clark County,32,3,5/18/2020,2,,,,
2,464850,NV,Clark County,32,3,5/19/2020,2,,,,
3,464851,NV,Clark County,32,3,5/20/2020,2,,,,
4,464879,NV,Clark County,32,3,5/21/2020,2,,,,
...,...,...,...,...,...,...,...,...,...,...,...
488,1257043,NV,Clark County,32,3,8/11/2021,1,Yes,Official,,Nev. Task Force Press Release (Mask guidance) ...
489,1257044,NV,Clark County,32,3,8/12/2021,1,Yes,Official,,Nev. Task Force Press Release (Mask guidance) ...
490,1257045,NV,Clark County,32,3,8/13/2021,1,Yes,Official,,"Nev. Task Force Press Release (Aug. 10, 2021) ..."
491,1257046,NV,Clark County,32,3,8/14/2021,1,Yes,Official,,"Nev. Task Force Press Release (Aug. 10, 2021) ..."


In [43]:
nevada_compliance_df = compliance_df[compliance_df['COUNTYFP'] == 32003].reset_index()
nevada_compliance_df

Unnamed: 0,index,COUNTYFP,NEVER,RARELY,SOMETIMES,FREQUENTLY,ALWAYS
0,1748,32003,0.027,0.032,0.054,0.145,0.742


# Data Cleaning

> The below cells clean the data by performing 3 actions:
1. This includes replacing any NA's
2. Change Face Mask to binary indicator, date from str to datetime
3. Removing columns such as (URL, County_Name, State_Tribe_Territory, etc.,) 

In [44]:
nevada_mandate_df['Face_Masks_Required_in_Public'] = nevada_mandate_df['Face_Masks_Required_in_Public'].apply(lambda x: 'Yes' if isinstance(x, str) else 'No')
nevada_mandate_df.drop(columns={'index', 'URL', 'County_Name', 'State_Tribe_Territory', 'FIPS_State', 'FIPS_County'}, inplace=True)
nevada_mandate_df['date'] = pd.to_datetime(nevada_mandate_df['date'])
nevada_mandate_df['COUNTYFP'] = 32003
nevada_mandate_df

Unnamed: 0,date,order_code,Face_Masks_Required_in_Public,Source_of_Action,Citation,COUNTYFP
0,2020-05-17,2,No,,,32003
1,2020-05-18,2,No,,,32003
2,2020-05-19,2,No,,,32003
3,2020-05-20,2,No,,,32003
4,2020-05-21,2,No,,,32003
...,...,...,...,...,...,...
488,2021-08-11,1,Yes,Official,Nev. Task Force Press Release (Mask guidance) ...,32003
489,2021-08-12,1,Yes,Official,Nev. Task Force Press Release (Mask guidance) ...,32003
490,2021-08-13,1,Yes,Official,"Nev. Task Force Press Release (Aug. 10, 2021) ...",32003
491,2021-08-14,1,Yes,Official,"Nev. Task Force Press Release (Aug. 10, 2021) ...",32003


In [45]:
nevada_cases_df['date'] = pd.to_datetime(nevada_cases_df['date'])
nevada_cases_df

Unnamed: 0,date,confirmed_cases
0,2020-01-22,0
1,2020-01-23,0
2,2020-01-24,0
3,2020-01-25,0
4,2020-01-26,0
...,...,...
1004,2022-10-22,643510
1005,2022-10-23,643510
1006,2022-10-24,643510
1007,2022-10-25,643510


# Aggregating Data

> The below cell joins all three (cases, mandate, and compliance) dataframes into one for visualization. We write this data out in a csv to `data/produced_data/nevada_covid_data.csv`

In [46]:
nevada_covid_df = nevada_mandate_df.merge(nevada_cases_df, on='date', how='left').merge(nevada_compliance_df, on='COUNTYFP')
nevada_covid_df

Unnamed: 0,date,order_code,Face_Masks_Required_in_Public,Source_of_Action,Citation,COUNTYFP,confirmed_cases,index,NEVER,RARELY,SOMETIMES,FREQUENTLY,ALWAYS
0,2020-05-17,2,No,,,32003,5366,1748,0.027,0.032,0.054,0.145,0.742
1,2020-05-18,2,No,,,32003,5463,1748,0.027,0.032,0.054,0.145,0.742
2,2020-05-19,2,No,,,32003,5463,1748,0.027,0.032,0.054,0.145,0.742
3,2020-05-20,2,No,,,32003,5650,1748,0.027,0.032,0.054,0.145,0.742
4,2020-05-21,2,No,,,32003,5734,1748,0.027,0.032,0.054,0.145,0.742
...,...,...,...,...,...,...,...,...,...,...,...,...,...
488,2021-08-11,1,Yes,Official,Nev. Task Force Press Release (Mask guidance) ...,32003,289746,1748,0.027,0.032,0.054,0.145,0.742
489,2021-08-12,1,Yes,Official,Nev. Task Force Press Release (Mask guidance) ...,32003,290632,1748,0.027,0.032,0.054,0.145,0.742
490,2021-08-13,1,Yes,Official,"Nev. Task Force Press Release (Aug. 10, 2021) ...",32003,291502,1748,0.027,0.032,0.054,0.145,0.742
491,2021-08-14,1,Yes,Official,"Nev. Task Force Press Release (Aug. 10, 2021) ...",32003,291502,1748,0.027,0.032,0.054,0.145,0.742


In [47]:
nevada_covid_df.to_csv('data/produced_data/nevada_covid_data.csv', index=False)