## COVID Data Cleanup 
* In this notebook, I will clean and organize the raw COVID data from the CDC and subset it to be used for my final project


In [1]:
import pandas as pd 

In [2]:
covid_time_df = pd.read_csv('../data/data_raw/United_States_COVID-19_Cases_and_Deaths_by_State_over_Time.csv')

### Initial data exploration
Steps
* Load in dataframe
* How many rows/columns are there?
* Define column labels with meta data 
* What does each row represent? 
* Is there missing data?

In [3]:
covid_time_df.columns

Index(['submission_date', 'state', 'tot_cases', 'conf_cases', 'prob_cases',
       'new_case', 'pnew_case', 'tot_death', 'conf_death', 'prob_death',
       'new_death', 'pnew_death', 'created_at', 'consent_cases',
       'consent_deaths'],
      dtype='object')

In [4]:
covid_time_df.shape

(17760, 15)

#### Meta data from CDC defining column names:
### Column Name	 - Description - Type
* submission_date	- Date of counts - Date & Time
* state	- Jurisdiction - Plain Text
* tot_cases	- Total number of cases - Number
* conf_cases	- Total confirmed cases - Number
* prob_cases	- Total probable cases - Number
* new_case	- Number of new cases - Number
* pnew_case	- Number of new probable cases - Number
* tot_death	- Total number of deaths - Number
* conf_death	- Total number of confirmed deaths - Number
* prob_death	- Total number of probable deaths - Number
* new_death	- Number of new deaths - Number
* pnew_death	- Number of new probable deaths - Number
* created_at	- Date and time record was created - Date & Time
* consent_cases	- If Agree, then confirmed and probable cases are included. If Not Agree, then only total cases are included. - Plain Text
* consent_deaths	- If Agree, then confirmed and probable deaths are included. If Not Agree, then only total deaths are included. - Plain Text


In [5]:
covid_time_df.sample(10)

Unnamed: 0,submission_date,state,tot_cases,conf_cases,prob_cases,new_case,pnew_case,tot_death,conf_death,prob_death,new_death,pnew_death,created_at,consent_cases,consent_deaths
8968,04/19/2020,LA,23928,,,348,0.0,1296,1296.0,0.0,29,0.0,04/19/2020 04:22:39 PM,Not agree,Agree
593,01/23/2020,AZ,0,,,0,,0,,,0,,03/26/2020 04:22:39 PM,Agree,Agree
2754,04/21/2020,NM,2072,,,101,0.0,65,,,7,0.0,04/21/2020 04:22:39 PM,,Not agree
5257,09/03/2020,NYC,236060,231290.0,4770.0,221,-1.0,23716,19073.0,4643.0,6,-1.0,09/04/2020 01:36:16 PM,Agree,Agree
2139,03/29/2020,WY,87,,,3,,0,,,0,,03/28/2020 04:22:39 PM,Agree,Agree
8562,10/22/2020,CA,880724,,,2940,0.0,17189,,,162,0.0,10/23/2020 01:44:31 PM,Not agree,Not agree
4642,08/11/2020,MA,121707,112969.0,8738.0,392,96.0,8751,8529.0,222.0,10,0.0,08/12/2020 01:50:14 PM,Agree,Agree
1334,06/20/2020,CT,45715,43763.0,1952.0,158,5.0,4251,3394.0,857.0,13,4.0,06/21/2020 02:32:57 PM,Agree,Agree
5650,02/17/2020,AL,0,,,0,,0,,,0,,03/26/2020 04:22:39 PM,Agree,Agree
15271,07/15/2020,HI,1190,,,32,0.0,22,,,0,0.0,07/16/2020 02:22:21 PM,Not agree,Not agree


To get a better idea of what each row represents, I'm going to look at one particular state:

In [6]:
rowfilter = covid_time_df['state']== "NJ"
covid_NJ = covid_time_df[rowfilter]
covid_NJ

Unnamed: 0,submission_date,state,tot_cases,conf_cases,prob_cases,new_case,pnew_case,tot_death,conf_death,prob_death,new_death,pnew_death,created_at,consent_cases,consent_deaths
8584,01/22/2020,NJ,0,,,0,,0,,,0,,03/26/2020 04:22:39 PM,Not agree,Agree
8585,01/23/2020,NJ,0,,,0,,0,,,0,,03/26/2020 04:22:39 PM,Not agree,Agree
8586,01/24/2020,NJ,0,,,0,,0,,,0,,03/26/2020 04:22:39 PM,Not agree,Agree
8587,01/25/2020,NJ,0,,,0,,0,,,0,,03/26/2020 04:22:39 PM,Not agree,Agree
8588,01/26/2020,NJ,0,,,0,,0,,,0,,03/26/2020 04:22:39 PM,Not agree,Agree
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8875,11/08/2020,NJ,254595,,,2013,0.0,16429,14629.0,1800.0,4,0.0,11/09/2020 02:38:54 PM,Not agree,Agree
8876,11/09/2020,NJ,256653,,,2058,0.0,16440,14640.0,1800.0,11,0.0,11/10/2020 02:45:45 PM,Not agree,Agree
8877,11/10/2020,NJ,260430,,,3777,0.0,16461,14661.0,1800.0,21,0.0,11/11/2020 03:14:03 PM,Not agree,Agree
8878,11/11/2020,NJ,263495,,,3065,0.0,16476,14676.0,1800.0,15,0.0,11/12/2020 02:53:58 PM,Not agree,Agree


Using a filter, I can see that each row is a daily update on cases, which is way more bulky/detailed than what I'll need for my project. 
There is lots of missing values for some columns (ex. confirmed cases). 

### How many COVID cases are there in the US? 
##### This data will have 2 dimensions to consider: space and time - I want to see  COVID totals by state and month.

I'm not really interested in the probable or confirmed of cases/deaths - I just need to be able to see the growth of the pandemic over time in broad strokes. To keep dataframes consistent, I only want to see the states (not territories or cities) and want to order the data chronologically

Steps
* Subset columns 
* Filter out territories 
* Reformat submission_date to datetime 

Step 1: Subset columns

In [7]:
cols_to_use = ['submission_date', 'state', 'new_case','tot_cases', 'tot_death', 'new_death']
covid_time_df2 = covid_time_df[cols_to_use]
covid_time_df2.sample(10)

Unnamed: 0,submission_date,state,new_case,tot_cases,tot_death,new_death
17228,03/22/2020,FSM,0,0,0,0
4649,08/18/2020,MA,221,124063,8848,6
5526,08/07/2020,OH,1204,98675,3652,34
9971,08/12/2020,GA,3565,226153,4456,105
17442,10/22/2020,FSM,0,0,0,0
9257,04/12/2020,ID,51,1458,27,0
10030,10/10/2020,GA,1237,330269,7393,45
10948,11/09/2020,NY,1988,259805,9371,18
6848,03/02/2020,IL,1,4,0,0
735,06/13/2020,AZ,1540,34458,1183,39


Step 2: Filter out territories/cities

In [8]:
states = covid_time_df2['state'].unique()
print(states)

['CO' 'FL' 'AZ' 'SC' 'CT' 'NE' 'KY' 'WY' 'IA' 'NM' 'ND' 'WA' 'RMI' 'TN'
 'AS' 'MA' 'PA' 'NYC' 'OH' 'AL' 'VA' 'MI' 'MS' 'IL' 'WI' 'PR' 'OK' 'TX'
 'CA' 'NJ' 'LA' 'ID' 'NV' 'GA' 'IN' 'MD' 'NY' 'AR' 'MN' 'OR' 'WV' 'UT'
 'MO' 'DE' 'SD' 'RI' 'KS' 'NH' 'ME' 'DC' 'MT' 'HI' 'NC' 'AK' 'GU' 'VT'
 'VI' 'MP' 'FSM' 'PW']


In [9]:
row_filter = covid_time_df2['state'].isin(['RMI', 'AS', 'NYC','PR','DC','AK','GU','MP','FSM','PW'])
covid_time_df2 = covid_time_df2[-row_filter]
states = covid_time_df2['state'].unique()
len(states)

50

Step 3: Reformat to datetime, aggregate daily data to monthly data

In [10]:
covid_time_df2['date_dt'] = pd.to_datetime(covid_time_df2['submission_date'])
covid_time_df2=covid_time_df2.set_index('date_dt')
covid_time_df2.sample(5)

Unnamed: 0_level_0,submission_date,state,new_case,tot_cases,tot_death,new_death
date_dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-06-27,06/27/2020,WI,559,30227,784,11
2020-07-20,07/20/2020,DE,227,13746,525,2
2020-07-31,07/31/2020,MN,724,55188,1646,6
2020-02-08,02/08/2020,IL,0,2,0,0
2020-03-12,03/12/2020,MI,1,3,0,0


In [11]:
covid_bymonth_df = covid_time_df2.groupby('state').resample('M')

In [12]:
covid_bymonth_df

<pandas.core.resample.DatetimeIndexResamplerGroupby object at 0x7f6eba6c4f98>

In [13]:
covid_bymonth_df = covid_time_df2.groupby('state').resample('M')
d = {'new_case': covid_bymonth_df['new_case'].sum(),
     'tot_cases': covid_bymonth_df['tot_cases'].max(),
      'new_death':covid_bymonth_df['new_death'].sum(),
      'tot_death': covid_bymonth_df['tot_death'].max()}
covid_bymonth_df = pd.DataFrame(data=d)

covid_bymonth now has the total number of new cases, each month grouped by state - so, in February Alabama had 0 new cases, in March it had 999 new cases, for a total of 999 cases
dataframe is cleaned, organized, and ready for analysis notebook

In [15]:
covid_bymonth_df.to_csv('../data/data_clean/covid_clean_dated.csv',index=True)