# State Covid Data in the United States

In this notebook we will be pulling in coronavirus data from The Atlantic's Covid Tracking Project. The data are provided at a state level resolution, with many associated data elements for each state.

### Table of Contents

- [Notebook Setup](#1---Notebook-Setup)
- [Import Data](#2---Import-Data)
- [Data Cleaning](#3---Data-Cleaning)
- [Columns](#5---Columns)


---

## 1 - Notebook Setup

First lets switch directories to the root directory to access modules.

In [1]:
pwd

'/Users/DanOvadia/Projects/covid-hotspots/notebooks'

In [2]:
cd ..

/Users/DanOvadia/Projects/covid-hotspots


### Import Libraries

In [3]:
import pandas as pd
import requests
import time

### Import Custom Modules

In [4]:
from modules import data_processing

%load_ext autoreload
%autoreload 1
%aimport modules.data_processing

--------

## 2 - Import Data

Import the data directly from covidtracking.

In [5]:
STATES_URL = "https://covidtracking.com/api/states/daily"
with requests.get(STATES_URL) as response:
    COVID_STATES_DF = pd.DataFrame(response.json())

### Assess raw data

Lets take a look at the shape and at column datatypes.

In [6]:
COVID_STATES_DF.shape

(10346, 54)

In [7]:
COVID_STATES_DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10346 entries, 0 to 10345
Data columns (total 54 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   date                         10346 non-null  int64  
 1   state                        10346 non-null  object 
 2   positive                     10250 non-null  float64
 3   negative                     10109 non-null  float64
 4   pending                      1219 non-null   float64
 5   hospitalizedCurrently        7578 non-null   float64
 6   hospitalizedCumulative       5740 non-null   float64
 7   inIcuCurrently               4125 non-null   float64
 8   inIcuCumulative              1605 non-null   float64
 9   onVentilatorCurrently        3529 non-null   float64
 10  onVentilatorCumulative       566 non-null    float64
 11  recovered                    6806 non-null   float64
 12  dataQualityGrade             10207 non-null  object 
 13  lastUpdateEt    

------

## 3 - Data Cleaning

Update date column from string to datetime.

In [8]:
# Convert date column from YYYMMDD to datetime format
COVID_STATES_DF['date'] = pd.to_datetime(COVID_STATES_DF['date'], format='%Y%m%d')

Check the first date that each state reported.

In [9]:
# Group by state, grab date column, and grab the minimum date for each state.
FIRST_DATE_BY_STATE = COVID_STATES_DF.groupby(by='state')['date'].min()
FIRST_DATE_BY_STATE.sort_values()

state
MA   2020-01-22
WA   2020-01-22
NJ   2020-02-10
VA   2020-02-27
MI   2020-03-01
RI   2020-03-01
NH   2020-03-04
NY   2020-03-04
NC   2020-03-04
OR   2020-03-04
IL   2020-03-04
TX   2020-03-04
SC   2020-03-04
FL   2020-03-04
CA   2020-03-04
AZ   2020-03-04
WI   2020-03-04
GA   2020-03-04
NE   2020-03-05
DC   2020-03-05
CO   2020-03-05
TN   2020-03-05
NV   2020-03-05
MD   2020-03-05
OH   2020-03-05
NM   2020-03-06
PA   2020-03-06
AK   2020-03-06
MN   2020-03-06
KS   2020-03-06
IN   2020-03-06
IA   2020-03-06
DE   2020-03-06
VT   2020-03-06
AR   2020-03-06
WV   2020-03-06
KY   2020-03-06
SD   2020-03-07
UT   2020-03-07
WY   2020-03-07
ND   2020-03-07
MT   2020-03-07
MS   2020-03-07
MO   2020-03-07
ME   2020-03-07
LA   2020-03-07
ID   2020-03-07
HI   2020-03-07
CT   2020-03-07
AL   2020-03-07
OK   2020-03-07
PR   2020-03-16
GU   2020-03-16
VI   2020-03-16
AS   2020-03-16
MP   2020-03-16
Name: date, dtype: datetime64[ns]

Wow, goood job MA and WA; well done being prepared and responding preparing preemtively. Meanwhile, Puerto Rico, Guam, Virginia, American Samoa, and Mississippi, were late to the game.

Next, lets check for consistency for each state. We will check to see if any dates are excluded in the data from the first reporting date to today.

First, lets do this for one case.

In [10]:
# Grab today's date
TODAY = time.strftime('%Y-%m-%d')

# Create a state mask for a sample state
STATE_MASK = (COVID_STATES_DF['state'] == 'AK')

# Grab only the date column for a state
STATE_SERIES = COVID_STATES_DF[STATE_MASK]['date']

# Create a date range from the states first date, to today. 
#  Take the difference from a state's date series
pd.date_range(start=FIRST_DATE_BY_STATE['AK'],end=TODAY).difference(STATE_SERIES)

DatetimeIndex(['2020-09-05'], dtype='datetime64[ns]', freq=None)

Now lets create a function to loop over all states and print out a results to see if we need to dig any further.

In [11]:
def checks_dates_by_state(df):
    
    # Create a list of first reporting dates for states
    state_first_series = df.groupby(by='state')['date'].min()
    
    # Loop over each state, the series' index
    for state in state_first_series.index:
        
        # Subset the dataframe to only this state, and only pull date
        state_series = df[df['state'] == state]['date']
        
        # Assign today's date
        today = time.strftime('%Y-%m-%d')
        
        # Count the number of missing dates from start to today.
        n_missing_dates = len(pd.date_range(start=state_first_series[state].strftime(format='%Y-%m-%d'), end=today).difference(state_series))
        
        # Print Output
        print(f"{state} is missing {n_missing_dates}")
        
checks_dates_by_state(COVID_STATES_DF)

AK is missing 1
AL is missing 1
AR is missing 1
AS is missing 1
AZ is missing 1
CA is missing 1
CO is missing 1
CT is missing 1
DC is missing 1
DE is missing 1
FL is missing 1
GA is missing 1
GU is missing 1
HI is missing 1
IA is missing 1
ID is missing 1
IL is missing 1
IN is missing 1
KS is missing 1
KY is missing 1
LA is missing 1
MA is missing 1
MD is missing 1
ME is missing 1
MI is missing 1
MN is missing 1
MO is missing 1
MP is missing 1
MS is missing 1
MT is missing 1
NC is missing 1
ND is missing 1
NE is missing 1
NH is missing 1
NJ is missing 1
NM is missing 1
NV is missing 1
NY is missing 1
OH is missing 1
OK is missing 1
OR is missing 1
PA is missing 1
PR is missing 1
RI is missing 1
SC is missing 1
SD is missing 1
TN is missing 1
TX is missing 1
UT is missing 1
VA is missing 1
VI is missing 1
VT is missing 1
WA is missing 1
WI is missing 1
WV is missing 1
WY is missing 1


This means that we are not missing any dates for every state from the day they started reporting. This does not necessarily mean we have data points for every column across the dataframe.

-----

## 4 - Columns

In [12]:
# Subset features
FEATURES = [
    'date',
    'state',
    'positive',
    'death',
    'hospitalized',
    'positiveIncrease',
    'deathIncrease',
    'total'
]

In [13]:
COVID_STATES_DF.isnull().sum()

date                               0
state                              0
positive                          96
negative                         237
pending                         9127
hospitalizedCurrently           2768
hospitalizedCumulative          4606
inIcuCurrently                  6221
inIcuCumulative                 8741
onVentilatorCurrently           6817
onVentilatorCumulative          9780
recovered                       3540
dataQualityGrade                 139
lastUpdateEt                     139
dateModified                     424
checkTimeEt                      424
death                            749
hospitalized                    4606
dateChecked                      424
totalTestsViral                 4648
positiveTestsViral              8523
negativeTestsViral              8784
positiveCasesViral              3000
deathConfirmed                  6491
deathProbable                   7666
totalTestEncountersViral        9265
totalTestsPeopleViral           6821
t