# NYTimes COVID-19 Dataset
---
NYTimes provides a dataset containing cases, deaths, state, fips (county code), and dates for the cases as the pandemic evolves over time. This information can be used as presented in raw data for the number of cases in a location over time but it may be handy to create features from this data to performing some time series forecasting of cases or deaths.

## Evolving influenza data from 2018 - 2019

* The flu data from https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html can provide some information as to how COVID-19 may spread during the winter season. The Spanish Flu is considered to be more like the COVID-19 in mortality and contagiousness as indicated by https://www.informationisbeautiful.net/visualizations/the-microbescope-infectious-diseases-in-context/. However, Spanish Flu data is scarce and so the flu data will have to suffice as a guideline.
* As the data overall is incomplete and projections are only made for a few months, the best I can do is normalize the data and then average for each state and attempt to use that as a guideline

In [1]:
import pandas as pd #Dataframes for data
flu_df = pd.read_csv("../data/raw/ILINet.csv")
flu_df['ILITOTAL'] = flu_df['ILITOTAL'].astype(float)
flu_df['WEEK'] = pd.to_datetime(flu_df['WEEK'], format='%m/%d/%Y')
flu_df.info()
flu_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2891 entries, 0 to 2890
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   REGION    2891 non-null   object        
 1   WEEK      2891 non-null   datetime64[ns]
 2   ILITOTAL  2891 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 67.9+ KB


Unnamed: 0,REGION,WEEK,ILITOTAL
0,Alabama,2018-01-20,5145.0
1,Alaska,2018-01-20,199.0
2,Arizona,2018-01-20,1056.0
3,Arkansas,2018-01-20,419.0
4,California,2018-01-20,1963.0


In [2]:
states = flu_df.REGION.unique()

for state in states:
    sub_df = flu_df[flu_df["REGION"] == state].drop(['REGION','WEEK'],1)
    sub_df = ( sub_df - sub_df.min() ) / ( sub_df.max() - sub_df.min() )
    flu_df.update(sub_df)
    
flu_df.head()

Unnamed: 0,REGION,WEEK,ILITOTAL
0,Alabama,2018-01-20,0.973891
1,Alaska,2018-01-20,0.318359
2,Arizona,2018-01-20,1.0
3,Arkansas,2018-01-20,0.654459
4,California,2018-01-20,0.800393


* Now that the ILITOTAL's have been normalized by state, the weeks will be averaged to make the data more appliable to the whole dataset as, for example, Florida's data is missing

In [3]:
weeks = flu_df.WEEK.unique()
temp_dict = {}

for week in weeks:
    sub_df = flu_df[flu_df['WEEK'] == week].drop(['REGION','WEEK'],1)
    temp_dict[week] = sub_df.mean()
    
ili_norm = pd.DataFrame.from_dict(temp_dict,orient='index')
ili_norm.reset_index(inplace=True)
ili_norm = ili_norm.rename(columns = {'index':'dates'})
print(ili_norm)

        dates  ILITOTAL
0  2018-01-20  0.582319
1  2018-01-27  0.694342
2  2018-02-03  0.763799
3  2018-02-10  0.772556
4  2018-02-17  0.683107
5  2018-02-24  0.494634
6  2018-03-03  0.375379
7  2018-03-10  0.299736
8  2018-03-17  0.252403
9  2018-03-24  0.229393
10 2018-03-31  0.192248
11 2018-04-07  0.175035
12 2018-04-14  0.146973
13 2018-04-21  0.122915
14 2018-04-28  0.115158
15 2018-05-05  0.095298
16 2018-05-12  0.077972
17 2018-05-19  0.066162
18 2018-05-26  0.061652
19 2018-06-02  0.048488
20 2018-06-09  0.041093
21 2018-06-16  0.025758
22 2018-06-23  0.029080
23 2018-06-30  0.026398
24 2018-07-07  0.021421
25 2018-07-14  0.023417
26 2018-07-21  0.016825
27 2018-07-28  0.015261
28 2018-08-04  0.013522
29 2018-08-11  0.019205
30 2018-08-18  0.021029
31 2018-08-25  0.029590
32 2018-09-01  0.039981
33 2018-09-08  0.051239
34 2018-09-15  0.058043
35 2018-09-22  0.061276
36 2018-09-29  0.067154
37 2018-10-06  0.134738
38 2018-10-13  0.144780
39 2018-10-20  0.148313
40 2018-10-27  0

* The normalized values provide weekly data but the cases in the NYTimes dataset are daily. Linear Interpolation will be used here to fill in the missing weekly information

In [4]:
ili_norm.index = ili_norm['dates']
del ili_norm['dates']
ili_norm = ili_norm.resample('D').mean()
ili_norm['ILITOTAL'] = ili_norm['ILITOTAL'].interpolate()
ili_norm.reset_index(inplace=True)
ili_norm.info()
ili_norm['dates'] = ili_norm['dates'] + pd.Timedelta(days=731) #Move it over 2 years for training

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 405 entries, 0 to 404
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   dates     405 non-null    datetime64[ns]
 1   ILITOTAL  405 non-null    float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 6.5 KB


In [5]:
ili_norm.tail()

Unnamed: 0,dates,ILITOTAL
400,2021-02-24,0.640351
401,2021-02-25,0.63153
402,2021-02-26,0.622709
403,2021-02-27,0.613888
404,2021-02-28,0.605066


* The overall trend appears to be what we are looking for. I'll read in the data from NYTimes dataset and try to fit the already existing data to the normalized values presented in ili_norm

## NYTimes Data Preparation
Reading in the NYTimes data and cleaning it up to match the flu data that was generated from above to perform time series forecasting

In [6]:
df = pd.read_csv('../data/raw/covid-19-data/us-counties.csv')
df = df[df['fips'] < 60000]
df.info()
print(df.shape)
df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 681089 entries, 0 to 703073
Data columns (total 6 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   date    681089 non-null  object 
 1   county  681089 non-null  object 
 2   state   681089 non-null  object 
 3   fips    681089 non-null  float64
 4   cases   681089 non-null  int64  
 5   deaths  681089 non-null  int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 36.4+ MB
(681089, 6)


Unnamed: 0,date,county,state,fips,cases,deaths
0,2020-01-21,Snohomish,Washington,53061.0,1,0
1,2020-01-22,Snohomish,Washington,53061.0,1,0
2,2020-01-23,Snohomish,Washington,53061.0,1,0
3,2020-01-24,Cook,Illinois,17031.0,1,0
4,2020-01-24,Snohomish,Washington,53061.0,1,0


* Here I need to convert the date column to datetime and drop the county/state/deaths column

In [7]:
df = df.drop(['county','state','deaths'],1)
df['date'] = pd.to_datetime(df['date'])
#df.date = df.date.dt.isocalendar().week
df.info()
print(df.shape)
df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 681089 entries, 0 to 703073
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype         
---  ------  --------------   -----         
 0   date    681089 non-null  datetime64[ns]
 1   fips    681089 non-null  float64       
 2   cases   681089 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 20.8 MB
(681089, 3)


Unnamed: 0,date,fips,cases
0,2020-01-21,53061.0,1
1,2020-01-22,53061.0,1
2,2020-01-23,53061.0,1
3,2020-01-24,17031.0,1
4,2020-01-24,53061.0,1


* The dates can be utilized as is with the already interpolated ili_norm data

In [8]:
print(ili_norm['dates'])

0     2020-01-21
1     2020-01-22
2     2020-01-23
3     2020-01-24
4     2020-01-25
         ...    
400   2021-02-24
401   2021-02-25
402   2021-02-26
403   2021-02-27
404   2021-02-28
Name: dates, Length: 405, dtype: datetime64[ns]


* Fortunately it looks like prior cleaning of the flu data had left the starting week to January 20th, encompassing the NYTimes data already. This can be used to match the data for the time series forecasting

## Adjacency Data
Prior processing allowed for the generation of the list of counties represented in fips with their adjacent county also represented in fips. Further processing then converted this list to a sum of cases in a county and adjacent counties per county per date. The explanation of this csv file being generated is present in the scripts directory. Reading in the adjacency data.

In [9]:
adj_df = pd.read_csv("../data/processed/adj_sum.csv")
adj_df = adj_df.drop('Unnamed: 0', 1)
adj_df.info()
print(adj_df.shape)
print(adj_df)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1095005 entries, 0 to 1095004
Data columns (total 3 columns):
 #   Column     Non-Null Count    Dtype  
---  ------     --------------    -----  
 0   fips       1095005 non-null  float64
 1   date       1095005 non-null  object 
 2   adj_cases  1095005 non-null  int64  
dtypes: float64(1), int64(1), object(1)
memory usage: 25.1+ MB
(1095005, 3)
            fips        date  adj_cases
0        53061.0  2020-01-21         39
1        53061.0  2020-01-22          4
2        53061.0  2020-01-23          1
3        53061.0  2020-01-24          2
4        53061.0  2020-01-25          0
...          ...         ...        ...
1095000   2230.0  2021-02-24         43
1095001   2230.0  2021-02-25         43
1095002   2230.0  2021-02-26         43
1095003   2230.0  2021-02-27         43
1095004   2230.0  2021-02-28         42

[1095005 rows x 3 columns]


* The simplest use of the adjacency data is to make a feature that represents the sum of the number of cases surrounding each county. This should help model possible outside influence for case development in each county
* As each county will need its own processing, it will be a messy set of for loops that generated separate dataframes and adjacency data that is then used in the time series forecasting

In [None]:
from sklearn.tree import DecisionTreeRegressor #Decision Tree Regressor for modeling
import datetime as dt #To Ordinal for training

failed_fips = []
model_dt = DecisionTreeRegressor(random_state=0)
exp_df = pd.DataFrame(columns=['date','fips','cases'])

for fips in df.fips.unique():
    try:
        #Piecing the separate dataframes together
        #adj_cases
        sub_df = df[df['fips'] == fips]
        temp = adj_df[adj_df['fips'] == fips]
        temp = temp.loc[temp['date'].isin(sub_df['date'].astype(str).unique()),'adj_cases'].tolist()
        sub_df.insert(3,'adj_cases',temp,True)
        sub_df = sub_df.dropna()

        #ili_norm
        temp = ili_norm.loc[ili_norm['dates'].isin(sub_df['date'].astype(str).unique()),'ILITOTAL'].tolist()
        sub_df.insert(3,'ili_norm',temp,True)
    
   
        #Modeling
        y = sub_df['cases']
        x = sub_df.drop(['cases','fips'],1)
        x['year'] = x['date'].dt.year
        x['month'] = x['date'].dt.month
        x['day'] = x['date'].dt.day
        x = x.drop(['date'], 1)
        x['adj_cases'] = ( x['adj_cases'] - x['adj_cases'].min() ) / ( x['adj_cases'].max() - x['adj_cases'].min() )
        model_dt.fit(x, y)

        #Predicting
        predict_df = pd.date_range(max(df['date']) + pd.Timedelta(days=1), max(df['date']) + pd.Timedelta(days=115)).to_frame(index=False,name='date')
        temp = adj_df[adj_df['fips'] == fips].copy()
        temp['adj_cases'] = ( temp['adj_cases'] - temp['adj_cases'].min() ) / ( temp['adj_cases'].max() - temp['adj_cases'].min() )

        temp = temp.loc[temp['date'].isin(predict_df['date'].astype(str).unique()),'adj_cases'].tolist()
        predict_df.insert(1,'adj_cases',temp,True)
        temp = ili_norm.loc[ili_norm['dates'].isin(predict_df['date'].astype(str).unique()),'ILITOTAL'].tolist()
        predict_df.insert(2,'ili_norm',temp,True)
        predict_df['year'] = predict_df['date'].dt.year
        predict_df['month'] = predict_df['date'].dt.month
        predict_df['day'] = predict_df['date'].dt.day
        predict_df = predict_df.drop(['date'], 1)
        predicted = model_dt.predict(predict_df)
        predicted = predicted + y.tail(1).iloc[0]
        
        #Setup for export
        app_df = pd.DataFrame(predicted).join(predict_df).rename(columns = {0:'cases'})
        app_df['date'] = pd.to_datetime(predict_df[['year', 'month', 'day']])
        app_df = app_df.drop(['year','month','day'],1)
        app_df['fips'] = fips
        app_df = app_df[['date', 'fips', 'cases']]
        exp_df = exp_df.append(df[df['fips'] == fips])
        exp_df = exp_df.append(app_df)
    except:
        failed_fips.append(fips)

## Output
* With the csv generated for predicted cases I'll simply output to a csv for reuse of data

In [None]:
exp_df.to_csv('../data/processed/dtr_results.csv',index=False)