# NYTimes COVID-19 Dataset
---
NYTimes provides a dataset containing cases, deaths, state, fips (county code), and dates for the cases as the pandemic evolves over time. This information can be used as presented in raw data for the number of cases in a location over time but it may be handy to create features from this data to performing some time series forecasting of cases or deaths.

## Evolving influenza data from 2018 - 2019

* The flu data from https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html can provide some information as to how COVID-19 may spread during the winter season. The Spanish Flu is considered to be more like the COVID-19 in mortality and contagiousness as indicated by https://www.informationisbeautiful.net/visualizations/the-microbescope-infectious-diseases-in-context/. However, Spanish Flu data is scarce and so the flu data will have to suffice as a guideline.
* As the data overall is incomplete and projections are only made for a few months, the best I can do is normalize the data and then average for each state and attempt to use that as a guideline

In [1]:
import pandas as pd #Dataframes for data
flu_df = pd.read_csv("../data/raw/ILINet.csv")
flu_df['ILITOTAL'] = flu_df['ILITOTAL'].astype(float)
flu_df['WEEK'] = pd.to_datetime(flu_df['WEEK'], format='%m/%d/%Y')
flu_df.info()
flu_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2842 entries, 0 to 2841
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   REGION    2842 non-null   object        
 1   WEEK      2842 non-null   datetime64[ns]
 2   ILITOTAL  2842 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 66.7+ KB


Unnamed: 0,REGION,WEEK,ILITOTAL
0,Alabama,2018-01-27,5136.0
1,Alaska,2018-01-27,233.0
2,Arizona,2018-01-27,923.0
3,Arkansas,2018-01-27,434.0
4,California,2018-01-27,2038.0


In [2]:
states = flu_df.REGION.unique()

for state in states:
    sub_df = flu_df[flu_df["REGION"] == state].drop(['REGION','WEEK'],1)
    sub_df = ( sub_df - sub_df.min() ) / ( sub_df.max() - sub_df.min() )
    flu_df.update(sub_df)
    
flu_df.head()

Unnamed: 0,REGION,WEEK,ILITOTAL
0,Alabama,2018-01-27,0.972124
1,Alaska,2018-01-27,0.384766
2,Arizona,2018-01-27,0.91472
3,Arkansas,2018-01-27,0.678344
4,California,2018-01-27,0.837266


* Now that the ILITOTAL's have been normalized by state, the weeks will be averaged to make the data more appliable to the whole dataset as, for example, Florida's data is missing

In [3]:
weeks = flu_df.WEEK.unique()
temp_dict = {}

for week in weeks:
    sub_df = flu_df[flu_df['WEEK'] == week].drop(['REGION','WEEK'],1)
    temp_dict[week] = sub_df.mean()
    
ili_norm = pd.DataFrame.from_dict(temp_dict,orient='index')
ili_norm.reset_index(inplace=True)
ili_norm = ili_norm.rename(columns = {'index':'dates'})
ili_norm['dates'] = pd.to_datetime(ili_norm['dates']).dt.isocalendar().week
print(ili_norm)

    dates  ILITOTAL
0       4  0.698070
1       5  0.767384
2       6  0.776034
3       7  0.686260
4       8  0.497011
5       9  0.377323
6      10  0.301002
7      11  0.253757
8      12  0.230812
9      13  0.193447
10     14  0.176076
11     15  0.147973
12     16  0.123764
13     17  0.115875
14     18  0.095979
15     19  0.078544
16     20  0.066655
17     21  0.062135
18     22  0.048924
19     23  0.041295
20     24  0.025884
21     25  0.029443
22     26  0.026678
23     27  0.021607
24     28  0.023612
25     29  0.016900
26     30  0.015492
27     31  0.013575
28     32  0.019436
29     33  0.021192
30     34  0.030010
31     35  0.040322
32     36  0.051637
33     37  0.058462
34     38  0.061611
35     39  0.067424
36     40  0.135182
37     41  0.145239
38     42  0.148789
39     43  0.165876
40     44  0.181562
41     45  0.183691
42     46  0.185782
43     47  0.176321
44     48  0.223590
45     49  0.223211
46     50  0.264184
47     51  0.301791
48     52  0.348000


* The overall trend appears to be what we are looking for. I'll read in the data from NYTimes dataset and try to fit the already existing data to the normalized values presented in ili_norm

## NYTimes Data Preparation
Reading in the NYTimes data and cleaning it up to match the flu data that was generated from above to perform time series forecasting

In [4]:
df = pd.read_csv('../data/raw/covid-19-data/us-counties.csv')
df.info()
print(df.shape)
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 703074 entries, 0 to 703073
Data columns (total 6 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   date    703074 non-null  object 
 1   county  703074 non-null  object 
 2   state   703074 non-null  object 
 3   fips    696354 non-null  float64
 4   cases   703074 non-null  int64  
 5   deaths  703074 non-null  int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 32.2+ MB
(703074, 6)


Unnamed: 0,date,county,state,fips,cases,deaths
0,2020-01-21,Snohomish,Washington,53061.0,1,0
1,2020-01-22,Snohomish,Washington,53061.0,1,0
2,2020-01-23,Snohomish,Washington,53061.0,1,0
3,2020-01-24,Cook,Illinois,17031.0,1,0
4,2020-01-24,Snohomish,Washington,53061.0,1,0


* Here I need to convert the date column to datetime and drop the county/state/deaths column

In [5]:
df = df.drop(['county','state','deaths'],1)
df['date'] = pd.to_datetime(df['date'])
df.date = df.date.dt.isocalendar().week
df.info()
print(df.shape)
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 703074 entries, 0 to 703073
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   date    703074 non-null  UInt32 
 1   fips    696354 non-null  float64
 2   cases   703074 non-null  int64  
dtypes: UInt32(1), float64(1), int64(1)
memory usage: 14.1 MB
(703074, 3)


Unnamed: 0,date,fips,cases
0,4,53061.0,1
1,4,53061.0,1
2,4,53061.0,1
3,4,17031.0,1
4,4,53061.0,1


* From here it looks like to make use of the generated flu data I'll need to bin the NYTimes data by weeks, this is also a good chance to further trim the flu data to match the NYTimes data.

In [6]:
print(ili_norm['dates'])

0      4
1      5
2      6
3      7
4      8
5      9
6     10
7     11
8     12
9     13
10    14
11    15
12    16
13    17
14    18
15    19
16    20
17    21
18    22
19    23
20    24
21    25
22    26
23    27
24    28
25    29
26    30
27    31
28    32
29    33
30    34
31    35
32    36
33    37
34    38
35    39
36    40
37    41
38    42
39    43
40    44
41    45
42    46
43    47
44    48
45    49
46    50
47    51
48    52
49     1
50     2
51     3
52     4
53     5
54     6
55     7
56     8
57     9
Name: dates, dtype: UInt32


* Fortunately it looks like prior cleaning of the flu data had left the starting week to 4, matching the NYTimes data already. This can be used to bin the data for the time series forecasting

## Adjacency Data
Prior processing allowed for the generation of the list of counties represented in fips with their adjacent county also represented in fips. The explanation of this csv file being generated is present in the scripts directory. Reading in the adjacency data.

In [7]:
adj_df = pd.read_csv("../data/processed/adjacency_list.csv")
adj_df.info()
print(adj_df.shape)
adj_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3140 entries, 0 to 3139
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   county      3140 non-null   int64  
 1   adjacent0   3140 non-null   int64  
 2   adjacent1   3137 non-null   float64
 3   adjacent2   3119 non-null   float64
 4   adjacent3   3087 non-null   float64
 5   adjacent4   3017 non-null   float64
 6   adjacent5   2759 non-null   float64
 7   adjacent6   2086 non-null   float64
 8   adjacent7   983 non-null    float64
 9   adjacent8   297 non-null    float64
 10  adjacent9   69 non-null     float64
 11  adjacent10  17 non-null     float64
 12  adjacent11  5 non-null      float64
 13  adjacent12  2 non-null      float64
 14  adjacent13  2 non-null      float64
 15  adjacent14  1 non-null      float64
dtypes: float64(14), int64(2)
memory usage: 392.6 KB
(3140, 16)


Unnamed: 0,county,adjacent0,adjacent1,adjacent2,adjacent3,adjacent4,adjacent5,adjacent6,adjacent7,adjacent8,adjacent9,adjacent10,adjacent11,adjacent12,adjacent13,adjacent14
0,1001,1001,1021.0,1047.0,1051.0,1085.0,1101.0,,,,,,,,,
1,1003,1003,1025.0,1053.0,1097.0,1099.0,1129.0,12033.0,,,,,,,,
2,1005,1005,1011.0,1045.0,1067.0,1109.0,1113.0,13061.0,13239.0,13259.0,,,,,,
3,1007,1007,1021.0,1065.0,1073.0,1105.0,1117.0,1125.0,,,,,,,,
4,1009,1009,1043.0,1055.0,1073.0,1095.0,1115.0,1127.0,,,,,,,,


* The simplest use of the adjacency data is to make a feature that represents the sum of the number of cases surrounding each county. This should help model possible outside influence for case development in each county
* As each county will need its own processing, it will be a messy set of for loops that generated separate dataframes and adjacency data that is then used in the time series forecasting

In [52]:
from sklearn.model_selection import train_test_split #Train Test Split
from sklearn.tree import DecisionTreeRegressor #Decision Tree Regressor for modeling
from sklearn.metrics import r2_score #R2_Score function
R2_Scores = {}
failed_fips = []
model_dt = DecisionTreeRegressor(random_state=0)

for fips in df.fips.unique():
    #Generates adjacency sums and appends to a new sub dataframe with cases and normalized flu data
    adjacent = adj_df.loc[adj_df.county == fips, ].values.flatten().tolist()
    adjacent = [x for x in adjacent if str(x) != 'nan']
    sub_df = df[df.fips.isin(adjacent)]
    sub_df = sub_df.groupby('date')['cases'].sum().to_frame()
    sub_df.insert(1,'flu_norm', ili_norm['ILITOTAL'], True)
    sub_df = sub_df.rename(columns = {'cases':'adj_cases'})
    sub_df.insert(0,'cases',df[df['fips']==1001].groupby('date').sum().drop('fips',1), True)
    sub_df.reset_index(inplace=True)
    sub_df = sub_df.dropna()
    
    #Modeling
    try:
        target = sub_df['cases']
        x_train, x_test, y_train, y_test = train_test_split(sub_df.drop(['cases'],1).values, target.values, test_size=0.20, random_state=0)
        model_dt.fit(x_train, y_train)
        predicted_x_train = model_dt.predict(x_test)
        R2_Scores[fips] = r2_score(y_test, predicted_x_train)
    except:
        failed_fips.append(fips)

In [53]:
for key in R2_Scores:
    print(str(key) + ": " + str(R2_Scores[key]))

53061.0: 0.980668732861924
17031.0: 0.8265954920119946
6059.0: 0.9795317122024797
4013.0: 0.9799160053525284
6037.0: 0.9795317122024797
6085.0: 0.8281168058214876
25025.0: 0.7347871489734512
6075.0: 0.9810530260119728
55025.0: 0.9799160053525284
6073.0: 0.9799160053525284
48029.0: 0.9795317122024797
31055.0: 0.8018959231861377
6023.0: 0.9799160053525284
6067.0: 0.9810530260119728
6095.0: 0.9810530260119728
53063.0: 0.9548321433766228
49035.0: 0.9548321433766228
6041.0: 0.9810530260119728
6055.0: 0.8269797851620433
6097.0: 0.9810530260119728
41067.0: 0.9795317122024797
53033.0: 0.9795317122024797
6001.0: 0.9810530260119728
12057.0: 0.9795317122024797
12081.0: 0.9799160053525284
6061.0: 0.9799160053525284
6081.0: 0.8281168058214876
13121.0: 0.8281168058214876
25021.0: 0.7519144633131302
33009.0: 0.9548321433766228
53071.0: 0.9795317122024797
6013.0: 0.9810530260119728
37183.0: 0.827732512671439
34003.0: 0.8265954920119946
36119.0: 0.8265954920119946
48157.0: 0.827732512671439
53007.0: 0.