### Covidcast -- Google search data

In [14]:
# !pip install covidcast

Collecting covidcast
  Downloading covidcast-0.1.0-py3-none-any.whl (9.8 MB)
[K     |████████████████████████████████| 9.8 MB 1.9 MB/s eta 0:00:01
Collecting delphi-epidata>=0.0.7
  Downloading delphi_epidata-0.0.7-py3-none-any.whl (5.7 kB)
Collecting imageio-ffmpeg
  Downloading imageio_ffmpeg-0.4.2-py3-none-macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (22.5 MB)
[K     |████████████████████████████████| 22.5 MB 28.6 MB/s eta 0:00:01
Collecting descartes
  Downloading descartes-1.1.0-py3-none-any.whl (5.8 kB)
Installing collected packages: delphi-epidata, imageio-ffmpeg, descartes, covidcast
Successfully installed covidcast-0.1.0 delphi-epidata-0.0.7 descartes-1.1.0 imageio-ffmpeg-0.4.2


In [16]:
import pandas as pd
import numpy as np
from datetime import date
import covidcast
import pickle

#### data_dict = https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/ght.html
Google searches, provided to us by Google Health Trends. 
- estimate the volume of COVID-related searches in a given location, on a given day. 
- signal is measured in arbitrary units (its scale is meaningless); larger numbers represent higher numbers of COVID-related searches.
- overall searcher interest in a set of COVID-19 related terms about anosmia (lack of smell or taste), which emerged as a symptom of the coronavirus. The specific terms are:
>“why cant i smell or taste”  OR  “loss of smell”  OR  “loss of taste”
- information reported by the API is unitless and pre-normalized for population size; i.e., the time series obtained for New York and Wyoming states are directly comparable
- The smoothed signal is produced using the following strategy. For each date, we fit a local linear regression, using a Gaussian kernel, with only data on or before that date. (This is equivalent to using a negative half normal distribution as the kernel.) The bandwidth is chosen such that most of the kernel weight is placed on the preceding seven days. The estimate for the data is the local linear regression’s prediction for that date.



difference btwn time_value and issue ... they are both dates
- time_value is the date the search was completed
- issue is the date the data was collected/published by Google.  collection started in May (may 6), and was sporadic until late July (july 15). Sine july, query data has been published daily.


**TO USE:**
field name  |  dtype  |  Description  |
---  |  ---  |  ---  |
geo-value  |  object  |  two-letter state code     ## save as UPPER case  |
time_value  |  datetime  |  date the query was made by end-user  |
Direction |  object  |  values = -1, 0, 1 to indicate increase or decrease in query size.  70% are 0 (no substantial change in direction)  |
value  |  float  |  the "score" assigned by Google to dimensionalize the amount of search activity, normalized to population size for the given area  |
lag  |  integer  |  # days betweeen publication data in data, vs. date of original post  |

**TO DROP**
Signal  |  object  |  all observations are smoothed search as defined above
issue  |  datetime  |  date the tabulated data was published by Google
stderr  |  --  |  null
sample_size  |  --  |  null
geo_type  |  object  |  indicates the geo aggregation represented (state, county, etc.)
data_source   |  object  |  all from Google Health Trends (ght)



In [4]:
#  data saved to pickle file below.  uncomment to re-load data

# google = covidcast.signal("ght", "smoothed_search",
#                         date(2020, 2, 1), date(2020, 10, 26),
#                         "state")

In [5]:
google.shape

NameError: name 'google' is not defined

In [6]:
google.head(5)

NameError: name 'google' is not defined

In [38]:
# google.to_pickle('./data/google_raw.pkl')    # uncomment to re-load data

In [40]:
data = pd.read_pickle('./data/google_raw.pkl')
data.head()

Unnamed: 0,geo_value,signal,time_value,direction,issue,lag,value,stderr,sample_size,geo_type,data_source
0,ak,smoothed_search,2020-02-01,,2020-05-06,95,0.0,,,state,ght
1,al,smoothed_search,2020-02-01,,2020-05-06,95,2.016856,,,state,ght
2,ar,smoothed_search,2020-02-01,,2020-05-06,95,3.961135,,,state,ght
3,az,smoothed_search,2020-02-01,,2020-05-06,95,1.732458,,,state,ght
4,ca,smoothed_search,2020-02-01,,2020-05-06,95,4.639261,,,state,ght


In [17]:
# other settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [10]:
data.shape

(13317, 11)

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13317 entries, 0 to 50
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   geo_value    13317 non-null  object        
 1   signal       13317 non-null  object        
 2   time_value   13317 non-null  datetime64[ns]
 3   direction    8173 non-null   object        
 4   issue        13317 non-null  datetime64[ns]
 5   lag          13317 non-null  int64         
 6   value        13317 non-null  float64       
 7   stderr       0 non-null      object        
 8   sample_size  0 non-null      object        
 9   geo_type     13317 non-null  object        
 10  data_source  13317 non-null  object        
dtypes: datetime64[ns](2), float64(1), int64(1), object(7)
memory usage: 1.2+ MB


In [48]:
#datetime to index, sort
data.set_index('time_value', inplace=True)

In [49]:
data.head()

Unnamed: 0_level_0,geo_value,signal,direction,issue,lag,value,stderr,sample_size,geo_type,data_source
time_value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2020-02-01,AK,smoothed_search,,2020-05-06,95,0.0,,,state,ght
2020-02-01,AL,smoothed_search,,2020-05-06,95,2.016856,,,state,ght
2020-02-01,AR,smoothed_search,,2020-05-06,95,3.961135,,,state,ght
2020-02-01,AZ,smoothed_search,,2020-05-06,95,1.732458,,,state,ght
2020-02-01,CA,smoothed_search,,2020-05-06,95,4.639261,,,state,ght


In [55]:
#explore
data['direction'].value_counts(ascending=False, normalize=True)

 0    0.698030
 1    0.160039
-1    0.141931
Name: direction, dtype: float64

In [42]:
# uppercase the state colds ['geo_value']
# uncomment if need to revise

# data['geo_value'] = data['geo_value'].str.upper()
# data['geo_value']

0     AK
1     AL
2     AR
3     AZ
4     CA
5     CO
6     CT
7     DC
8     DE
9     FL
10    GA
11    HI
12    IA
13    ID
14    IL
15    IN
16    KS
17    KY
18    LA
19    MA
20    MD
21    ME
22    MI
23    MN
24    MO
25    MS
26    MT
27    NC
28    ND
29    NE
30    NH
31    NJ
32    NM
33    NV
34    NY
35    OH
36    OK
37    OR
38    PA
39    RI
40    SC
41    SD
42    TN
43    TX
44    UT
45    VA
46    VT
47    WA
48    WI
49    WV
50    WY
0     AK
1     AL
2     AR
3     AZ
4     CA
5     CO
6     CT
7     DC
8     DE
9     FL
10    GA
11    HI
12    IA
13    ID
14    IL
15    IN
16    KS
17    KY
18    LA
19    MA
20    MD
21    ME
22    MI
23    MN
24    MO
25    MS
26    MT
27    NC
28    ND
29    NE
30    NH
31    NJ
32    NM
33    NV
34    NY
35    OH
36    OK
37    OR
38    PA
39    RI
40    SC
41    SD
42    TN
43    TX
44    UT
45    VA
46    VT
47    WA
48    WI
49    WV
50    WY
0     AK
1     AL
2     AR
3     AZ
4     CA
5     CO
6     CT
7     DC
8     DE
9

In [58]:
#remove several columns to reduce size

df_google = data.drop(columns=['signal', 'issue', 'stderr', 'sample_size', 'geo_type', 'data_source'])
df_google.head()

Unnamed: 0_level_0,geo_value,direction,lag,value
time_value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-02-01,AK,,95,0.0
2020-02-01,AL,,95,2.016856
2020-02-01,AR,,95,3.961135
2020-02-01,AZ,,95,1.732458
2020-02-01,CA,,95,4.639261


In [64]:
df_google.to_csv('./data/data_state_detail/google_clean.csv')

In [60]:
df_google.to_pickle('./data/google_clean.pkl')

In [62]:
google = pd.read_pickle('./data/google_clean.pkl')
google.head()

Unnamed: 0_level_0,geo_value,direction,lag,value
time_value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-02-01,AK,,95,0.0
2020-02-01,AL,,95,2.016856
2020-02-01,AR,,95,3.961135
2020-02-01,AZ,,95,1.732458
2020-02-01,CA,,95,4.639261
