# COVID-19 MIDS Collaboration 

## Data Sourcing: COVID Tracking Project 

This Jupyter Notebook reads in raw data as csv files from a website and exports them as [pickle files for faster loading](https://medium.com/better-programming/load-fast-load-big-with-compressed-pickles-5f311584507e). 

This code was adapted from a script provided to us by Professor Kevin Crook of the Berkeley MIDS program during our W205 (Data Engineering) class. 

### Data sources

US COVID-19 data (historical, at state level) from the COVID tracking project: https://covidtracking.com/api

Note: they are a new website, and their file formats have changed several times, but will eventually settle down.  So the parsing may change.

### Set up environment 

In [1]:
# Import packages
import pandas as pd
import io
import requests
import pickle 

### Retrieve data

In [2]:
# get data at URL - this URL is for the state historical data, updated daily at 4pm ET
r = requests.get("https://covidtracking.com/api/v1/states/daily.csv")

In [3]:
# check HTTP request status
r.status_code

200

In [4]:
# just show the first 2000 characters, the text is really long otherwise
r.text[0:2000]

'date,state,positive,probableCases,negative,pending,totalTestResultsSource,totalTestResults,hospitalizedCurrently,hospitalizedCumulative,inIcuCurrently,inIcuCumulative,onVentilatorCurrently,onVentilatorCumulative,recovered,dataQualityGrade,lastUpdateEt,dateModified,checkTimeEt,death,hospitalized,dateChecked,totalTestsViral,positiveTestsViral,negativeTestsViral,positiveCasesViral,deathConfirmed,deathProbable,totalTestEncountersViral,totalTestsPeopleViral,totalTestsAntibody,positiveTestsAntibody,negativeTestsAntibody,totalTestsPeopleAntibody,positiveTestsPeopleAntibody,negativeTestsPeopleAntibody,totalTestsPeopleAntigen,positiveTestsPeopleAntigen,totalTestsAntigen,positiveTestsAntigen,fips,positiveIncrease,negativeIncrease,total,totalTestResultsIncrease,posNeg,deathIncrease,hospitalizedIncrease,hash,commercialScore,negativeRegularScore,negativeScore,positiveScore,score,grade\n20201020,AK,12432,,534093,,totalTestsViral,546525,38,,,,10,,6681,A,10/20/2020 03:59,2020-10-20T03:59:00Z,10/19 23

In [5]:
# load into a Pandas dataframe
covid_df = pd.read_csv(io.StringIO(r.text)).add_prefix('o_') 

covid_df

Unnamed: 0,o_date,o_state,o_positive,o_probableCases,o_negative,o_pending,o_totalTestResultsSource,o_totalTestResults,o_hospitalizedCurrently,o_hospitalizedCumulative,...,o_posNeg,o_deathIncrease,o_hospitalizedIncrease,o_hash,o_commercialScore,o_negativeRegularScore,o_negativeScore,o_positiveScore,o_score,o_grade
0,20201020,AK,12432.0,,534093.0,,totalTestsViral,546525.0,38.0,,...,546525,0,0,c0e44ad8d86fcc1e637484d963ee4b2a44cca93c,0,0,0,0,0,
1,20201020,AL,174528.0,21512.0,1112559.0,,totalTestsViral,1265575.0,846.0,19081.0,...,1287087,16,0,1fbf9e94e20368495944d9b840b76b5d3299a657,0,0,0,0,0,
2,20201020,AR,100441.0,6023.0,1137234.0,,totalTestsViral,1231652.0,627.0,6428.0,...,1237675,14,67,696a543f35e5a8e1498d893e4b20a56bae36154a,0,0,0,0,0,
3,20201020,AS,0.0,,1616.0,,totalTestsViral,1616.0,,,...,1616,0,0,3321ebf66fbc98011373f6dcb01d358a9c68c1ac,0,0,0,0,0,
4,20201020,AZ,232937.0,5267.0,1419675.0,,totalTestsPeopleViral,1647345.0,777.0,20719.0,...,1652612,7,71,4be9a6265b0b8e6d09e2cb6bff66e9c7690990f1,0,0,0,0,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12928,20200124,WA,0.0,,0.0,,totalTestEncountersViral,0.0,,,...,0,0,0,82726df68eb97c98a4a6ee792349e547023147d5,0,0,0,0,0,
12929,20200123,MA,,,,,totalTestsViral,2.0,,,...,0,0,0,76bc987d054b119a4e05a4a43742249c0b0568b6,0,0,0,0,0,
12930,20200123,WA,0.0,,0.0,,totalTestEncountersViral,0.0,,,...,0,0,0,1c2229c239ffad5e5fdd9d76c641dc9166caf6ba,0,0,0,0,0,
12931,20200122,MA,,,,,totalTestsViral,1.0,,,...,0,0,0,01f5dcd6631859503ef1b62d81d49e41d12fc1bd,0,0,0,0,0,


#### You can also pull data in json format, but since they are "flat" csv loaded into Pandas is much easier to work with, but here is an example of downing a json data and loading into a Python dictionary

Since the dictionary will be really large, for this example I picked the smallest json which is the US current values

In [6]:
r = requests.get("https://covidtracking.com/api/v1/us/current.json")

In [7]:
r.status_code

200

In [8]:
r.json()

[{'date': 20201020,
  'states': 56,
  'positive': 8232285,
  'negative': 109948664,
  'pending': 8014,
  'hospitalizedCurrently': 39230,
  'hospitalizedCumulative': 438965,
  'inIcuCurrently': 8178,
  'inIcuCumulative': 22662,
  'onVentilatorCurrently': 1889,
  'onVentilatorCumulative': 2593,
  'recovered': 3295148,
  'dateChecked': '2020-10-20T00:00:00Z',
  'death': 212678,
  'hospitalized': 438965,
  'totalTestResults': 126940105,
  'lastModified': '2020-10-20T00:00:00Z',
  'total': 0,
  'posNeg': 0,
  'deathIncrease': 832,
  'hospitalizedIncrease': 2148,
  'negativeIncrease': 753597,
  'positiveIncrease': 60582,
  'totalTestResultsIncrease': 906932,
  'hash': '6d6c20790057879700bcfe70bb3bb69fc191dfea'}]

### Convert datatypes

#### Count records

In [9]:
# count rows and columns
covid_df.shape

(12933, 55)

#### Check and convert datatypes

We have:

* o_date: date the cases are from; should be converted from int64 to a date object
* o_dateChecked: date when data were validated; should be converted from string to a date object
* o_hash: unique ID for every data update; string 
* o_state: state the cases are from; string; could be converted to category 
* o_fips: state fips code; should be converted to category; can be used to join to other datasets 
* o_positive/ negative/ pending/ hospitalizedCurrently/ inIcuCurrently/ inIcuCumulative/ onVentilatorCurrently/ onVentilatorCumulative/ recovered/ death/ hospitalized/ total: COVID-19 counts; should be converted from float to integer 
* o_totalTestResults/ posNeg: test results; posNeg is deprecated and was renamed to totalTestResults; should be converted from float to integer 
* o_deathIncrease/ hospitalized/ negative/ positive/ totalTestResults: increase in COVID-19 counts from last date; should be converted from float to integer


In [10]:
# check data types 
covid_df.dtypes

o_date                             int64
o_state                           object
o_positive                       float64
o_probableCases                  float64
o_negative                       float64
o_pending                        float64
o_totalTestResultsSource          object
o_totalTestResults               float64
o_hospitalizedCurrently          float64
o_hospitalizedCumulative         float64
o_inIcuCurrently                 float64
o_inIcuCumulative                float64
o_onVentilatorCurrently          float64
o_onVentilatorCumulative         float64
o_recovered                      float64
o_dataQualityGrade                object
o_lastUpdateEt                    object
o_dateModified                    object
o_checkTimeEt                     object
o_death                          float64
o_hospitalized                   float64
o_dateChecked                     object
o_totalTestsViral                float64
o_positiveTestsViral             float64
o_negativeTestsV

In [11]:
covid_df.head()

Unnamed: 0,o_date,o_state,o_positive,o_probableCases,o_negative,o_pending,o_totalTestResultsSource,o_totalTestResults,o_hospitalizedCurrently,o_hospitalizedCumulative,...,o_posNeg,o_deathIncrease,o_hospitalizedIncrease,o_hash,o_commercialScore,o_negativeRegularScore,o_negativeScore,o_positiveScore,o_score,o_grade
0,20201020,AK,12432.0,,534093.0,,totalTestsViral,546525.0,38.0,,...,546525,0,0,c0e44ad8d86fcc1e637484d963ee4b2a44cca93c,0,0,0,0,0,
1,20201020,AL,174528.0,21512.0,1112559.0,,totalTestsViral,1265575.0,846.0,19081.0,...,1287087,16,0,1fbf9e94e20368495944d9b840b76b5d3299a657,0,0,0,0,0,
2,20201020,AR,100441.0,6023.0,1137234.0,,totalTestsViral,1231652.0,627.0,6428.0,...,1237675,14,67,696a543f35e5a8e1498d893e4b20a56bae36154a,0,0,0,0,0,
3,20201020,AS,0.0,,1616.0,,totalTestsViral,1616.0,,,...,1616,0,0,3321ebf66fbc98011373f6dcb01d358a9c68c1ac,0,0,0,0,0,
4,20201020,AZ,232937.0,5267.0,1419675.0,,totalTestsPeopleViral,1647345.0,777.0,20719.0,...,1652612,7,71,4be9a6265b0b8e6d09e2cb6bff66e9c7690990f1,0,0,0,0,0,


##### String --> Date columns

In [12]:
# create new version of column as a datetime object - with ymd
covid_df["date"] = pd.to_datetime(covid_df["o_date"], format='%Y%m%d')

# create new version of column as a datetime object - with ymdhms
covid_df["dateChecked"] = pd.to_datetime(covid_df["o_dateChecked"])
# check conversion 
covid_df[["o_date", "date", "o_dateChecked", "dateChecked"]].head()

Unnamed: 0,o_date,date,o_dateChecked,dateChecked
0,20201020,2020-10-20,2020-10-20T03:59:00Z,2020-10-20 03:59:00+00:00
1,20201020,2020-10-20,2020-10-20T11:00:00Z,2020-10-20 11:00:00+00:00
2,20201020,2020-10-20,2020-10-20T00:00:00Z,2020-10-20 00:00:00+00:00
3,20201020,2020-10-20,2020-10-01T00:00:00Z,2020-10-01 00:00:00+00:00
4,20201020,2020-10-20,2020-10-20T00:00:00Z,2020-10-20 00:00:00+00:00


##### String --> Categorical columns

In [13]:
# identify which columns to convert 
str_to_category = ["o_state", "o_fips"]
# create new column names
str_to_category_new = list(map(lambda x: x.replace('o_', ''), str_to_category))
# add new converted columns
covid_df[str_to_category_new] = covid_df[str_to_category].apply(lambda x: x.astype('category'))
# check conversion
covid_df[str_to_category + str_to_category_new]


Unnamed: 0,o_state,o_fips,state,fips
0,AK,2,AK,2
1,AL,1,AL,1
2,AR,5,AR,5
3,AS,60,AS,60
4,AZ,4,AZ,4
...,...,...,...,...
12928,WA,53,WA,53
12929,MA,25,MA,25
12930,WA,53,WA,53
12931,MA,25,MA,25


In [14]:
# check conversion
covid_df[str_to_category + str_to_category_new].dtypes

o_state      object
o_fips        int64
state      category
fips       category
dtype: object

##### Float --> Integer columns

In [15]:
# identify which columns to convert 
to_int = list(covid_df.select_dtypes(include = ["float64"]).columns)
# create new column names
converted_to_int = list(map(lambda x: x.replace('o_', ''), to_int))
# add new converted columns
covid_df[converted_to_int] = covid_df[to_int].apply(lambda x: x.astype('Int64'))
# check conversion
covid_df[to_int].head()


Unnamed: 0,o_positive,o_probableCases,o_negative,o_pending,o_totalTestResults,o_hospitalizedCurrently,o_hospitalizedCumulative,o_inIcuCurrently,o_inIcuCumulative,o_onVentilatorCurrently,...,o_positiveTestsAntibody,o_negativeTestsAntibody,o_totalTestsPeopleAntibody,o_positiveTestsPeopleAntibody,o_negativeTestsPeopleAntibody,o_totalTestsPeopleAntigen,o_positiveTestsPeopleAntigen,o_totalTestsAntigen,o_positiveTestsAntigen,o_grade
0,12432.0,,534093.0,,546525.0,38.0,,,,10.0,...,,,,,,,,,,
1,174528.0,21512.0,1112559.0,,1265575.0,846.0,19081.0,,1954.0,,...,,,61855.0,,,,,,,
2,100441.0,6023.0,1137234.0,,1231652.0,627.0,6428.0,257.0,,96.0,...,,,,,,37538.0,6672.0,21856.0,3300.0,
3,0.0,,1616.0,,1616.0,,,,,,...,,,,,,,,,,
4,232937.0,5267.0,1419675.0,,1647345.0,777.0,20719.0,170.0,,91.0,...,,,,,,,,,,


In [16]:
# adding new columns missing 10.20.2020
missingcol=['negativeScore', 'score', 'lastUpdateEt', 'negativeIncrease', 'hash', 'negativeRegularScore', 'totalTestResultsIncrease', 'deathIncrease', 'posNeg', 'checkTimeEt', 'commercialScore', 'dateModified', 'hospitalizedIncrease', 'positiveScore', 'positiveIncrease', 'total', 'totalTestResultsSource', 'dataQualityGrade']
missingcol_with_o = list(map(lambda x: 'o_'+x, missingcol))
missingcol_with_o
covid_df[missingcol]=covid_df[missingcol_with_o]

covid_df

Unnamed: 0,o_date,o_state,o_positive,o_probableCases,o_negative,o_pending,o_totalTestResultsSource,o_totalTestResults,o_hospitalizedCurrently,o_hospitalizedCumulative,...,posNeg,checkTimeEt,commercialScore,dateModified,hospitalizedIncrease,positiveScore,positiveIncrease,total,totalTestResultsSource,dataQualityGrade
0,20201020,AK,12432.0,,534093.0,,totalTestsViral,546525.0,38.0,,...,546525,10/19 23:59,0,2020-10-20T03:59:00Z,0,0,212,546525,totalTestsViral,A
1,20201020,AL,174528.0,21512.0,1112559.0,,totalTestsViral,1265575.0,846.0,19081.0,...,1287087,10/20 07:00,0,2020-10-20T11:00:00Z,0,0,1043,1287087,totalTestsViral,A
2,20201020,AR,100441.0,6023.0,1137234.0,,totalTestsViral,1231652.0,627.0,6428.0,...,1237675,10/19 20:00,0,2020-10-20T00:00:00Z,67,0,844,1237675,totalTestsViral,A+
3,20201020,AS,0.0,,1616.0,,totalTestsViral,1616.0,,,...,1616,09/30 20:00,0,2020-10-01T00:00:00Z,0,0,0,1616,totalTestsViral,D
4,20201020,AZ,232937.0,5267.0,1419675.0,,totalTestsPeopleViral,1647345.0,777.0,20719.0,...,1652612,10/19 20:00,0,2020-10-20T00:00:00Z,71,0,1040,1652612,totalTestsPeopleViral,A+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12928,20200124,WA,0.0,,0.0,,totalTestEncountersViral,0.0,,,...,0,,0,,0,0,0,0,totalTestEncountersViral,
12929,20200123,MA,,,,,totalTestsViral,2.0,,,...,0,,0,,0,0,0,0,totalTestsViral,
12930,20200123,WA,0.0,,0.0,,totalTestEncountersViral,0.0,,,...,0,,0,,0,0,0,0,totalTestEncountersViral,
12931,20200122,MA,,,,,totalTestsViral,1.0,,,...,0,,0,,0,0,0,0,totalTestsViral,


In [17]:
# check conversion
covid_df[converted_to_int].head()

Unnamed: 0,positive,probableCases,negative,pending,totalTestResults,hospitalizedCurrently,hospitalizedCumulative,inIcuCurrently,inIcuCumulative,onVentilatorCurrently,...,positiveTestsAntibody,negativeTestsAntibody,totalTestsPeopleAntibody,positiveTestsPeopleAntibody,negativeTestsPeopleAntibody,totalTestsPeopleAntigen,positiveTestsPeopleAntigen,totalTestsAntigen,positiveTestsAntigen,grade
0,12432,,534093,,546525,38.0,,,,10.0,...,,,,,,,,,,
1,174528,21512.0,1112559,,1265575,846.0,19081.0,,1954.0,,...,,,61855.0,,,,,,,
2,100441,6023.0,1137234,,1231652,627.0,6428.0,257.0,,96.0,...,,,,,,37538.0,6672.0,21856.0,3300.0,
3,0,,1616,,1616,,,,,,...,,,,,,,,,,
4,232937,5267.0,1419675,,1647345,777.0,20719.0,170.0,,91.0,...,,,,,,,,,,


In [18]:
# check conversion
covid_df[to_int + converted_to_int].dtypes

o_positive                    float64
o_probableCases               float64
o_negative                    float64
o_pending                     float64
o_totalTestResults            float64
                               ...   
totalTestsPeopleAntigen         Int64
positiveTestsPeopleAntigen      Int64
totalTestsAntigen               Int64
positiveTestsAntigen            Int64
grade                           Int64
Length: 66, dtype: object

### Check missingness

We have a lot of missing data in the coumns that track COVID related things. However, we don't have any missing data in the ID columns (date, state, fips). Given the difficulty with retrieving these data [https://covidtracking.com/data](- as documented on the COVID Tracking Project website -) one would expect at least some missing data in these columns. 

These missingness counts are a further validation that the datatype conversions did not introduce additional NAs. 

In [19]:
covid_df.isnull().sum(axis = 0)

o_date                       0
o_state                      0
o_positive                 110
o_probableCases           8974
o_negative                 248
                          ... 
positiveScore                0
positiveIncrease             0
total                        0
totalTestResultsSource       0
dataQualityGrade          1193
Length: 110, dtype: int64

### Generate summary statistics

The summary statistics of original and converted columns are another method for validating the column type conversion. We can see that the conversion did not appear to change summary statistics of the data so from now on we will use the converted columns. 

In [20]:
# display summary statistics of original columns
covid_df[covid_df.columns[pd.Series(covid_df.columns).str.startswith('o_')]].describe(include = 'all')

Unnamed: 0,o_date,o_state,o_positive,o_probableCases,o_negative,o_pending,o_totalTestResultsSource,o_totalTestResults,o_hospitalizedCurrently,o_hospitalizedCumulative,...,o_posNeg,o_deathIncrease,o_hospitalizedIncrease,o_hash,o_commercialScore,o_negativeRegularScore,o_negativeScore,o_positiveScore,o_score,o_grade
count,12933.0,12933,12823.0,3959.0,12685.0,1449.0,12933,12903.0,10023.0,7442.0,...,12933.0,12933.0,12933.0,12933,12933.0,12933.0,12933.0,12933.0,12933.0,0.0
unique,,56,,,,,4,,,,...,,,,12933,,,,,,
top,,WA,,,,,posNeg,,,,...,,,,3a7925b9487d0193b24876c142d4441b4dfe0178,,,,,,
freq,,273,,,,,7559,,,,...,,,,1,,,,,,
mean,20200650.0,,59252.118927,3300.621116,688806.7,1309.36784,,770068.8,850.216801,7246.263639,...,734346.5,16.444599,33.941158,,0.0,0.0,0.0,0.0,0.0,
std,220.3379,,114199.84294,4389.12398,1458547.0,5221.300266,,1567902.0,1593.418321,14730.659311,...,1549145.0,45.024185,224.593734,,0.0,0.0,0.0,0.0,0.0,
min,20200120.0,,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,...,0.0,-213.0,-4124.0,,0.0,0.0,0.0,0.0,0.0,
25%,20200430.0,,1846.5,468.0,34494.0,21.0,,33378.0,106.0,581.25,...,31985.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,
50%,20200630.0,,15839.0,1807.0,205353.0,176.0,,237803.0,379.0,2460.0,...,215875.0,4.0,0.0,,0.0,0.0,0.0,0.0,0.0,
75%,20200820.0,,66558.0,4327.5,698554.0,583.0,,791708.0,878.0,7600.25,...,749502.0,14.0,24.0,,0.0,0.0,0.0,0.0,0.0,


In [21]:
list(covid_df[covid_df.columns[pd.Series(covid_df.columns).str.startswith('')]].columns)

['o_date',
 'o_state',
 'o_positive',
 'o_probableCases',
 'o_negative',
 'o_pending',
 'o_totalTestResultsSource',
 'o_totalTestResults',
 'o_hospitalizedCurrently',
 'o_hospitalizedCumulative',
 'o_inIcuCurrently',
 'o_inIcuCumulative',
 'o_onVentilatorCurrently',
 'o_onVentilatorCumulative',
 'o_recovered',
 'o_dataQualityGrade',
 'o_lastUpdateEt',
 'o_dateModified',
 'o_checkTimeEt',
 'o_death',
 'o_hospitalized',
 'o_dateChecked',
 'o_totalTestsViral',
 'o_positiveTestsViral',
 'o_negativeTestsViral',
 'o_positiveCasesViral',
 'o_deathConfirmed',
 'o_deathProbable',
 'o_totalTestEncountersViral',
 'o_totalTestsPeopleViral',
 'o_totalTestsAntibody',
 'o_positiveTestsAntibody',
 'o_negativeTestsAntibody',
 'o_totalTestsPeopleAntibody',
 'o_positiveTestsPeopleAntibody',
 'o_negativeTestsPeopleAntibody',
 'o_totalTestsPeopleAntigen',
 'o_positiveTestsPeopleAntigen',
 'o_totalTestsAntigen',
 'o_positiveTestsAntigen',
 'o_fips',
 'o_positiveIncrease',
 'o_negativeIncrease',
 'o_total',


In [22]:
list(covid_df[covid_df.columns[pd.Series(covid_df.columns).str.startswith('o_')]].columns)

['o_date',
 'o_state',
 'o_positive',
 'o_probableCases',
 'o_negative',
 'o_pending',
 'o_totalTestResultsSource',
 'o_totalTestResults',
 'o_hospitalizedCurrently',
 'o_hospitalizedCumulative',
 'o_inIcuCurrently',
 'o_inIcuCumulative',
 'o_onVentilatorCurrently',
 'o_onVentilatorCumulative',
 'o_recovered',
 'o_dataQualityGrade',
 'o_lastUpdateEt',
 'o_dateModified',
 'o_checkTimeEt',
 'o_death',
 'o_hospitalized',
 'o_dateChecked',
 'o_totalTestsViral',
 'o_positiveTestsViral',
 'o_negativeTestsViral',
 'o_positiveCasesViral',
 'o_deathConfirmed',
 'o_deathProbable',
 'o_totalTestEncountersViral',
 'o_totalTestsPeopleViral',
 'o_totalTestsAntibody',
 'o_positiveTestsAntibody',
 'o_negativeTestsAntibody',
 'o_totalTestsPeopleAntibody',
 'o_positiveTestsPeopleAntibody',
 'o_negativeTestsPeopleAntibody',
 'o_totalTestsPeopleAntigen',
 'o_positiveTestsPeopleAntigen',
 'o_totalTestsAntigen',
 'o_positiveTestsAntigen',
 'o_fips',
 'o_positiveIncrease',
 'o_negativeIncrease',
 'o_total',


In [23]:
covid_df

Unnamed: 0,o_date,o_state,o_positive,o_probableCases,o_negative,o_pending,o_totalTestResultsSource,o_totalTestResults,o_hospitalizedCurrently,o_hospitalizedCumulative,...,posNeg,checkTimeEt,commercialScore,dateModified,hospitalizedIncrease,positiveScore,positiveIncrease,total,totalTestResultsSource,dataQualityGrade
0,20201020,AK,12432.0,,534093.0,,totalTestsViral,546525.0,38.0,,...,546525,10/19 23:59,0,2020-10-20T03:59:00Z,0,0,212,546525,totalTestsViral,A
1,20201020,AL,174528.0,21512.0,1112559.0,,totalTestsViral,1265575.0,846.0,19081.0,...,1287087,10/20 07:00,0,2020-10-20T11:00:00Z,0,0,1043,1287087,totalTestsViral,A
2,20201020,AR,100441.0,6023.0,1137234.0,,totalTestsViral,1231652.0,627.0,6428.0,...,1237675,10/19 20:00,0,2020-10-20T00:00:00Z,67,0,844,1237675,totalTestsViral,A+
3,20201020,AS,0.0,,1616.0,,totalTestsViral,1616.0,,,...,1616,09/30 20:00,0,2020-10-01T00:00:00Z,0,0,0,1616,totalTestsViral,D
4,20201020,AZ,232937.0,5267.0,1419675.0,,totalTestsPeopleViral,1647345.0,777.0,20719.0,...,1652612,10/19 20:00,0,2020-10-20T00:00:00Z,71,0,1040,1652612,totalTestsPeopleViral,A+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12928,20200124,WA,0.0,,0.0,,totalTestEncountersViral,0.0,,,...,0,,0,,0,0,0,0,totalTestEncountersViral,
12929,20200123,MA,,,,,totalTestsViral,2.0,,,...,0,,0,,0,0,0,0,totalTestsViral,
12930,20200123,WA,0.0,,0.0,,totalTestEncountersViral,0.0,,,...,0,,0,,0,0,0,0,totalTestEncountersViral,
12931,20200122,MA,,,,,totalTestsViral,1.0,,,...,0,,0,,0,0,0,0,totalTestsViral,


In [24]:
converted_covid_df = covid_df[list(
    map(lambda x: x.replace('o_', ''), 
        list(covid_df[covid_df.columns[pd.Series(covid_df.columns).str.startswith('o_')]].columns)))]

In [25]:
# extract converted columns & unconverted hash column
converted_covid_df = covid_df[list(
    map(lambda x: x.replace('o_', '') if x != 'o_hash' else x, 
        list(covid_df[covid_df.columns[pd.Series(covid_df.columns).str.startswith('o_')]].columns)))]
         
# show summary statistics
converted_covid_df.describe(include = 'all')

Unnamed: 0,date,state,positive,probableCases,negative,pending,totalTestResultsSource,totalTestResults,hospitalizedCurrently,hospitalizedCumulative,...,posNeg,deathIncrease,hospitalizedIncrease,o_hash,commercialScore,negativeRegularScore,negativeScore,positiveScore,score,grade
count,12933,12933,12823.0,3959.0,12685.0,1449.0,12933,12903.0,10023.0,7442.0,...,12933.0,12933.0,12933.0,12933,12933.0,12933.0,12933.0,12933.0,12933.0,0.0
unique,273,56,,,,,4,,,,...,,,,12933,,,,,,
top,2020-09-01 00:00:00,WA,,,,,posNeg,,,,...,,,,3a7925b9487d0193b24876c142d4441b4dfe0178,,,,,,
freq,56,273,,,,,7559,,,,...,,,,1,,,,,,
first,2020-01-22 00:00:00,,,,,,,,,,...,,,,,,,,,,
last,2020-10-20 00:00:00,,,,,,,,,,...,,,,,,,,,,
mean,,,59252.118927,3300.621116,688806.7,1309.36784,,770068.8,850.216801,7246.263639,...,734346.5,16.444599,33.941158,,0.0,0.0,0.0,0.0,0.0,
std,,,114199.84294,4389.12398,1458547.0,5221.300266,,1567902.0,1593.418321,14730.659311,...,1549145.0,45.024185,224.593734,,0.0,0.0,0.0,0.0,0.0,
min,,,0.0,0.0,0.0,0.0,,0.0,0.0,0.0,...,0.0,-213.0,-4124.0,,0.0,0.0,0.0,0.0,0.0,
25%,,,1846.5,468.0,34494.0,21.0,,33378.0,106.0,581.25,...,31985.0,0.0,0.0,,0.0,0.0,0.0,0.0,0.0,


#### Check duplicates

There are no perfectly duplicated rows. 

In [26]:
sum(converted_covid_df.duplicated())

0

There are no duplications for the primary keys - as expected, there appears to be one column for every day, for every state (whether identified via state name or fips code).

In [27]:
# check duplication w/ date and state name
converted_covid_df[converted_covid_df.duplicated(subset=['date','state'], keep=False)]

Unnamed: 0,date,state,positive,probableCases,negative,pending,totalTestResultsSource,totalTestResults,hospitalizedCurrently,hospitalizedCumulative,...,posNeg,deathIncrease,hospitalizedIncrease,o_hash,commercialScore,negativeRegularScore,negativeScore,positiveScore,score,grade


In [28]:
# check duplication w/ date and state fips code
converted_covid_df[converted_covid_df.duplicated(subset=['date','fips'], keep=False)]

Unnamed: 0,date,state,positive,probableCases,negative,pending,totalTestResultsSource,totalTestResults,hospitalizedCurrently,hospitalizedCumulative,...,posNeg,deathIncrease,hospitalizedIncrease,o_hash,commercialScore,negativeRegularScore,negativeScore,positiveScore,score,grade


#### Recount records

In [29]:
# count rows 
covid_df.shape[0] == converted_covid_df.shape[0]

True

### Pickle data 

In [30]:
pickle.dump(converted_covid_df, open( "../Data_pkl/covid19/covidtrackingproject_df.pkl", "wb" ) )
