Skip to content
This repository has been archived by the owner on Mar 10, 2023. It is now read-only.

New US file #1527

Open
DavidViral opened this issue Mar 25, 2020 · 13 comments
Open

New US file #1527

DavidViral opened this issue Mar 25, 2020 · 13 comments

Comments

@DavidViral
Copy link

Where is it? You stopped supporting the database to offer a new "clean" dataabase with more US data fields. Where is it? We have zero state level data since the weekend.

@jfhirsch
Copy link

Second this problem. Please give an update.

@ei-JoanneT
Copy link

I've been pulling in state level data from daily report and joining that with time_series data since 3/23. It was a lot of work since daily_report and time_series are under different format, but could be a solution if you need the state-level data somehow urgently

@DavidViral
Copy link
Author

DavidViral commented Mar 25, 2020 via email

@jfhirsch
Copy link

I've been pulling in state level data from daily report and joining that with time_series data since 3/23. It was a lot of work since daily_report and time_series are under different format, but could be a solution if you need the state-level data somehow urgently

Thanks. I saw the daily_report (most likely) includes the same data, but it seems much more efficient overall if the updated time series are provided centrally (as they were a few days ago). One other question: It looks like the daily_report is at US county level - have you confirmed that the sum of the county data in the daily_report gives the same total as state-level data in the time_series?

@DavidAWest
Copy link

DavidAWest commented Mar 25, 2020

The daily report has inconsistent data in the field that contains the Province/State. Also, in the 3/23 and later datasets the field is now called Province_State. This field contains a City, ST in some cases, in others it has the name of the state in long form. It will require a fair amount of cleaning up in order to make use of it.

The datasets also have the Country/Region problems where the names of countries were changed. Mainland China is also listed as China. Taiwan is Taiwan, Taipei and environs and at least one other representation.

Exercise caution when using this data due to the many inconsistencies.

I've created a Jupyter notebook in Python 3.8 that reads all of these daily files and creates a single CSV. It does NOT do the data cleansing, yet. If you want it, it is attached.

Note - there was a small coding error that has been fixed. The error caused the 3/22 dataset to be read and appended 3 times but the 3/23 and 3/24 datasets were not read.
I also added a field to represent the date of the file. The "last update" date in the files is not consistent with the data files.
Read JHU Daily Files for COVID-19.zip

@ei-JoanneT
Copy link

Thanks. I saw the daily_report (most likely) includes the same data, but it seems much more efficient overall if the updated time series are provided centrally (as they were a few days ago). One other question: It looks like the daily_report is at US county level - have you confirmed that the sum of the county data in the daily_report gives the same total as state-level data in the time_series?

I totally agree with the efficiency issue. I had to do it because there was a urgent need, and I cannot wait for a more reasonable data structure to make my day easier

I have been using the time_series_global since it was updated over the weekend, and it does not contain any state-level data so I cannot really check.

Have they also updated the old time_series ones? I thought they already moved to the _global files so I am gathering everything of US (both state and county-level) from daily report

@jasonlally
Copy link

Hello! Chief Data Officer of San Francisco here and we really need state and county level data so we can inform our response in San Francisco and comingle datasets with other local data.

We can build our own pipeline of daily reports to timeseries, but I want to assess if this is necessary or if the US file is imminent. Any signal about when this is coming would be helpful.

@DavidAWest
Copy link

@jasonlally The encoding of counties in this data is inconsistent over time and I wouldn't expect it to be fixed any time soon.

I found another source that has California county-level data here: https://coronavirus.1point3acres.com/#map

You might consider using that as your source.

@jasonlally
Copy link

Yeah, we are using that one at the moment. We'll continue to use then, but eagerly awaiting comprehensive datasets from John's Hopkins.

Thanks for warning us away from trying to build a pipeline off the dailies.

@irenevanwoerden
Copy link

Please let us know where to get US state (or county) information. I have been unable to find this.

@davidbau
Copy link

For anybody doing USA by-state time series visualization in Javascript, here is an example of doing this aggregation on a webpage. It automatically loads and sums as many individual days as needed, starting 3/23 so can hold us over until the new time series appears.

https://github.com/davidbau/covid-19-chart/blob/6c78a748abe2bb940ff7cb07238d09810167ee17/index.html#L167-L300

Although the data structure is not commented in detail, this is what's used for the visualization in https://covid19chart.org/, so you can see the raw merged datastructure by using the js debugger on that site, and inspecting the "csse" variable there. You can do time series rollups by executing a function call like this (to get a time series for Kansas for example):

rollup_ts(csse['confirmed'], (s) => s.endsWith('KS') || s == 'Kansas', (c) => c == 'US')

Maybe somebody will find it helpful.

@davidbau
Copy link

Duplicate of #1534

@luvdata
Copy link

luvdata commented Mar 25, 2020

davidbau, an outstanding site. Thanks.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants