Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjustments to Time Series Data #590

Open
CSSEGISandData opened this issue Mar 12, 2020 · 23 comments
Open

Adjustments to Time Series Data #590

CSSEGISandData opened this issue Mar 12, 2020 · 23 comments

Comments

@CSSEGISandData
Copy link
Owner

@CSSEGISandData CSSEGISandData commented Mar 12, 2020

We have removed the US county values from the 10th to present in regards to double counts from the US state level data.

@tmeacham
Copy link

@tmeacham tmeacham commented Mar 12, 2020

Safe to assume the county cases were rolled up to a state level prior to 5/10 so that the time series doesn't graph as if suddenly hundreds of cases suddenly appeared on 5/10 on trend lines all over the internet?

Update: Turns out, not safe :) . Looks like the U.S suddenly had 893 cases on 5/10.

image

@tmeacham
Copy link

@tmeacham tmeacham commented Mar 12, 2020

Deleting the values will helps with the duplication. In a perfect world, the county data would have been rolled up the to the new state level standard for dates prior to 5/10 and the county-level rows would be deleted from the dataset entirely as they are no longer tracked in that way. At least moving forward the US data should reflect accurate trends.
Currently however, without knowing about the data collection errors in the dataset, a casual observer could be forgiven for thinking the US went from 0 to over 800 cases in one day.

image

@AndroidDev77
Copy link

@AndroidDev77 AndroidDev77 commented Mar 12, 2020

Why didn't you just stop populating the county values that way it wouldn't double count as you start entering data at a state level. But you still retain outbreak region history.

@Itelina
Copy link

@Itelina Itelina commented Mar 12, 2020

Are you guys planning to repopulate the county level values? I found those to be super helpful, really drives home what this means for our local communities

@tmeacham
Copy link

@tmeacham tmeacham commented Mar 12, 2020

@Itelina
see #382 They are no longer tracking at the county level. As such it is best to delete those rows entirely.

@Itelina
Copy link

@Itelina Itelina commented Mar 12, 2020

@tmeacham TY!

@Jacoble1
Copy link

@Jacoble1 Jacoble1 commented Mar 12, 2020

@Itelina
see #382 They are no longer tracking at the county level. As such it is best to delete those rows entirely.

In specific states such as NY, WA & CA it would be wise to have County level data labels & groupings in the higher populated region (such as NYC's 5 counties & the surrounding suburban clusters in NYS like Rockland County, Nassau County, etc) if that's doable.
The fact that infection's increasing at a rate which makes it difficult to report on a County level in certain is the very reason that those specific areas need county-based data groups.

@aatishb
Copy link

@aatishb aatishb commented Mar 12, 2020

Thank you! This is very helpful.

@aatishb
Copy link

@aatishb aatishb commented Mar 12, 2020

FWIW, for my own analysis, I'm going back and looking for US entries with a comma in 'Province/State', and replacing those values with the appropriate state, in order to backfill the empty state level data for dates prior to March 10.

@cscollett
Copy link

@cscollett cscollett commented Mar 12, 2020

@aatishb I regex the state value out [A-Z]{2}$ and then use a State Abbreviation/Name lookup table (https://raw.githubusercontent.com/aruljohn/us-states/master/states.csv) to match the state name. DC needs an exception.

I just completed my multi-metro-area jupyter notebook which relies on the county data. I'm really bummed it's going away.

@kamermans
Copy link

@kamermans kamermans commented Mar 13, 2020

If someone is looking for a way to handle this in Golang, here's how I'm doing it:
https://gist.github.com/kamermans/397488317c75b23414100d7e1316e96f

@grandave99
Copy link

@grandave99 grandave99 commented Mar 14, 2020

It seems that the Confirmed data of Italy have no change (12462 Confirmed) between 11th March and 12th March. But the data released by WHO (https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports) claim changes.

@bbdundar
Copy link

@bbdundar bbdundar commented Mar 14, 2020

Deleting the county level values in a state like California where these figures are reported at the county level doesn't make sense.

How were those values collected before? Manually?

@kendonB
Copy link

@kendonB kendonB commented Mar 14, 2020

Is there an alternative source anyone knows about for finer locations of cases in the USA?

@kendonB
Copy link

@kendonB kendonB commented Mar 14, 2020

@CSSEGISandData are there any other examples of these sorts of changes in other countries? Specifically, are there any other countries for which you the spatial unit of the data changes at a point in time?

@dawenx
Copy link

@dawenx dawenx commented Mar 14, 2020

@kendonB We are committed to provide county level statistics for US. Check my profile or #7

@PCastleton
Copy link

@PCastleton PCastleton commented Mar 14, 2020

It would be better to drop the state level data and allow users to rollup county-level data to states or country. Now we're just losing granularity of spread.

@lesham
Copy link

@lesham lesham commented Mar 15, 2020

are the 0's due to different aggregation, or is it possible that (some) geographical entities are reporting only new cases on the day, and not "total cases to date". There seems to be some confusion is various parts of the dataset.

@PaulIPS
Copy link

@PaulIPS PaulIPS commented Mar 15, 2020

It would be great to see the county data fixed for the the US. State wide data doesn't really show the local impact especially in large states.

Has anyone found another data set with correct country data?

@PySimpleGUI
Copy link

@PySimpleGUI PySimpleGUI commented Mar 15, 2020

CHOOSE 1 way to represent the USA data please.

Question - is the intention to list every county of every state if they have a case?

If so, the table will get large since there's 3,000+ counties in the USA.

If not, what are the rules for listing a county versus showing under a state total.

I'm struggling to parse the State field because, well, it's not just the state. Sometimes it's the state (spelled out), sometimes a county and a state abbreviation, and sometimes a city & state. It seems to be a "free form" field where anything goes.

If county level reporting is to be included, shouldn't it be another column or at least format the text following a rigid rule so it can be parsed?

Why can't the state abbreviation always be used? Example - getting the total for "North Carolina" means looking for both "North Carolina" and "NC" as both are used.

This data is going to get more complex and if it's this difficult to parse already, when the data grows 100 fold it's going to be not usable as it's already hard to parse.

I'm using the data to create grids of graphs for easy comparison. Selecting which countries / states is difficult if the data is not consistently formatted.

image

@tomquisel
Copy link

@tomquisel tomquisel commented Mar 16, 2020

I found another source for US county-level data and started tracking it historically at the covid19-data repo. I hope it's a good substitute.

@Jacoble1
Copy link

@Jacoble1 Jacoble1 commented Mar 17, 2020

I found another source for US county-level data and started tracking it historically at the covid19-data repo. I hope it's a good substitute.

It's definitely helpful! Except for one thing... New York City is broken into 5 boroughs, each of which is it's own County.
Manhattan - New York County
Brooklyn - Kings County
Queens - Queens County
Staten Island - Richmond County
Bronx - Bronx County

An increased number in Staten Island is significant in trends showing a correlation to Brooklyn or NJ (migration patterns) in the same manner that Bronx County versus New York County could determine if the trend of infection is truly localized or attributed to commuters. If that makes any sense from a non-expert like myself....

@MaryELennon
Copy link

@MaryELennon MaryELennon commented Mar 23, 2020

Hey All! It looks like the last comment in here is from 6 days ago. In this window of time, the dashboard went from showing cases only at the state level back to showing county level data. Why is this? Is this county level information reliable and complete? I am trying to decide if this is something I an use and in the context of the above chain I am becoming quite confused. I will note that the related Tableau dash website, with it's data.world hub, is still showing state and not county level information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet