Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INCORRECT DATA FOR New York Fatalities #2257

Open
lmirny opened this issue Apr 17, 2020 · 30 comments
Open

INCORRECT DATA FOR New York Fatalities #2257

lmirny opened this issue Apr 17, 2020 · 30 comments

Comments

@lmirny
Copy link

@lmirny lmirny commented Apr 17, 2020

The correct number of fatalities today is 12192
https://covid19tracker.health.ny.gov/views/NYS-COVID19-Tracker/NYSDOHCOVID-19Tracker-Fatalities?%3Aembed=yes&%3Atoolbar=no&%3Atabs=n

@bvlaicu
Copy link

@bvlaicu bvlaicu commented Apr 17, 2020

The inflation is probably due to the new "Probable deaths" number added by NYC:
https://www1.nyc.gov/site/doh/covid/covid-19-data.page

@JChristensen
Copy link

@JChristensen JChristensen commented Apr 17, 2020

As anyone that understands measurement and data analysis knows, changing the definition mid-stream greatly reduces the value of the data for making informed decisions. We have all this fancy analysis but it's of little value if there are not good operational definitions for data collection that are held consistent. I certainly hope that Johns Hopkins is pushing with all their might to eliminate poor measurement decisions like this and maintain data quality.
See also #2247

@CalvinParis
Copy link

@CalvinParis CalvinParis commented Apr 17, 2020

Amen brother.

@paolinic03
Copy link

@paolinic03 paolinic03 commented Apr 17, 2020

@JChristensen I was saying the same thing yesterday. I am still trying to figure out the rationale for including "probable" cases in the death count now drastically increasing the total. I understand we need to account for all who have passed unfortunately but in terms of a strategy, do we have one?Cuomo talking yesterday seemed to glance over this and I am wondering if we even have a solid plan at all.

@paolinic03
Copy link

@paolinic03 paolinic03 commented Apr 17, 2020

Now, given the new definition. I think we should maintain consistent with the original definition but add another measure called "Possible deaths" for complete transparency and consistency.

  1. Confirmed deaths (tested positive and confirmed in hospital/ healthcare facilities)
  2. Possible deaths (never tested and died at home/ outside healthcare system)

Adding them together all of a sudden is a really poor decision.

@JChristensen
Copy link

@JChristensen JChristensen commented Apr 17, 2020

Obviously the "true number" can vary wildly depending on the definition. To me, the actual definition is less important than it is to hold it strictly constant. Given that, I can perhaps believe that the reported data has some fairly constant proportional relationship to the "true number". At this point it is at least as important to understand trends as it is to know the true numbers. Without strictly enforced measurement definitions, this is impossible. Categorizing data like this is a very difficult task under the best conditions so I know this is a situation that requires an extraordinary amount of rigor. Any change to the measurement definition and the noise can easily overwhelm the signal.

@paolinic03
Copy link

@paolinic03 paolinic03 commented Apr 17, 2020

New York has been tracking the "possible cases" since March 11th and just decided to add that into the total count so they have already been categorizing the confirmed vs the possible. The problem is not every state is defining and tracking this way so when you have a time series including state and county data, you cant just add the totals for NYs definition and original definition still being used by other states. We either need a standardized approach for the data set, or be able to categorize totals by definition. Otherwise, we are aggregating and making decisions on combined logic which is obviously causing confusion in analysis and amongst media outlets.

@paulavery1951
Copy link

@paulavery1951 paulavery1951 commented Apr 17, 2020

I had been wondering what happened to the US daily deaths (calculated from the JHU cumulative data) which jumped from 2494 on 4/15 to 4591 on 4/16. Drilling down, I discovered that the entire effect comes from the changed reporting from the NYC area, which went up by ~2500 (NY state also moves up by ~2500). Without this change, US daily deaths for 4/16 is ~2000.

Here are the last 10 days of US deaths calculated from the data I downloaded from the JHU site last night. The 4/16 daily count is much larger than the rest.

name,date,cumulative,daily
US,2020/04/07,12722,1939
US,2020/04/08,14695,1973
US,2020/04/09,16478,1783
US,2020/04/10,18586,2108
US,2020/04/11,20463,1877
US,2020/04/12,22020,1557
US,2020/04/13,23529,1509
US,2020/04/14,25832,2303
US,2020/04/15,28326,2494
US,2020/04/16,32917,4591 <=====

@paolinic03
Copy link

@paolinic03 paolinic03 commented Apr 17, 2020

Yes, we know, read above...

@paulavery1951
Copy link

@paulavery1951 paulavery1951 commented Apr 17, 2020

I did. I am commenting on its effect on the national death statistics.

@Schiffasaurus
Copy link

@Schiffasaurus Schiffasaurus commented Apr 18, 2020

Is there any way to apply the deaths per day in NYC to the day they actually occurred? If all these "probable COVID deaths" occurred on April 16th, what about all the "probable deaths" from the prior dates? That would skyrocket the figures nationally and the "curve".

@onepalone
Copy link

@onepalone onepalone commented Apr 18, 2020

I also agree with this posible suggestion as it is important to add true cases (Possible and Actuals) as it will give a fair picture about the actual mortality rate.
So if somebody has the power to get in touch directly with JHU
this will be one alternative that will work for everybody:
As @paolinic03 said:
Confirmed deaths (tested positive and confirmed in hospital/ healthcare facilities)
Possible deaths (never tested and died at home/ outside healthcare system)

@paolinic03
Copy link

@paolinic03 paolinic03 commented Apr 18, 2020

Is there any way to apply the deaths per day in NYC to the day they actually occurred? If all these "probable COVID deaths" occurred on April 16th, what about all the "probable deaths" from the prior dates? That would skyrocket the figures nationally and the "curve".

That’s the thing, they didn’t all occur on April 16th, they added all probable deaths together from March 11th-on and appended the amount to the confirmed deaths on April 16th as one big number.

@rbracco
Copy link

@rbracco rbracco commented Apr 18, 2020

This is a huge problem. Changing your methodology for part of your dataset when doing a time-series analysis is a very bad idea. It is exacerbated by the fact that there are hundreds of analysts downstream using this data to draw conclusions that will no longer be valid.

I have been trying to raise this issue with JHU with no success. Has anyone succeeded in reaching those working on the project?

@CalvinParis
Copy link

@CalvinParis CalvinParis commented Apr 18, 2020

Looks like they are roughly double counting probable deaths, which appear to have always been included by not broken out. 8,448 confirmed + 4,264 probable = 12,712 vs 17,131, an overcount of 4,419

Cases: 126,368
Hospitalized*: 33,079
Confirmed deaths: 8,448
Probable deaths: 4,264
Updated: April 18, 2:00 p.m
@gbigliardi
Copy link

@gbigliardi gbigliardi commented Apr 19, 2020

this problem is blocking a lot of applications downnstream.. JHU, please resolve ... is BLOCKING all the downstream study about new york situation

@paolinic03
Copy link

@paolinic03 paolinic03 commented Apr 19, 2020

Makes you wonder...

@onepalone
Copy link

@onepalone onepalone commented Apr 19, 2020

As it was suggested before, why dont you guys stick with NY times Data for US? as it is more accurate and doesnt have the addition 'yet' of probable deaths ...

@rbracco
Copy link

@rbracco rbracco commented Apr 19, 2020

I for one am going to switch to an alternate API, but it is essential that this is fixed as many downstream sources are unaware. See the following images showing this error propagating into top newspapers across the US. Note that the two things reported below never actually happened!

Wall Street Journal:
image

Washington Post:
image

@paolinic03
Copy link

@paolinic03 paolinic03 commented Apr 19, 2020

Yep, sure does look like it is “improving” now right? Artificially create an apex so it looks like we are on the down swing.

@paulavery1951
Copy link

@paulavery1951 paulavery1951 commented Apr 19, 2020

Unfortunately, the NYT site is showing 0 deaths for 4/17 and 4/18. I'm guessing they must be investigating, but others here might know better.

On the other hand, # daily cases seems normal, i.e. somewhat spiky but nothing crazy.

@CalvinParis
Copy link

@CalvinParis CalvinParis commented Apr 19, 2020

I'm pretty sure that hey are double counting probable deaths, which appear to have always been included by in the total just not broken out. 8,811 confirmed + 4,429 probable = 13,240 vs 17,671 an over-count of 4,431

Cases: 129,788
Hospitalized*: 34,602
Confirmed deaths: 8,811
Probable deaths: 4,429
Updated: April 19, 1:30 p.m.
@JChristensen
Copy link

@JChristensen JChristensen commented Apr 19, 2020

I for one am going to switch to an alternate API...

@rbracco if you're aware of a data source that tracks confirmed and probable deaths separately, would you share please?

@JChristensen
Copy link

@JChristensen JChristensen commented Apr 19, 2020

To be fair, the inclusion of "probable" deaths seems to be the result of a CDC recommendation. I for one have no idea what input JHU may have had on that, if any. I also don't know how data collection procedures are communicated, or whether the CDC recommendation included instructions to keep the statistics separate.

@rbracco
Copy link

@rbracco rbracco commented Apr 19, 2020

@JChristensen I would recommend https://covidtracking.com/api but they don't list probable deaths. NYC lists the data separately as reported here (no api): https://www1.nyc.gov/assets/doh/downloads/pdf/imm/covid-19-deaths-confirmed-probable-daily-04192020.pdf
It appears there is no great solution yet, but I'm sure there will be within a week.

The primary issue as I see it isn't that they started including probable deaths, but they appear to have done so erroneously. It looks like they labeled probable deaths from many past dates as having occurred on April 16th. This ruins any attempts at trend analysis. Also the numbers have been completely inaccurate since then.

image

New deaths per day in NYC (the source relies on JH CSSE)
image

@cpyic
Copy link

@cpyic cpyic commented Apr 19, 2020

Not sure whether one has noticed. But they have many entries of multiple states marked as "unassigned" and "out of" in the time series file. the problem might be related to these entries that did not belong to general FIPs nor a specific city. This could be a valid explanation if they have data from multiple sources, though I am not familiar how those numbers could be compiled together.

If we could be safely sure that the total numbers of cases came from the sum by state, from the time series, then you would be able to find the summed death from NY state as 17671 on 4/18.

For some reason, an entry of death update was coded into the FIPS=90036. If this interpretation is not incorrect, then folks might benefit from utilizing the time series file. Since if they have captured all sources, then there they should be. In the time series it seemed to be correct if you add this "unassigned" FIPS=90036 entry to the NY state total. Hope this helps.

image
image

@rbracco
Copy link

@rbracco rbracco commented May 9, 2020

Good news, this particular error was fixed via the commit listed below. Thanks to all who helped track this down.

0fa65c7

@brad255
Copy link

@brad255 brad255 commented May 16, 2020

For 5/15, JHU reports 27,878 while NYS DOH reports 22,478 a JHU positive diff of 5,400. Any views on this diff?

@cpyic
Copy link

@cpyic cpyic commented May 16, 2020

Could it be "probable deaths" from NYC reported separately?
https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/ has reported 27k as well.

@brad255
Copy link

@brad255 brad255 commented May 17, 2020

The official DOH page doesn't include probable deaths which would seem to be a possible explanation. https://covid19tracker.health.ny.gov/views/NYS-COVID19-Tracker/NYSDOHCOVID-19Tracker-Fatalities?%3Aembed=yes&%3Atabs=n&%3Atoolbar=no

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet