Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

County Data - Allow community creation, editing, and verification of this data for #558

Open
ciscorucinski opened this issue Mar 12, 2020 · 81 comments

Comments

@ciscorucinski
Copy link

@ciscorucinski ciscorucinski commented Mar 12, 2020

📌 Ongoing Information 📌

Website: Corona Data Scraper
Download data and view sources

GitHub: Corona Data Scraper
Help write scrapping rules. See Readme

Google Doc: COVID-19 Community Data Collection
Public + comment access: Comment information and sources
Help us acquire valid, official data sources on all levels: County, State, Country

Slack: COVID Atlas
First Join, then go to the COVID Atlas Slack

Background

It is clear that the team at @CSSEGISandData cannot accommodate and scale with the huge influx of new cases within the US. Therefore, it is perfectly reasonable that they abandoned the county-level reporting of cases. I think when people look at the decision with unbiased and an open mind, they will see that this was the right balance to be as helpful as possible. With that said, it is sad to see the county-level information be abandoned completely. It is very helpful!

I remember seeing a +3 increase in Wisconsin and was wondering exactly where those cases were located, and I had to search for and read a few articles to verify. But this chart could have provided that detail to me very fast!

But again, the current processes cannot scale to the number of new cases. So we have to change the processes if we want to bring this back, and the sooner the better.

Suggestion

So, I suggest some new ability to let the community, who deeply care about this information, to help @CSSEGISandData get as accurate of information as possible. You know what type of information is needed to be registered for each new case, and that baton can be passed on to us to find, report, and verify (with verification probably being the biggest aspect of this effort).

I have seen a lot of people report new data as Issues and this new tool would be the preferred method to report those cases. Maybe they would have to provide an article link. A number of people could verify that information along with location information. This verification could go through multiple steps if needed, but @CSSEGISandData would have the final say in including the data after their own review and verification process.

Ideas

The following are some ideas of how the processes could work.

  • Stack Overflow's Triage Queue: Quickly move information to other needed areas
  • Stack Overflow's Review Queue: Review and verification of information
  • Allow people to point out incorrect data and add it to the queues
  • Multiple people should review and verify each datapoint.
  • etc...

Other Benefits

As an extremely positive benefit of this approach is that other countries could start providing their own more-localized data, and @CSSEGISandData could entrust a "country-representative" (CDC, or a respected university in said country) to do the review and verification of those country's more localized data.

@ciscorucinski
Copy link
Author

@ciscorucinski ciscorucinski commented Mar 12, 2020

The key to this is to get this up as soon as possible, and the planning phase is going to take a while. So community members (especially those with the needed background) how would you create a system that could handle these needs?

@CSSEGISandData would you support something like this? Would you use it? Could you help the community develop this? Help through the planning stage would be the best as we know you have a lot to do.

Let's not flood this with comments. Show support via emoji...

❤️ - Can help develop this project
👀 - Can help retrieve, enter, and verify data
🚀 - Can help retrieve, enter, and verify data in a different country

@zdavatz
Copy link

@zdavatz zdavatz commented Mar 12, 2020

I am also interested in doing the same for the county levels (Kantone) in Switzerland. I am willing to help. Zürich already put up its county data on https://opendata.swiss/dataset/covid_19-fallzahlen-kanton-zuerich

@ciscorucinski
Copy link
Author

@ciscorucinski ciscorucinski commented Mar 12, 2020

...without overloading the curators of the original repo.

@ecam85 Yes, I have made all of my suggestions with this idea in mind, and I've stated several times about their ability and work that needs to be done already. That is clear.

Could this be done via forks (and maybe pull requests)?

I think this might have to be a new tool. Using GitHub workflows for this verification process doesn't seem to be the right tool for the job, I think. This tool would help @CSSEGISandData get all the needed data in a way that they can feel confident in the accuracy of the data without having to put in a lot of time. So it's kind of an "entrance" into entering data in the CSV files.

One thing that I thought of when thinking of this idea was Stackoverflow's triage and review queues (as an idea to build off of).

@ciscorucinski
Copy link
Author

@ciscorucinski ciscorucinski commented Mar 12, 2020

how would you create a system that could handle these needs?

@ecam85 in my 1st comment, I posted this. I have some ideas but that is it. There are people way smarter than me, so I am asking the community that would participate in this effort. If we can't get this going, then the overall effort probably would fail anyway.

Let's not flood this with minutia details. This is the planning stage. Please re-read if needed and let's get some ideas going....

@zdavatz
Copy link

@zdavatz zdavatz commented Mar 12, 2020

Ok, thanks to https://github.com/daenuprobst we now have all the Swiss data in a CSV file. He is grabing the data via BAG Twitter Feed and then publishes the data on github: https://github.com/daenuprobst/covid19-cases-switzerland

@nognkantoor
Copy link

@nognkantoor nognkantoor commented Mar 13, 2020

Where/How do you obtain the data at the county level? Especially since there are several instances of the same individual being claimed by two, or more county health departments, and recorded as being in their county.

@saralioness
Copy link

@saralioness saralioness commented Mar 13, 2020

@ciscorucinski If we are triaging and verifying before commits anyway, I would suggest getting a group together in a shared google drive/sheet. It works extremely well even for large enterprises to collaborate and it would be the quickest to get up and running. I agree with having ambassadors that are assigned to specific regions to focus on case monitoring instead of trying to compile data for the entire world. I'm in the DC area, I can take the US East Coast if we go that route. I do think that having the data at the county level is extremely valuable and am willing to pitch in on this effort.

@zdavatz
Copy link

@zdavatz zdavatz commented Mar 13, 2020

What is also really important is the data you collect: Age, Gender, date/place of first contact.

@longsyntax
Copy link

@longsyntax longsyntax commented Mar 13, 2020

I agree. This additional insight provided by the county-level statistics are invaluable - especially for the folks highly vulnerable to COVID-19. I'm based out of the tri-state area but I'm willing to take up curation of this data for any of the US states.

The map on the CDC website has hyperlinks to each state's Department of Health website, that usually houses these county-level stats (some states like CA require you to visit the individual county's website for the stats)
https://www.cdc.gov/coronavirus/2019-ncov/cases-in-us.html#reporting-cases

I'm sure once a few of us get this up and running, more people will reach out to collaborate and share responsibility.

@becare-rocket
Copy link

@becare-rocket becare-rocket commented Mar 13, 2020

We can with JHU's permission fork the database and then community maintain it. It's kind of a waste of time. I suggest contacting JHU to make a donation and asking for them to use the donation to maintain whatever level of detail you need. If each user gave say $50 there are at least it appears 300 users, so that is $15K. They probably need about 3x that. I think they need 2 to 3 full time people until the pandemic peaks, spread out over time zones to the extent it is practical (i.e. a 4 AM to noon shift, a noon to 8 PM shift, and an 8 PM to midnight shift. Or, another government or non-profit body can agree to collaborate and provide someone in their time zone to maintain the data. It's around a 3 person job per day full time, including the weekends.

@becare-rocket
Copy link

@becare-rocket becare-rocket commented Mar 13, 2020

Just adding to this, Wikipedia maintains essentially the same data using community maintenance. It is up to date and pretty decent, but it is not in a time series format. If they can do it, it can be done by a Github fork.

@lazd
Copy link

@lazd lazd commented Mar 13, 2020

@ciscorucinski I think the answer lies in scraping official sources, rather than fielding reports from news articles. What do you think of the following?

  1. Let's begin by compiling a list of sources: a CSV a Google doc with each county in a given state (or state itself, if they have a webpage with all counties listed) and a source webpage.

  2. After that, we'll have to write and maintain scraper rules to pull the data from each of the websites. That's happening here https://github.com/lazd/coronadatascraper

  3. Finally, we can combine these into a single repository that pulls this data on a regular schedule, reports errors to the right person if it fails, and publishes the data when complete. That's also happening here, but it's not automatic yet https://github.com/lazd/coronadatascraper

On the pages I've checked so far, it seems only positive and deaths are reported, so we won't be able to get recovered (or consequently, active)...

I'm going to start on this by trying to get sources for California together.

Edit: I've gathered all the resources I was using. Here's what I've got so far: https://github.com/lazd/coronavirus-data-sources .

@ciscorucinski
Copy link
Author

@ciscorucinski ciscorucinski commented Mar 13, 2020

@lazd Doing a county-by-county scrapping effort is going to be a lot and there is no guarantee that layout and format will stay the same for each county. Also, we can't get the county website until they get their 1st case.

I was looking at Wisconsin's Department of Health Services and they provide a list of new releases. Does the state of California have a similar list in one place? That seems more reasonable for sources if that is available, no?

Edit: Forgot to add link...
https://www.dhs.wisconsin.gov/outbreaks/index.htm

@ciscorucinski
Copy link
Author

@ciscorucinski ciscorucinski commented Mar 13, 2020

@ciscorucinski If we are triaging and verifying before commits anyway, I would suggest getting a group together in a shared google drive/sheet. It works extremely well even for large enterprises to collaborate and it would be the quickest to get up and running.

I agree. A Google sheets could be created to handle this. With specific roles and data protection in place, it could be opened for may. But What's a good format/layout for the Sheets?

@zdavatz
Copy link

@zdavatz zdavatz commented Mar 13, 2020

@longsyntax
Copy link

@longsyntax longsyntax commented Mar 13, 2020

I went through and identified source URLs for a few states whose data is all in one place for all their counties. Unfortunately I don't have the scraping expertise - but I'm more than happy to help with anything else I can do.

With regard to identifying sources, lets figure out how best to parse this out so we aren't duplicating efforts.

@lazd
Copy link

@lazd lazd commented Mar 13, 2020

Doing a county-by-county scrapping effort is going to be a lot and there is no guarantee that layout and format will stay the same for each county. Also, we can't get the county website until they get their 1st case.

@ciscorucinski all true. I started work on a scraper that basically has a custom function that gets ran against the body of the website: https://github.com/lazd/coronadatascraper/blob/master/scrapers.js

I've written a few scraper functions already, and it produces something like this:

[
  {
    cases: 21,
    deaths: 0,
    county: 'San Francisco County',
    state: 'CA',
    country: 'USA',
    url: 'https://www.sfdph.org/dph/alerts/coronavirus.asp'
  },
  {
    cases: 20,
    deaths: 0,
    county: 'San Mateo County',
    state: 'CA',
    country: 'USA',
    url: 'https://www.smchealth.org/coronavirus'
  },
  {
    cases: 3,
    county: 'Sonoma County',
    state: 'CA',
    country: 'USA',
    url: 'https://socoemergency.org/emergency/novel-coronavirus/novel-coronavirus-in-sonoma-county/'
  },
  {
    cases: 7,
    county: 'Santa Cruz County',
    state: 'CA',
    country: 'USA',
    url: 'http://www.santacruzhealth.org/HSAHome/HSADivisions/PublicHealth/CommunicableDiseaseControl/Coronavirus.aspx'
  }
]

Like you said, it will not be consistent; it'll have to be done on a case-by-case basis, and maintained if the county changes their website. This may not be sustainable, but its the best shot we have.

I am going to add a Chrome headless browser for sites that need JavaScript, and will work out a way to capture states from @longsyntax's that have aggregate data.

@ciscorucinski
Copy link
Author

@ciscorucinski ciscorucinski commented Mar 13, 2020

@longsyntax it seems like state data of counties is within 3 categories. Available in a single webpage, Available within links within a single webpage, and not aggregated.

My Wisconsin link above would be the 2nd case where extra effort would need to be made

@DavidGeeraerts
Copy link

@DavidGeeraerts DavidGeeraerts commented Mar 13, 2020

Everyone needs to bug their State Health Departments to use standard (best practice) HTML tags, specifically TABLE tag, so that we all don't have to come up with one-off scrappers for all these sites.
I'm bugging WDOH.

@DavidGeeraerts
Copy link

@DavidGeeraerts DavidGeeraerts commented Mar 13, 2020

@ciscorucinski Seems a Slack instance would be super helpful if there's a coordinated effort to get County level data.

@lazd
Copy link

@lazd lazd commented Mar 13, 2020

@DavidGeeraerts yes, let's get one up an running... Or maybe discord, since it'll keep our chat history (Unless someone at Slack wants to give us a free instance?)

@ciscorucinski
Copy link
Author

@ciscorucinski ciscorucinski commented Mar 13, 2020

It's late by me. Here is a Google Sheets that can be expanded on. It's public with editability for now.

https://docs.google.com/spreadsheets/d/1T2cSvWvUvurnOuNFj2AMPGLpuR2yVs3-jdd_urfWU4c/edit?usp=sharing

@DavidGeeraerts
Copy link

@DavidGeeraerts DavidGeeraerts commented Mar 13, 2020

Slack instance has been created, see ticket 658

@lazd
Copy link

@lazd lazd commented Mar 13, 2020

@ciscorucinski nice work. I think this will be much easier than trying to work in Git, especially with people making contributions from all over.

I created one for county sources and added the data I have so far: https://docs.google.com/spreadsheets/d/1T2cSvWvUvurnOuNFj2AMPGLpuR2yVs3-jdd_urfWU4c/edit#gid=1477768381

@ciscorucinski
Copy link
Author

@ciscorucinski ciscorucinski commented Mar 13, 2020

Feel free to modify it as you see fit. This was just a quick setup with the data above.

Right now, the doc is freely open and anyone can edit. Should I put some restrictions in place, and add via email?

@lazd
Copy link

@lazd lazd commented Mar 13, 2020

@ciscorucinski I think it's fine to be open for now. I've ran out of data sources to scrape and need more web resources for counties across America. I currently have data scraped for 51 counties (see coronadatascraper).

@ciscorucinski
Copy link
Author

@ciscorucinski ciscorucinski commented Mar 13, 2020

@CSSEGISandData Can you pin this issue?

You are able to PIN 3 important issues in the Issues tab. This community-driven effort might be a good candidate for pinning. I say this because it is already buried several pages into the results (page 4). So new people will have a hard time finding it.

CC @saralioness

https://help.github.com/en/github/managing-your-work-on-github/pinning-an-issue-to-your-repository

@adanecito
Copy link

@adanecito adanecito commented Mar 16, 2020

Thanks for the prompt response. I am not using the time series one. The one I use for example it does not have the sum or source url. I can always change the file parsing code to accommodate for those differences. Usually the file has the name month-day-year or mm-dd-yyyy. So I can adapt quickly but not sure of others. Also the more data you add the bigger the file will get or the bigger the Map will get thus more load and higher response time and as more people hit that map.
Many Thanks,
-Tony

@greg-minshall
Copy link

@greg-minshall greg-minshall commented Mar 16, 2020

@lazd sorry, the scraping is probably not where i can be of help.

@lazd
Copy link

@lazd lazd commented Mar 16, 2020

Very true @adanecito. Please report any issues with the data or output format, missing columns, etc to https://github.com/lazd/coronadatascraper/issues

@adanecito
Copy link

@adanecito adanecito commented Mar 16, 2020

Thanks Larry. To report issues I need to know the requirements. For example what should the fields be for what is being attempted? I know for the data what should be there by looking at the source but are extra fields expected to be there? Is the requirement to match the daily reports?

@tomquisel
Copy link

@tomquisel tomquisel commented Mar 16, 2020

I love this project and plan to contribute! I notice that US county data is still fairly sparse. As a stop-gap, CSBS.org has been doing a great job of keeping a US county-level map up to date. You can find daily CSVs here.

@adanecito
Copy link

@adanecito adanecito commented Mar 16, 2020

@tomquisel the report should not show counties where there was no reported people infected that is what you might be seeing. But as Larry said file an issue hopefully with proof for verification/validation. That could be something you right now or when you are off work. If they have people cured then they should report it even if they currently do not have anyone affected. Same goes with deaths and no infected. At least that is the way I see it.

@adanecito
Copy link

@adanecito adanecito commented Mar 17, 2020

Ok I got yesterdays data to work. Here is what I am up to.
latestvirustest

I did learn some thngs about the data mostly what I said earlier.
Go JHU!!!

@Jord-Holt
Copy link

@Jord-Holt Jord-Holt commented Mar 18, 2020

Hello! I would like to contribute to this effort. I can assist with writing scrapers if needed. I'm a little unsure on where to start here and how much has already been done. That said I have been following the situation in my own state (Kentucky) quite closely and would be happy to seek out information and potentially contribute a scraper if needed. Otherwise I am happy to take on tasks elsewhere. Just need a bit of guidance to get started. Thank you!

@lazd
Copy link

@lazd lazd commented Mar 18, 2020

Hey @DatJord please take a look at our lists of sources: https://blog.lazd.net/coronadatascraper/#sources

Join our Slack and coordinate with us, and check out the contributing section for information on getting started with a scraper, and a link to a doc of websites that need to be scraped: https://github.com/lazd/coronadatascraper/#contributing

@codewarrior2000
Copy link

@codewarrior2000 codewarrior2000 commented Mar 18, 2020

@longsyntax
Those are great links to state-level statistics. I do want to point out that https://www.health.ny.gov/diseases/communicable/coronavirus/
lumped all the data from Richmond, Manhattan, Kings, Queens, and Bronx counties into just "New York City".

@JKSenthil
Copy link

@JKSenthil JKSenthil commented Mar 18, 2020

I would love to contribute as well! I have started my own project https://github.com/JKSenthil/covid19-spread-tracker to visualize COVID-19 cases county by county. I scrape this information from https://coronavirus.1point3acres.com/en, and used waybackmachine to get historical data.

@JKSenthil
Copy link

@JKSenthil JKSenthil commented Mar 19, 2020

I actually developed an API for county data: https://github.com/JKSenthil/coronavirus-county-api. Feel free to use!

@PaulMansour
Copy link

@PaulMansour PaulMansour commented Mar 19, 2020

Am I right in surmising that the CDC does not get individual death reports from county coroners as they occur? Could the CDC possibly be that unprepared for a pandemic? Or do they have the info and refuse to share it?

@zdavatz
Copy link

@zdavatz zdavatz commented Mar 19, 2020

Believe it or not, in Switzerland every single positive Covid19 case is being sent by Fax to the central government. They can't keep up with the Faxes coming in!

@PaulMansour
Copy link

@PaulMansour PaulMansour commented Mar 19, 2020

I would have thought that if the CDC had one job at all, it would be to have a system for pandemic reporting where county coroners could just log on and report the death.

@tomquisel
Copy link

@tomquisel tomquisel commented Mar 19, 2020

I think this is close to what you're looking for with death reports. It includes a flu-related deaths column.

@PaulMansour
Copy link

@PaulMansour PaulMansour commented Mar 19, 2020

Those are aggregate numbers, and don't have anything with respect to coronavirus. I think if what I wanted was out there, this entire issue would not have been raised, because it precisely about the difficulties of collecting the data I want.

@croixchristenson
Copy link

@croixchristenson croixchristenson commented Mar 20, 2020

Does anyone have Iowa or Minnesota data?

@Jord-Holt
Copy link

@Jord-Holt Jord-Holt commented Mar 20, 2020

Also on the hunt for Kentucky data. Thought I found an option but it ended up being a dead end.

@DavidGeeraerts
Copy link

@DavidGeeraerts DavidGeeraerts commented Mar 20, 2020

@croixchristenson I emailed Iowa and they are looking into it:

Good morning,

Thank you for this information and for your recommendation. I have forwarded it to our leadership working within the State Emergency Operations Center for their review and consideration.

Sincerely,
Kelsey Feller
@croixchristenson
Copy link

@croixchristenson croixchristenson commented Mar 21, 2020

Question, for places that automated collection is challenging, is there a place or way to submit data for states at the county level manually? I've been doing this for MN and IA the past few days.

Thanks David! If you need historical IA data I can share it too.

Anyone else see the NY Times is doing county data now?
https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html

@croixchristenson I emailed Iowa and they are looking into it:

Good morning,

Thank you for this information and for your recommendation. I have forwarded it to our leadership working within the State Emergency Operations Center for their review and consideration.

Sincerely,
Kelsey Feller
@burnout87
Copy link

@burnout87 burnout87 commented Mar 21, 2020

I would also be extremely happy to contribute. I can offer my help and skills with coding and system design

@curran
Copy link

@curran curran commented Mar 23, 2020

I'm curious, where is the county data coming from right now in the visualization here?
image
https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6

Is there a way to access this data? Is it coming from @JKSenthil 's API? Thank you.

@adanecito
Copy link

@adanecito adanecito commented Mar 23, 2020

I have been experimenting using the snapshot data. I am not sure when it gets updated, I am thinking sometime each day? I would think using an API is good but more dependencies and liability.

newyork

@JKSenthil
Copy link

@JKSenthil JKSenthil commented Mar 23, 2020

I'm curious, where is the county data coming from right now in the visualization here?
image
https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6

Is there a way to access this data? Is it coming from @JKSenthil 's API? Thank you.

@curran Skimming through their sources, it seems they are scraping the county data from https://coronavirus.1point3acres.com/en. You can use the https://github.com/ExpDev07/coronavirus-tracker-api API, which retrieves county data from CSBS, where they use their own data alongside 1point3acres's data to provide county data. (ie https://coronavirus-tracker-api.herokuapp.com/v2/locations?source=csbs)

@vbisbest
Copy link

@vbisbest vbisbest commented Apr 5, 2020

Is there any county data that includes a timeline (historical data)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet