County Data - Allow community creation, editing, and verification of this data for #558
Comments
|
The key to this is to get this up as soon as possible, and the planning phase is going to take a while. So community members (especially those with the needed background) how would you create a system that could handle these needs? @CSSEGISandData would you support something like this? Would you use it? Could you help the community develop this? Help through the planning stage would be the best as we know you have a lot to do. Let's not flood this with comments. Show support via emoji...
|
|
I am also interested in doing the same for the county levels (Kantone) in Switzerland. I am willing to help. Zürich already put up its county data on https://opendata.swiss/dataset/covid_19-fallzahlen-kanton-zuerich |
@ecam85 Yes, I have made all of my suggestions with this idea in mind, and I've stated several times about their ability and work that needs to be done already. That is clear.
I think this might have to be a new tool. Using GitHub workflows for this verification process doesn't seem to be the right tool for the job, I think. This tool would help @CSSEGISandData get all the needed data in a way that they can feel confident in the accuracy of the data without having to put in a lot of time. So it's kind of an "entrance" into entering data in the CSV files. One thing that I thought of when thinking of this idea was Stackoverflow's triage and review queues (as an idea to build off of). |
@ecam85 in my 1st comment, I posted this. I have some ideas but that is it. There are people way smarter than me, so I am asking the community that would participate in this effort. If we can't get this going, then the overall effort probably would fail anyway. Let's not flood this with minutia details. This is the planning stage. Please re-read if needed and let's get some ideas going.... |
|
Ok, thanks to https://github.com/daenuprobst we now have all the Swiss data in a CSV file. He is grabing the data via BAG Twitter Feed and then publishes the data on github: https://github.com/daenuprobst/covid19-cases-switzerland |
|
Where/How do you obtain the data at the county level? Especially since there are several instances of the same individual being claimed by two, or more county health departments, and recorded as being in their county. |
|
@ciscorucinski If we are triaging and verifying before commits anyway, I would suggest getting a group together in a shared google drive/sheet. It works extremely well even for large enterprises to collaborate and it would be the quickest to get up and running. I agree with having ambassadors that are assigned to specific regions to focus on case monitoring instead of trying to compile data for the entire world. I'm in the DC area, I can take the US East Coast if we go that route. I do think that having the data at the county level is extremely valuable and am willing to pitch in on this effort. |
|
What is also really important is the data you collect: Age, Gender, date/place of first contact. |
|
I agree. This additional insight provided by the county-level statistics are invaluable - especially for the folks highly vulnerable to COVID-19. I'm based out of the tri-state area but I'm willing to take up curation of this data for any of the US states. The map on the CDC website has hyperlinks to each state's Department of Health website, that usually houses these county-level stats (some states like CA require you to visit the individual county's website for the stats) I'm sure once a few of us get this up and running, more people will reach out to collaborate and share responsibility. |
|
We can with JHU's permission fork the database and then community maintain it. It's kind of a waste of time. I suggest contacting JHU to make a donation and asking for them to use the donation to maintain whatever level of detail you need. If each user gave say $50 there are at least it appears 300 users, so that is $15K. They probably need about 3x that. I think they need 2 to 3 full time people until the pandemic peaks, spread out over time zones to the extent it is practical (i.e. a 4 AM to noon shift, a noon to 8 PM shift, and an 8 PM to midnight shift. Or, another government or non-profit body can agree to collaborate and provide someone in their time zone to maintain the data. It's around a 3 person job per day full time, including the weekends. |
|
Just adding to this, Wikipedia maintains essentially the same data using community maintenance. It is up to date and pretty decent, but it is not in a time series format. If they can do it, it can be done by a Github fork. |
|
@ciscorucinski I think the answer lies in scraping official sources, rather than fielding reports from news articles. What do you think of the following?
On the pages I've checked so far, it seems only positive and deaths are reported, so we won't be able to get recovered (or consequently, active)... I'm going to start on this by trying to get sources for California together. Edit: I've gathered all the resources I was using. Here's what I've got so far: https://github.com/lazd/coronavirus-data-sources . |
|
@lazd Doing a county-by-county scrapping effort is going to be a lot and there is no guarantee that layout and format will stay the same for each county. Also, we can't get the county website until they get their 1st case. I was looking at Wisconsin's Department of Health Services and they provide a list of new releases. Does the state of California have a similar list in one place? That seems more reasonable for sources if that is available, no? Edit: Forgot to add link... |
I agree. A Google sheets could be created to handle this. With specific roles and data protection in place, it could be opened for may. But What's a good format/layout for the Sheets? |
|
this is how murtaman in UK does it: https://docs.google.com/spreadsheets/d/1eTKeK9vRxgw0KhvKxPCaDrfaHnxQP-n9TsLzsEymviY/edit#gid=0 - personally I like the layout. |
|
I went through and identified source URLs for a few states whose data is all in one place for all their counties. Unfortunately I don't have the scraping expertise - but I'm more than happy to help with anything else I can do. With regard to identifying sources, lets figure out how best to parse this out so we aren't duplicating efforts.
|
@ciscorucinski all true. I started work on a scraper that basically has a custom function that gets ran against the body of the website: https://github.com/lazd/coronadatascraper/blob/master/scrapers.js I've written a few scraper functions already, and it produces something like this:
Like you said, it will not be consistent; it'll have to be done on a case-by-case basis, and maintained if the county changes their website. This may not be sustainable, but its the best shot we have. I am going to add a Chrome headless browser for sites that need JavaScript, and will work out a way to capture states from @longsyntax's that have aggregate data. |
|
@longsyntax it seems like state data of counties is within 3 categories. Available in a single webpage, Available within links within a single webpage, and not aggregated. My Wisconsin link above would be the 2nd case where extra effort would need to be made |
|
Everyone needs to bug their State Health Departments to use standard (best practice) HTML tags, specifically TABLE tag, so that we all don't have to come up with one-off scrappers for all these sites. |
|
@ciscorucinski Seems a Slack instance would be super helpful if there's a coordinated effort to get County level data. |
|
@DavidGeeraerts yes, let's get one up an running... Or maybe discord, since it'll keep our chat history (Unless someone at Slack wants to give us a free instance?) |
|
It's late by me. Here is a Google Sheets that can be expanded on. It's public with editability for now. https://docs.google.com/spreadsheets/d/1T2cSvWvUvurnOuNFj2AMPGLpuR2yVs3-jdd_urfWU4c/edit?usp=sharing |
|
Slack instance has been created, see ticket 658 |
|
@ciscorucinski nice work. I think this will be much easier than trying to work in Git, especially with people making contributions from all over. I created one for county sources and added the data I have so far: https://docs.google.com/spreadsheets/d/1T2cSvWvUvurnOuNFj2AMPGLpuR2yVs3-jdd_urfWU4c/edit#gid=1477768381 |
|
Feel free to modify it as you see fit. This was just a quick setup with the data above. Right now, the doc is freely open and anyone can edit. Should I put some restrictions in place, and add via email? |
|
@ciscorucinski I think it's fine to be open for now. I've ran out of data sources to scrape and need more web resources for counties across America. I currently have data scraped for 51 counties (see coronadatascraper). |
|
@CSSEGISandData Can you pin this issue? You are able to PIN 3 important issues in the Issues tab. This community-driven effort might be a good candidate for pinning. I say this because it is already buried several pages into the results (page 4). So new people will have a hard time finding it. CC @saralioness https://help.github.com/en/github/managing-your-work-on-github/pinning-an-issue-to-your-repository |
|
Thanks for the prompt response. I am not using the time series one. The one I use for example it does not have the sum or source url. I can always change the file parsing code to accommodate for those differences. Usually the file has the name month-day-year or mm-dd-yyyy. So I can adapt quickly but not sure of others. Also the more data you add the bigger the file will get or the bigger the Map will get thus more load and higher response time and as more people hit that map. |
|
@lazd sorry, the scraping is probably not where i can be of help. |
|
Very true @adanecito. Please report any issues with the data or output format, missing columns, etc to https://github.com/lazd/coronadatascraper/issues |
|
Thanks Larry. To report issues I need to know the requirements. For example what should the fields be for what is being attempted? I know for the data what should be there by looking at the source but are extra fields expected to be there? Is the requirement to match the daily reports? |
|
@tomquisel the report should not show counties where there was no reported people infected that is what you might be seeing. But as Larry said file an issue hopefully with proof for verification/validation. That could be something you right now or when you are off work. If they have people cured then they should report it even if they currently do not have anyone affected. Same goes with deaths and no infected. At least that is the way I see it. |
|
Hello! I would like to contribute to this effort. I can assist with writing scrapers if needed. I'm a little unsure on where to start here and how much has already been done. That said I have been following the situation in my own state (Kentucky) quite closely and would be happy to seek out information and potentially contribute a scraper if needed. Otherwise I am happy to take on tasks elsewhere. Just need a bit of guidance to get started. Thank you! |
|
Hey @DatJord please take a look at our lists of sources: https://blog.lazd.net/coronadatascraper/#sources Join our Slack and coordinate with us, and check out the contributing section for information on getting started with a scraper, and a link to a doc of websites that need to be scraped: https://github.com/lazd/coronadatascraper/#contributing |
|
@longsyntax |
|
I would love to contribute as well! I have started my own project https://github.com/JKSenthil/covid19-spread-tracker to visualize COVID-19 cases county by county. I scrape this information from https://coronavirus.1point3acres.com/en, and used waybackmachine to get historical data. |
|
I actually developed an API for county data: https://github.com/JKSenthil/coronavirus-county-api. Feel free to use! |
|
Am I right in surmising that the CDC does not get individual death reports from county coroners as they occur? Could the CDC possibly be that unprepared for a pandemic? Or do they have the info and refuse to share it? |
|
Believe it or not, in Switzerland every single positive Covid19 case is being sent by Fax to the central government. They can't keep up with the Faxes coming in! |
|
I would have thought that if the CDC had one job at all, it would be to have a system for pandemic reporting where county coroners could just log on and report the death. |
|
I think this is close to what you're looking for with death reports. It includes a flu-related deaths column. |
|
Those are aggregate numbers, and don't have anything with respect to coronavirus. I think if what I wanted was out there, this entire issue would not have been raised, because it precisely about the difficulties of collecting the data I want. |
|
Does anyone have Iowa or Minnesota data? |
|
Also on the hunt for Kentucky data. Thought I found an option but it ended up being a dead end. |
|
@croixchristenson I emailed Iowa and they are looking into it:
|
|
Question, for places that automated collection is challenging, is there a place or way to submit data for states at the county level manually? I've been doing this for MN and IA the past few days. Thanks David! If you need historical IA data I can share it too. Anyone else see the NY Times is doing county data now?
|
|
I would also be extremely happy to contribute. I can offer my help and skills with coding and system design |
|
I'm curious, where is the county data coming from right now in the visualization here? Is there a way to access this data? Is it coming from @JKSenthil 's API? Thank you. |
@curran Skimming through their sources, it seems they are scraping the county data from https://coronavirus.1point3acres.com/en. You can use the https://github.com/ExpDev07/coronavirus-tracker-api API, which retrieves county data from CSBS, where they use their own data alongside 1point3acres's data to provide county data. (ie https://coronavirus-tracker-api.herokuapp.com/v2/locations?source=csbs) |
|
Is there any county data that includes a timeline (historical data)? |



Website: Corona Data Scraper
Download data and view sources
GitHub: Corona Data Scraper
Help write scrapping rules. See Readme
Google Doc: COVID-19 Community Data Collection
Public + comment access: Comment information and sources
Help us acquire valid, official data sources on all levels: County, State, Country
Slack: COVID Atlas
First Join, then go to the COVID Atlas Slack
Background
It is clear that the team at @CSSEGISandData cannot accommodate and scale with the huge influx of new cases within the US. Therefore, it is perfectly reasonable that they abandoned the county-level reporting of cases. I think when people look at the decision with unbiased and an open mind, they will see that this was the right balance to be as helpful as possible. With that said, it is sad to see the county-level information be abandoned completely. It is very helpful!
I remember seeing a +3 increase in Wisconsin and was wondering exactly where those cases were located, and I had to search for and read a few articles to verify. But this chart could have provided that detail to me very fast!
But again, the current processes cannot scale to the number of new cases. So we have to change the processes if we want to bring this back, and the sooner the better.
Suggestion
So, I suggest some new ability to let the community, who deeply care about this information, to help @CSSEGISandData get as accurate of information as possible. You know what type of information is needed to be registered for each new case, and that baton can be passed on to us to find, report, and verify (with verification probably being the biggest aspect of this effort).
I have seen a lot of people report new data as
Issuesand this new tool would be the preferred method to report those cases. Maybe they would have to provide an article link. A number of people could verify that information along with location information. This verification could go through multiple steps if needed, but @CSSEGISandData would have the final say in including the data after their own review and verification process.Ideas
The following are some ideas of how the processes could work.
Other Benefits
As an extremely positive benefit of this approach is that other countries could start providing their own more-localized data, and @CSSEGISandData could entrust a "country-representative" (CDC, or a respected university in said country) to do the review and verification of those country's more localized data.
The text was updated successfully, but these errors were encountered: