Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraper of Mobility reports v 2.0 #6

Closed
ActiveConclusion opened this issue Apr 17, 2020 · 14 comments
Closed

Scraper of Mobility reports v 2.0 #6

ActiveConclusion opened this issue Apr 17, 2020 · 14 comments
Labels
enhancement New feature or request

Comments

@ActiveConclusion
Copy link
Owner

Google recently published a mobility report with time-series in CSV format. You can download it on their website.
That means there's no need for a PDF file parser anymore. Due to that, I plan to change the concept of this repository.
Here are my points that I propose to implement here:

  1. Archive the PDF parser as part of the great history of this repository.
  2. Automatically download to this repository all available files (including PDF) on Google and Apple sites. If there are no problems with Google reports, the Apple website parser needs to be rewritten, because my ad-hoc solution does not work, unfortunately.
  3. Make one summary file from Google and Apple reports of the following structure:
country sub_region_1 sub_region_2 date retail grocery_and_pharmacy parks transit_stations workplaces residential walking driving transit
... ... ... ... ... ... ... ... ... ... ... ... ...
  1. Make a simple visualization app for this data (for example, using Bokeh library).

Feel free to offer your suggestions here.
Thank you!

@ActiveConclusion ActiveConclusion added the enhancement New feature or request label Apr 17, 2020
@ladew222
Copy link

Great! I will pull this into balefire.info for USA. The color shading was resolved as an FYI. I am using D3 and D3Plus for my visuals. The drawback is my visuals are USA obviously. I can look into the possibility of doing a global as it would involve mostly ignoring/reducing the data merges.

@ladew222
Copy link

I got the google data in the system. It is pretty interesting. I do need to sit down with it some time but there are some pretty telling Pearson coefficients correlating with it. I also see that confirmed cases per 10k is higher when mobility data is higher. What I really should do is assess the log of that two weeks after to see if there is a correlation there. The graphs suggest that is the case. I will put a screen shot below showing 4/11 and one plot of AK. As an FYI, the university is doing a short article on the tool next week so hope to get more info on our data out there.

Screen Shot 2020-04-17 at 10 39 23 PM

@ActiveConclusion
Copy link
Owner Author

@ladew222 Cool! I hope that I compile a summary file from Google and Apple reports in the next 2-3 days.

@ladew222
Copy link

Wow cool.

@ActiveConclusion
Copy link
Owner Author

ActiveConclusion commented Apr 21, 2020

I've recently made a couple of updates, so I summarize what's been done here:

  1. Everything related to PDF parser now is in the directory "scraper v 1.0".
  2. Apple report is now automatically downloaded to the repository every day. But with Google data now a little problem: if the CSV download is okay, the ability to download the PDF is now disabled, because the structure of Google webpage has significantly changed. But I think that's not a critical problem.
  3. Also, now automatically generated summary reports from Google and Apple data, which I mentioned above. They are available here. But some points should be noted here:
  • the matching of subregions from Google data with cities from Apple data needs to be further improved. Currently, they are matched as they are in the original data.
  • with the U.S. data is a serious problem because they are quite heterogeneous. So far, the cities are in the "sub_region_1" column. I think it is probably even better to remove the detailed breakdown by counties for the US from the summary report.
  • It is appropriate to adjust the baseline for Apple data for a longer period that intersects with the baseline Google period (e.g. January 13 to February 6). This is a rough approach, but I think it would be better than just taking the baseline for January 13th.
  1. Google Sheets are now updated automatically.

Also, it is necessary to think about the view of data visualization app, which would provide simple answers about the mobility situation in a particular region.

@ladew222
Copy link

Cool. Here is the choropleth of residential mobility as it is now if you havent seen it.
Screen Shot 2020-04-20 at 9 01 17 PM

@ActiveConclusion
Copy link
Owner Author

Wow, looks nice! But I couldn't reproduce this picture in your dashboard( I got it like this:
balefire

Maybe, I didn't press some button or checkbox?

@ladew222
Copy link

My fault. It looks like Google does not have significant enough data for that metric. Retail and Recreation has the fuller map.

@ActiveConclusion
Copy link
Owner Author

Got it, thanks! I suggest adding the ability to make a breakdown by states, it will allow us to see the picture throughout the United States.

@ladew222
Copy link

ladew222 commented Apr 21, 2020 via email

@ActiveConclusion
Copy link
Owner Author

Are you thinking about a map by states?

Yes

@ActiveConclusion
Copy link
Owner Author

Last week's Update Digest:

  1. The problem with downloading Google PDF reports fixed (I fixed this problem a week ago, just didn't write here).
  2. Apple has added more regions/cities to their report. The main problem with it is that cities and subregions go without country names, but I have already fixed this issue (it was a challenging issue for me).
  3. With the addition of new data from Apple, there are now huge problems with the merging of reports, the scale of which I have not even assessed yet.

@ActiveConclusion
Copy link
Owner Author

Latest updates:

  • Recently, Google Sheets with US detailed data crashed due to a large amount of data for one sheet. My apologies for everyone who used this spreadsheet as the source, currently you can use CSV version of this report. Maybe, I will reformat this Google Sheets (e.g. split the states by tabs), or, unfortunately, it will have to be abandoned forever.

  • Merging of Apple and Google reports significantly improved. I finally have made a matching table of subregions of Apple and Google. Also, I've split the summary report into several:

    1. Report by regions (without US counties)
    2. Report by countries (only totals)
    3. Report for the US only

    If someone sees errors in the matching table, please create the issue immediately.

  • Also, I think it's a good idea to add a geo-type column to Google data (such as in Apple report).

@ActiveConclusion
Copy link
Owner Author

I haven't written anything here in a while, but I should have. So, point by point:

  • Until today, the parser has been working successfully for a long time automatically without my intervention. But strange things are happening on the Apple website today, so I predict the problems tonight.
  • I haven't fixed the problem with Google Sheets for the US yet, but there's already some progress.
  • Lately, I've been actively processing OpenSky COVID-19 Flight Dataset. I hope that within 2-3 days I will put my results in a separate repository. The main problem is that I do not understand the quality of data in this dataset and how to evaluate it. But it is what it is. If all goes well, I will also add these data to the merged reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants