Skip to content
This repository has been archived by the owner on Mar 10, 2023. It is now read-only.

[SOLUTION] - Pandas + Python - Pulls daily data for Recoveries @worldmeters.info and transforms in @JHU standards #1642

Open
r-lomba opened this issue Mar 26, 2020 · 18 comments

Comments

@r-lomba
Copy link

r-lomba commented Mar 26, 2020

For Pandas + Python users, I propose the following code snippet to seamlessly integrate daily data from worldometers.info straight inside your legacy code generating stuff from recoveries stats @jhu (as we all know, this data stream is now discontinued and we are left on our own, so to say)

IMPORTANT:

  • No Regional level is managed here. Just Country level. Fix accordingly if you need Regions
  • You need to run this everyday, and everyday you incrementally save a new version of you very own CSV file for Recoveries
  • Have fun :)
#############################################################
##
## PRELIMINARY OPERATIONS
##
#############################################################

# IMPORT LIBRARIES
import pandas as pd
import numpy as np
import dateutil
import time
import copy
import requests
from datetime import datetime, timedelta, date



#############################################################
##
## HARDCODED DICTIONARIES - FIXES TO COUNTRIES THAT ARE NAMED
## DIFFERENTLY IN WORLDOMETER STANDARDS THAN JHU STANDARDS
## HERE WE RENAME COUNTRIES IN RECOVERED DATAFRAME FROM WORLDOMETER
## (ON THE LEFT HAND SIDE) TO MAKE THE COUNTRY NAMES MATCH WITH
## THOSE USED AT JHU (ON THE RIGHT HAND SIDE)
##
## PLEASE NOTE THAT FOREIGN TERRITORIES OF MAIN COUNTRIES ARE
## RENAMED AS THE MAIN COUNTRY ITSELF: THIS BECAUSE WE SUPPOSE STATS
## ARE AGGREGATED AT COUNTRY LEVEL. YMMV. EVENTUALLY FIX ACCORDINGLY
##
#############################################################

# DEFINES COUNTRY NAMES FIXES DICTIONARY
countries_fixes_dict = {'CAR': 'Central African Republic',
                        'Congo': 'Congo (Brazzaville)',
                        'DRC': 'Congo (Kinshasa)',
                        'Diamond Princess': 'Cruise Ship',
                        'Ivory Coast': 'Cote d\'Ivoire',
                        'S. Korea': 'Korea, South',
                        'St. Vincent Grenadines': 'Saint Vincent and the Grenadines',
                        'Taiwan': 'Taiwan*',
                        'UAE': 'United Arab Emirates',
                        'UK': 'United Kingdom',
                        'USA': 'US',
                        'Vatican City': 'Holy See',
                        'Aruba': 'Netherlands',
                        'Bermuda': 'United Kingdom',
                        'Cayman Islands': 'United Kingdom',
                        'Channel Islands': 'United Kingdom',
                        'Curaçao': 'Netherlands',
                        'Faeroe Islands': 'Denmark',
                        'French Polynesia': 'France',
                        'French Guiana': 'France',
                        'Gibraltar': 'United Kingdom',
                        'Greenland': 'Denmark',
                        'Guadeloupe': 'France',
                        'Guam': 'US',
                        'Hong Kong': 'China',
                        'Isle of Man': 'United Kingdom',
                        'Macao': 'China',
                        'Martinique': 'France',
                        'Mayotte': 'France',
                        'Montserrat': 'United Kingdom',
                        'New Caledonia': 'France',
                        'Puerto Rico': 'US',
                        'Réunion': 'France',
                        'Saint Martin': 'France',
                        'Sint Maarten': 'Netherlands',
                        'St. Barth': 'France',
                        'Turks and Caicos': 'United Kingdom',
                        'U.S. Virgin Islands': 'US',
                        'Bahamas': 'Bahamas, The',
                        'Gambia': 'Gambia, The'
                        }



#############################################################
##
## READS WORLDOMETER STATS
##
## WEB PAGES ARE COMPLEX AND CAN CONTAIN SEVERAL TABLES. THIS IS WHY READ_HTML RETURNS
## A LIST OF TABLES. IN THIS CASE THE TABLES WE NEED ARE THE FIRST IN THE LIST
## (HENCE THE INDEXING [0]) CONTAINING THE WORLDOMETER STATS FROM TODAY, AND THE
## SECOND ONE (HENCE THE INDEXING [1]) CONTAINING THE WORLDOMETER STATS FROM YESERDAY.
## THE OPTION "header=[0]" MEANS THAT JUST THE FIRST ROW OF
## THE TABLE MAKE UP THE HEADER OTHERWISE IT COULD BE E.G. "header=[0,1]" AND IN
## THIS CASE IT WOULD MEAN THE HEADER IS MADE UP FROM THE FIRST 2 LINES OF
## THE SCRAPED TABLE. PLEASE NOTE THAT THE LAST CALL IN THE FOLLOWING CODE LINES
## IS ".iloc[:-1]" AS TO DISCARD THE LAST LINE OF THE SCRAPED TABLE: THAT WOULD
## CONTAIN GRANDTOTALS THAT WE'RE NOT INTERESTED IN
##
#############################################################

# PULLS TODAY'S AND YESTERDAY'S DATA FROM WORLDOMETER SITE
url = 'https://www.worldometers.info/coronavirus/'

header = {
  "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
  "X-Requested-With": "XMLHttpRequest"
}

r = requests.get(url, headers=header)

worldometer_stats_today_orig_DF = pd.read_html(r.text, header=[0], index_col=0)[0].iloc[:-1]
worldometer_stats_yesterday_orig_DF = pd.read_html(r.text, header=[0], index_col=0)[1].iloc[:-1]



#############################################################
##
## A FEW VARIABLE DEFINITIONS
##
#############################################################

# CURRENT AND YESTERDAY DATE AND TIME
now = datetime.now()
yesterday = date.today() - timedelta(days=1)

# BUILDS TRAILER STRINGS TO APPEND TO THE FILE NAMES WE WILL USE AND TO USE TO FIX COLUMN NAMES
# IN THE NEW RECOVERIES DATAFRAME THAT WE'LL INTEGRATE IN JHU DATA
date_time_trailer = now.strftime("%Y_%m_%d")

# IF YOU USE WINDOWS, SWITCH COMMENT ON THE NEXT TWO LINES OF CODE
# MISMATCH IN DATE FORMAT AS DOCUMENTED IN:
# https://stackoverflow.com/questions/10807164/python-time-formatting-different-in-windows
date_time_columnname = yesterday.strftime("%-m/%-d/%Y")
#date_time_columnname = yesterday.strftime("%m/%d/%Y")



#############################################################
##
## PUTTING TOGETHER ALL THE PIECES
##
#############################################################

# RENAMES NEW DATAFRAMES INDEX TO MATCH JHU DATAFRAME STANDARDS
worldometer_stats_today_orig_DF.index.names = ['Country/Region']
worldometer_stats_yesterday_orig_DF.index.names = ['Country/Region']

# FIXES DATATYPES AND COLUMN NAMES ON THE NEW RECOVERED STATS DATAFRAME TO MATCH JHU DATAFRAME STANDARDS
# HERE, WE JUST USE "TODAY" DATAFRAME BECAUSE WE SUPPOSE WE ARE GENERATING OUR STATS ON THE "NEW DAY"
# SO WE HAVE TO LOOK AT NEW ACTIVE VALUES SINCE YESTERDAY
worldometer_recovered_orig_DF = worldometer_stats_today_orig_DF['TotalRecovered'].copy().to_frame()
worldometer_recovered_orig_DF['TotalRecovered'] = worldometer_recovered_orig_DF['TotalRecovered'].fillna(0).astype(np.int64)

# RENAMES NEW DATAFRAME RECOVERED COLUMN NAME TO MATCH JHU DATAFRAME STANDARDS
# USING THE "date_time_yesterday_columnname" VARIABLE WE PREVIOUSLY INITIALIZED
# REMEMBER WE WANT TO CREATE A NEW COLUMN TO HSTACK TO RECOVERED INCREMENTAL CSV
# AS YESTERDAY'S STATS/COLUMN NAME!
worldometer_recovered_orig_DF = worldometer_recovered_orig_DF.rename(columns={'TotalRecovered': date_time_columnname})

# RENAMES COUNTRY NAMES INSIDE THE INDEX OF OUR NEW DATAFRAME TO MATCH JHU DATAFRAME STANDARDS
# USING THE FIX INDEX DICTIONARY "countries_fixes_dict" THAT WE PREVIOUSLY INITIALIZED
worldometer_recovered_orig_DF = worldometer_recovered_orig_DF.rename(index = countries_fixes_dict)

# AS NOW WE'LL HAVE MULTIPLE RECORDS FOR MANY COUNTRIES HAVING FOREIGN TERRITORIES (WE RENAMED
# ALL THE TERRITORIES AS THE MAIN COUNTRIES) WE GROUP AND SUM THE DATAFRAME AT COUNTRY LEVEL
# WE ALSO CHANGE NAME TO THIS DATAFRAME BECAUSE THIS IS THE GROUPED BY COUNTRY VERSION OF THE
# ORIGINAL WORLDCOUNTER DATAFRAME
worldometer_recovered_DF = worldometer_recovered_orig_DF.groupby(['Country/Region']).sum()

# SAVES THE "worldometer_recovered_DF" DATAFRAME WITH A PROPER NAME ENRICHED WITH DATE AND TIME
# SO THAT WE HAVE A VERSION OF THE ORIGINAL DATA
worldometer_recovered_DF.to_csv(r'./worldometer_recovered_' + date_time_trailer + '.csv', index = True)


# LET'S PEEK INSIDE THE GENERATED DATAFRAME. YOU CAN NOW JOIN/HSTACK IT WITH THE DATAFRAME
# CONTAINING THE LATEST DATA AVAILABLE @JHU WITH REGARDS TO RECOVERIES! WATCH OUT: HERE WE
# AGGREGATED EVERYTHING AT COUNTRY LEVEL. YMMV. EVENTUALLY FIX ACCORDINGLY
worldometer_recovered_DF.head()

OUTPUT:
Output

@mainadwitiya
Copy link

noiceeeeeeeeeeeeeeeeeeee

@ablanch5
Copy link

ablanch5 commented Mar 27, 2020

it seems you're missing
import numpy as np.

Also line 110:
date_time_columnname = yesterday.strftime("%-m/%-d/%Y")
was throwing me an error.
I took away the dashes and it worked:
date_time_columnname = yesterday.strftime("%m/%d/%Y")

I'm now correctly getting worldometer_recovered_2020_03_26.csv correctly in my folder

P.S. I also had to pip install lxml

@r-lomba
Copy link
Author

r-lomba commented Mar 27, 2020

@ablanch5 sorry, I forgot a few imports. I modified the code above, can you try now?

@r-lomba
Copy link
Author

r-lomba commented Mar 27, 2020

@ablanch5 the dashes are to avoid padding in the date e.g. NOT "03/25/2020" but instead "3/25/2020" so I used the dashes to be compliant with JHU... Maybe your python version does not support that format?

@ablanch5
Copy link

I'm using python 3.7.6. When I look at the csv file I'm getting 3/25/2020 like you wanted. Could it be a Linux/Windows thing? I don't know

@r-lomba
Copy link
Author

r-lomba commented Mar 27, 2020

@ablanch5 I don't know. I use Linux. Glad it worked, anyway! Enjoy :)

@ablanch5
Copy link

I think its a windows/linux thing:
https://stackoverflow.com/questions/10807164/python-time-formatting-different-in-windows

Nice work!

@efra-mx
Copy link

efra-mx commented Mar 28, 2020

Something like this would fix the date format issue:

import os
...
...
# MISMATCH IN DATE FORMAT AS DOCUMENTED IN:
# https://stackoverflow.com/questions/10807164/python-time-formatting-different-in-windows
if os.name == 'nt':
    date_time_columnname = yesterday.strftime("%m/%d/%Y")
else:
    date_time_columnname = yesterday.strftime("%-m/%-d/%Y")

@r-lomba
Copy link
Author

r-lomba commented Mar 28, 2020

@efra-mx have you tested it? If you have and you sure it works, I'll update the code snippet. Otherwise if not tested, let's just leave your comment here for inspiration for future readers.

In the meanwhile I've updated the code because since today if you don't "pretend to be a browser" they give you a 403 error. Now the problem is fixed (I don't get why they try to protect a public HTML table on a public page, in times of pandemic... Nevertheless, for now it still works with the updated code)

@efra-mx
Copy link

efra-mx commented Mar 28, 2020

@r-lomba I don't have a windows machine myself.
But you can check:
https://stackoverflow.com/questions/1325581/how-do-i-check-if-im-running-on-windows-in-python
https://stackoverflow.com/questions/1854/what-os-am-i-running-on

Good job by the way.

@efra-mx
Copy link

efra-mx commented Mar 28, 2020

I wanted to share a refactored version of the code done by @r-lomba.
It is wrapped in a class
It will be easier to use.

r = CovidReader()
r = fetch()
print(r.recovered_by_country())

I hope you find it helpful

import os
import pandas as pd
import numpy as np
import dateutil
import time
import copy
from datetime import datetime, timedelta, date
import requests


#############################################################
##
## HARDCODED DICTIONARIES - FIXES TO COUNTRIES THAT ARE NAMED
## DIFFERENTLY IN WORLDOMETER STANDARDS THAN JHU STANDARDS
## HERE WE RENAME COUNTRIES IN RECOVERED DATAFRAME FROM WORLDOMETER
## (ON THE LEFT HAND SIDE) TO MAKE THE COUNTRY NAMES MATCH WITH
## THOSE USED AT JHU (ON THE RIGHT HAND SIDE)
##
## PLEASE NOTE THAT FOREIGN TERRITORIES OF MAIN COUNTRIES ARE
## RENAMED AS THE MAIN COUNTRY ITSELF: THIS BECAUSE WE SUPPOSE STATS
## ARE AGGREGATED AT COUNTRY LEVEL. YMMV. EVENTUALLY FIX ACCORDINGLY
##
#############################################################

# DEFINES COUNTRY NAMES FIXES DICTIONARY
countries_fixes_dict = {'CAR': 'Central African Republic',
                        'Congo': 'Congo (Brazzaville)',
                        'DRC': 'Congo (Kinshasa)',
                        'Diamond Princess': 'Cruise Ship',
                        'Ivory Coast': 'Cote d\'Ivoire',
                        'S. Korea': 'Korea, South',
                        'St. Vincent Grenadines': 'Saint Vincent and the Grenadines',
                        'Taiwan': 'Taiwan*',
                        'UAE': 'United Arab Emirates',
                        'UK': 'United Kingdom',
                        'USA': 'US',
                        'Vatican City': 'Holy See',
                        'Aruba': 'Netherlands',
                        'Bermuda': 'United Kingdom',
                        'Cayman Islands': 'United Kingdom',
                        'Channel Islands': 'United Kingdom',
                        'Curaçao': 'Netherlands',
                        'Faeroe Islands': 'Denmark',
                        'French Polynesia': 'France',
                        'French Guiana': 'France',
                        'Gibraltar': 'United Kingdom',
                        'Greenland': 'Denmark',
                        'Guadeloupe': 'France',
                        'Guam': 'US',
                        'Hong Kong': 'China',
                        'Isle of Man': 'United Kingdom',
                        'Macao': 'China',
                        'Martinique': 'France',
                        'Mayotte': 'France',
                        'Montserrat': 'United Kingdom',
                        'New Caledonia': 'France',
                        'Puerto Rico': 'US',
                        'Réunion': 'France',
                        'Saint Martin': 'France',
                        'Sint Maarten': 'Netherlands',
                        'St. Barth': 'France',
                        'Turks and Caicos': 'United Kingdom',
                        'U.S. Virgin Islands': 'US',
                        'Bahamas': 'Bahamas, The',
                        'Gambia': 'Gambia, The'
                        }


#############################################################
##
## READS WORLDOMETER STATS
##
## WEB PAGES ARE COMPLEX AND CAN CONTAIN SEVERAL TABLES. THIS IS WHY READ_HTML RETURNS
## A LIST OF TABLES. IN THIS CASE THE TABLES WE NEED ARE THE FIRST IN THE LIST
## (HENCE THE INDEXING [0]) CONTAINING THE WORLDOMETER STATS FROM TODAY, AND THE
## SECOND ONE (HENCE THE INDEXING [1]) CONTAINING THE WORLDOMETER STATS FROM YESERDAY.
## THE OPTION "header=[0]" MEANS THAT JUST THE FIRST ROW OF
## THE TABLE MAKE UP THE HEADER OTHERWISE IT COULD BE E.G. "header=[0,1]" AND IN
## THIS CASE IT WOULD MEAN THE HEADER IS MADE UP FROM THE FIRST 2 LINES OF
## THE SCRAPED TABLE. PLEASE NOTE THAT THE LAST CALL IN THE FOLLOWING CODE LINES
## IS ".iloc[:-1]" AS TO DISCARD THE LAST LINE OF THE SCRAPED TABLE: THAT WOULD
## CONTAIN GRANDTOTALS THAT WE'RE NOT INTERESTED IN
##
#############################################################
class CovidReader:
    url = "https://www.worldometers.info/coronavirus/"
    header = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
        "X-Requested-With": "XMLHttpRequest"
    }

    def __init__(self):
        self._now = datetime.now()
        self._yesterday = date.today() - timedelta(days=1)
        self._today_df = None
        self._yesterday_df = None
        self._recovered_df = None

    def now(self):
        return self._now

    def yesterday(self):
        return self._yesterday

    def today_df(self):
        return self._today_df

    def yesterday_df(self):
        return self._yesterday_df

    def recovered_df(self):
        return self._recovered_df

    def fetch(self):
        self._fetch_data()
        self._fix_data()

    # PULLS TODAY'S AND YESTERDAY'S DATA FROM WORLDOMETER SITE
    def _fetch_data(self):
        self._now = datetime.now()
        self._yesterday = date.today() - timedelta(days=1)

        r = requests.get(CovidReader.url, headers=CovidReader.header)
        self._today_df = pd.read_html(r.text, header=[0], index_col=0)[0].iloc[:-1]
        self._yesterday_df = pd.read_html(r.text, header=[0], index_col=0)[1].iloc[:-1]

    def _fix_data(self):
        
        # RENAMES NEW DATAFRAMES INDEX TO MATCH JHU DATAFRAME STANDARDS
        self._today_df.index.names = ['Country/Region']
        self._yesterday_df.index.names = ['Country/Region']

        # FIXES DATATYPES AND COLUMN NAMES ON THE NEW RECOVERED STATS DATAFRAME TO MATCH JHU DATAFRAME STANDARDS
        # HERE, WE JUST USE "TODAY" DATAFRAME BECAUSE WE SUPPOSE WE ARE GENERATING OUR STATS ON THE "NEW DAY"
        # SO WE HAVE TO LOOK AT NEW ACTIVE VALUES SINCE YESTERDAY
        self._recovered_df = self._today_df['TotalRecovered'].copy().to_frame()
        self._recovered_df['TotalRecovered'] = self._recovered_df['TotalRecovered'].fillna(0).astype(np.int64)

        # RENAMES NEW DATAFRAME RECOVERED COLUMN NAME TO MATCH JHU DATAFRAME STANDARDS
        # USING THE "date_time_yesterday_columnname" VARIABLE WE PREVIOUSLY INITIALIZED
        # REMEMBER WE WANT TO CREATE A NEW COLUMN TO HSTACK TO RECOVERED INCREMENTAL CSV
        # AS YESTERDAY'S STATS/COLUMN NAME!
        #
        # MISMATCH IN DATE FORMAT AS DOCUMENTED IN:
        # https://stackoverflow.com/questions/10807164/python-time-formatting-different-in-windows
        if os.name == 'nt':
            date_time_columnname = self._yesterday.strftime("%m/%d/%Y")
        else:
            date_time_columnname = self._yesterday.strftime("%-m/%-d/%Y")
        self._recovered_df = self._recovered_df.rename(columns={'TotalRecovered': date_time_columnname})

        # RENAMES COUNTRY NAMES INSIDE THE INDEX OF OUR NEW DATAFRAME TO MATCH JHU DATAFRAME STANDARDS
        # USING THE FIX INDEX DICTIONARY "countries_fixes_dict" THAT WE PREVIOUSLY INITIALIZED
        self._recovered_df = self._recovered_df.rename(index = countries_fixes_dict)


    # AS NOW WE'LL HAVE MULTIPLE RECORDS FOR MANY COUNTRIES HAVING FOREIGN TERRITORIES (WE RENAMED
    # ALL THE TERRITORIES AS THE MAIN COUNTRIES) WE GROUP AND SUM THE DATAFRAME AT COUNTRY LEVEL
    # WE ALSO CHANGE NAME TO THIS DATAFRAME BECAUSE THIS IS THE GROUPED BY COUNTRY VERSION OF THE
    # ORIGINAL WORLDCOUNTER DATAFRAME
    def recovered_by_country(self):
        df = self._recovered_df.groupby(['Country/Region']).sum()
        return df

    # SAVES THE "_recovered_df" DATAFRAME WITH A PROPER NAME ENRICHED WITH DATE AND TIME
    # SO THAT WE HAVE A VERSION OF THE ORIGINAL DATA
    def save_recovered(self, basename=r'./covid_recovered_', **kwargs):
        # BUILDS TRAILER STRINGS TO APPEND TO THE FILE NAMES WE WILL USE AND TO USE TO FIX COLUMN NAMES
        # IN THE NEW RECOVERIES DATAFRAME THAT WE'LL INTEGRATE IN JHU DATA
        date_time_trailer = self._now.strftime("%Y_%m_%d")
        self._recovered_df.to_csv(basename + date_time_trailer + '.csv', **kwargs)

@r-lomba
Copy link
Author

r-lomba commented Mar 28, 2020

Brilliant, @efra-mx !! Thanks a lot for this :)

@marco-cheung
Copy link

Since 'Diamond Princess' cases are seperated from 'Japan' in Worldometers table, may I know how to group Diamond Princess cases into Japan? Thanks

image

@efra-mx
Copy link

efra-mx commented Apr 9, 2020 via email

@marco-cheung
Copy link

marco-cheung commented Apr 9, 2020

@efra-mx Thanks for follow-up. Can group them using this function after converting object data fields into float. However, the original column 'Continent' is missing after grouping. May I know how to add back into dataframe? Thanks!

image

@efra-mx
Copy link

efra-mx commented Apr 10, 2020 via email

@marco-cheung
Copy link

Thanks @efra-mx! That works for me and sounds straightforward to understand your code with comments!

@Seelova
Copy link

Seelova commented Oct 22, 2020

I have this error running the code
ValueError: No tables found matching pattern '.+'

How to fix that?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants