# COVID-19 Data Crawl on Wikipedia
Data on numbers of confirmed, recovered cases, and deaths for COVID-19 in the developing countries is fragmented and not always provided in consistent or machine-friendly formats. Also, in many cases only the latest numbers are available so it's not possible to look at changes over time.

This small Python script will crawl historical data from Wikipedia and export to an Excel file for HAPRI's partners in the respective country to double check. Data from this crawl only serves as reference.

This raw data will be cleaned with STATA in the next step.

For technical issue and further questions, please email Định Nguyễn, Research Associate, DinhNX@ueh.edu.vn.

## Import relevent packages

In [1]:
import os
import pandas as pd
from openpyxl import load_workbook

## Set working directory

In [2]:
os.chdir('D:\Dropbox (Vo Tat Thang)\[0][Master]Database\(1)Library_of_Data\Web-Scrap') # Provide the path here

## Set up lists

In [3]:
country_name = ["Cambodia",
                "Laos"]
country_url = ['https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Cambodia', 
               'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Laos']
country_table = ['3', '1']

## Begin crawling

In [4]:
i = 1
for url, name, table in zip(country_url, country_name, country_table):

    df = pd.read_html(url, header=0)[int(table)]

    print("Crawling: {}".format(name))
    
    #Create new excel file for first sheet
    if i == 1:

        writer = pd.ExcelWriter(os.path.join("DATARAW", 'COVID19_uncleaned.xlsx'), engine = 'openpyxl')
        df.to_excel(writer \
                    , sheet_name = name \
                    , index = False)
        writer.save()
        writer.close() 

    #Append new sheet to the newly created excel file
    elif i != 1: 

        book = load_workbook(os.path.join("DATARAW", 'COVID19_uncleaned.xlsx'))
        writer = pd.ExcelWriter(os.path.join("DATARAW", 'COVID19_uncleaned.xlsx'), engine = 'openpyxl')
        writer.book = book  
        df.to_excel(writer \
                    , sheet_name = name \
                    , index = False)
        writer.save()
        writer.close() 

    else:

        print("Something is wrong")

    i += 1

Crawling: Cambodia
Crawling: Laos


# Check crawl accuracy

In [5]:
book = load_workbook(os.path.join("DATARAW", 'COVID19_uncleaned.xlsx'))
book.sheetnames

['Cambodia', 'Laos']

In [6]:
df = pd.read_excel(os.path.join("DATARAW", 'COVID19_uncleaned.xlsx'), sheet_name='Laos')
print(df.to_markdown())

|    |   Case | Date          |   Age | Gen⁠der   | National⁠ity      | Location      |   Treatment facility | Previous country been to   | Status     | Note                                                                      |   Source |
|---:|-------:|:--------------|------:|:---------|:-----------------|:--------------|---------------------:|:---------------------------|:-----------|:--------------------------------------------------------------------------|---------:|
|  0 |      1 | 24 March 2020 |    28 | Male     | Laos             | Vientiane     |                  nan | Thailand                   | In patient | nan                                                                       |      nan |
|  1 |      2 | 24 March 2020 |    36 | Female   | Laos             | Vientiane     |                  nan | No                         | In patient | Tour guide                                                                |      nan |
|  2 |      3 | 25 March 2020 |    26 | Male  

In [7]:
df = pd.read_excel(os.path.join("DATARAW", 'COVID19_uncleaned.xlsx'), sheet_name='Cambodia')
print(df.to_markdown())

|    | Province         |   Cases |   Re⁠coveries |
|---:|:-----------------|--------:|-------------:|
|  0 | Banteay Meanchey |       4 |            3 |
|  1 | Battambang       |       8 |            8 |
|  2 | Kampong Cham     |      16 |           16 |
|  3 | Kampong Chhnang  |       3 |            2 |
|  4 | Kampot           |       2 |            2 |
|  5 | Kandal           |       2 |            2 |
|  6 | Kep              |       4 |            4 |
|  7 | Koh Kong         |       2 |            2 |
|  8 | Phnom Penh       |      28 |           28 |
|  9 | Preah Vihear     |       2 |            2 |
| 10 | Siem Reap        |       7 |            7 |
| 11 | Sihanoukville    |      40 |           39 |
| 12 | Tboung Khmum     |       4 |            4 |


## Reference
1. https://realpython.com/openpyxl-excel-spreadsheets-python/
2. https://www.journaldev.com/33306/pandas-read_excel-reading-excel-file-in-python