# COVID-19 Data Crawl on Wikipedia
Data on numbers of confirmed, recovered cases, and deaths for COVID-19 in the developing countries is fragmented and not always provided in consistent or machine-friendly formats. Also, in many cases only the latest numbers are available so it's not possible to look at changes over time.

This small Python script will crawl historical data from Wikipedia and export to an Excel file for HAPRI's partners in the respective country to double check. Data from this crawl only serves as reference.

This raw data will be cleaned with STATA in the next step.

For technical issue and further questions, please email Định Nguyễn, DinhNX@ueh.edu.vn.

## Import relevent packages

In [1]:
import os
import pandas as pd
from openpyxl import load_workbook

## Set working directory

In [2]:
os.chdir(r'C:\Users\NXDin\Dropbox (Vo Tat Thang)\[0][Master]Database\(1)Library_of_Data\Web-Scrap') # Provide the path here
#os.chdir('D:\Dropbox (Vo Tat Thang)\[0][Master]Database\(1)Library_of_Data\Web-Scrap') # Provide the path here

## Set up lists

In [3]:
country_name = ["Cambodia",
                "Laos",
                "Thailand"
               ]
                
country_url = ['https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Cambodia', 
               'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Laos',
               'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Thailand'
              ]
country_table = ['6', #Cambodia: Detail confirmed and recovered cases
                 '1', #Laos: Detail confirmed and recovered cases
                 '11' #Thailand: Death cases
                ]

## Begin crawling

In [4]:
i = 1
for url, name, table in zip(country_url, country_name, country_table):

    df = pd.read_html(url, header=0)[int(table)]

    print("Crawling: {}".format(name))
    
    #Create new excel file for first sheet
    if i == 1:

        writer = pd.ExcelWriter(os.path.join("DATARAW", 'COVID19_uncleaned.xlsx'), engine = 'openpyxl')
        df.to_excel(writer \
                    , sheet_name = name \
                    , index = False)
        writer.save()
        writer.close() 

    #Append new sheet to the newly created excel file
    elif i != 1: 

        book = load_workbook(os.path.join("DATARAW", 'COVID19_uncleaned.xlsx'))
        writer = pd.ExcelWriter(os.path.join("DATARAW", 'COVID19_uncleaned.xlsx'), engine = 'openpyxl')
        writer.book = book  
        df.to_excel(writer \
                    , sheet_name = name \
                    , index = False)
        writer.save()
        writer.close() 

    else:

        print("Something is wrong")

    i += 1

Crawling: Cambodia
Crawling: Laos
Crawling: Thailand


# Check crawl accuracy

In [5]:
book = load_workbook(os.path.join("DATARAW", 'COVID19_uncleaned.xlsx'))
book.sheetnames

['Cambodia', 'Laos', 'Thailand']

In [6]:
df = pd.read_excel(os.path.join("DATARAW", 'COVID19_uncleaned.xlsx'), sheet_name='Laos', nrows=10)
display(df)

Unnamed: 0,vteCOVID-19 pandemic,vteCOVID-19 pandemic.1
0,SARS-CoV-2 (virus) COVID-19 (disease),SARS-CoV-2 (virus) COVID-19 (disease)
1,Timeline Pre-pandemic Crimson Contagion Exerci...,Timeline Pre-pandemic Crimson Contagion Exerci...
2,Timeline,Timeline
3,Pre-pandemic Crimson Contagion Exercise Cygnus...,Pre-pandemic Crimson Contagion Exercise Cygnus...
4,LocationsAfrica Algeria Angola Benin Botswana ...,LocationsAfrica Algeria Angola Benin Botswana ...
5,Locations,Locations
6,Africa Algeria Angola Benin Botswana Burkina F...,Africa Algeria Angola Benin Botswana Burkina F...
7,Africa,Algeria Angola Benin Botswana Burkina Faso Bur...
8,Asia,Central/North Kazakhstan Kyrgyzstan Russia Nor...
9,Central/North,Kazakhstan Kyrgyzstan Russia North Asia Tajiki...


In [7]:
df = pd.read_excel(os.path.join("DATARAW", 'COVID19_uncleaned.xlsx'), sheet_name='Cambodia', nrows=10)
display(df)

Unnamed: 0,Case,Date,Age,Gen⁠der,National⁠ity,Detection location,Treatment facility,Previous country been to,Status,Note,Source,Unnamed: 11
0,1,27 January 2020,60,Male,China,Sihanoukville,Preah Si⁠hanouk Referral Hospital,China,Dis⁠charged (10 February),Arrived from Wuhan on 23 January with his family.,[8],
1,2,7 March 2020,38,Male,Cambodia,Siem Reap,Siem Reap Referral Hospital,No,Dis⁠charged (30 March),To have person-to-person spread from his emplo...,[18],
2,3,10 March 2020,65,Female,United Kingdom,Kampong Cham,Kampong Cham Pro⁠vincial Hospital,Vietnam,Dis⁠charged (22 March),Case 3-⁠5 were passengers of Viking Cruise Jou...,[20],
3,3,10 March 2020,65,Female,United Kingdom,Kampong Cham,Royal Ph⁠nom Penh Hospital,Vietnam,Dis⁠charged (22 March),Case 3-⁠5 were passengers of Viking Cruise Jou...,[20],
4,4,12 March 2020,73,Male,United Kingdom,Kampong Cham,Khmer-Soviet Friendship Hospital,Vietnam,Dis⁠charged (29 March),Case 3-⁠5 were passengers of Viking Cruise Jou...,[24],
5,5,12 March 2020,69,Female,United Kingdom,Kampong Cham,Khmer-Soviet Friendship Hospital,Vietnam,Dis⁠charged (29 March),Case 3-⁠5 were passengers of Viking Cruise Jou...,[24],
6,6,13 March 2020,49,Male,Canada,Phnom Penh,Khmer-Soviet Friendship Hospital,Thailand,Discharged (18 April),"A staff of Canadian International School, Koh ...",[27],
7,7,13 March 2020,33,Male,Belgium,Phnom Penh,Khmer-Soviet Friendship Hospital,Undisclosed,Dis⁠charged (2 April),Identity requested to be concealed.,[27],
8,8,15 March 2020,35,Male,France,Singapore,Khmer-Soviet Friendship Hospital,Singapore,Dis⁠charged (27 April),Arrived from Singapore on 14 March. Possibly i...,[34][36],
9,9,15 March 2020,4 months,Male,France,Phnom Penh,National Pediatric Hospital,Singapore,Dis⁠charged (3 April),Child of case 8. Spread from his father.,[34][36],


In [8]:
df = pd.read_excel(os.path.join("DATARAW", 'COVID19_uncleaned.xlsx'), sheet_name='Thailand', nrows=10)
display(df)

Unnamed: 0,Cases Order,Age,Gender,Province,Nationality,Hospital admitted to,Been to other country,Status,Occupation,Note
0,1,35,Male,Samut Prakan,Thailand,"Bamrasnaradura Infectious Diseases Institute, ...",No,Died on 29 February,Product Consultant at King Power store Sivaree...,Patient had been in contact with many tourists...
1,2,70,Male,Bangkok,Thailand,"Bamrasnaradura Infectious Diseases Institute, ...",No,Died on 23 March,Private car driver,Patient also had tuberculosis.[67][68]
2,3,79,Male,Bangkok,Thailand,"Bamrasnaradura Infectious Diseases Institute, ...",No,Died on 23 March,Muay Thai pundit,Patient had other ailments and severe symptoms...
3,4,45,Male,Bangkok,Thailand,Undisclosed Hospital,No,Died on 23 March,Security guard in Thonglor Pub,"Patient had diabetes and obesity, Admit to hos..."
4,5,50,Male,Narathiwat,Thailand,"Su-ngai Kolok Hospital, Narathiwat Province",Malaysia,Died on 27 March,,The first Thai to die from participation in a ...
5,6,55,Female,Bangkok,Thailand,Undisclosed Hospital,No,Died on 28 March,,Patient had diabetes and hyperlipidemia. patie...
6,7,68,Male,Nonthaburi,Thailand,"Nonthavej Hospital, Nonthaburi Province",No,Died on 28 March,,Patient visited Lumpinee Boxing Stadium on Mar...
7,8,54,Male,Yala,Thailand,Undisclosed Hospital,Malaysia,Died on 29 March,Merchant,"Patient had been in Malaysia on 12 March, admi..."
8,9,56,Female,Bangkok,Thailand,Undisclosed Hospital,No,Died on 29 March,,No underlying disease.
9,10,48,Male,Maha Sarakham,Thailand,"Maha Sarakham Hospital, Maha Sarakam",No,Died on 30 March,Musician,"Patient had diabetes, Intestinal cancer and He..."


## Reference
1. https://realpython.com/openpyxl-excel-spreadsheets-python/
2. https://www.journaldev.com/33306/pandas-read_excel-reading-excel-file-in-python