# I/ COVID-19 Data Crawl on Wikipedia
Data on numbers of confirmed, recovered cases, and deaths for COVID-19 in the developing countries is fragmented and not always provided in consistent or machine-friendly formats. Also, in many cases only the latest numbers are available so it's not possible to look at changes over time.

This small Python script will crawl historical data from Wikipedia and export to an Excel file for HAPRI's partners in the respective country to double check. Data from this crawl only serves as reference.

This raw data will be cleaned with STATA in the next step.

For technical issue and further questions, please email Định Nguyễn, DinhNX@ueh.edu.vn.

## Import relevent packages

In [1]:
import os
import requests
import pandas as pd
from openpyxl import load_workbook

## Set working directory

In [2]:
os.chdir(r'C:\Users\NXDin\Dropbox (Vo Tat Thang)\[0][Master]Database\(1)Library_of_Data\Web-Scrap') # Provide the path here
#os.chdir('D:\Dropbox (Vo Tat Thang)\[0][Master]Database\(1)Library_of_Data\Web-Scrap') # Provide the path here

## Set up lists

In [3]:
country_name = ["Cambodia",
                "Laos",
                "Thailand"
               ]
                
country_url = ['https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Cambodia', 
               'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Laos',
               'https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Thailand'
              ]
country_table = ['6', #Cambodia: Detail confirmed and recovered cases
                 '1', #Laos: Detail confirmed and recovered cases
                 '15' #Thailand: Death cases
                ]

## Begin crawling

In [4]:
i = 1
for url, name, table in zip(country_url, country_name, country_table):

    df = pd.read_html(url, header=0)[int(table)]

    print("Crawling: {}".format(name))
    
    #Create new excel file for first sheet
    if i == 1:

        writer = pd.ExcelWriter(os.path.join("DATARAW", 'COVID19_uncleaned.xlsx'), engine = 'openpyxl')
        df.to_excel(writer \
                    , sheet_name = name \
                    , index = False)
        writer.save()
        writer.close() 

    #Append new sheet to the newly created excel file
    elif i != 1: 

        book = load_workbook(os.path.join("DATARAW", 'COVID19_uncleaned.xlsx'))
        writer = pd.ExcelWriter(os.path.join("DATARAW", 'COVID19_uncleaned.xlsx'), engine = 'openpyxl')
        writer.book = book  
        df.to_excel(writer \
                    , sheet_name = name \
                    , index = False)
        writer.save()
        writer.close() 

    else:

        print("Something is wrong")

    i += 1

Crawling: Cambodia
Crawling: Laos
Crawling: Thailand


# Check crawl accuracy

In [5]:
book = load_workbook(os.path.join("DATARAW", 'COVID19_uncleaned.xlsx'))
book.sheetnames

['Cambodia', 'Laos', 'Thailand']

In [6]:
df = pd.read_excel(os.path.join("DATARAW", 'COVID19_uncleaned.xlsx'), sheet_name='Laos', nrows=10)
display(df)

Unnamed: 0,Case,Date,Age,Gen⁠der,National⁠ity,Location,Treatment facility,Previous country been to,Status,Note,Source,Unnamed: 11
0,1,24 March 2020,28,Male,Laos,Vientiane,,Thailand,Discharged,,,
1,2,24 March 2020,36,Female,Laos,Vientiane,,No,Discharged,Tour guide,,
2,3,25 March 2020,26,Male,Laos,Vientiane,,Europe,Discharged,Close contact with case 1,,
3,4,26 March 2020,42,Male,Laos,Luang Prabang,,No,Discharged,Driver of same tour group as case 2.,,
4,5,26 March 2020,42,Male,Laos,Luang Prabang,,No,Discharged,,,
5,6,26 March 2020,41,Male,Laos,Vientiane,,No,Discharged,In patient,Close contact with case 3,
6,7,28 March 2020,50,Female,Laos,Luang Prabang,,No,Discharged,In patient,Wife of case 5,
7,8,28 March 2020,18,Male,Laos,Vientiane,,No,Discharged,In patient,Close contact with case 3,
8,9,29 March 2020,22,Female,Laos,Vientiane,,Thailand,Discharged,In patient,Visited her relative in Bangkok,
9,10,25 March 2020,21,Female,Laos,Vientiane,,No,Discharged,In patient,Close contact with case 8,


In [7]:
df = pd.read_excel(os.path.join("DATARAW", 'COVID19_uncleaned.xlsx'), sheet_name='Cambodia', nrows=10)
display(df)

Unnamed: 0,Case,Date confirmed,Age,Gen⁠der,National⁠ity,Detection location,Treatment facility,Previous countr(y/ies) been to,Status,Note,Source
0,1,27 January 2020,60,Male,China,Sihanoukville,Preah Si⁠hanouk Referral Hospital,China,Dis⁠charged (10 February),Arrived from Wuhan on 23 January with his family.,[8]
1,2,7 March 2020,38,Male,Cambodia,Siem Reap,Siem Reap Referral Hospital,No,Dis⁠charged (30 March),To have person-to-person spread from his emplo...,[19][92]
2,3,10 March 2020,65,Female,United Kingdom,Kampong Cham,Kampong Cham Pro⁠vincial Hospital,Vietnam,Dis⁠charged (22 March),Case 3-⁠5 were passengers of Viking Cruise Jou...,[20]
3,3,10 March 2020,65,Female,United Kingdom,Kampong Cham,Royal Ph⁠nom Penh Hospital,Vietnam,Dis⁠charged (22 March),Case 3-⁠5 were passengers of Viking Cruise Jou...,[20]
4,4,12 March 2020,73,Male,United Kingdom,Kampong Cham,Khmer-Soviet Friendship Hospital,Vietnam,Dis⁠charged (29 March),Case 3-⁠5 were passengers of Viking Cruise Jou...,[24]
5,5,12 March 2020,69,Female,United Kingdom,Kampong Cham,Khmer-Soviet Friendship Hospital,Vietnam,Dis⁠charged (29 March),Case 3-⁠5 were passengers of Viking Cruise Jou...,[24]
6,6,13 March 2020,49,Male,Canada,Phnom Penh,Khmer-Soviet Friendship Hospital,Thailand,Discharged (18 April),"A staff of Canadian International School, Koh ...",[27]
7,7,13 March 2020,33,Male,Belgium,Phnom Penh,Khmer-Soviet Friendship Hospital,Undisclosed,Dis⁠charged (2 April),Identity requested to be concealed.,[27]
8,8,15 March 2020,35,Male,France,Singapore,Khmer-Soviet Friendship Hospital,Singapore,Dis⁠charged (27 April),Arrived from Singapore on 14 March. Possibly i...,[34][36]
9,9,15 March 2020,4 months,Male,France,Phnom Penh,National Pediatric Hospital,Singapore,Dis⁠charged (3 April),Child of case 8. Spread from his father.,[34][36]


In [8]:
df = pd.read_excel(os.path.join("DATARAW", 'COVID19_uncleaned.xlsx'), sheet_name='Thailand', nrows=10)
display(df)

Unnamed: 0,Cases Order,Age,Gender,Province of Detection,Nationality,Place of Isolation,Travel History,Status,Occupation,Note
0,1,35,Male,Samut Prakan,Thailand,"Bamrasnaradura Infectious Diseases Institute, ...",No,Died on 29 February,Product Consultant at King Power store Sivaree...,"Case 29 patient, had been in contact with many..."
1,2,70,Male,Bangkok,Thailand,"Bamrasnaradura Infectious Diseases Institute, ...",No,Died on 23 March,Private car driver,"Case 25 patient, also had tuberculosis.[80][81]"
2,3,79,Male,Bangkok,Thailand,"Bamrasnaradura Infectious Diseases Institute, ...",No,Died on 23 March,Muay Thai pundit,"Case 158 patient, had other ailments and sever..."
3,4,45,Male,Bangkok,Thailand,Undisclosed Hospital,No,Died on 23 March,Security guard in Thonglor Pub,"Case 198 patient, had diabetes and obesity, Ad..."
4,5,50,Male,Narathiwat,Thailand,"Su-ngai Kolok Hospital, Narathiwat Province",Malaysia,Died on 27 March,,The first Thai to die from participation in a ...
5,6,55,Female,Bangkok,Thailand,Undisclosed Hospital,No,Died on 28 March,,Patient had diabetes and hyperlipidemia. patie...
6,7,68,Male,Nonthaburi,Thailand,"Nonthavej Hospital, Nonthaburi Province",No,Died on 28 March,,Patient visited Lumpinee Boxing Stadium on Mar...
7,8,54,Male,Yala,Thailand,Undisclosed Hospital,Malaysia,Died on 29 March,Merchant,"Patient had been in Malaysia on 12 March, admi..."
8,9,56,Female,Bangkok,Thailand,Undisclosed Hospital,No,Died on 29 March,,No underlying disease.
9,10,48,Male,Maha Sarakham,Thailand,"Maha Sarakham Hospital, Maha Sarakam",No,Died on 30 March,Musician,"Patient had diabetes, Intestinal cancer and He..."


## Reference
1. https://realpython.com/openpyxl-excel-spreadsheets-python/
2. https://www.journaldev.com/33306/pandas-read_excel-reading-excel-file-in-python
3. https://stackoverflow.com/questions/12965203/how-to-get-json-from-webpage-into-python-script
4. https://stackoverflow.com/questions/15008970/way-to-read-first-few-lines-for-pandas-dataframe

# II/ Import COVID-19 Data cases from Thailand's Department of Decases Control

The result of this script is the raw COVID-19 cases data from Thailand Department of Decease Control. The data will be cleaned in STATA in the next step.

Data is available at https://data.go.th/en/dataset/covid-19-daily

## Set up URL

In [9]:
data_url = 'https://covid19.th-stat.com/api/open/cases'

## Get data from the server and convert to JSON format

In [10]:
response = requests.get(data_url)
response_json = response.json()
print('Last update: {}'.format(response_json['UpdateDate']))

Last update: 20/08/2020


### Extract data and save to Excel format

In [11]:
data_json = response_json['Data']

In [12]:
writer = pd.ExcelWriter(os.path.join("DATARAW", 'COVID19_Thailand.xlsx'), engine = 'openpyxl')

df1 = pd.DataFrame(data_json)
df1.to_excel(writer, index = False)

writer.save()
writer.close()

## Check crawl accuracy

In [13]:
df = pd.read_excel(os.path.join("DATARAW", 'COVID19_Thailand.xlsx'), nrows=10)
display(df)

Unnamed: 0,ConfirmDate,No,Age,Gender,GenderEn,Nation,NationEn,Province,ProvinceId,District,ProvinceEn,Detail,StatQuarantine
0,2020-08-20 00:00:00,3389,33,ชาย,Male,Thailand,,กรุงเทพมหานคร,1,ราชเทวี,Bangkok,,1
1,2020-08-20 00:00:00,3388,46,ชาย,Male,Thailand,,กรุงเทพมหานคร,1,,Bangkok,,1
2,2020-08-20 00:00:00,3387,38,ชาย,Male,Thailand,,กรุงเทพมหานคร,1,ราชเทวี,Bangkok,,1
3,2020-08-20 00:00:00,3386,28,ชาย,Male,Thailand,,กรุงเทพมหานคร,1,ราชเทวี,Bangkok,,1
4,2020-08-20 00:00:00,3385,51,ชาย,Male,Thailand,,ชลบุรี,9,บางละมุง,Chonburi,,1
5,2020-08-20 00:00:00,3384,51,ชาย,Male,Thailand,,ชลบุรี,9,บางละมุง,Chonburi,,1
6,2020-08-20 00:00:00,3383,21,หญิง,Female,Thailand,,ชลบุรี,9,บางละมุง,Chonburi,,1
7,2020-08-19 00:00:00,3382,37,หญิง,Female,Thailand,,กรุงเทพมหานคร,1,ประเวศ,Bangkok,,1
8,2020-08-18 00:00:00,3381,62,ชาย,Male,Thailand,,กรุงเทพมหานคร,1,ประเวศ,Bangkok,,1
9,2020-08-18 00:00:00,3380,49,หญิง,Female,Thailand,,กรุงเทพมหานคร,1,ปทุมวัน,Bangkok,,1


# Import COVID-19 Data cases from covid19japan.com

The result of this script is the raw COVID-19 cases data for Japan. The data will be cleaned in STATA in the next step.

Read more about data source at https://github.com/NXDinh/covid19japan-data#data-sources, https://stopcovid19.metro.tokyo.lg.jp/en/

Data is available at https://github.com/reustle/covid19japan-data

## Set up URL

In [14]:
data_url = 'https://data.covid19japan.com/patient_data/latest.json'

## Get data from the server and convert to JSON format

In [15]:
response = requests.get(data_url)
response_json = response.json()

### Extract data and save to Excel format

In [16]:
data_json = response_json

In [17]:
writer = pd.ExcelWriter(os.path.join("DATARAW", 'COVID19_Japan.xlsx'), engine = 'openpyxl')

df1 = pd.DataFrame(data_json)
df1.to_excel(writer, index = False)

writer.save()
writer.close()

## Check crawl accuracy

In [None]:
df = pd.read_excel(os.path.join("DATARAW", 'COVID19_Japan.xlsx'), nrows=10)
display(df)

# Import COVID-19 Data cases from 

https://api.covid19india.org/

## Set up URL

data_url = ['https://api.covid19india.org/raw_data1.json',
            'https://api.covid19india.org/raw_data2.json',
            'https://api.covid19india.org/raw_data3.json',
            'https://api.covid19india.org/raw_data4.json',
            'https://api.covid19india.org/raw_data5.json',
            'https://api.covid19india.org/raw_data6.json',
            'https://api.covid19india.org/raw_data7.json',
            'https://api.covid19india.org/raw_data8.json',
            'https://api.covid19india.org/raw_data9.json',
            'https://api.covid19india.org/raw_data10.json',
            'https://api.covid19india.org/raw_data11.json',
            'https://api.covid19india.org/raw_data12.json'
           ]

## Get data from the server and convert to JSON format

i = 1
for url in data_url:

    response = requests.get(url)
    response_json = response.json()
    data_json = response_json['raw_data']

    print("Crawling: {}".format(url))
    
    #Create new excel file for first sheet
    if i == 1:

        writer = pd.ExcelWriter(os.path.join("DATARAW", 'COVID19_India.xlsx'), engine = 'openpyxl')
        df1 = pd.DataFrame(data_json)
        df1.to_excel(writer \
                    , sheet_name = str(i) \
                    , index = False)
        writer.save()
        writer.close() 

    #Append new sheet to the newly created excel file
    elif i != 1: 

        book = load_workbook(os.path.join("DATARAW", 'COVID19_India.xlsx'))
        writer = pd.ExcelWriter(os.path.join("DATARAW", 'COVID19_India.xlsx'), engine = 'openpyxl')
        writer.book = book
        df1 = pd.DataFrame(data_json)
        df1.to_excel(writer \
                    , sheet_name = str(i) \
                    , index = False)
        writer.save()
        writer.close() 

    else:

        print("Something is wrong")

    i += 1

## Check crawl accuracy

df = pd.read_excel(os.path.join("DATARAW", 'COVID19_India.xlsx'), sheet_name='4', nrows=10)
display(df)