In [1]:
require("httr")
require("rvest")

library(httr)
library(rvest)
     


Loading required package: httr

Loading required package: rvest



# TASK 1: Get a COVID-19 pandemic Wiki page using HTTP request

In [None]:
# get_wiki_covid19_page <- function() {
  # Our target COVID-19 wiki page URL is: https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country  
  # Which has two parts: 
    # 1) base URL `https://en.wikipedia.org/w/index.php  
    # 2) URL parameter: `title=Template:COVID-19_testing_by_country`, separated by question mark ?
    
  # Wiki page base
  wiki_base_url <- "http://web.archive.org/web/20221025155918/https://en.wikipedia.org/w/index.php"
  
  # You will need to create a List which has an element called `title` to specify which page you want to get from Wiki
  # in our case, it will be `Template:COVID-19_testing_by_country`
  url_param <- list(title = "Template:COVID-19_testing_by_country")
  
  wiki_response <- GET(wiki_base_url, query = url_param)
  return(wiki_response)
}



In [8]:

response <- get_wiki_covid19_page()
print(response)

Response [http://web.archive.org/web/20221025155918/https://en.wikipedia.org/w/index.php?title=Template%3ACOVID-19_testing_by_country]
  Date: 2023-07-27 22:44
  Status: 200
  Content-Type: text/html; charset=UTF-8
  Size: 464 kB
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head><script type="text/javascript" src="/_static/js/bundle-playback.js?v=1W...
<script type="text/javascript" src="/_static/js/wombat.js?v=txqj7nKC" charset...
<script type="text/javascript">
  __wm.init("http://web.archive.org/web");
  __wm.wombat("https://en.wikipedia.org/w/index.php?title=Template:COVID-19_t...
	      "1666713558");
</script>
<link rel="stylesheet" type="text/css" href="/_static/css/banner-styles.css?v...
...


# TASK 2: Extract COVID-19 testing data table from the wiki HTML page

In [13]:
# Get the root html node from the http response in task 1 
root_node <- read_html(response)

# Get the table node from the root html node
table_node <- html_node(root_node, "table")

# Read the table node and convert it into a data frame, and print the data frame for review
covid19_df1 <- html_table(table_node)
covid19_df <- as.data.frame(covid19_df1)
covid19_df

Country or region,Date[a],Tested,Units[b],Confirmed(cases),"Confirmed /tested,%","Tested /population,%","Confirmed /population,%",Ref.
<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
Afghanistan,17 Dec 2020,154767,samples,49621,32.1,0.40,0.13,[1]
Albania,18 Feb 2021,428654,samples,96838,22.6,15.0,3.4,[2]
Algeria,2 Nov 2020,230553,samples,58574,25.4,0.53,0.13,[3][4]
Andorra,23 Feb 2022,300307,samples,37958,12.6,387,49.0,[5]
Angola,2 Feb 2021,399228,samples,20981,5.3,1.3,0.067,[6]
Antigua and Barbuda,6 Mar 2021,15268,samples,832,5.4,15.9,0.86,[7]
Argentina,16 Apr 2022,35716069,samples,9060495,25.4,78.3,20.0,[8]
Armenia,29 May 2022,3099602,samples,422963,13.6,105,14.3,[9]
Australia,9 Sep 2022,78548492,samples,10112229,12.9,313,40.3,[10]
Austria,21 Oct 2022,199625374,samples,5392347,2.7,2242,60.6,[11]


# TASK 3: Pre-process and export the extracted data frame
The goal of task 3 is to pre-process the extracted data frame from the previous step, and export it as a csv file

In [14]:
# Print the summary of the data frame
summary(covid19_df)

 Country or region    Date[a]             Tested            Units[b]        
 Length:173         Length:173         Length:173         Length:173        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
 Confirmed(cases)   Confirmed /tested,% Tested /population,%
 Length:173         Length:173          Length:173          
 Class :character   Class :character    Class :character    
 Mode  :character   Mode  :character    Mode  :character    
 Confirmed /population,%     Ref.          
 Length:173              Length:173        
 Class :character        Class :character  
 Mode  :character        Mode  :character  

In [None]:
As you can see from the summary, the columns names are little bit different to understand and some column data types are 
not correct. For example, the `Tested` column shows as `character`. 
As such, the data frame read from HTML table will need some pre-processing such as removing irrelvant columns,
renaming columns, and convert columns into proper data types.

In [21]:
preprocess_covid_data_frame <- function(data_frame) {
    
    # shape <- dim(data_frame)

    # Remove the World row
    # data_frame<-data_frame[!(data_frame$`Country.or.region`=="World"),]
    # Remove the last row
    data_frame <- data_frame[1:172, ]
    
    # We dont need the Units and Ref columns, so can be removed
    data_frame["Ref."] <- NULL
    data_frame["Units[b]"] <- NULL
    
    # Renaming the columns
    names(data_frame) <- c("country", "date", "tested", "confirmed", "confirmed.tested.ratio", "tested.population.ratio", "confirmed.population.ratio")
    
    # Convert column data types
    data_frame$country <- as.factor(data_frame$country)
    data_frame$date <- as.factor(data_frame$date)
    data_frame$tested <- as.numeric(gsub(",","",data_frame$tested))
    data_frame$confirmed <- as.numeric(gsub(",","",data_frame$confirmed))
    data_frame$'confirmed.tested.ratio' <- as.numeric(gsub(",","",data_frame$`confirmed.tested.ratio`))
    data_frame$'tested.population.ratio' <- as.numeric(gsub(",","",data_frame$`tested.population.ratio`))
    data_frame$'confirmed.population.ratio' <- as.numeric(gsub(",","",data_frame$`confirmed.population.ratio`))

    return(data_frame)
}



In [25]:
# call `preprocess_covid_data_frame` function and assign it to a new data frame

preprocessed_df <- preprocess_covid_data_frame(covid19_df)

preprocessed_df

Unnamed: 0_level_0,country,date,tested,confirmed,confirmed.tested.ratio,tested.population.ratio,confirmed.population.ratio
Unnamed: 0_level_1,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,Afghanistan,17 Dec 2020,154767,49621,32.10,0.40,0.1300
2,Albania,18 Feb 2021,428654,96838,22.60,15.00,3.4000
3,Algeria,2 Nov 2020,230553,58574,25.40,0.53,0.1300
4,Andorra,23 Feb 2022,300307,37958,12.60,387.00,49.0000
5,Angola,2 Feb 2021,399228,20981,5.30,1.30,0.0670
6,Antigua and Barbuda,6 Mar 2021,15268,832,5.40,15.90,0.8600
7,Argentina,16 Apr 2022,35716069,9060495,25.40,78.30,20.0000
8,Armenia,29 May 2022,3099602,422963,13.60,105.00,14.3000
9,Australia,9 Sep 2022,78548492,10112229,12.90,313.00,40.3000
10,Austria,21 Oct 2022,199625374,5392347,2.70,2242.00,60.6000


In [26]:
# Print the summary of the processed data frame again
summary(preprocessed_df)

                country             date         tested         
 Afghanistan        :  1   21 Oct 2022: 13   Min.   :     3880  
 Albania            :  1   20 Oct 2022:  5   1st Qu.:   512037  
 Algeria            :  1   1 Mar 2021 :  3   Median :  3029859  
 Andorra            :  1   15 Oct 2022:  3   Mean   : 31057082  
 Angola             :  1   16 Oct 2022:  3   3rd Qu.: 11867328  
 Antigua and Barbuda:  1   23 Jul 2021:  3   Max.   :929349291  
 (Other)            :166   (Other)    :142                      
   confirmed        confirmed.tested.ratio tested.population.ratio
 Min.   :       0   Min.   :  0.00         Min.   :   0.0065      
 1st Qu.:   37802   1st Qu.:  5.00         1st Qu.:   9.4250      
 Median :  281196   Median : 10.05         Median :  46.9500      
 Mean   : 2467072   Mean   : 12.15         Mean   : 172.5734      
 3rd Qu.: 1249614   3rd Qu.: 15.25         3rd Qu.: 152.5000      
 Max.   :90749469   Max.   :185.30         Max.   :3098.0000      
           

In [None]:
After pre-processing, you can see the columns and columns names are simplified, and columns types are 
converted into correct types.


In [27]:
# Export the data frame to a csv file
write.csv(preprocessed_df, "covid_data.csv")

In [28]:
# check file path for csv created
# Get working directory
wd <- getwd()
# Get exported 
file_path <- paste(wd, sep="", "/covid_data.csv")
# File path
print(file_path)
file.exists(file_path)

[1] "C:/Users/nihar/covid_data.csv"


# TASK 4: Get a subset of the extracted data frame

The goal of task 4 is to get the 5th to 10th rows from the data frame with only `country` and `confirmed` columns selected


In [32]:
# Read covid_data_frame_csv from the csv file
df <- read.csv("covid_data.csv")

# Get the 5th to 10th rows, with two "country" "confirmed" columns
df[5:10, c("country","confirmed")]

Unnamed: 0_level_0,country,confirmed
Unnamed: 0_level_1,<chr>,<int>
5,Angola,20981
6,Antigua and Barbuda,832
7,Argentina,9060495
8,Armenia,422963
9,Australia,10112229
10,Austria,5392347


# TASK 5: Calculate worldwide COVID testing positive ratio

The goal of task 5 is to get the total confirmed and tested cases worldwide, and try to figure the overall 
positive ratio using `confirmed cases / tested cases`

In [40]:
# Get the total confirmed cases worldwide
Total_confirmed  <- sum(df[5])
Total_confirmed


In [41]:
# Get the total tested cases worldwide
Total_tested <- sum(df$tested)
Total_tested

In [43]:

# Get the positive ratio (confirmed / tested)
positive_ratio = Total_confirmed/Total_tested
positive_ratio

# TASK 6: Get a country list which reported their testing data 

The goal of task 6 is to get a catalog or sorted list of countries who have reported their COVID-19 testing data


In [46]:
# Get the `country` column

country_col <- df$country

# Check its class 
class(country_col)

# Convert the country column into character so that you can easily sort them
country_col <- as.character(country_col)



In [47]:
# Sort the countries AtoZ

sorted_countries <- sort(country_col)

In [48]:
# Sort the countries ZtoA
sorted_countries_desc <- sort(country_col, decreasing = TRUE )

# Print the sorted ZtoA list
print(sorted_countries_desc)

  [1] "Zimbabwe"               "Zambia"                 "Vietnam"               
  [4] "Venezuela"              "Uzbekistan"             "Uruguay"               
  [7] "United States"          "United Kingdom"         "United Arab Emirates"  
 [10] "Ukraine"                "Uganda"                 "Turkey"                
 [13] "Tunisia"                "Trinidad and Tobago"    "Togo"                  
 [16] "Thailand"               "Tanzania"               "Taiwan[m]"             
 [19] "Switzerland[l]"         "Sweden"                 "Sudan"                 
 [22] "Sri Lanka"              "Spain"                  "South Sudan"           
 [25] "South Korea"            "South Africa"           "Slovenia"              
 [28] "Slovakia"               "Singapore"              "Serbia"                
 [31] "Senegal"                "Saudi Arabia"           "San Marino"            
 [34] "Saint Vincent"          "Saint Lucia"            "Saint Kitts and Nevis" 
 [37] "Rwanda"              

# TASK 7: Identify countries names with a specific pattern

The goal of task 7 is using a regular expression to find any countires start with `United`

In [51]:
# Use a regular expression `United.+` to find matches


matches <- grep("United.+", df$country)
matches



In [54]:
# Print the matched country names

for (i in matches) 
{print(df$country[i])}

[1] "United Arab Emirates"
[1] "United Kingdom"
[1] "United States"


# TASK 8: Pick two countries you are interested, and then review their testing data

The goal of task 8 is to compare the COVID-19 test data between two countires, you will need to select two rows 
from the dataframe, and select `country`, `confirmed`, `confirmed-population-ratio` columns

In [59]:
# Select a subset (should be only one row) of data frame based on a selected country name and columns

US_cases <- df[df$country == 'United States', c('confirmed','country','confirmed.population.ratio')] 
US_cases



Unnamed: 0_level_0,confirmed,country,confirmed.population.ratio
Unnamed: 0_level_1,<int>,<chr>,<dbl>
166,90749469,United States,27.4


In [60]:
# Select a subset (should be only one row) of data frame based on a selected country name and columns

Japan_cases <- df[df$country == 'Japan', c('confirmed','country','confirmed.population.ratio')] 
Japan_cases

Unnamed: 0_level_0,confirmed,country,confirmed.population.ratio
Unnamed: 0_level_1,<int>,<chr>,<dbl>
82,432773,Japan,0.34


# TASK 9: Compare which one of the selected countries has a larger ratio of confirmed cases to population

The goal of task 9 is to find out which country you have selected before has larger ratio of confirmed cases
to population, which may indicate that country has higher COVID-19 infection risk


In [65]:
# Use if-else statement

if (US_cases$confirmed.population.ratio > Japan_cases$confirmed.population.ratio) 
   {
    print("Unites States has larger ratio of COVID confirmed cases than Japan" )
   } else {
    print("Japan has larger ratio of COVID confirmed casesthan United States")
   }


[1] "Unites States has larger ratio of COVID confirmed cases than Japan"


# TASK 10: Find countries with confirmed to population ratio rate less than a threshold

The goal of task 10 is to find out which countries have the confirmed to population ratio less than 1%, it may
indicate the risk of those countries are relatively low

In [79]:
# Get a subset of any countries with `confirmed.population.ratio` less than the threshold
 
low_risk <- df[df$confirmed.population.ratio < 1]

low_risk_countries <- low_risk$country
low_risk_countries