<center>
<img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0101EN-Coursera/v2/M5_Final/images/SN_web_lightmode.png" width="300">
</center>


<h1>Analysis of Global COVID-19 Pandemic Data</h1>





## Overview:


In [None]:
# This lab requires 'httr' and 'rvest'packages, which are already pre-loaded into this lab environment.
# However, if you are working on your local RStudio, please uncomment the below codes and install the packages.

#install.packages("httr")
#install.packages("rvest")

In [None]:
library(httr)
library(rvest)

Note: if you can import above libraries, please use install.packages() to install them first.


## TASK 1: Get a `COVID-19 pandemic` Wiki page using HTTP request


First, let's write a function to use HTTP request to get a public COVID-19 Wiki page.

Before you write the function, you can open this public page from this

URL https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country using a web browser.

The goal of task 1 is to get the html page using HTTP request (`httr` library)


In [None]:

get_wiki_covid19_page <- function() {

  # Our target COVID-19 wiki page URL is: https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country
  # Which has two parts:
    # 1) base URL `https://en.wikipedia.org/w/index.php
    # 2) URL parameter: `title=Template:COVID-19_testing_by_country`, seperated by question mark ?

  # Wiki page base
  wiki_base_url <- "https://en.wikipedia.org/w/index.php"
  # You will need to create a List which has an element called `title` to specify which page you want to get from Wiki
  # in our case, it will be `Template:COVID-19_testing_by_country`

  # - Use the `GET` function in httr library with a `url` argument and a `query` arugment to get a HTTP response

  # Use the `return` function to return the response
  query_list <- list(title ="Template:COVID-19_testing_by_country")
  response <- GET(url = wiki_base_url, query = query_list)
  return (response)

}




Call the `get_wiki_covid19_page` function to get a http response with the target html page


In [None]:
# Call the get_wiki_covid19_page function and print the response
response <- get_wiki_covid19_page()
print(response)

Response [https://en.wikipedia.org/w/index.php?title=Template%3ACOVID-19_testing_by_country]
  Date: 2025-06-16 16:57
  Status: 200
  Content-Type: text/html; charset=UTF-8
  Size: 451 kB
<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-fea...
<head>
<meta charset="UTF-8">
<title>Template:COVID-19 testing by country - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-heade...
RLSTATE={"ext.globalCssJs.user.styles":"ready","site.styles":"ready","user.st...
<script>(RLQ=window.RLQ||[]).push(function(){mw.loader.impl(function(){return...
}];});});</script>
<link rel="stylesheet" href="/w/load.php?lang=en&amp;modules=ext.cite.styles%...
...


## TASK 2: Extract COVID-19 testing data table from the wiki HTML page


On the COVID-19 testing wiki page, you should see a data table `<table>` node contains COVID-19 testing data by country on the page:

<a href="https://cognitiveclass.ai/">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0101EN-Coursera/v2/M5_Final/images/covid-19-by-country.png" width="400" align="center">
</a>

Note the numbers you actually see on your page may be different from above because it is still an on-going pandemic when creating this notebook.

The goal of task 2 is to extract above data table and convert it into a data frame


Now use the `read_html` function in rvest library to get the root html node from response


In [None]:
# Get the root html node from the http response in task 1
root_node <- read_html(response)


Get the tables in the HTML root node using `html_nodes` function.


In [None]:
# Get the table node from the root html node
# Get all <table> nodes from the root HTML
table_nodes <- html_elements(root_node, "table")

# Inspect the number of tables found
print(table_nodes)

{xml_nodeset (4)}
[1] <table class="box-Update plainlinks ombox ombox-content ambox-Update" rol ...
[2] <table class="wikitable plainrowheaders sortable collapsible autocollapse ...
[3] <table class="plainlinks ombox mbox-small ombox-notice" role="presentatio ...
[4] <table class="wikitable mw-templatedata-doc-params">\n<caption><p class=" ...


Read the specific table from the multiple tables in the `table_node` using the `html_table` function and convert it into dataframe using `as.data.frame`

_Hint:- Please read the `table_node` with index 2(ex:- table_node[2])._


In [None]:
# Read the table node and convert it into a data frame, and print the data frame for review
covid_df <- html_table(table_nodes[[2]], fill = TRUE)

print(covid_df)


[90m# A tibble: 173 × 9[39m
   `Country or region` `Date[a]`   Tested      `Units[b]` `Confirmed(cases)`
   [3m[90m<chr>[39m[23m               [3m[90m<chr>[39m[23m       [3m[90m<chr>[39m[23m       [3m[90m<chr>[39m[23m      [3m[90m<chr>[39m[23m             
[90m 1[39m Afghanistan         17 Dec 2020 154,767     samples    49,621            
[90m 2[39m Albania             18 Feb 2021 428,654     samples    96,838            
[90m 3[39m Algeria             2 Nov 2020  230,553     samples    58,574            
[90m 4[39m Andorra             23 Feb 2022 300,307     samples    37,958            
[90m 5[39m Angola              2 Feb 2021  399,228     samples    20,981            
[90m 6[39m Antigua and Barbuda 6 Mar 2021  15,268      samples    832               
[90m 7[39m Argentina           16 Apr 2022 35,716,069  samples    9,060,495         
[90m 8[39m Armenia             29 May 2022 3,099,602   samples    422,963           
[90m 9[39m Australia   

## TASK 3: Pre-process and export the extracted data frame

The goal of task 3 is to pre-process the extracted data frame from the previous step, and export it as a csv file


Let's get a summary of the data frame


In [None]:
# Print the summary of the data frame
summary(covid_df)


 Country or region    Date[a]             Tested            Units[b]        
 Length:173         Length:173         Length:173         Length:173        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
 Confirmed(cases)   Confirmed /tested,% Tested /population,%
 Length:173         Length:173          Length:173          
 Class :character   Class :character    Class :character    
 Mode  :character   Mode  :character    Mode  :character    
 Confirmed /population,%     Ref.          
 Length:173              Length:173        
 Class :character        Class :character  
 Mode  :character        Mode  :character  

As you can see from the summary, the columns names are little bit different to understand and some column data types are not correct. For example, the `Tested` column shows as `character`.

As such, the data frame read from HTML table will need some pre-processing such as removing irrelvant columns, renaming columns, and convert columns into proper data types.


We have prepared a pre-processing function for you to conver the data frame but you can also try to write one by yourself


In [None]:
preprocess_covid_data_frame <- function(data_frame) {

    shape <- dim(data_frame)

    # Remove the World row
    data_frame<-data_frame[!(data_frame$`Country or region`=="World"),]
    # Remove the last row
    data_frame <- data_frame[1:172, ]

    # We dont need the Units and Ref columns, so can be removed
    data_frame["Ref."] <- NULL
    data_frame["Units[b]"] <- NULL


    # Renaming the columns
    names(data_frame) <- c("country", "date", "tested", "confirmed", "confirmed.tested.ratio", "tested.population.ratio", "confirmed.population.ratio")

    # Convert column data types
    data_frame$country <- as.factor(data_frame$country)
    data_frame$date <- as.factor(data_frame$date)
    data_frame$tested <- as.numeric(gsub(",","",data_frame$tested))
    data_frame$confirmed <- as.numeric(gsub(",","",data_frame$confirmed))
    data_frame$'confirmed.tested.ratio' <- as.numeric(gsub(",","",data_frame$`confirmed.tested.ratio`))
    data_frame$'tested.population.ratio' <- as.numeric(gsub(",","",data_frame$`tested.population.ratio`))
    data_frame$'confirmed.population.ratio' <- as.numeric(gsub(",","",data_frame$`confirmed.population.ratio`))

    return(data_frame)
}


Call the `preprocess_covid_data_frame` function


In [None]:
# call `preprocess_covid_data_frame` function and assign it to a new data frame
new_df <- preprocess_covid_data_frame(covid_df)

Get the summary of the processed data frame again


In [None]:
# Print the summary of the processed data frame again
print(summary(new_df))

                country             date         tested         
 Afghanistan        :  1   2 Feb 2023 :  6   Min.   :     3880  
 Albania            :  1   1 Feb 2023 :  4   1st Qu.:   512037  
 Algeria            :  1   31 Jan 2023:  4   Median :  3029859  
 Andorra            :  1   1 Mar 2021 :  3   Mean   : 31377219  
 Angola             :  1   23 Jul 2021:  3   3rd Qu.: 12386725  
 Antigua and Barbuda:  1   29 Jan 2023:  3   Max.   :929349291  
 (Other)            :166   (Other)    :149                      
   confirmed        confirmed.tested.ratio tested.population.ratio
 Min.   :       0   Min.   : 0.00          Min.   :   0.0065      
 1st Qu.:   37839   1st Qu.: 5.00          1st Qu.:   9.4750      
 Median :  281196   Median :10.05          Median :  46.9500      
 Mean   : 2508340   Mean   :11.25          Mean   : 175.5043      
 3rd Qu.: 1278105   3rd Qu.:15.25          3rd Qu.: 156.5000      
 Max.   :90749469   Max.   :46.80          Max.   :3223.0000      
           

After pre-processing, you can see the columns and columns names are simplified, and columns types are converted into correct types.


The data frame has following columns:

- **country** - The name of the country
- **date** - Reported date
- **tested** - Total tested cases by the reported date
- **confirmed** - Total confirmed cases by the reported date
- **confirmed.tested.ratio** - The ratio of confirmed cases to the tested cases
- **tested.population.ratio** - The ratio of tested cases to the population of the country
- **confirmed.population.ratio** - The ratio of confirmed cases to the population of the country


OK, we can call `write.csv()` function to save the csv file into a file.


In [None]:
# Export the data frame to a csv file
write.csv(new_df, "new_df.csv", row.names = FALSE)

Note for IBM Waston Studio, there is no traditional "hard disk" associated with a R workspace.

Even if you call `write.csv()` method to save the data frame as a csv file, it won't be shown in IBM Cloud Object Storage asset UI automatically.

However, you may still check if the `covid.csv` exists using following code snippet:


In [None]:
# Get working directory
wd <- getwd()
print(wd)
# Get exported
file_path <- paste(wd, sep="", "/new_df.csv")
# File path
print(file_path)
file.exists(file_path)

[1] "/content"
[1] "/content/new_df.csv"


**Optional Step**: If you have difficulties finishing above webscraping tasks, you may still continue with next tasks by downloading a provided csv file from here:


In [None]:
## Download a sample csv file
# covid_csv_file <- download.file("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-RP0101EN-Coursera/v2/dataset/covid.csv", destfile="covid.csv")
# covid_data_frame_csv <- read.csv("covid.csv", header=TRUE, sep=",")

## TASK 4: Get a subset of the extracted data frame

The goal of task 4 is to get the 5th to 10th rows from the data frame with only `country` and `confirmed` columns selected


In [None]:
# Read covid_data_frame_csv from the csv file

# Get the 5th to 10th rows, with two "country" "confirmed" columns
new_df[5:10,c('country','confirmed')]


country,confirmed
<fct>,<dbl>
Angola,20981
Antigua and Barbuda,832
Argentina,9060495
Armenia,422963
Australia,10112229
Austria,5789991


## TASK 5: Calculate worldwide COVID testing positive ratio

The goal of task 5 is to get the total confirmed and tested cases worldwide, and try to figure the overall positive ratio using `confirmed cases / tested cases`


In [None]:
# Get the total confirmed cases worldwide
total_confirmed <- sum(new_df$confirmed, na.rm = TRUE)
print(total_confirmed)

# Get the total tested cases worldwide
total_tested <-sum(new_df$tested, na.rm = TRUE)
print(total_tested)

# Get the positive ratio (confirmed / tested)
positive_ratio <- total_confirmed/total_tested
print(positive_ratio)



[1] 431434555
[1] 5396881644
[1] 0.07994145


## TASK 6: Get a country list which reported their testing data

The goal of task 6 is to get a catalog or sorted list of countries who have reported their COVID-19 testing data


In [None]:
# Get the `country` column
new_df$country
# Check its class (should be Factor)
class(new_df$country)

# Convert the country column into character so that you can easily sort them
new_df$country <- as.character(new_df$country)
# Sort the countries AtoZ
new_df_az <- new_df[order(new_df$country), ]

# Sort the countries ZtoA
new_df_za <- new_df[order(new_df$country,decreasing = TRUE),]

# Print the sorted ZtoA list
print(new_df_za$country)

  [1] "Zimbabwe"               "Zambia"                 "Vietnam"               
  [4] "Venezuela"              "Uzbekistan"             "Uruguay"               
  [7] "United States"          "United Kingdom"         "United Arab Emirates"  
 [10] "Ukraine"                "Uganda"                 "Turkey"                
 [13] "Tunisia"                "Trinidad and Tobago"    "Togo"                  
 [16] "Thailand"               "Tanzania"               "Taiwan[m]"             
 [19] "Switzerland[l]"         "Sweden"                 "Sudan"                 
 [22] "Sri Lanka"              "Spain"                  "South Sudan"           
 [25] "South Korea"            "South Africa"           "Slovenia"              
 [28] "Slovakia"               "Singapore"              "Serbia"                
 [31] "Senegal"                "Saudi Arabia"           "San Marino"            
 [34] "Saint Vincent"          "Saint Lucia"            "Saint Kitts and Nevis" 
 [37] "Rwanda"              

## TASK 7: Identify countries names with a specific pattern

The goal of task 7 is using a regular expression to find any countires start with `United`


In [None]:
# Use a regular expression `United.+` to find matches
matched_names <- grep("United.+", new_df$country, value = TRUE)

# Print the matched country names
print(matched_names)

[1] "United Arab Emirates" "United Kingdom"       "United States"       


## TASK 8: Pick two countries you are interested, and then review their testing data

The goal of task 8 is to compare the COVID-19 test data between two countires, you will need to select two rows from the dataframe, and select `country`, `confirmed`, `confirmed-population-ratio` columns


In [None]:
# Select a subset (should be only one row) of data frame based on a selected country name and columns
selected_countries <-c("Albania", "Algeria")
selected_row<-new_df[new_df$country %in% selected_countries,
                        c("country", "confirmed", "confirmed.population.ratio")]
print(selected_row)

# Select a subset (should be only one row) of data frame based on a selected country name and columns


[90m# A tibble: 2 × 3[39m
  country confirmed confirmed.population.ratio
  [3m[90m<chr>[39m[23m       [3m[90m<dbl>[39m[23m                      [3m[90m<dbl>[39m[23m
[90m1[39m Albania     [4m9[24m[4m6[24m838                       3.4 
[90m2[39m Algeria     [4m5[24m[4m8[24m574                       0.13


## TASK 9: Compare which one of the selected countries has a larger ratio of confirmed cases to population

The goal of task 9 is to find out which country you have selected before has larger ratio of confirmed cases to population, which may indicate that country has higher COVID-19 infection risk


In [None]:
# Subset the rows for Albania and Algeria
albania_ratio <- new_df[new_df$country == "Albania", "confirmed.population.ratio"]
algeria_ratio <- new_df[new_df$country == "Algeria", "confirmed.population.ratio"]
# Use if-else statement
if (albania_ratio > algeria_ratio) {
  print("Albania has a higher confirmed-to-population ratio.")
} else if (albania_ratio < algeria_ratio) {
  print("Algeria has a higher confirmed-to-population ratio.")
} else {
  print("Both countries have the same confirmed-to-population ratio.")
}

[1] "Albania has a higher confirmed-to-population ratio."


## TASK 10: Find countries with confirmed to population ratio rate less than a threshold

The goal of task 10 is to find out which countries have the confirmed to population ratio less than 1%, it may indicate the risk of those countries are relatively low


In [None]:
# Get a subset of any countries with `confirmed.population.ratio` less than the threshold

low_ratio_df <- new_df[new_df$confirmed.population.ratio < 1,
                       c("country", "confirmed.population.ratio")]

# View the result
print(low_ratio_df)

[90m# A tibble: 53 × 2[39m
   country             confirmed.population.ratio
   [3m[90m<chr>[39m[23m                                    [3m[90m<dbl>[39m[23m
[90m 1[39m Afghanistan                             0.13  
[90m 2[39m Algeria                                 0.13  
[90m 3[39m Angola                                  0.067 
[90m 4[39m Antigua and Barbuda                     0.86  
[90m 5[39m Bangladesh                              0.7   
[90m 6[39m Benin                                   0.067 
[90m 7[39m Brunei                                  0.074 
[90m 8[39m Burkina Faso                            0.058 
[90m 9[39m Burundi                                 0.007[4m4[24m
[90m10[39m Cambodia                                0.48  
[90m# ℹ 43 more rows[39m
