# **Analysis of Global COVID-19 Pandemic Data**

This project is part of Introduction to R Programming for Data Science(https://www.coursera.org/learn/introducton-r-programming-data-science/)

In [1]:
install.packages("curl")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [2]:
library(curl)

Using libcurl 7.81.0 with OpenSSL/3.0.2



In [3]:
install.packages("httr")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [4]:
library(httr)


Attaching package: ‘httr’


The following object is masked from ‘package:curl’:

    handle_reset




In [6]:
install.packages("rvest")

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [7]:
library(rvest)

# Get a COVID-19 pandemic Wiki page using HTTP request

Call the get_wiki_covid19_page function to get a http response with the target html page

In [13]:
# Call the get_wiki_covid19_page function and print the response
covid19_url <- "https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country"
response <- GET(covid19_url)
response

Response [https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country]
  Date: 2024-07-18 05:47
  Status: 200
  Content-Type: text/html; charset=UTF-8
  Size: 451 kB
<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-fea...
<head>
<meta charset="UTF-8">
<title>Template:COVID-19 testing by country - Wikipedia</title>
<script>(function(){var className="client-js vector-feature-language-in-heade...
"",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","M...
"CS1 uses Russian-language script (ru)","CS1 Russian-language sources (ru)","...
,"CS1 Lithuanian-language sources (lt)","CS1 Malagasy-language sources (mg)",...
"wgRelevantArticleId":63303421,"wgIsProbablyEditable":false,"wgRelevantPageIs...
...


# TASK 2: Extract COVID-19 testing data table from the wiki HTML page


On the COVID-19 testing wiki page, you should see a data table <table> node contains COVID-19 testing data by country on the page:

The goal of task 2 is to extract above data table and convert it into a data frame

Now use the read_html function in rvest library to get the root html node from response

In [35]:
# Get the root html node from the http response in task 1
wiki_node <- read_html( "https://en.wikipedia.org/w/index.php?title=Template:COVID-19_testing_by_country")
wiki_node

{html_document}
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-clientpref-1 vector-feature-main-menu-pinned-disabled vector-feature-limited-width-clientpref-1 vector-feature-limited-width-content-enabled vector-feature-custom-font-size-clientpref-1 vector-feature-appearance-enabled vector-feature-appearance-pinned-clientpref-1 vector-feature-night-mode-enabled skin-theme-clientpref-day vector-toc-available" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="skin--responsive skin-vector skin-vector-search-vue mediawik ...

In [38]:
# Get the table node from the root html node
covid19_table_node <- html_nodes(wiki_node, "table")
covid19_table_node

{xml_nodeset (4)}
[1] <table class="box-Update plainlinks ombox ombox-content ambox-Update" rol ...
[2] <table class="wikitable plainrowheaders sortable collapsible autocollapse ...
[3] <table class="plainlinks ombox mbox-small ombox-notice" role="presentatio ...
[4] <table class="wikitable mw-templatedata-doc-params">\n<caption><p class=" ...

In [39]:
# Read the table node and convert it into a data frame, and print the data frame for review
covid19_data_frame <- as.data.frame(html_table(table_node[2]))
head(covid19_data_frame)

Unnamed: 0_level_0,Country.or.region,Date.a.,Tested,Units.b.,Confirmed.cases.,Confirmed..tested..,Tested..population..,Confirmed..population..,Ref.
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,Afghanistan,17 Dec 2020,154767,samples,49621,32.1,0.4,0.13,[1]
2,Albania,18 Feb 2021,428654,samples,96838,22.6,15.0,3.4,[2]
3,Algeria,2 Nov 2020,230553,samples,58574,25.4,0.53,0.13,[3][4]
4,Andorra,23 Feb 2022,300307,samples,37958,12.6,387.0,49.0,[5]
5,Angola,2 Feb 2021,399228,samples,20981,5.3,1.3,0.067,[6]
6,Antigua and Barbuda,6 Mar 2021,15268,samples,832,5.4,15.9,0.86,[7]


### TASK 3: Pre-process and export the extracted data frame

Pre-process the extracted data frame from the previous step, and export it as a csv file

In [40]:
# Print the summary of the data frame
summary(covid19_data_frame)

 Country.or.region    Date.a.             Tested            Units.b.        
 Length:173         Length:173         Length:173         Length:173        
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
 Confirmed.cases.   Confirmed..tested.. Tested..population..
 Length:173         Length:173          Length:173          
 Class :character   Class :character    Class :character    
 Mode  :character   Mode  :character    Mode  :character    
 Confirmed..population..     Ref.          
 Length:173              Length:173        
 Class :character        Class :character  
 Mode  :character        Mode  :character  

In [42]:
preprocess_covid_data_frame <- function(data_frame) {

    shape <- dim(data_frame)

    # Remove the World row
     data_frame <- data_frame[!(data_frame$`Country.or.region`=="World"),]
    # Remove the last row
    data_frame <- data_frame[1:172, ]

    # We dont need the Units and Ref columns, so can be removed
    data_frame["Ref."] <- NULL
    data_frame["Units.b."] <- NULL

     # Renaming the columns
    names(data_frame) <- c("country", "date", "tested", "confirmed", "confirmed.tested.ratio", "tested.population.ratio", "confirmed.population.ratio")

    # Convert column data types
    # Convert column data types
   data_frame$country <- as.factor(data_frame$country)
    data_frame$date <- as.factor(data_frame$date)
    data_frame$tested <- as.numeric(gsub(",","",data_frame$tested))
    data_frame$confirmed <- as.numeric(gsub(",","",data_frame$confirmed))
    data_frame$'confirmed.tested.ratio' <- as.numeric(gsub(",","",data_frame$`confirmed.tested.ratio`))
    data_frame$'tested.population.ratio' <- as.numeric(gsub(",","",data_frame$`tested.population.ratio`))
    data_frame$'confirmed.population.ratio' <- as.numeric(gsub(",","",data_frame$`confirmed.population.ratio`))

    return(data_frame)
}

In [43]:
# call `preprocess_covid_data_frame` function and assign it to a new data frame
wiki_covid19_data_frame <- preprocess_covid_data_frame(covid19_data_frame)
wiki_covid19_data_frame

Unnamed: 0_level_0,country,date,tested,confirmed,confirmed.tested.ratio,tested.population.ratio,confirmed.population.ratio
Unnamed: 0_level_1,<fct>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,Afghanistan,17 Dec 2020,154767,49621,32.10,0.40,0.1300
2,Albania,18 Feb 2021,428654,96838,22.60,15.00,3.4000
3,Algeria,2 Nov 2020,230553,58574,25.40,0.53,0.1300
4,Andorra,23 Feb 2022,300307,37958,12.60,387.00,49.0000
5,Angola,2 Feb 2021,399228,20981,5.30,1.30,0.0670
6,Antigua and Barbuda,6 Mar 2021,15268,832,5.40,15.90,0.8600
7,Argentina,16 Apr 2022,35716069,9060495,25.40,78.30,20.0000
8,Armenia,29 May 2022,3099602,422963,13.60,105.00,14.3000
9,Australia,9 Sep 2022,78548492,10112229,12.90,313.00,40.3000
10,Austria,1 Feb 2023,205817752,5789991,2.80,2312.00,65.0000


In [44]:
# Print the summary of the processed data frame again
summary(wiki_covid19_data_frame)

                country             date         tested         
 Afghanistan        :  1   2 Feb 2023 :  6   Min.   :     3880  
 Albania            :  1   1 Feb 2023 :  4   1st Qu.:   512037  
 Algeria            :  1   31 Jan 2023:  4   Median :  3029859  
 Andorra            :  1   1 Mar 2021 :  3   Mean   : 31377219  
 Angola             :  1   23 Jul 2021:  3   3rd Qu.: 12386725  
 Antigua and Barbuda:  1   29 Jan 2023:  3   Max.   :929349291  
 (Other)            :166   (Other)    :149                      
   confirmed        confirmed.tested.ratio tested.population.ratio
 Min.   :       0   Min.   : 0.00          Min.   :   0.006       
 1st Qu.:   37839   1st Qu.: 5.00          1st Qu.:   9.475       
 Median :  281196   Median :10.05          Median :  46.950       
 Mean   : 2508340   Mean   :11.25          Mean   : 175.504       
 3rd Qu.: 1278105   3rd Qu.:15.25          3rd Qu.: 156.500       
 Max.   :90749469   Max.   :46.80          Max.   :3223.000       
           

In [45]:
# Export the data frame to a csv file
write.csv(wiki_covid19_data_frame, file = "covid.csv", row.names = FALSE)

In [46]:
# Get working directory
wd <- getwd()
# Get exported
file_path <- paste(wd, sep="", "/covid.csv")
# File path
print(file_path)
file.exists(file_path)

[1] "/content/covid.csv"


### TASK 4: Get a subset of the extracted data frame

Get the 5th to 10th rows from the data frame with only country and confirmed columns selected

In [52]:
# Read covid_data_frame_csv from the csv file
covid_data <- read.csv("covid.csv", header=TRUE, sep=",")

In [53]:
# Get the 5th to 10th rows, with two "country" "confirmed" columns
covid_data[ 5:10, c( "country", "confirmed") ]

Unnamed: 0_level_0,country,confirmed
Unnamed: 0_level_1,<chr>,<int>
5,Angola,20981
6,Antigua and Barbuda,832
7,Argentina,9060495
8,Armenia,422963
9,Australia,10112229
10,Austria,5789991


###  TASK 5: Calculate worldwide COVID testing positive ratio

Get the total confirmed and tested cases worldwide, and try to figure the overall positive ratio using `confirmed cases / tested cases`

In [54]:
# Get the total confirmed cases worldwide
tot_confirmed <- sum(covid_data[,'confirmed'])
tot_confirmed
# Get the total tested cases worldwide
tot_tested <- sum(covid_data[,'tested'])
tot_tested
# Get the positive ratio (confirmed / tested)
positive_ratio <- tot_confirmed/tot_tested
round(positive_ratio,2)

### TASK 6: Get a country list which reported their testing data

Get a catalog or sorted list of countries who have reported their COVID-19 testing data.

In [55]:
# Get the `country` column
covid_data[,'country']
# Check its class (should be Factor)
class(covid_data$country)
# Convert the country column into character so that you can easily sort them
covid_data$country <- as.character(covid_data$country)
class(covid_data$country)
# Sort the countries AtoZ
sort(covid_data$country)
# Sort the countries ZtoA
ztoa_country <- sort(covid_data$country, decreasing=TRUE)
# Print the sorted ZtoA list
print(ztoa_country)

  [1] "Zimbabwe"               "Zambia"                 "Vietnam"               
  [4] "Venezuela"              "Uzbekistan"             "Uruguay"               
  [7] "United States"          "United Kingdom"         "United Arab Emirates"  
 [10] "Ukraine"                "Uganda"                 "Turkey"                
 [13] "Tunisia"                "Trinidad and Tobago"    "Togo"                  
 [16] "Thailand"               "Tanzania"               "Taiwan[m]"             
 [19] "Switzerland[l]"         "Sweden"                 "Sudan"                 
 [22] "Sri Lanka"              "Spain"                  "South Sudan"           
 [25] "South Korea"            "South Africa"           "Slovenia"              
 [28] "Slovakia"               "Singapore"              "Serbia"                
 [31] "Senegal"                "Saudi Arabia"           "San Marino"            
 [34] "Saint Vincent"          "Saint Lucia"            "Saint Kitts and Nevis" 
 [37] "Rwanda"              

### TASK 7: Identify countries names with a specific pattern

Using a regular expression to find any countires start with United

In [57]:
# Use a regular expression `United.+` to find matches
country_matches <- regexpr('United.+', covid_data$country)

# Print the matched country names
regmatches(covid_data$country, country_matches)

### TASK 8: Pick two countries you are interested, and then review their testing data

Compare the COVID-19 test data between two countires, you will need to select two rows from the dataframe, and select country, confirmed, confirmed-population-ratio columns

In [58]:
# Select a subset (should be only one row) of data frame based on a selected country name and columns
india <- covid_data[covid_data$country=='India',c('country','tested','confirmed','confirmed.population.ratio')]
india
# Select a subset (should be only one row) of data frame based on a selected country name and columns
usa <- covid_data[covid_data$country=='United States',c('country','tested','confirmed','confirmed.population.ratio')]
usa

Unnamed: 0_level_0,country,tested,confirmed,confirmed.population.ratio
Unnamed: 0_level_1,<chr>,<dbl>,<int>,<dbl>
73,India,866177937,43585554,31.7


Unnamed: 0_level_0,country,tested,confirmed,confirmed.population.ratio
Unnamed: 0_level_1,<chr>,<dbl>,<int>,<dbl>
166,United States,929349291,90749469,27.4


In [59]:
#difference in testing
india$tested > usa$tested
#difference in confirmed
india$confirmed > usa$confirmed

### TASK 9: Compare which one of the selected countries has a larger ratio of confirmed cases to population

Let's  find out which country you have selected before has larger ratio of confirmed cases to population, which may indicate that country has higher COVID-19 infection risk

In [60]:
# Use if-else statement
if (usa$confirmed.population.ratio > india$confirmed.population.ratio) {
   print('USA higher covid-19 infection risk')
} else {
   print('India has higher covid-19 infection risk')
}

[1] "India has higher covid-19 infection risk"


### TASK 10: Find countries with confirmed to population ratio rate less than a threshold

Let's find out which countries have the confirmed to population ratio less than 1%, it may indicate the risk of those countries are relatively low

In [62]:
# Get a subset of any countries with `confirmed.population.ratio` less than the threshold
new_df <- covid_data[(covid_data$`confirmed.population.ratio` < 1), ]
new_df

Unnamed: 0_level_0,country,date,tested,confirmed,confirmed.tested.ratio,tested.population.ratio,confirmed.population.ratio
Unnamed: 0_level_1,<chr>,<chr>,<dbl>,<int>,<dbl>,<dbl>,<dbl>
1,Afghanistan,17 Dec 2020,154767,49621,32.1,0.4,0.13
3,Algeria,2 Nov 2020,230553,58574,25.4,0.53,0.13
5,Angola,2 Feb 2021,399228,20981,5.3,1.3,0.067
6,Antigua and Barbuda,6 Mar 2021,15268,832,5.4,15.9,0.86
14,Bangladesh,24 Jul 2021,7417714,1151644,15.5,4.5,0.7
19,Benin,4 May 2021,595112,7884,1.3,5.1,0.067
25,Brunei,2 Aug 2021,153804,338,0.22,33.5,0.074
27,Burkina Faso,4 Mar 2021,158777,12123,7.6,0.76,0.058
28,Burundi,5 Jan 2021,90019,884,0.98,0.76,0.0074
29,Cambodia,1 Aug 2021,1812706,77914,4.3,11.2,0.48
