# Introduction

A pneumonia of unknown cause detected in Wuhan, China was first internationally reported from China on 31 December 2019. Today we know this virus as Coronavirus. COVID-19, which stands for COronaVIrus Disease, is the disease caused by this virus. Since then, the world has been engaged in the fight against this pandemic. Several measures have therefore been taken to "flatten the curve". We have consequently experienced social distancing, and unfortunately, many lives have been lost.

In the solidarity to face this unprecedented global crisis, several organizations did not hesitate to share several datasets allowing the conduction of various kinds of analysis in order to understand this pandemic.

# Objective

The objective of this analysis is to identify the countries with the highest number of positive cases in relation to the number of tests conducted.

# Data

You can find the data [here](https://dq-content.s3.amazonaws.com/505/covid19.csv).

# Data Exploration

In [None]:
install.packages('readr')

Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [None]:
library(readr)
covid_data <- read.csv("covid19.csv")


To understand the data, we begin by observing its dimensions, structure, and column names, as well as obtaining a general summary of the data.

In [None]:
#get the dimension of the data
dimensions <- dim(covid_data)
dimensions


NULL

In [None]:
#check out the column names of the data
column_names <- colnames(covid_data)
column_names


In [None]:
#showing the first few rows of the data
head_data <- head(covid_data)
head_data


Unnamed: 0_level_0,Date,Continent_Name,Two_Letter_Country_Code,Country_Region,Province_State,positive,hospitalized,recovered,death,total_tested,active,hospitalizedCurr,daily_tested,daily_positive
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,2020-01-20,Asia,KR,South Korea,All States,1,0,0,0,4,0,0,0,0
2,2020-01-22,North America,US,United States,All States,1,0,0,0,1,0,0,0,0
3,2020-01-22,North America,US,United States,Washington,1,0,0,0,1,0,0,0,0
4,2020-01-23,North America,US,United States,All States,1,0,0,0,1,0,0,0,0
5,2020-01-23,North America,US,United States,Washington,1,0,0,0,1,0,0,0,0
6,2020-01-24,Asia,KR,South Korea,All States,2,0,0,0,27,0,0,5,0


In [None]:
#Display summary of the data using glimpse() function
library(tibble)

glimpse_data <- glimpse(covid_data)
glimpse_data


Rows: 10,903
Columns: 14
$ Date                    [3m[90m<chr>[39m[23m "2020-01-20", "2020-01-22", "2020-01-22", "202…
$ Continent_Name          [3m[90m<chr>[39m[23m "Asia", "North America", "North America", "Nor…
$ Two_Letter_Country_Code [3m[90m<chr>[39m[23m "KR", "US", "US", "US", "US", "KR", "US", "US"…
$ Country_Region          [3m[90m<chr>[39m[23m "South Korea", "United States", "United States…
$ Province_State          [3m[90m<chr>[39m[23m "All States", "All States", "Washington", "All…
$ positive                [3m[90m<int>[39m[23m 1, 1, 1, 1, 1, 2, 1, 1, 4, 0, 3, 0, 0, 0, 0, 1…
$ hospitalized            [3m[90m<int>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ recovered               [3m[90m<int>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ death                   [3m[90m<int>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ total_tested            [3m[90m<int>[39m[23m 4, 1, 1, 1, 1, 27, 1, 1, 0, 0, 0,

Date,Continent_Name,Two_Letter_Country_Code,Country_Region,Province_State,positive,hospitalized,recovered,death,total_tested,active,hospitalizedCurr,daily_tested,daily_positive
<chr>,<chr>,<chr>,<chr>,<chr>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
2020-01-20,Asia,KR,South Korea,All States,1,0,0,0,4,0,0,0,0
2020-01-22,North America,US,United States,All States,1,0,0,0,1,0,0,0,0
2020-01-22,North America,US,United States,Washington,1,0,0,0,1,0,0,0,0
2020-01-23,North America,US,United States,All States,1,0,0,0,1,0,0,0,0
2020-01-23,North America,US,United States,Washington,1,0,0,0,1,0,0,0,0
2020-01-24,Asia,KR,South Korea,All States,2,0,0,0,27,0,0,5,0
2020-01-24,North America,US,United States,All States,1,0,0,0,1,0,0,0,0
2020-01-24,North America,US,United States,Washington,1,0,0,0,1,0,0,0,0
2020-01-25,Oceania,AU,Australia,All States,4,0,0,0,0,0,0,0,0
2020-01-25,Oceania,AU,Australia,Australian Capital Territory,0,0,0,0,0,0,0,0,0


# Filtering Relevant Rows

In [None]:
library(dplyr)


In [None]:
#filter the data to select rows with "All States" in the Province_State column, and remove the column
filtered_data <- covid_data %>%
  filter(Province_State == "All States") %>%
  select(-Province_State)


We can remove the "Province_State" column without losing any information because it only contains "All States" (which is not needed) after the filtering process.

# Selecting Relevant Columns

In [None]:
selected_columns <- filtered_data %>%
  select(Date, Country_Region, active, hospitalizedCurr, daily_tested, daily_positive)

head_selected_columns <- head(selected_columns)
head_selected_columns


Unnamed: 0_level_0,Date,Country_Region,active,hospitalizedCurr,daily_tested,daily_positive
Unnamed: 0_level_1,<chr>,<chr>,<int>,<int>,<int>,<int>
1,2020-01-20,South Korea,0,0,0,0
2,2020-01-22,United States,0,0,0,0
3,2020-01-23,United States,0,0,0,0
4,2020-01-24,South Korea,0,0,5,0
5,2020-01-24,United States,0,0,0,0
6,2020-01-25,Australia,0,0,0,0


# Extracting the Top Ten countries with the highest number of tested cases

In [None]:
summary_data <- selected_columns %>%
  group_by(Country_Region) %>%
  summarise(tested = sum(daily_tested),
            positive = sum(daily_positive),
            active = sum(active),
            hospitalized = sum(hospitalizedCurr)) %>%
  arrange(desc(tested))

summary_data


Country_Region,tested,positive,active,hospitalized
<chr>,<int>,<int>,<int>,<int>
United States,17282363,1877179,0,0
Russia,10542266,406368,6924890,0
Italy,4091291,251710,6202214,1699003
India,3692851,60959,0,0
Turkey,2031192,163941,2980960,0
Canada,1654779,90873,56454,0
United Kingdom,1473672,166909,0,0
Australia,1252900,7200,134586,6655
Peru,976790,59497,0,0
Poland,928256,23987,538203,0


In [None]:
top_10_countries <- head(summary_data, 10)

top_10_countries


Country_Region,tested,positive,active,hospitalized
<chr>,<int>,<int>,<int>,<int>
United States,17282363,1877179,0,0
Russia,10542266,406368,6924890,0
Italy,4091291,251710,6202214,1699003
India,3692851,60959,0,0
Turkey,2031192,163941,2980960,0
Canada,1654779,90873,56454,0
United Kingdom,1473672,166909,0,0
Australia,1252900,7200,134586,6655
Peru,976790,59497,0,0
Poland,928256,23987,538203,0


# Identifying the Highest Ratio of Positive Cases to Tested Cases

In [None]:
#creating vectors
countries <- top_10_countries$Country_Region
tested_cases <- top_10_countries$tested
positive_cases <- top_10_countries$positive
active_cases <- top_10_countries$active
hospitalized_cases <- top_10_countries$hospitalized


In [None]:
#naming the vectors using names() function
names(positive_cases) <- countries
names(tested_cases) <- countries
names(active_cases) <- countries
names(hospitalized_cases) <- countries


In [None]:
#identifying the ratio of positive cases to tested cases
positive_cases

total_positive_cases <- sum(positive_cases)
total_positive_cases

average_positive_cases <- mean(positive_cases)
average_positive_cases

positive_cases_ratio <- positive_cases / total_positive_cases
positive_cases_ratio

positive_cases_ratio_to_tested <- positive_cases / tested_cases
positive_cases_ratio_to_tested


In [None]:
#creating a vector with the top three countries with the highest ratio of positive cases to tested cases
top_3_positive_tested_countries <- c("United Kingdom" = 0.11, "United States" = 0.1,"Turkey" = 0.08)


# Organizing Relevant Information

In [None]:
#creating vectors
united_kingdom <- c(0.11, 1473672, 166909, 0, 0)
united_states <- c(0.10, 17282363, 1877179, 0, 0)
turkey <- c(0.08, 2031192, 163941, 2980960, 0)


In [None]:
#creating a matrix
covid_matrix <- rbind(united_kingdom, united_states, turkey)


In [None]:
#rename the columns of the matrix
names(covid_matrix) <- c("Ratio", "Tested", "Positive", "Active", "Hospitalized")

covid_matrix


0,1,2,3,4,5
united_kingdom,0.11,1473672,166909,0,0
united_states,0.1,17282363,1877179,0,0
turkey,0.08,2031192,163941,2980960,0


# Summary

In [None]:
question <- "Which countries have had the highest number of positive cases against the number of tests?"



In [None]:
answer <- c("Countries with the highest positive cases to tested cases ratio" = top_3_positive_tested_countries)



In [None]:
#creating a list of the dataframes used
dataset_list <- list(
  original_data = covid_data,
  filtered_data = filtered_data,
  selected_columns = selected_columns,
  top_10_countries = top_10_countries
)


In [None]:
#creating a list of matrices used
matrices_list <- list(covid_matrix)

#creating a list of vectors used
vector_list <- list(column_names, countries)

#creating a data structure that contains the lists above
data_structure_list <- list("dataframe" = dataset_list, "matrix" = matrices_list, "vector" = vector_list)



In [None]:
covid_analysis_list <- list(question, answer, data_structure_list)
covid_analysis_list[[2]]


# Conclusion

From our analysis, it is evident that the United Kingdom, United States, and Turkey are the countries with the highest number of positive cases in relation to the number of tests conducted.