### Investigating COVID-19 Virus Trends

#### Guided Project Introduction

The purpose of this Guided Project is to build our skills and understanding of the data analysis workflow by evaluating the COVID-19 situation through this dataset.

We will try to answer the question:  Which countries have reported the highest number of positive cases in relation to the number of tests conducted?

#### Understanding the Data

In [None]:
%load_ext rpy2.ipython

In [None]:
%%R
library("tidyverse")

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors


In [None]:
%%R
# Downloading
covid_df <- read.csv("https://raw.githubusercontent.com/Sat0ri1/MojeRep_PG/main/covid19.csv")

In [None]:
%%R
# Printing dimension
dim(covid_df)

[1] 10903    14


In [None]:
%%R
# Printing col names
colnames(covid_df)

 [1] "Date"                    "Continent_Name"         
 [3] "Two_Letter_Country_Code" "Country_Region"         
 [5] "Province_State"          "positive"               
 [7] "hospitalized"            "recovered"              
 [9] "death"                   "total_tested"           
[11] "active"                  "hospitalizedCurr"       
[13] "daily_tested"            "daily_positive"         


In [None]:
%%R
vector_cols <- colnames(covid_df)
typeof(vector_cols)

[1] "character"


In [None]:
%%R
head(covid_df)

        Date Continent_Name Two_Letter_Country_Code Country_Region
1 2020-01-20           Asia                      KR    South Korea
2 2020-01-22  North America                      US  United States
3 2020-01-22  North America                      US  United States
4 2020-01-23  North America                      US  United States
5 2020-01-23  North America                      US  United States
6 2020-01-24           Asia                      KR    South Korea
  Province_State positive hospitalized recovered death total_tested active
1     All States        1            0         0     0            4      0
2     All States        1            0         0     0            1      0
3     Washington        1            0         0     0            1      0
4     All States        1            0         0     0            1      0
5     Washington        1            0         0     0            1      0
6     All States        2            0         0     0           27      0
  hosp

In [None]:
%%R
library("tibble")

In [None]:
%%R
glimpse(covid_df)

Rows: 10,903
Columns: 14
$ Date                    <chr> "2020-01-20", "2020-01-22", "2020-01-22", "202…
$ Continent_Name          <chr> "Asia", "North America", "North America", "Nor…
$ Two_Letter_Country_Code <chr> "KR", "US", "US", "US", "US", "KR", "US", "US"…
$ Country_Region          <chr> "South Korea", "United States", "United States…
$ Province_State          <chr> "All States", "All States", "Washington", "All…
$ positive                <int> 1, 1, 1, 1, 1, 2, 1, 1, 4, 0, 3, 0, 0, 0, 0, 1…
$ hospitalized            <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ recovered               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ death                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ total_tested            <int> 4, 1, 1, 1, 1, 27, 1, 1, 0, 0, 0, 0, 0, 0, 0, …
$ active                  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ hospitalizedCurr        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ daily_tested 

We can see that dataset contains 14 columns and 10903 rows.

Data structure of the `vector_cols` variable represents a character.


The glimpse() allows us to see the columns of the dataset and display some portion of the data with respect to each attribute that can fit on a single line.

#### Isolating the Rows We Need

The Province_State column mixes geographical data from different levels: country level and state/province level. Since we cannot run an analysis on all these levels at the same time, we need to filter what we are interested in.  Therefore, we will extract only the country-level data in order to make analysis reliable.

In [None]:
%%R
# FIltering rows
covid_df_all_states <- covid_df %>% filter(Province_State == "All States")
head(covid_df_all_states)

        Date Continent_Name Two_Letter_Country_Code Country_Region
1 2020-01-20           Asia                      KR    South Korea
2 2020-01-22  North America                      US  United States
3 2020-01-23  North America                      US  United States
4 2020-01-24           Asia                      KR    South Korea
5 2020-01-24  North America                      US  United States
6 2020-01-25        Oceania                      AU      Australia
  Province_State positive hospitalized recovered death total_tested active
1     All States        1            0         0     0            4      0
2     All States        1            0         0     0            1      0
3     All States        1            0         0     0            1      0
4     All States        2            0         0     0           27      0
5     All States        1            0         0     0            1      0
6     All States        4            0         0     0            0      0
  hosp

We can safely remove the `Province_State` column because it only contains `All states` after the filtering took place.

In [None]:
%%R
covid_df <- covid_df %>% select(-Province_State)


#### Isolating the Columns We Need

In [None]:
%%R
head(covid_df)

        Date Continent_Name Two_Letter_Country_Code Country_Region positive
1 2020-01-20           Asia                      KR    South Korea        1
2 2020-01-22  North America                      US  United States        1
3 2020-01-22  North America                      US  United States        1
4 2020-01-23  North America                      US  United States        1
5 2020-01-23  North America                      US  United States        1
6 2020-01-24           Asia                      KR    South Korea        2
  hospitalized recovered death total_tested active hospitalizedCurr
1            0         0     0            4      0                0
2            0         0     0            1      0                0
3            0         0     0            1      0                0
4            0         0     0            1      0                0
5            0         0     0            1      0                0
6            0         0     0           27      0          

Looking at the columns again, we can see that there are two types of data: with cumulative information and daily information. As we cannot work on both at the same time we shall only analyse one type of data. Daily type will be our pick in this analysis, therefore we shall separate it from cumulative data.

In [None]:
%%R
# Extracting daily data to covid_df_all_states_daily variable
covid_df_all_states_daily <- covid_df_all_states %>% select(Date, Country_Region, active, hospitalizedCurr, daily_tested, daily_positive)

#### Extracting the Top Ten Countries with Most Covid-19 Cases

As we analyse covid-19 trends, we will check the top ten Countries with most covid-19 cases in this step.

In [None]:
%%R
# Ordering by highest n
covid_df_all_states_daily_sum <- covid_df_all_states_daily %>%
  group_by(Country_Region) %>%
  summarise(tested = sum(daily_tested),
            positive = sum(daily_positive),
            active = sum(active),
            hospitalized = sum(hospitalizedCurr)) %>%
  arrange(desc(tested))

covid_df_all_states_daily_sum

# A tibble: 108 × 5
   Country_Region   tested positive  active hospitalized
   <chr>             <int>    <int>   <int>        <int>
 1 United States  17282363  1877179       0            0
 2 Russia         10542266   406368 6924890            0
 3 Italy           4091291   251710 6202214      1699003
 4 India           3692851    60959       0            0
 5 Turkey          2031192   163941 2980960            0
 6 Canada          1654779    90873   56454            0
 7 United Kingdom  1473672   166909       0            0
 8 Australia       1252900     7200  134586         6655
 9 Peru             976790    59497       0            0
10 Poland           928256    23987  538203            0
# ℹ 98 more rows
# ℹ Use `print(n = ...)` to see more rows


In [None]:
%%R
covid_top_10 <- head(covid_df_all_states_daily_sum, 10)
covid_top_10

# A tibble: 10 × 5
   Country_Region   tested positive  active hospitalized
   <chr>             <int>    <int>   <int>        <int>
 1 United States  17282363  1877179       0            0
 2 Russia         10542266   406368 6924890            0
 3 Italy           4091291   251710 6202214      1699003
 4 India           3692851    60959       0            0
 5 Turkey          2031192   163941 2980960            0
 6 Canada          1654779    90873   56454            0
 7 United Kingdom  1473672   166909       0            0
 8 Australia       1252900     7200  134586         6655
 9 Peru             976790    59497       0            0
10 Poland           928256    23987  538203            0


####  Identifying the Highest Positive Against Tested Cases

**Which countries have had the highest number of positive cases against the number of tests?**

We will try to answer this question, because it will give us more representative look at the trends. Above we've mainly seen above all that there is positive correlation between number of tests and positive cases.

In [None]:
%%R
countries <- covid_top_10$Country_Region
tested_cases <- covid_top_10$tested
positive_cases <- covid_top_10$positive
active_cases <- covid_top_10$active
hospitalized_cases <-  covid_top_10$hospitalized

In [None]:
%%R
names(tested_cases) <- countries
names(positive_cases) <- countries
names(active_cases) <- countries
names(hospitalized_cases) <- countries

In [None]:
%%R
cos <- positive_cases / tested_cases
cos <- sort(cos, decreasing = TRUE)
positive_tested_top_3 <- cos[1:3]

In [None]:
%%R
positive_tested_top_3

United Kingdom  United States         Turkey 
    0.11326062     0.10861819     0.08071172 


UK, US and Turkey are top 3 countreis with highest positive cases to tested cases ratio.

#### Keeping relevant information

Our goal is to find a way to keep all the information available for the top three countries that have had the highest number of positive cases against the number of tests carried out

In [None]:
%%R
# Creating vectors
united_kingdom <- c(0.11, 1473672, 166909, 0, 0)
united_states  <- c(0.10, 17282363, 1877179, 0, 0)
turkey <- c(0.08, 2031192, 163941, 2980960, 0)

# Creating matrix
covid_mat <- rbind(united_kingdom, united_states,turkey)
covid_mat

               [,1]     [,2]    [,3]    [,4] [,5]
united_kingdom 0.11  1473672  166909       0    0
united_states  0.10 17282363 1877179       0    0
turkey         0.08  2031192  163941 2980960    0


In [None]:
%%R
# Renaming matrix columns
colnames(covid_mat) <- c("Ratio", "tested", "positive", "active", "hospitalized")

In [None]:
%%R
covid_mat

               Ratio   tested positive  active hospitalized
united_kingdom  0.11  1473672   166909       0            0
united_states   0.10 17282363  1877179       0            0
turkey          0.08  2031192   163941 2980960            0


#### Putting all together

 Our goal here is to put all our answers and datasets together. Since a list can contain several types of objects, we are able to store all the data of our project together. This allows us to have a global view from a single variable and the ability to export our results for other uses.

In [None]:
%%R
question <- "Which countries have had the highest number of positive cases against the number of tests?"

In [None]:
%%R
answer <- c("Positive tested cases" = positive_tested_top_3)
answer

Positive tested cases.United Kingdom  Positive tested cases.United States 
                          0.11326062                           0.10861819 
        Positive tested cases.Turkey 
                          0.08071172 


In [None]:
%%R
list1 <- list(head(covid_df))
list2 <- list(covid_df_all_states_daily_sum)
list3 <- list(positive_tested_top_3)

In [None]:
%%R
list1 <- list(
  pierwotna = covid_df,
  stany = covid_df_all_states,
  dzien = covid_df_all_states_daily,
  top10 = covid_top_10
)

In [None]:
%%R
mat <- list(covid_mat)
vector <- list(vector_cols, countries)

In [None]:
%%R
data_structure_list <- list("df" = list1, "mat" = mat, "v" = vector)

In [None]:
%%R
covid_analysis_list <- list(question, answer, data_structure_list)

In [None]:
%%R
covid_analysis_list[[2]]

Positive tested cases.United Kingdom  Positive tested cases.United States 
                          0.11326062                           0.10861819 
        Positive tested cases.Turkey 
                          0.08071172 


Basing on this analysis, we can assume United Kingdom, United States, and Turkey have had the highest number of positive cases against the number of test cases. With that knowledge, we can look for the possible reasons for that case to prevent covid and other viruses from spreading in the future by setting the optimal tactic to fight the pandemics.