# Covid-19 Testing Importance

## Introduction

I believe that testing is one of the most crucial parts of dealing with an epidemic virus. Testing helps us identify and isolate positive cases. The more tests you perform, the faster you isolate the case preventing them from coming into contact with others, **slowing the rate of transmission**.  
  
**Date this notebook was written: 19/3/2020**  
  
**Disclaimer**: *In any way I do not want to point my finger on the governments and people of countries. It is not possible to know what were the reasons and the circumstances that lead to a lack of testing. My only target is to see if the data suggests that testing has a major role on this specific epidemic.*

In [123]:
import requests
import re

import pandas as pd
import numpy as np

from bs4 import BeautifulSoup

## Data sources

The first thing we want to find is **how many tests** have been performed on each country. The data will be extracted from this page: https://ourworldindata.org/coronavirus-testing-source-data . You may observe that the dates of last report are not the same for each country. This will be taken into account.

In [124]:
response = requests.get('https://ourworldindata.org/coronavirus-testing-source-data')
soup = BeautifulSoup(response.content)

table_soup = soup.find("div", {"class": "tableContainer"}).findAll("tr")[1:] # skip the headers

# RegEx to extract the country name, number of test, date
reg = re.compile("<tr><td>(.+)<\/td><td>([\d,]+)<\/td><td>([\w\s]+)<\/td><td>.*<\/td><td>.*<\/td><\/tr>")
testing_data = [reg.match(str(row)).groups() for row in table_soup]

# Create the dataframe
testing_df = pd.DataFrame(testing_data, columns=["Country", "Tests", "Date"])
testing_df['Tests'] = testing_df['Tests'].str.replace(',','').astype(int)    # Transform the x,xxx string to integers
testing_df['Date'] = pd.to_datetime(testing_df['Date'])    # Cast Date from string to date type

# Set country name as index of dataframe
testing_df = testing_df.set_index('Country', drop=False)

# For Canada and Australia we have info about specific regions
# Since on this notebook we will examine country response we merge them
# Australia
testing_df.at['Australia – New South Wales', 'Tests'] += testing_df['Tests']['Australia – Government of the Australian Capital Territory']
testing_df = testing_df.drop('Australia – Government of the Australian Capital Territory')

# Canada
testing_df = testing_df[~testing_df.Country.str.contains("Canada –")]

# Usa has two trackers I will keep the most recent one (at least on the time writing)
testing_df = testing_df.drop('United States – CDC samples tested')

testing_df = testing_df.drop("Hong Kong")

# We also rename some countries so as to have the expected country name
testing_df = testing_df.rename({
    "Australia – New South Wales": "Australia",
    "China – Guangdong": "China",
    "United States – COVID-Tracking project": "US",
    "Czech Republic": "Czechia",
    "South Korea": "Korea, South",
    "Taiwan": "Taiwan*"
})
testing_df = testing_df.drop('Country', axis=1)

display(testing_df)

Unnamed: 0_level_0,Tests,Date
Country,Unnamed: 1_level_1,Unnamed: 2_level_1
Armenia,694,2020-03-16
Australia,31635,2020-03-17
Austria,10278,2020-03-17
Bahrain,13553,2020-03-17
Belarus,16000,2020-03-16
Belgium,4225,2020-03-11
Brazil,2927,2020-03-13
Canada,38482,2020-03-17
China,320000,2020-02-24
Colombia,2571,2020-03-17


Next we must find info like confirmed cases, recovered cases, deaths for each country above.  
The dataset used will be the [John Hopkin's one](https://github.com/CSSEGISandData/COVID19/blob/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv).


In [129]:
def transform_date_to_str(date_str):
    if date_str[8] != '0':
        return date_str[6:7] + "/" + date_str[8:10] + "/20"
    else:
        return date_str[6:7] + "/" + date_str[9:10] + "/20"

def read_john_hopkins_dataset(url, column_name):
    john_hopkins_df = pd.read_csv(url, index_col='Country/Region')
    john_hopkins_df = john_hopkins_df.drop('Lat', axis=1)
    john_hopkins_df = john_hopkins_df.drop('Long', axis=1)

    # We must sum the cases on countries that are displayed by region eg US and China
    john_hopkins_df = john_hopkins_df.groupby(['Country/Region']).sum()

    testing_df[column_name] = np.nan
    for index, row in testing_df.iterrows():
        if index == 'Palestine':
            testing_df.at[index, column_name] = 39    # Hardcoded Palestine case since it is not included in the John Hopkin's dataset
        else:
            testing_df.at[index, column_name] = john_hopkins_df.loc[index][transform_date_to_str(str(row['Date']))]

read_john_hopkins_dataset('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv',
                          "Confirmed")

read_john_hopkins_dataset('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv',
                          "Recovered")   

read_john_hopkins_dataset('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv',
                          "Deaths")

display(testing_df)

Unnamed: 0_level_0,Tests,Date,Confirmed,Recovered,Deaths
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Armenia,694,2020-03-16,52.0,0.0,0.0
Australia,31635,2020-03-17,452.0,23.0,5.0
Austria,10278,2020-03-17,1332.0,1.0,3.0
Bahrain,13553,2020-03-17,228.0,81.0,1.0
Belarus,16000,2020-03-16,36.0,3.0,0.0
Belgium,4225,2020-03-11,314.0,1.0,3.0
Brazil,2927,2020-03-13,151.0,0.0,0.0
Canada,38482,2020-03-17,478.0,9.0,5.0
China,320000,2020-02-24,77241.0,25015.0,2595.0
Colombia,2571,2020-03-17,65.0,1.0,0.0
