#Task 1
During the last couple of years we have been bombarded by a constant stream of COVID-19 related numbers: infections detected,
tests performed, ICU admissions, and so on. No wonder that this generated a lot of interest in consolidating such data to model
and predict how some measures could impact outcomes to minimize suffering and limit economic costs. This also translated in
many efforts to visualize and communicate these numbers and predictions to the public and decision makers.

*Dataset*

In order to have a meaningful conversation around data, and to give you a chance to briefly showcase your technical
competences, we chose to focus on a dataset covering world-wide testing for COVID-19.
The data is openly available from the European Centre for Disease Prevention and Control here:
https://www.ecdc.europa.eu/en/publications-data/covid-19-testing
Please use the attached CSV file as your data source for the following tasks. However, you are encouraged to visit the website
above to read about how the data is collected and consult the Data dictionary provided by ECDC here:
https://www.ecdc.europa.eu/sites/default/files/documents/covid-19-variable-dictionary-and-disclaimer-weekly-testing-data.pdf

*Goal*

Convey the development of national testing, at monthly time granularity, for the following countries: Denmark, Germany, Italy,
Spain and Sweden.

## Import libraries

In [1]:
#Import libraries
import pandas as pd
import datetime as dt
from pandas import Series
from math import pi
from bokeh.models.widgets import Panel, Tabs
from bokeh.io import output_notebook, show
from bokeh.models import ColumnDataSource, FactorRange, Legend
from bokeh.plotting import figure
from bokeh import palettes


## Utils

In [2]:

class utils(object):
  '''
  This class wraps all the utility functions created along the notebook.

  functions:

    no __init__, since, we are using it as wrapper.

    datetime_formating


  '''
  def datetime_formating(timestamp: str) -> dt:
    '''
    This function transform the string with the format yyyy-Www to datetime value using the library datetime and defining day as the first day of the week.
    This formating it is also possible to be done with regex using the pattern \WW.
    
    arg:
          string containing timestamp in format yyyy-Www

    output:
          datetime object
    '''
    timestamp = dt.datetime.strptime(timestamp + '-1', "%Y-W%W-%w") - pd.offsets.MonthEnd(0) - pd.offsets.MonthBegin(1)
    return timestamp





  def create_bokeh_bars(df: pd.DataFrame, category:str):
    '''
    This function generates an interactive bar diagram using bokeh specific for this dataset.
    
    arg:
          df: Dataset parsed with Pandas
          category: The column or category of interest to analyse.

    output:
          p: Bokeh figure object
    '''
    focuscountries = set(['Denmark', 'Germany', 'Italy', 'Spain', 'Sweden'])

    df_grouped = pd.pivot_table(df, index = "timestamp", 
                              columns = "country", values = category ,aggfunc = 'sum')

    # Create Hour of the day column as the pivot_table Pandas function converts the HourOfTheDay column to an index.
    # We need Hour of the day as a separate column to be handed to Bokeh.
    df_grouped['year_month']=df_grouped.index.values
    
    source = ColumnDataSource(df_grouped)

    p = figure(x_range=list(df_grouped.index), 
              plot_height=400, plot_width=800, title= category,
              x_axis_label='year_month', y_axis_label= category)


    color = palettes.Category10[len(focuscountries)]
    bar ={} # to store vbars
    items = []
    ### here we will do a for loop:
    for indx,i in enumerate(focuscountries):
        bar[i] = p.vbar(x='year_month', 
                        source=source,
                        top=i,
                        width=0.9,
                        muted=True, 
                        muted_alpha=0.005,
                        color=color[indx],
                        fill_alpha=0.7,
                        line_alpha=0.7)
        items.append((i, [bar[i]]))
        
    legend = Legend(items=items)
    p.xaxis.major_label_orientation = pi/3
    p.add_layout(legend, 'left')
    p.legend.click_policy = 'mute'
    p.legend.label_text_font_size='7pt'
    return p



##Import Dataset

In [3]:
url = 'https://opendata.ecdc.europa.eu/covid19/testing/csv/data.csv'
df = pd.read_csv(url)

## Data Exploration

Firstly, In order to have an overview of the dataset an information scheme from the dataset is generated. 

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4440 entries, 0 to 4439
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   country              4440 non-null   object 
 1   country_code         4440 non-null   object 
 2   year_week            4440 non-null   object 
 3   level                4440 non-null   object 
 4   region               4440 non-null   object 
 5   region_name          4440 non-null   object 
 6   new_cases            4011 non-null   float64
 7   tests_done           4012 non-null   float64
 8   population           4440 non-null   int64  
 9   testing_rate         4012 non-null   float64
 10  positivity_rate      3984 non-null   float64
 11  testing_data_source  4012 non-null   object 
dtypes: float64(4), int64(1), object(7)
memory usage: 416.4+ KB


It can be observed the shape of the dataset, column names, non-null counts and Dtype.

From the non-null count it can be observed that there is several columns with null values. However, further analysis have to be done to learn about the nature of this nulls.

Object probably are strings since Dtype is a numpy class.

In [5]:
df.head(10)

Unnamed: 0,country,country_code,year_week,level,region,region_name,new_cases,tests_done,population,testing_rate,positivity_rate,testing_data_source
0,Austria,AT,2020-W01,national,AT,Austria,,,8932664,,,
1,Austria,AT,2020-W02,national,AT,Austria,,,8932664,,,
2,Austria,AT,2020-W03,national,AT,Austria,,,8932664,,,
3,Austria,AT,2020-W04,national,AT,Austria,,,8932664,,,
4,Austria,AT,2020-W05,national,AT,Austria,,,8932664,,,
5,Austria,AT,2020-W06,national,AT,Austria,,,8932664,,,
6,Austria,AT,2020-W07,national,AT,Austria,,,8932664,,,
7,Austria,AT,2020-W08,national,AT,Austria,,,8932664,,,
8,Austria,AT,2020-W09,national,AT,Austria,,,8932664,,,
9,Austria,AT,2020-W10,national,AT,Austria,,,8932664,,,


From the exploration of first values it is hypothetized that null values are not missing values but real 0.

## Data Processing

In [6]:
# Drop columns with direct dependencies and/or not relevant for the study
df.drop(['region_name'], axis=1, inplace = True)
df.drop(['level'], axis=1, inplace = True)
df.drop(['region'], axis=1, inplace = True)
df.drop(['country_code'], axis=1, inplace = True)
df.drop(['testing_data_source'], axis=1, inplace = True)

# The following columns are kept for a brief implementation. 
# The dataset is not large enough.

#df.drop(['testing_rate'], axis=1, inplace = True)
#df.drop(['positivity_rate'], axis=1, inplace = True)

In [7]:
## Filter the countries of interest to reduce the data

# Denmark, Germany, Italy, Spain and Sweden.

df = df[(df['country']=='Denmark') | (df['country']=='Germany') | (df['country']=='Italy') | (df['country']=='Spain') | (df['country']=='Sweden')]

In [8]:
## Transform and split timestamp in year and month since the analysis is focused on monthly granularity

# Format "year_week" into datetime data
df['year_week'] = df['year_week'].apply(lambda x: str(utils.datetime_formating(x).to_period('M')) )

# Rename Columns year_week to timestamp. 
df.rename(columns={'year_week': 'timestamp'}, inplace=True)

# 

In [9]:
group = df.groupby('country')
Population = group.apply(lambda x: x['population'].unique())
population_dict = dict(zip(list(Population.index), list(map(int, (list(Population.values))))))

In [10]:
df.drop(['population'], axis=1, inplace = True)

In [11]:
df['testing_rate']=df['testing_rate']/1000

In [12]:
df=df.groupby(['timestamp','country']).sum().reset_index()

In [13]:
output_notebook() # open the bokeh viz on the notebook.

## Data Analysis and Visualization

In [14]:
from bokeh.models import Range1d

In [17]:
import numpy as np

from bokeh.io import curdoc, show
from bokeh.models import ColumnDataSource, Grid, HBar, LinearAxis, Plot

x = list(population_dict.values())

factors = list(population_dict.keys())

fig1 = figure(title="Population", toolbar_location=None,

tools="hover", tooltips="@x",

y_range=factors, x_range=[0,105000000],x_axis_label='year_month',

plot_width=800, plot_height=200)

fig1.segment(0, factors, x, factors, line_width=2, line_color="#3182bd")

fig1.circle(x, factors, size=15, fill_color="#9ecae1", line_color="#3182bd", line_width=3)

fig1.xgrid.grid_line_color = None

fig1.ygrid.grid_line_color = None

In [18]:
p1 = utils.create_bokeh_bars(df, 'new_cases' )
tab1 = Panel(child=p1, title="new_cases")

p2 = utils.create_bokeh_bars(df, 'tests_done')
tab2 = Panel(child=p2, title="tests_done")

p3 = utils.create_bokeh_bars(df, 'testing_rate')
tab3 = Panel(child=p3, title="testing_rate")

p4 = utils.create_bokeh_bars(df, 'positivity_rate')
tab4 = Panel(child=p4, title="positivity_rate")

from bokeh.plotting import figure, show

p5 = figure(width=400, height=400, title= "Correlation New Cases - Test Done")
p5.circle([1, 2, 3, 4, 5], [6, 7, 2, 4, 5], size=20, color="navy", alpha=0.5)
tab5 = Panel(child=p5, title="Correlation New Cases - Test Done ")


tabs = Tabs(tabs=[ tab1, tab2, tab3, tab4, tab5 ])
show(fig1)
show(tabs)

###Categories:
* **new_cases(*Numeric*)**: Number of new confirmed cases

* **tests_done(*Numeric*)**: Number of tests done

* **testing_rate(*Numeric*)**: Testing rate(%)- 100 x test_done/population

* **positivity_rate(*Numeric*)**: Weekly test positivity (%) - 100 x Number of new confirmed cases/number of tests done per week

* **Correlation(*Numeric*)**: Correlation between new cases and test done

#### Posible categories:
* Positive_population: New Cases / population - Not included, since, it would be very Biased
* Postive rate / population: Biased

#### Analysis
* Test done and new cases are closely related so it is difficult to see which country was more affected by the disease

* Germany and Italy seems to have a faster response regarding testing. However, the patter can also give us some insight about which countries the spreading of the virus happened before. In contrast other possibility would be the country handleling of covid first stages.

* Postive rate have a very different pattern in the first stages. The test were not wide performed and mainly citizens with clear symptomps or risk were tested. This pattern is very noticeable in the first months of sweeden, were the positive rate is enormous while the number of test performed is minimal.

* It can also be observed that the positive rate of Denmark is low compared with the other countries. Having external information from different countries including Denmark, it could be caused by the fact that a larger group of citiziens with less marginalisation respect the symptoms could get tested.

* The number of tests respect population size in Denmark is impressive, however, the effets and benefits of such a measure are not reflected in this data. Small number of death would be a more fair indicator of a sucessfull strategy between countries with the same develompent level in health system.

* In testing rate is more easy to observe when the disease stop being consideread a risk in the country with big decline in the last months pattern.