# Data Cleaning

## 1. Introduccion

- Tareas pendientes:
1. Hacer intro
2. Funcion delta para las columnas agrupadas
3. Rellenar huecos
4. GeoJSON Lat y Lon

## 2. Impotando Librerias

In [527]:
import pandas as pd
import numpy as np
from datetime import datetime

## 3. Creacion de `df` utilizando datos de `CoreCode` en `data_core/`

### 3.1. Preparacion dataset `confirmed_global.csv`

In [528]:
#url_confirmed_global = "https://data.humdata.org/hxlproxy/api/data-preview.csv?url=https%3A%2F%2Fraw.githubusercontent.com%2FCSSEGISandData%2FCOVID-19%2Fmaster%2Fcsse_covid_19_data%2Fcsse_covid_19_time_series%2Ftime_series_covid19_confirmed_global.csv&filename=time_series_covid19_confirmed_global.csv"
#df1 = pd.read_csv(url_confirmed_global)
df1 = pd.read_csv('data_core/confirmed_global.csv')

In [529]:
# El analisis se va a hacer por pais, no por provincia de modo que elimino la columna 'Province/State'. Las columnas de 'Lat' y 'Long' 
# se van a eliminar ahora para luego mergearlas con el dataframe final, ya que las coordenadas se cerian alteradas en el 'groupby'.

df1 = df1.drop(['Province/State'], axis=1)
df1 = df1.drop(['Lat'], axis=1)
df1 = df1.drop(['Long'], axis=1)

In [530]:
# Una vez eliminada dichas columnas agrupamos los datos a nivel de fila por pais Sumando asi todos 
# los casos por pais que anteriormente estaban subdivididos por 'Province/State'.

# Comprobamos que efectivamente, hay nombres de paises que aparecen varias veces
print(df1["Country/Region"].value_counts().to_string())

China                               34
Canada                              16
United Kingdom                      12
France                              12
Australia                            8
Netherlands                          5
Denmark                              3
New Zealand                          2
Panama                               1
Niger                                1
Nigeria                              1
North Macedonia                      1
Norway                               1
Oman                                 1
Pakistan                             1
Palau                                1
Peru                                 1
Papua New Guinea                     1
Paraguay                             1
Philippines                          1
Poland                               1
Portugal                             1
Qatar                                1
Romania                              1
Russia                               1
Rwanda                   

In [531]:
df1.loc[df1["Country/Region"] == "Austria"]

Unnamed: 0,Country/Region,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,1/30/20,...,12/19/21,12/20/21,12/21/21,12/22/21,12/23/21,12/24/21,12/25/21,12/26/21,12/27/21,12/28/21
16,Austria,0,0,0,0,0,0,0,0,0,...,1249641,1251433,1253961,1256230,1258377,1260751,1262836,1264553,1266103,1268519


In [532]:
# Vemos que tras el groupby los casos de agrupado correctamente, ya que la suma de la columna de casos de un dia especifico
# es igual a la fila de ese mismo dia para df1 tras esta operacion
df1 = df1.groupby(['Country/Region']).sum().reset_index()
print(df1.loc[df1["Country/Region"] == "Austria"].sum())

Country/Region    Austria
1/22/20                 0
1/23/20                 0
1/24/20                 0
1/25/20                 0
                   ...   
12/24/21          1260751
12/25/21          1262836
12/26/21          1264553
12/27/21          1266103
12/28/21          1268519
Length: 708, dtype: object


In [533]:
# Vemos que solo existe un valor por pais. 
print(df1["Country/Region"].value_counts().to_string())

Afghanistan                         1
Namibia                             1
Netherlands                         1
New Zealand                         1
Nicaragua                           1
Niger                               1
Nigeria                             1
North Macedonia                     1
Norway                              1
Oman                                1
Pakistan                            1
Palau                               1
Panama                              1
Papua New Guinea                    1
Paraguay                            1
Peru                                1
Philippines                         1
Poland                              1
Portugal                            1
Qatar                               1
Romania                             1
Russia                              1
Rwanda                              1
Nepal                               1
Mozambique                          1
Albania                             1
Morocco     

<div align="center">
Confirmamos que el groupby se ha completado con exito
<div>

In [534]:
# Mergeamos las columnas de 'Date-Countrty' por cada pais y anadimos una columna con su valor correspondiente

# Agrupo las columnas de fecha en filas utilizando la funcion `melt` y hago un idetificador unico para mergear con el resto
# de tablas, que sera el (dia)+(el nombre del pais) para poder mergear correctamente con el resto de tablas por dia y pais
df1 = df1.melt(id_vars=["Country/Region"], 
        var_name="Date", 
        value_name="Confirmed")

# Creo la columna con el identificador para usarla como indentificador unico para el mergeo
df1['Date-Country'] = df1['Date'] + df1['Country/Region']

# Hago esta misma columna indice del dataframe
df1.set_index('Date-Country')

Unnamed: 0_level_0,Country/Region,Date,Confirmed
Date-Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/22/20Afghanistan,Afghanistan,1/22/20,0
1/22/20Albania,Albania,1/22/20,0
1/22/20Algeria,Algeria,1/22/20,0
1/22/20Andorra,Andorra,1/22/20,0
1/22/20Angola,Angola,1/22/20,0
...,...,...,...
12/28/21Vietnam,Vietnam,12/28/21,1680985
12/28/21West Bank and Gaza,West Bank and Gaza,12/28/21,469452
12/28/21Yemen,Yemen,12/28/21,10123
12/28/21Zambia,Zambia,12/28/21,238383


### 3.2. Preparacion dataset `deaths_global.csv`

Repetimos el mismo proceso anterior para el dataset `deaths_global.csv`

In [535]:
#url_deaths_global = "https://data.humdata.org/hxlproxy/api/data-preview.csv?url=https%3A%2F%2Fraw.githubusercontent.com%2FCSSEGISandData%2FCOVID-19%2Fmaster%2Fcsse_covid_19_data%2Fcsse_covid_19_time_series%2Ftime_series_covid19_deaths_global.csv&filename=time_series_covid19_deaths_global.csv"
#df2 = pd.read_csv(url_deaths_global)
df2 = pd.read_csv('data_core/deaths_global.csv')

df2 = df2.drop(['Province/State'], axis=1)
df2 = df2.drop(['Lat'], axis=1)
df2 = df2.drop(['Long'], axis=1)
df2 = df2.groupby(['Country/Region']).sum().reset_index()
df2 = df2.melt(id_vars=["Country/Region"], 
        var_name="Date", 
        value_name="Deaths")
df2['Date-Country'] = df2['Date'] + df2['Country/Region']

df2.set_index('Date-Country')

Unnamed: 0_level_0,Country/Region,Date,Deaths
Date-Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/22/20Afghanistan,Afghanistan,1/22/20,0
1/22/20Albania,Albania,1/22/20,0
1/22/20Algeria,Algeria,1/22/20,0
1/22/20Andorra,Andorra,1/22/20,0
1/22/20Angola,Angola,1/22/20,0
...,...,...,...
12/28/21Vietnam,Vietnam,12/28/21,31632
12/28/21West Bank and Gaza,West Bank and Gaza,12/28/21,4912
12/28/21Yemen,Yemen,12/28/21,1984
12/28/21Zambia,Zambia,12/28/21,3716


### 3.3. Preparacion dataset `recovered_global.csv`

Repetimos el mismo proceso anterior para el dataset `recovered_global.csv`

In [536]:
#url_recovered_global = "https://data.humdata.org/hxlproxy/api/data-preview.csv?url=https%3A%2F%2Fraw.githubusercontent.com%2FCSSEGISandData%2FCOVID-19%2Fmaster%2Fcsse_covid_19_data%2Fcsse_covid_19_time_series%2Ftime_series_covid19_recovered_global.csv&filename=time_series_covid19_recovered_global.csv"
#df3 = pd.read_csv(url_recovered_global)
df3 = pd.read_csv('data_core/recovered_global.csv')


df3 = df3.drop(['Province/State'], axis=1)
df3 = df3.drop(['Lat'], axis=1)
df3 = df3.drop(['Long'], axis=1)
df3 = df3.groupby(['Country/Region']).sum().reset_index()
df3 = df3.melt(id_vars=["Country/Region"], 
        var_name="Date", 
        value_name="Recovered")
df3['Date-Country'] = df3['Date'] + df3['Country/Region']
df3.set_index('Date-Country')

Unnamed: 0_level_0,Country/Region,Date,Recovered
Date-Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/22/20Afghanistan,Afghanistan,1/22/20,0
1/22/20Albania,Albania,1/22/20,0
1/22/20Algeria,Algeria,1/22/20,0
1/22/20Andorra,Andorra,1/22/20,0
1/22/20Angola,Angola,1/22/20,0
...,...,...,...
12/28/21Vietnam,Vietnam,12/28/21,0
12/28/21West Bank and Gaza,West Bank and Gaza,12/28/21,0
12/28/21Yemen,Yemen,12/28/21,0
12/28/21Zambia,Zambia,12/28/21,0


### 3.4. Juntamos todos los dataframe `df1`, `df2` y `df3` en uno solo `df`

In [537]:
# Creo un primer dataframe final (df_f1), mergeando df1 y df2 por 'Date-Country'
df = pd.merge(df1, df2 , how='left', on='Date-Country')

# Creo un sefundo dataframe final, mergeando el anterior dataframe (df_f1) y df3 por 'Date-Country'
df = pd.merge(df, df3 , how='left', on='Date-Country')


In [538]:
#Elimino las columnas duplicadas
df = df.drop(['Date-Country','Country/Region_y','Date_y', 'Country/Region_x','Date_x'], axis=1)

# Reordeno las Columnas
df = df[['Country/Region','Date','Confirmed','Deaths','Recovered']]
df = df.rename(columns={'Country/Region':'Country'})
df

Unnamed: 0,Country,Date,Confirmed,Deaths,Recovered
0,Afghanistan,1/22/20,0,0,0
1,Albania,1/22/20,0,0,0
2,Algeria,1/22/20,0,0,0
3,Andorra,1/22/20,0,0,0
4,Angola,1/22/20,0,0,0
...,...,...,...,...,...
138567,Vietnam,12/28/21,1680985,31632,0
138568,West Bank and Gaza,12/28/21,469452,4912,0
138569,Yemen,12/28/21,10123,1984,0
138570,Zambia,12/28/21,238383,3716,0


## 4. Anado datos geograficos y poblacion a `df`

### 4.1. Anado las columnas de `'Lat'` y `'Long'` al dataframe `df`

In [539]:
df4 = pd.read_csv("data_extra/concap.csv")
df4 = df4.drop(['CapitalName'], axis=1)
df4 = df4.drop_duplicates()

filter_continent = df4['ContinentName'] == 'Europe'
df4 = df4[filter_continent]

df4 = df4.rename(columns={'CountryName':'Country',
                          'CapitalLatitude':'Lat', 
                          'CapitalLongitude':'Long', 
                          'CountryCode':'geoId',
                          'ContinentName':'continentExp'})
df4.head(3)


Unnamed: 0,Country,Lat,Long,geoId,continentExp
4,Aland Islands,60.116667,19.9,AX,Europe
10,Albania,41.316667,19.816667,AL,Europe
13,Andorra,42.5,1.516667,AD,Europe


In [540]:
df = pd.merge(df, df4 , how='left', on='Country')

### 4.2. Filtrado de `df` por `'continentExP'`: `'Europe'`

Para poder aprovechar los dataset de data_extra, que estan centrados unicamente en Europa, y ademas poder centrar mejor el analisis, voy a filtrar el dataframe eliminando todos los paises que no son europeos.

In [541]:
# Filtro el dataframe (df) para paises europeos 
filter_europe = df['continentExp'] == 'Europe'
df = df[filter_europe]
df

Unnamed: 0,Country,Date,Confirmed,Deaths,Recovered,Lat,Long,geoId,continentExp
1,Albania,1/22/20,0,0,0,41.316667,19.816667,AL,Europe
3,Andorra,1/22/20,0,0,0,42.500000,1.516667,AD,Europe
7,Armenia,1/22/20,0,0,0,40.166667,44.500000,AM,Europe
9,Austria,1/22/20,0,0,0,48.200000,16.366667,AT,Europe
10,Azerbaijan,1/22/20,0,0,0,40.383333,49.866667,AZ,Europe
...,...,...,...,...,...,...,...,...,...
138545,Sweden,12/28/21,1294560,15286,0,59.333333,18.050000,SE,Europe
138546,Switzerland,12/28/21,1276956,12152,0,46.916667,7.466667,CH,Europe
138557,Turkey,12/28/21,9367369,81917,0,39.933333,32.866667,TR,Europe
138560,Ukraine,12/28/21,3828336,101212,0,50.433333,30.516667,UA,Europe


## 5. Modificacion del indice de `df` y creacion de columnas `'Year'`, `'Week'` y `'Day'`

### 5.1. Cambio de tipo de datos

In [542]:
df.dtypes

Country          object
Date             object
Confirmed         int64
Deaths            int64
Recovered         int64
Lat             float64
Long            float64
geoId            object
continentExp     object
dtype: object

In [543]:
# Cabia Date a tipo datetime
df['Date'] = pd.to_datetime(df.Date)

### 5.2. Extraccion de nuevas columnas a traves de `'Date'`

In [544]:
# Declaro variables
y = df['Date'].dt
x = df['Date'].dt.isocalendar().week.apply(np.int64)

# Creo nuevas columnas con tipo int64
df['Year'] = y.year
df['Month'] = y.month
df['Week'] = x
df['Week-Copy'] = x
df['Day'] = y.day

# comprobamos que efectivamente las columnas se han creado como int64
df.dtypes

Country                 object
Date            datetime64[ns]
Confirmed                int64
Deaths                   int64
Recovered                int64
Lat                    float64
Long                   float64
geoId                   object
continentExp            object
Year                     int64
Month                    int64
Week                     int64
Week-Copy                int64
Day                      int64
dtype: object

In [546]:
# Para poder mergear con los data set que tienen informacion a nivel 'year'-'week'
# Necesito poner un '0' delante de las semanas que sean menores de 10
# Esta columna al tener el '-', obligatoriamente sera de tipo 'object.

def str_fixer(value):
    if int(value) < 10:
        return f'0{value}'
    else:
        return str(value)

df["Week-Copy"] = df["Week-Copy"].apply(str_fixer)


In [547]:
# Comprobamos que funciona correctamente
df["Week-Copy"].unique()

array(['04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14',
       '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25',
       '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36',
       '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47',
       '48', '49', '50', '51', '52', '53', '01', '02', '03'], dtype=object)

In [548]:
# Creo los id que me serviran para mergear el resto de datasets
df["Year-Week"] = df["Year"].apply(str) + "-" + df["Week-Copy"]
df["Year-Week-Copy"] = df["Year"].apply(str) + "-W" + df["Week-Copy"]

In [549]:
df.head(5)

Unnamed: 0,Country,Date,Confirmed,Deaths,Recovered,Lat,Long,geoId,continentExp,Year,Month,Week,Week-Copy,Day,Year-Week,Year-Week-Copy
1,Albania,2020-01-22,0,0,0,41.316667,19.816667,AL,Europe,2020,1,4,4,22,2020-04,2020-W04
3,Andorra,2020-01-22,0,0,0,42.5,1.516667,AD,Europe,2020,1,4,4,22,2020-04,2020-W04
7,Armenia,2020-01-22,0,0,0,40.166667,44.5,AM,Europe,2020,1,4,4,22,2020-04,2020-W04
9,Austria,2020-01-22,0,0,0,48.2,16.366667,AT,Europe,2020,1,4,4,22,2020-04,2020-W04
10,Azerbaijan,2020-01-22,0,0,0,40.383333,49.866667,AZ,Europe,2020,1,4,4,22,2020-04,2020-W04


In [550]:
df.dtypes

Country                   object
Date              datetime64[ns]
Confirmed                  int64
Deaths                     int64
Recovered                  int64
Lat                      float64
Long                     float64
geoId                     object
continentExp              object
Year                       int64
Month                      int64
Week                       int64
Week-Copy                 object
Day                        int64
Year-Week                 object
Year-Week-Copy            object
dtype: object

## 6. Union de los dataset: `ICU_hospital.csv`, `test_rate.csv` y `vaccine_tracker.csv` con `df`

### 6.1. Creacion de `id` en `df` para el mergeo con los dataset de `data_extra`

In [551]:
#Borro Columnas sobrantes
df = df.drop(['continentExp'], axis=1)

# Reordeno las Columnas
df = df[['Date','Country','geoId','Lat','Long','Year','Month','Week','Day','Confirmed','Deaths',
         'Recovered','Week-Copy','Year-Week','Year-Week-Copy']]
    
# Defino las columnas que me serviran para mergear con otros dataset
df['id-merge'] = df['geoId'] + df['Year-Week-Copy']
df['id-merge-country-date'] = df['Country'] + df['Date'].apply(str)

Index(['Country', 'Date', 'Confirmed', 'Deaths', 'Recovered', 'Lat', 'Long',
       'geoId', 'continentExp', 'Year', 'Month', 'Week', 'Week-Copy', 'Day',
       'Year-Week', 'Year-Week-Copy'],
      dtype='object')
Index(['Date', 'Country', 'geoId', 'Lat', 'Long', 'Year', 'Month', 'Week',
       'Day', 'Confirmed', 'Deaths', 'Recovered', 'Week-Copy', 'Year-Week',
       'Year-Week-Copy', 'id-merge', 'id-merge-country-date'],
      dtype='object')
Date                     datetime64[ns]
Country                          object
geoId                            object
Lat                             float64
Long                            float64
Year                              int64
Month                             int64
Week                              int64
Day                               int64
Confirmed                         int64
Deaths                            int64
Recovered                         int64
Week-Copy                        object
Year-Week                    

### 6.2. Preparacion dataset `ICU_hospital.csv`

In [552]:
#url_UCI = "https://opendata.ecdc.europa.eu/covid19/hospitalicuadmissionrates/csv/data.csv"
#df_ex3 = pd.read_csv(url_UCI)
df_ex3 = pd.read_csv('data_extra/ICU_hospital.csv')

In [553]:
df_ex3.head(10)

Unnamed: 0,country,indicator,date,year_week,value,source,url
0,Austria,Daily hospital occupancy,2020-04-01,2020-W14,856.0,Country_Website,https://covid19-dashboard.ages.at/dashboard_Ho...
1,Austria,Daily hospital occupancy,2020-04-02,2020-W14,823.0,Country_Website,https://covid19-dashboard.ages.at/dashboard_Ho...
2,Austria,Daily hospital occupancy,2020-04-03,2020-W14,829.0,Country_Website,https://covid19-dashboard.ages.at/dashboard_Ho...
3,Austria,Daily hospital occupancy,2020-04-04,2020-W14,826.0,Country_Website,https://covid19-dashboard.ages.at/dashboard_Ho...
4,Austria,Daily hospital occupancy,2020-04-05,2020-W14,712.0,Country_Website,https://covid19-dashboard.ages.at/dashboard_Ho...
5,Austria,Daily hospital occupancy,2020-04-06,2020-W15,824.0,Country_Website,https://covid19-dashboard.ages.at/dashboard_Ho...
6,Austria,Daily hospital occupancy,2020-04-07,2020-W15,857.0,Country_Website,https://covid19-dashboard.ages.at/dashboard_Ho...
7,Austria,Daily hospital occupancy,2020-04-08,2020-W15,829.0,Country_Website,https://covid19-dashboard.ages.at/dashboard_Ho...
8,Austria,Daily hospital occupancy,2020-04-09,2020-W15,820.0,Country_Website,https://covid19-dashboard.ages.at/dashboard_Ho...
9,Austria,Daily hospital occupancy,2020-04-10,2020-W15,771.0,Country_Website,https://covid19-dashboard.ages.at/dashboard_Ho...


In [554]:
print((df_ex3.isnull().sum()/len(df_ex3))*100)

country      0.000000
indicator    0.000000
date         0.000000
year_week    0.000000
value        0.000000
source       0.000000
url          9.862196
dtype: float64


In [555]:
df_ex3['indicator'].unique()

array(['Daily hospital occupancy', 'Daily ICU occupancy',
       'Weekly new hospital admissions per 100k',
       'Weekly new ICU admissions per 100k'], dtype=object)

In [556]:
# Divido el dataset en 4 dataset por indicator y luego los uno en uno solo por columnas
df_ex3['id-merge-country-date'] = df_ex3['country']+df_ex3['date']
df_ex3 = df_ex3.drop(['year_week', 'source', 'url'], axis=1)

In [557]:
# Hospital_Occupancy

filter_uci = df_ex3['indicator'] == 'Daily hospital occupancy'

# DEFINO EL DATAFRAME
df_uci_1 = df_ex3[filter_uci]
df_uci_1 = df_uci_1.rename(columns={'value':'Hospital_Occupancy'})
df_uci_1 = df_uci_1.drop(['indicator','date','country'], axis=1)

df_uci_1

Unnamed: 0,Hospital_Occupancy,id-merge-country-date
0,856.0,Austria2020-04-01
1,823.0,Austria2020-04-02
2,829.0,Austria2020-04-03
3,826.0,Austria2020-04-04
4,712.0,Austria2020-04-05
...,...,...
30733,496.0,Sweden2021-12-15
30734,526.0,Sweden2021-12-16
30735,524.0,Sweden2021-12-17
30736,513.0,Sweden2021-12-18


In [558]:
# ICU occupancy

filter_uci = df_ex3['indicator'] == 'Daily ICU occupancy'

#DEFINO EL DATAFRAME
df_uci_2 = df_ex3[filter_uci]
df_uci_2 = df_uci_2.rename(columns={'value':'ICU_Occupancy'})
df_uci_2 = df_uci_2.drop(['indicator','date','country'], axis=1)

df_uci_2

Unnamed: 0,ICU_Occupancy,id-merge-country-date
628,215.0,Austria2020-04-01
629,219.0,Austria2020-04-02
630,245.0,Austria2020-04-03
631,245.0,Austria2020-04-04
632,244.0,Austria2020-04-05
...,...,...
31395,75.0,Sweden2021-12-15
31396,70.0,Sweden2021-12-16
31397,68.0,Sweden2021-12-17
31398,72.0,Sweden2021-12-18


In [559]:
df_ex3 = df_ex3.drop(['indicator','value','country'], axis=1)
df_ex3

Unnamed: 0,date,id-merge-country-date
0,2020-04-01,Austria2020-04-01
1,2020-04-02,Austria2020-04-02
2,2020-04-03,Austria2020-04-03
3,2020-04-04,Austria2020-04-04
4,2020-04-05,Austria2020-04-05
...,...,...
31489,2021-11-21,Sweden2021-11-21
31490,2021-11-28,Sweden2021-11-28
31491,2021-12-05,Sweden2021-12-05
31492,2021-12-12,Sweden2021-12-12


In [560]:
df_ex3 = pd.merge(df_ex3, df_uci_1 , how='left', on='id-merge-country-date')
df_ex3 = pd.merge(df_ex3, df_uci_2 , how='left', on='id-merge-country-date')
df_ex3 = df_ex3.drop(columns=['date'])
df_ex3

Unnamed: 0,id-merge-country-date,Hospital_Occupancy,ICU_Occupancy
0,Austria2020-04-01,856.0,215.0
1,Austria2020-04-02,823.0,219.0
2,Austria2020-04-03,829.0,245.0
3,Austria2020-04-04,826.0,245.0
4,Austria2020-04-05,712.0,244.0
...,...,...,...
31489,Sweden2021-11-21,272.0,31.0
31490,Sweden2021-11-28,293.0,29.0
31491,Sweden2021-12-05,344.0,46.0
31492,Sweden2021-12-12,451.0,53.0


### 6.3. Preapracion dataset `test_rate.csv`

In [561]:
#url_test_rate = "https://opendata.ecdc.europa.eu/covid19/testing/csv/data.csv"
#df_ex1 = pd.read_csv(url_test_rate)
df_ex1 = pd.read_csv('data_extra/test_rate.csv')

In [562]:
df_ex1

Unnamed: 0,country,country_code,year_week,level,region,region_name,new_cases,tests_done,population,testing_rate,positivity_rate,testing_data_source
0,Austria,AT,2020-W15,national,AT,Austria,1838,12339,8901064.0,138.623877,14.895859,Manual webscraping
1,Austria,AT,2020-W16,national,AT,Austria,684,58488,8901064.0,657.089984,1.169471,Manual webscraping
2,Austria,AT,2020-W17,national,AT,Austria,448,33443,8901064.0,375.719128,1.339593,Manual webscraping
3,Austria,AT,2020-W18,national,AT,Austria,312,26598,8901064.0,298.818209,1.173021,Country website
4,Austria,AT,2020-W19,national,AT,Austria,264,42153,8901064.0,473.572598,0.626290,Country website
...,...,...,...,...,...,...,...,...,...,...,...,...
11750,Sweden,SE,2021-W46,national,SE,Sweden,7095,123920,10327589.0,1199.892831,5.725468,TESSy
11751,Sweden,SE,2021-W47,national,SE,Sweden,11916,226289,10327589.0,2191.111594,5.265833,TESSy
11752,Sweden,SE,2021-W48,national,SE,Sweden,13802,273987,10327589.0,2652.961887,5.037465,TESSy
11753,Sweden,SE,2021-W49,national,SE,Sweden,18659,335956,10327589.0,3252.995447,5.554001,TESSy


In [563]:
df_ex1['level'].unique()

array(['national', 'subnational'], dtype=object)

In [564]:
# Dentro del dataset exiten datos a nivel nacional y subnacional. Vamos a filtrar por nacional, para descartar 
# todas las lineas por provicia, ya que nuestro analisis es a nivel nacional en Europa.

filter_national = df_ex1['level'] == 'national'
df_ex1 = df_ex1[filter_national]

In [565]:
# Eliminamos columnas no necesarias 

df_ex1 = df_ex1.drop(['region_name', 'new_cases', 'testing_data_source','region_name','level','region'], axis=1)

In [566]:
df_ex1['id-merge'] = df_ex1['country_code'] + df_ex1['year_week']
df_ex1 = df_ex1.drop(['year_week', 'country_code', 'country'], axis=1)
df_ex1['population'] = df_ex1['population'].astype(int)

df_ex1

Unnamed: 0,tests_done,population,testing_rate,positivity_rate,id-merge
0,12339,8901064,138.623877,14.895859,AT2020-W15
1,58488,8901064,657.089984,1.169471,AT2020-W16
2,33443,8901064,375.719128,1.339593,AT2020-W17
3,26598,8901064,298.818209,1.173021,AT2020-W18
4,42153,8901064,473.572598,0.626290,AT2020-W19
...,...,...,...,...,...
11750,123920,10327589,1199.892831,5.725468,SE2021-W46
11751,226289,10327589,2191.111594,5.265833,SE2021-W47
11752,273987,10327589,2652.961887,5.037465,SE2021-W48
11753,335956,10327589,3252.995447,5.554001,SE2021-W49


### 6.4. Preparacion dataset `vaccine_tracker.csv`

In [567]:
#url_vaccine_tracker = "https://opendata.ecdc.europa.eu/covid19/vaccine_tracker/csv/data.csv"
#df_ex2 = pd.read_csv(url_vaccine_tracker)
df_ex2 = pd.read_csv('data_extra/vaccine_tracker.csv')


In [568]:
df_ex2['TargetGroup'].unique()

array(['ALL', 'Age0_4', 'Age10_14', 'Age15_17', 'Age18_24', 'Age25_49',
       'Age50_59', 'Age5_9', 'Age60_69', 'Age70_79', 'Age80+', 'Age<18',
       'AgeUNK', 'HCW', 'LTCF', '1_Age60+', '1_Age<60'], dtype=object)

In [569]:
filter_target = df_ex2['TargetGroup'] != '1_Age<60'
df_ex2 = df_ex2[filter_target]

filter_target = df_ex2['TargetGroup'] != '1_Age60+'
df_ex2 = df_ex2[filter_target]

filter_target = df_ex2['TargetGroup'] != 'LTCF'
df_ex2 = df_ex2[filter_target]

filter_target = df_ex2['TargetGroup'] != 'HCW'
df_ex2 = df_ex2[filter_target]

filter_target = df_ex2['TargetGroup'] != 'AgeUNK'
df_ex2 = df_ex2[filter_target]

filter_target = df_ex2['TargetGroup'] != 'Age<18'
df_ex2 = df_ex2[filter_target]

filter_target = df_ex2['TargetGroup'] != 'ALL'
df_ex2 = df_ex2[filter_target]


df_ex2['TargetGroup'].unique()

array(['Age0_4', 'Age10_14', 'Age15_17', 'Age18_24', 'Age25_49',
       'Age50_59', 'Age5_9', 'Age60_69', 'Age70_79', 'Age80+'],
      dtype=object)

In [570]:
df_ex2 = df_ex2.drop(['Denominator', 'DoseAdditional1', 'UnknownDose','Population','Vaccine'], axis=1)

df_ex2 = df_ex2.drop(['TargetGroup'], axis=1)

df_ex2['id-merge'] = df_ex2['ReportingCountry'] + df_ex2['YearWeekISO']


df_ex2 = df_ex2.drop(['YearWeekISO', 'ReportingCountry', 'Region'], axis=1)

df_ex2 = df_ex2.groupby(['id-merge']).sum().reset_index()

df_ex2

Unnamed: 0,id-merge,NumberDosesReceived,NumberDosesExported,FirstDose,FirstDoseRefused,SecondDose
0,AT2020-W53,614250.0,0.0,5249,0.0,0
1,AT2021-W01,614250.0,0.0,26205,0.0,0
2,AT2021-W02,686250.0,0.0,85006,0.0,399
3,AT2021-W03,585000.0,0.0,93304,0.0,4572
4,AT2021-W04,549900.0,0.0,31525,0.0,17538
...,...,...,...,...,...,...
1378,SK2021-W47,0.0,0.0,53204,0.0,8188
1379,SK2021-W48,0.0,0.0,28858,0.0,8425
1380,SK2021-W49,0.0,0.0,22157,0.0,15976
1381,SK2021-W50,0.0,0.0,18650,0.0,23346


### 6.5. Mergeo: Enriquecimiento de `df` con datadet: `test_rate.csv`, `vaccine_tracker.csv` y `ICU_hospital.csv`

In [571]:
df.columns

Index(['Date', 'Country', 'geoId', 'Lat', 'Long', 'Year', 'Month', 'Week',
       'Day', 'Confirmed', 'Deaths', 'Recovered', 'Week-Copy', 'Year-Week',
       'Year-Week-Copy', 'id-merge', 'id-merge-country-date'],
      dtype='object')

In [572]:
df = pd.merge(df, df_ex1 , how='left', on='id-merge')
df = pd.merge(df, df_ex2 , how='left', on='id-merge')
df = pd.merge(df, df_ex3 , how='left', on='id-merge-country-date')

In [573]:
df.columns

Index(['Date', 'Country', 'geoId', 'Lat', 'Long', 'Year', 'Month', 'Week',
       'Day', 'Confirmed', 'Deaths', 'Recovered', 'Week-Copy', 'Year-Week',
       'Year-Week-Copy', 'id-merge', 'id-merge-country-date', 'tests_done',
       'population', 'testing_rate', 'positivity_rate', 'NumberDosesReceived',
       'NumberDosesExported', 'FirstDose', 'FirstDoseRefused', 'SecondDose',
       'Hospital_Occupancy', 'ICU_Occupancy'],
      dtype='object')

In [574]:
df

Unnamed: 0,Date,Country,geoId,Lat,Long,Year,Month,Week,Day,Confirmed,...,population,testing_rate,positivity_rate,NumberDosesReceived,NumberDosesExported,FirstDose,FirstDoseRefused,SecondDose,Hospital_Occupancy,ICU_Occupancy
0,2020-01-22,Albania,AL,41.316667,19.816667,2020,1,4,22,0,...,,,,,,,,,,
1,2020-01-22,Andorra,AD,42.500000,1.516667,2020,1,4,22,0,...,,,,,,,,,,
2,2020-01-22,Armenia,AM,40.166667,44.500000,2020,1,4,22,0,...,,,,,,,,,,
3,2020-01-22,Austria,AT,48.200000,16.366667,2020,1,4,22,0,...,,,,,,,,,,
4,2020-01-22,Azerbaijan,AZ,40.383333,49.866667,2020,1,4,22,0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33224,2021-12-28,Sweden,SE,59.333333,18.050000,2021,12,52,28,1294560,...,,,,,,,,,,
33225,2021-12-28,Switzerland,CH,46.916667,7.466667,2021,12,52,28,1276956,...,,,,,,,,,,
33226,2021-12-28,Turkey,TR,39.933333,32.866667,2021,12,52,28,9367369,...,,,,,,,,,,
33227,2021-12-28,Ukraine,UA,50.433333,30.516667,2021,12,52,28,3828336,...,,,,,,,,,,


## 7. Dataframe final `df`

### 7.1. Limpieza de columnas y sort de `df`

In [575]:
df = df.drop(['id-merge', 'Year-Week-Copy', 'Week-Copy', 'geoId','id-merge-country-date'], axis=1)
df = df.rename(columns={'NumberDosesReceived':'DosesReceived',
                        'NumberDosesExported':'DosesExported'})

df.columns

Index(['Date', 'Country', 'Lat', 'Long', 'Year', 'Month', 'Week', 'Day',
       'Confirmed', 'Deaths', 'Recovered', 'Year-Week', 'tests_done',
       'population', 'testing_rate', 'positivity_rate', 'DosesReceived',
       'DosesExported', 'FirstDose', 'FirstDoseRefused', 'SecondDose',
       'Hospital_Occupancy', 'ICU_Occupancy'],
      dtype='object')

In [576]:
df = df.sort_values(['Date'], ascending=[True])
#filter_Confirmed_0 = df['Confirmed'] != 0
#df = df[filter_Confirmed_0]
# hacer el delta de dos series, restando la anterior. 
df

Unnamed: 0,Date,Country,Lat,Long,Year,Month,Week,Day,Confirmed,Deaths,...,population,testing_rate,positivity_rate,DosesReceived,DosesExported,FirstDose,FirstDoseRefused,SecondDose,Hospital_Occupancy,ICU_Occupancy
0,2020-01-22,Albania,41.316667,19.816667,2020,1,4,22,0,0,...,,,,,,,,,,
26,2020-01-22,Luxembourg,49.600000,6.116667,2020,1,4,22,0,0,...,,,,,,,,,,
27,2020-01-22,Malta,35.883333,14.500000,2020,1,4,22,0,0,...,,,,,,,,,,
28,2020-01-22,Moldova,47.000000,28.850000,2020,1,4,22,0,0,...,,,,,,,,,,
29,2020-01-22,Monaco,43.733333,7.416667,2020,1,4,22,0,0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33200,2021-12-28,Hungary,47.500000,19.083333,2021,12,52,28,1246689,38894,...,,,,,,,,,,
33201,2021-12-28,Iceland,64.150000,-21.950000,2021,12,52,28,25314,37,...,,,,,,,,,,
33202,2021-12-28,Ireland,53.316667,-6.233333,2021,12,52,28,731467,5890,...,,,,,,,,,,
33204,2021-12-28,Kosovo,42.666667,21.166667,2021,12,52,28,161356,2990,...,,,,,,,,,,


### 7.3. Rellenando valores nulos

In [577]:

#print((df.isnull().sum()/len(df))*100)
#df.dtypes

### 7.4. Transformo las columnas `'Lat'` y `'Long'` a GeoJSON

## 8. Exportacion `df` a `.csv`

In [578]:
#df.to_csv('df.csv')