# Overview
This notebook is associated with Trello card https://trello.com/c/y4Nv52JN, specifically the Spain data. 

## Discovery
Spain data was initially found on https://en.wikipedia.org/wiki/2020_coronavirus_pandemic_in_Spain. Following links from this page, led to the following:
- https://www.rtve.es/noticias/20200329/mapa-del-coronavirus-espana/2004681.shtml
- https://www.mscbs.gob.es/profesionales/saludPublica/ccayes/alertasActual/nCov-China/situacionActual.htm
- https://covid19.isciii.es/

## Read Data

In [235]:
import pandas as pd
from datetime import datetime

In [236]:
url = "https://covid19.isciii.es/resources/serie_historica_acumulados.csv"
df = pd.read_csv(url, encoding="latin_1")
df.shape


(761, 7)

In [237]:
df[df['CCAA Codigo ISO']=='AN'].head(10)

Unnamed: 0,CCAA Codigo ISO,Fecha,Casos,Hospitalizados,UCI,Fallecidos,Recuperados
0,AN,20/02/2020,,,,,
19,AN,21/02/2020,,,,,
38,AN,22/02/2020,,,,,
57,AN,23/02/2020,,,,,
76,AN,24/02/2020,,,,,
95,AN,25/02/2020,,,,,
114,AN,26/02/2020,1.0,,,,
133,AN,27/02/2020,6.0,,,,
152,AN,28/02/2020,8.0,,,,
171,AN,29/02/2020,12.0,,,,


## Model Data

### Column Headers

In [238]:
df.rename(columns={'CCAA Codigo ISO': 'ccaa-iso-code', 'Fecha': 'date', 'Casos ': 'cases', 'Hospitalizados': 'hospitalized', 'UCI': 'uci', 'Fallecidos': 'deceased', 'Recuperados': 'recovered'}, inplace=True)
print(df.head())

  ccaa-iso-code        date  cases  hospitalized  uci  deceased  recovered
0            AN  20/02/2020    NaN           NaN  NaN       NaN        NaN
1            AR  20/02/2020    NaN           NaN  NaN       NaN        NaN
2            AS  20/02/2020    NaN           NaN  NaN       NaN        NaN
3            IB  20/02/2020    1.0           NaN  NaN       NaN        NaN
4            CN  20/02/2020    1.0           NaN  NaN       NaN        NaN


### Replace NaN with 0

In [239]:
df['cases'].fillna(value=0, inplace=True)
df['hospitalized'].fillna(value=0, inplace=True)
df['uci'].fillna(value=0, inplace=True)
df['deceased'].fillna(value=0, inplace=True)
df['recovered'].fillna(value=0, inplace=True)

df.head()

Unnamed: 0,ccaa-iso-code,date,cases,hospitalized,uci,deceased,recovered
0,AN,20/02/2020,0.0,0.0,0.0,0.0,0.0
1,AR,20/02/2020,0.0,0.0,0.0,0.0,0.0
2,AS,20/02/2020,0.0,0.0,0.0,0.0,0.0
3,IB,20/02/2020,1.0,0.0,0.0,0.0,0.0
4,CN,20/02/2020,1.0,0.0,0.0,0.0,0.0


### Update date 
Transform to compliance with https://coronawhy.github.io/task-geo/data_model.html

In [240]:
df['date'] = pd.to_datetime(df['date'])
df.head()

Unnamed: 0,ccaa-iso-code,date,cases,hospitalized,uci,deceased,recovered
0,AN,2020-02-20,0.0,0.0,0.0,0.0,0.0
1,AR,2020-02-20,0.0,0.0,0.0,0.0,0.0
2,AS,2020-02-20,0.0,0.0,0.0,0.0,0.0
3,IB,2020-02-20,1.0,0.0,0.0,0.0,0.0
4,CN,2020-02-20,1.0,0.0,0.0,0.0,0.0


### Undo cumulative sums

In [241]:
# show current values for AN. See how they are cumulative values
print(df[df['ccaa-iso-code']=='AN'].head(30))

# create a copy of the dataframe, without date
unrolled_df = df.copy()
unrolled_df.drop(['date'], axis=1, inplace=True)

# unroll (i.e. undo the cumulative values)
unrolled_df = unrolled_df.groupby('ccaa-iso-code').diff().fillna(unrolled_df)

# add back ccaa-iso-code, date columns
unrolled_df = pd.concat([df[['ccaa-iso-code', 'date']], unrolled_df], axis=1)

# show the unrolled_df. See how values are no longer cumulative
print(unrolled_df[unrolled_df['ccaa-iso-code']=='AN'].head(30))

    ccaa-iso-code       date   cases  hospitalized   uci  deceased  recovered
0              AN 2020-02-20     0.0           0.0   0.0       0.0        0.0
19             AN 2020-02-21     0.0           0.0   0.0       0.0        0.0
38             AN 2020-02-22     0.0           0.0   0.0       0.0        0.0
57             AN 2020-02-23     0.0           0.0   0.0       0.0        0.0
76             AN 2020-02-24     0.0           0.0   0.0       0.0        0.0
95             AN 2020-02-25     0.0           0.0   0.0       0.0        0.0
114            AN 2020-02-26     1.0           0.0   0.0       0.0        0.0
133            AN 2020-02-27     6.0           0.0   0.0       0.0        0.0
152            AN 2020-02-28     8.0           0.0   0.0       0.0        0.0
171            AN 2020-02-29    12.0           0.0   0.0       0.0        0.0
190            AN 2020-01-03    12.0           0.0   0.0       0.0        0.0
209            AN 2020-02-03    12.0           0.0   0.0       0

### Insert iso name

In [242]:
unrolled_df.insert(0, 'iso-3166-1-alpha-3', 'Spain')
unrolled_df.head()

Unnamed: 0,iso-3166-1-alpha-3,ccaa-iso-code,date,cases,hospitalized,uci,deceased,recovered
0,Spain,AN,2020-02-20,0.0,0.0,0.0,0.0,0.0
1,Spain,AR,2020-02-20,0.0,0.0,0.0,0.0,0.0
2,Spain,AS,2020-02-20,0.0,0.0,0.0,0.0,0.0
3,Spain,IB,2020-02-20,1.0,0.0,0.0,0.0,0.0
4,Spain,CN,2020-02-20,1.0,0.0,0.0,0.0,0.0


### <font color='red'>TODO</font>
1. Find correct latitude and longitude for regions
1. Understand what uci is 