## Tratamento de dados

Para podermos obter resultados fiáveis, é preciso haver um tratamento dos dados que vem do dataset.
Pegando dos dados processados anteriormente feito inicialmente, começamos por extrai-los do ficheiro para podermos manipulá-los.

In [1]:
import pandas as pd

covid_data = pd.read_csv('covid_19_clean_complete.csv')
covid_data

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
0,,Afghanistan,33.000000,65.000000,1/22/20,0,0,0
1,,Albania,41.153300,20.168300,1/22/20,0,0,0
2,,Algeria,28.033900,1.659600,1/22/20,0,0,0
3,,Andorra,42.506300,1.521800,1/22/20,0,0,0
4,,Angola,-11.202700,17.873900,1/22/20,0,0,0
...,...,...,...,...,...,...,...,...
27451,,Western Sahara,24.215500,-12.885800,5/4/20,6,0,5
27452,,Sao Tome and Principe,0.186360,6.613081,5/4/20,23,3,4
27453,,Yemen,15.552727,48.516388,5/4/20,12,2,0
27454,,Comoros,-11.645500,43.333300,5/4/20,3,0,0


De seguida, após uma análise, verificou-se a presença de entrada respeitantes a navios que nalgum momento tiveram casos de Covid-19 e não estão portanto associados a nenhum país particular.
Além disso, consideramos que estes dados iriam criar ruído, pelo que optamos por ignorá-los e remover dos dados em análise.

In [2]:
covid_data = covid_data.drop(covid_data[covid_data['Province/State']=='Grand Princess'].index)
covid_data = covid_data.drop(covid_data[covid_data['Province/State']=='Diamond Princess'].index)
covid_data = covid_data.drop(covid_data[covid_data['Country/Region']=='Diamond Princess'].index)
covid_data = covid_data.drop(covid_data[covid_data['Country/Region']=='MS Zaandam'].index)
covid_data = covid_data.reset_index()
del covid_data['index']
covid_data

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed,Deaths,Recovered
0,,Afghanistan,33.000000,65.000000,1/22/20,0,0,0
1,,Albania,41.153300,20.168300,1/22/20,0,0,0
2,,Algeria,28.033900,1.659600,1/22/20,0,0,0
3,,Andorra,42.506300,1.521800,1/22/20,0,0,0
4,,Angola,-11.202700,17.873900,1/22/20,0,0,0
...,...,...,...,...,...,...,...,...
27035,,Western Sahara,24.215500,-12.885800,5/4/20,6,0,5
27036,,Sao Tome and Principe,0.186360,6.613081,5/4/20,23,3,4
27037,,Yemen,15.552727,48.516388,5/4/20,12,2,0
27038,,Comoros,-11.645500,43.333300,5/4/20,3,0,0


A maioria das entradas da coluna **Province/State** tem valores nulos, pelo que procedemos a eliminá-los. Além disso, como ter uma entrada para uma região e país não é muito relevante, optamos por agregar as duas informações numa só coluna denominada de **Local**.

In [3]:
import numpy as np

covid_data['Province/State'] = covid_data.replace(np.nan, '', regex=True)
cols = ['Province/State', 'Country/Region']
covid_data['Local'] = covid_data[cols].apply(lambda row: ' / '.join(row.values.astype(str)) if row.values[0] != '' else ''.join(row.values.astype(str)), axis=1)
del covid_data['Province/State']
del covid_data['Country/Region']
covid_data

Unnamed: 0,Lat,Long,Date,Confirmed,Deaths,Recovered,Local
0,33.000000,65.000000,1/22/20,0,0,0,Afghanistan
1,41.153300,20.168300,1/22/20,0,0,0,Albania
2,28.033900,1.659600,1/22/20,0,0,0,Algeria
3,42.506300,1.521800,1/22/20,0,0,0,Andorra
4,-11.202700,17.873900,1/22/20,0,0,0,Angola
...,...,...,...,...,...,...,...
27035,24.215500,-12.885800,5/4/20,6,0,5,Western Sahara
27036,0.186360,6.613081,5/4/20,23,3,4,Sao Tome and Principe
27037,15.552727,48.516388,5/4/20,12,2,0,Yemen
27038,-11.645500,43.333300,5/4/20,3,0,0,Comoros


De seguida, vamos converter as datas em contagem de dias desde o início do dataset (22 de janeiro de 2020)

In [4]:
covid_data['Date'] = pd.to_datetime(covid_data['Date'],format='%m/%d/%y')
covid_data['Date'] -= pd.to_datetime("2020-01-22")
covid_data['Date'] /= np.timedelta64(1,'D')
covid_data = covid_data.rename(columns  = {'Date':'Days Passed'})
covid_data

Unnamed: 0,Lat,Long,Days Passed,Confirmed,Deaths,Recovered,Local
0,33.000000,65.000000,0.0,0,0,0,Afghanistan
1,41.153300,20.168300,0.0,0,0,0,Albania
2,28.033900,1.659600,0.0,0,0,0,Algeria
3,42.506300,1.521800,0.0,0,0,0,Andorra
4,-11.202700,17.873900,0.0,0,0,0,Angola
...,...,...,...,...,...,...,...
27035,24.215500,-12.885800,103.0,6,0,5,Western Sahara
27036,0.186360,6.613081,103.0,23,3,4,Sao Tome and Principe
27037,15.552727,48.516388,103.0,12,2,0,Yemen
27038,-11.645500,43.333300,103.0,3,0,0,Comoros


Por fim, adicionar as colunas da contagem do dia anterior. Este passo é um pouco mais longo tendo em conta o numero de linhas e a procura pelo valor anterior.

In [5]:
covid_data['Conf. Prev.'] = covid_data.apply(lambda row: 
                                                      covid_data[(covid_data['Local'] == row['Local']) & (covid_data['Days Passed'] == row['Days Passed']-1)]['Confirmed'].item()
                                                      if row['Days Passed'] > 0 else 0,axis=1)
covid_data['Deaths Prev.'] = covid_data.apply(lambda row: 
                                                      covid_data[(covid_data['Local'] == row['Local']) & (covid_data['Days Passed'] == row['Days Passed']-1)]['Deaths'].item()
                                                      if row['Days Passed'] > 0 else 0,axis=1)
covid_data['Recov. Prev.'] = covid_data.apply(lambda row: 
                                                      covid_data[(covid_data['Local'] == row['Local']) & (covid_data['Days Passed'] == row['Days Passed']-1)]['Recovered'].item()
                                                      if row['Days Passed'] > 0 else 0,axis=1)
covid_data

Unnamed: 0,Lat,Long,Days Passed,Confirmed,Deaths,Recovered,Local,Conf. Prev.,Deaths Prev.,Recov. Prev.
0,33.000000,65.000000,0.0,0,0,0,Afghanistan,0,0,0
1,41.153300,20.168300,0.0,0,0,0,Albania,0,0,0
2,28.033900,1.659600,0.0,0,0,0,Algeria,0,0,0
3,42.506300,1.521800,0.0,0,0,0,Andorra,0,0,0
4,-11.202700,17.873900,0.0,0,0,0,Angola,0,0,0
...,...,...,...,...,...,...,...,...,...,...
27035,24.215500,-12.885800,103.0,6,0,5,Western Sahara,6,0,5
27036,0.186360,6.613081,103.0,23,3,4,Sao Tome and Principe,16,1,4
27037,15.552727,48.516388,103.0,12,2,0,Yemen,10,2,0
27038,-11.645500,43.333300,103.0,3,0,0,Comoros,3,0,0


In [6]:
covid_data['Long']

0        65.000000
1        20.168300
2         1.659600
3         1.521800
4        17.873900
           ...    
27035   -12.885800
27036     6.613081
27037    48.516388
27038    43.333300
27039    71.276093
Name: Long, Length: 27040, dtype: float64

In [7]:
with pd.ExcelWriter('covid_19_distance.xlsx') as writer:
    covid_data.to_excel(writer)

## SVR

**Support Vector Regression** utiliza conceitos semelhantes aos aplicados no algoritmo de Support Vector Machine para aplicações em métodos de regressão.

Deste modo serão expostos alguns conceitos teóricos fundamentais para que se possa compreender este algoritmo:
-  **Hyper Plane** : Uma linha de separação que irá ajudar a prever os valores em causa.
-   **Boundary line** : Margens da hyperplane que separam os valores existentes.
-  **Support vectors** : Os pontos mais perto da "boundary line". A distância dos pontos é a minima.
    
Assim, neste algoritmo o objetivo é considerar os valores dentro da margem, ou seja aqueles com menor erro entre eles. Assim o objetivo é descobrir um hyper plain que mais se aproxime aos valores existentes, ou seja, que se tenha uma distância minima a um maior número destes pontos.

Desta forma a aplicação deste algoritmo vai ser feita auxiliando-nos da implementação presente na ferramenta scikit-learn. 

### Support Vector Regression

#### Parâmetros da pesquisa

Serão refer

   * **kernel** : rbf, sigmod foi implementado o default uma vez que a obtenção de resultados era muito superior às restantes opções
   * **epsilonfloat**: 0.1 e 0.2 - o valor associado ao valor da distância entre a "boundary line" e o "hyper plane"
   * **cache size**: 500 - permite diminuir o tempo de execução 
   * **C** : 1 uma vez que os dados com que trabalhamos não tem muito ruído

In [8]:
from sklearn.svm import SVR

In [9]:
covid_data

Unnamed: 0,Lat,Long,Days Passed,Confirmed,Deaths,Recovered,Local,Conf. Prev.,Deaths Prev.,Recov. Prev.
0,33.000000,65.000000,0.0,0,0,0,Afghanistan,0,0,0
1,41.153300,20.168300,0.0,0,0,0,Albania,0,0,0
2,28.033900,1.659600,0.0,0,0,0,Algeria,0,0,0
3,42.506300,1.521800,0.0,0,0,0,Andorra,0,0,0
4,-11.202700,17.873900,0.0,0,0,0,Angola,0,0,0
...,...,...,...,...,...,...,...,...,...,...
27035,24.215500,-12.885800,103.0,6,0,5,Western Sahara,6,0,5
27036,0.186360,6.613081,103.0,23,3,4,Sao Tome and Principe,16,1,4
27037,15.552727,48.516388,103.0,12,2,0,Yemen,10,2,0
27038,-11.645500,43.333300,103.0,3,0,0,Comoros,3,0,0


De forma semelhante ao algoritmo anterior procedemos à criação de sets para podermos treinar o nosso modelo, e por fim testá-lo.

In [11]:
#colunas em que vamos basear as previsões
x_columns = ['Lat','Long','Days Passed', 'Conf. Prev.','Deaths Prev.','Recov. Prev.']
#colunas que queremos prever
y_columns = ['Confirmed','Deaths','Recovered']

In [12]:
from sklearn.model_selection import train_test_split


#criar set de treino e teste
x_train, x_test, y_train, y_test = train_test_split(covid_data[x_columns], covid_data[y_columns], test_size=0.0096)


O uso de apenas este algoritmo não possibilita a previsão para mais do que um output.
Assim, foi preciso auxiliarmo-nos num "wrapper" - MultiOutputRegressor - de forma a conseguirmos
contornar este problema. Esta classe irá criar uma instância do modelo para cada um dos outputs do problema.

Contudo como referido, é criado um modelo separado para cada output. Consequentemente, esta
solução não consegue garantir a dependência entre as várias varíáveis, ou seja, assume que
os outputs são totalmente independentes uns dos outros. 

No contexto deste problema os outputs tem relação entre si uma vez que o maior número de casos confirmados 
implicam um maior número de mortes e recuperados. Apesar disto, decidimos avançar na mesma com este algoritmo uma
vez que a dependência entre os vários outputs poderá não ser assim tão significativa para os resultados finais.

In [13]:
from sklearn.multioutput import MultiOutputRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing

In [14]:
scaler = StandardScaler()
scaler.fit(x_train)

x_train = scaler.transform(x_train)
unscaled_test = x_test.join(y_test)
x_test = scaler.transform(x_test)


## RBF with epsilon=0.1

In [15]:
regressor_rbf = SVR(kernel='rbf',cache_size=500)
wrapper_rbf = MultiOutputRegressor(regressor_rbf)

In [16]:
wrapper_rbf.fit(x_train,y_train)
predictions_rbf = wrapper_rbf.predict(x_test)

In [22]:
predictions_rbf = pd.DataFrame(data=predictions_rbf,columns=['Confirmed Prediction','Deaths Prediction','Recovered Prediction'])
predictions_rbf['Local'] = unscaled_test.apply(lambda row: covid_data.loc[(covid_data['Lat'] == row['Lat']) & (covid_data['Long'] == row['Long']),'Local'].iloc[0],axis=1).tolist()
predictions_rbf['Confirmed Prediction'] = predictions_rbf['Confirmed Prediction'].apply(np.ceil)
predictions_rbf['Deaths Prediction'] = predictions_rbf['Deaths Prediction'].apply(np.ceil)
predictions_rbf['Recovered Prediction'] = predictions_rbf['Recovered Prediction'].apply(np.ceil)
predictions_rbf['Days Passed'] = unscaled_test['Days Passed'].tolist()

predictions_rbf['Confirmed Actual'] = unscaled_test['Confirmed'].tolist()
predictions_rbf['Deaths Actual'] = unscaled_test['Deaths'].tolist()
predictions_rbf['Recovered Actual'] = unscaled_test['Recovered'].tolist()

predictions_rbf['Confirmed Diff'] = (unscaled_test['Confirmed'] - predictions_rbf['Confirmed Prediction'].values).tolist()
predictions_rbf['Deaths Diff'] = (unscaled_test['Deaths'] - predictions_rbf['Deaths Prediction'].values).tolist()
predictions_rbf['Recovered Diff'] = (unscaled_test['Recovered'] - predictions_rbf['Recovered Prediction'].values).tolist()

predictions_rbf = predictions_rbf[['Days Passed','Local', 'Confirmed Prediction', 'Confirmed Diff','Deaths Prediction', 'Deaths Diff','Recovered Prediction','Recovered Diff']]
predictions_rbf

Unnamed: 0,Days Passed,Local,Confirmed Prediction,Confirmed Diff,Deaths Prediction,Deaths Diff,Recovered Prediction,Recovered Diff
0,4.0,Cuba,10.0,-10.0,1.0,-1.0,-0.0,0.0
1,62.0,Equatorial Guinea,23.0,-14.0,-0.0,0.0,-0.0,0.0
2,102.0,Zimbabwe,124.0,-90.0,5.0,-1.0,40.0,-35.0
3,0.0,South Sudan,8.0,-8.0,1.0,-1.0,1.0,-1.0
4,42.0,Sudan,10.0,-10.0,1.0,-1.0,1.0,-1.0
...,...,...,...,...,...,...,...,...
255,87.0,Saint Lucia,104.0,-89.0,3.0,-3.0,15.0,-4.0
256,23.0,Ontario / Canada,-8.0,11.0,-0.0,0.0,-0.0,0.0
257,100.0,Moldova,547.0,3433.0,53.0,69.0,170.0,1102.0
258,22.0,Djibouti,-4.0,4.0,1.0,-1.0,1.0,-1.0


In [23]:
with pd.ExcelWriter('results_svr_rbf.xlsx') as writer:
    predictions_rbf.to_excel(writer)

## Sigmoid with epsilon=0.1


In [18]:
regressor_s = SVR(kernel='sigmoid', cache_size=500)
wrapper_s = MultiOutputRegressor(regressor_s)

In [19]:
wrapper_s.fit(x_train,train[y_columns])
predictions_s = wrapper_s.predict(x_test)

In [20]:
predictions_s = pd.DataFrame(data=predictions_s,columns=['Confirmed Prediction','Deaths Prediction','Recovered Prediction'])
predictions_s['Local'] = test['Local'].tolist()
predictions_s['Confirmed Prediction'] = predictions_s['Confirmed Prediction'].apply(np.ceil)
predictions_s['Deaths Prediction'] = predictions_s['Deaths Prediction'].apply(np.ceil)
predictions_s['Recovered Prediction'] = v['Recovered Prediction'].apply(np.ceil)
predictions_s['Days Passed'] = test['Days Passed'].tolist()

predictions_s['Confirmed Actual'] = test['Confirmed'].tolist()
predictions_s['Deaths Actual'] = test['Deaths'].tolist()
predictions_s['Recovered Actual'] = test['Recovered'].tolist()

predictions_s['Confirmed Diff'] = (unscaled_test['Confirmed'] - predictions_s['Confirmed Prediction'].values).tolist()
predictions_s['Deaths Diff'] = (unscaled_test['Deaths'] - predictions_s['Deaths Prediction'].values).tolist()
predictions_s['Recovered Diff'] = (unscaled_test['Recovered'] - predictions_s['Recovered Prediction'].values).tolist()

predictions_s = predictions_s[['Days Passed','Local', 'Confirmed Prediction', 'Confirmed Diff','Deaths Prediction', 'Deaths Diff','Recovered Prediction','Recovered Diff']]
predictions_s

Unnamed: 0,Days Passed,Local,Confirmed Prediction,Confirmed Actual,Deaths Prediction,Deaths Actual,Recovered Prediction,Recovered Actual
0,87.0,Victoria / Australia,-95.0,1319,-670.0,14,-343.0,1172
1,46.0,Congo (Brazzaville),97.0,0,54.0,0,50.0,0
2,41.0,Yunnan / China,122.0,174,41.0,2,77.0,0
3,11.0,Saint Lucia,-120.0,0,-94.0,0,-41.0,0
4,13.0,Serbia,-31.0,0,15.0,0,-46.0,0
...,...,...,...,...,...,...,...,...
255,21.0,Hunan / China,98.0,946,8.0,2,101.0,304
256,51.0,Qatar,111.0,320,19.0,0,50.0,0
257,66.0,United Arab Emirates,164.0,468,29.0,2,78.0,52
258,41.0,Portugal,33.0,2,-26.0,0,-1.0,0


In [21]:
with pd.ExcelWriter('results_svr_s.xlsx') as writer:
    predictions_s.to_excel(writer)

## Sigmoid with epsilon=0.2

In [22]:
regressor_s2 = SVR(kernel='sigmoid', epsilon=0.2, cache_size = 500)
wrapper_s2 = MultiOutputRegressor(regressor_s2)

In [23]:
wrapper_s2.fit(x_train,train[y_columns])
predictions_s2 = wrapper_s2.predict(x_test)

In [24]:
predictions_s2 = pd.DataFrame(data=predictions_s2,columns=['Confirmed Prediction','Deaths Prediction','Recovered Prediction'])
predictions_s2['Local'] = test['Local'].tolist()
predictions_s2['Confirmed Prediction'] = predictions_s2['Confirmed Prediction'].apply(np.ceil)
predictions_s2['Deaths Prediction'] = predictions_s2['Deaths Prediction'].apply(np.ceil)
predictions_s2['Recovered Prediction'] = predictions_s2['Recovered Prediction'].apply(np.ceil)
predictions_s2['Days Passed'] = test['Days Passed'].tolist()

predictions_s2['Confirmed Actual'] = test['Confirmed'].tolist()
predictions_s2['Deaths Actual'] = test['Deaths'].tolist()
predictions_s2['Recovered Actual'] = test['Recovered'].tolist()

predictions_s2['Confirmed Diff'] = (unscaled_test['Confirmed'] - predictions_s2['Confirmed Prediction'].values).tolist()
predictions_s2['Deaths Diff'] = (unscaled_test['Deaths'] - predictions_s2['Deaths Prediction'].values).tolist()
predictions_s2['Recovered Diff'] = (unscaled_test['Recovered'] - predictions_s2['Recovered Prediction'].values).tolist()

predictions_s2 = predictions_s2[['Days Passed','Local', 'Confirmed Prediction', 'Confirmed Diff','Deaths Prediction', 'Deaths Diff','Recovered Prediction','Recovered Diff']]
predictions_s2

Unnamed: 0,Days Passed,Local,Confirmed Prediction,Confirmed Actual,Deaths Prediction,Deaths Actual,Recovered Prediction,Recovered Actual
0,87.0,Victoria / Australia,-95.0,1319,-670.0,14,-344.0,1172
1,46.0,Congo (Brazzaville),97.0,0,54.0,0,50.0,0
2,41.0,Yunnan / China,122.0,174,41.0,2,77.0,0
3,11.0,Saint Lucia,-119.0,0,-93.0,0,-41.0,0
4,13.0,Serbia,-31.0,0,15.0,0,-46.0,0
...,...,...,...,...,...,...,...,...
255,21.0,Hunan / China,98.0,946,8.0,2,101.0,304
256,51.0,Qatar,111.0,320,19.0,0,50.0,0
257,66.0,United Arab Emirates,164.0,468,29.0,2,78.0,52
258,41.0,Portugal,33.0,2,-26.0,0,-1.0,0


In [25]:
with pd.ExcelWriter('results_svr_s2.xlsx') as writer:
    predictions_s2.to_excel(writer)

## Rbf com epsilon=0.2

In [28]:
regressor_rbf2 = SVR(kernel='rbf', epsilon=0.5,cache_size=500)
wrapper_rbf2 = MultiOutputRegressor(regressor_rbf2)

In [29]:
wrapper_rbf2.fit(x_train,train[y_columns])
predictions_rbf2 = wrapper_rbf2.predict(x_test)

In [30]:
predictions_rbf2 = pd.DataFrame(data=predictions_rbf2,columns=['Confirmed Prediction','Deaths Prediction','Recovered Prediction'])
predictions_rbf2['Local'] = test['Local'].tolist()
predictions_rbf2['Confirmed Prediction'] = predictions_rbf2['Confirmed Prediction'].apply(np.ceil)
predictions_rbf2['Deaths Prediction'] = predictions_rbf2['Deaths Prediction'].apply(np.ceil)
predictions_rbf2['Recovered Prediction'] = predictions_rbf2['Recovered Prediction'].apply(np.ceil)
predictions_rbf2['Days Passed'] = test['Days Passed'].tolist()

predictions_rbf2['Confirmed Actual'] = test['Confirmed'].tolist()
predictions_rbf2['Deaths Actual'] = test['Deaths'].tolist()
predictions_rbf2['Recovered Actual'] = test['Recovered'].tolist()

predictions_rbf2['Confirmed Diff'] = (unscaled_test['Confirmed'] - predictions_rbf2['Confirmed Prediction'].values).tolist()
predictions_rbf2['Deaths Diff'] = (unscaled_test['Deaths'] - predictions_rbf2['Deaths Prediction'].values).tolist()
predictions_rbf2['Recovered Diff'] = (unscaled_test['Recovered'] - predictions_rbf2['Recovered Prediction'].values).tolist()

predictions_rbf2 = predictions_rbf2[['Days Passed','Local', 'Confirmed Prediction', 'Confirmed Diff','Deaths Prediction', 'Deaths Diff','Recovered Prediction','Recovered Diff']]
predictions_rbf2

Unnamed: 0,Days Passed,Local,Confirmed Prediction,Confirmed Diff,Deaths Prediction,Deaths Diff,Recovered Prediction,Recovered Diff
0,7.0,Gambia,1.0,-1.0,-0.0,0.0,-0.0,0.0
1,85.0,Burma,234.0,-149.0,2.0,2.0,49.0,-47.0
2,18.0,Libya,-0.0,0.0,-0.0,0.0,-0.0,0.0
3,64.0,Afghanistan,82.0,12.0,1.0,3.0,6.0,-4.0
4,17.0,Taiwan*,-0.0,17.0,-0.0,0.0,-0.0,1.0
...,...,...,...,...,...,...,...,...
255,35.0,Laos,2.0,-2.0,1.0,-1.0,1.0,-1.0
256,69.0,Turks and Caicos Islands / United Kingdom,111.0,-106.0,1.0,-1.0,12.0,-12.0
257,21.0,Spain,-0.0,2.0,-0.0,0.0,1.0,-1.0
258,25.0,Chongqing / China,5.0,546.0,3.0,2.0,6.0,201.0


In [31]:
with pd.ExcelWriter('results_svr_rbf2.xlsx') as writer:
    predictions_rbf2.to_excel(writer)