# COVID-19 Regression Models
-----------------

## Authors

* Filipe Ferreira, Student, up201706086@fe.up.pt
* Mark Meehan, Student,     up201704581@fe.up.pt
* Sofia Lajes, Student,     up201704066@fe.up.pt

### Afiliattions

[![feup_logo](images/logo_cores_oficiais.jpg)](http://www.fe.up.pt)

Faculdade de Engenharia, Universidade do Porto Rua Dr. Roberto Frias, 4200-465 Porto, Portugal


# Abstract

The outbreak of the novel Coronavirus disease (COVID-19) has been one worse outbreaks in human history, leaving thousands unemployed and economies to a halt. Each country has faced the pandemic in many different ways, some quarantining and closing the borders earlier than others. The purpose of this paper is to explore how we can train Machine Learning, more specifically Supervised Learning, models to predict the number of cases, deaths and recoveries in each country.

**Keywords :** Coronavirus, outbreak, COVID-19, Data Science, Machine Learning, Regression, Data Mining

## Introduction

Given the current world pandemic health risk, it seemed relevant and of significant importance to study the current outbreak and develop a better understanding of the challenges that lay ahead. 

Therefore, we chose to study Covid-19 and develop models for it's regression in the near future, in many countries, but especially the ones that were more severely affected by the pandemic outbreak. For predicting the number of confirmed cases, deaths and recovered patients of Covid-19 in each country, we developed a set of models and analysed it's score and accuracy for the current data. These models, in specific, were a Neural Network,  K-Nearest-Neighbour, and Support Vector Machine.  

The same input data was used for all of these models - Days since January 2020, Population Density, Urban Population, and Total Population Percentage.

# Imports and Dependencies

In [None]:
import pandas as pd
from datetime import datetime
from datetime import timedelta
%matplotlib inline
%reload_ext watermark
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np
import seaborn as sb
from IPython.display import display
import sklearn as sk
import sklearn.neural_network as sknn
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

#KNN
from sklearn.neighbors import KNeighborsRegressor

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from sklearn.multioutput import MultiOutputRegressor
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.svm import SVR


# Data

In this section we load both datasets we use, COVID-19 dataset and countries population dataset , clean the data and add some new columns.  

The datasets we use are:

* https://www.kaggle.com/imdevskp/corona-virus-report for the COVID-19 dataset

* https://www.kaggle.com/tanuprabhu/population-by-country-2020 for population information per country


## COVID-19 Dataset

This dataset is updated daily in the kraggle repository, to check how it was gathered and clean check the link above or in the references. This notebook was made locally and comes with a script to download the datasets from kraggle

In [None]:
# Loading datasets

full_table = pd.read_csv('datasets/covid_19_clean_complete.csv', 
                          na_values=['NaN'],
                          parse_dates=['Date'])

# Adding Active cases column
full_table['Active'] = (full_table['Confirmed'] - full_table['Deaths'] - full_table['Recovered']).apply(lambda x: x if x >= 0 else 0)

# filling missing values
full_table[['Province/State']] = full_table[['Province/State']].fillna('')
full_table[['Confirmed','Deaths','Recovered','Active']] = full_table[['Confirmed','Deaths','Recovered','Active']].fillna(0)

full_table.sample(6)


## Population Dataset


In [None]:
pop_table = pd.read_csv('datasets/population_by_country_2020.csv',
                        na_values=['N.A.'])


# Selecting only the Country and Population columns
pop_table = pop_table.iloc[:,[0,1,4,9]]



# Renaming columns
pop_table.columns = ['Country/Region', 'Population', 'Population Density (P/Km²)','Urban Population %']

# Most of the entries with urban population as NaN in the population dataset that we are going to use have 100% as of 2020
pop_table[['Urban Population %']] = pop_table[['Urban Population %']].fillna('100 %')
pop_table['Urban Population %'] = pop_table['Urban Population %'].map(lambda x: int(x.split(' ')[0]))


In [None]:
pop_table.info()
pop_table.isna().sum()

## Removing ship data

The dataset also includes data from the various ships that had COVID19 outbreaks. Since we only need the information per country we removed it from the dataset.

In [None]:
# ship rows
ship_rows = full_table['Province/State'].str.contains('Grand Princess') | full_table['Province/State'].str.contains('Diamond Princess') | full_table['Country/Region'].str.contains('Diamond Princess') | full_table['Country/Region'].str.contains('MS Zaandam')

# ship
ship = full_table[ship_rows]

# dropping ship rows 
full_table = full_table[~(ship_rows)]

ship.sample(6)


## Fixing country names


### Fixing mismatched names between datasets

Here we manually set the names so that the join between datasets works.


In [None]:
fix_name_only = {
    'Sao Tome & Principe': 'Sao Tome and Principe',
    "Côte d'Ivoire": "Cote d'Ivoire",
    "United States": "US",
    "Czech Republic (Czechia)": 'Czechia',
    'Myanmar': 'Burma',
    'Taiwan': 'Taiwan*',
    'Saint Kitts & Nevis': 'Saint Kitts and Nevis',
    'Macao' : 'Macau'
}

for original,new in fix_name_only.items():
    full_table.loc[full_table['Country/Region'] == new, 'Country/Region'] = original
    full_table.loc[full_table['Province/State'] == new, 'Province/State'] = original

missing_countries = set(full_table['Country/Region']).difference(set(pop_table['Country/Region']))

# # print(sorted(pop_table['Country/Region'].unique()))
# if len(missing_countries) != 0:
#     print(missing_countries)


### Replacing Country/Region with Province/State

The population dataset has entries for autonomous regions, for example Greenland. Here we rewrite the Country/Region column with the Province/State name so we can easily join the population dataset. For example, Greenland exists in the population dataset so what we do is replace Denmark (the Country column of Greenland) with Greenland.

In [None]:

province_set = set(full_table['Province/State']).intersection(set(pop_table['Country/Region']))

no_data = set(['Saint Vincent and the Grenadines','Kosovo','Congo','West Bank and Gaza'])

for province in province_set:
    if province in no_data:
        continue
    full_table.loc[ full_table['Province/State'] == province,'Country/Region'] = province 



In [None]:
# Check for null values
full_table.isna().sum()

## Grouping data

Here we are grouping data by Date and Country so we can add population and cases per million afterwards.

### Group by Country

In [None]:
full_grouped = full_table.groupby(['Country/Region','Lat','Long','Date'])['Confirmed','Deaths','Recovered','Active'].sum().reset_index()
full_grouped_nolat = full_table.groupby(['Country/Region','Date'])['Confirmed','Deaths','Recovered','Active'].sum().reset_index()

full_grouped.head()

### Adding population
In this section, we merge both datasets by Country/Region.

In [None]:
full_grouped = pd.merge(full_grouped,pop_table,on=['Country/Region'])
full_grouped_nolat = pd.merge(full_grouped_nolat,pop_table,on=['Country/Region'])

full_grouped

### Calculating new cases per day

To calculate the number of

In [None]:
# Dataframe with latitude and longitude
temp = full_grouped.groupby(['Country/Region', 'Date'])['Confirmed', 'Deaths', 'Recovered']
temp = temp.sum().diff().reset_index()

mask = temp['Country/Region'] != temp['Country/Region'].shift(1)

temp.loc[mask, 'Confirmed'] = np.nan
temp.loc[mask, 'Deaths'] = np.nan
temp.loc[mask, 'Recovered'] = np.nan


temp.columns = ['Country/Region', 'Date','New cases', 'New deaths', 'New recovered']


full_grouped = pd.merge(full_grouped,temp, on=['Country/Region', 'Date'])

full_grouped = full_grouped.fillna(0)

full_grouped[['New cases','New deaths','New recovered']] = full_grouped[['New cases','New deaths','New recovered']].astype('int64')

#############################################################################################################################
# # Dataset with no lat and long

temp = full_grouped_nolat.groupby(['Country/Region', 'Date' ])['Confirmed', 'Deaths', 'Recovered']
temp = temp.sum().diff().reset_index()

mask = temp['Country/Region'] != temp['Country/Region'].shift(1)

temp.loc[mask, 'Confirmed'] = np.nan
temp.loc[mask, 'Deaths'] = np.nan
temp.loc[mask, 'Recovered'] = np.nan

temp.columns = ['Country/Region', 'Date', 'New cases', 'New deaths', 'New recovered']

full_grouped_nolat = pd.merge(full_grouped_nolat,temp, on=['Country/Region', 'Date'])

full_grouped_nolat = full_grouped_nolat.fillna(0)

full_grouped_nolat[['New cases','New deaths','New recovered']] = full_grouped_nolat[['New cases','New deaths','New recovered']].astype('int64')


### World Data

In [None]:
world_data = full_grouped.groupby(['Date'])['Confirmed','Deaths','Recovered','Active','Population'].sum().reset_index()
world_data.loc[world_data['Date'] == world_data['Date'].max()]
world_data.head()
world_pop_total = world_data['Population'].max()

In [None]:
# Check information on types and null values
full_grouped.info()
full_grouped.loc[full_grouped['Urban Population %'].isnull()]['Country/Region'].unique()


In [None]:
full_grouped.sample(6)

### Calculating Cases per Million of People




In [None]:

def calc_permillion(df):
    df['Confirmed per million'] = round((df['Confirmed'] / df['Population']) * 1000000)
    df['Deaths per million']    = round((df['Deaths'] / df['Population']) * 1000000)
    df['Recovered per million'] = round((df['Recovered'] / df['Population']) * 1000000)
    df['Active per million']    = round((df['Active'] / df['Population']) * 1000000)
    return df

def calc_permillion_world():
    world_data['Confirmed per million'] = round((world_data['Confirmed'] / world_data['Population']) * 1000000)
    world_data['Deaths per million']    = round((world_data['Deaths'] / world_data['Population']) * 1000000)
    world_data['Recovered per million'] = round((world_data['Recovered'] / world_data['Population']) * 1000000)
    world_data['Active per million']    = round((world_data['Active'] / world_data['Population']) * 1000000)



per_million = calc_permillion(full_grouped)
per_million_nolat = calc_permillion(full_grouped_nolat)

per_million.head()

### Value Truncation

Since the COVID-19 dataset only has data from  22nd of January of 2020 onwards, we will define Date from here moving foward as days since the 22nd of January of 2020.

In [None]:
per_million['Date'].min()

This function does that:

In [None]:
def daysSinceJan(d):
    return d.toordinal() - datetime(2020,1,20).toordinal()

def revertdaysSince(d):
    return datetime(2020,1,20) + timedelta(days=d)

In [None]:
per_million_nolat.info()

In [None]:
per_million['Days Since Jan'] = per_million['Date'].map(daysSinceJan)

per_million_nolat['Days Since Jan'] = per_million['Date'].map(daysSinceJan)

In [None]:
per_million =  per_million.sort_values(['Date','Country/Region'],ascending=[True, True])
per_million.head()

In [None]:
per_million_nolat.isna().sum()

### Adding population percentage

In [None]:
per_million['Pop %'] = per_million['Population'].apply(lambda x: x/world_pop_total * 100)
per_million_nolat['Pop %'] = per_million_nolat['Population'].apply(lambda x: x/world_pop_total * 100)

In [None]:
full_grouped.head()

In [None]:
per_million_nolat

In [None]:
def removeChinaData(df):
    df = df[full_grouped['Country/Region'] != 'China']

removeChinaData(full_grouped)
removeChinaData(full_grouped_nolat)
removeChinaData(per_million_nolat)
removeChinaData(per_million)

In [None]:
per_million.loc[per_million['Country/Region'] == 'Portugal']

# Model Training

We decided to train three different models provided by the scikit-learn API: MLPRegressor (Multi layer Perceptron), KNeighborsRegressor and a SVR (Support Vector Machine). Each of the following sections explain the input and output and each algorithm and how they are used.

## Inputs and outputs

This function will return a pair of inputs and outputs given the country:

In [None]:
def print_graph(world_viz,things = ['Confirmed','Deaths','Recovered'],label='C/D/R',x = 'Date',title='Title',limit_x=None):
    sb.set()

    dd = world_viz.melt([x],var_name=label, value_name='Cases',value_vars=things)


    chart = sb.relplot(x=x,y='Cases',hue=label,data=dd,kind='line')

    chart.set(title=title)
    chart.fig.autofmt_xdate()
    if limit_x is not None:
        plt.axvline(limit_x,0,1,linewidth=1, color='r',linestyle='--')

    plt.show()

In [None]:
def getPredInp(country):
    return [per_million.loc[per_million['Country/Region'] == country]['Population Density (P/Km²)'].max(),
    per_million.loc[per_million['Country/Region'] == country]['Urban Population %'].max(),
    per_million.loc[full_grouped['Country/Region'] == country]['Pop %'].max()]

# per_million.loc[per_million['Country/Region'] == 'Portugal'].groupby(['Lat','Long']).size()

In [None]:
def print_predict(model,country,minprev=0,offset=10,maxprev=daysSinceJan(per_million_nolat['Date'].max()),print_real=True,title="Future Prediction"):

    
    country_pop = per_million_nolat.loc[per_million_nolat['Country/Region'] == country]['Population'].max()
    ip = []

    for dat in range(minprev,maxprev+offset+1): # Predict from 0 to 155 days from January 1st 2020
        ip.append([dat, *getPredInp(country)]) # Hard Coded Pop Density, Urban Pop %, Latitude and Longitude

    out = model.predict(ip)

    nl = []

    for i,o in zip(ip,out):
        nl.append([*i,*o])

    futurepredict = pd.DataFrame(nl,columns=['Date','Population Density (P/Km²)','Urban Population %',"Pop %",'Confirmed','Deaths','Recovered'])

    futurepredict['Confirmed'] = futurepredict['Confirmed'].map(lambda x: round((x/1000000) * country_pop))
    futurepredict['Recovered'] = futurepredict['Recovered'].map(lambda x: round((x/1000000) * country_pop))
    futurepredict['Deaths'] = futurepredict['Deaths'].map(lambda x: round((x/1000000) * country_pop) )

    futurepredict['Date'] = futurepredict['Date'].map(revertdaysSince) 

    if print_real:
        print_graph(per_million_nolat.loc[per_million_nolat['Country/Region'] == country],['Confirmed','Deaths','Recovered'], title='Real Data')
    print_graph(futurepredict,['Confirmed','Deaths','Recovered'],x='Date', title=title,limit_x=revertdaysSince(maxprev))



## MLPRegressor

The MLPRegressor stands for Multi Layer Perceptron. The MLP is very sensitive to feature scaling, so we tried to normalize the input data as much as we could, for example we didn't use the Population of the country, we used the percentage of the population relative to the world's population (more specifically the sum of all different countries population in the dataset). The activation function we found that is best for this dataset is the rectifier linear unit function.

In [None]:
def nn_by_country_train(
    country,
    input=['Days Since Jan','Population Density (P/Km²)','Urban Population %','Pop %'],
    out= ['Confirmed per million','Deaths per million','Recovered per million'],
    hidden_layer_sizes=(100,100,100,60,60,60)
):
    df = per_million_nolat.loc[ per_million_nolat['Country/Region'] == country]
    X = df[input].values
    Y = df[out].values
    nn = sknn.MLPRegressor(
        hidden_layer_sizes=hidden_layer_sizes,
        activation='relu',
        solver='adam',
        alpha=0.00001,
        batch_size='auto',
        max_iter=2000,
        n_iter_no_change=500)

    X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.30,shuffle=True)
    nn.fit(X_train,y_train) # Train the model
    y_pred = nn.predict(X_test)

    nnr2 = nn.score(X_test,y_test) # Calculate R² for the model
    nnmae = mean_absolute_error(y_test,y_pred) # Mean Absolute Error
    nnmse = mean_squared_error(y_test,y_pred)

    print(f"MLPRegressor scores for {country}:")
    print("R2:",nnr2)
    print("MAE:",nnmae)
    print("MSE:",nnmse)

    return nn

We use Mexico as an example here but you can change it to use other countries. Here we can see how the number of nodes per layer and de number of layers affects the ability of the MLP regressor to predict the real data. For training it tries to minimize the MSE (Mean Squared Error). The R2 is the R² and the MAE represents the Mean Absolute Error.

In [None]:
country = 'Mexico'

nn  = nn_by_country_train(country)
nn_10_1 = nn_by_country_train(country,hidden_layer_sizes=(10))
nn_10_2 = nn_by_country_train(country,hidden_layer_sizes=(10,10))
nn_10_3 = nn_by_country_train(country,hidden_layer_sizes=(10,10,10))
nn_10_4 = nn_by_country_train(country,hidden_layer_sizes=(10,10,10,10,10,10,10,10))
nn_100_1 = nn_by_country_train(country,hidden_layer_sizes=(100))
nn_100_2 = nn_by_country_train(country,hidden_layer_sizes=(100,100))
nn_100_3 = nn_by_country_train(country,hidden_layer_sizes=(100,100,100))
nn_100_4 = nn_by_country_train(country,hidden_layer_sizes=(100,100,100,100,100,100,100))

The offset represents the number of days after the most recent data inserted in the dataset. Here we can conclude some things:

1. The number of the hidden layers influences the type of regression the MLP uses, the more layers the more he can follow the curve as in small number of layers can only regress straight lines and >2 layers can regress curves
2. After a certain amount of time, deppending on the country, the regression always tends to a linear regression.

In [None]:
print_predict(nn,country,offset=120,print_real=True,title="MLPRegressor Prediction")
print_predict(nn_10_1,country,offset=120,print_real=False, title="MLPRegressor Prediction (10)")
print_predict(nn_10_2,country,offset=120,print_real=False, title="MLPRegressor Prediction (10,10)")
print_predict(nn_10_3,country,offset=120,print_real=False, title="MLPRegressor Prediction (10,10,10)")
print_predict(nn_10_4,country,offset=120,print_real=False, title="MLPRegressor Prediction (10,10,10,10,10,10,10,10)")
print_predict(nn_100_1,country,offset=120,print_real=False,title="MLPRegressor Prediction (100)")
print_predict(nn_100_2,country,offset=120,print_real=False,title="MLPRegressor Prediction (100,100))")
print_predict(nn_100_3,country,offset=120,print_real=False,title="MLPRegressor Prediction (100,100,100)")
print_predict(nn_100_4,country,offset=120,print_real=False,title="MLPRegressor Prediction (100,100,100,100,100,100,100)")

## K-Nearest Neighbours

In this chapter, the implementation of the K-Nearest Neighbours algorithm to solve the regression problem is demonstrated. To make this demonstration more understandable, it is divided in several functions with its explanation above.

### Parameterization and application of KNN regression algorithm

Based on the data set to be analysed, it is necessary to define the set of parameters that better adjusts to the current problem, in order to augment the efficiency and accuracy of the analysis.
This is achieved by making a grid search, which, through the class' fit function, runs the algorithm several times trying to find the best combination of algorithms, as well as the best score, training the set in the process.

Note that we use Pipeline, this is because GridSearchCV's fit method can receive multiple inputs (X_train) but can only receive a single output (Y_train), that is, can only receive one column, whilst we desire three: 'Confirmed', 'Deaths' and 'Recovered'; using a Pipeline with a StandardScaler and a MultiOutputRegressor allows working with more than one column in the output.

In the following code block, a parameter grid is defined with the parameters and each one's possible values. These correspond to the KNeighborsRegressor's parameters. KNeighborsRegressor is a sklearn class that a adapts a regression model based on k-nearest neighbors.

In order to avoid overfitting we use K-Folds cross-validation.

Once fitting is concluded, we get the best estimator and print its best score and best parameter combination. This estimator is returned.

In [None]:
def knn_by_country_train(
    country,
    input=['Days Since Jan','Population Density (P/Km²)','Urban Population %','Pop %'],
    out= ['Confirmed per million','Deaths per million','Recovered per million'],
):
    df = per_million_nolat.loc[ per_million_nolat['Country/Region'] == country]
    X = df[input].values
    Y = df[out].values

    pipe_svr = Pipeline(
                [
                    ("scl", StandardScaler()),
                    ("knn", MultiOutputRegressor(KNeighborsRegressor()))
                ]
            )

    parameter_grid = {'knn__estimator__n_neighbors': [3,5,7,9,11],
                    'knn__estimator__weights': ['uniform','distance'],
                    'knn__estimator__p' : [1,2,3],
                    'knn__estimator__leaf_size' : [20,25,30,35,40,300]
                    }

    cross_validation = KFold(n_splits=10)

    grid_search = GridSearchCV(estimator=pipe_svr,
                            param_grid=parameter_grid,
                            cv=cross_validation,
                            n_jobs=-1,
                            return_train_score=True
                            )

    X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.30,shuffle=True)

    grid_search.fit(X_train, Y_train)
    knn = grid_search.best_estimator_

    Y_pred = knn.predict(X_test)

    score = knn.score(X_test,Y_test) # Calculate R² for the model
    mae = mean_absolute_error(Y_test,Y_pred) # Mean Absolute Error
    mse = mean_squared_error(Y_test,Y_pred) # Mean Squared Error

    print(f"KNN scores for {country}:")
    print("R2:",score)
    print("MAE:",mae)
    print("MSE:",mse)

    return knn

####  Calling the algorithm

In [None]:
country = 'Mexico'
knn = knn_by_country_train(country)

In [None]:
print_predict(knn,country,offset=30)

## SVM - Support Vector Machine

For predicting the number of cases of Covid-19, recovered patients, and deaths, we can develop a Support Vector Regression. This SVR would take in as input the Date, and output the mentioned parameters - Number of cases, recovered patients, and deaths. This data would refer to a specific country.
In order to find the best parameters for the Regression at hand, we performed a grid search with a wide posssibility of values for different parameters (C, degree, epsilon, alpha). We also experimented with different kernels for the Regressor, and found different results for each one.

In [None]:
def make_prediction(
    country,
    kernel,
    input=['Days Since Jan','Population Density (P/Km²)','Urban Population %','Pop %'],
    out= ['Confirmed per million','Deaths per million','Recovered per million'],
    ):

    df = per_million_nolat.loc[ per_million_nolat['Country/Region'] == country]
    X = df[input].values
    Y = df[out].values

    X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.30,shuffle=True)

    pipe_svr = Pipeline(
        [
            ("scl", StandardScaler()),
            ("reg", MultiOutputRegressor(SVR(kernel=kernel)))
        ]
    )

    grid_param_svr = {
    "reg__estimator__C": [0.1, 1, 10, 100, 1000, 10000],
    "reg__estimator__degree": [2,3,4,5,6,7],
    "reg__estimator__epsilon": [0.1, 0.01, 0.001, 0.0001, 0.00001],
    "reg__estimator__gamma": [1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1],
    "reg__estimator__coef0": [1e-6, 1e-4, 1e-2, 1e-1, 0]
    }

    gs_svr = (
        GridSearchCV(
            estimator=pipe_svr,
            param_grid=grid_param_svr,
            cv=2,
            scoring = "neg_mean_squared_error",
            n_jobs = -1
            )
        )

    gs_svr = gs_svr.fit(X_train, y_train)
    gs_svr = gs_svr.best_estimator_

    y_pred = gs_svr.predict(X_test)

    r2 = gs_svr.score(X_test,y_test) # Calculate R² for the model
    mae = mean_absolute_error(y_test,y_pred) # Mean Absolute Error
    mse = mean_squared_error(y_test,y_pred) # Mean Squared Error

    print(f"SVM {kernel} scores for {country}:")
    print("R2:", r2)
    print("MAE:", mae)
    print("MSE:", mse)

    return gs_svr

In [None]:
def print_prediction_instance(kernels, future_days, countries):
    for kernel in kernels:
        for country_svn in countries:
            gs_svr = make_prediction(country_svn, 'rbf')
            print_predict(gs_svr,country_svn,offset=future_days, title="SVR with " + kernel + " kernel for " + str(future_days) + "days")

In [None]:
countries = ["Portugal", "Sweden", "United States", "Singapore", "Spain", "Italy"]
kernels = ['poly', 'linear', 'rbf']
future_days = 60

# This function prints predictions using all mentioned kernels, for the countries above. For this report, we will only print the results from Mexico.
# print_prediction_instance(kernels, future_days, countries)

print_prediction_instance(kernels, future_days, ["Mexico"])

## Results for the SVM

- Exponential kernel

This kernel seems to be a good fit for the problem at hand, and is reaching very respectful scores. Since the growth of reported cases was exponential in the early weeks of the pandemic, the Regression developed by this kernel follows the same exponential tendency, which accounts for the good scores it reaches. However, if the number of cases continuies to drop or even stabilize, as it has in the past few weeks in manyu countries, this kernel may loose its relevancy for Covid-19 predictions in these specific countries.

- Lineal kernel

The scores of the regression (for most countries) show that this kernel is not the most appropriate for the given problem. However, by analysing the graphs, we see that, if the number of Covid-19 cases is expected to have a linear growth in the coming times, which the most recent data seems to suggest for many countries, it may become a better solution and accomplish better scores. However, since the growth was exponential in the first few weeks of the pandemic, this kernel lacks in score. Nevertheless, we consider it to be a good fit for the problem since the number of reported cases has been generally dropping significantly in recent times.

- RBF kernel

This kernel seems to be the most indicated for Covid-19 cases predictions, at least as far as scores go. Reaching close to perfect scores in some countries, it seems to fit really well with the problem at hand. However, analysing the graphs, we see that it struggles when the number of reported cases drop. We think that this is justified by the shortage in input Data for the regressor. In the future, we could expand into new data sets and solve this issue, or even study deeply how to work around this issue, as well as improving the predictions for all regressions.


Overall, we conclude that finding the most fitting kernel for the regression depends on which country we are making predictions for. If the number of reported cases, deaths and recovered patients is, overall, increasing exponentially, the polynomial kernel is the most fitting. If, however, they are growing at a more or less constant rate, the linear kernel might be the best fit. At last, if they have been decreasing, the rbf kernel will most probably be the best fit, since it reaches the best scores. However, the issue with this kernel described above must be taken into account.

In [None]:
countries = ["Portugal", "Sweden", "Spain","United States","Italy","Japan","Mexico"]

models = dict()

for country in countries:
    models[country] = []
    models[country].append(nn_by_country_train(country))
    models[country].append(knn_by_country_train(country))
    models[country].append(make_prediction(country, "rbf"))
    models[country].append(make_prediction(country, "linear"))
    models[country].append(make_prediction(country, "poly"))

# Comparing models

## Portugal

In [None]:
model = models["Portugal"]
print_predict(model[0], "Portugal", offset=50, title="MLPRegressor prediction")
print_predict(model[1], "Portugal", offset=50, title="KNN prediction", print_real=False)
print_predict(model[2], "Portugal", offset=50, title="SVR prediction (rbf kernel)", print_real=False)
print_predict(model[3], "Portugal", offset=50, title="SVR prediction (linear kernel)", print_real=False)
print_predict(model[4], "Portugal", offset=50, title="SVR prediction (poly kernel)", print_real=False)

## Sweden

In [None]:
model = models["Sweden"]
print_predict(model[0], "Sweden", offset=50, title="MLPRegressor prediction")
print_predict(model[1], "Sweden", offset=50, title="KNN prediction", print_real=False)
print_predict(model[2], "Sweden", offset=50, title="SVR prediction (rbf kernel)", print_real=False)
print_predict(model[3], "Sweden", offset=50, title="SVR prediction (linear kernel)", print_real=False)
print_predict(model[4], "Sweden", offset=50, title="SVR prediction (poly kernel)", print_real=False)

## Spain

In [None]:
model = models["Spain"]
print_predict(model[0], "Spain", offset=50, title="MLPRegressor prediction")
print_predict(model[1], "Spain", offset=50, title="KNN prediction", print_real=False)
print_predict(model[2], "Spain", offset=50, title="SVR prediction (rbf kernel)", print_real=False)
print_predict(model[3], "Spain", offset=50, title="SVR prediction (linear kernel)", print_real=False)
print_predict(model[4], "Spain", offset=50, title="SVR prediction (poly kernel)", print_real=False)

## United States

In [None]:
model = models["United States"]
print_predict(model[0], "United States", offset=50, title="MLPRegressor prediction")
print_predict(model[1], "United States", offset=50, title="KNN prediction", print_real=False)
print_predict(model[2], "United States", offset=50, title="SVR prediction (rbf kernel)", print_real=False)
print_predict(model[3], "United States", offset=50, title="SVR prediction (linear kernel)", print_real=False)
print_predict(model[4], "United States", offset=50, title="SVR prediction (poly kernel)", print_real=False)

## Italy

In [None]:
model = models["Italy"]
print_predict(model[0], "Italy", offset=50, title="MLPRegressor prediction")
print_predict(model[1], "Italy", offset=50, title="KNN prediction", print_real=False)
print_predict(model[2], "Italy", offset=50, title="SVR prediction (rbf kernel)", print_real=False)
print_predict(model[3], "Italy", offset=50, title="SVR prediction (linear kernel)", print_real=False)
print_predict(model[4], "Italy", offset=50, title="SVR prediction (poly kernel)", print_real=False)

## Japan

In [None]:
model = models["Japan"]
print_predict(model[0], "Japan", offset=50, title="MLPRegressor prediction")
print_predict(model[1], "Japan", offset=50, title="KNN prediction", print_real=False)
print_predict(model[2], "Japan", offset=50, title="SVR prediction (rbf kernel)", print_real=False)
print_predict(model[3], "Japan", offset=50, title="SVR prediction (linear kernel)", print_real=False)
print_predict(model[4], "Japan", offset=50, title="SVR prediction (poly kernel)", print_real=False)

## Mexico

In [None]:
model = models["Mexico"]
print_predict(model[0], "Mexico", offset=50, title="MLPRegressor prediction")
print_predict(model[1], "Mexico", offset=50, title="KNN prediction", print_real=False)
print_predict(model[2], "Mexico", offset=50, title="SVR prediction (rbf kernel)", print_real=False)
print_predict(model[3], "Mexico", offset=50, title="SVR prediction (linear kernel)", print_real=False)
print_predict(model[4], "Mexico", offset=50, title="SVR prediction (poly kernel)", print_real=False)

# Conclusion

By analysing the results, we conclude that we achieved some very good results, as well as other that were quite poor. Comparing the developed models, KNN seems to be the weakest and least fitting of the three, since the results achieved were not accurate or plausible. However, both the MLPRegressor and the SVR reached results which we are quite happy with.