## Clinic 4: UK Visitors timeseries || COVID-19

In this clinic we are going to explore a dataset that represents the monthly total number of visits to the UK by overseas residents (in thousands)<br>from January 1980 to December 2018 (inclusive). In the following part, we import the dataset and then you can read what is required/asked from you. For reference, use/read the lecture notebook (uploaded at the portal).

Alternatively, you can skip this part and go to the Alt-Clinic (see below). You are asked to attempt one of the two tasks for getting a grade for this clinic.

### Load the data into Pandas DataFrame

In [1]:
import IPython.core.display
import matplotlib

def apply_styles():
    matplotlib.rcParams['font.size'] = 12
    matplotlib.rcParams['figure.figsize'] = (18, 6)
    matplotlib.rcParams['lines.linewidth'] = 1

apply_styles()

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(style="ticks")
import warnings
warnings.filterwarnings('ignore')



In [3]:
df = pd.read_csv("GMAA-010119.csv", header=None, skiprows=7, parse_dates=[0], names=['period', 'value'])
df.value.astype(int, copy=False);

In [4]:
df.head(5)

Unnamed: 0,period,value
0,1980-01-01,739
1,1980-02-01,602
2,1980-03-01,740
3,1980-04-01,1028
4,1980-05-01,1088


In [5]:
max_date = df.period.max()
min_date = df.period.min()

num_of_actual_points = df.index.shape[0]
num_of_expected_points = (max_date.year - min_date.year) * 12 + max_date.month - min_date.month + 1

print("Date range: {} - {}".format(min_date.strftime("%d.%m.%Y"), max_date.strftime("%d.%m.%Y")))
print("Number of data points: {} of expected {}".format(num_of_actual_points, num_of_expected_points))


Date range: 01.01.1980 - 01.12.2019
Number of data points: 480 of expected 480


## Requirements

Assume that you work for a travel agent office in the UK. Your goal is to prepare a report (supported by numbers/models, plots and explanations) for your boss that looks into the data from UK visitors. Your report/analysis should contain the following:

* Proper visualizations of the timeseries
* Indication of relevant events that might drive the timeseries trends/events (e.g. financial, health events etc.).
* Highlight these events by zooming in on these specific periods
* What is the best time to visit the UK?
* Report on the seasonality and the trend of the timeseries
* Check your timeseries for stationarity and correct if necessary
* Based on your observations try to fit an ARIMA model for predicting the UK visitors.
* Use this guide (http://people.duke.edu/~rnau/arimrule.htm) as indicative rules for which model fits the best
* Use data before 2015 (excl.) for training and data after 2015 (incl.) for testing your performance
* Show your predictions in a plot
* Evaluate your models using MAE, MSE and R-squared
* Download the recent data from https://www.ons.gov.uk/peoplepopulationandcommunity/leisureandtourism/timeseries/gmaa/ott (you can use the whole 2019) and persuade your boss that your model is awesome!

## Alt-Clinic (COVID-19 data)

Please continue reading if you (alternatively) would like to work with the recent COVID-19 data. The data (per country) are provided **daily** via this GitHub repository https://github.com/CSSEGISandData/COVID-19.

You are going to follow a similar approach as above, but adapted using the following steps:

* Start by exploring the plot of the COVID-19 cases (in total) and justify the exponential rate of increase
* Proper visualizations of the timeseries

Obviously, one interesting task is to compare how new infection cases appear in different countries and if there is a similarity in the evolution of these timeseries, i.e. can we assume that the timeseries evolution of infections in the Netherlands will have the same evolution as in e.g. China or Italy?

* Provide commentary on the problems that arise from the selection bias introduced by different ways that countries handle reporting. If you find fit, do some research into how countries do tests and how do they report their cases

* Provide commentary (and eventually a solution) of the "start day problem" of the timeseries: The problem is that while in China the outbreak was already in January, in Italy it was in late February and in the Netherlands is not quite yet determined(?). Define your "day 0" properly: If you look online, some people consider as day 0, either the day that the country reported the first infection, or the day that a cummulative amount of e.g. 50 infections were reported, or you might think of something else. The reason for this is to have a proper comparison metric for the timeseries of different countries.

* Pick some countries (incl. Netherlands) and visualize the results on this "shifted"/"normalized" time-scale

* Report preliminary results on the visual inspection of the plots

And now for the most challenging task, let's try to predict! The real challenge here is that we don't have enough data points for having accurate train/test models, so you have to think of a work-around (e.g. fit model for a deep-in-the-outbreak-country like China and then try to see how that model performs on other countries).

Here are some steps to follow:

* Try to fit an ARIMA model for predicting the evolution of the infected cases worldwide
* Try to fit an ARIMA model for a specific country (e.g. China).
* Use this guide (http://people.duke.edu/~rnau/arimrule.htm) as indicative rules for which model fits the best
* As the situation evolves and since the clinic is delivered in one week, you can use "real" world data for testing! Make sure to download the data every day and check your model prediction.
* Show your predictions in a plot
* Evaluate your models using MAE, MSE and R-squared

Since this is "extra-ordinary" conditions, let me(us) know how we can help you more into this or if you have any other cool idea. 

Stay healthy and enjoy!