## Flu trends dataset: Understanding your data and applying the ARIMA model

This notebook contains the tasks to analyze the flu trends dataset and to apply the ARIMA model on it. The dataset contains the Google Trends data for various flu-related search terms as well as the weekly office visits for the flu (FluVisits column). You can find the dataset in data/flu_trends.csv. Install required packages and complete the notebook.

Throughout the seminar, we will use the following splits for training, validation, and testing. Make sure to keep the tests unseen until the final evaluation (information leakage):
- Training set: 2009-2013
- Validation set: 2014
- Test set: 2015-2016

Required python packages: pandas, numpy, matplotlib, scikit-learn, statsmodels, pmdarima

In [3]:
# Install required packages

In [7]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import scipy.stats as stats
import statsmodels.api as sm

import pmdarima as pm


## Task 1: Data Exploration

**Task 1.1:** Load the dataset and display the first few rows to understand the structure of the data. Check the data types of each column and convert the week column to a more usable datetime format.

In [18]:
filepath = os.path.join('..', 'data', 'flu-trends.csv')

In [None]:
flu_data_testing = pd.read_csv(filepath)
flu_data_testing[["start_date", "end_date"]] = flu_data_testing["Week"].str.split("/", expand=True)
flu_data_testing["start_date"] = pd.to_datetime(flu_data_testing["start_date"])
flu_data_testing["end_date"]   = pd.to_datetime(flu_data_testing["end_date"])

df = flu_data_testing.set_index(["start_date", "end_date"])
df.drop(columns=["Week"], inplace=True)
df.head(10)


#We decided to keep the start and end date as a multi-index to make it easier to use while still keeping all the information.

Unnamed: 0_level_0,Unnamed: 1_level_0,AInfluenza,AcuteBronchitis,BodyTemperature,BraunThermoscan,BreakAFever,Bronchitis,ChestCold,ColdAndFlu,ColdOrFlu,ColdVersusFlu,...,TreatingTheFlu,TreatmentForFlu,TreatmentForTheFlu,Tussin,Tussionex,TypeAInfluenza,UpperRespiratory,WalkingPneumonia,WhatToDoIfYouHaveTheFlu,FluVisits
start_date,end_date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2009-06-29,2009-07-05,36,20,43,27,11,22,16,7,3,8,...,16,6,13,25,34,30,25,24,15,180
2009-07-06,2009-07-12,25,19,40,31,10,21,11,6,8,8,...,16,7,8,27,32,27,28,29,9,115
2009-07-13,2009-07-19,24,30,45,20,12,20,20,6,6,8,...,16,6,9,24,28,25,25,25,9,132
2009-07-20,2009-07-26,23,19,40,15,10,19,12,7,10,15,...,8,5,12,21,26,26,29,24,13,109
2009-07-27,2009-08-02,27,21,44,20,11,19,17,8,10,15,...,8,8,12,33,29,21,27,30,9,120
2009-08-03,2009-08-09,28,23,39,8,6,18,13,8,7,8,...,8,9,16,18,30,13,26,26,17,115
2009-08-10,2009-08-16,29,22,41,35,10,17,15,7,10,8,...,8,11,10,28,32,21,17,25,13,123
2009-08-17,2009-08-23,29,20,43,32,9,20,20,10,13,22,...,8,9,10,24,29,13,23,27,14,205
2009-08-24,2009-08-30,27,19,52,27,7,22,16,13,16,19,...,15,13,18,26,43,29,27,26,17,454
2009-08-31,2009-09-06,34,28,59,38,12,25,18,20,19,12,...,25,20,34,35,38,38,37,32,14,628


**Task 1.2:** Are there any missing values in the dataset? If so, how would you handle them?

**Task 1.3:** Provide summary statistics for the dataset.

**Task 1.4:** Plot the weekly FluVisits over time to visualize trends and patterns. Compare with a few of the search trends. What do you observe?

**Task 1.5:** Group the data by month and calculate the average number for each month. Create a bar plot to visualize the monthly averages, which will help identify any seasonal patterns. Check how the number of cases develops over the years.


## Task 2: Stationarity and Autocorrelation

**Task 2.1:** Apply the Augmented Dickey-Fuller test to check for stationarity. Explain the results and the implications for time series modeling.

**Task 2.2:** Investigate Autocorrelation by plotting the Autocorrelation Function (ACF) and the partial Autocorrelation Function (PACF) for the FluVisits column. Determine appropriate values for the AR (p) and MA (q) parameters of the ARIMA model based on these plots.

> Hint: The statsmodels package provides usesful functions for that purpose.  
> If you want to read more about the ACF and PACF, you can check this [link](https://machinelearningmastery.com/gentle-introduction-autocorrelation-partial-autocorrelation/).

Explain your visualizations and findings. Specifically, comment on the trends, seasonality, and autocorrelation. Explain why it is important to do these checks before applying the ARIMA model.

**Task 2.3:** Create lag plots for the FluVisits column to visualize the relationship between the current value and its past values. Comment on the patterns you observe in the lag plots.

## Task 3: Application of the AR(I)MA Model

The ARIMA (AutoRegressive Integrated Moving Average) model is a popular time series forecasting model that combines three components:

1. **AR (AutoRegressive)**: Uses the dependent relationship between an observation and some number of lagged observations.
2. **I (Integrated)**: Represents the differencing of raw observations to allow the time series to become stationary.
3. **MA (Moving Average)**: Uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.

ARIMA is denoted as ARIMA(p,d,q), where:
- p: The order of the autoregressive term
- d: The number of differencing required to make the time series stationary
- q: The order of the moving average term

For further information, you can refer to e.g. this [tutorial](https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/). 

Many Python packages provide implementations for the ARIMA model, but in this seminar, we will use the [pmdarima](pmdarima) package. The pmdarima package provides an easy-to-use interface for the ARIMA model and automatically selects the best parameters.

**Task 3.1:** Prepare your data for the ARIMA model. In the upcoming days, you will use the years 2015-2016 for testing and therefore you should also get the predictions for these years with the ARIMA model. Prepare the data accordingly. What input is expected by the model?

**Task 3.2:** Use the auto_arima function from the pmdarima package to find the best parameters for the ARIMA model.

**Task 3.3:** Fit the ARIMA model with the best parameters and make predictions for the test set (2015-2016). Evaluate the model using suitable metrics.

**Task 3.4:** Plot the actual vs. predicted values for the test set to visualize the model's performance.

## Optional tasks

- Include search terms as covariates (exogenous features) in the model. Experiment with different combinations of search terms and evaluate the model's performance.
- Apply Facebook's Prophet model to the same dataset and compare its performance with the ARIMA model.