## Google Trends Data

Google trends is a very useful source of time series data. Google collects information on the popularity of searches over time and we can use this to get an idea of regional or global interest in any topic. 

Note: You can also add comparisons if you are interested in multivariate time series analysis. 

### Interest in Climate Change 

Here we consider the searches for 'climate change'. By considering global Google searches, we can see if interest in climate change has increased over the last 5 years, and forecast if this interest will increase or not. 

You can see the source of this data here: https://trends.google.com/trends/explore?date=today%205-y&q=climate%20change

In particular here we will explore using AR and ARMA models for modelling this Google trends data, specifically searches of 'climate change' over the last 5 years worlwide. We will then explore methods for testing the model fits and the forecasts.

In [None]:
import re
import pandas as pd
from datetime import datetime
import numpy as np
import matplotlib.pyplot as plt

from statsmodels.api import tsa
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from statsmodels.tsa.ar_model import ar_select_order
from statsmodels.graphics.tsaplots import plot_acf

%matplotlib inline

## 1) Load and transform the data

Load the data into a `pandas.DataFrame` object. You can use the `pd.read_csv` function with the parameter `skiprows` to ignore the first row. 

Then:

1) Make the `Week` column of datetime format

2) Set the `Week` as the index column

3) Convert the dataframe to a pandas series by selecting the one remaining column `climate change: (Worldwide)`.

In [None]:
# Your code here...


In [None]:
# For our model building later, we attach a data frequency to the times. This is not essential but avoids warnings
df.index = pd.DatetimeIndex(
    df.index.values,
    freq=df.index.inferred_freq
)

time_series = df

## 2) Visualise the data with a plot

In [None]:
# Your code here...


## 3) Fit the data

### Task:

1) Using `ar_select_order()` select the optimal AR parameter according to the AIC. 

2) Then fit the time series model using `tsa.AutoReg()` and the optimal lag found

3) Form a prediction starting at the optimal lag value. Hint: use the parameter `start`.

4) Plot the resulting prediction and original time series

In [None]:
# Your code here...


### Residual diagnostics

By using `time_series[optlag:].value.flatten()` as the original time series (this should align with your prediction), define the residuals from the AR model fit. 

Plot these residuals and the ACF in order to assess if this model is adequte. 

**Recall:** We want the residuals to have zero mean and be uncorrelated

In [None]:
# Your code here...


## 4) Forecasting

**Note**: forecasting with time series data is **tricky** and usually basic methods do not really provide very good results (especially on realistic data). ARMA models are nice because they are simple but do not expect fantastic performances. (On the other hand, predicting the future is hard! -- who would have thought).

* Separate the time series into a training set and a test set formed of the last 40 points. 
  - To do this you can index `time_series` with `[:-40]` and `[-40:]`
* Fit an AR model on the training data and try to find the optimal lag using the `BIC` criterion, an alternative to the AIC which also accounts for the sample size. 
  - Use the `ar_select_order` function on your training data and use the parameters `maxlag` and `ic`
  - Then use the `AutoReg` function with your training data and use the `lags` parameter to specify the optimal lags found
* Predict and show the prediction on the original time series. Did it do a good job? 
  - For this step you will find the `.fit()` and `.predict()` functions useful
  - Note that for the prediction, we want to store the indices `[-len(test):]` as these are form our forecast
* compute the MAE
  - The function `mean_absolute_error` will let you do this

In [None]:
# Your code here...


Repeat for an ARMA process, using `arma_order_select_ic` to find the optimal parameters. How does the MAE compare to the AR process?

Note: in order to run an ARMA model using `statsmodels`, you can use `tsa.ARIMA` and set the `d` in the order as 0.

In [None]:
# Your code here...


## 5) Cross-Validation for Time Series Forecasting

Recall that it is important not to use future observations during a forecast. Below we will define a function to take a certain number of observations (ordered chronologically) as training data and use the remainder as testing data. We will also visualize their predictions :) 



In [None]:
def Forecasting_crossValidation( time_series, training_size, optlag ):
    """
    Given a pd.Series we train a autoregressive model. 
    The number of training observation is specified by the variable training_size
    """
    train = time_series[:training_size]
    test = time_series[training_size:]
    ar = tsa.AutoReg(train, lags=optlag)
    ar_result = ar.fit()
    prediction = ar_result.predict(end=time_series.index[-1])[-len(test):]
    # You could also use:
    #prediction = ar_result.predict(end=len(time_series)-1)[-len(test):]
    
    # compute the MAE:
    mae = mean_absolute_error(time_series.values[training_size:], prediction)
    print('Mean absolute error: ' + str(mae))
    
    # plot results as well:
    plt.plot(time_series.values, '-o', label='true')
    plt.plot(range(training_size, len(time_series)), prediction, 
         '-o', label='out of sample prediction')
    plt.legend();
    
    return prediction
    

In [None]:
Forecasting_crossValidation( time_series, int(len(time_series)*0.85), 3)

#### Excercise:
Compute the average mean absolute error (MAE) for all possible forecasts (using a minimum of 30 training points). 

Hint: 

1) Copy the `Forecasting_crossValidation` function and adapt it so that it:
  - has an input parameter allowing you to specify the lags to use for the AR model
  - returns the MAE
  - Name this function `Forecasting_CV`
  Note: consider if you need it to output plots or not
  
2) Loop through the training sizes from length `30` to `len(time_series)-30`. In each loop do the following:
  - Select the optimal AR parameter according to the AIC using `ar_select_order` and just the training data
  - Use the function `Forecasting_CV` with your time series, current training size and selected lag to get a MAE
  - Append this MAE to a list which stores all of the errors from each training size

3) Print out the mean of your MAE list to get the average MAE for all possible forecasts


In [None]:
# Your code here...
