[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](http://colab.research.google.com/github/Jonas-Metz-verovis/verovis_Coding_Challenge/blob/main/04_Time_Series_Forecasting.ipynb)

# Introduction - Coding Challenge #4 - Time Series Analysis & Forecasting 

**Today's coding challenge focuses on time series analysis and time series prediction. In the first part of the challenge, you will demonstrate your theoretical knowledge. In the following section, you have to train your model, evaluate it, and predict the subsequent 17 periods. The data set contains quarterly data of the earnings per share of the company Johnson&Johnson.

The main goal of the coding challenge is to develop a model that can predict the subsequent 17 earnings per share of Johnson&Johnson.**

**The Challenge will be scored based on:**

1.  The Prediction Model's Test Mean Absolute Percentage (MAPE) Score
1.  The verbal Explanations for specific Processing/Modeling Choices
1.  The Readability and Transferability of the submitted Code
1.  The Documentation of the submitted Code
1.  Optional (not scored): Explanation of the Model's learned Relationships (e.g. through the Feature Importances)

General Machine Learning Project Checklist (**Focus of this Challenge**) by [Aurélien Géron](https://github.com/ageron/handson-ml)

1. Frame the Problem and look at the Big Picture
1. Get the Data
1. Explore the Data to gain Insights
1. Prepare the Data to better expose the underlying Data Patterns to the used Machine Learning Algorithms
1. **Explore many different Models and short-list the best ones**
1. Fine-tune your Models and combine them into a great Solution
1. Present your Solution
1. **Launch, monitor, and maintain your Model/Service**

**INFO:** Instead of working with [Google Colab](https://colab.research.google.com/), which is recommended because you can get started right away, or [Databricks](https://adb-7072220306909809.9.azuredatabricks.net/?o=7072220306909809), which is recommended if you want to collaborate in real-time, you can also work with your own Development Environment (e.g. [Visual Studio Code](https://code.visualstudio.com/)), by using [Git](https://git-scm.com/) to clone the [verovis Coding Challenge GitHub Repository](https://github.com/Jonas-Metz-verovis/verovis_Coding_Challenge) and collaborate e.g. by using [Microsoft Visual Studio Live Share](https://marketplace.visualstudio.com/items?itemName=MS-vsliveshare.vsliveshare-pack)



# Documentation and Support

#### The following Resources might be useful to complete this Challenge:

1.  [AR-Process (statsmodels)](https://www.statsmodels.org/stable/generated/statsmodels.tsa.ar_model.AutoReg.html#statsmodels.tsa.ar_model.AutoReg)
1.  [Forecasting: Principles and Practice (AR-Process)](https://otexts.com/fpp2/AR.html)
1.  [Forecasting: Principles and Practice (MA-Process)](https://otexts.com/fpp2/MA.html)
1.  [ARIMA-Process (statsmodels)](https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima.model.ARIMA.html#statsmodels.tsa.arima.model.ARIMA)
1.  [Forecasting: Principles and Practice (ARIMA-Process)](https://otexts.com/fpp2/arima.html)
1.  [Simple Exponential Smoothing (statsmodels)](https://www.statsmodels.org/stable/generated/statsmodels.tsa.holtwinters.SimpleExpSmoothing.html#statsmodels.tsa.holtwinters.SimpleExpSmoothing)
1.  [Forecasting: Principles and Practice (Simple Exponential Smoothing)](https://otexts.com/fpp2/ses.html)
1.  [Holt-Winters (statsmodels)](https://www.statsmodels.org/stable/generated/statsmodels.tsa.holtwinters.Holt.html#statsmodels.tsa.holtwinters.Holt)
1.  [Forecasting: Principles and Practice (Holt-Winters)](https://otexts.com/fpp2/holt-winters.html)

<hr>

1.  [Pandas Documentation](https://pandas.pydata.org/docs/reference/index.html#api)
1.  [Numpy Documentation](https://numpy.org/doc/stable/)
1.  [Scikit-Learn Documentation](https://scikit-learn.org/stable/modules/classes.html)
1.  [Seaborn Documentation](https://seaborn.pydata.org/api.html)
1.  [SHAP Documentation](https://shap.readthedocs.io/en/latest/api.html)
1.  [Pandas Data Wrangling Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
1.  [TowardsDataScience: Data Cleansing](https://towardsdatascience.com/data-cleaning-in-python-the-ultimate-guide-2020-c63b88bf0a0d)

#### If you don't know how to find a Solution to a given Problem, it often works well if one just "googles the problem". Great Sources are:

1.  [TowardsDataScience](https://towardsdatascience.com/)
1.  [StackOverflow](https://stackoverflow.com/)
1.  [Machine Learning Mastery](https://machinelearningmastery.com/start-here/)
1.  [Python-Kurs.eu](https://www.python-kurs.eu/python3_kurs.php)
1.  [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook)
1.  [The Hitchhiker's Guide to Python](https://docs.python-guide.org/)
1.  [Overview of Data Science YouTube Channels](https://towardsdatascience.com/top-20-youtube-channels-for-data-science-in-2020-2ef4fb0d3d5)
1.  [Introduction to Machine Learning with Python](https://github.com/amueller/introduction_to_ml_with_python) / [Buy the Book](https://www.amazon.de/Introduction-Machine-Learning-Python-Scientists/dp/1449369413)
1.  [The Elements of Statistical Learning](https://web.stanford.edu/~hastie/ElemStatLearn/printings/ESLII_print12_toc.pdf)
1.  [Bayesian Reasoning and Machine Learning](http://web4.cs.ucl.ac.uk/staff/D.Barber/textbook/200620.pdf)
1.  [Deep Learning](https://www.deeplearningbook.org/)
1.  [Hyndman/Athanasopoulos, Forecasting: Principles and Practice](https://otexts.com/fpp2/)

#### This Challenge was created by [Tim Fritzsche](tfritzsche@verovis.com), [Jonas Metz](jmetz@verovis.com) and [Marcel Fynn Froboese](mfroboese@verovis.com), please contact us anytime, if you have any Questions! :-)


# Global Flags and Variables
Please use the given RANDOM_STATE for all your Models etc.

In [None]:
import os

# TODO: Check the Notebook on DataBricks

RANDOM_STATE = 42

# # TODO: Please choose a Team Name!
# TEAM_NAME = 'AdminTeam'

# DATABRICKS_INSTANCE = "https://adb-7072220306909809.9.azuredatabricks.net"
# DATABRICKS_ORGANISATION = "7072220306909809"
# DATABRICKS_BASE_DIRECTORY = os.path.join ("/dbfs/FileStore", TEAM_NAME)

# MODELS = os.path.join (DATABRICKS_BASE_DIRECTORY, "Models")

# SAVE_MODEL = True
# SAVE_PIPELINE = True

# Databricks Specifics

[Databricks Filestore Documentation](https://docs.databricks.com/data/filestore.html)

In [None]:
# TODO: Create the config.py file and get sure that it is accessable on DataBricks itself

# dbutils.fs.rm ("/FileStore/" + TEAM_NAME, recurse=True)
# dbutils.fs.mkdirs("/FileStore/" + TEAM_NAME + "/Models")
# dbutils.fs.ls("/FileStore/" + TEAM_NAME)

# Imports

### Info (Google Colab)

If you are working in Google Colab, you can install necessary (and not already installed) Packages by running e.g.

```
!pip install shap
```

### Info (Databricks)

If you are working in [Databricks](https://docs.databricks.com/libraries/notebooks-python-libraries.html), you can install necessary (and not already installed) Packages by running e.g. this Command in the first Cell of your Notebook (the Kernel will reset after the Package has been installed):

```
%pip install shap
```

In [None]:
# General Imports
import os 
import config
from config import Load_Data
from config import Generate_Process
from config import plot_process
import pandas as pd 
import numpy as np
from tqdm import tqdm_notebook

# Scikit-Learn
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
from sklearn.model_selection import TimeSeriesSplit
from sklearn.ensemble import RandomForestRegressor

# TimeSeries
from statsmodels.tsa.stattools import adfuller
from statsmodels.regression.linear_model import yule_walker
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import pacf, acf
from statsmodels.tsa.statespace.sarimax import SARIMAX
from itertools import product


# Graphics
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (16, 14)

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Helper Functions

In [None]:
def optimize_ARIMA(endog, ps=range, ds=int, qs=range, return_order=False):
    """
        Return dataframe with parameters and corresponding AIC
        and the best order direct.
        
        
        endog:      the observed variable
        ps:         Order range of parameter p
        qs:         Order range of paramter q
        ds:         Number of intergrations of the time series

    """
    # Create all possible combination of the given order Parameter
    parameters = product(ps, qs)
    parameters_list = list(parameters)

    # Store all possible combination in order_list
    order_list = []
    for each in parameters_list:
        each = list(each)
        each.insert(ds, 1)
        each = tuple(each)
        order_list.append(each)
    
    
    results = []
    
    # BUG: tqdm maybe don't work on Databricks
    for order in tqdm_notebook(order_list):
        try: 
            model = SARIMAX(endog, order=order, simple_differencing=False).fit(disp=False)
        except:
            continue
        
        # Choose AIC as decision metric
        # TODO: Add one or two metrics
        aic = model.aic
        results.append([order, model.aic])
        
    result_df = pd.DataFrame(results)
    result_df.columns = ['(p, d, q)', 'AIC']
    #Sort in ascending order, lower AIC is better
    result_df = result_df.sort_values(by='AIC', ascending=True).reset_index(drop=True)

    if return_order:
        best_order = result_df.iloc[0]['(p, d, q)']
        return result_df, best_order

    else:
        return result_df


def metrics_dataframe(data, ps=range, ds=int, qs=range, n_splits=4):
    """
        This function returns information about key-metrics calcualted on the respective Fold.
    """
    metric_results = {}

    tscv = TimeSeriesSplit(n_splits=n_splits)
    for fold, (train_index, test_index) in enumerate(tscv.split(data['values'])):
        # get the train and test data according to the index
        train = data['values'][train_index]
        test = data['values'][test_index]

        # Use the function optimize_ARIMA to define the best order of parameter on the specific train set. 
        # Then, use the best order (decision based on lowest AIC) to fit a temporarly model and calculate the respective metrics on the respective fold. 
        result_df, best_order = optimize_ARIMA(endog=train, ps=ps, ds=ds, qs=qs, return_order=True)
        temp_model = SARIMAX(train, order=best_order, simple_differencing=False)
        results = temp_model.fit()

        # generate pseudo forecast to compare the actual values with forecast values. The length of the forecast must be the    length of the test! Then calculate the metrics based on the actual values and forecast values in test_set

        temp_forecast = results.forecast(len(test))
        #For checking if the date index is equal, we join on index and store the dataframes in metric_results
        # dfs = pd.DataFrame(temp_forecast).join(pd.DataFrame(test))

        metric_results['Fold_'+ str(fold)] = {
            'MSE': mean_squared_error(y_true=test.values, y_pred=temp_forecast.values),
            'RMSE': np.square(mean_squared_error(y_true=test.values, y_pred=temp_forecast.values)),
            'MAE': results.mae,
            'MAPE': mean_absolute_percentage_error(y_true=test.values, y_pred=temp_forecast.values),
            'SSE': results.sse,
            'AIC': results.aic,
            'Date_Range_Train': (train.index[0], train.index[-1]),
            'Date_Range_Test': (test.index[0], test.index[-1]),
            'length_train': int(len(train_index)),
            'length_test': int(len(test_index))
            }

    # Calculate the average over the folds:
    df_metric = pd.DataFrame(metric_results)
    df_metric['Fold_Mean'] = df_metric.loc[['MSE', 'RMSE', 'MAE', 'MAPE', 'SSE', 'AIC']].mean(axis=1)

    return df_metric   



# Data Loading

In [None]:
link_ar_process = 'https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/TimeSeries/Generated_Process/AR_PROCESS.csv'
link_ma_process = 'https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/TimeSeries/Generated_Process/MA_PROCESS.csv'
link_arma_process = 'https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/TimeSeries/Generated_Process/ARMA_PROCESS.csv'


AR_PROCESS = pd.read_csv(link_ar_process, sep=';')
MA_PROCESS = pd.read_csv(link_ma_process, sep=';')
ARMA_PROCESS = pd.read_csv(link_arma_process, sep=';')

# Time Series Analysis (Theory)

## TASK I: 
Your first task is to visualize the AR- / MA- and ARMA-Process. Furthermore, try to determine the parameter order for an AR($p$), MA($q$) and an ARMA($p,q$)-Process.
To answer this question please use the chunk below.

In [None]:
# Insert your Code here !!!


In [None]:
# Insert your Code here !!!

In [None]:
# Insert your Code here !!!

### (Task I) Answer :
.
.
.
.
.


## TASK II:
You gained your first information to be able to forecast values of a time series. But not all information lay in the specific visualization. To increase the understanding how an *AR(p)* and a *MA(q)* works define in the following task the mathematical formula with repect to your findings in *Task I*.

The formula for an *AR(p)* and *MA(q)* is given by:

$$\text{AR(p)} := \quad y_t = c + \sum_{i=1}^{p} \ \Phi_i \ y_{t-i} + \epsilon_t$$

<br>
and 

$$\text{MA(q)} := \quad y_t = \mu + \epsilon_t + \Theta_1 \epsilon_{t-1} + \Theta_2 \epsilon_{t-2} + ... + \Theta_q \epsilon_{t-q}$$

## (Task II) Answer:
.
.
.
.
.


## Task III:
As you can see, AR(2)-Process takes the last two time points into account ($y_{t-1}$ and $y_{t-2}$) based on the order we assume. For the MA(2)-Process we take the passed observed error into account ($\epsilon$). Both Process are depending on $\Phi$ and $\Theta$, respectively. The Question is now, how to determine the latter parameter. Obvisouly, this is a simple regression problem. The Task know is to determine both parameter with the help of the *statsmodels* Python package. 

<br>

__Steps:__

1.  Define your the *order*-Attribute (Sequence of the prameter is *p,d,q*)
1. Fit the model and save it in the variable *AR_MODEL* and *MA_MODEL*, respectively.
1. Print out your model summary with the method *<your_fitted_model>.summary()*.
1. Use your mathematical formualtion from *Task II* and replace the corresponding parameter. 

__INFO: For now set your *d*-Parameter to null.__

## (Task III) Answer:

In [None]:
# Modelling AR(2)-Process

# 1. Define your order here: [order=(p, d, q)]
# order = (p, d, q)

# 2. Fit and save your model here:
# AR_MODEL = ARIMA(<Process>, order= <order>, enforce_stationarity=False).fit()

# 3. Print your model summary here:
# print(AR_MODEL.summary())

\# 4. Mathematical formulation:

$$\text{AR(2)} := \quad y_t = 0.3149 y_{t-1} + 0.4699 y_{t-2}$$ 

In [None]:
# Modelling MA(2)-Process

# 1. Define your order here: [order=(p, d, q)]
# order = ...

# 2. Fit and save your model here:
# MA_MODEL = ARIMA(..., ..., enforce_stationarity=False).fit()

# 3. Print your model summary here:
# print(MA_MODEL.summary())

\# 4. Mathematical formulation:

$$\text{MA(2)} := \quad y_t = $$ 

# Time Series Analysis (Practice)

The goal of this part is to deliver a capable model to predict/forecast the earnings per share of the company *Johnson&Johnson*. The model power will be calculated on an Out-of-Sample file measured by the mean absolute percentage error (MAPE).

## Task IV:
For the first task of the practice part access the data and answer the following question:

1. How many rows contains the file?
2. What time span is given in the data?
3. What frequency has the appearence of the values?

In [None]:
# Data Loading
link_train_set = 'https://raw.githubusercontent.com/Jonas-Metz-verovis/verovis_Coding_Challenge/main/Data/TimeSeries/train_jj.csv'
data = pd.read_csv(link_train_set, sep=';')

## (Task IV) Answer:
1. The data contains in sum ... rows.
2. The first value is given by the date ... and the last by the date ....
3. The frequency is ... .

## Task V:
The next task will refer to the theoretically part of the time series analysis. Visualize the data in its entire length and use the *ACF* and *PACF* to determine the parameter order (*p,d,q*). For your guidance, answer the following questions:

1.  Does show the plot of time series any pattern like *trend* or *seasonality*? If yes, what kind of pattern do we have?
1.  What do you notice on the *ACF*- and *PACF* -Plots?


In [None]:
# Plot the general time series, acf and pacf here:


## (Task V) Answer:
.
.
.
.
.

## Task VI:
One of the most elementary properties is that a time series must be *stationary*. Your next task is to discuss (explain) verbal in your team, why the shown time series is most likely a non-stationary process. In addition to verbal explanation, use an appropriate test-statistic to quantify your assumption.

__INFO: Please give the underlying test hypothesis ($H_0 \quad \text{vs.} \quad  H_1$) for your test-statistic__

## (Task VI) Answer:

In [None]:
# Quantify - stationarity


## Task VII:
Refered to your discussion about the stationarity of a time series and your quantifed results, transform the given time series accordingly. Furthermore, repeat the questions from *Task V* on the transformed time series. Any additional information you gained by the transformation process?

__INFO: Store the additional transformation in the given dataframe.__

In [None]:
# Since the given time series is a non-stationary process, we need to transform the data. The data is exponentialy increasing over time. So, the first transformation will be the natural logarithm. Furtheremore, the first differention is used to yield a stationary process. Notice, that our d-parameter increases by 1.


In [None]:
# Quantify - stationarity


## (Task VII) Answer:
After the transformations the time series is a stationary process and can be used to build up a model. Unfortunately, both autocorrelation plots shows the same characterisic as before and we can not use them to define the parameters.

# Model Selection

## Task VIII:
*"For the next tasks it is up to you. If you already comfortable with time series analysis jump to the bottom of the notebook and give it a try.
Otherwise follow the Tasks in the right order and in right manner :)"*

The following steps must be fullfilled:
- Model Selection
- In-Sample Evaluation
    - MAPE; RMSE; AIC
- Out-of-Sample Evaluation
- Forecast the next 17 periods (according to the frequency)
- Calcualtion of the given Out-of-Sample Test-Set
    - MAPE

<br>

The next goal is to determine the right parameter of the given time series. As we already know, the *ACF* *PACF* does not help very much. But at least, the plots can limit or restrict the range of likely parameter.
For the next task refer to the "stationary"- *ACF* and *PACF* Plots and choose a rough range of parameter. Calculate then for each combination of parameter the Aika Information Criterion (AIC). Choose your model accordingly and explain your decision.

__INFO: Set your range between 0 and 10 for the respective parameter__ 

## (Task VIII) Answer:

In [None]:
# Define your model selection here


In [None]:
# Choose the best parameter combination based on the Aika Information Criterion (AIC)


In [None]:
# Choose the best order of parameter from the your results and fit the model


# Model Evaluation (In-Sample)

## Task IX:
After you have successfully selected your model, it is interesting to know how good your model is. Your next task is to make an in-sample prediction and then determine the following metrics:

- MSE 
- RMSE
- MAPE

__INFO: You don't need to save the metrics, just use the print()-function__

## (Task IX) Answer:

In [None]:
# How good is our model? Calculate standard measurements like MSE; RMSE and MAPE
# 1. Get the prediction (In-Sample) y_hat


# 2. Combine actual values with predictions in a DataFrame


In [None]:
# 3. Print out the Measurements


# Cross Validation (*pseudo* Out-of-Sample)

## Task X:
In the last task, you evaluated your model on the entire time series available. However, how good is your model on unknown data? In the next task, you have to determine your metrics based on different time windows. 

__INFO: Use the function "TimeSeriesSplit" from the Sklearn package__.

## (Task X) Answer:

In [None]:
# Compute the Cross Validation here



# Forecasting 

## Task XII:
To see your *best_model* in action, predict the next 17 periods. Afterwards, save all information in a DataFrame called "TimeSeriesPredictions". The DataFrame includes the columns: values(actual values), predictions(In-Sample) and Forecast(Out-of-Sample). Additionaly, visualize your data and the ongoing forecast values.

## (Task XII) Answer:

In [None]:
# Forecast the next 17 periods (OOS)
# n_forecast = 17
# forecast_series = ...

In [None]:
# TimeSeriesPredictions = data.append(pd.DataFrame(forecast_series))
# TimeSeriesPredictions = TimeSeriesPredictions.rename(columns={'predicted_mean': 'Forecast'})

In [None]:
# Visualize your data


# Data Saving

## Info (Google Colab)

If you are working in Google Colab, you can save the Results to your Google Drive by running

```
from google.colab import drive
drive.mount("/content/drive")
```

You will be requested to authenticate with your Google Account.

The Path to your Google Colab Notebooks Folder will be "/content/drive/My Drive/Colab Notebooks".

The Commands can then use this Path:

```
os.makedirs ("/content/drive/My Drive/Colab Notebooks/Results", exist_ok=True)
df_predictions.to_csv ("/content/drive/My Drive/Colab Notebooks/Results/TimeSeriesPredictions.csv", index=False)

## Task XIII:
Save a DataFrame which contains the actual Live Targets as well as the corresponding Live Predictions to a CSV-File.
Please write the CSV-File in a way which can be read by a German Microsoft Excel without any necessary Modifications and submit the CSV-File together with your Solution Notebook.

In [None]:
# TimeSeriesPredictions.to_csv (os.path.join (DATABRICKS_BASE_DIRECTORY, TEAM_NAME + "_TimeSeriesPredictions.csv"), sep=";", decimal=",", header=True, index=False, encoding="utf-8", float_format="%.4f")
# print ("The Predictions have been successfully saved to a CSV-File, you can download them from:")
# print (DATABRICKS_INSTANCE + "/files/" + TEAM_NAME + "/" + TEAM_NAME + "_TimeSeriesPredictions.csv" + "?o=" + DATABRICKS_ORGANISATION)

## Task XIV: 
Save your Model to a joblib Pickle Dump File, which can be loaded during Inference. Please submit this File together with your Solution Notebook.

In [None]:
# model_name = TEAM_NAME + "_" + dt.datetime.now().strftime("%Y%m%d_%H%M%S") + "_Coding_Challenge_04.joblib"
# dump(best_model, os.path.join("/dbfs/FileStore/" + TEAM_NAME, model_name))
# print ("The fitted Model has been successfully saved, you can download it from:")
# print (DATABRICKS_INSTANCE + "/files/" + TEAM_NAME + "/" + model_name + "?o=" + DATABRICKS_ORGANISATION)