# Before your start:
- Read the README.md file
- Comment as much as you can and use the resources (README.md file)
- Happy learning!

In [None]:
# Import numpy and pandas
import numpy as np 
import pandas as pd

# Challenge 1 - Loading and Evaluating The Data

In this lab, we will look at a dataset of sensor data from a cellular phone. The phone was carried in the subject's pocket for a few minutes while they walked around.

To load the data, run the code below.

In [None]:
# Run this code:
sensor = pd.read_csv('../sub_1.csv')
sensor.drop(columns=['Unnamed: 0'], inplace=True)

Examine the data using the `head` function.

In [None]:
# Your code here:
sensor.head()

Check whether there is any missing data. If there is any missing data, remove the rows containing missing data.

In [None]:
# Your code here:
sensor.isnull().sum() 

How many rows and columns are in our data?

In [None]:
# Your code here:
sensor.shape 

To perform time series analysis on the data, we must change the index from a range index to a time series index. In the cell below, create a time series index using the `pd.date_range` function. Create a time series index starting at 1/1/2018 00:00:00 and ending at 1/1/2018 00:29:10. The number of periods is equal to the number of rows in `sensor`. The frequency should be set to `infer`.

In [None]:
# Your code here:
date_index = pd.date_range(start='1/1/2018 00:00:00', end='1/1/2018 00:29:10', periods=sensor.shape[0]) 

Assign the time series index to the dataframe's index.

In [None]:
# Your code here:
sensor.index = pd.DatetimeIndex(date_index, freq='infer') 

Our next step is to decompose the time series and evaluate the patterns in the data. Load the `statsmodels.api` submodule and plot the decomposed plot of `userAcceleration.x`. Set `freq=60` in the `seasonal_decompose` function. Your graph should look like the one below.

![time series decomposition](../images/tsa_decompose.png)

In [None]:
# Your code here:
'''
L.S. Very good!
'''

import statsmodels.api as sm

res = sm.tsa.seasonal_decompose(sensor['userAcceleration.x'], freq=60)
resplot = res.plot() 

Plot the decomposed time series of `rotationRate.x` also with a frequency of 60.

In [None]:
# Your code here:
'''
L.S. Well done!
'''

res = sm.tsa.seasonal_decompose(sensor['rotationRate.x'], freq=60)
resplot = res.plot() 

# Challenge 2 - Modelling the Data

To model our data, we should look at a few assumptions. First, let's plot the `lag_plot` to detect any autocorrelation. Do this for `userAcceleration.x`

In [None]:
# Your code here:
from pandas.plotting import lag_plot

lag_plot(sensor['userAcceleration.x']);

Create a lag plot for `rotationRate.x`

In [None]:
# Your code here:
'''
L.S. Good!
'''

lag_plot(sensor['rotationRate.x']); 

What are your conclusions from both visualizations?

In [None]:
# Your conclusions here:
# both are lines along the diagona, so there are autoregressive relationships (there are correlationg with themselve)

The next step will be to test both variables for stationarity. Perform the Augmented Dickey Fuller test on both variables below.

In [None]:
# Your code here:
'''
L.S. Good!
'''

from statsmodels.tsa.stattools import adfuller

print('for userAcceleration.x:', adfuller(sensor['userAcceleration.x'])[1]) 
print('for rotationRate.x:', adfuller(sensor['rotationRate.x'])[1]) 

What are your conclusions from this test?

In [None]:
# Your conclusions here:
# both are very significant (p<0.05), so both data are stationary  

Finally, we'll create an ARMA model for `userAcceleration.x`. Load the `ARMA` function from `statsmodels`. The order of the model is (2, 1). Split the data to train and test. Use the last 10 observations as the test set and all other observations as the training set. 

In [None]:
# Your code here:

'''
L.S. Correct!
'''

from statsmodels.tsa.arima_model import ARMA

# split data in test en train 
train, test = sensor['userAcceleration.x'][:-10], sensor['userAcceleration.x'][-10:]

# fit model
model = ARMA(sensor['userAcceleration.x'], order=(2, 1))
model_fit = model.fit(disp=False)

# make prediction based on test and train 
predictions = model_fit.predict(start=len(train), end=len(train)+len(test)-1, dynamic=False)

To compare our predictions with the observed data, we can compute the RMSE (Root Mean Squared Error) from the submodule `statsmodels.tools.eval_measures`. You can read more about this function [here](https://www.statsmodels.org/dev/generated/statsmodels.tools.eval_measures.rmse.html). Compute the RMSE for the last 10 rows of the data by comparing the observed and predicted data for the `userAcceleration.x` column.

In [None]:
# Your code here:
from statsmodels.tools.eval_measures import rmse

# make df in which the actual observed values are and the predicted ones 
compare_data = pd.DataFrame({'observed':sensor['userAcceleration.x'][-10:], 'predicted':predictions})  

# the rmse score for both coumns in df 
rmse(compare_data.observed, compare_data.predicted) # this nr should be as low as possible, 
                                                    # but it depends on the range of your variable 