# Assignment 4: Linear Regression

For this assignment, you may `numpy`, `pandas`, and packages from the Python standard library.

List your team members (name, matriculation number, course of study) in the following cell:

* Your names here

## Task 2: Regression for Time-Series Prediction

Make sure the provided `lake.dat` file is in the same directory as the Jupyter notebook.

You can find the original data including references here: [DaISy - Data of a simulation of the western basin of Lake Erie](https://homes.esat.kuleuven.be/~smc/daisy/daisydata.html).

In [1]:
import matplotlib.pyplot as plt
import numpy as np

In [2]:
def load_lake_data():
    with open('lake.dat') as f:
        data = np.loadtxt(f)
    X = data[:, 6:11]
    Y = data[:, 23:25]
    return X, Y

def train_test_split(X, Y, train_fraction):
    cutoff_idx = int(train_fraction * X.shape[0])
    X_train, Y_train = X[:cutoff_idx], Y[:cutoff_idx]
    X_test, Y_test = X[cutoff_idx:], Y[cutoff_idx:]

    return X_train, Y_train, X_test, Y_test


X_train, Y_train, X_test, Y_test = train_test_split(*load_lake_data(), train_fraction=0.7)
print(f'Train samples: {X_train.shape[0]}, Test samples: {X_test.shape[0]}')

Train samples: 39, Test samples: 18


Training data is stored in `X_train` and `Y_train`. Test data is stored in `X_test` and `Y_test`.

1\) Create a plot that shows all inputs and outputs of the training dataset over time. For each variable a separate subplot should be created. Time should be displayed along the x-axis, while the value of each variable should be displayed along the y-axis.

*Hint:* Use `plt.subplots(n, sharex=True)` to create `n` plots that are synchronized along the x-axis.

In [3]:
# Your code here

2\) The `make_features` takes as input a matrix of inputs `X` and a matrix of outputs `Y`, as well as the arguments `feature_type` and `lag`.
It computes a features used as input for a linear regression.


`X` has the shape `(T, 5)` and `Y` has the shape `(T, 2)` where `T` is a number of time steps.
`feature_type` is a string with either value `lin` or `quad` whether only linear features or also quadratic features should be computed. `lag` is a boolean that determines whether the lag variables for `t-1`should be included.


If `lag` is False, the returned feature matrix has shape `(T, D)` where `D` depends on the choice of `feature_type`.
For `feature_type=lin` and `lag=False`, `D` equals `5+2+1`, for `feature_type=lin` and `lag=True`, `D` equals `2*(5+2)+1`.
If `lag` is True, the returned feature matrix has shape `(T-1, D)`, since the additional lag variable prevents the creation of a feature for time step $t=0$.


Extend the `make_features` method to also compute quadratic features (`ftype = 'quad'`) for both `lag=False` and `lag=True`.


In [4]:
def make_features(X, Y, feature_type='lin', lag=False):
    if feature_type == 'lin':
        if lag:
            Phi = np.hstack((
                np.ones((X.shape[0]-1, 1)), # bias/intercept term
                X[1:], # x_t
                X[:-1], # x_{t-1}
                Y[1:], # y_t
                Y[:-1] # y_{t-1}
            ))
        else:
            Phi = np.hstack((
                np.ones((X.shape[0], 1)), # bias/intercept term
                X, # x_t
                Y # y_t
            ))
    elif feature_type == 'quad':
        # Your code here
        raise NotImplementedError()
    else:
        raise ValueError(f'feature_type is not implemented for {feature_type}')

    return Phi
