A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Problem 3. Multiple Variable Linear Regression

In this problem, we will use statsmodels to fit a multiple variable linear regression model that predicts `AirTime` from `Distance` and `DepDelay`.

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error

# tools for testing
from nose.tools import assert_is_instance, assert_almost_equal
from numpy.testing import assert_array_almost_equal

In [Problem 1](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week5/assignments/Problem_1_Seaborn_Linear_Regression.ipynb) and [Problem 2](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week5/assignments/Problem_2_Statsmodels_Linear_Regression.ipynb), we have used all of our data to fit a model, and made predictions on the same data. Often, we want to perform [cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics) (see the [Introduction to Regression](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week5/notebooks/intro2regress.ipynb) notebook). In cross-validation, we use a training set to fit a model, and use a separate test set to quantify the quality of a particular fit.

Suppose, in addition to the data from 20 flights (which we will use as training data to fit a model), we are given data from 10 more flights (which we will use as testing data to quantify the quality of the fit).

```python
>>> print(data_test)
```
```
   DepDelay  Distance
0        -5       361
1         0       588
2        -6      2565
3        11      2454
4        23       228
5        -2      1182
6        -1      1605
7        -7       477
8        -2       505
9         2       228
```

```python
>>> print(air_time_true)
```
```
   AirTime
0       64
1       95
2      296
3      328
4       56
5      151
6      187
7       71
8       76
9       56
```

In [None]:
data_train = pd.DataFrame(
    {"Distance": [
        361, 569, 588, 1172, 2565, 861, 665, 787, 228, 197,
        978, 1745, 1605, 373, 156, 209, 505, 224, 282, 862
    ],
    "DepDelay": [
        -3, 10, -2, 10, -11, 0, 17, -24, 1, 75,
        -16, -26, -10, -17, -11, -4, -4, 66, -8, 41
    ],
    "AirTime": [
        60, 84, 95, 182, 337, 119, 87, 103, 55, 47,
        127, 215, 213, 59, 31, 57, 88, 42, 45, 102
    ]}
)

data_test = pd.DataFrame(
    {"Distance": [
        361, 588, 2565, 2454, 228, 1182, 1605, 477, 505, 228
    ],
    "DepDelay": [
        -5, 0, -6, 11, 23, -2, -1, -7, -2, 2
    ]}
)

air_time_true = pd.DataFrame(
    {"AirTime": [
        64, 95, 296, 328, 56, 151, 187, 71, 76, 56
    ]}
)

In [None]:
print(data_test)

In [None]:
print(air_time_true)

## Use Statsmodels to fit a multiple variable linear regression model

- Write a function named `fit_multiple_variable_linear_regression()` which fits an ordinary least squares (OLS) fit that predicts `AirTime` from `Distance` and `DepDelay`.

Notes:

- The function takes two arguments. The first argument `df_train` is a `pandas.DataFrame` with three columns: `AirTime`, `Distance`, and `DepDelay`. The second argument `df_test` is a `pandas.DataFrame` with two columns, `Distance` and `DepDelay`.

- It then uses `statsmodels.formula.api.ols()` to fit an OLS model that maps the `AirTime` labels to the `Distance` and `DepDelay` features of **df_train**. (In other words, if `Distance` is $x_1$, `DepDelay` $x_2$, and `AirTime` is the $y$ label, we want to fit a linear regression function $y=f(x_1,x_2)$ and predict `AirTime` from `Distance` and `DepDelay`.)

- Finally, we use the model to predict `AirTime` from the `Distance` and `DepDelay` columns of **df_test** and return the predictions as a `numpy.ndarray`. (In contrast to Problem 1 and 2, we don't make predictions on the same data set. We use `df_train` to fit the model, and then make predictions on `df_test`.)

- Use the formulaic interface that represents a linear combination of two independent variables (with no intercept), e.g. `y ~ x1 + x2`.

In [None]:
def fit_multiple_variable_linear_regression(df_train, df_test):
    """
    Trains OLS on the columns in "df_train" and makes a prediction for "df_test".
    Returns the predicted "AirTime" values.
    
    Parameters
    ----------
    df_train: A pandas.DataFrame. Should have "AirTime", "ArrDelay", and "Distance" columns.
    df_test: A pandas.DataFrame. Should have "AirTime" and "ArrDelay" columns.
    
    Returns
    -------
    A numpy array
    """

    # YOUR CODE HERE
    
    return result

In [None]:
air_time_pred = fit_multiple_variable_linear_regression(data_train, data_test)
print(air_time_pred)

In [None]:
assert_is_instance(air_time_pred, np.ndarray)
assert_array_almost_equal(
    air_time_pred,
    [60.49457846, 88.10094501, 327.75462849, 314.56045885, 44.80502873,
     160.10305484, 211.41530694, 74.53035392, 78.00429255, 44.47565995]
)

# test a trivial case
df_train1 = pd.DataFrame(
    {"Distance": np.arange(50),
     "DepDelay": np.arange(50) - 1,
     "AirTime": np.arange(50) + 1}
)
df_test1 = pd.DataFrame(
    {"Distance": np.arange(50) + 1,
     "DepDelay": np.arange(50)}
)
y_pred1 = fit_multiple_variable_linear_regression(df_train1, df_test1)
assert_array_almost_equal(y_pred1, np.arange(50) + 2)

## Compute the root mean squared error

- Write a function named `compute_root_mean_squared_error()` that computes the root mean squared error between two numpy arrays.

- You can either take the square root of `sklearn.metrics.mean_squared_error()`, or use Numpy functions to calculate

$$
\text{RMSE} = \sqrt{\frac{\sum_{i=1}^{N}(\hat{y}_i - y_i)^2}{N}}
$$

- In our case, we have
```python
>>> rmse = compute_root_mean_squared_error(air_time_true, air_time_pred)
>>> print("Root mean squared error is {0:.1f} minutes.".format(rmse))
```
```
Root mean squared error is 14.8 minutes.
```
So, when our model predictions are compared to the ground truth, each prediction had an error of 14.8 minutes.

In [None]:
def compute_root_mean_squared_error(y_true, y_pred):
    """
    Computes the root mean squared error.
    
    Parameters
    ----------
    y_true: A numpy array. Ground truth (correct) target values.
    y_pred: A numpy array. Estimated target values.

    Returns
    -------
    A numpy.float.
    """
    
    # YOUR CODE HERE
    
    return result

In [None]:
rmse = compute_root_mean_squared_error(air_time_true, air_time_pred)
print("Root mean squared error is {0:.1f} minutes.".format(rmse))

In [None]:
assert_is_instance(rmse, np.float)
assert_almost_equal(rmse, 14.8406665601)

# test a trivial case
y_true1 = np.arange(50)
y_pred1 = y_true1 + 1
assert_almost_equal(compute_root_mean_squared_error(y_true1, y_pred1), 1.)