A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Problem 2. Linear Regression Using Statsmodels

In this problem, we will use Statsmodels to fit a linear regression model that predicts `AirTime` from `Distance`.

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf

# tools for testing
from nose.tools import assert_equal, assert_is_not, assert_is_instance
from numpy.testing import assert_array_equal, assert_array_almost_equal

We use the same data set that we used in [Problem 1](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week5/assignments/Problem_1_Seaborn_Linear_Regression.ipynb). However, in this problem, we will use `statsmodels`, not `seaborn`.

In [None]:
data = pd.DataFrame(
    {"AirTime": [60, 84, 95, 182, 337, 119, 87, 103, 55, 47,
        127, 215, 213, 59, 31, 57, 88, 42, 45, 102],
     "Distance": [361, 569, 588, 1172, 2565, 861, 665, 787, 228, 197,
        978, 1745, 1605, 373, 156, 209, 505, 224, 282, 862]}
)

In [None]:
print(data)

## Use Statsmodels to fit a linear regression model

- Write a function named `fit_statsmodels_linear_regression()` that fits an ordinary least squares (OLS) fit on `Distance` and `AirTime`.

Notes:

- The function takes one argument, a `pandas.DataFrame` with two columns, `AirTime` and `Distance`.

- It then uses `statsmodels.formula.api.ols()` to fit an OLS model that maps the `AirTime` labels to the `Distance` features. (In other words, if `Distance` is the $x$ feature and `AirTime` is the $y$ label, we want to fit a linear regression function $y=f(x)$ and predict `AirTime` from `Distance`.)

- Finally, we use the model to predict `AirTime` from `Distance` and return the predictions as a `numpy.ndarray`. We use the same data set to fit the model **and** make predictions. (In [Problem 3](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week5/assignments/Problem_3_Multiple_Variable_Linear_Regression.ipynb), we will use a "training set" to fit a model and make predictions on a different "test set".)

- Use the formulaic interface that represents a linear regression with one independent variable (with no intercept), e.g. `y ~ x`.

In [None]:
def fit_statsmodels_linear_regression(df):
    """
    Trains an OLS that predicts "AirTime" from "Distance".
    Returns the predicted "AirTime" values.
    
    Parameters
    ----------
    df: A pandas.DataFrame. Should have "AirTime" and "Distance" columns.
    
    Returns
    -------
    A numpy array
    """

    # YOUR CODE HERE
    
    return result

In [None]:
y_pred = fit_statsmodels_linear_regression(data)
print(y_pred)

In [None]:
assert_array_almost_equal(
    y_pred,
    np.array([
        60.72739639, 85.90348963, 88.20322891, 158.88995222, 327.49715352,
        121.24685128, 97.52322496, 112.28997195, 44.62922139, 40.87701519,
        135.40840372, 228.24524751, 211.29980014, 62.17986331, 35.91441989,
        42.32948211, 78.1569994, 44.14506576, 51.16532252, 121.36789019
    ])
    )

# test more cases
df1 = pd.DataFrame({
    "AirTime": np.arange(100),
    "Distance": np.arange(100)
    })
y_test = fit_statsmodels_linear_regression(df1)

assert_array_almost_equal(y_test, df1["Distance"])

## Plot the linear regression model

Plot the model we learned with `fit_statsmodels_linear_regression()`. Your plot should have both a scatter plot of `AirTime` vs. `Distance` and a line that represents the linear regression model.

- Use [seaborn.regplot](http://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.regplot.html) to write a function named `plot_seaborn_linear_regression()` that creates a scatter plot with `Distance` in the $x$-axis and `AirTime` in the $y$-axis. The function shuold also fit a linear regression model in the same plot. 

Here is an example plot. (You don't have to make your plot look exactly like this example. If your plot looks visually OK, and if the test code cell doesn't produce any errors, your solution is correct.)

![](statsmodels_linear_regression.png)

Hints:

- The function take two arguments: `df` and `y`. `df` is a `pandas.DataFrame` and is used for creating the scatter plot. `y` is a Numpy array that we obtained from `fit_statsmodels_linear_regression()`, and we use `y` for plotting the linear relationship (i.e., a straight line with `Distance` on the $x$-axis and `y` on the $y$-axis). For example,

```python
>>> y_pred = fit_statsmodels_linear_regression(data)
>>> ax = plot_statsmodels_linear_regression(data, y_pred)
```

- By default, `seaborn.regplot()` will fit and plot a **seaborn** linear regression model. Use the [fit_reg](https://stanford.edu/~mwaskom/software/seaborn/generated/seaborn.regplot.html) parameter to turn this off.

- The function should return an instance of [matplotlib.Axes](http://matplotlib.org/users/artists.html) object. Note `seaborn.regplot()` returns a matplotlib Axes instance, so you can assign the return value of the `seaborn.regplot()` function to a variable named `ax` and return this variable `ax`.

- You plot should also have a title and labels for the $x$ and $y$ axes. To do this, use one or more of the following: [ax.set()](http://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes.set), [ax.set_title()](http://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes.set_title), [ax.set_xlabel()](http://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes.set_xlabel), or [ax.set_ylabel()](http://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes.set_ylabel).

- If you are not sure how to do this, there is an example of using `seaborn.regplot()` in the [Introduction to Regression](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week5/notebooks/intro2regress.ipynb) notebook.

In [None]:
def plot_statsmodels_linear_regression(df, y):
    """
    Plots the follwoing:
    1. A scatter plot of the "Distance" column on the x-axis
       and the "AirTime" column on the y-axis,
    2. A straight line with "Distance" on the x-axis
       and the values of "y" on the y-axis.
              
    Parameters
    ----------
    df: A pandas.DataFrame. Should have columns named "AirTime" and "Distance".
    y: A Numpy array. The y values of the linear regression model.
    
    Returns
    -------
    A matplotlib.Axes object
    """
    
    # YOUR CODE HERE

    return ax

In [None]:
ax = plot_statsmodels_linear_regression(data, y_pred)

In [None]:
assert_is_instance(
    ax, mpl.axes.Axes,
    msg="Your function should return a matplotlib.axes.Axes object."
)
assert_equal(len(ax.lines), 1)
assert_equal(
    len(ax.collections), 1,
    msg="You should turn off Seaborn regression."
)
assert_is_not(
    len(ax.title.get_text()), 0,
    msg="Your plot doesn't have a title."
)
assert_is_not(
    ax.xaxis.get_label_text(), "AirTime",
    msg="Change the x-axis label to something more descriptive."
)
assert_is_not(
    ax.yaxis.get_label_text(), "Distance",
    msg="Change the y-axis label to something more descriptive."
)
    
x_scatter, y_scatter = ax.collections[0].get_offsets().T
assert_array_equal(x_scatter, data["Distance"])
assert_array_equal(y_scatter, data["AirTime"])

line = ax.get_lines()[0]
x_line = line.get_xdata()
y_line = line.get_ydata()
assert_array_equal(x_line, data["Distance"])
assert_array_almost_equal(y_line, y_pred)

# If your function can only plot the delays and
# cannot handle other data sets, the following test will fail.
df1 = pd.DataFrame({
    "AirTime": np.random.randint(100, size=100),
    "Distance": np.random.randint(100, size=100)
    })
y_pred1 = fit_statsmodels_linear_regression(df1)
ax1 = plot_statsmodels_linear_regression(df1, y_pred1)
x1data, y1data = ax1.collections[0].get_offsets().T
assert_array_equal(x1data, df1["Distance"].values)
assert_array_equal(y1data, df1["AirTime"].values)

line1 = ax1.get_lines()[0]
x_line1 = line1.get_xdata()
y_line1 = line1.get_ydata()
assert_array_equal(x_line1, df1["Distance"])
assert_array_almost_equal(y_line1, y_pred1)

plt.close()