A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Problem 1. Logistic Regression.

In this problem, we fit a logistic regression model that takes the day of the week and depature delays as input, and predicts whether a flight is on time or not.

In [None]:
import numpy as np
import pandas as pd

from patsy import dmatrices
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score

from nose.tools import assert_is_instance, assert_equal, assert_almost_equal
from numpy.testing import assert_array_equal, assert_array_almost_equal
from pandas.util.testing import assert_index_equal

We use the same [airline on-time performance data](http://stat-computing.org/dataexpo/2009/) from the lessons. You can find the descriptions [here](http://stat-computing.org/dataexpo/2009/) and [here](http://stat-computing.org/dataexpo/2009/the-data.html). We use four columns: `DayOfWeek`, `ArrDelay`, `DepDelay`, and `Origin`.

In [None]:
filename = "/home/data_scientist/data/2001.csv"
usecols = (3, 14, 15, 17)
names = ["DayOfWeek", "ArrDelay", "DepDelay", "Origin"]

all_data = pd.read_csv(filename, header=0, na_values=["NA"], usecols=usecols, names=names)

We perform some data pre-processing, similarly to the [Introduction to Logistic Regression](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week6/notebooks/intro2lr.ipynb) notebook.

To simplify the computations, we first extract only those flights that depart from Willard Airport (CMI). After this, we drop all rows that have missing values ("`NA`") in any of the columns.

We next create a categorical column, `ArrLate` (_arrival late_), that is zero if the flight arrived less than or equal to 5 minutes after the scheduled arrival time, or one if it arrived more than 5 minutes after the scheduled time. We will use this column as the target label to train our logistic regressor.

Furthermore, to save memory, we drop the columns that we no longer need: the origin airport and the arrival delay.

Finally, we reset the indices so that the first row corresponds to index 0, the second row to index 1, and so on.

In [None]:
local = all_data[all_data["Origin"] == "CMI"].dropna()

local["ArrLate"] = (local["ArrDelay"] > 5).astype(int)

local = local.drop(["Origin", "ArrDelay"], axis=1)

local = local.reset_index(drop=True)

Let's print the first 10 columns of the resulting data frame, and check what it looks like.

```python
>>> print(local.head(10))
```
```
   DayOfWeek  DepDelay  ArrLate
0          1      15.0        1
1          2      -5.0        1
2          3      52.0        1
3          4      12.0        0
4          5       0.0        0
5          7     152.0        1
6          1      51.0        1
7          2       3.0        0
8          3      -7.0        0
9          4      14.0        0
```

In [None]:
print(local.head(10))

## Split the data frame and convert to category variables

We will use the scikit learn library to perform logistic regression on `DepDelay` and `DayOfWeek` to predict `ArrLate`. To fit a logistic regression model, we need to convert `DayOfWeek` into categorical variables. Thus, in the following code cell,

- Use the formula interface to construct `X` (a `pandas` DataFrame) and `y` (a numpy array) for use in `sklearn`.
- Use the `dmatrics()` function in the `patsy` library, which supports a formula interface (as we used with the `statsmodel` library).
- Turn `DayOfWeek` into a category variable (by wrapping them in the `C()` notation). Do _not_ turn the `DepDelay` column into a category variable.
- The return value `y` needs to be a one-dimensional array for scikit learn. (Note: this part of the code is already provided for you.)
- If you are not sure how to do this, there is an example in the [Introduction to Logistic Regression](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week6/notebooks/intro2lr.ipynb) notebook.

In the end, we should have

```python
>>> X, y = convert_df_using_patsy_formula(local)
>>> print(X.head())
```
```
   Intercept  C(DayOfWeek)[T.2]  C(DayOfWeek)[T.3]  C(DayOfWeek)[T.4]  \
0        1.0                0.0                0.0                0.0   
1        1.0                1.0                0.0                0.0   
2        1.0                0.0                1.0                0.0   
3        1.0                0.0                0.0                1.0   
4        1.0                0.0                0.0                0.0   

   C(DayOfWeek)[T.5]  C(DayOfWeek)[T.6]  C(DayOfWeek)[T.7]  DepDelay  
0                0.0                0.0                0.0      15.0  
1                0.0                0.0                0.0      -5.0  
2                0.0                0.0                0.0      52.0  
3                0.0                0.0                0.0      12.0  
4                1.0                0.0                0.0       0.0  
```
```python
>>> print(y)
```
```
[1 1 1 ..., 1 0 1]
```

In [None]:
def convert_df_using_patsy_formula(df):
    """
    Uses patsy formula interface to 
    
    Paramters
    ---------
    df: A pandas data frame with columns:
        "DayOfWeek", "DepDelay", and "ArrLate".

    Returns
    -------
    X: A pandas data frame with columns:
       "DepDelay", "Intercept", "C(DayOfWeek)[T.2]",
       "C(DayOfWeek)[T.3]", "C(DayOfWeek)[T.4]",
       "C(DayOfWeek)[T.5]", and "C(DayOfWeek)[T.6]"
    y: A 1-D numpy array. Same as the "ArrLate" column of "df".
    """
    
    y, X = dmatrices(
    # YOUR CODE HERE
    )
    # y needs to be a 1D array for scikit learn
    y = np.ravel(y).astype(np.int)
    
    return X, y

In [None]:
X, y = convert_df_using_patsy_formula(local)

In [None]:
print(X.head())

In [None]:
assert_is_instance(X, pd.DataFrame)
assert_is_instance(y, np.ndarray)

assert_equal(len(local), len(X))
assert_equal(len(local), len(y))

columns = [
    'C(DayOfWeek)[T.2]', 'C(DayOfWeek)[T.3]', 'C(DayOfWeek)[T.4]',
    'C(DayOfWeek)[T.5]', 'C(DayOfWeek)[T.6]', 'C(DayOfWeek)[T.7]',
    'DepDelay', 'Intercept'
]
assert_equal(set(X.columns), set(columns))
assert_array_almost_equal(local.DepDelay.values, X.DepDelay.values)
assert_array_almost_equal(X.Intercept.values, [1.0] * len(local))
for i in [2, 3, 4, 5, 6, 7]:
    assert_index_equal(
        X[X["C(DayOfWeek)[T.{}]".format(i)] == 1.0].index,
        local[local.DayOfWeek == i].index
    )
assert_array_equal(local.ArrLate.values, y)

To evaluate how well our regressor will perform on new, unseen data, we want to train on a subset of the data and test this new regressor on unseen test data. So, we split our data into a training sample, and a testing sample by using the `train_test_split()` method in scikit learn. Specifically, in this example, we use 75% of the data for training and 25% of the data for testing. Note that by providing an integer to the optional paramter `random_state`, the `train_test_split()` function becomes deterministic, and we get the same split every time we run it.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

## Fit a logistic regression model

In the following code cell, 

- Write a function named `fit_logistic_regression_model()`.

- Use the `LogisticRegression()` method in scikit learn to train a logistic regression model on the training set.

- Use the model to predict `ArrLate` from the `DepDelay` and `DayOfWeek` columns of the test set.

- Finally, return the predicted values as a one-dimensioanl numpy array.

- Note that `fit_logistic_regression_model()` takes 4 arguments, but it is not necessary that you use all 4 arguments. You should decide which arguments are needed and which are not.

When we use this function on the training and test sets that we created in the previous code cell, we get an accuracy of 88.7 %.

```python
>>> y_pred = fit_logistic_regression_model(X_train, X_test, y_train, y_test)
>>> accuracy = accuracy_score(y_test, y_pred)
>>> print("accuracy = {0:3.1f} %.".format(100.0 * accuracy))
```
```
accuracy = 88.7 %.
```

In [None]:
def fit_logistic_regression_model(X_train, X_test, y_train, y_test):
    """
    Fits a logistic regression model and returns the predicted values of "ArrLate".
    
    Paramters
    ---------
    X_train: A pandas data frame. The features of the training set.
    X_test: A pandas data frame. The features of the test set.
    y_train: A numpy array. The labels of the training set.
    y_test: A numpy array. The labels of the test set.
    
    Returns
    -------
    A 1-D numpy array. Predicted values of "ArrLate".
    """
    
    # YOUR CODE HERE
    
    return result

In [None]:
y_pred = fit_logistic_regression_model(X_train, X_test, y_train, y_test)
accuracy = accuracy_score(y_test, y_pred)
print("accuracy = {0:3.1f} %.".format(100.0 * accuracy))

In [None]:
assert_is_instance(y_pred, np.ndarray)
assert_equal(len(y_pred), len(X_test))
assert_array_equal(
    np.where(y_pred != y_test)[0],
[  5,   6,  12,  24,  26,  31,  38,  39,  46,  53,  60,  61,  62,
        64,  78,  83, 103, 110, 114, 128, 142, 156, 159, 167, 196, 205,
       208, 213, 219, 229, 233, 236, 250, 251, 252, 261, 280, 297, 304,
       312, 338, 349, 376, 384, 392, 400, 408, 412]
)
assert_almost_equal(accuracy, 0.8867924528301887)