A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Problem 2. Nearest Neighbors.

In this problem, we fit a $k$-nearest neighbors (kNN) model that takes the day of the week and depature delays as input and predicts whether a flight is on time or not.

In [None]:
import numpy as np
import pandas as pd

from sklearn.neighbors import KNeighborsClassifier
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score

from nose.tools import assert_is_instance, assert_equal, assert_almost_equal
from numpy.testing import assert_array_equal, assert_array_almost_equal
from pandas.util.testing import assert_index_equal

We use the same [airline on-time performance data](http://stat-computing.org/dataexpo/2009/) from the lessons. You can find the descriptions [here](http://stat-computing.org/dataexpo/2009/). We use 4 columns: `DayOfWeek`, `ArrDelay` `DepDelay`, and `Origin`.

In [None]:
filename = "/home/data_scientist/data/2001.csv"
usecols = (3, 14, 15, 17)
names = ["DayOfWeek", "ArrDelay", "DepDelay", "Origin"]

all_data = pd.read_csv(filename, header=0, na_values=["NA"], usecols=usecols, names=names)

We perform some data pre-processing, similarly to the [Introduction to Logistic Regression](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week6/notebooks/intro2lr.ipynb) notebook.

To simplify the computations, we first extract only those flights that depart from Willard airport (CMI). After this, we drop all rows that have missing values ("`NA`") in any of the columns.

We next create a categorical column, _arrival late_, that is zero if the flight arrived less than 5 minutes after the scheduled arrival time, or one if it arrived more than this number of minutes after the scheduled time. We will use this
to train our logistic regressor.

Furthermore, to save memory, we drop the columns that we no longer need: the origin airport and arrival delay columns.

Finally, we use reset the indices so that the first row corresponds to index 0, the second row to index 1, and so on.

In [None]:
local = all_data[all_data["Origin"] == "CMI"].dropna()

local["ArrLate"] = (local["ArrDelay"] > 5).astype(int)

local = local.drop(["Origin", "ArrDelay"], axis=1)

local = local.reset_index(drop=True)

Let's print out the first 10 columns of the resulting data frame.

```python
>>> print(local.head(10))
```
```
   DayOfWeek  DepDelay  ArrLate
0          1      15.0        1
1          2      -5.0        1
2          3      52.0        1
3          4      12.0        0
4          5       0.0        0
5          7     152.0        1
6          1      51.0        1
7          2       3.0        0
8          3      -7.0        0
9          4      14.0        0
```

In [None]:
print(local.head(10))

To evaluate how well our regressor will perform on new, unseen data, we want to train on a subset of the data and test this new regressor on unseen _test_ data. So, we split our data into a _training_ sample, and a testing sample by using the `train_test_split` method in scikit learn. Specifically, in this example, we use 75% of the data for training and 25% of the data for testing.

In [None]:
# Evaluate the model by splitting into train and test sets
X = local.drop("ArrLate", axis=1)
y = np.ravel(local["ArrLate"])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

Note that the six columns we want to use for training have different scales: `DayOfWeek` ranges from 1 to 7, while `DepDelay` ranges from -19 to 224.

In [None]:
print(X_train.min())

In [None]:
print(X_train.max())

Before we apply the machine learning technique, we need to standardize the features. Otherwise, a feature with a big scale will dominate another feature with a smaller scale when the kNN algorithm considers the distance between the neighbors. One way to standardize the features is to rescale the range of features to $[0, 1]$:

$$x' = \frac{x - \text{min}(x)}{\text{max}(x)-\text{min}(x)}$$

where $x$ is an original value, $x'$ is the normalized value. `normalize()` in the following code cell takes a data frame and returns a data frame with all columns rescaled to the range $[0, 1]$.

In [None]:
def normalize(df):
    
    result = df.apply(lambda x: (x - np.min(x)) / (np.max(x) - np.min(x)))
    
    return result

In [None]:
X_train_normalized = normalize(X_train)
X_test_normalized = normalize(X_test)

In [None]:
print(X_train_normalized.min())

In [None]:
print(X_train_normalized.max())

## Train a k-Nearest Neighbors model

Now that we have standardized the test sets, we are finally ready to apply the $k$-Nearest Neighbors algorithm.

- Write a function named `fit_knn_and_predict()` that fits a $k$-Nearest Neighbors using [KNeighborsClassifier()](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) in scikit learn.
- Note that the function takes 4 arguments, but it is not necessary that you use all 4 arguments in the function. You should decide which arguments are needed and which are not.
- If you read the [sklearn.neighbors.KNeighborClassifer documentation](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) carefully, there are many optional parameters that you can use in `KNeighborClassifer()`, e.g., `n_neighbors`, `weights`, etc. Use defaults values for all optional parameters. That is, do not use any optional paramters.

In [None]:
def fit_knn_and_predict(X_train, X_test, y_train, y_test):
    """
    Fits a kNN classifer on the training data.
    Returns the predicted values on the test data.
    
    Paramters
    ---------
    X_train: A pandas data frame. The features of the training set.
    X_test: A pandas data frame. The features of the test set.
    y_train: A numpy array. The labels of the training set.
    y_test: A numpy array. The labels of the test set.
    
    Returns
    -------
    A 1-D numpy array. 
    """
    
    # YOUR CODE HERE
    
    return result

In [None]:
y_pred = fit_knn_and_predict(X_train_normalized, X_test_normalized, y_train, y_test)
accuracy = accuracy_score(y_test, y_pred)
print("accuracy = {0:3.1f} %".format(100.0 * accuracy))

In [None]:
assert_is_instance(y_pred, np.ndarray)
assert_equal(len(y_pred), len(y_test))
assert_array_equal(
    np.where(y_pred != y_test)[0],
    [  6,  12,  24,  25,  26,  27,  29,  31,  35,  37,  38,  39,  41,
        46,  48,  61,  62,  64,  83,  86,  93,  99, 101, 108, 110, 114,
       117, 128, 142, 143, 159, 162, 167, 168, 170, 172, 181, 185, 189,
       194, 196, 207, 208, 213, 219, 221, 229, 233, 236, 237, 239, 240,
       242, 246, 247, 250, 251, 252, 254, 271, 280, 297, 304, 312, 317,
       319, 322, 344, 345, 349, 362, 376, 383, 387, 392, 400, 405, 409,
       412, 415, 417]
)
assert_almost_equal(accuracy, 0.80896226415094341)