A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Problem 3. Support Vector Machine.

In this problem, we fit a Support Vector Machine (SVM) model that takes the day of the week and depature delays as input and predicts whether a flight is on time or not.

In [None]:
import numpy as np
import pandas as pd

from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score

from nose.tools import assert_is_instance, assert_equal, assert_almost_equal
from numpy.testing import assert_array_equal, assert_array_almost_equal
from pandas.util.testing import assert_index_equal

We use the same [airline on-time performance data](http://stat-computing.org/dataexpo/2009/) from the lessons. You can find the descriptions [here](http://stat-computing.org/dataexpo/2009/). We use 4 columns: `DayOfWeek`, `ArrDelay` `DepDelay`, and `Origin`.

In [None]:
filename = "/home/data_scientist/data/2001.csv"
usecols = (3, 14, 15, 17)
names = ["DayOfWeek", "ArrDelay", "DepDelay", "Origin"]

all_data = pd.read_csv(filename, header=0, na_values=["NA"], usecols=usecols, names=names)

We perform the same data pre-processing as we performed in [Problem 1](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week6/assignments/Problem_1_Logistic_Regression.ipynb) and [Problem 2](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week6/assignments/Problem_2_Nearest_Neighbors.ipynb).

To simplify the computations, we first extract only those flights that depart from Willard airport (CMI). After this, we drop all rows that have missing values ("`NA`") in any of the columns.

We next create a categorical column, _arrival late_, that is zero if the flight arrived less than 5 minutes after the scheduled arrival time, or one if it arrived more than this number of minutes after the scheduled time. We will use this
to train our logistic regressor.

Finally, to save memory, we drop the columns that we no longer need: the origin airport and arrival delay columns.

In [None]:
local = all_data[all_data["Origin"] == "CMI"].dropna()

local["ArrLate"] = (local["ArrDelay"] > 5).astype(int)

local = local.drop(["Origin", "ArrDelay"], axis=1)

Let's print out the first 10 columns of the resulting data frame.

```python
>>> print(local.head(10))
```
```
        DayOfWeek  DepDelay  ArrLate
365879          1      15.0        1
365880          2      -5.0        1
365881          3      52.0        1
365882          4      12.0        0
365883          5       0.0        0
365884          7     152.0        1
365885          1      51.0        1
365886          2       3.0        0
365887          3      -7.0        0
365888          4      14.0        0

```

In [None]:
print(local.head(10))

In the previous problems, we split our data into a training set and a testing sample by using the `train_test_split` method in scikit learn. In this problem, we will use a [validation set](https://en.wikipedia.org/wiki/Test_set#Validation_set) in addition to the training and test sets.

We split `local` into a training set (used for training our model), a validation set (used for determining _hyperparameters_, such as the number of neighbors in a kNN classifier or the kernel to be used in a SVM classifier), and a test set (used for evaluating our model's final performance). We can do this by using the `train_test_split()` function twice with different `test_size` to split `local` into training:validation:test = 60:20:20.

In [None]:
X = local.drop("ArrLate", axis=1)
y = np.ravel(local["ArrLate"])

# training + validation = 80%, test = 20%
X_train_valid, X_test, y_train_valid, y_test = train_test_split(X, y, test_size=0.20, random_state=0)
# training = 80% * 0.75 = 60%, validation = 80% * 0.25 = 20%
X_train, X_valid, y_train, y_valid = train_test_split(X_train_valid, y_train_valid, test_size=0.25, random_state=0)

In [Problem 2](https://github.com/UI-DataScience/accy571-fa16/blob/master/Week6/assignments/Problem_2_Nearest_Neighbors.ipynb), we saw that the columns we want to use for training have different scales, so we scaled each column to the [0, 1] range. For SVM, we will use a different scheme, and scale features to be in [-1, -1] range by dividing through the largest maximum value in each feature.

In [None]:
def standardize(df):
    
    result = df.apply(lambda x: x / np.max(np.abs(x)))
    
    return result

In [None]:
X_train_standardized = standardize(X_train)
X_valid_standardized = standardize(X_valid)

In [None]:
print(X_train_standardized.min())

In [None]:
print(X_train_standardized.max())

## Train a Support Vector Machine model

Now that we have standardized the test sets, we are ready to apply the SVM algorithm.

- Write a function named `fit_svm_and_predict()` that fits a SVM classifier using [SVC()](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) in scikit learn.
- Note that the function takes 5 arguments. **You must use `kernel`** but it is not necessary that you use all of the other 4 arguments (`X_train`, `X_test`, `y_train`, and `y_test`). You should decide which arguments are needed and which are not.
- If you read the [sklearn.svm.SVC documentation](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), there are many optional parameters that you can use in `SVC()`, e.g., `C`, `kernel`, `gamma` etc. We will only use the `kernel` parameter in this problem. Use defaults values for all optional parameters except `kernel`.

In [None]:
def fit_svm_and_predict(X_train, X_test, y_train, y_test, kernel):
    """
    Fits a SVM classifier on the training set.
    Returns the predicted values on the test set.
    
    Paramters
    ---------
    X_train: A pandas data frame. The features of the training set.
    X_test: A pandas data frame. The features of the test set.
    y_train: A numpy array. The labels of the training set.
    y_test: A numpy array. The labels of the test set.
    kernel: A string. Specifies the kernel type to be used in the algorithm.

    Returns
    -------
    A 1-D numpy array. 
    """
    
    # YOUR CODE HERE
    
    return result

In [None]:
y_pred_linear = fit_svm_and_predict(
    X_train_standardized, X_valid_standardized, y_train, y_valid,
    kernel="linear"
)
print("linear kernel accuracy = {0:3.1f} %".format(100.0 * accuracy_score(y_valid, y_pred_linear)))

In [None]:
y_pred_rbf = fit_svm_and_predict(
    X_train_standardized, X_valid_standardized, y_train, y_valid,
    kernel="rbf"
)
print("rbf kernel accuracy = {0:3.1f} %".format(100.0 * accuracy_score(y_valid, y_pred_rbf)))

In [None]:
y_pred_poly = fit_svm_and_predict(
    X_train_standardized, X_valid_standardized, y_train, y_valid,
    kernel="poly"
)
print("poly kernel accuracy = {0:3.1f} %".format(100.0 * accuracy_score(y_valid, y_pred_poly)))

The linear kernel yields the greatest accuracy, so we choose `kernel="linear"` as the optimal model for our data set. Consider what we have done here. We use the validation set to determine the optimal value for a hyperparameter, `kernel="linear"`. We did not use the test set to arrive at this model (a model here simply means a particuar set of hyperparameters).

Now that we have decided on our model, we can now use both the training set and the validation set for training, and then use the test set to evaulate the performance.

```python
>>> accuracy_final = accuracy_score(y_test, y_pred_final)
>>> print("test set accuracy = {0:3.1f} %".format(100.0 * accuracy_final))
```
```
test set accuracy = 87.3 %
```

In [None]:
X_train_valid_standardized = standardize(X_train_valid)
X_test_standardized = standardize(X_test)

y_pred_final = fit_svm_and_predict(
    X_train_valid_standardized, X_test_standardized, y_train_valid, y_test,
    kernel="linear"
)

accuracy_final = accuracy_score(y_test, y_pred_final)
print("test set accuracy = {0:3.1f} %".format(100.0 * accuracy_final))

In [None]:
assert_is_instance(y_pred_final, np.ndarray)
assert_equal(len(y_pred_final), len(y_test))
assert_array_equal(
    np.where(y_pred_final != y_test)[0],
    [  5,   6,  12,  24,  26,  31,  38,  39,  41,  46,  53,  61,  62,
       64,  78,  83, 101, 103, 110, 114, 128, 142, 143, 159, 167, 168,
       196, 208, 213, 219, 221, 229, 233, 236, 250, 251, 252, 280, 297,
       304, 312, 322, 338]
)
assert_almost_equal(accuracy_final, 0.87315634218289084)