# KNN Classification

In the KNN tutorial, we devised a simple classification problem involving daily changes in VIX levels and daily changes SPY returns.  In particular, we used k-nearest neighbors to identify a given daily return as a gain or a loss by analyzing changes in the VIX from *the same day*.  We found that the prediction accuracy to be a little over 80%, which is quite strong.

In this exercise, we extend that analysis to try to predict whether *the following day* will be a gain or a loss by looking at VIX changes from the current day.  As you will see, there is very little predictive power in this methodology.

#### 1) Import the packages that you think you will need.

In [1]:
import pandas as pd


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


#### 2) Read in the data from `vix_knn.csv` and assign it to a variable called `df_vix`.

In [2]:
# Now that the file has been uploaded, let's read the data from 'vix_knn.csv' again
df_vix = pd.read_csv('/Users/yuanhanlim/Desktop/DS & ML/05_k_nearest_neighbors/vix_knn.csv')

# Display the first few rows of the dataframe to verify the data
df_vix.head()

Unnamed: 0,trade_date,vix_009,vix_030,vix_090,vix_180,spy_ret
0,2011-01-03,,,,,0.010338
1,2011-01-04,0.02,-0.23,-0.01,-0.21,-0.000551
2,2011-01-05,-0.49,-0.36,-0.56,-0.41,0.005198
3,2011-01-06,0.14,0.38,0.3,0.09,-0.001959
4,2011-01-07,-0.7,-0.26,-0.06,0.05,-0.001962


#### 3) Notice that the first row of `df_vix` contains `NaN` values, so remove the first row.

In [3]:
# Remove the first row since it contains NaN values
df_vix = df_vix.dropna().reset_index(drop=True)

# Display the first few rows of the cleaned dataframe to confirm
df_vix

Unnamed: 0,trade_date,vix_009,vix_030,vix_090,vix_180,spy_ret
0,2011-01-04,0.02,-0.23,-0.01,-0.21,-0.000551
1,2011-01-05,-0.49,-0.36,-0.56,-0.41,0.005198
2,2011-01-06,0.14,0.38,0.30,0.09,-0.001959
3,2011-01-07,-0.70,-0.26,-0.06,0.05,-0.001962
4,2011-01-10,0.80,0.40,0.19,0.01,-0.001259
...,...,...,...,...,...,...
2006,2018-12-24,8.66,5.96,2.61,1.50,-0.026423
2007,2018-12-26,-7.69,-5.66,-3.15,-1.99,0.050525
2008,2018-12-27,-0.83,-0.45,0.20,-0.14,0.007677
2009,2018-12-28,-2.86,-1.62,-0.57,-0.28,-0.001290


#### 4) Add a column to `df_vix` called `spy_label_1`.  These will be the labels that we are trying to predict, and they will be a function of the *next day* return. If it is a loss the column will contain a 'L', otherwise it will contain a 'G'.

In [4]:
# Shift the 'spy_ret' column to create the labels based on the next day's return
def label_return(x):
    if x > 0:
        return 'G'
    if x < 0:
        return 'L'
        
df_vix['spy_label_1'] = df_vix['spy_ret'].shift(-1).apply(label_return)

#### 5) Notice that in the final row of `df_vix`, the `spy_label_1` column contains a `NaN` value.  Remove the final row from `df_vix`.

In [5]:
df_vix = df_vix.dropna(subset=['spy_label_1'])

#### 6) Import the `KNeighborsClassifier` constructor, as well as the `scale` function and the `train_test_split` function.

In [6]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split

#### 7) Select all four VIX term structure points as your feature set and it `X`.  Also, isolate the labels you want to predict and call them `y`.

In [7]:

# Select the feature set (VIX term structure points)
X = df_vix[['vix_009', 'vix_030', 'vix_090', 'vix_180']]

# Isolate the labels to predict (spy_label_1)
y = df_vix['spy_label_1']

#### 8) Use the `scale()` function to normalize the feature set; call the normalized features `Xs`.

In [8]:
# Normalize the feature set
Xs = scale(X)

#### 9) Use `train_test_split()` to generate a training set and a hold out set.  Use the canonical variable names `X_train`, `X_test`, `y_train`, `y_test`.  Set the size of the test set to 20%.

In [9]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(Xs, y, test_size=0.2, random_state=42)

# Display the shapes of the training and testing sets to verify
(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

((1600, 4), (401, 4), (1600,), (401,))

#### 10) Instantiate the a KNN classifer with a hyperparameter of 10, and fit the model to the training set.

In [10]:
# Instantiate the KNN classifier with k=10
knn_classifier = KNeighborsClassifier(n_neighbors=10)

# Fit the model to the training set
knn_classifier.fit(X_train, y_train)

#### 11) Check the in-sample accuracy score of the model.

In [11]:
# Check the in-sample accuracy score of the model
in_sample_accuracy = knn_classifier.score(X_train, y_train)

# Display the in-sample accuracy
in_sample_accuracy


0.631875

#### 12) Check the out-of-sample accuracy score using the test set.

In [12]:
# Check the in-sample accuracy score of the model
out_sample_accuracy = knn_classifier.score(X_test, y_test)

# Display the in-sample accuracy
out_sample_accuracy

0.5336658354114713