# Lab 4 - Classification

In this lab we will use LDA and kNN for classification. We will evaluate the models using accuracy and confusion matrices. 

- [Load Datasets](#Load-Datasets)
- [Linear Discriminant Analysis (LDA)](#lda)
- [k-Nearest Neighbours (kNN)](#knn)
- [Lab Assignment](#Lab-Assignment)


In [4]:
# %load ../standard_import.txt
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score

%matplotlib inline
plt.style.use('seaborn-white')

## Load Datasets
First we will work with some historical stock market data. The following command assumes that you have csv files stored  within the notebook directory. Modify the command as necessary if your datasets are stored elsewhere.

In [6]:
smarket = pd.read_csv('Smarket.csv')
smarket.head()

Unnamed: 0.1,Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
0,1,2001,0.381,-0.192,-2.624,-1.055,5.01,1.1913,0.959,Up
1,2,2001,0.959,0.381,-0.192,-2.624,-1.055,1.2965,1.032,Up
2,3,2001,1.032,0.959,0.381,-0.192,-2.624,1.4112,-0.623,Down
3,4,2001,-0.623,1.032,0.959,0.381,-0.192,1.276,0.614,Up
4,5,2001,0.614,-0.623,1.032,0.959,0.381,1.2057,0.213,Up


## Linear Discriminant Analysis
<a id='lda'></a>

First let's try using LDA for predicting the direction of the stock market (Up or Down). We will use the features Lag1, Lag2, and Volume. 

For the training set, we'll use all of the data from prior to 2005. 

## Training Data

In [7]:
smark_train = smarket[smarket['Year'] < 2005]

X_train = smark_train[['Lag1', 'Lag2', 'Volume']]
y_train = smark_train['Direction']

lda = LinearDiscriminantAnalysis()
lda.fit(X_train, y_train)

## Testing Data

We will use the data from 2005 as our test data. 

After making predictions on the test data, we will look at the confusion matrix, and calculate accuracy.

In [8]:
smark_test = smarket[smarket['Year'] == 2005]

X_test = smark_test[['Lag1', 'Lag2', 'Volume']]
y_test = smark_test['Direction']

preds = lda.predict(X_test)

conf = confusion_matrix(y_test, preds, labels=lda.classes_)
print('Confusion matrix:\n')
print(lda.classes_)
print(conf)
acc = accuracy_score(y_test, preds)
print('\nThe accuracy is: ', acc)

Confusion matrix:

['Down' 'Up']
[[ 79  32]
 [100  41]]

The accuracy is:  0.47619047619047616


## k-Nearest Neighbours
<a id='knn'></a>

Now let's try the same steps, but with kNN.

## Training 

In [9]:
knn = KNeighborsClassifier(5)
knn.fit(X_train, y_train)

## Testing and Evaluation

In [10]:
knn_preds = knn.predict(X_test)

conf = confusion_matrix(y_test, knn_preds, labels=knn.classes_)
print('Confusion matrix:\n')
print(knn.classes_)
print(conf)
acc = accuracy_score(y_test, knn_preds)
print('\nThe accuracy is: ', acc)

Confusion matrix:

['Down' 'Up']
[[52 59]
 [72 69]]

The accuracy is:  0.4801587301587302


# Lab Assignment

Now you will carry out the same basic steps, on a dataset containing information about US college tuition. Carry out the following steps:

* Read in the College dataset.
* Use the first 400 rows as training data, and the remainder (377 rows) as test data.
* Use Private (indicating whether a university is private) as your outcome variable, and Enroll (the number of enrolled students) and Outstate (the number of out-of-state students) as your features.
* Train a kNN model with k=5 and make predictions on the test set. 
* Show the accuracy and the confusion matrix.
* Try a few different values of k and see how the accuracy changes.
* Train an LDA model and make predictions on the test set.
* Show the accuracy and confusion matrix for the LDA predictions.


You will now carry out the same basic steps as the previous lab, on the College dataset, but this time using logistic regression. Carry out the following steps:

* Read in the College dataset.
* Use the first 400 rows as training data, and the remainder (377 rows) as test data.
* Use Private (indicating whether a university is private) as your outcome variable, and Enroll (the number of enrolled students) and Outstate (the number of out-of-state students) as your features.
* Train a logistic regression model and make predictions on the test set. 
* Show the accuracy and the confusion matrix on the test set.
* Show precision and recall on the test set when the probability threshold is 0.5.
* Show precision and recall on the test set when the probability threshold is 0.45. 
* Show the AUC score on the test set. 