# 4.6.3 Linear Discriminant Analysis

Load modules and data

In [165]:
from scipy import stats
import pandas as pd
import seaborn as sns
import scipy as sp
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt
from sklearn.preprocessing import scale
import sklearn.linear_model as skl_lm
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
import statsmodels.formula.api as smf
%matplotlib inline
plt.style.use('seaborn-white')
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

Smarket = pd.read_csv('Data/Smarket.csv', usecols = range(1,10),parse_dates=True)

Now we will perform LDA on the Smarket data. In Python, we can fit a LDA model using the LinearDiscriminantAnalysis function, which is part of the sklearn library.

In [146]:
x_train = Smarket[0:sum(Smarket.Year<2005)][['Lag1','Lag2']]
y_train = Smarket[0:sum(Smarket.Year<2005)]['Direction']

lda = LinearDiscriminantAnalysis(solver='svd')
lda.fit(x_train, y_train);

### Prior probabilities of groups:

In [145]:
print("Down: %f" % lda.priors_[0])
print("Up: %f" % lda.priors_[1])

Down: 0.491984
Up: 0.508016


The LDA output indicates prior probabilities of ${\hat{\pi}}_1 = 0.492$ and ${\hat{\pi}}_2 = 0.508$; in other words,
49.2% of the training observations correspond to days during which the
market went down.

### Group means:

The group means provides the average of each predictor within each class, and are used by LDA as estimates of $\mu_k$. These suggest that there is a tendency for the previous 2 days’ returns to be negative on days when the market increases, and a tendency for the previous days’ returns to be positive on days when the market declines. 

In [147]:
pd.DataFrame(lda.means_,['Down', 'Up'],['Lag1','Lag2'])

Unnamed: 0,Lag1,Lag2
Down,0.04279,0.033894
Up,-0.039546,-0.031325


In [148]:
lda.coef_

array([[-0.05544078, -0.0443452 ]])

Ikke samme koefficienter som i bogen?

The coefficients of linear discriminants output provides the linear
combination of Lag1 and Lag2 that are used to form the LDA decision rule.
If $−0.0554 \cdot Lag1−0.0443 \cdot Lag2$ is large, then the LDA classifier will predict a market increase, and if it is small, then the LDA classifier will
predict a market decline.

The predict() function returns a list of LDA’s predictions about the movement of the market on the test data:

In [186]:
x_test = Smarket[sum(Smarket.Year<2005):][['Lag1','Lag2']] # Data from 2005
y_test = Smarket[sum(Smarket.Year<2005):]['Direction'] # Data from 2005
predict = lda.predict(x_test)
pd.DataFrame(confusion_matrix(y_test, predict).T,['Down', 'Up'],['Down','Up'])

Unnamed: 0,Down,Up
Down,35,35
Up,76,106


Comparing with 4.6.2 The LDA and logistic regression predictions are almost identical.

Mean value

In [173]:
(35+106.0)/(35+35+76+106)

0.5595238095238095

In [177]:
print(classification_report(y_test, predict, digits=3))

             precision    recall  f1-score   support

       Down      0.500     0.315     0.387       111
         Up      0.582     0.752     0.656       141

avg / total      0.546     0.560     0.538       252



Applying a 50% threshold to the posterior probabilities allows us to recreate
the predictions

In [182]:
pred_p = lda.predict_proba(x_test)

In [235]:
print(sum(pred_p[:,0]>=0.5))
print(sum(pred_p[:,0]<0.5))

70
182


Notice that the posterior probability output by the model corresponds to
the probability that the market will $\underline{decrease}$

In [223]:
pred_p[0:20,0].T

array([ 0.49017925,  0.4792185 ,  0.46681848,  0.47400107,  0.49278766,
        0.49385615,  0.49510156,  0.4872861 ,  0.49070135,  0.48440262,
        0.49069628,  0.51199885,  0.48951523,  0.47067612,  0.47445929,
        0.47995834,  0.49357753,  0.50308938,  0.49788061,  0.48863309])

In [224]:
predict[1:20]

array(['Up', 'Up', 'Up', 'Up', 'Up', 'Up', 'Up', 'Up', 'Up', 'Up', 'Down',
       'Up', 'Up', 'Up', 'Up', 'Up', 'Down', 'Up', 'Up'], 
      dtype='|S4')

If we wanted to use a posterior probability threshold other than 50% in
order to make predictions, then we could easily do so. For instance, suppose
that we wish to predict a market decrease only if we are very certain that the
market will indeed decrease on that day—say, if the posterior probability
is at least 90%:

In [236]:
print(sum(pred_p[:,0]>0.9))

0


No days in 2005 meet that threshold! In fact, the greatest posterior probability
of decrease in all of 2005 was 52.02%:

In [237]:
max(pred_p[:,0])

0.52023495053561553