# AccelerateAI - LDA and QDA

We will use the Stock Market as an example here to understand various fundamentals of LDA.

* This dataset consists of percentage returns for the S&P 500 stock index over 1250 days, from the beginning of 2001 until the end of 2005. 
* For each date, we have recorded the percentage returns for each of the 5 previous trading days, ```Lag1``` through ```Lag5``` 
* We have also recorded ```Volume``` (the number of shares traded on the previous day, in billions)
* ```Today``` (the percentage return on the date) 
* ```Direction``` (whether the market was Up or Down on this date)

We will also use QDA (Quadratic Discriminant Analysis) with same dataset.

## Linear Discriminant Analysis

### 1. Load Libraries and Import Dataset

In [1]:
import pandas as pd
import numpy as np

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.metrics import confusion_matrix, classification_report, precision_score

%matplotlib inline

In [2]:
df = pd.read_csv('./smarket.csv', usecols=range(1,10), index_col=0, parse_dates=True)

df.sample(6)

Unnamed: 0_level_0,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2003-01-01,-0.155,0.091,2.238,-0.991,0.103,1.4889,-1.524,Down
2003-01-01,0.184,-1.411,0.1,-1.524,-0.155,1.3883,-0.827,Down
2002-01-01,0.775,0.82,1.716,-0.556,0.97,1.3541,0.914,Up
2001-01-01,-1.627,-1.637,-0.343,0.605,-0.554,1.1987,1.608,Up
2001-01-01,1.035,-1.825,-0.684,0.615,1.171,1.3757,-0.066,Down
2001-01-01,0.812,-1.998,4.368,-0.29,-3.439,1.0628,2.707,Up


In [3]:
df.shape

(1250, 8)

### 2. Split Data and Fit the Model

We will perform LDA on this smarket data. We can fit a LDA model using the LinearDiscriminantAnalysis() function, which is part of the discriminant_analysis module of the sklearn library. We'll fit the model using only the observations before year 2005, and then test the model on the data from the year 2005.

We are considering two predictors Lag1 and Lag2 here.

In [4]:
X_train = df[:'2004'][['Lag1','Lag2']]
y_train = df[:'2004']['Direction']

X_test = df['2005':][['Lag1','Lag2']]
y_test = df['2005':]['Direction']

lda = LinearDiscriminantAnalysis()
model = lda.fit(X_train, y_train)


### 3. Interpretation

In [5]:
print(model.priors_)

[0.49198397 0.50801603]


* The LDA output indicates prior probabilities of  ```π-hat_1=0.492```  and  ```π-hat_2=0.508``` 
* This indicates 49.2% of the training observations correspond to days during which the market went down.

In [6]:
print(model.means_)

[[ 0.04279022  0.03389409]
 [-0.03954635 -0.03132544]]


* The above provides the group means; these are the average of each predictor within each class, and are used by LDA as estimates of  ```μk``` . 
* These suggest that there is a tendency for the previous 2 days returns to be negative on days when the market increases, and a tendency for the previous days returns to be positive on days when the market declines.

In [7]:
print(model.coef_)

[[-0.05544078 -0.0443452 ]]


* The coefficients of linear discriminants output provides the linear combination of Lag1 and Lag2 that are used to form the LDA decision rule.

* If  ```−0.0554×Lag1−0.0443×Lag2```  is large, then the LDA classifier will predict a market increase, and if it is small, then the LDA classifier will predict a market decline.


The predict() function returns a list of LDA’s predictions about the movement of the market on the test data.

In [8]:
pred=model.predict(X_test)
print(np.unique(pred, return_counts=True))

(array(['Down', 'Up'], dtype='<U4'), array([ 70, 182], dtype=int64))


* The model assigned 70 observations to the "Down" class, and 182 observations to the "Up" class. Let's check out the confusion matrix to see how this model is performing. 
* We would want to compare the predicted class (```pred```) to the true class (```y_test```).

In [9]:
print(confusion_matrix(pred, y_test))
print(classification_report(y_test, pred, digits=3))

[[ 35  35]
 [ 76 106]]
              precision    recall  f1-score   support

        Down      0.500     0.315     0.387       111
          Up      0.582     0.752     0.656       141

    accuracy                          0.560       252
   macro avg      0.541     0.534     0.522       252
weighted avg      0.546     0.560     0.538       252



In [10]:
model.classes_

array(['Down', 'Up'], dtype='<U4')

In [11]:
model.feature_names_in_

array(['Lag1', 'Lag2'], dtype=object)

In [12]:
model.get_params()

{'covariance_estimator': None,
 'n_components': None,
 'priors': None,
 'shrinkage': None,
 'solver': 'svd',
 'store_covariance': False,
 'tol': 0.0001}

In [13]:
model.intercept_

array([0.03221375])

## Quadratic Discriminant Analysis

We will use the same Stock Market dataset, (and also same set of train and test split based on time window) to fit 2nd model using QDA (Quadratic Discriminant Analysis) and try to understand the outcome.

In [14]:
qda = QuadraticDiscriminantAnalysis()
model2 = qda.fit(X_train, y_train)
print(model2.priors_)
print(model2.means_)

[0.49198397 0.50801603]
[[ 0.04279022  0.03389409]
 [-0.03954635 -0.03132544]]


* The output contains the group means. However it does not contain the coefficients of the linear discriminants, because the QDA classifier involves a quadratic, rather than a linear, function of the predictors. 
* The predict() function works in exactly the same fashion as for LDA.

In [15]:
pred2=model2.predict(X_test)
print(np.unique(pred2, return_counts=True))
print(confusion_matrix(pred2, y_test))
print(classification_report(y_test, pred2, digits=3))

(array(['Down', 'Up'], dtype=object), array([ 50, 202], dtype=int64))
[[ 30  20]
 [ 81 121]]
              precision    recall  f1-score   support

        Down      0.600     0.270     0.373       111
          Up      0.599     0.858     0.706       141

    accuracy                          0.599       252
   macro avg      0.600     0.564     0.539       252
weighted avg      0.599     0.599     0.559       252



Interestingly, the QDA predictions are accurate almost 60% of the time, even though the 2005 data was not used to fit the model. This level of accuracy is quite impressive for stock market data, which is known to be quite hard to model accurately.

This suggests that the quadratic form assumed by QDA may capture the true relationship more accurately than the linear forms assumed by LDA and logistic regression. However, there should be recommendations to evaluate this method’s performance on a larger test set before betting that this approach will consistently beat the market.

In [16]:
model2.get_params()

{'priors': None, 'reg_param': 0.0, 'store_covariance': False, 'tol': 0.0001}

In [17]:
model2.classes_

array(['Down', 'Up'], dtype=object)

In [18]:
# For each class k an array of shape [n_features, n_k], with n_k = min(n_features, number of elements in class k) 
# It is the rotation of the Gaussian distribution, i.e. its principal axis.
model2.rotations_

[array([[ 0.57172606,  0.82044458],
        [-0.82044458,  0.57172606]]),
 array([[-0.84630247, -0.53270267],
        [ 0.53270267, -0.84630247]])]