# **Lab:Logistic Regression, LDA, QDA and KNN**

## **The Stock Market Data**

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import style

import warnings
warnings.filterwarnings('ignore')

style.use('fivethirtyeight')

from sklearn import linear_model
from sklearn.metrics import confusion_matrix
import statsmodels.api as sm

%matplotlib inline

  import pandas.util.testing as tm


In [2]:
data = pd.read_csv('/content/drive/My Drive/Repos/Git/Machine-Learning/An Introduction to Statistical Learning/Dataset/Smarket.csv')

In [3]:
data.isnull().any()

Unnamed: 0    False
Year          False
Lag1          False
Lag2          False
Lag3          False
Lag4          False
Lag5          False
Volume        False
Today         False
Direction     False
dtype: bool

In [4]:
data.drop('Unnamed: 0',1,inplace=True)

In [5]:
data.head(3)

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
0,2001,0.381,-0.192,-2.624,-1.055,5.01,1.1913,0.959,Up
1,2001,0.959,0.381,-0.192,-2.624,-1.055,1.2965,1.032,Up
2,2001,1.032,0.959,0.381,-0.192,-2.624,1.4112,-0.623,Down


**Dataset info:**
 
 This data set consists of percentage returns for S&P 500 stock index over 
 1,250 days, from beginning of 2001 until the end of 2005. For each date, we have recorded the percentage returns for each of the five previous trading days, `Lag1` through `Lag5`. We have also recorded `Volume`(No.of Shares traded the previous day, in billions), `Today`(the percentage return on the date in question) and `Direction` (Whether the Market was Up or Down on this date.)

In [6]:
data.describe() # Gives a Statistical Description of all the Numerical Features 

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today
count,1250.0,1250.0,1250.0,1250.0,1250.0,1250.0,1250.0,1250.0
mean,2003.016,0.003834,0.003919,0.001716,0.001636,0.00561,1.478305,0.003138
std,1.409018,1.136299,1.13628,1.138703,1.138774,1.14755,0.360357,1.136334
min,2001.0,-4.922,-4.922,-4.922,-4.922,-4.922,0.35607,-4.922
25%,2002.0,-0.6395,-0.6395,-0.64,-0.64,-0.64,1.2574,-0.6395
50%,2003.0,0.039,0.039,0.0385,0.0385,0.0385,1.42295,0.0385
75%,2004.0,0.59675,0.59675,0.59675,0.59675,0.597,1.641675,0.59675
max,2005.0,5.733,5.733,5.733,5.733,5.733,3.15247,5.733


In [7]:
data.corr()
# Correlation Function produces a Correlation Matrix between Numerical Features.

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today
Year,1.0,0.0297,0.030596,0.033195,0.035689,0.029788,0.539006,0.030095
Lag1,0.0297,1.0,-0.026294,-0.010803,-0.002986,-0.005675,0.04091,-0.026155
Lag2,0.030596,-0.026294,1.0,-0.025897,-0.010854,-0.003558,-0.043383,-0.01025
Lag3,0.033195,-0.010803,-0.025897,1.0,-0.024051,-0.018808,-0.041824,-0.002448
Lag4,0.035689,-0.002986,-0.010854,-0.024051,1.0,-0.027084,-0.048414,-0.0069
Lag5,0.029788,-0.005675,-0.003558,-0.018808,-0.027084,1.0,-0.022002,-0.03486
Volume,0.539006,0.04091,-0.043383,-0.041824,-0.048414,-0.022002,1.0,0.014592
Today,0.030095,-0.026155,-0.01025,-0.002448,-0.0069,-0.03486,0.014592,1.0


---
**Observation:**
- The Correlation between the Lag variables and Today's return's are close to zero.
- The only substantial correlation is between Year and Volume.
- In other words, the Volume of shares traded increased from the year 2001 to 2005.

---

### **Applying Logistic Regression**

Now, we will fit a logistic regression model in order to predict the Direction using Lag1 through Lag5 and Volume. The `GLM()` function fits *generalized linear model*, a class of models that includes logistic regression. We must pass the argument `family = binomial` in order to run a logistic regression rather than some type of generalized linear model.

#### **Using Statsmodels**

In [8]:
X = data[['Lag1','Lag2','Lag3','Lag4','Lag5','Volume']] # Exogenos Variables
y = data['Direction'].factorize()[0] # Response Variable

In [9]:
X = sm.add_constant(X) # Adding Constant for intercept value

In [10]:
sm_model = sm.GLM(y,X,family=sm.families.Binomial()).fit()
print(sm_model.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                      y   No. Observations:                 1250
Model:                            GLM   Df Residuals:                     1243
Model Family:                Binomial   Df Model:                            6
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -863.79
Date:                Wed, 24 Jun 2020   Deviance:                       1727.6
Time:                        16:00:50   Pearson chi2:                 1.25e+03
No. Iterations:                     4                                         
Covariance Type:            nonrobust                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.1260      0.241      0.523      0.6

---
**Observation**
- The smallest p-value here is associated with `Lag1`
- The negative coeffiecients for the predictor suggests that if the 

In [11]:
sm_model.fittedvalues[:10]

0    0.492916
1    0.518532
2    0.518861
3    0.484778
4    0.489219
5    0.493044
6    0.507349
7    0.490771
8    0.482386
9    0.511162
dtype: float64

#### **Using Sklearn**

In [12]:
data['Direction_Encoded'] = pd.get_dummies(data.Direction,drop_first=True)
data.head()

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction,Direction_Encoded
0,2001,0.381,-0.192,-2.624,-1.055,5.01,1.1913,0.959,Up,1
1,2001,0.959,0.381,-0.192,-2.624,-1.055,1.2965,1.032,Up,1
2,2001,1.032,0.959,0.381,-0.192,-2.624,1.4112,-0.623,Down,0
3,2001,-0.623,1.032,0.959,0.381,-0.192,1.276,0.614,Up,1
4,2001,0.614,-0.623,1.032,0.959,0.381,1.2057,0.213,Up,1


In [13]:
X = data[['Lag1','Lag2','Lag3','Lag4','Lag5','Volume']] # Exogenos Variables
X = X.values.reshape(-1,6)
y = data['Direction_Encoded']

In [14]:
sk_model = linear_model.LogisticRegression(solver='newton-cg').fit(X,y)

In [15]:
model_prob = sk_model.predict_proba(X)

In [16]:
model_prob[:10,0]

array([0.4926563 , 0.51825501, 0.51870192, 0.48465051, 0.4890086 ,
       0.4929353 , 0.50725072, 0.49072098, 0.48216684, 0.51090539])

In [17]:
# Creating a Confusion Matrix with clear Details.
conf_mat = pd.DataFrame({'Actual Direction':y,'Predicted Direction':model_prob[:,0]>0.5}) 
conf_mat.replace(to_replace={1:'Up',0:'Down','True':'Up','False':'Down'},inplace=True)
conf_mat = conf_mat.groupby(['Actual Direction','Predicted Direction']).size().unstack('Predicted Direction')
conf_mat

Predicted Direction,Down,Up
Actual Direction,Unnamed: 1_level_1,Unnamed: 2_level_1
Down,458,144
Up,507,141


In [18]:
y_pred = [1 if i>0.5 else 0 for i in model_prob[:,0]]

In [19]:
# Sklearn metrics Confusion Matrix Provides the same above result.
confusion_matrix(y,y_pred)

array([[458, 144],
       [507, 141]])

**Confusion Matrix**
- The Diagonals of the confusion matrix represent the correct predictions.
- The Off-Diagonals of the confusion matrix represents the incorrect predictions.

**Note:** Here our main focus is on predicting properly and correct classification.

In [20]:
print(f'Accuracy of the Model:{((conf_mat.iloc[0,0]+conf_mat.iloc[1,1])/(conf_mat.iloc[:,0].sum()+conf_mat.iloc[:,1].sum()))*100}%')

Accuracy of the Model:47.92%


**Observation:**
- This accuracy doesn't determine the actual accuracy of the model. Since this is the data that we trained upon.

In [21]:
train_data = data[data.Year<2005]
test_data = data[data.Year==2005]

In [22]:
X_train = train_data[['Lag1','Lag2','Lag3','Lag4','Lag5','Volume']].values.reshape(-1,6)
y_train = train_data['Direction_Encoded'].values.reshape(-1,1)
X_test = test_data[['Lag1','Lag2','Lag3','Lag4','Lag5','Volume']].values.reshape(-1,6)
y_test = test_data['Direction_Encoded'].values.reshape(-1,1)

In [23]:
tr_model = linear_model.LogisticRegression(solver='newton-cg').fit(X_train,y_train)

In [24]:
y_prob_pred = tr_model.predict_proba(X_test)

In [25]:
y_pred = [1 if i>0.5 else 0 for i in y_prob_pred[:,0]]

In [26]:
y_test.shape

(252, 1)

In [27]:
confusion_matrix(y_test,y_pred)

array([[37, 74],
       [48, 93]])

In [28]:
print(f'Accuracy:{((37+93)/252)*100:.2f}%')

Accuracy:51.59%


### **Linear Discriminant Analysis**