# 4.6.2 Logistic Regression

#### Load modules and data

In [31]:
from scipy import stats
import pandas as pd
import seaborn as sns
import scipy as sp
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt
from sklearn.preprocessing import scale
import sklearn.linear_model as skl_lm
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
import statsmodels.formula.api as smf
%matplotlib inline
plt.style.use('seaborn-white')


Smarket = pd.read_csv('Data/Smarket.csv', usecols = range(1,10))

In this lab, we will fit a logistic regression model in order to predict  Direction  using  Lag1 through  Lag5  and  Volume . We'll build our model using the glm() function.

The  glm()  function fits generalized linear models, a class of models that includes logistic regression. The syntax of the glm()  function is similar to that of  ols(), except that we must pass in the argument  family=sm.families.Binomial()  in order to run a logistic regression rather than some other type of generalized linear model.

In [32]:
res = smf.glm('Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume', Smarket, family=sm.families.Binomial()).fit()
res.summary().tables[1]

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,0.1260,0.241,0.523,0.601,-0.346 0.598
Lag1,0.0731,0.050,1.457,0.145,-0.025 0.171
Lag2,0.0423,0.050,0.845,0.398,-0.056 0.140
Lag3,-0.0111,0.050,-0.222,0.824,-0.109 0.087
Lag4,-0.0094,0.050,-0.187,0.851,-0.107 0.089
Lag5,-0.0103,0.050,-0.208,0.835,-0.107 0.087
Volume,-0.1354,0.158,-0.855,0.392,-0.446 0.175


Fortegnene er vendte i forhold til bogen da up og down bliver defineret modsat i glm funktionen. Nedenfor ændres Direction så glm resultatet matcher bogen. 

In [34]:
Smarket = pd.read_csv('Data/Smarket.csv', usecols = range(1,10))
for x in range(0, Smarket.Direction.size):
    if Smarket.Direction[x].lower() in ['up']:
        Smarket.loc[x, 'Direction'] = 0
    else:
        Smarket.loc[x, 'Direction'] = 1
res = smf.glm('Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume', Smarket, family=sm.families.Binomial()).fit()
res.summary().tables[1]

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-0.1260,0.241,-0.523,0.601,-0.598 0.346
Lag1,-0.0731,0.050,-1.457,0.145,-0.171 0.025
Lag2,-0.0423,0.050,-0.845,0.398,-0.140 0.056
Lag3,0.0111,0.050,0.222,0.824,-0.087 0.109
Lag4,0.0094,0.050,0.187,0.851,-0.089 0.107
Lag5,0.0103,0.050,0.208,0.835,-0.087 0.107
Volume,0.1354,0.158,0.855,0.392,-0.175 0.446


The smallest p-value here is associated with Lag1. The negative coefficient
for this predictor suggests that if the market had a positive return yesterday,
then it is less likely to go up today. However, at a value of 0.145, the p-value
is still relatively large, and so there is no clear evidence of a real association between Lag1 and Direction.

In [4]:
res.params

Intercept   -0.126000
Lag1        -0.073074
Lag2        -0.042301
Lag3         0.011085
Lag4         0.009359
Lag5         0.010313
Volume       0.135441
dtype: float64

The predict()  function can be used to predict the probability that the market will go down, given values of the predictors.

In [5]:
predictions = res.predict()
print(predictions[0:10])

[ 0.50708413  0.48146788  0.48113883  0.51522236  0.51078116  0.50695646
  0.49265087  0.50922916  0.51761353  0.48883778]


In order to make a prediction as to whether the market will go up or
down on a particular day, we must convert these predicted probabilities
into class labels, Up or Down. The following commands create a vector
of class predictions based on whether the predicted probability of a market
increase is greater than or less than 0.5. If greater than 0.5 the value is transformed to Up. If lower than 0.5 it's transformed to Down.

In [6]:
predictions_nominal = [ "Up" if x > 0.5 else "Down" for x in predictions]

A confusion matrix is made to determine how many observations were correctly or incorrectly classified.

In [7]:
from sklearn.metrics import confusion_matrix, classification_report
Smarket = pd.read_csv('Data/Smarket.csv', usecols=range(1,10))
cm1,cm2 = confusion_matrix(Smarket["Direction"], predictions_nominal)

In [8]:
temp=[]
temp.append("Down")
temp.append("Up")
tab = pd.DataFrame(index=None)
tab[""] = temp
tab["Down"] = cm1
tab["Up"] = cm2
print(tab.to_csv(sep='\t', index=False))

	Down	Up
Down	145	141
Up	457	507



The diagonal elements of the confusion matrix indicate correct predictions,
while the off-diagonals represent incorrect predictions. Hence our model
correctly predicted that the market would go up on 507 days and that
it would go down on 145 days, for a total of 507 + 145 = 652 correct
predictions. The mean() function can be used to compute the fraction of
days for which the prediction was correct. In this case, logistic regression
correctly predicted the movement of the market 52.2% of the time.

In [9]:
(507+145)/1250.0

0.5216

In [10]:
print classification_report(Smarket["Direction"], predictions_nominal, digits=3)

             precision    recall  f1-score   support

       Down      0.507     0.241     0.327       602
         Up      0.526     0.782     0.629       648

avg / total      0.517     0.522     0.483      1250



In this case, logistic regression correctly predicted the movement of the market 52.2% of the time. At first glance, it appears that the logistic regression model is working a little better than random guessing. However, this result is misleading because we trained and tested the model on the same set of 1, 250 observations. In other words, 100− 52.2 = 47.8% is the __training error rate__.
The training error rate is often overly optimistic — it tends to underestimate the test error rate.

In order to better assess the accuracy of the logistic regression model in this setting, we can fit the model using part of the data, and then examine how well it predicts the held out data. This will yield a more realistic error rate, in the sense that in practice we will be interested in our model’s performance not on the data that we used to fit the model, but rather on days in the future for which the market’s movements are unknown.

To implement this strategy, we will first create a vector corresponding to the observations from 2001 through 2004. We will then use this vector to create a held out data set of observations from 2005.

We now fit a logistic regression model using only the subset of the observations that correspond to dates before 2005.

In [11]:
x_train = Smarket[0:sum(Smarket.Year<2005)] # Data from 2001-2004
model = smf.glm('Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume', data=x_train, family=sm.families.Binomial())
result = model.fit()

We then obtain predicted probabilities of the stock market going up for
each of the days in our test set—that is, for the days in 2005.

Notice that we have trained and tested our model on two completely separate data sets: training was performed using only the dates before 2005, and testing was performed using only the dates in 2005. Finally, we compute the predictions for 2005 and compare them to the actual movements of the market over that time period.

In [12]:
x_test = Smarket[sum(Smarket.Year<2005):] # Data from 2005
predictions = result.predict(x_test)
predictions_nominal = [ "Up" if x > 0.5 else "Down" for x in predictions] # obs '<' 
cm1,cm2 = confusion_matrix(x_test["Direction"], predictions_nominal)

In [13]:
temp=[]
temp.append("Down")
temp.append("Up")
tab = pd.DataFrame(index=None)
tab[""] = temp
tab["Down"] = cm1
tab["Up"] = cm2
print(tab.to_csv(sep='\t', index=False))

	Down	Up
Down	34	44
Up	77	97



In [14]:
(77+44.)/sum(cm1+cm2)

0.4801587301587302

In [15]:
print classification_report(x_test["Direction"], predictions_nominal, digits=3)

             precision    recall  f1-score   support

       Down      0.436     0.306     0.360       111
         Up      0.557     0.688     0.616       141

avg / total      0.504     0.520     0.503       252



The results are rather disappointing: the test error
rate (1 - ${\tt recall}$) is 52%, which is worse than random guessing! Of course this result
is not all that surprising, given that one would not generally expect to be
able to use previous days’ returns to predict future market performance.
(After all, if it were possible to do so, then the authors of this book [along with your professor] would probably
be out striking it rich rather than teaching statistics.)

We recall that the logistic regression model had very underwhelming pvalues
associated with all of the predictors, and that the smallest p-value,
though not very small, corresponded to Lag1. Perhaps by removing the
variables that appear not to be helpful in predicting Direction, we can
obtain a more effective model. After all, using predictors that have no
relationship with the response tends to cause a deterioration in the test
error rate (since such predictors cause an increase in variance without a
corresponding decrease in bias), and so removing such predictors may in
turn yield an improvement. 

In the space below, a refit of a logistic regression using just Lag1 and Lag2, which seemed to have the highest predictive power in the original logistic regression model.

In [105]:
model = smf.glm('Direction ~ Lag1 + Lag2', data=x_train, family=sm.families.Binomial())
result = model.fit()
predictions = result.predict(x_test)
predictions_nominal = [ "Up" if x < 0.5 else "Down" for x in predictions] # obs '<' 
cm1,cm2 = confusion_matrix(x_test["Direction"], predictions_nominal)

In [19]:
temp=[]
temp.append("Down")
temp.append("Up")
tab = pd.DataFrame(index=None)
tab[""] = temp
tab["Down"] = cm1
tab["Up"] = cm2
print(tab.to_csv(sep='\t', index=False))

	Down	Up
Down	35	35
Up	76	106



In [20]:
(106+35.)/sum(cm1+cm2)

0.5595238095238095

In [21]:
print classification_report(x_test["Direction"], predictions_nominal, digits=3)

             precision    recall  f1-score   support

       Down      0.500     0.315     0.387       111
         Up      0.582     0.752     0.656       141

avg / total      0.546     0.560     0.538       252



Now the results appear to be more promising: 56% of the daily movements
have been correctly predicted. The confusion matrix suggests that on days
when logistic regression predicts that the market will decline, it is only
correct 50% (35/(35+35)) of the time. However, on days when it predicts an increase in
the market, it has a 58 % (106/(76+106) accuracy rate.

This suggests a possible trading strategy of buying on days when the model predicts an increasing market, and avoiding trades on days when a decrease is predicted. Of course one would need to investigate more carefully whether this small improvement was real or just due to random chance.

Predict a Direction on a day when Lag1 and Lag2 equal 1.2 and 1.1, respectively, and on a day when they equal 1.5 and −0.8.

In [106]:
print result.predict(pd.DataFrame([[1.2,1.1],[1.5,-0.8]], columns = ["Lag1","Lag2"]))

[ 0.52085376  0.50390613]


In [108]:
1-0.52085376

0.47914623999999995

In [109]:
1-0.50390613

0.49609387000000005