# Assignment
## Multiple logistic regression to predict the “ups” and  “downs” in the stock market
### Possible solution 

The data set we will use was taken from "An Introduction to Statistical Learning with Applications in R", G. James, D. Witten, T. Hastie and R. Tibshirani, (Springer, New York, 2013), with the permission announced in http://faculty.marshall.usc.edu/gareth-james/ISL/.

The purpose is to fit a logistic regression model in order to predict Direction using Lag1 through Lag5 and Volume. To build a model using the glm() function which is part of the formula submodule of (statsmodels).

In [2]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
from sklearn.metrics import confusion_matrix, classification_report

## (1) Import the Smarket data from ISLR

In [3]:
df = pd.read_csv('Smarket.csv', index_col=0, parse_dates=True)
df.head()

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
1,2001,0.381,-0.192,-2.624,-1.055,5.01,1.1913,0.959,Up
2,2001,0.959,0.381,-0.192,-2.624,-1.055,1.2965,1.032,Up
3,2001,1.032,0.959,0.381,-0.192,-2.624,1.4112,-0.623,Down
4,2001,-0.623,1.032,0.959,0.381,-0.192,1.276,0.614,Up
5,2001,0.614,-0.623,1.032,0.959,0.381,1.2057,0.213,Up


## (2) The Model

 The glm() function fits generalized linear models, a class of models that includes logistic regression. The syntax of the glm() function is similar to that of lm(), except that we must pass in the argument family=sm.families.Binomial() in order to tell python to run a logistic regression rather than some other type of generalized linear model.

In [4]:
formula = 'Direction ~ Lag1+Lag2+Lag3+Lag4+Lag5+Volume'

In [5]:
model = smf.glm(formula = formula, data=df, family=sm.families.Binomial())
result = model.fit()
print(result.summary())

                          Generalized Linear Model Regression Results                           
Dep. Variable:     ['Direction[Down]', 'Direction[Up]']   No. Observations:                 1250
Model:                                              GLM   Df Residuals:                     1243
Model Family:                                  Binomial   Df Model:                            6
Link Function:                                    logit   Scale:                          1.0000
Method:                                            IRLS   Log-Likelihood:                -863.79
Date:                                  Wed, 09 Sep 2020   Deviance:                       1727.6
Time:                                          07:58:26   Pearson chi2:                 1.25e+03
No. Iterations:                                       4                                         
Covariance Type:                              nonrobust                                         
                 coef    std e

The smallest p-value here is associated with Lag1. The positive coefficient for this predictor suggests that if the market had a positive return yesterday, then it is more likely to go up today. However, at a value of 0.145, the p-value is still relatively large, and so there is no clear evidence of a real association between Lag1 and Direction.

We use the .params attribute in order to access just the coefficients for this fitted model. Similarly, we can use .pvalues to get the p-values for the coefficients, and .model.endog_names to get the response (endogenous or dependent) variables.

In [7]:
print("Coefficients")
print(result.params)
print()
print("p-Values")
print(result.pvalues)
print()
print("Dependent variables")
print(result.model.endog_names)

Coefficients
Intercept    0.126000
Lag1         0.073074
Lag2         0.042301
Lag3        -0.011085
Lag4        -0.009359
Lag5        -0.010313
Volume      -0.135441
dtype: float64

p-Values
Intercept    0.600700
Lag1         0.145232
Lag2         0.398352
Lag3         0.824334
Lag4         0.851445
Lag5         0.834998
Volume       0.392404
dtype: float64

Dependent variables
['Direction[Down]', 'Direction[Up]']


The dependent variable has been converted from nominal into two dummy variables: ['Direction[Down]', 'Direction[Up]'].

The predict() function can be used to predict the probability that the market will go down, given values of the predictors. If no data set is supplied to the predict() function, then the probabilities are computed for the training data that was used to fit the logistic regression model.

In [8]:
predictions = result.predict()
print(predictions[0:10])

[0.49291587 0.51853212 0.51886117 0.48477764 0.48921884 0.49304354
 0.50734913 0.49077084 0.48238647 0.51116222]


Here we have printed only the first ten probabilities. Note: these values correspond to the probability of the market going down, rather than up. If we print the model's encoding of the response values alongside the original nominal response, we see that Python has created a dummy variable with a 1 for Down.

In [9]:
# print(np.column_stack((df.as_matrix(columns = ["Direction"]).flatten(),result.model.endog))) # as_matrix is not available any longer, we use to_numpy().

print(np.column_stack((df['Direction'].to_numpy().flatten(),result.model.endog)))

[['Up' 0.0]
 ['Up' 0.0]
 ['Down' 1.0]
 ...
 ['Up' 0.0]
 ['Down' 1.0]
 ['Down' 1.0]]


In order to make a prediction as to whether the market will go up or down on a particular day, we must convert these predicted probabilities into class labels, Up or Down. The following list comprehension creates a vector of class predictions based on whether the predicted probability of a market increase is greater than or less than 0.5.

In [10]:
predictions_nominal = [ "Up" if x < 0.5 else "Down" for x in predictions]

This transforms to Up all of the elements for which the predicted probability of a market increase exceeds 0.5 (i.e. probability of a decrease is below 0.5). Given these predictions, the confusion\_matrix() function can be used to produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified.

In [12]:
print(confusion_matrix(df["Direction"],predictions_nominal))

[[145 457]
 [141 507]]


 The diagonal elements of the confusion matrix indicate correct predictions, while the off-diagonals represent incorrect predictions. Hence our model correctly predicted that the market would go up on 507 days and that it would go down on 145 days, for a total of 507 + 145 = 652 correct predictions. The mean() function can be used to compute the fraction of days for which the prediction was correct. In this case, logistic regression correctly predicted the movement of the market 52.2% of the time. this is confirmed by checking the output of the classification\_report() function.

In [13]:
print(classification_report(df["Direction"],predictions_nominal,digits = 3))

              precision    recall  f1-score   support

        Down      0.507     0.241     0.327       602
          Up      0.526     0.782     0.629       648

    accuracy                          0.522      1250
   macro avg      0.516     0.512     0.478      1250
weighted avg      0.517     0.522     0.483      1250



At first glance, it appears that the logistic regression model is working a little better than random guessing. But remember, this result is misleading because we trained and tested the model on the same set of 1,250 observations. In other words, 100− 52.2 = 47.8% is the training error rate. As we have seen previously, the training error rate is often overly optimistic — it tends to underestimate the test error rate.

 In order to better assess the accuracy of the logistic regression model in this setting, we can fit the model using part of the data, and then examine how well it predicts the held out data. This will yield a more realistic error rate, in the sense that in practice we will be interested in our model’s performance not on the data that we used to fit the model, but rather on days in the future for which the market’s movements are unknown.

We will first create a vector corresponding to the observations from 2001 through 2004. We will then use this vector to create a held out data set of observations from 2005.

In [14]:
##x_train = df[:'2004'][:]  these codes don't work well on my Jupyternotebook. but you can try if they work on your python tool.
##y_train = df[:'2004']['Direction']

##x_test = df['2005':][:]
##y_test = df['2005':]['Direction']

x_train = pd.read_csv('Smarket2001-2004.csv', index_col=0, parse_dates=True)
x_train.head()
y_train1=x_train.Direction
x_test1 = pd.read_csv('Smarket2005.csv', index_col=0, parse_dates=True)
x_test1.head()
y_test = x_test1.Direction
x_test = x_test1.drop(columns = 'Direction')

We now fit a logistic regression model using only the subset of the observations that correspond to dates before 2005, using the subset argument. We then obtain predicted probabilities of the stock market going up for each of the days in our test set—that is, for the days in 2005.

In [15]:
model1 = smf.glm(formula = formula, data = x_train, family = sm.families.Binomial())
result1 = model1.fit()

Notice that we have trained and tested our model on two completely separate data sets: training was performed using only the dates before 2005, and testing was performed using only the dates in 2005. Finally, we compute the predictions for 2005 and compare them to the actual movements of the market over that time period.

In [16]:
predictions1 = result1.predict(x_test)
predictions_nominal1 = [ "Up" if x < 0.5 else "Down" for x in predictions1]
print(classification_report(y_test, predictions_nominal1, digits = 3))

              precision    recall  f1-score   support

        Down      0.443     0.694     0.540       111
          Up      0.564     0.312     0.402       141

    accuracy                          0.480       252
   macro avg      0.503     0.503     0.471       252
weighted avg      0.511     0.480     0.463       252



The results are rather disappointing: the test error rate (1 - recall) is 52%, which is worse than random guessing! Of course this result is not all that surprising, given that one would not generally expect to be able to use previous days’ returns to predict future market performance. (After all, if it were possible to do so, then the authors of this book [along with your professor] would probably be out striking it rich rather than teaching statistics.)

We recall that the logistic regression model had very underwhelming pvalues associated with all of the predictors, and that the smallest p-value, though not very small, corresponded to Lag1. Perhaps by removing the variables that appear not to be helpful in predicting Direction, we can obtain a more effective model. After all, using predictors that have no relationship with the response tends to cause a deterioration in the test error rate (since such predictors cause an increase in variance without a corresponding decrease in bias), and so removing such predictors may in turn yield an improvement.

In the space below, refit a logistic regression using just Lag1 and Lag2, which seemed to have the highest predictive power in the original logistic regression model.

## (3) Trying other models...