# Stock Market Movement Prediction using the Weekly S&P Stock Market Data Using R

This dataset is named **Weekly** and captures weekly percentage returns for the S&P 500 stock index between 1990 and 2010.

This dataset has 1089 observations and 9 varables (8 predictor and 1 target).

The predictor variables are: 

- Year -- The year that the observation was recorded.
- Lag1 -- Percentage return for previous week.
- Lag2 -- Percentage return for 2 weeks previous.
- Lag3 -- Percentage return for 3 weeks previous.
- Lag4 -- Percentage return for 4 weeks previous.
- Lag5 -- Percentage return for 5 weeks previous.
- Volume -- Volume of shares traded (average number of daily shares traded in billions).
- Today -- Percentage return for this week.

The predictor variable is:

   - Direction -- A factor with levels Down and Up indicating whether the market had a positive or negative return on a given week.
   
The source for this problem is: James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R, New York, NY:Springer-Verlag. ISBN: 978-1461471370. Chapter 4, pp. 171, applied exercise number 10. <a href="http://www.StatLearning.com" target="_blank">Visit the Book Website</a>.


# Using An R Essentials Environment In Jupyter

One approach to quickly working with R in JupyterLab is to install the R essentials in the current Jupyter environment.

From a terminal window (aka command line window), execute the following:

       conda install -c r r-essentials

**Note**: this is done only once on your computer.

In [1]:
# set knitr options
knitr::opts_chunk$set(echo = TRUE)

In [3]:
# the dataset is contained in the ISLR library 
# if not already installed, install ISLR
install.packages("ISLR")
# load the ISLR library

library("ISLR")


Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


In [4]:
# examine the structure of the dataset
str(Weekly)


'data.frame':	1089 obs. of  9 variables:
 $ Year     : num  1990 1990 1990 1990 1990 1990 1990 1990 1990 1990 ...
 $ Lag1     : num  0.816 -0.27 -2.576 3.514 0.712 ...
 $ Lag2     : num  1.572 0.816 -0.27 -2.576 3.514 ...
 $ Lag3     : num  -3.936 1.572 0.816 -0.27 -2.576 ...
 $ Lag4     : num  -0.229 -3.936 1.572 0.816 -0.27 ...
 $ Lag5     : num  -3.484 -0.229 -3.936 1.572 0.816 ...
 $ Volume   : num  0.155 0.149 0.16 0.162 0.154 ...
 $ Today    : num  -0.27 -2.576 3.514 0.712 1.178 ...
 $ Direction: Factor w/ 2 levels "Down","Up": 1 1 2 2 2 1 2 2 2 1 ...


(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

In [5]:
# check for correlation between the predictors by excluding the qualitative variable
cor ( Weekly [ , -9])

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today
Year,1.0,-0.032289274,-0.03339001,-0.03000649,-0.031127923,-0.030519101,0.84194162,-0.032459894
Lag1,-0.03228927,1.0,-0.07485305,0.05863568,-0.071273876,-0.008183096,-0.06495131,-0.075031842
Lag2,-0.03339001,-0.074853051,1.0,-0.07572091,0.058381535,-0.072499482,-0.08551314,0.059166717
Lag3,-0.03000649,0.058635682,-0.07572091,1.0,-0.075395865,0.060657175,-0.06928771,-0.071243639
Lag4,-0.03112792,-0.071273876,0.05838153,-0.07539587,1.0,-0.075675027,-0.06107462,-0.007825873
Lag5,-0.0305191,-0.008183096,-0.07249948,0.06065717,-0.075675027,1.0,-0.05851741,0.011012698
Volume,0.84194162,-0.064951313,-0.08551314,-0.06928771,-0.061074617,-0.058517414,1.0,-0.033077783
Today,-0.03245989,-0.075031842,0.05916672,-0.07124364,-0.007825873,0.011012698,-0.03307778,1.0


Correlation between the predictors shows that Volume and Year are highly correlation.
Other predictors do not seems correlated.

In [None]:
# Attach Weekly data set 
attach(Weekly)

In [None]:
# plot the Volume
plot(Volume)

Seeing the plot of Volume we can say that the volume is in increasing trend. In other words, the average number of shares traded daily increased from 1990 to 2010.

(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

The glm() function fits generalized linear models, a class of models that includes logistic regression (excluding Today).


In [None]:
# build the model
glm.fit = glm(Direction~.-Today,data=Weekly,family="binomial")

In [None]:
# summary of the logistic model
summary(glm.fit)

Lag2 is statistically significant predictor as the probability by happening chance is significantly less with pvalue = 0.0275

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

In [None]:
# predict the probability the market will go up and type = "response" output probabilities of the form P (Y = 1|X)
glm.probs = predict(glm.fit , type ="response")

In [None]:
# create a vector of class predictions based on whether the predicted probability 
# of a market increase is greater than or less than 0.5.
# creates a vector of length as Down elements
glm.pred = rep("Down" ,length(glm.probs))

In [None]:
# transform to Up all of the elements for which the predicted probability 
# of a market increase exceeds 0.5
glm.pred[glm.probs>0.5]="Up"

# produce a confusion matrix in order to determine how many observations were correctly or incorrectly classified
table(glm.pred,Weekly$Direction)

In [None]:
# fraction of days for misclassifcation
1-((56+558)/length(glm.probs))

In [None]:
# fraction of days for which the prediction was correct
mean(glm.pred==Direction)

The logistic regression has misclassified 47 direction which is actually Up as Down and 428 actual Down as Up. So overall misclassification percentage is also very high(43.6%) which is training error rate. However if we see only Up class the misclassification percentage is very small, out of 605 only 47 are misclassified by logistic model.

Total Down is 428 + 56 = 484. However, 428 are misclassified by logistic regression model. 

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

In [None]:
# create a vector corresponding to the observations with year less than 2009 (from 1990 to 2008)
train = (Year<2009)

In [None]:
# dataset of observations from 2009

Weekly.20092010 = Weekly[!train,]

Direction.20092010 = Direction [!train]

dim(Weekly.20092010)

In [None]:
# logistic regression model with Lag2 as the only predictor

glm.fit1 = glm(Direction~Lag2,data=Weekly,family=binomial,subset=train)

In [None]:
# explore the model
summary(glm.fit1)

In [None]:
# compute probabilities for test/validation data
glm.probabilities = predict(glm.fit1, Weekly.20092010, type="response")

In [None]:
# initially, set all predictions to Down
glm.predictions = rep("Down", length(glm.probabilities))

In [None]:
# if the probability > 0.5, set prediction to Up
glm.predictions[glm.probabilities > 0.5] = "Up"

In [None]:
# compute confusion matrix
confusion_matrix = table(glm.predictions, Direction.20092010)

In [None]:
# print confusion matrix
confusion_matrix

In [None]:
# compute Down accuarcy
confusion_matrix[1,1]/(confusion_matrix[1,1] + confusion_matrix[2,1])

In [None]:
# compute Up accuracy
confusion_matrix[2,2] / (confusion_matrix[1,2] + confusion_matrix[2,2])

In [None]:
# mean of correct predictions
mean(glm.pred1 == Direction.20092010)

In [None]:
# mean of incorrect predictions
mean(glm.pred1 != Direction.20092010)

The logistic model show that the accuracy to classify correctly on test data is 62.5% and test error rate is 37.5%. We could also say that for weeks when the market goes up, the model is right 91.80328% of the time. For weeks when the market goes down, the model is right only 20.93023% of the time.

In [None]:
(e) Repeat (d) using Linear Discriminant Analysis (LDA)

In [None]:
# LDA is in MASS library
library (MASS)

In [None]:
# build the LDA model using training data
lda.model = lda(Direction~Lag2, data = Weekly, subset = train)

In [None]:
# print the model
lda.model

In [None]:
# plot the model
plot(lda.model)

In [None]:
# make predictions on the test/validation data
lda.predictions = predict(lda.model, Weekly.20092010)

In [None]:
# classes predicted by the model
lda.class = lda.predictions$class

In [None]:
# compute confusion matrix
confusion_matrix = table(lda.class,Direction.20092010)

In [None]:
# print confusion matrix
confusion_matrix

In [None]:
# compute Down accuarcy
confusion_matrix[1,1]/(confusion_matrix[1,1] + confusion_matrix[2,1])

In [None]:
# compute Up accuracy
confusion_matrix[2,2] / (confusion_matrix[1,2] + confusion_matrix[2,2])

In [None]:
# mean of correct predictions
mean(lda.predictions == Direction.20092010)

In [None]:
# mean of incorrect predictions
mean(lda.predictions != Direction.20092010)

Confusion matrix obtained for the LDA and logistic regression are the same. Furthermore, the accuracy of LDA is same as that of the logistic regression.

In [None]:
(f) Repeat (d) using QDA

In [None]:

library (MASS)

In [None]:
# build QDA model using the training data
qda.model = qda(Direction~Lag2, data = Weekly, subset = train)

In [None]:
# explore the QDA model
qda.model

In [None]:
# make predictions using QDA
qda.predictions = predict(qda.model, Weekly.20092010)

In [None]:
# compute confusion matrix
confusion_matrix = table(qda.predictions$class, Direction.20092010)

In [None]:
# print confusion matrix
confusion_matrix

In [None]:
# compute Down accuracy
confusion_matrix[1,1] / (confusion_matrix[1,1] + confusion_matrix[2,1])

In [None]:
# compute Up accuracy
confusion_matrix[2,2] / (confusion_matrix[1,2] + confusion_matrix[2,2])

In [None]:
# mean of correct predictions
mean(qda.pred$class == Direction.20092010)

In [None]:
# mean of incorrect predictions
mean(qda.pred$class != Direction.20092010)

In the QDA case, we see the percentage of correct predictions on the test data is 58.6538462% and 41.3461538% is the test error rate. We could also say that for weeks when the market goes up, the model is right 100% of the time. For weeks when the market goes down, the model is right 0% of the time.

In [None]:
(g) Repeat (d) using kNN with k = 1

In [None]:

library (class)

In [None]:

train.X = as.matrix(Lag2[train])

In [None]:

test.X = as.matrix(Lag2[!train])

In [None]:

train.Direction = Direction[train]

In [None]:
# for reproducibility of results
set.seed(1)

In [None]:
# make predictions using kNN
knn.predictions = knn(train.X, test.X, train.Direction, k = 1)

In [None]:
# compute confusion matrix
confusion_matrix = table(knn.predictions, Direction.20092010)

In [None]:
# print confusion matrix
confusion_matrix

In [None]:
# compute Down accuracy
confusion_matrix[1,1] / (confusion_matrix[1,1] + confusion_matrix[2,1])

In [None]:
# compute Up accuracy
confusion_matrix[2,2] / (confusion_matrix[1,2] + confusion_matrix[2,2])

In [None]:
# mean of correct predictions
mean(knn.predictions == Direction.20092010)

In [None]:
# mean of correct predictions
mean(knn.predictions != Direction.20092010)

In kNN case, we may conclude that the percentage of correct predictions on the test data is 50%. In other words, 50% is the test error rate. We could also say that for weeks when the market goes up, the model is right 50.819% of the time. For weeks when the market goes down, the model is right 48.837% of the time.

(h) Which of these methods appears to provide the best results on this data?

Logistic regression and LDA appears to provide the best results on this data.

(i) Experiment with different combinations of predictors, including possible transformations and interactions, for each of the methods. Report the variables, method, and associated confusion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for k in the kNN classifier.

In [None]:
# logistic regression with interaction of Lag2 and Lag1

# build the model
glm.model.2 = glm(Direction ~ Lag2:Lag1, data=Weekly, family=binomial, subset=train)

In [None]:
# make predictions using glm.model.2
glm.probabilities.2 = predict(glm.model.2, Weekly.20092010, type="response")

In [None]:
# initially, all predictions are set to Down
glm.predictions.2 = rep("Down", length(glm.probabilities.2))

In [None]:
# if the probability > 0.5, set prediction value to Up
glm.predictions.2[glm.probabilities.2 > 0.5] = "Up"

In [None]:
# print glm predictions
glm.predictions.2

In [None]:
# compute confusion matrix
confusion_matrix = table(glm.predictions.2, Direction.20092010)

In [None]:
# print confusion matrix
confusion_matrix

In [None]:
# compute Down accuracy
confusion_matrix[1,1] / (confusion_matrix[1,1] + confusion_matrix[2,1])

In [None]:
# compute Up accuracy
confusion_matrix[2,2] / (confusion_matrix[1,2] + confusion_matrix[2,2])

In [None]:
# mean of correct predictions
mean(glm.predictions.2 == Direction.20092010)

In [None]:
# mean of incorrect predictions
mean(glm.predictions.2 != Direction.20092010)

In [None]:
# LDA with Lag2 and Lag1 interaction

# load librray
library (MASS)

In [None]:
# build the model
lda.model.1 = lda(Direction~Lag2:Lag1, data=Weekly, subset=train)

In [None]:
# explore the LDA model
lda.model.1

In [None]:
# plot the LDA model
plot(lda.model.1)

In [None]:
# predict with LDA model
lda.prediction.1 = predict(lda.model.1, Weekly.20092010)

In [None]:
# isolate the predicted classes
lda.class = lda.prediction.1$class

In [None]:
# compute confusion matrix
confusion_matrix = table(lda.class, Direction.20092010)

In [None]:
# print confusion matrix
confusion_matrix

In [None]:
# compute Down accuracy
confusion_matrix[1,1] / (confusion_matrix[1,1] + confusion_matrix[2,1])

In [None]:
# compute Up accuracy
confusion_matrix[2,2] / (confusion_matrix[1,2] + confusion_matrix[2,2])

In [None]:
# mean of correct predictions
mean(lda.class == Direction.20092010)

In [None]:
# mean of incorrect predictions
mean(lda.class != Direction.20092010)

In [None]:
# QDA with sqrt(abs(Lag2))

# load library
library (MASS)

In [None]:
# build the QDA model with training data
qda.model.1 = qda(Direction~Lag2+sqrt(abs(Lag2)), data=Weekly, subset=train)

In [None]:
# explore the model
qda.model.1

In [None]:
qda.predictions.1 = predict(qda.model.1, Weekly.20092010)

In [None]:
confusion_matrix = table(qda.predictions.1$class, Direction.20092010)

In [None]:
# print confusion matrix
confusion_matrix

In [None]:
# compute Down accuracy
confusion_matrix[1,1] / (confusion_matrix[1,1] + confusion_matrix[2,1])

In [None]:
# compute Up accuracy
confusion_matrix[2,2] / (confusion_matrix[1,2] + confusion_matrix[2,2])

In [None]:
# mean of correct predictions
mean(qda.predictions.1$class == Direction.20092010)

In [None]:
# mean of incorrect predictions
mean(qda.predictions.1$class != Direction.20092010)

In [None]:
# kNN for k = 20 

# load library
library (class)

In [None]:
# isolate training data
train.X = as.matrix(Lag2[train])

In [None]:
# isolate training data
test.X = as.matrix(Lag2[!train])

In [None]:

train.Direction = Direction[train]

In [None]:
# for experimental reproducibility
set.seed(1)

In [None]:
# make predictions
knn.predictions = knn(train.X, test.X, train.Direction, k = 20)

In [None]:
# compute confusion matrix
confusion_matrix = table(knn.predictions, Direction.20092010)

In [None]:
# print confusion matrix
confusion_matrix

In [None]:
# compute Down accuracy
confusion_matrix[1,1] / (confusion_matrix[1,1] + confusion_matrix[2,1])

In [None]:
# compute Up accuracy
confusion_matrix[2,2] / (confusion_matrix[1,2] + confusion_matrix[2,2])

In [None]:
# mean of correct predictions
mean(knn.predictions == Direction.20092010)

In [None]:
# mean of incorrect predictions
mean(knn.predictions != Direction.20092010)