# Assignment 3

## Imports
It is important for all (or most) imports to go on the top of a notebook so that other users know which packages need to be installed. In projects that use Anaconda, it is also common to see a file named requirements.txt listing all the packages that one has to install.

First, let's import all the necessary modules using the import function. For this exercise, we will continue to use pandas, numpy, and matplotlib. We will also be uisng statsmodels to conduct statistical analysis. 

To learn more about these packages, you can read through the documentation:  
https://pandas.pydata.org/  
https://numpy.org/  
https://matplotlib.org/  
https://www.statsmodels.org/stable/index.html  

In [96]:
# Let's import the relevant Python packages here
# Feel free to import any other packages for this project

# Data Wrangling
import pandas as pd
import numpy as np

# Statistics
import statsmodels.formula.api as smf
import statsmodels.api as sm

# Plotting
import matplotlib.pyplot as plt

%matplotlib inline

## Data
This dataset contains 1,089 weekly stock market percentage returns for 21 years, from the beginning of 1990 to the end of 2010.

| Column | Description |
|:-|:-|
|Year | The year that the observation was recorded|
|Lag1 | Percentage return for the previous week|
|Lag2 | Percentage return for the previous 2 weeks|
|Lag3 | Percentage return for the previous 3 weeks|
|Lag4 | Percentage return for the previous 4 weeks|
|Lag5 | Percentage return for the previous 5 weeks|
|Volume | Volume of shares traded (average number of daily shares traded in billions)|
|Today | Percentage return for this week|
|Direction | A factor with levels Down and Up indicating whether the market had a positive or negative return on a given week|

Once again, we will begin by loading the dataset using pandas

In [97]:
weekly = pd.read_csv("Weekly.csv")
weekly.head()

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
0,1990,0.816,1.572,-3.936,-0.229,-3.484,0.154976,-0.27,Down
1,1990,-0.27,0.816,1.572,-3.936,-0.229,0.148574,-2.576,Down
2,1990,-2.576,-0.27,0.816,1.572,-3.936,0.159837,3.514,Up
3,1990,3.514,-2.576,-0.27,0.816,1.572,0.16163,0.712,Up
4,1990,0.712,3.514,-2.576,-0.27,0.816,0.153728,1.178,Up


1. First, transform our `Direction` variable into a numerical feature that is equal to 1 if `Direction = Up`.

In [98]:
# your code here
weekly['Direction'] = weekly['Direction'].map({'Up': 1, 'Down': 0})
weekly.head()

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction
0,1990,0.816,1.572,-3.936,-0.229,-3.484,0.154976,-0.27,0
1,1990,-0.27,0.816,1.572,-3.936,-0.229,0.148574,-2.576,0
2,1990,-2.576,-0.27,0.816,1.572,-3.936,0.159837,3.514,1
3,1990,3.514,-2.576,-0.27,0.816,1.572,0.16163,0.712,1
4,1990,0.712,3.514,-2.576,-0.27,0.816,0.153728,1.178,1


You may now want to produce several numerical and graphical summaries of the `Weekly` data and check for patterns (Hint: see if you can find a correlation between `Year` and `Volume)`

In [99]:
# your code here
print("\nBasic Descriptive Statistics:")
print(weekly.describe())               

print("\nCorrelation between Year and Volume:\n")
print(weekly[['Year','Volume']].corr())

print("\nSummary of Direction:\n")
print(weekly['Direction'].value_counts())


Basic Descriptive Statistics:
              Year         Lag1         Lag2         Lag3         Lag4  \
count  1089.000000  1089.000000  1089.000000  1089.000000  1089.000000   
mean   2000.048669     0.150585     0.151079     0.147205     0.145818   
std       6.033182     2.357013     2.357254     2.360502     2.360279   
min    1990.000000   -18.195000   -18.195000   -18.195000   -18.195000   
25%    1995.000000    -1.154000    -1.154000    -1.158000    -1.158000   
50%    2000.000000     0.241000     0.241000     0.241000     0.238000   
75%    2005.000000     1.405000     1.409000     1.409000     1.409000   
max    2010.000000    12.026000    12.026000    12.026000    12.026000   

              Lag5       Volume        Today    Direction  
count  1089.000000  1089.000000  1089.000000  1089.000000  
mean      0.139893     1.574618     0.149899     0.555556  
std       2.361285     1.686636     2.356927     0.497132  
min     -18.195000     0.087465   -18.195000     0.000000  
25

2. Use the full dataset to perform logistic regression with `Direction` as the response and the 5 `Lag` variables as predictors.

In [100]:
# your code here
#raise NotImplementedError
# Drop the rows with missing values

X = weekly[['Lag1','Lag2','Lag3', 'Lag4', 'Lag5']]
y = weekly['Direction']

#from sklearn.linear_model import LogisticRegression
#logreg = LogisticRegression()
#logreg.fit(x,y)
#print("Model Coefficients:", logreg.coef_)

X = sm.add_constant(X)
logit = sm.Logit(y,X)
result = logit.fit()
print(result.summary())

Optimization terminated successfully.
         Current function value: 0.682615
         Iterations 4
                           Logit Regression Results                           
Dep. Variable:              Direction   No. Observations:                 1089
Model:                          Logit   Df Residuals:                     1083
Method:                           MLE   Df Model:                            5
Date:                Fri, 15 Nov 2024   Pseudo R-squ.:                0.006327
Time:                        04:46:55   Log-Likelihood:                -743.37
converged:                       True   LL-Null:                       -748.10
Covariance Type:            nonrobust   LLR p-value:                   0.09186
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.2303      0.062      3.712      0.000       0.109       0.352
Lag1          -0.0401      0.

How many variables are statistically significant? Store your result in a variable named `num_significant`. (Hint: use the summary() function)

In [101]:
# your code here

#Lag2           0.0602      0.027      2.249      0.025       0.008       0.113
#Lag2 p-value is < 0.05 so statically significant (others p-value much larger 0.05)
num_significant = 1 # Save your solution in this variable

In [102]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert num_significant == 1, "Wrong value for num_significant, try again.."

Save any variables that are statistically significant in a list named `var_significant`.

In [103]:
# your code here
var_significant = ['Lag2'] # Save your solution in this variable
#raise NotImplementedError

In [104]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert len(var_significant) == 1, "There should be num_significant entries in the list"
assert var_significant[0] == 'Lag2', "That is not the correct variable, try again.."

3. Compute the overall fraction of correct predictions. Store your result in a variable named `fraction_correct`.

In [105]:
# your code here

y_pred = result.predict(X)
#print(y_pred)
y_pred = (y_pred > 0.5).astype(int)

correct_predictions = (y == y_pred).sum()
total               = len(y)
fraction_correct = correct_predictions/total # Save your solution in this variable
print(fraction_correct)

0.5629017447199265


In [106]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert np.allclose(fraction_correct, 0.5629017447199265, .001), "Incorrect result, try again.. (hint: use the predict() function)"

5. Now fit the logistic regression model using a training data period from 1990 to 2007, with `Lag2` as the only predictor.

Compute the overall fraction of correct predictions for the held out data (that is, the data from 2008, 2009 and 2010) and store it in a variable named `fraction_correct_test`. 

In [107]:
### We have split the data for you
train = weekly[weekly['Year'] <= 2007]
test = weekly[weekly['Year'] > 2007]
#fraction_correct_test = None 

# your code here
X_train = train[['Lag2']] # only Lag2 as predictor
y_train = train['Direction']

X_test = test[['Lag2']]
y_test = test['Direction']

logit_train = sm.Logit(y_train,X_train)
result      = logit_train.fit()
#print(result.summary())

y_pred_test = result.predict(X_test)
y_pred_test = (y_pred_test > 0.5).astype(int)

correct_predictions_test = (y_test == y_pred_test).sum()
total_test               = len(y_test)
fraction_correct_test    = correct_predictions_test/total_test # Save your solution in this variable

#raise NotImplementedError
fraction_correct_test

Optimization terminated successfully.
         Current function value: 0.691471
         Iterations 4


0.5512820512820513

In [108]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert np.allclose(fraction_correct_test, 0.5512820512820513, .001), "Incorrect result, try again.. (hint: use the predict() function)"

Now, we want to develop an investment strategy in which we buy if the returns are greater than
$0.5\%$ and sell otherwise.

6. Create a response variable named `Response` such that

$$
\text{Response}_i = \begin{cases}
1 \text{ if Today } > 0.5 &\\
0 \text{ otherwise }
\end{cases}
$$

In [109]:
# your code here
#raise NotImplementedError
weekly['Response'] = (weekly['Today'] > 0.50).astype(int)
weekly.head()

Unnamed: 0,Year,Lag1,Lag2,Lag3,Lag4,Lag5,Volume,Today,Direction,Response
0,1990,0.816,1.572,-3.936,-0.229,-3.484,0.154976,-0.27,0,0
1,1990,-0.27,0.816,1.572,-3.936,-0.229,0.148574,-2.576,0,0
2,1990,-2.576,-0.27,0.816,1.572,-3.936,0.159837,3.514,1,1
3,1990,3.514,-2.576,-0.27,0.816,1.572,0.16163,0.712,1,1
4,1990,0.712,3.514,-2.576,-0.27,0.816,0.153728,1.178,1,1


7.  Fit a logistic regression model to predict `Response` using a training data period from 1990 to 2008, with the five lag variables and volume as predictors.

In [110]:
### We have split the data for you
train = weekly[weekly['Year'] <= 2008]
test = weekly[weekly['Year'] > 2008]

X_train = train[['Lag1','Lag2','Lag3','Lag4','Lag5','Volume']] 
X_train = sm.add_constant(X_train)
y_train = train['Response']

X_test = test[['Lag1','Lag2','Lag3','Lag4','Lag5','Volume']] 
X_test = sm.add_constant(X_test)
y_test = test['Response']

logit_train = sm.Logit(y_train,X_train)
result      = logit_train.fit()

result.summary()

Optimization terminated successfully.
         Current function value: 0.681276
         Iterations 4


0,1,2,3
Dep. Variable:,Response,No. Observations:,985.0
Model:,Logit,Df Residuals:,978.0
Method:,MLE,Df Model:,6.0
Date:,"Fri, 15 Nov 2024",Pseudo R-squ.:,0.008988
Time:,04:46:58,Log-Likelihood:,-671.06
converged:,True,LL-Null:,-677.14
Covariance Type:,nonrobust,LLR p-value:,0.05825

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-0.0780,0.094,-0.831,0.406,-0.262,0.106
Lag1,-0.0737,0.029,-2.500,0.012,-0.132,-0.016
Lag2,0.0293,0.030,0.985,0.325,-0.029,0.088
Lag3,-0.0147,0.029,-0.503,0.615,-0.072,0.043
Lag4,-0.0246,0.029,-0.840,0.401,-0.082,0.033
Lag5,-0.0314,0.029,-1.072,0.284,-0.089,0.026
Volume,-0.1039,0.055,-1.886,0.059,-0.212,0.004


How many variables are statistically significant? Store your result in a variable named `num_significant_B`. (Hint: use the summary() function)

In [111]:
# your code here
# Only "Lag1" is statistically significant as p value 0.012 < 0.05!
num_significant_B = 1 # Save your solution in this variable

#raise NotImplementedError

In [112]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert num_significant_B == 1, "Wrong value for num_significant_B, try again.."

Save any variables that are statistically significant in a list named `var_significant_B`.

In [113]:
# your code here
var_significant_B = ['Lag1'] # Save your solution in this variable

#raise NotImplementedError

In [114]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert len(var_significant_B) == 1, "There should be num_significant entries in the list"
assert var_significant_B[0] == 'Lag1', "That is not the correct variable, try again.."

Compute the overall fraction of correct predictions for the held out data (that is, the data
from 2009 and 2010). Store this value in a variable named `fraction_correct_B`.

In [115]:
# your code here

y_pred_test = result.predict(X_test)
y_pred_test = (y_pred_test > 0.5).astype(int)

correct_predictions_test = (y_test == y_pred_test).sum()
total_test               = len(y_test)
fraction_correct_B       = correct_predictions_test/total_test # Save your solution in this variable

fraction_correct_B
#raise NotImplementedError

0.5

In [116]:
##########################
### TEST YOUR SOLUTION ###
##########################

assert np.allclose(fraction_correct_B, 0.5, .001), "Incorrect result, try again.. (hint: use the predict() function)"