Q1. Use the full data set to perform a logistic regression with Direction
as the response and the five lag variables plus Volume as predictors. Use the
summary function to print the results. Do any of the predictors appear to be
statistically significant? If so, which ones?

In [2]:
# Upload file
from google.colab import files
uploaded = files.upload()

# Import libraries
import pandas as pd
import statsmodels.formula.api as smf

# Load and preview the dataset
df = pd.read_csv("StockMarket.csv")  # Adjust filename if different
print(df.head())

# Convert 'Direction' to binary (Up = 1, Down = 0)
df['Direction'] = df['Direction'].map({'Up': 1, 'Down': 0})

# Fit logistic regression model with Lag1–Lag5 and Volume as predictors
logit_model = smf.logit("Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume", data=df).fit()

# Print the summary
print(logit_model.summary())

Saving StockMarket.csv to StockMarket (1).csv
   Year   Lag1   Lag2   Lag3   Lag4   Lag5    Volume  Today Direction
0  1990  0.816  1.572 -3.936 -0.229 -3.484  0.154976 -0.270      Down
1  1990 -0.270  0.816  1.572 -3.936 -0.229  0.148574 -2.576      Down
2  1990 -2.576 -0.270  0.816  1.572 -3.936  0.159837  3.514        Up
3  1990  3.514 -2.576 -0.270  0.816  1.572  0.161630  0.712        Up
4  1990  0.712  3.514 -2.576 -0.270  0.816  0.153728  1.178        Up
Optimization terminated successfully.
         Current function value: 0.682441
         Iterations 4
                           Logit Regression Results                           
Dep. Variable:              Direction   No. Observations:                 1089
Model:                          Logit   Df Residuals:                     1082
Method:                           MLE   Df Model:                            6
Date:                Thu, 03 Apr 2025   Pseudo R-squ.:                0.006580
Time:                        21:23:06

Q2. Compute the confusion matrix (threshold = 0.5) and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression in this case.

In [3]:
# Import required libraries
from sklearn.metrics import confusion_matrix, accuracy_score

# Predict probabilities from the model
pred_probs = logit_model.predict(df)

# Convert probabilities to binary predictions using threshold = 0.5
pred_class = (pred_probs >= 0.5).astype(int)

# Actual values
true_class = df['Direction']

# Compute confusion matrix
cm = confusion_matrix(true_class, pred_class)
print("Confusion Matrix:\n", cm)

# Calculate accuracy (overall fraction of correct predictions)
accuracy = accuracy_score(true_class, pred_class)
print(f"\nAccuracy: {accuracy:.3f}")

Confusion Matrix:
 [[ 54 430]
 [ 48 557]]

Accuracy: 0.561


The model correctly predicted 557 "Up" weeks (True Positives). It correctly predicted 54 "Down" weeks (True Negatives). However, it mistakenly predicted 430 weeks as "Up" when the market actually went Down (False Positives). And it mistakenly predicted 48 weeks as "Down" when the market actually went Up (False Negatives).

Q3. Now, using the Year column, split the dataset into training and testing sets as follows: the training set contains all observations from 1990 to 2008, and the test set contains all observations from 2009 and 2010.

Fit the logistic regression model using the training set to predict Direction including only the significant predictors identified in Q1. Make predictions on the test set (threshold = 0.5) and report the accuracy and the AUC of your model.


In [4]:
# Split the data into training and test sets
train_df = df[df['Year'] <= 2008]
test_df = df[df['Year'] >= 2009]

# Fit logistic regression using only Lag2 (significant predictor)
model_lag2 = smf.logit("Direction ~ Lag2", data=train_df).fit()

# Predict probabilities on the test set
test_probs = model_lag2.predict(test_df)

# Convert to binary predictions (threshold = 0.5)
test_preds = (test_probs >= 0.5).astype(int)

# Evaluate performance
from sklearn.metrics import accuracy_score, roc_auc_score

test_actual = test_df['Direction']
accuracy = accuracy_score(test_actual, test_preds)
auc = roc_auc_score(test_actual, test_probs)

# Print results
print(f"Accuracy on Test Set: {accuracy:.3f}")
print(f"AUC on Test Set: {auc:.3f}")

Optimization terminated successfully.
         Current function value: 0.685555
         Iterations 4
Accuracy on Test Set: 0.625
AUC on Test Set: 0.546


Accuracy = 0.625 → The model correctly predicted the market direction in 62.5% of test cases, so performs slightly better than random guessing in terms of classification accuracy.

AUC (Area Under the Curve) = 0.546 → The AUC is only slightly above 0.5, suggesting very limited ability to distinguish between "Up" and "Down" weeks. The low AUC indicates that the model struggles to rank probabilities effectively, especially for weeks with uncertain outcomes.

Using only Lag2 as a predictor provides modest predictive power, and incorporating additional features or models may improve performance.









Q4. Can a multiple logistic regression model (with several predictors)
lead to lower (test set) accuracy compared to a simple logistic regression model?

Yes, a multiple logistic regression model can lead to lower test set accuracy than a simpler model. This happens when additional predictors introduce overfitting, irrelevant information, or multicollinearity, causing the model to perform worse on unseen data, despite fitting the training data well.