# Problem 2: Predicting Stock Returns

We aim to predict the direction of next month's return ($y_{t+1}^i$) using current month's characteristics ($X_t^i$).

In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Load data
df = pd.read_csv('Data_Pred_Return.csv')
df['dates'] = pd.to_datetime(df['dates'])
df = df.sort_values(['cusip', 'dates'])

# Inspect data
df.head()

  df['dates'] = pd.to_datetime(df['dates'])


Unnamed: 0.1,Unnamed: 0,dates,cusip,Price,MV,M2B,S2A,SD2A,LD2A,PE,Sales,RET
48449,190814,2009-11-30,125581,28.99,,0.657499,0.015832,0.125728,0.904575,3.73613,745.0,
48450,190815,2009-12-31,125581,27.61,5522000.0,0.657499,0.015832,0.125728,0.904575,3.73613,745.0,
48451,190816,2010-01-29,125581,31.82,6364000.0,0.913891,0.04396,0.09682,0.940486,56.463768,1753.2,0.152481
48452,190817,2010-02-26,125581,36.43,7287311.48,0.913891,0.04396,0.09682,0.940486,56.463768,1753.2,0.144877
48453,190818,2010-03-31,125581,38.96,7793402.56,0.913891,0.04396,0.09682,0.940486,56.463768,1753.2,0.069448


## Part (a): Feature Construction

We need to select features that are stationary to avoid spurious regression results. 

**Variable Analysis:**
- **Non-Stationary (Trends):** `Price`, `MV` (Market Value), `Sales`. These variables tend to grow over time and should not be used directly in levels. We will transform them into growth rates (percentage change) to make them stationary.
- **Stationary (Ratios/Returns):** `M2B` (Market-to-Book), `S2A` (Sales-to-Asset), `SD2A` (Short-term Debt-to-Asset), `LD2A` (Long-term Debt-to-Asset), `PE` (Price-to-Earnings), `RET` (Returns).

**Selected Features:**
We will use the provided financial ratios, the current month's return (Momentum), and the growth rates of the non-stationary variables as predictors.

- `M2B`
- `S2A`
- `SD2A`
- `LD2A`
- `PE`
- `RET` (Current month return $r_t$)
- `Price_Growth`
- `MV_Growth`
- `Sales_Growth`

In [19]:
# Transform non-stationary variables
# We use pct_change() within each group to get the growth rate
df['Price_Growth'] = df.groupby('cusip')['Price'].pct_change()
df['MV_Growth'] = df.groupby('cusip')['MV'].pct_change()
df['Sales_Growth'] = df.groupby('cusip')['Sales'].pct_change()

# Define predictors
features = ['M2B', 'S2A', 'SD2A', 'LD2A', 'PE', 'RET', 'Price_Growth', 'MV_Growth', 'Sales_Growth']

# Create Target: y_{t+1} = 1 if r_{t+1} > 0 else 0
# We group by cusip and shift the 'RET' column UP by 1 (-1) to get the next month's return aligned with current row
df['Next_RET'] = df.groupby('cusip')['RET'].shift(-1)

# Create binary target
df['Target'] = (df['Next_RET'] > 0).astype(int)

# Replace infinity with NaN
df.replace([np.inf, -np.inf], np.nan, inplace=True)

# Remove rows with NaN (last month for each stock has no next return, and first month for growth rates)
df_clean = df.dropna(subset=['Next_RET'] + features).copy()

# Verify shape
print(f"Original shape: {df.shape}, Clean shape: {df_clean.shape}")
df_clean.head()

Original shape: (49032, 17), Clean shape: (41795, 17)


  df['MV_Growth'] = df.groupby('cusip')['MV'].pct_change()
  df['Sales_Growth'] = df.groupby('cusip')['Sales'].pct_change()


Unnamed: 0.1,Unnamed: 0,dates,cusip,Price,MV,M2B,S2A,SD2A,LD2A,PE,Sales,RET,Price_Growth,MV_Growth,Sales_Growth,Next_RET,Target
48451,190816,2010-01-29,125581,31.82,6364000.0,0.913891,0.04396,0.09682,0.940486,56.463768,1753.2,0.152481,0.152481,0.152481,1.353289,0.144877,1
48452,190817,2010-02-26,125581,36.43,7287311.48,0.913891,0.04396,0.09682,0.940486,56.463768,1753.2,0.144877,0.144877,0.145084,0.0,0.069448,1
48453,190818,2010-03-31,125581,38.96,7793402.56,0.913891,0.04396,0.09682,0.940486,56.463768,1753.2,0.069448,0.069448,0.069448,0.0,0.042094,1
48454,190819,2010-04-30,125581,40.6,8121542.8,0.785494,0.049133,0.093724,0.962398,42.325,1780.7,0.042094,0.042094,0.042105,0.015686,-0.093842,0
48455,190820,2010-05-28,125581,36.79,7359434.81,0.785494,0.049133,0.093724,0.962398,42.325,1780.7,-0.093842,-0.093842,-0.093838,0.0,-0.079641,0


## Part (b): Logistic Regression

We split the data into Training (1989-2011) and Testing (Post 2011, i.e., 2012-2016).

In [20]:
# Split Data
train_mask = (df_clean['dates'].dt.year >= 1989) & (df_clean['dates'].dt.year <= 2011)
test_mask = (df_clean['dates'].dt.year > 2011)

X_train = df_clean.loc[train_mask, features]
y_train = df_clean.loc[train_mask, 'Target']

X_test = df_clean.loc[test_mask, features]
y_test = df_clean.loc[test_mask, 'Target']

print(f"Train samples: {len(X_train)}, Test samples: {len(X_test)}")

# Standardize features (Important for Logistic Regression and Regularization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit Logistic Regression (No penalty initially or weak penalty default l2)
# The problem asks for standard Logistic Regression first.
logit_model = LogisticRegression(penalty=None, max_iter=1000) 
# Note: penalty='none' is deprecated in newer sklearn versions, use penalty=None. If older sklearn, use penalty='none'.
# If 'penalty=None' fails, I will fallback to a very large C.
try:
    logit_model.fit(X_train_scaled, y_train)
except:
    logit_model = LogisticRegression(C=1e9, max_iter=1000) # Effectively no penalty
    logit_model.fit(X_train_scaled, y_train)

print("Logistic Regression Coefficients:")
print(pd.Series(logit_model.coef_[0], index=features))

Train samples: 29092, Test samples: 12703
Logistic Regression Coefficients:
M2B             0.104969
S2A             0.008219
SD2A           -0.002737
LD2A           -0.003609
PE              0.004068
RET            -0.045293
Price_Growth    0.026254
MV_Growth       0.030867
Sales_Growth   -0.003483
dtype: float64


## Part (c): Ridge and LASSO with Time-Series Cross Validation

We will use `LogisticRegressionCV` which supports time-series cross-validation steps if configured, or we can manually implement GridSearch with `TimeSeriesSplit`.

In [21]:
# Define TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)

# Ridge (L2 Penalty)
print("Fitting Ridge (L2)...")
ridge_cv = LogisticRegressionCV(
    Cs=10, 
    penalty='l2', 
    cv=tscv, 
    solver='lbfgs', 
    max_iter=1000,
    random_state=42
)
ridge_cv.fit(X_train_scaled, y_train)
print(f"Best C for Ridge: {ridge_cv.C_[0]}")

# LASSO (L1 Penalty)
print("Fitting LASSO (L1)...")
lasso_cv = LogisticRegressionCV(
    Cs=10, 
    penalty='l1', 
    cv=tscv, 
    solver='liblinear', 
    max_iter=1000,
    random_state=42
)
lasso_cv.fit(X_train_scaled, y_train)
print(f"Best C for LASSO: {lasso_cv.C_[0]}")

print("\nRidge Coefficients:")
print(pd.Series(ridge_cv.coef_[0], index=features))

print("\nLASSO Coefficients:")
print(pd.Series(lasso_cv.coef_[0], index=features))

Fitting Ridge (L2)...
Best C for Ridge: 0.000774263682681127
Fitting LASSO (L1)...
Best C for LASSO: 0.046415888336127774

Ridge Coefficients:
M2B             0.078217
S2A             0.006966
SD2A           -0.002480
LD2A           -0.003028
PE              0.005493
RET            -0.027132
Price_Growth    0.018073
MV_Growth       0.021840
Sales_Growth   -0.002850
dtype: float64

LASSO Coefficients:
M2B             0.099209
S2A             0.004079
SD2A            0.000000
LD2A            0.000000
PE              0.000654
RET            -0.033995
Price_Growth    0.018667
MV_Growth       0.025723
Sales_Growth   -0.000485
dtype: float64


## Part (d): Confusion Matrix and Error Rates

We construct the confusion matrix for the **test sample** (Post 2011) with cutoff $\bar{p} = 0.5$.
We compute:
- **Type I Error Rate (False Positive Rate):** Proportion of actual 0s classified as 1. $P(\hat{y}=1 | y=0)$.
- **Type II Error Rate (False Negative Rate):** Proportion of actual 1s classified as 0. $P(\hat{y}=0 | y=1)$.
- **Overall Error Rate:** Proportion of incorrect predictions.

In [22]:
def compute_metrics(model, X, y_true, model_name="Model"):
    y_pred = model.predict(X)
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    
    # Error Rates
    type1_error = fp / (tn + fp) if (tn + fp) > 0 else 0
    type2_error = fn / (fn + tp) if (fn + tp) > 0 else 0
    overall_error = (fp + fn) / (tn + fp + fn + tp)
    
    print(f"--- {model_name} ---")
    print("Confusion Matrix:")
    print(cm)
    print(f"Type I Error Rate (FP Rate): {type1_error:.4f}")
    print(f"Type II Error Rate (FN Rate): {type2_error:.4f}")
    print(f"Overall Error Rate: {overall_error:.4f}")
    print("-" * 30)

# Evaluate models on Test Set
compute_metrics(logit_model, X_test_scaled, y_test, "Standard Logit")
compute_metrics(ridge_cv, X_test_scaled, y_test, "Ridge Logit")
compute_metrics(lasso_cv, X_test_scaled, y_test, "LASSO Logit")

--- Standard Logit ---
Confusion Matrix:
[[   0 5363]
 [   0 7340]]
Type I Error Rate (FP Rate): 1.0000
Type II Error Rate (FN Rate): 0.0000
Overall Error Rate: 0.4222
------------------------------
--- Ridge Logit ---
Confusion Matrix:
[[   0 5363]
 [   0 7340]]
Type I Error Rate (FP Rate): 1.0000
Type II Error Rate (FN Rate): 0.0000
Overall Error Rate: 0.4222
------------------------------
--- LASSO Logit ---
Confusion Matrix:
[[   0 5363]
 [   0 7340]]
Type I Error Rate (FP Rate): 1.0000
Type II Error Rate (FN Rate): 0.0000
Overall Error Rate: 0.4222
------------------------------


## Part (e): Random Forest

We implement a Random Forest model and compare its performance.

In [23]:
from sklearn.ensemble import RandomForestClassifier

# Fit Random Forest
# Using a reasonable number of estimators and random_state for reproducibility
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train_scaled, y_train)

# Evaluate Random Forest
compute_metrics(rf_model, X_test_scaled, y_test, "Random Forest")

# Feature Importance
importances = pd.Series(rf_model.feature_importances_, index=features).sort_values(ascending=False)
print("Feature Importances:")
print(importances)

--- Random Forest ---
Confusion Matrix:
[[1640 3723]
 [2108 5232]]
Type I Error Rate (FP Rate): 0.6942
Type II Error Rate (FN Rate): 0.2872
Overall Error Rate: 0.4590
------------------------------
Feature Importances:
MV_Growth       0.127728
PE              0.124103
M2B             0.123992
RET             0.122849
Price_Growth    0.122179
S2A             0.120274
LD2A            0.101579
SD2A            0.090095
Sales_Growth    0.067201
dtype: float64


## Part (f): Feed-Forward Neural Network (Optional)

We implement a simple feed-forward neural network (MLPClassifier) and compare its performance.

In [24]:
from sklearn.neural_network import MLPClassifier

# Fit Neural Network
# Simple architecture: one hidden layer with 100 neurons (default)
nn_model = MLPClassifier(hidden_layer_sizes=(100,), max_iter=500, random_state=42)
nn_model.fit(X_train_scaled, y_train)

# Evaluate Neural Network
compute_metrics(nn_model, X_test_scaled, y_test, "Neural Network")

--- Neural Network ---
Confusion Matrix:
[[ 115 5248]
 [ 133 7207]]
Type I Error Rate (FP Rate): 0.9786
Type II Error Rate (FN Rate): 0.0181
Overall Error Rate: 0.4236
------------------------------
