# Big Data in Finance II  Group Assignment

#### Question1.	In the data used by Gu, Kelly and Xiu (RFS 2019 – provided in class), use a similar procedure to theirs to predict stock returns with neural networks. Start by finding a suitable baseline configuration, and use a validation procedure to pick optimal hyperparameters for three neural network models: One with 2 hidden layers, one with 3 hidden layers, and one with 4 hidden layers.

Import the packages and data

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Input
from tensorflow.keras.optimizers import Adam, SGD
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Dense, BatchNormalization
import optuna

# fix random state
random_state = 42

panel = pd.read_pickle('returns_chars_panel.pkl') 
macro = pd.read_pickle('macro_timeseries.pkl')

2024-05-17 13:06:18.107377: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Process the data

In [16]:
# combine micro and macro data
df = pd.merge(panel,macro,on='date',how='left',suffixes=['','_macro']) 

# features + targets 
X = df.drop(columns=['ret','excess_ret','rfree','permno','date']) # everything except return info and IDs
y = df['excess_ret'] 

# make 30 years of training data
date = df['date']
training = (date <= '2006-03') # selects 
X_train, y_train = X.loc[training].values, y.loc[training].values 

# make test data
test = (date > '2006-03') 
X_test, y_test = X.loc[test].values, y.loc[test].values 

In [20]:
len(X_train) / len(X_test)

3.896265065762339

Train the NN with optuna

In [None]:

def create_model(trial, num_layers):
    neurons_per_layer = trial.suggest_categorical('neurons_per_layer', [32, 64, 128, 256])
    activation = trial.suggest_categorical('activation', ['relu', 'tanh'])
    optimizer = trial.suggest_categorical('optimizer', ['adam', 'sgd'])
    learning_rate = trial.suggest_categorical('learning_rate', [0.001, 0.0001])
    batch_size = trial.suggest_categorical('batch_size', [32, 64, 128])

    model = Sequential()
    model.add(Input(shape=(X_train.shape[1],)))
    model.add(Dense(neurons_per_layer, activation=activation))
    model.add(BatchNormalization())
    
    for _ in range(num_layers - 1):
        model.add(Dense(neurons_per_layer, activation=activation))
        model.add(BatchNormalization())
        model.add(Dropout(0.2))

    model.add(Dense(1))  # Output layer for regression

    if optimizer == 'adam':
        opt = Adam(learning_rate=learning_rate)
    else:
        opt = SGD(learning_rate=learning_rate)

    model.compile(optimizer=opt, loss='mean_squared_error', metrics=['mae'])
    return model, batch_size

def objective(trial, num_layers):
    model, batch_size = create_model(trial, num_layers)

    # K-Fold Cross Validation
    kf = KFold(n_splits=5, shuffle=True, random_state=random_state)
    val_scores = []

    for train_index, val_index in kf.split(X_train):
        X_tr, X_val = X_train[train_index], X_train[val_index]
        y_tr, y_val = y_train[train_index], y_train[val_index]

        es = EarlyStopping(monitor='val_loss', mode='min', verbose=0, patience=5)
        model.fit(X_tr, y_tr, epochs=2, batch_size=batch_size, validation_data=(X_val, y_val), callbacks=[es], verbose=0)

        val_loss, val_mae = model.evaluate(X_val, y_val, verbose=0)
        val_scores.append(val_mae)

    return np.mean(val_scores)

# Create a study for each number of layers and optimize
studies = {}
num_layers_options = [2, 3, 4]

for num_layers in num_layers_options:
    study = optuna.create_study(direction='minimize')
    study.optimize(lambda trial: objective(trial, num_layers), n_trials=5)
    studies[num_layers] = study
    print(f'Best hyperparameters for {num_layers} layers: {study.best_params}')
    print(f'Validation MAE for {num_layers} layers: {study.best_value}')



[I 2024-05-17 14:21:07,490] A new study created in memory with name: no-name-7818be5c-3c89-48b5-8e82-822402ce9dbe


Write the analysis here.

#### Question2.	Use test data to get an idea of the out of sample performance of each model. Convert the standard MSE metric for out of sample performance to the “R2 out of sample” metric that was discussed in class. Compare your results to those in Gu-Kelly-Xiu and comment on the differences. 

In [None]:
# Define a function to calculate the R square out of sample
def calculate_r2_out_of_sample(y_true, y_pred):
    ss_res = np.sum((y_true - y_pred) * (y_true - y_pred))
    ss_tot = np.sum((y_true - np.mean(y_true)) * (y_true - np.mean(y_true)))
    r2_out_of_sample = 1 - (ss_res / ss_tot)
    return r2_out_of_sample

best_r2 = -np.inf
best_num_layers = None
best_model = None

for num_layers, study in studies.items():
    best_params = study.best_params
    neurons_per_layer = best_params['neurons_per_layer']
    activation = best_params['activation']
    optimizer = best_params['optimizer']
    learning_rate = best_params['learning_rate']
    batch_size = best_params['batch_size']

    final_model, _ = create_model(study.best_trial, num_layers)

    es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=5)
    final_model.fit(X_train, y_train, epochs=5, batch_size=batch_size, validation_split=0.2, callbacks=[es], verbose=1)

    y_pred = final_model.predict(X_test)
    test_loss, test_mae = final_model.evaluate(X_test, y_test)
    r2_out_of_sample = calculate_r2_out_of_sample(y_test, y_pred)
    print(f'Test MAE for {num_layers} layers: {test_mae}')
    print(f'R² out of sample for {num_layers} layers: {r2_out_of_sample}')
    
        # Check if this model is the best one
    if r2_out_of_sample > best_r2:
        best_r2 = r2_out_of_sample
        best_num_layers = num_layers
        best_model = final_model
    
print(f'Best R² out of sample is {best_r2} for the model with {best_num_layers} layers')

Write the analysis here.

#### Question3.	Pick the model that performs the best out of sample, and interpret its output by doing the following analysis of variable importance:
 
#### a.	First, for all stock characteristics, get variable importance by setting one predictor at a time to zero and finding the decrease in out of sample R2. Show a table of the 10 most important variables according to this measure, and give an economic interpretation. 

In [None]:
# 基线样本外 R^2
y_pred_baseline = best_model.predict(X_test)
r2_baseline = calculate_r2_out_of_sample(y_test, y_pred_baseline)

# 逐个变量置零并计算样本外 R^2 下降
variable_importance = {}
for i in range(X_test.shape[1]):
    X_test_zeroed = X_test.copy()
    X_test_zeroed[:, i] = 0
    y_pred_zeroed = best_model.predict(X_test_zeroed)
    r2_zeroed = calculate_r2_out_of_sample(y_test, y_pred_zeroed)
    r2_decrease = r2_baseline - r2_zeroed
    variable_importance[i] = r2_decrease

# 找到最重要的10个变量
important_variables = sorted(variable_importance.items(), key=lambda item: item[1], reverse=True)[:10]
important_variables_df = pd.DataFrame(important_variables, columns=['Variable Index', 'Decrease in R^2'])

print(important_variables_df)

#### b.	Second, get a measure of the joint importance of all our “macro predictors” (i.e., those taken from Welch and Goyal 2008), by setting them all to zero and finding the decrease in out of sample R2. Comment on how important macroeconomic variables are relative to stock characteristics in predicting returns.

In [None]:
# 假设宏观预测变量的索引是已知的
macro_predictors_indices = macro.columns  # 用实际索引替换

X_test_macro_zeroed = X_test.copy()
X_test_macro_zeroed[:, macro_predictors_indices] = 0
y_pred_macro_zeroed = best_model.predict(X_test_macro_zeroed)
r2_macro_zeroed = calculate_r2_out_of_sample(y_test, y_pred_macro_zeroed)
r2_macro_decrease = r2_baseline - r2_macro_zeroed

print(f'Decrease in R^2 when macro predictors are set to zero: {r2_macro_decrease}')


#### c.	Repeat the two steps above, but by using a measure of the sensitivity of predictions to each input variable, as outlined in the lectures.

In [None]:
# 计算每个变量的敏感性
sensitivity = {}
epsilon = 1e-5  # 微小扰动

for i in range(X_test.shape[1]):
    X_test_perturbed = X_test.copy()
    X_test_perturbed[:, i] += epsilon
    y_pred_perturbed = best_model.predict(X_test_perturbed)
    sensitivity[i] = np.mean(np.abs(y_pred_perturbed - y_pred_baseline))

# 找到最敏感的10个变量
sensitive_variables = sorted(sensitivity.items(), key=lambda item: item[1], reverse=True)[:10]
sensitive_variables_df = pd.DataFrame(sensitive_variables, columns=['Variable Index', 'Sensitivity'])

print(sensitive_variables_df)


Write the analysis here.

#### Question4.	Fit a penalised linear model (LASSO) to the same data, using validation data to pick the best penalty (e.g., you can use the “sklearn” package in Python to do this easily). Compare its test data performance to the neural network. 

In [None]:
from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_squared_error, r2_score

# Fit LASSO model with cross-validation
lasso = LassoCV(cv=5, random_state=random_state).fit(X_train, y_train)

# Predict on test set
y_pred_lasso = lasso.predict(X_test)

# Calculate MSE and R^2
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
r2_lasso = r2_score(y_test, y_pred_lasso)

print(f'LASSO MSE: {mse_lasso}')
print(f'LASSO R^2: {r2_lasso}')


Write the analysis here.

#### Question5.	Suppose somebody tells you to collect 10 more micro or macro variables that can predict returns and are not in our current dataset. How would you choose those variables, based on the intuitions you have gained in this project?

Write the analysis here.