# Fourth Assignment - FINTECH 540 - Machine Learning for FinTech - Dimensionality Reduction with Autoencoders

In this assignment, you will attempt to replicate the S&P 500 index using the price series of some of its constituents. This task involves applying machine learning techniques, specifically neural networks, to select a subset of companies that tracks the index value well. You may also explore using a Principal Component Analysis (PCA) as a benchmarking tool, though this is not mandatory. The primary objective of this task is to achieve a satisfactory performance on the test set (out-of-sample). For a reference on the meaning of this exercise, refer to notebook 14 of our class material.

## Dataset Overview

- **Asset Close Prices**: 360 stocks + S&P500 Index (Last column of both files) from 2000 to 2023.
- **Format**: Divided into train and test sets (2 files are provided). Do not split again into train and test. Split only the train set if you want to obtain a validation set for hyperparameter selection.


## Task and General Hints

In this assignment, you are tasked with building an unsupervised learning model on equity data. Your primary goal is to ensure accurate out-of-sample reproduction of the given index (S&P500) through a subset of the constituents and evaluate them with the below-mentioned metric.

To guide you through this process, consider breaking down your tasks into the following three phases:

**Preprocessing**
The dataset is already free of inconsistencies, missing values, or outliers. 
- **Data Splitting**: The dataset is partitioned, and two files are provided to you.

**Model Selection**
- This notebook focuses on using neural networks for index replication. You can experiment with the different neural network architectures (autoencoders) we have seen in class. Feel free to compare the performance against a PCA methodology. 

**Model Tuning and Evaluation**
- Once you've selected a model, you'll want to fine-tune its parameters to achieve a good index tracking out-of-sample. You must also choose the number of companies used to reproduce the index dynamics.
- You may adjust parameters manually or construct a routine to fit several models with different hyperparameters. 
- Evaluate your final model using the function provided at the end of the notebook, paying attention to respect the indicated naming convention.

**Note**: Parameter choices and tuning should be made thoughtfully while it is up to you. Carefully study the documentation of the neural network models and refer to the Jupyer Notebooks we used in class to see the possible parameters you can fine-tune.

**IMPORTANT REMARK**: 
You must use the test set solely as data the model has never seen before. The results on that part of the dataset are those that are going to provide your grade.

Remember to set the seed when training and instantiating your model. You can use either Keras (Tensorflow) or Pytorch for this task, and you must make your results fully reproducible for grading. Double-check that you have correctly set the seed before diving into the coding part.

- [Setting the seed in Keras](https://www.tensorflow.org/api_docs/python/tf/keras/utils/set_random_seed)
- [Setting the seed in Pytorch](https://pytorch.org/docs/stable/notes/randomness.html)

# Grading Rubric

Your grade for this assignment will be determined by a composite score that considers both the **normalized Root Mean Squared Error (RMSE)** on the test set and the **efficiency of your index reconstruction**. The formula for your final grade is as follows:

$$ \text{Final Score} = \text{Weighted RMSE Score} + \text{Weighted Efficiency Score} $$

This will be a number between 0 and 100, with grades potentially curved before release.

**Components of the Grading Rubric**

1. **RMSE Score:**
   - Calculated as:
     $$ \text{Normalized RMSE} = 1 - \left( \frac{\text{RMSE}}{\text{MAX_POSSIBLE_RMSE}} \right) $$
   - `MAX_POSSIBLE_RMSE` is set as the standard deviation of the target variable.
   - RMSE measures how close your constructed index is to the actual index.
   - This component contributes 70% to your final score.

2. **Efficiency Score:**
   - Calculated as:
     $$ \text{Efficiency Score} = 1 - \left( \frac{\text{Number of Companies Used}}{\text{Total Number of Companies}} \right) $$
   - Encourages models that use fewer companies for index replication. The more companies you use, the more costly it would be to construct that portfolio to track the S&P500, so the less, the better.
   - This component contributes 30% to your final score.

The final score is a weighted sum of the RMSE and Efficiency scores:

$$ \text{Final Score} = (\text{Weight RMSE} \times \text{Normalized RMSE}) + (\text{Weight Efficiency} \times \text{Efficiency Score}) $$

Where:
- `Weight RMSE` = 0.7 
- `Weight Efficiency` = 0.3

The final grade will be:

$$ \text{Grade} = \lceil \text{Final Score} \times 100 \rceil $$

Rounded up to the nearest whole number.


In [7]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from keras.models import Model
from keras.layers import Input, Dense
from sklearn.metrics import mean_squared_error
import tensorflow as tf
import random as python_random

np.random.seed(41)
python_random.seed(41)
tf.random.set_seed(41)

train_data = pd.read_csv('sp_train.csv')
test_data = pd.read_csv('sp_test.csv')
train_data = train_data.drop(columns=['Date'])
test_data = test_data.drop(columns=['Date'])

index_train = train_data['S&P'].values
index_test = test_data['S&P'].values

scaler = MinMaxScaler()
X_train = scaler.fit_transform(train_data)
X_test = scaler.transform(test_data)

index_scaler = MinMaxScaler()
index_train = index_scaler.fit_transform(index_train.reshape(-1, 1))
index_test = index_scaler.transform(index_test.reshape(-1, 1))
index_train = index_train.reshape(-1)
index_test = index_test.reshape(-1)

input_layer = Input(shape=(X_train.shape[1],))
encoding_dim = 16
encoded = Dense(encoding_dim, activation='relu')(input_layer)
decoded = Dense(X_train.shape[1], activation='linear')(encoded)
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer='adam', loss='mean_squared_error')

autoencoder.fit(X_train, X_train,
                epochs=100,
                batch_size=75)

X_predicted = autoencoder.predict(X_train)
test_predicted = autoencoder.predict(X_test)

error_train = np.mean(np.abs(X_train - X_predicted)**2, axis=0)

ind = np.argsort(error_train)
sort_error = error_train[ind]

n = 4

portfolio_train = X_predicted[:, ind[:n]]
portfolio_test = test_predicted[:, ind[:n]]

tracked_index_insample = np.mean(portfolio_train, axis=1)

tracked_index_outofsample = np.mean(portfolio_test, axis=1)



Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

**Instructions to let the next code cell run:**

Before running the cell below, ensure the following:

- The target variable of your problem has to be named exactly `index_test`, while the out-of-sample prediction variable has to be named `tracked_index_outofsample`. Store the number of companies to reconstruct the index dynamic in a variable called `n`. The calculation of the evaluation function relies on this naming convention to determine the final grade. Please make sure that both `index_test` and `tracked_index_outofsample` are numpy arrays of dimensionality (1158,) where 1158 is the length of the out-of-sample set. If the dimensions are different than that the following cell will not run.

By adhering to these naming conventions, the grading cell can compute the final score without any issues.

In [10]:
import numpy as np
import math

def evaluate_index_performance(y_test, y_pred, num_companies_used, total_companies=360, weight_rmse=0.7, weight_efficiency=0.3):
    """
    Function to evaluate the performance of the reconstructed index.
    
    :param y_test: Actual index values (out of sample)
    :param y_pred: Predicted index values using a subset of companies
    :param num_companies_used: Number of companies used for the reconstruction
    :param total_companies: Total number of companies in the index (default 500 for S&P 500)
    :param weight_mse: Weight for the MSE score (default 0.7)
    :param weight_efficiency: Weight for the efficiency score (default 0.3)
    :return: A composite score combining MSE and efficiency
    """
    # Calculate MSE and normalized MSE score
    rmse = np.sqrt(np.mean((y_test - y_pred) ** 2))
    max_possible_rmse = y_test.std()
    rmse_score = 1 - rmse / max_possible_rmse

    # Calculate efficiency score
    efficiency_score = 1 - (num_companies_used / total_companies)

    # Calculate final grade
    final_score = weight_rmse * rmse_score + weight_efficiency * efficiency_score

    return final_score

grade = evaluate_index_performance(index_test, tracked_index_outofsample, n)
print(f'The grade for this assignment is {math.ceil(grade * 100):.2f}')


The grade for this assignment is 94.00
