# First Assignment - FINTECH 540 - Machine Learning for FinTech

In this assignment, you will gain hands-on experience applying linear models to financial market data. Specifically, you will work with time series prices of the 30 constituents of the *Dow Jones Industrial Average (DJIA)* Index. The dataset covers the period from June $2^{nd}$, 2017, through June $2^{nd}$, 2023. The price series of the ETF associated with the DJIA index is also provided, whose symbol is *DIA*. The dataset is uploaded on Sakai in the same place where you found this notebook.

You will deal with three consecutive tasks, so in general, you can only perform a task if you have solved the previous one. You can obtain at most 100 points for this home assignment. The tasks are briefly summarized below, and you can find the relative prompt in each subsection of this notebook:
- Build descriptive linear models (CAPM) for all the index constituents (*20 points*).
- Select a subset of constituents and fit a predictive linear model to forecast the index value (*40 points*).
- Repeat the linear modeling exercise using boostrapped returns (*40 points*).

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.metrics import mean_squared_error
import math
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelBinarizer

## Task 1 - Build descriptive linear models (CAPM) for all the index constituents (*20 points*)

The Capital Asset Pricing Model (CAPM) is represented as:

$$R_i - R_f =   \beta_i (R_m - R_f) + e_i$$

Where:
- $R_i$ is the return of the asset or security $i$.
- $R_f$ is the risk-free rate, representing the return on a risk-free investment.
- $\beta_i$ is the beta of the asset $i$, which measures its sensitivity to market movements.
- $R_m$ is the market portfolio's return (the index).
- $e_i$ is the error term or residual representing unexplained variation in the asset's return.

The CAPM equation helps estimate the return of an asset based on its risk relative to the market and the risk-free rate. You can calculate the daily risk-free rate by using the following formula.

$$ r_{\text{daily}} = \left(1 + r_{\text{annual}}\right)^{\frac{1}{365}} - 1 $$

Where:
- $r_{\text{daily}}$ is the daily yield. It represents the expected daily return on investment.
- $ r_{\text{annual}} $ is the annual yield. It represents the expected annual return on investment.
- The formula assumes daily compounding, meaning the investment's return is calculated daily over a year (365 days). It allows to do the modeling based on daily returns.

For this task, you can use an annual yield of *5.482%* per the annualized U.S. 3-month Treasury Bill yield.

To solve this part of the homework, you have to:
- Compute the daily yield from the annualized provided in the prompt.
- Prepared the data to fit the CAPM for each company in the DJIA index described above.
- Fit the CAPM for each company and check the estimated sensitivity to market movements.
- Select a subset of stocks sensitive to market movements between 0.85 and 1.15. Before including a symbol, ensure the estimated sensitivity is statistically significant. Store the symbols in a Python list before moving to the next task.

Before performing the CAPM modeling, remember to split the dataset into a training set and a test set and use only the training set to perform Task 1. Use *2022-01-01* as a cutoff date. Ensure the cutoff date is included in the test set and not in the train set.

**Motivation behind the task**

Fitting individual CAPM models allows for a detailed assessment of each stock's risk profile. CAPM provides a systematic way to quantify the sensitivity of each stock's returns to market movements, as measured by the beta coefficient. This individual assessment is valuable because different stocks may exhibit varying levels of market sensitivity.

Selecting stocks based on their beta values is usually a risk-based approach to portfolio construction. By choosing stocks with higher (lower) beta values, you are essentially selecting those that tend to exhibit greater (lower) price volatility in response to market fluctuations. This can be seen as a deliberate strategy to include riskier (safer) assets in the portfolio.

This task will set the basis for selecting a subset of index constituents to be used for a predictive model. 

**Grading Criteria**

- **Data Preparation (10 points)**: Points will be awarded for preparing the data appropriately for the modeling task.

- **CAPM Model Fitting (10 points)**: Points will be awarded based on the correctness and completeness of the CAPM models, including accurate significance evaluation and the subset of stock selection based on the beta estimations.

In [2]:
df = pd.read_csv('data/dows_daily.csv',index_col=0)

In [3]:
df

Unnamed: 0_level_0,DIA,GS.N,NKE.N,CSCO.OQ,JPM.N,DIS.N,INTC.OQ,MRK.N,CVX.N,AXP.N,...,PG.N,IBM.N,MMM.N,AAPL.OQ,WMT.N,CAT.N,AMGN.OQ,V.N,TRV.N,BA.N
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-06-02,211.91,213.31,52.98,31.98,82.64,107.18,36.32,62.427347,103.11,78.49,...,88.59,145.232686,206.70,38.8625,79.62,105.95,159.15,96.15,125.15,190.23
2017-06-05,211.86,213.99,53.01,31.76,82.79,106.52,36.34,62.045937,103.19,78.97,...,88.74,145.576545,206.22,38.4825,80.26,105.20,160.22,96.55,125.38,188.95
2017-06-06,211.37,214.53,52.48,31.56,82.96,105.50,36.13,61.664526,104.17,78.85,...,88.80,145.538339,205.41,38.6125,78.93,104.55,159.53,95.79,124.03,186.75
2017-06-07,211.72,215.78,53.23,31.61,83.91,105.92,36.26,61.082876,103.77,79.81,...,88.77,144.210661,205.01,38.8425,79.15,103.51,161.66,96.09,123.50,188.10
2017-06-08,211.86,218.76,53.20,31.61,84.95,104.32,36.48,60.262843,104.00,79.95,...,87.85,145.280444,205.94,38.7475,78.93,105.01,162.65,96.09,123.78,189.93
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023-05-26,330.84,332.01,107.51,49.86,136.94,88.29,29.00,111.070000,154.08,157.24,...,145.40,128.890000,96.94,175.4300,146.42,211.80,216.93,225.01,172.29,203.63
2023-05-30,330.52,330.83,106.52,50.17,137.46,87.82,29.99,109.170000,153.12,158.01,...,143.18,129.480000,96.06,177.3000,146.06,209.90,218.53,221.64,173.29,204.69
2023-05-31,329.52,323.90,105.26,49.67,135.71,87.96,31.44,110.410000,150.62,158.56,...,142.50,128.590000,93.31,177.2500,146.87,205.75,220.65,221.03,169.24,205.70
2023-06-01,330.94,316.40,103.63,49.74,137.58,88.59,31.13,110.930000,152.16,162.72,...,143.96,129.820000,94.28,180.0900,147.41,209.07,214.27,226.50,171.30,207.96


In [4]:
# 3-month T-bill annual yield
annual_yield = 0.05482
daily_yield = ((1 + annual_yield) ** (1 / 365)) - 1
print(f'Daily Yield: {daily_yield:.8f}')

Daily Yield: 0.00014623


In [5]:
# Specify the time cutoff date as a string 
time_cutoff_date = '2022-01-01'

# Calculate the excess daily returns for each stock by subtracting the estimated daily yield
returns = df.pct_change().iloc[:, 1:].dropna() - daily_yield

# Caculate market returns (DIA index)
market_returns = df.pct_change().iloc[:, 0].dropna() - daily_yield

# Calculate beta for each stock by fitting CAPM
beta_values = {}
p_values = {}
for stock_symbol in returns.columns:
    stock_returns = returns[stock_symbol]
    
    # Split the returns data based on the time cutoff date
    stock_returns_train = stock_returns[stock_returns.index < time_cutoff_date]
    market_returns_train = market_returns[market_returns.index < time_cutoff_date]


    # Fit the OLS model (no need to add a constant term)
    model = sm.OLS(stock_returns_train, market_returns_train).fit()
    # Beta is the coefficient of the market returns variable
    beta = model.params[0]
    # Get the p-value for the beta coefficient
    p_value = model.pvalues[0]
    # Store beta and pvalues
    beta_values[stock_symbol] = beta
    p_values[stock_symbol] = p_value

# Set thresholds for beta
upper_threshold = 1.15
lower_threshold = 0.85
significance_level = 0.05 
# Select stocks with beta values within the threshold range
selected_stocks = [symbol for symbol, beta in beta_values.items() if lower_threshold <= beta <= upper_threshold and p_values[symbol] < significance_level]
print("Selected Stocks based on significant Betas:")
print(selected_stocks)

Selected Stocks based on significant Betas:
['NKE.N', 'CSCO.OQ', 'DIS.N', 'INTC.OQ', 'HD.N', 'UNH.N', 'MSFT.OQ', 'HON.OQ', 'CRM.N', 'IBM.N', 'MMM.N', 'AAPL.OQ', 'CAT.N', 'V.N', 'TRV.N']


## Task 2 - Select a subset of constituents and fit a predictive linear model to forecast the index value (*40 points*)

In this task, you will apply linear predictive modeling techniques to forecast the value of the DIA ETF on the DJIA index using the subset of its constituents you selected in the previous task. The goal is to build a predictive linear model that accurately estimates the future index return based on the historical data of selected constituent stocks. Note that to perform this predictive task, you have to prepare the data accordingly. Don't use the excess returns with respect to a daily risk-free rate for this task, but use the plain returns instead.

The predictive linear regression equation to estimate the dependent variable \(Y\) at time \(t+1\) is represented as:

$$ Y_{t+1} = \beta_0 + \beta_1 X_{1,t} + \beta_2 X_{2,t} + \ldots + \beta_k X_{k,t} + \varepsilon_{t} $$

In this equation:

- $Y_{t+1}$ represents the dependent variable at time $t+1$ that we want to predict. Note that the dependent variable is real-valued.
- $\beta_0$ is the intercept or constant term.
- $\beta_1, \beta_2, \ldots, \beta_k$ are the $k$ coefficients for the independent variables $ X_{1,t}, X_{2,t}, \ldots, X_{k,t} $ at time $t$. you can assume $k$ to be the number of selected stocks from the previous task. Note that the regressors are real-valued.
- $\varepsilon_{t}$ represents the error term at time $t$, capturing unexplained variation or noise in the dependent variable at that specific time.

Before performing the linear regression modeling, remember to split the dataset into a training set and a test set. Use *2022-01-01* as a cutoff date, the same way you did in the previous task. Make sure the cutoff date will be included in the test set and not in the train set.

Assess the performance of your predictive model using an appropriate evaluation metric for a regression problem like this one. Evaluate the model on the test set to ensure its predictive accuracy out-of-sample.

**Grading Criteria**

- **Data Preparation (15 points)**: Points will be awarded for preparing the data appropriately for the modeling task.

- **Predictive Regression Model Building (20 points)**: Points will be awarded based on the correctness and completeness of the regression model built using selected stocks' returns and the index return.

- **Model Evaluation (5 points)**: Points will be awarded based on the proper choice of evaluation metric.

In [6]:
# Extract the selected stock data and the index returns from price dataframe 
returns = df.pct_change().loc[:, ['DIA'] + selected_stocks].dropna()

# Shift the target variable by one day to create target in order to predict tomorrow's index return
returns['DIA_shifted'] = returns['DIA'].shift(-1)
returns.dropna(inplace=True)  # Remove the last row with NaN target from the whole df

# Define the features (selected stock returns) and target (future Dow Jones return)
X = returns[selected_stocks]
X = sm.add_constant(X)

y = returns['DIA_shifted']

# Split the data based on the time cutoff
X_train = X[X.index < time_cutoff_date]
y_train = y[y.index < time_cutoff_date]
X_test = X[X.index >= time_cutoff_date]
y_test = y[y.index >= time_cutoff_date]

# Create a linear regression model using statsmodels OLS
model = sm.OLS(y_train, X_train).fit()

# Make predictions on the test data
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = math.sqrt(mse)
print(f'Mean Squared Error: {mse:.6f}')
print(f'Root Mean Squared Error: {rmse:.6f}')

Mean Squared Error: 0.000146
Root Mean Squared Error: 0.012065


## Task 3 - Augment the Dataset with Bootstrapped Alphas and Fit again the Linear Predictive Models (40 points)

In this task, we explore the concept of bootstrapped alphas and their role in predictive modeling. Bootstrapped alphas are used as proxy trading signals for real alphas that can be practically obtained. These signals are correlated with future returns and can play the role of good predictors in the predictive modeling process. Don't use the excess returns with respect to a daily risk-free rate for this task, but use the plain returns instead when you have to calculate the boostrapped alphas.

We define bootstrapped alphas $\alpha_t$ as per the formula below:

$$\alpha_{i,t} := \rho_{\text{boot}} r_{i,t+1} + \sqrt{1 - \rho_{\text{boot}}^{2}} z_{i,t}$$

where:
- $r_{i,t+1}$ represents the next period return of the traded security $i$, which is given to you.
- $z_{i,t} \sim \mathbb{N}(0,\sigma^{2})$ is a randomly drawn scalar associated for each company $i$, which is not given and you have to sample. When sampling, ensure that each sampled vector is independent of the other since you have to draw samples for each company you will use as regressors. The number of companies stays the same that you used in the previous task and that you have selected by fitting the CAPM model in task 1.
- $\sigma^{2}_{i}$ is an estimate of the true conditional variance of the security $i$, which you have to calculate based on the given returns. Note that you have to calculate those variances on the train set only. Use the same cutoff applied in the previous task to define what the training set is.
- $\rho_{\text{boot}} \in [-1,1]$ is a correlation coefficient, which you have to set equal to 0.25.

In this setting, the parameter $\rho_{\text{boot}}$ artificially regulates the strength of the trading signal you create. We remark that regressing the bootstrapped alpha $\alpha_t$ on the future returns $r_{t+1}$ results in an $R^2$ equal to $\rho^2$.

The equation above formalizes the calculation of the boostrapped alpha for a single security while you will have more than one security. Try to make your calculations as efficient as possible by computing them simultaneously. It is possible by using calculations between pandas dataframe. Remember that $z_{i,t} \sim \mathcal{N}(0,\sigma^{2}_{i})$ can be calculated as $z_{i,t} = \sqrt{\sigma^{2}_{i}}u_{i,t}$ where $u_{i,t} \sim \mathcal{N}(0,1)$. 

Once you calculate the boostrapped alphas, repeat the linear predictive forecasting exercise as in the previous task. This time you will use the boostrapped alphas as predictors, while you will keep the same target as before, the index returns. In other words, the target stays the same as in the previous task (future returns for DIA) by looking at the equation below. Still, the predictors change from the current returns of the constituents to the alpha bootstrap you have calculated.

$$ Y_{t+1} = \beta_0 + \beta_1 X_{1,t} + \beta_2 X_{2,t} + \ldots + \beta_k X_{k,t} + \varepsilon_{t} $$

To ensure reproducibility, please set the random seed to 42. Don't use another seed, and remember to set it. Avoiding to follow these guidelines will result in point deductions.

**Motivation behind the task**

In the dynamic and complex world of financial markets, predictive modeling is a potent tool to decipher underlying patterns and trends that govern security prices. Coming up with good predictors for a certain set of assets is a complicated task that is not necessarily the purpose of this assignment. The concept of bootstrapped alphas, as delineated in this exercise, emerges as a sophisticated method to engineer artificial trading signals that can potentially enhance the predictive power of financial models. It is equivalent to assuming that we have a way to predict the future returns of the index constituents. Look at the alpha bootstrap equation to understand why we are talking about future returns by looking at what the prices indicate.

The utilization of bootstrapped alphas is grounded in the mathematical formulation provided, where the alpha ($\alpha_{i,t}$) for a security $i$ at time $t$ is constructed using a combination of the next period return of the security ($r_{i,t+1}$) and a stochastic component ($z_{i,t}$) drawn from a normal distribution. This formulation allows for the incorporation of both deterministic and random elements, thereby mimicking the inherent uncertainty and volatility observed in financial markets.

By setting the correlation coefficient ($\rho_{\text{boot}}$) to 0.25, we are essentially moderating the influence of the artificial trading signal, ensuring that it does not overwhelmingly dictate the behavior of the bootstrapped alphas. This parameter, therefore, serves as a tuning knob, allowing us to control the strength of the trading signal and, consequently, its predictive power. However, you have to keep this parameter fixed for this exercise, as indicated by the prompt.

The subsequent step of employing these bootstrapped alphas as predictors in a linear predictive forecasting model is an exercise to highlight how well one can expect to forecast index returns, given a good way to predict future returns for the constituents. By replacing the current returns of the constituents with the calculated bootstrapped alphas, we are essentially enhancing the model with artificially generated yet statistically grounded signals that can potentially unveil deeper insights into the market dynamics.

**Grading Criteria**

- **Data Preparation (30 points)**: Points will be awarded for preparing the data appropriately for the modeling task.

- **Predictive Regression Model Building (10 points)**: Points will be awarded based on the correctness and completeness of the regression model built using selected stocks' boostrapped alpha and the index return.

In [7]:
rng = np.random.RandomState(42)
conditional_variances = returns.loc[returns.index < time_cutoff_date,selected_stocks].var()
random_vectors = pd.DataFrame(rng.normal(size=returns[selected_stocks].shape),columns=conditional_variances.index) * np.sqrt(conditional_variances)
random_vectors.index = returns.index
rho = 0.25
# Compute the bootstrapped returns.
bootstrapped_returns = rho * returns[selected_stocks].shift(-1) + np.sqrt(1 - rho**2) * random_vectors
bootstrapped_returns.dropna(inplace=True)

bootstrapped_returns['DIA_shifted'] = returns.iloc[:-1,-1]



# Define the features (selected stock returns) and target (future Dow Jones return)
X = bootstrapped_returns[selected_stocks]

y = bootstrapped_returns['DIA_shifted']

# Split the data based on the time cutoff
X_train = X[X.index < time_cutoff_date]
y_train = y[y.index < time_cutoff_date]
X_test = X[X.index >= time_cutoff_date]
y_test = y[y.index >= time_cutoff_date]

# Create a linear regression model using statsmodels OLS
model = sm.OLS(y_train, X_train).fit()

# Make predictions on the test data
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = math.sqrt(mse)
print(f'Mean Squared Error: {mse:.6f}')
print(f'Root Mean Squared Error: {rmse:.6f}')

Mean Squared Error: 0.000084
Root Mean Squared Error: 0.009146
