Name: Lorenzo Ausiello

Date: 02/05/2024

In [1]:
import numpy as np
import yfinance
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
# Set seed of random number generator
CWID = 20021869 
personal = CWID % 10000
np.random.seed(personal)

## Question 1 (10pt)

### Question 1.1 
Use the `yfinance` package (or other method of your choice) to obtain the daily adjusted close prices for `SPY` and `IEF`.  You should have at least 5 years of data for both assets. Do **not** include any data after January 1, 2023.  You should inspect the dates for your data to make sure you are including everything appropriately.  Create a binary variable whether the `SPY` returns are above the `IEF` returns on a each day. Create a data frame (or array) of the daily log returns both both stocks along with the lagged returns (at least 3 lags) and your binary class variable.  Use the `print` command to display your data.

In [10]:
import yfinance as yf
from datetime import datetime
import pandas as pd

# Download historical data for SPY and IEF
start_date = datetime(2017, 1, 1)
end_date = datetime(2023, 1, 1)

myData = yf.download(['SPY', 'IEF'], start=start_date, end=end_date)
SPY = myData['Adj Close']['SPY']
IEF = myData['Adj Close']['IEF']

# Calculate daily log returns
rSPY = (np.log(SPY) - np.log(SPY.shift(1))).dropna()
rIEF = (np.log(IEF) - np.log(IEF.shift(1))).dropna()

df = pd.DataFrame({'SPY':rSPY, 'IEF': rIEF})

for i in range(1, 4):
    df[f"SPY_lag{i}"] = df["SPY"].shift(i)
    df[f"IEF_lag{i}"] = df["IEF"].shift(i)
    
df= df.dropna()

df['SPY>IEF'] = (df['SPY'] > df['IEF']).astype(int)

print(df)

[*********************100%***********************]  2 of 2 completed
                 SPY       IEF  SPY_lag1  IEF_lag1  SPY_lag2  IEF_lag2  \
Date                                                                     
2017-01-09 -0.003306  0.003799  0.003571 -0.004558 -0.000795  0.006463   
2017-01-10  0.000000 -0.000474 -0.003306  0.003799  0.003571 -0.004558   
2017-01-11  0.002822  0.001137  0.000000 -0.000474 -0.003306  0.003799   
2017-01-12 -0.002513  0.000569  0.002822  0.001137  0.000000 -0.000474   
2017-01-13  0.002293 -0.002180 -0.002513  0.000569  0.002822  0.001137   
...              ...       ...       ...       ...       ...       ...   
2022-12-23  0.005736 -0.004537 -0.014369 -0.000309  0.014842  0.001235   
2022-12-27 -0.003951 -0.008407  0.005736 -0.004537 -0.014369 -0.000309   
2022-12-28 -0.012506 -0.002400 -0.003951 -0.008407  0.005736 -0.004537   
2022-12-29  0.017840  0.004898 -0.012506 -0.002400 -0.003951 -0.008407   
2022-12-30 -0.002638 -0.004167  0.017840  0

### Question 1.2
Split your data into training and testing sets (80% training and 20% test). This split should be done so that the causal relationship is kept consistent (i.e., split data at a specific time).

Run a logistic regression of the binary variable (of `SPY` returns greater than `IEF` returns) as a function of the lagged returns (at least 2 lags) for both stocks.
This should be of the form (assuming 2 lags) of $p_{t} = [1 + \exp(-[\beta_0 + \beta_{SPY,1} r_{SPY,t-1} + \beta_{SPY,2} r_{SPY,t-2} + \beta_{SPY,3} r_{SPY,t-3} + \beta_{IEF,1} r_{IEF,t-1} + \beta_{IEF,2} r_{IEF,t-2} + \beta_{IEF,3} r_{IEF,t-3}])]^{-1}$.
Evaluate the performance of this model by printing the confusion matrix and accuracy on the test data.

In [20]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.preprocessing import StandardScaler
# Get the length of the DataFrame
data_length = len(df)

# Calculate the number of samples for each set
train_size = int(data_length * 0.8)
test_size = data_length - train_size

# Split into training and testing sets based on index
train_data = df.iloc[:train_size]
test_data = df.iloc[train_size:]

# Define features and target variable for training and testing sets
X_train = train_data[['SPY_lag1', 'SPY_lag2', 'SPY_lag3', 'IEF_lag1', 'IEF_lag2', 'IEF_lag3']]
y_train = train_data['SPY>IEF']

X_test = test_data[['SPY_lag1', 'SPY_lag2', 'SPY_lag3', 'IEF_lag1', 'IEF_lag2', 'IEF_lag3']]
y_test = test_data['SPY>IEF']

# Fit logistic regression model on scaled data
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Predict on scaled test data
y_pred = log_reg.predict(X_test)

# Evaluate the model
conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)

print("Confusion Matrix:")
print(conf_matrix)
print("Accuracy:", accuracy)

Confusion Matrix:
[[  0 146]
 [  0 156]]
Accuracy: 0.5165562913907285


<font color='red'>Solution:</font> The accuracy is 48%

## Question 2 (30spt)

### Question 2.1
Using the same data, train/test split ratio, and consider the same classification problem as in Question 1.2.
Create a feed-forward neural network with a single hidden layer (8 hidden nodes) densely connected to the inputs.
You may choose any activation functions you wish.

In [16]:
import tensorflow as tf
from tensorflow import keras

model = keras.Sequential()

# Hidden layer
model.add(keras.layers.Dense(units=8, input_dim=6, activation= 'relu'))

# Output layer
model.add(keras.layers.Dense(1, activation = "sigmoid"))

model.compile(optimizer=keras.optimizers.Adam(), loss='binary_crossentropy', metrics=['accuracy'])

model.summary()



Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 8)                 56        
                                                                 
 dense_1 (Dense)             (None, 1)                 9         
                                                                 
Total params: 65 (260.00 Byte)
Trainable params: 65 (260.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


### Question 2.2
Train this neural network on the training data.  
Evaluate the performance of this model by printing the confusion matrix and accuracy on the test data.

In [17]:
model.fit(X_train, y_train, epochs=500, verbose = 0)
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
print("Test Loss:", test_loss)
print("Test Accuracy:", test_accuracy)

y_pred_prob = model.predict(X_test)
y_pred_binary = (y_pred_prob > 0.5).astype(int)
conf_matrix = confusion_matrix(y_test, y_pred_binary)
print(conf_matrix)



Test Loss: 0.7014026045799255
Test Accuracy: 0.4867549538612366
[[ 20 126]
 [ 29 127]]


<font color='red'>Solution:</font> we have a similar accuracy

### Question 2.3
Using the same train/test split and consider the same classification problem as in Question 1.2.
Train and test another feed-forward neural network of your own design.

In [21]:
# Hidden layer
model.add(keras.layers.Dense(units=32, input_dim=6, activation= 'relu'))

# Add another hidden layer with 8 units and ReLU activation
model.add(keras.layers.Dense(units=16, activation='relu'))

# Output layer with 1 unit
model.add(keras.layers.Dense(1, activation="sigmoid"))

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model on the training data
model.fit(X_train, y_train, epochs=500, verbose = 0)

# Make predictions on the test data
y_pred_prob = model.predict(X_test)
y_pred_binary = (y_pred_prob > 0.5).astype(int)

# Confusion matrix and accuracy
conf_matrix = confusion_matrix(y_test, y_pred_binary)
accuracy = accuracy_score(y_test, y_pred_binary)

print(conf_matrix)
print(accuracy)


[[ 32 114]
 [ 35 121]]
0.5066225165562914


<font color='red'>Solution:</font> in this case, we have less accuracy than the previous one

## Question 3 (30pt)

### Question 3.1
Using the same data, train/test split ratio, and consider the same classification problem as in Question 1.2.
Create a time dilation neural network with a single convolutional layer (filter size of 8, kernel size of 3, dilation size of 1) densely connected to the inputs.
You may choose any activation functions you wish.

*Hint:* The CNN can reference earlier lags on its own without feeding explicit memory inputs as was needed for the Question 2.

In [26]:
from tensorflow.keras.layers import Conv1D, Dense, Flatten
TDNN = keras.Sequential()

TDNN.add(keras.layers.Conv1D(filters=8, kernel_size=3, padding='causal', dilation_rate=1, activation='relu', input_shape=(2, 1)))
TDNN.add(keras.layers.Flatten())
TDNN.add(keras.layers.Dense(1, activation='sigmoid'))

TDNN.compile(optimizer=keras.optimizers.Adam(), loss='binary_crossentropy', metrics=['accuracy'])

X_train1 = train_data[['SPY_lag1', 'IEF_lag1']].values
X_test1 = test_data[['SPY_lag1', 'IEF_lag1']].values

# Reshape data for Conv1D input
X_train_reshaped = X_train1.reshape((X_train1.shape[0], X_train1.shape[1], 1))
X_test_reshaped = X_test1.reshape((X_test1.shape[0], X_test1.shape[1], 1))
TDNN.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv1d_1 (Conv1D)           (None, 2, 8)              32        
                                                                 
 flatten_1 (Flatten)         (None, 16)                0         
                                                                 
 dense_6 (Dense)             (None, 1)                 17        
                                                                 
Total params: 49 (196.00 Byte)
Trainable params: 49 (196.00 Byte)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


### Question 3.2
Train this neural network on the training data.
Evaluate the performance of this model by printing the confusion matrix and accuracy on the test data.

In [27]:
TDNN.fit(X_train_reshaped, y_train, epochs=500, verbose=0)

test_loss, test_accuracy = TDNN.evaluate(X_test_reshaped, y_test, verbose=0)
print("Test Loss:", test_loss)
print("Test Accuracy:", test_accuracy)

# Predictions on the test set
y_pred_prob = TDNN.predict(X_test_reshaped)
y_pred_binary = (y_pred_prob > 0.5).astype(int)

# Confusion matrix and accuracy
conf_matrix = confusion_matrix(y_test, y_pred_binary)
accuracy = accuracy_score(y_test, y_pred_binary)

print(conf_matrix)
print(accuracy)

Test Loss: 0.6981911063194275
Test Accuracy: 0.5298013091087341
[[  9 137]
 [  5 151]]
0.5298013245033113


<font color='red'>Solution:</font> in this case we have a similar accuracy compared to the previous models

### Question 3.3
Using the same train/test split and consider the same classification problem as in Question 1.2. Train and test another convolutional neural network of your own design.

In [32]:
TDNN1 = keras.Sequential()

TDNN1.add(keras.layers.Conv1D(filters=32, kernel_size=3, padding='causal', dilation_rate=1, activation='relu', input_shape=(2, 1)))
TDNN1.add(keras.layers.Conv1D(filters=16, kernel_size=3, padding='causal', dilation_rate=1, activation='relu', input_shape=(2, 1)))
TDNN1.add(keras.layers.Flatten())
TDNN1.add(keras.layers.Dense(1, activation='sigmoid'))

TDNN1.compile(optimizer=keras.optimizers.Adam(), loss='binary_crossentropy', metrics=['accuracy'])

TDNN1.summary()
         
TDNN1.fit(X_train_reshaped, y_train, epochs=500, verbose=0)

test_loss, test_accuracy = TDNN1.evaluate(X_test_reshaped, y_test, verbose=0)
print("Test Loss:", test_loss)
print("Test Accuracy:", test_accuracy)

# Predictions on the test set
y_pred_prob = TDNN1.predict(X_test_reshaped)
y_pred_binary = (y_pred_prob > 0.5).astype(int)

# Confusion matrix and accuracy
conf_matrix = confusion_matrix(y_test, y_pred_binary)
accuracy = accuracy_score(y_test, y_pred_binary)

print(conf_matrix)
print(accuracy)

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 conv1d_6 (Conv1D)           (None, 2, 32)             128       
                                                                 
 conv1d_7 (Conv1D)           (None, 2, 16)             1552      
                                                                 
 flatten_4 (Flatten)         (None, 32)                0         
                                                                 
 dense_9 (Dense)             (None, 1)                 33        
                                                                 
Total params: 1713 (6.69 KB)
Trainable params: 1713 (6.69 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
Test Loss: 0.7020733952522278
Test Accuracy: 0.49668875336647034
[[ 12 134]
 [ 18 138]]
0.4966887417218543


<font color='red'>Solution:</font> we have less accuracy than previous one

## Question 4 (30pt)

## Question 4.1
Consider the same classification problem as in Question 1.2.
Of the methods considered in this assignment, which would you recommend in practice?
Explain briefly (1 paragraph) why you choose this fit. 

<font color='red'>Solution:</font> the method with a good accuracy is the TDNN (48%). I use this method because Iwe have less accuracy compared to logistic regression, but it predicts also class 0 e not just class 1. I reccomend this method but also other methods have a similar accuracy. 

## Question 4.2
Recreate your data set using data from January 1, 2023 through December 31, 2023.
Using the method your would implement in practice, invest in the asset (``SPY`` or ``IEF``) depending on your predictions.
Print the returns your portfolio would obtain from following this strategy. Comment on how this portfolio compares with the ``SPY`` and ``IEF`` returns and risks.

In [31]:
start_date = datetime(2023, 1, 1)
end_date = datetime(2023, 12, 31)

myData = yf.download(['SPY', 'IEF'], start=start_date, end=end_date)
SPY = myData['Adj Close']['SPY']
IEF = myData['Adj Close']['IEF']

# Calculate daily log returns
rSPY = (np.log(SPY) - np.log(SPY.shift(1))).dropna()
rIEF = (np.log(IEF) - np.log(IEF.shift(1))).dropna()

df = pd.DataFrame({'SPY': rSPY, 'IEF': rIEF})

for i in range(1, 4):
    df[f"SPY_lag{i}"] = df["SPY"].shift(i)
    df[f"IEF_lag{i}"] = df["IEF"].shift(i)

df = df.dropna()

df['SPY>IEF'] = (df['SPY'] > df['IEF']).astype(int)

# Define features and target variable for training and testing sets
X_predict = df[['SPY_lag1', 'IEF_lag1']].values
y = df['SPY>IEF']

X_predict_reshaped = X_predict.reshape((X_predict.shape[0], X_predict.shape[1], 1))

# Make predictions, notice 
y_pred_prob = TDNN.predict(X_predict_reshaped)
y_pred_binary = (y_pred_prob > 0.5).astype(int)

# Calculate portfolio returns
df['Predictions'] = y_pred_binary

df['Portfolio_Returns'] = np.where(df['Predictions'] == 1, df['SPY'], df['IEF'])
print(df)

[*********************100%***********************]  2 of 2 completed
                 SPY       IEF  SPY_lag1  IEF_lag1  SPY_lag2  IEF_lag2  \
Date                                                                     
2023-01-09 -0.000567  0.002538  0.022673  0.012787 -0.011479 -0.001440   
2023-01-10  0.006988 -0.006306 -0.000567  0.002538  0.022673  0.012787   
2023-01-11  0.012569  0.006407  0.006988 -0.006306 -0.000567  0.002538   
2023-01-12  0.003634  0.008882  0.012569  0.006407  0.006988 -0.006306   
2023-01-13  0.003872 -0.005340  0.003634  0.008882  0.012569  0.006407   
...              ...       ...       ...       ...       ...       ...   
2023-12-22  0.002008 -0.000935  0.009437 -0.001557 -0.013954  0.004886   
2023-12-26  0.004214  0.000624  0.002008 -0.000935  0.009437 -0.001557   
2023-12-27  0.001806  0.007762  0.004214  0.000624  0.002008 -0.000935   
2023-12-28  0.000378 -0.003822  0.001806  0.007762  0.004214  0.000624   
2023-12-29 -0.002899 -0.002487  0.000378 -0

In [39]:
mean_p = df['Portfolio_Returns'].mean()*252
sd_p = df['Portfolio_Returns'].std()*252**0.5

print(mean_p)
print(sd_p)

# SPY
mean_s = df['SPY'].mean()*252
sd_s = df['SPY'].std()*252**0.5
print('Spy mean :', mean_s)
print('Spy sd:', sd_s)

# IEF
mean = df['IEF'].mean()*252
sd = df['IEF'].std()*252**0.5
print('Ief mean :', mean)
print('Ief sd:', sd)

# portfolio
sharpe_ratio = (mean_p-mean)/sd_p
print('sharpe ratio portfolio:', sharpe_ratio)

0.27785877165972805
0.12823682216769253
Spy mean : 0.22315317205683186
Spy sd: 0.1289527803460459
Ief mean : 0.009232127777839577
Ief sd: 0.09318130785619633
sharpe ratio portfolio: 2.094769968103321


<font color='red'>Solution:</font> this method allows us to invest mainly in SPY but we have a little better expected returns, sd and sharpe ratio than SPY.