<a href="https://colab.research.google.com/github/Jandsy/ml_finance_imperial/blob/main/Coursework/CourseWork.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **<center>Machine Learning and Finance </center>**


## <center> CourseWork 2024 - StatArb </center>


# Libraries

In [1]:
#!pip install yfinance


In [2]:
import requests 
from bs4 import BeautifulSoup

import yfinance as yf
import pandas as pd
import statsmodels.api as sm
import numpy as np

from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam


In this coursework, you will delve into and replicate selected elements of the research detailed in the paper **[End-to-End Policy Learning of a Statistical Arbitrage Autoencoder Architecture](https://arxiv.org/pdf/2402.08233.pdf)**. **However, we will not reproduce the entire study.**

## Overview

This study redefines Statistical Arbitrage (StatArb) by combining Autoencoder architectures and policy learning to generate trading strategies. Traditionally, StatArb involves finding the mean of a synthetic asset through classical or PCA-based methods before developing a mean reversion strategy. However, this paper proposes a data-driven approach using an Autoencoder trained on US stock returns, integrated into a neural network representing portfolio trading policies to output portfolio allocations directly.


## Coursework Goal

This coursework will replicate these results, providing hands-on experience in implementing and evaluating this innovative end-to-end policy learning Autoencoder within financial trading strategies.

## Outline

- [Data Preparation and Exploration](#Data-Preparation-and-Exploration)
- [Fama French Analysis](#Fama-French-Analysis)
- [PCA Analysis](#PCA-Analysis)
- [Ornstein Uhlenbeck](#Ornstein-Uhlenbeck)
- [Autoencoder Analysis](#Autoencoder-Analysis)



**Description:**
The Coursework is graded on a 100 point scale and is divided into five  parts. Below is the mark distribution for each question:

| **Problem**  | **Question**          | **Number of Marks** |
|--------------|-----------------------|---------------------|
| **Part A**   | Question 1            | 4                   |
|              | Question 2            | 1                   |
|              | Question 3            | 3                   |
|              | Question 4            | 3                   |
|              | Question 5            | 1                   |
|              | Question 6            | 3                   |
|**Part  B**    | Question 7           | 1                   |
|              | Question 8            | 5                   |
|              | Question 9            | 4                   |
|              | Question 10           | 5                   |
|              | Question 11           | 2                   |
|              | Question 12           | 3                   |
|**Part  C**    | Question 13          | 3                   |
|              | Question 14           | 1                   |
|              | Question 15           | 3                   |
|              | Question 16           | 2                   |
|              | Question 17           | 7                   |
|              | Question 18           | 6                   |
|              | Question 19           | 3                   |
|  **Part  D** | Question 20           | 3                   |
|              | Question 21           | 5                   |
|              | Question 22           | 2                   |
|  **Part  E** | Question 23           | 2                   |
|              | Question 24           | 1                   |
|              | Question 25           | 3                   |
|              | Question 26           | 10                  |
|              | Question 27           | 1                   |
|              | Question 28           | 3                   |
|              | Question 29           | 3                   |
|              | Question 30           | 7                   |




Please read the questions carefully and do your best. Good luck!

## Objectives



## 1. Data Preparation and Exploration
Collect, clean, and prepare US stock return data for analysis.

## 2. Fama French Analysis
Utilize Fama French Factors to isolate the idiosyncratic components of stock returns, differentiating them from market-wide effects. This analysis helps in understanding the unique characteristics of individual stocks relative to broader market trends.

## 3. PCA Analysis
Employ Principal Component Analysis (PCA) to identify hidden structures and reduce dimensionality in the data. This method helps in extracting significant patterns that might be obscured in high-dimensional datasets.

## 4. Ornstein-Uhlenbeck Process
Analyze mean-reverting behavior in stock prices using the Ornstein-Uhlenbeck process. This stochastic process is useful for modeling and forecasting based on the assumption that prices will revert to a long-term mean.

## 5. Building a Basic Autoencoder Model
Construct and train a standard Autoencoder to extract residual idiosyncratic risk.








# Data Preparation and Exploration


---
<font color=green>Q1: (4 Marks)</font>
<br><font color='green'>
Write a Python function that accepts a URL parameter and retrieves the NASDAQ-100 companies and their ticker symbols by scraping the relevant Wikipedia page using **[Requests](https://pypi.org/project/requests/)** and **[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)**. Your function should return the data as a list of tuples, with each tuple containing the company name and its ticker symbol. Then, call your function with the appropriate Wikipedia page URL and print the data in a 'Company: Ticker' format.

</font>

---


In [3]:
def get_nasdaq_100(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    table = soup.find('table', {'class': 'wikitable sortable'})
    rows = table.find_all('tr')[1:]  # Skip the header row

    companies = []
    for row in rows:
        cols = row.find_all('td')
        company = cols[0].text.strip()
        ticker = cols[1].text.strip()
        companies.append((company, ticker))

    return companies

url = 'https://en.wikipedia.org/wiki/NASDAQ-100'
companies = get_nasdaq_100(url)

# Print the data in a 'Company: Ticker' format
for company, ticker in companies:
    print(f'Company: {company}, Ticker: {ticker}')

Company: Adobe Inc., Ticker: ADBE
Company: ADP, Ticker: ADP
Company: Airbnb, Ticker: ABNB
Company: Alphabet Inc. (Class A), Ticker: GOOGL
Company: Alphabet Inc. (Class C), Ticker: GOOG
Company: Amazon, Ticker: AMZN
Company: Advanced Micro Devices Inc., Ticker: AMD
Company: American Electric Power, Ticker: AEP
Company: Amgen, Ticker: AMGN
Company: Analog Devices, Ticker: ADI
Company: Ansys, Ticker: ANSS
Company: Apple Inc., Ticker: AAPL
Company: Applied Materials, Ticker: AMAT
Company: ASML Holding, Ticker: ASML
Company: AstraZeneca, Ticker: AZN
Company: Atlassian, Ticker: TEAM
Company: Autodesk, Ticker: ADSK
Company: Baker Hughes, Ticker: BKR
Company: Biogen, Ticker: BIIB
Company: Booking Holdings, Ticker: BKNG
Company: Broadcom Inc., Ticker: AVGO
Company: Cadence Design Systems, Ticker: CDNS
Company: CDW Corporation, Ticker: CDW
Company: Charter Communications, Ticker: CHTR
Company: Cintas, Ticker: CTAS
Company: Cisco, Ticker: CSCO
Company: Coca-Cola Europacific Partners, Ticker: CCEP

---
Q2: (1 Mark)

Given a list of tuples representing NASDAQ-100 companies (where each tuple contains a company name and its ticker symbol), write a Python script to extract all ticker symbols into a separate list called `tickers_list`.

---


In [4]:
tickers_list = [ticker for company, ticker in companies]

In [5]:
print(tickers_list)

['ADBE', 'ADP', 'ABNB', 'GOOGL', 'GOOG', 'AMZN', 'AMD', 'AEP', 'AMGN', 'ADI', 'ANSS', 'AAPL', 'AMAT', 'ASML', 'AZN', 'TEAM', 'ADSK', 'BKR', 'BIIB', 'BKNG', 'AVGO', 'CDNS', 'CDW', 'CHTR', 'CTAS', 'CSCO', 'CCEP', 'CTSH', 'CMCSA', 'CEG', 'CPRT', 'CSGP', 'COST', 'CRWD', 'CSX', 'DDOG', 'DXCM', 'FANG', 'DLTR', 'DASH', 'EA', 'EXC', 'FAST', 'FTNT', 'GEHC', 'GILD', 'GFS', 'HON', 'IDXX', 'ILMN', 'INTC', 'INTU', 'ISRG', 'KDP', 'KLAC', 'KHC', 'LRCX', 'LIN', 'LULU', 'MAR', 'MRVL', 'MELI', 'META', 'MCHP', 'MU', 'MSFT', 'MRNA', 'MDLZ', 'MDB', 'MNST', 'NFLX', 'NVDA', 'NXPI', 'ORLY', 'ODFL', 'ON', 'PCAR', 'PANW', 'PAYX', 'PYPL', 'PDD', 'PEP', 'QCOM', 'REGN', 'ROP', 'ROST', 'SIRI', 'SBUX', 'SNPS', 'TTWO', 'TMUS', 'TSLA', 'TXN', 'TTD', 'VRSK', 'VRTX', 'WBA', 'WBD', 'WDAY', 'XEL', 'ZS']


---
Q3: (3 Marks)

Using **[yfinance](https://pypi.org/project/yfinance/)** library, write a Python script that accepts a list of stock ticker symbols. For each symbol, download the adjusted closing price data, store it in a dictionary with the ticker symbol as the key, and then convert the final dictionary into a Pandas DataFrame. Handle any errors encountered during data retrieval by printing a message indicating which symbol failed

---

In [6]:
def get_adjusted_close_prices(tickers):
    data_dict = {}
    for ticker in tickers:
        try:
            data = yf.download(ticker, start='2000-01-01', end='2024-05-21')
            data_dict[ticker] = data['Adj Close']
        except Exception as e:
            print(f"Failed to download data for {ticker}. Reason: {e}")

    df = pd.DataFrame(data_dict)
    return df

df = get_adjusted_close_prices(tickers_list)

[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%**********************]  1 of 1 completed
[*********************100%%*******

In [7]:
df

Unnamed: 0_level_0,ADBE,ADP,ABNB,GOOGL,GOOG,AMZN,AMD,AEP,AMGN,ADI,...,TSLA,TXN,TTD,VRSK,VRTX,WBA,WBD,WDAY,XEL,ZS
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2000-01-03,16.274668,24.781523,,,,4.468750,15.500000,10.773265,44.925064,28.438288,...,,32.968723,,,18.781250,17.076210,,,6.977996,
2000-01-04,14.909400,24.781523,,,,4.096875,14.625000,10.901770,41.489864,26.999626,...,,31.566664,,,17.281250,16.440990,,,7.138673,
2000-01-05,15.204174,24.543238,,,,3.487500,15.000000,11.308707,42.917480,27.393770,...,,30.805525,,,17.000000,16.627815,,,7.414120,
2000-01-06,15.328290,24.870888,,,,3.278125,16.000000,11.372968,43.631290,26.644886,...,,29.964283,,,16.750000,16.142063,,,7.345256,
2000-01-07,16.072987,25.436808,,,,3.478125,16.250000,11.522893,48.538685,27.393770,...,,30.124514,,,18.218750,16.553085,,,7.345256,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-05-14,475.950012,245.500000,146.699997,170.339996,171.929993,187.070007,153.160004,90.790001,309.213806,211.940002,...,177.550003,191.130005,86.180000,246.929993,428.589996,18.097662,8.56,246.880005,55.560001,176.820007
2024-05-15,485.350006,246.619995,145.800003,172.509995,173.880005,185.990005,159.669998,91.970001,316.790009,215.750000,...,173.990005,195.529999,90.250000,247.839996,437.489990,17.643988,8.20,251.309998,55.790001,181.130005
2024-05-16,482.880005,250.059998,147.190002,174.179993,175.429993,183.630005,162.619995,92.540001,314.720001,214.119995,...,174.839996,194.970001,93.190002,251.479996,440.640015,18.087799,8.23,256.570007,55.849998,179.309998
2024-05-17,483.429993,252.330002,145.660004,176.059998,177.289993,184.699997,164.470001,92.669998,312.470001,214.080002,...,177.460007,195.020004,94.779999,251.619995,445.209991,17.930000,8.05,257.929993,55.520000,178.860001


---
<font color=green>Q4: (3 Marks)</font>
<br><font color='green'>
Write a Python script to analyze stock data stored in a dictionary `stock_data` (where each key is a stock ticker symbol, and each value is a Pandas Series of adjusted closing prices). The script should:
1. Convert the dictionary into a DataFrame.
2. Calculate the daily returns for each stock.
3. Identify columns (ticker symbols) with at least 2000 non-NaN values in their daily returns.
4. Create a new DataFrame that only includes these filtered ticker symbols.
5. Remove any remaining rows with NaN values in this new DataFrame.
</font>

---

In [8]:
def analyze_stock_data(stock_data):

    stock_df = pd.DataFrame(stock_data)
    daily_returns = stock_df.pct_change()
    valid_columns = daily_returns.columns[daily_returns.count() >= 2000]
    filtered_df = daily_returns[valid_columns]
    cleaned_df = filtered_df.dropna()

    return cleaned_df

stock_data = df.to_dict(orient='series')
cleaned_data = analyze_stock_data(stock_data)
cleaned_data

Unnamed: 0_level_0,ADBE,ADP,GOOGL,GOOG,AMZN,AMD,AEP,AMGN,ADI,ANSS,...,TTWO,TMUS,TSLA,TXN,VRSK,VRTX,WBA,WBD,WDAY,XEL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-12-10,-0.006699,0.006004,-0.003292,-0.002861,-0.003715,0.042553,-0.024355,0.011274,0.008826,0.004968,...,-0.004777,0.007485,0.011358,0.002995,-0.002616,0.010279,0.000960,-0.002109,0.005037,-0.013889
2015-12-11,0.027653,-0.024924,-0.012657,-0.014130,-0.033473,-0.036735,-0.005831,-0.028309,-0.003849,-0.013951,...,-0.011575,-0.009356,-0.044259,-0.012647,-0.015995,-0.034709,-0.020499,-0.035928,-0.057630,0.004311
2015-12-14,0.020127,0.015241,0.016151,0.012045,0.027744,-0.008475,-0.000367,0.019078,-0.001932,0.003899,...,0.005141,0.014444,0.007188,0.000178,0.014390,-0.016491,0.010402,-0.033613,0.000253,0.010017
2015-12-15,0.008149,0.010283,-0.003213,-0.005844,0.001110,0.008547,0.027503,0.028524,-0.004752,0.009544,...,0.023018,0.044907,0.011483,0.023657,0.013924,0.013656,-0.005450,0.002646,-0.000380,0.007934
2015-12-16,0.016380,0.008541,0.021708,0.019761,0.026008,0.076271,0.017309,0.012053,0.017330,0.008354,...,0.006389,0.025419,0.060699,0.009036,0.021117,0.010658,0.031664,0.024887,0.029378,0.023896
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-05-14,-0.014821,-0.009282,0.007095,0.006027,0.002680,0.017269,-0.007976,0.009596,0.017084,-0.007130,...,0.007016,-0.005755,0.032928,0.017623,0.002395,-0.003117,0.012693,0.021480,-0.000809,-0.004836
2024-05-15,0.019750,0.004562,0.012739,0.011342,-0.005773,0.042505,0.012997,0.024502,0.017977,0.012337,...,0.021523,0.001662,-0.020051,0.023021,0.003685,0.020766,-0.025068,-0.042056,0.017944,0.004140
2024-05-16,-0.005089,0.013949,0.009681,0.008914,-0.012689,0.018476,0.006198,-0.006534,-0.007555,-0.007124,...,-0.013506,0.005532,0.004885,-0.002864,0.014687,0.007200,0.025154,0.003659,0.020930,0.001075
2024-05-17,0.001139,0.009078,0.010793,0.010603,0.005827,0.011376,0.001405,-0.007149,-0.000187,0.000550,...,0.012048,0.002568,0.014985,0.000256,0.000557,0.010371,-0.008724,-0.021871,0.005301,-0.005909


---
<font color=green>Q5: (1 Mark)</font>
<br><font color='green'>
Download the dataset named `df_filtered_nasdaq_100` from the GitHub repository of the course.
</font>

---

In [9]:
df_filtered = pd.read_csv('C:/Users/42275/Downloads/df_filtered_nasdaq_100.csv')
df_filtered['Date'] =  pd.to_datetime(df_filtered['Date'])
df_filtered

Unnamed: 0,Date,ADBE,ADP,GOOGL,GOOG,AMZN,AMD,AEP,AMGN,ADI,...,TTWO,TMUS,TSLA,TXN,VRSK,VRTX,WBA,WBD,WDAY,XEL
0,2015-12-10,-0.006699,0.006004,-0.003292,-0.002861,-0.003715,0.042553,-0.024356,0.011274,0.008826,...,-0.004777,0.007485,0.011358,0.002996,-0.002616,0.010279,0.000960,-0.002109,0.005037,-0.013889
1,2015-12-11,0.027653,-0.024924,-0.012657,-0.014130,-0.033473,-0.036735,-0.005831,-0.028308,-0.003850,...,-0.011575,-0.009356,-0.044259,-0.012648,-0.015996,-0.034709,-0.020499,-0.035928,-0.057630,0.004311
2,2015-12-14,0.020127,0.015241,0.016151,0.012045,0.027744,-0.008475,-0.000366,0.019078,-0.001932,...,0.005141,0.014444,0.007188,0.000178,0.014390,-0.016491,0.010402,-0.033613,0.000253,0.010017
3,2015-12-15,0.008149,0.010283,-0.003213,-0.005844,0.001110,0.008547,0.027503,0.028525,-0.004752,...,0.023018,0.044907,0.011483,0.023657,0.013923,0.013656,-0.005450,0.002646,-0.000380,0.007934
4,2015-12-16,0.016380,0.008541,0.021708,0.019761,0.026008,0.076271,0.017309,0.012053,0.017330,...,0.006389,0.025419,0.060699,0.009036,0.021117,0.010658,0.031664,0.024887,0.029378,0.023897
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2113,2024-05-06,0.015241,0.003514,0.005142,0.004971,0.013372,0.034396,0.002370,-0.037939,0.018484,...,0.016863,-0.013548,0.019703,0.015427,0.019087,0.003540,-0.030881,-0.001255,-0.022949,0.002028
2114,2024-05-07,-0.002674,0.009805,0.018739,0.018548,0.000318,-0.008666,0.011936,0.002738,0.001230,...,-0.000067,-0.001109,-0.037616,0.012752,0.021252,0.019230,0.005214,-0.023869,-0.001921,0.012141
2115,2024-05-08,-0.008471,-0.008894,-0.010920,-0.010521,-0.004026,-0.005245,0.007900,0.023343,0.006337,...,-0.015910,0.003946,-0.017378,0.007007,-0.009838,0.020915,-0.006916,0.003861,0.000802,-0.001636
2116,2024-05-09,-0.011166,0.009097,0.003424,0.002454,0.007979,-0.008007,0.004085,0.018060,-0.000342,...,-0.001987,0.011361,-0.015739,0.007448,0.001676,0.000406,0.001161,0.030769,-0.014702,0.005644


---
<font color=green>Q6: (3 Marks) </font>
<br><font color='green'>
Conduct an in-depth analysis of the `df_filtered_nasdaq_100` dataset from GitHub. Answer the following questions:
- Which stock had the best performance over the entire period?
- What is the average daily return of 'AAPL'?
- What is the worst daily return? Provide the stock name and the date it occurred.
</font>

---

In [10]:
# 1. Which stock had the best performance over the entire period?
# Calculate the total return for each stock
total_returns = (df_filtered.iloc[:,1:]+1).prod()-1
best_performance_stock = total_returns.idxmax()
best_performance_value = total_returns.max()
print("Stock with the best performance over the entire period:", best_performance_stock, ", Best performance value:", best_performance_value)

# 2. What is the average daily return of 'AAPL'?
average_daily_return_aapl = df_filtered['AAPL'].dropna().mean()
print("Average daily return of AAPL:", average_daily_return_aapl)

# 3. What is the worst daily return? Provide the stock name and the date it occurred.
worst_daily_return = df_filtered.iloc[:,1:].min().min()
worst_daily_return_stock = df_filtered.iloc[:,1:].min().idxmin()
worst_daily_return_date = df_filtered[df_filtered[worst_daily_return_stock]==worst_daily_return].index[0]
print("Worst daily return:", worst_daily_return, ", Stock name:", worst_daily_return_stock, ", Date:", worst_daily_return_date)

Stock with the best performance over the entire period: NVDA , Best performance value: 111.58842785050534
Average daily return of AAPL: 0.0010849409515136207
Worst daily return: -0.4464579394678258 , Stock name: FANG , Date: 1066


# Fama French Analysis

The Fama-French five-factor model is an extension of the classic three-factor model used in finance to describe stock returns. It is designed to better capture the risk associated with stocks and explain differences in returns. This model includes the following factors:

1. **Market Risk (MKT)**: The excess return of the market over the risk-free rate. It captures the overall market's premium.
2. **Size (SMB, "Small Minus Big")**: The performance of small-cap stocks relative to large-cap stocks.
3. **Value (HML, "High Minus Low")**: The performance of stocks with high book-to-market values relative to those with low book-to-market values.
4. **Profitability (RMW, "Robust Minus Weak")**: The difference in returns between companies with robust (high) and weak (low) profitability.
5. **Investment (CMA, "Conservative Minus Aggressive")**: The difference in returns between companies that invest conservatively and those that invest aggressively.

## Additional Factor

6. **Momentum (MOM)**: This factor represents the tendency of stocks that have performed well in the past to continue performing well, and the reverse for stocks that have performed poorly.

### Mathematical Representation

The return of a stock $R_i^t$ at time $t$ can be modeled as follows :

$$
R_i^t - R_f^t = \alpha_i^t + \beta_{i,MKT}^t(R_M^t - R_f^t) + \beta_{i,SMB}^t \cdot SMB^t + \beta_{i,HML}^t \cdot HML^t + \beta_{i,RMW}^t \cdot RMW^t + \beta_{i,CMA}^t \cdot CMA^t + \beta_{i,MOM}^t \cdot MOM^t + \epsilon_i^t
$$

Where:
- $ R_i^t $ is the return of stock $i$ at time $t$
- $R_f^t $is the risk-free rate at time $t$
- $ R_M^t $ is the market return at time $t$
- $\alpha_i^t $ is the abnormal return or alpha of stock $ i $ at time $t$
- $\beta^t $ coefficients represent the sensitivity of the stock returns to each factor at time $t$
- $\epsilon_i^t $ is the error term or idiosyncratic risk unique to stock $ i $ at time $t$

This model is particularly useful for identifying which factors significantly impact stock returns and for constructing a diversified portfolio that is optimized for given risk preferences.




---
<font color=green>Q7: (1 Mark) </font>
<br><font color='green'>
Download the `fama_french_dataset` from the course's GitHub account.
</font>

---

In [11]:
fama_french_data = pd.read_csv("C:/Users/42275/Downloads/fama_french_dataset.csv")
fama_french_data.columns.values[0]='Date'
fama_french_data['Date'] = pd.to_datetime(fama_french_data['Date'])
# fama_french_data.set_index('Date', inplace=True)
fama_french_data

Unnamed: 0,Date,Mkt-RF,SMB,HML,RMW,CMA,RF,Mom
0,1963-07-01,-0.67,0.02,-0.35,0.03,0.13,0.012,-0.21
1,1963-07-02,0.79,-0.28,0.28,-0.08,-0.21,0.012,0.42
2,1963-07-03,0.63,-0.18,-0.10,0.13,-0.25,0.012,0.41
3,1963-07-05,0.40,0.09,-0.28,0.07,-0.30,0.012,0.07
4,1963-07-08,-0.63,0.07,-0.20,-0.27,0.06,0.012,-0.45
...,...,...,...,...,...,...,...,...
15285,2024-03-22,-0.23,-0.98,-0.53,0.29,-0.37,0.021,0.43
15286,2024-03-25,-0.26,-0.10,0.88,-0.22,-0.17,0.021,-0.34
15287,2024-03-26,-0.26,0.10,-0.13,-0.50,0.23,0.021,0.09
15288,2024-03-27,0.88,1.29,0.91,-0.14,0.58,0.021,-1.34


---
<font color=green>Q8: (5 Marks)</font>
<br><font color='green'>

Write a Python function called `get_sub_df_ticker(ticker, date, df_filtered, length_history)` that extracts a historical sub-dataframe for a given `ticker` from `df_filtered`. The function should use `length_history` to determine the number of trading days to include, ending at the specified `date`. Return the sub-dataframe for the specified `ticker`.
</font>

---


In [12]:
def get_sub_df_ticker(ticker, date, df_filtered, length_history):
    df_filtered['Date'] = pd.to_datetime(df_filtered['Date'])
    date = pd.to_datetime(date)
    
    end_date_index = df_filtered.index[df_filtered['Date'] == date].tolist()[0]
    start_date_index = max(0, end_date_index - length_history +1)
    
    sub_ticker_df= df_filtered.iloc[start_date_index:end_date_index+1].copy()
    
    return sub_ticker_df[['Date', ticker]]

---
<font color=green>Q9: (4 Marks)</font>
<br><font color='green'>
Create a Python function named `df_ticker_with_fama_french(ticker, date, df_filtered, length_history, fama_french_data)` that uses `get_sub_df_ticker` to extract historical data for a specific `ticker`. Incorporate the Fama-French factors from `fama_french_data` into the extracted sub-dataframe. Adjust the ticker's returns by subtracting the risk-free rate ('RF') and add other relevant Fama-French factors ('Mkt-RF', 'SMB', 'HML', 'RMW', 'CMA', and 'Mom'). Return the resulting sub-dataframe.
</font>

---

In [13]:
def df_ticker_with_fama_french(ticker,date, df_filtered, length_history, fama_french_data):
    #Get sub-dateframe of history data for a specific sticker
    sub_ticker_df = get_sub_df_ticker(ticker, date, df_filtered, length_history)
    #Transfer date form
    fama_french_data['Date'] = pd.to_datetime(fama_french_data['Date'])
    #Incorporate the Fama-French factors from fama_french_data into the extracted sub-dataframe
    merge_df = pd.merge(sub_ticker_df, fama_french_data, on='Date')
    #Adjust ticker's returns by subtracting RF
    merge_df[ticker] = merge_df[ticker] - merge_df['RF']
    
    columns = ['Date',ticker,'Mkt-RF', 'SMB', 'HML', 'RMW', 'CMA', 'Mom']
    
    sub_df = merge_df[columns].copy()

    return sub_df
    

---
<font color=green>Q10: (5 Marks) </font>
<br><font color='green'>
Write a Python function named `extract_beta_fama_french` to perform a rolling regression analysis for a given stock at a specific time point using the Fama-French model. The function should accept the following parameters:

- `ticker`: A string indicating the stock symbol.
- `date`: A string specifying the date for the analysis.
- `length_history`: An integer representing the number of days of historical data to include.
- `df_filtered`: A pandas DataFrame (assumed to be derived from question 5) containing filtered stock data.
- `fama_french_data`: A pandas DataFrame (assumed to be from question 7) that includes Fama-French factors.

Utilize the `statsmodels.api` library to conduct the regression.
</font>

---

In [14]:
def extract_beta_fama_french(ticker, date, length_history, df_filtered, fama_french_data):
    sub_df = df_ticker_with_fama_french(ticker,date, df_filtered, length_history, fama_french_data)
    
    rolling_results =[]
    
    window = length_history
    
    for start in range(len(sub_df) - window + 1):
        end = start+window
        y = sub_df[ticker].iloc[start:end]
        X = sub_df[['Mkt-RF', 'SMB', 'HML', 'RMW', 'CMA', 'Mom']].iloc[start:end]
        X = sm.add_constant(X)

        model = sm.OLS(y,X).fit()
        
        return model.summary()
        
        
        #params = model.params
        #rolling_results.append(params)
        
        

    #model_summary = pd.DataFrame(rolling_results, index=sub_df['Date'].iloc[window-1:])
    
    #return model_summary

    

In [15]:
print(df_filtered.columns)
print(fama_french_data.columns)

Index(['Date', 'ADBE', 'ADP', 'GOOGL', 'GOOG', 'AMZN', 'AMD', 'AEP', 'AMGN',
       'ADI', 'ANSS', 'AAPL', 'AMAT', 'ASML', 'AZN', 'TEAM', 'ADSK', 'BKR',
       'BIIB', 'BKNG', 'AVGO', 'CDNS', 'CDW', 'CHTR', 'CTAS', 'CSCO', 'CCEP',
       'CTSH', 'CMCSA', 'CPRT', 'CSGP', 'COST', 'CSX', 'DXCM', 'FANG', 'DLTR',
       'EA', 'EXC', 'FAST', 'FTNT', 'GILD', 'HON', 'IDXX', 'ILMN', 'INTC',
       'INTU', 'ISRG', 'KDP', 'KLAC', 'KHC', 'LRCX', 'LIN', 'LULU', 'MAR',
       'MRVL', 'MELI', 'META', 'MCHP', 'MU', 'MSFT', 'MDLZ', 'MNST', 'NFLX',
       'NVDA', 'NXPI', 'ORLY', 'ODFL', 'ON', 'PCAR', 'PANW', 'PAYX', 'PYPL',
       'PEP', 'QCOM', 'REGN', 'ROP', 'ROST', 'SIRI', 'SBUX', 'SNPS', 'TTWO',
       'TMUS', 'TSLA', 'TXN', 'VRSK', 'VRTX', 'WBA', 'WBD', 'WDAY', 'XEL'],
      dtype='object')
Index(['Date', 'Mkt-RF', 'SMB', 'HML', 'RMW', 'CMA', 'RF', 'Mom'], dtype='object')


---
<font color=green>Q11: (2 Marks) </font>
<br><font color='green'>
Apply the `extract_beta_fama_french` function to the stock symbol 'AAPL' for the date '2024-03-28', using a historical data length of 252 days. Ensure that the `df_filtered` and `fama_french_data` DataFrames are correctly prepared and available in your environment before executing this function. The parameters for the function call are set as follows:

- **Ticker**: 'AAPL'
- **Date**: '2024-03-28'
- **Length of History**: 252 days
</font>

---



In [16]:
results = extract_beta_fama_french('AAPL','2024-03-28',252,df_filtered, fama_french_data)
print(results)

                            OLS Regression Results                            
Dep. Variable:                   AAPL   R-squared:                       0.475
Model:                            OLS   Adj. R-squared:                  0.462
Method:                 Least Squares   F-statistic:                     36.96
Date:                Mon, 27 May 2024   Prob (F-statistic):           9.04e-32
Time:                        15:24:37   Log-Likelihood:                 827.28
No. Observations:                 252   AIC:                            -1641.
Df Residuals:                     245   BIC:                            -1616.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0207      0.001    -35.204      0.0

---
<font color=green>Q12: (2 Marks)</font>
<br><font color='green'>
Once the `extract_beta_fama_french` function has been applied to 'AAPL' with the specified parameters, the next step is to analyze the regression summary to identify which Fama-French factor explains the most variance in 'AAPL' returns during the specified period.

Follow these steps to perform the analysis:

1. **Review the Summary**: Examine the regression output, focusing on the coefficients and their statistical significance (p-values).
2. **Identify Key Factor**: Determine which factor has the highest absolute coefficient value and is statistically significant (typically p < 0.05). This factor can be considered as having the strongest influence on 'AAPL' returns for the period.

</font>

---

In [17]:
##Analysis

**Write your answers here:**

# PCA Analysis


In literature, another method exists for extracting residuals for each stock, utilizing the PCA approach to identify hidden factors in the data. Let's describe this method.

The return of a stock $R_i^t$ at time $t$ can be modeled as follows :

$$
R_i^t  = \sum_{j=1}^m\beta_{i,j}^t F_j^t  + \epsilon_i^t
$$

Where:
- $ R_i^t $ is the return of stock $i$ at time $t$
- $m$ is the number of factors selected from PCA
-  $ F_j^t $ is the $j$-th hidden factor constructed from PCA at time $t$
- $\beta_{i,j}^t $ are the coefficients representing the sensitivity of the stock returns to each hidden factor.
- $\epsilon_i^t $  is the residual term for stock $i$ at time $t$, representing the portion of the return not explained by the PCA factors.

### Representation of Stock Return Data

Consider the return data for $N$ stocks over $T$ periods, represented by the matrix $R$ of size $T \times N$:

$$
R = \left[
\begin{array}{cccc}
R_1^T & R_2^T & \cdots & R_N^T \\
R_1^{T-1} & R_2^{T-1} & \cdots & R_N^{T-1} \\
\vdots & \vdots & \ddots & \vdots \\
R_1^1 & R_2^1 & \cdots & R_N^1 \\
\end{array}
\right]
$$

Each element $R_i^k$ of the matrix represents the return of stock $i$ at time $k$ and is defined as:

$$
R_i^k = \frac{S_{i,k} - S_{i, k-1}}{S_{i, k-1}}, \quad k=1,\cdots, T, \quad i=1,\cdots,N
$$

where $S_{i,k}$ denotes the adjusted close price of stock $i$ at time $k$.

### Standardization of Returns

To adjust for varying volatilities across stocks, we standardize the returns as follows:

$$
Z_i^t = \frac{R_i^t - \mu_i}{\sigma_i}
$$

where $\mu_i$ and $\sigma_i$ are the mean and standard deviation of returns for stock $i$ over the period $[t-T, t]$, respectively.

### Empirical Correlation Matrix

The empirical correlation matrix $C$ is computed from the standardized returns:

$$
C = \frac{1}{T-1} Z^T Z
$$

where $Z^T$ is the transpose of matrix $Z$.

### Singular Value Decomposition (SVD)

We apply Singular Value Decomposition to the correlation matrix $C$:

$$
C = U \Sigma V^T
$$

Here, $U$ and $V$ are orthogonal matrices representing the left and right singular vectors, respectively, and $\Sigma$ is a diagonal matrix containing the singular values, which are the square roots of the eigenvalues.

### Construction of Hidden Factors

For each of the top $m$ components, we construct the selected hidden factors as follows:

$$
F_j^t = \sum_{i=1}^N \frac{\lambda_{i,j}}{\sigma_i} R_i^t
$$

where $\lambda_{i,j}$ is the $i$-th component of the $j$-th eigenvector (ranked by eigenvalue magnitude).


---
<font color=green>Q13 (3 Marks):

For the specified period from March 29, 2023 ('2023-03-29'), to March 28, 2024 ('2024-03-28'), generate the matrix $Z$ by standardizing the stock returns using the DataFrame `df_filtered_new`
</font>

---


In [18]:
df_filtered_new = df_filtered.copy()
mask = (df_filtered['Date'] >= '2023-03-29') & (df_filtered['Date'] <= '2024-03-28')
sub_df_filtered_new = df_filtered_new.loc[mask]
sub_df_filtered_new.set_index('Date', inplace=True)

# standardize the daily returns
daily_returns_std = (sub_df_filtered_new - sub_df_filtered_new.mean()) / sub_df_filtered_new.std()

Z = daily_returns_std
Z

Unnamed: 0_level_0,ADBE,ADP,GOOGL,GOOG,AMZN,AMD,AEP,AMGN,ADI,ANSS,...,TTWO,TMUS,TSLA,TXN,VRSK,VRTX,WBA,WBD,WDAY,XEL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-03-29,0.658925,2.189521,0.106285,0.207821,1.503570,0.444433,1.020467,0.727398,1.860839,0.356414,...,0.517876,0.615242,0.814986,1.344650,1.105776,0.127319,0.497115,0.429736,2.277783,1.269711
2023-03-30,0.273051,-0.221306,-0.389329,-0.434766,0.787034,0.526996,0.277291,0.076716,1.630406,0.936497,...,-0.109884,0.439972,0.233493,1.187056,0.239468,-0.525980,0.691243,0.446100,0.389915,0.430715
2023-03-31,0.360570,1.136308,1.540732,1.439604,0.531735,-0.056439,0.475365,0.008639,0.935425,1.044069,...,1.337544,0.117646,2.058867,0.640237,0.325258,0.538943,-0.008816,0.565842,1.611597,0.597545
2023-04-03,-0.713053,-2.259591,0.252735,0.407399,-0.591892,-0.600163,-0.086824,0.759731,-0.328378,-0.555499,...,-0.377687,1.191774,-2.030038,-0.684867,-0.214460,0.183344,1.205798,-0.547975,-0.634285,0.121519
2023-04-04,0.560730,-1.137430,0.099652,0.013877,0.658632,-0.342217,0.223111,0.872403,-0.399698,-0.127907,...,1.434972,-0.347689,-0.377655,-1.394525,-0.538698,-0.487155,0.553153,0.755053,-0.537767,1.011197
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-03-22,-1.146803,-0.516681,1.151431,1.085070,0.074915,0.081847,-0.150701,-0.278013,-0.555224,0.126867,...,0.046870,-0.246035,-0.386620,-0.054031,-0.476030,-0.091837,-0.421327,-0.946797,0.106149,-0.002851
2024-03-25,0.659349,-1.221008,-0.372489,-0.341074,0.109668,-0.292706,-0.084892,1.184710,-0.959285,-0.282029,...,-2.575803,0.240999,0.343241,-0.651297,-1.151639,-0.024341,0.166127,0.118796,-0.427841,0.321648
2024-03-26,-0.032708,0.234343,0.131651,0.109342,-0.556130,-0.244717,-0.377797,0.183364,-0.577464,0.317533,...,0.150958,-0.070195,0.960796,-1.177043,-0.384593,0.306386,-0.206326,-0.246683,0.237594,-0.878121
2024-03-27,-0.363721,1.052056,-0.024166,-0.010591,0.315892,0.224881,2.192442,1.128111,1.411127,-0.309662,...,0.034693,0.474269,0.396878,1.991081,0.941800,-0.265792,1.179504,1.004419,-0.793726,2.193537


---
<font color=green>Q14: (1 Mark) </font>
<br><font color='green'>
Download the `Z_matrix` matrix from the course's GitHub account.
</font>

---

In [19]:
Z_matrix = pd.read_csv("C:/Users/42275/Downloads/Z_matrix.csv")
Z_matrix

Unnamed: 0,Date,ADBE,ADP,GOOGL,GOOG,AMZN,AMD,AEP,AMGN,ADI,...,TTWO,TMUS,TSLA,TXN,VRSK,VRTX,WBA,WBD,WDAY,XEL
0,2023-03-29,0.658925,2.189521,0.106285,0.207821,1.503570,0.444433,1.020467,0.727398,1.860839,...,0.517876,0.615242,0.814986,1.344650,1.105776,0.127319,0.497115,0.429736,2.277783,1.269711
1,2023-03-30,0.273051,-0.221306,-0.389329,-0.434766,0.787034,0.526996,0.277291,0.076716,1.630406,...,-0.109884,0.439972,0.233493,1.187056,0.239468,-0.525980,0.691243,0.446100,0.389915,0.430715
2,2023-03-31,0.360570,1.136308,1.540732,1.439604,0.531735,-0.056439,0.475365,0.008639,0.935425,...,1.337544,0.117646,2.058867,0.640237,0.325258,0.538943,-0.008816,0.565842,1.611597,0.597545
3,2023-04-03,-0.713053,-2.259591,0.252735,0.407399,-0.591892,-0.600163,-0.086824,0.759731,-0.328378,...,-0.377687,1.191774,-2.030038,-0.684867,-0.214460,0.183344,1.205798,-0.547975,-0.634285,0.121519
4,2023-04-04,0.560730,-1.137430,0.099652,0.013877,0.658632,-0.342217,0.223111,0.872403,-0.399698,...,1.434972,-0.347689,-0.377655,-1.394525,-0.538698,-0.487155,0.553153,0.755053,-0.537767,1.011197
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
247,2024-03-22,-1.146803,-0.516681,1.151431,1.085070,0.074915,0.081847,-0.150701,-0.278013,-0.555224,...,0.046870,-0.246035,-0.386620,-0.054031,-0.476030,-0.091837,-0.421327,-0.946797,0.106149,-0.002851
248,2024-03-25,0.659349,-1.221008,-0.372489,-0.341074,0.109668,-0.292706,-0.084892,1.184710,-0.959285,...,-2.575803,0.240999,0.343241,-0.651297,-1.151639,-0.024341,0.166127,0.118796,-0.427841,0.321648
249,2024-03-26,-0.032708,0.234343,0.131651,0.109342,-0.556130,-0.244717,-0.377797,0.183364,-0.577464,...,0.150958,-0.070195,0.960796,-1.177043,-0.384593,0.306386,-0.206326,-0.246683,0.237594,-0.878121
250,2024-03-27,-0.363721,1.052056,-0.024166,-0.010591,0.315892,0.224881,2.192442,1.128111,1.411127,...,0.034693,0.474269,0.396878,1.991081,0.941800,-0.265792,1.179504,1.004419,-0.793726,2.193537


---
<font color=green>Q15: (3 Marks) </font>
<br><font color='green'>
For the specified period from March 29, 2023 ('2023-03-29'), to March 28, 2024 ('2024-03-28'), compute the correlation matrix
$C$ using the matrix `Z_matrix`.
</font>

---

In [20]:
Z_matrix['Date'] = pd.to_datetime(Z_matrix['Date'])
Z_matrix_filtered = Z_matrix[(Z_matrix['Date'] >= '2023-03-29') & (Z_matrix['Date'] <= '2024-03-28')]
Z_matrix_filtered.set_index('Date', inplace=True)

correlation_matrix = Z_matrix_filtered.T.dot(Z_matrix_filtered) / len(Z_matrix_filtered)
correlation_matrix

Unnamed: 0,ADBE,ADP,GOOGL,GOOG,AMZN,AMD,AEP,AMGN,ADI,ANSS,...,TTWO,TMUS,TSLA,TXN,VRSK,VRTX,WBA,WBD,WDAY,XEL
ADBE,0.996032,0.217646,0.396311,0.399011,0.461649,0.442270,-0.035824,0.197992,0.320713,0.385946,...,0.256908,0.101761,0.267796,0.325301,0.170899,0.164106,0.033820,0.099445,0.416451,0.019030
ADP,0.217646,0.996032,0.293046,0.297655,0.167538,0.045702,0.227550,0.213960,0.278498,0.237409,...,0.289159,0.113532,0.177421,0.296772,0.323967,0.176070,0.141804,0.243017,0.319563,0.164029
GOOGL,0.396311,0.293046,0.996032,0.993457,0.519130,0.369632,-0.006776,0.118466,0.221371,0.291127,...,0.237273,0.086329,0.266878,0.191425,0.177913,0.141881,0.052500,0.041905,0.287989,0.025599
GOOG,0.399011,0.297655,0.993457,0.996032,0.523540,0.370093,-0.004021,0.117827,0.222822,0.293374,...,0.241150,0.091093,0.267050,0.197258,0.179395,0.145610,0.060581,0.045335,0.292410,0.026288
AMZN,0.461649,0.167538,0.519130,0.523540,0.996032,0.461212,-0.010806,0.123254,0.289717,0.340685,...,0.221464,0.119824,0.302164,0.298311,0.143753,0.104546,0.017855,0.162291,0.402155,-0.058637
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
VRTX,0.164106,0.176070,0.141881,0.145610,0.104546,0.039383,0.238909,0.280641,0.109751,0.141557,...,0.180092,0.138632,0.143870,0.197471,0.250864,0.996032,0.158492,0.062477,0.101447,0.183637
WBA,0.033820,0.141804,0.052500,0.060581,0.017855,0.002618,0.308488,0.213849,0.208078,0.096429,...,0.114732,0.063286,0.167827,0.198835,0.038218,0.158492,0.996032,0.360098,0.010812,0.194066
WBD,0.099445,0.243017,0.041905,0.045335,0.162291,0.092365,0.324172,0.219468,0.309448,0.095538,...,0.128511,0.083980,0.281145,0.354039,0.002978,0.062477,0.360098,0.996032,0.160222,0.183107
WDAY,0.416451,0.319563,0.287989,0.292410,0.402155,0.333259,0.017589,0.067826,0.314174,0.380842,...,0.292783,0.142082,0.276642,0.344788,0.194811,0.101447,0.010812,0.160222,0.996032,-0.019234


---
<font color=green>Q16: (2 Marks) </font>
<br><font color='green'>
Refind the correlation matrix from the from March 29, 2023 ('2023-03-29'), to March 28, 2024 ('2024-03-28') using pandas correlation matrix method.
</font>

---

In [21]:
Z_matrix_filtered = Z_matrix[(Z_matrix['Date'] >= '2023-03-29') & (Z_matrix['Date'] <= '2024-03-28')]
Z_matrix_filtered.set_index('Date', inplace=True)
C = Z_matrix_filtered.corr()
C

Unnamed: 0,ADBE,ADP,GOOGL,GOOG,AMZN,AMD,AEP,AMGN,ADI,ANSS,...,TTWO,TMUS,TSLA,TXN,VRSK,VRTX,WBA,WBD,WDAY,XEL
ADBE,1.000000,0.218513,0.397890,0.400601,0.463488,0.444032,-0.035967,0.198781,0.321991,0.387483,...,0.257931,0.102167,0.268863,0.326597,0.171580,0.164760,0.033955,0.099841,0.418110,0.019105
ADP,0.218513,1.000000,0.294213,0.298841,0.168206,0.045884,0.228457,0.214813,0.279607,0.238355,...,0.290311,0.113985,0.178128,0.297954,0.325258,0.176771,0.142369,0.243986,0.320836,0.164682
GOOGL,0.397890,0.294213,1.000000,0.997415,0.521199,0.371105,-0.006803,0.118938,0.222252,0.292286,...,0.238219,0.086673,0.267941,0.192188,0.178622,0.142447,0.052710,0.042072,0.289137,0.025701
GOOG,0.400601,0.298841,0.997415,1.000000,0.525626,0.371568,-0.004037,0.118296,0.223710,0.294542,...,0.242111,0.091456,0.268114,0.198044,0.180110,0.146190,0.060822,0.045516,0.293575,0.026392
AMZN,0.463488,0.168206,0.521199,0.525626,1.000000,0.463049,-0.010849,0.123745,0.290872,0.342042,...,0.222346,0.120301,0.303368,0.299500,0.144325,0.104962,0.017926,0.162937,0.403757,-0.058870
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
VRTX,0.164760,0.176771,0.142447,0.146190,0.104962,0.039540,0.239861,0.281759,0.110189,0.142121,...,0.180810,0.139184,0.144443,0.198258,0.251863,1.000000,0.159124,0.062726,0.101851,0.184369
WBA,0.033955,0.142369,0.052710,0.060822,0.017926,0.002629,0.309717,0.214701,0.208907,0.096813,...,0.115189,0.063538,0.168495,0.199627,0.038371,0.159124,1.000000,0.361533,0.010855,0.194839
WBD,0.099841,0.243986,0.042072,0.045516,0.162937,0.092733,0.325463,0.220342,0.310681,0.095919,...,0.129023,0.084315,0.282265,0.355450,0.002990,0.062726,0.361533,1.000000,0.160860,0.183837
WDAY,0.418110,0.320836,0.289137,0.293575,0.403757,0.334587,0.017659,0.068097,0.315426,0.382360,...,0.293949,0.142648,0.277744,0.346161,0.195588,0.101851,0.010855,0.160860,1.000000,-0.019310


---
<font color=green>Q17: (7 Marks) </font>
<br><font color='green'>
Conduct Singular Value Decomposition on the correlation matrix $C$. Follow these steps:


1.   **Perform SVD**: Decompose the matrix $C$ into its singular values and vectors.
2.   **Rank Eigenvalues**: Sort the resulting singular values (often squared to compare to eigenvalues) in descending order.
3. **Select Components**: Extract the first 20 components based on the largest singular values.
4. **Variance Explained**: Print the variance explained by the first 20 Components and dimensions of differents matrix that you created.

</font>

---

In [22]:
# SVD
U, S, Vt = np.linalg.svd(C)

# Sort the resulting singular values (often squared to compare to eigenvalues) in descending order.
idx = S.argsort()[::-1]
S = S[idx]
U = U[:, idx]
Vt = Vt[idx, :]
V = Vt.T

# Extract the first 20 components based on the largest singular values.
U = U[:, :20]
V = V[:, :20]
S = S[:20]

# Print the variance explained by the first 20 Components and dimensions of differents matrix that you created.
explained_variance = S / np.sum(S)
print("Explained Variance:", explained_variance)

Explained Variance: [0.36655445 0.12157645 0.0648746  0.04573254 0.04267564 0.03391664
 0.03079454 0.02918996 0.02884525 0.02670705 0.02513509 0.02398557
 0.02323357 0.02193761 0.02075082 0.0197569  0.01941274 0.01901935
 0.0180899  0.01781136]


---
<font color=green>Q18: (6 Marks) </font>
<br><font color='green'>
Extract the 20 hidden factors in a matrix F. Check that shape of F is $(252,20)$
</font>

</font>

---

In [23]:
# Extract the 20 hidden factors in a matrix F. Check that shape of F is (252,20)
F = Z_matrix_filtered.dot(U)
print("Shape of F:", F.shape)

Shape of F: (252, 20)


---
<font color=green>Q19: (3 Marks) </font>
<br><font color='green'>
Perform the Regression Analysis of 'AAPL' for the date '2024-03-28', using a historical data length of 252 days using previous $F$ Matrix. Compare the R-squared from the ones obtained at Q11.
</font>

</font>

---

In [24]:
y = sub_df_filtered_new['AAPL']

X = F

target_date = '2024-03-28'
train_length = 252
X_train = X.loc[:target_date].iloc[:-1]
X_test = X.loc[target_date].values.reshape(1, -1)
y_train = y.loc[:target_date].iloc[:-1]
y_test = y.loc[target_date]

model = sm.OLS(y_train, X_train).fit()
y_pred = model.predict(X_test)

r_squared = model.rsquared
print("R-squared:", r_squared)


R-squared: 0.6166763584635265


In [25]:
pca_result = model.summary()
print(pca_result)

                                 OLS Regression Results                                
Dep. Variable:                   AAPL   R-squared (uncentered):                   0.617
Model:                            OLS   Adj. R-squared (uncentered):              0.583
Method:                 Least Squares   F-statistic:                              18.58
Date:                Mon, 27 May 2024   Prob (F-statistic):                    1.59e-37
Time:                        15:24:37   Log-Likelihood:                          869.80
No. Observations:                 251   AIC:                                     -1700.
Df Residuals:                     231   BIC:                                     -1629.
Df Model:                          20                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

# Ornstein Uhlenbeck

The Ornstein-Uhlenbeck process is defined by the following stochastic differential equation (SDE):

$$ dX_t = \theta (\mu - X_t) dt + \sigma dW_t $$

where:

- **$ X_t $**: The value of the process at time $ t $.
- **$ \mu $**: The long-term mean (equilibrium level) to which the process reverts.
- **$ \theta $**: The speed of reversion or the rate at which the process returns to the mean.
- **$ \sigma $**: The volatility (standard deviation), representing the magnitude of random fluctuations.
- **$ W_t $**: A Wiener process or Brownian motion that adds stochastic (random) noise.

This equation describes a process where the variable $ X_t $ moves towards the mean $ \mu $ at a rate determined by $ \theta $, with random noise added by $ \sigma dW_t $.

---
<font color=green>Q20: (3 Marks) </font>
<br><font color='green'>
In the context of mean reversion, which quantity should be modeled using an Ornstein-Uhlenbeck process?
</font>

---

**Write your answers here:**

1. Interest rates

In financial modeling, the Ornstein-Uhlenbeck process can be used to model interest rates. For example, the Vasicek model is a popular model that assumes the short-term interest rate follows an OU process. Interest rates tend to revert to a long-term mean over time, which aligns well with the mean-reverting nature of the OU process.

2. Volatility

The volatility of asset prices can be modeled as an OU process. This approach assumes that volatility reverts to a long-term average level over time, capturing the cyclical nature of market volatility.

3. Exchange rates

Some exchange rate pairs exhibit mean reversion characteristics in the short term, meaning that they tend to revert to a long-term average level after experiencing external shocks. This behavior may result from a combination of factors such as:

Central Bank Interventions: Central banks may intervene in the foreign exchange market or adjust interest rates to stabilize the exchange rate, causing it to revert to a target level.

Economic Fundamentals: Changes in economic fundamentals, such as inflation rates, economic growth rates, and international trade conditions, can drive exchange rates back to levels consistent with these fundamentals.

Market Sentiment: Investor and trader expectations and market sentiment can influence exchange rate fluctuations, leading to a reversion to the long-term mean.

Arbitrage Mechanisms: If exchange rates deviate significantly from the mean, arbitrageurs may engage in buying low and selling high, which helps push the exchange rate back towards the mean.

4. Asset prices in pairs trading

In pairs trading, the spread between two correlated asset prices is often modeled using an OU process. The idea is that while the prices of two assets might move together, the difference (or spread) between them tends to revert to a mean value over time. This mean-reverting behavior makes the OU process a suitable model for the spread.




---
<font color=green>Q21: (5 Marks) </font>
<br><font color='green'>
Explain how the parameters $ \theta $ and $ \sigma $ can be determined using the following equations. Also, detail the underlying assumptions:
$$ E[X] = \mu $$
$$ \text{Var}[X] = \frac{\sigma^2}{2\theta} $$
</font>

---

**Write your answers here:**

---
<font color=green>Q22: (2 Marks) </font>
<br><font color='green'>
Create a function named `extract_s_scores` which computes 's scores' for the last element in a list of floating-point numbers. This function calculates the scores using the following formula $ \text{s scores} = \frac{X_T - \mu}{\sigma} $ where `list_xi` represents a list containing a sequence of floating-point numbers $(X_0, \cdots, X_T)$.

</font>

---

In [26]:
## Insert your code here
def extract_s_scores(list_xi):
    array_xi = np.array(list_xi)
    mu = np.mean(array_xi)
    sigma = np.std(array_xi)
    
    XT = array_xi[-1]
    s_score = (XT - mu) / sigma
    
    return s_score


# Autoencoder Analysis

Autoencoders are neural networks used for unsupervised learning, particularly for dimensionality reduction and feature extraction. Training an autoencoder on the $Z_i$ matrix aims to identify hidden factors capturing the intrinsic structures in financial data.

### Architecture
- **Encoder**: Compresses input data into a smaller latent space representation.
  - *Input Layer*: Matches the number of features in the $Z_i$ matrix.
  - *Hidden Layers*: Compress data through progressively smaller layers.
  - *Latent Space*: Encodes the data into hidden factors.
- **Decoder**: Reconstructs input data from the latent space.
  - *Hidden Layers*: Gradually expand to the original dimension.
  - *Output Layer*: Matches the input layer to recreate the original matrix.

### Training
The autoencoder is trained by minimizing reconstruction loss, usually mean squared error (MSE), between the input $Z_i$ matrix and the decoder's output.

### Hidden Factors Extraction
After training, the encoder's latent space provides the most important underlying patterns in the stock returns.

---
<font color=green>Q23: (2 Marks) </font>
<br><font color='green'>
Modify the standardized returns matrix `Z_matrix` to reduce the influence of extreme outliers on model trainingby ensuring that all values in the matrix `Z_matrix` do not exceed 3 standard deviations from the mean. Specifically, cap these values at the interval $-3, 3]$. Store the adjusted values in a new matrix, `Z_hat`.
</font>

----

In [27]:
Z_matrix.head()

Unnamed: 0,Date,ADBE,ADP,GOOGL,GOOG,AMZN,AMD,AEP,AMGN,ADI,...,TTWO,TMUS,TSLA,TXN,VRSK,VRTX,WBA,WBD,WDAY,XEL
0,2023-03-29,0.658925,2.189521,0.106285,0.207821,1.50357,0.444433,1.020467,0.727398,1.860839,...,0.517876,0.615242,0.814986,1.34465,1.105776,0.127319,0.497115,0.429736,2.277783,1.269711
1,2023-03-30,0.273051,-0.221306,-0.389329,-0.434766,0.787034,0.526996,0.277291,0.076716,1.630406,...,-0.109884,0.439972,0.233493,1.187056,0.239468,-0.52598,0.691243,0.4461,0.389915,0.430715
2,2023-03-31,0.36057,1.136308,1.540732,1.439604,0.531735,-0.056439,0.475365,0.008639,0.935425,...,1.337544,0.117646,2.058867,0.640237,0.325258,0.538943,-0.008816,0.565842,1.611597,0.597545
3,2023-04-03,-0.713053,-2.259591,0.252735,0.407399,-0.591892,-0.600163,-0.086824,0.759731,-0.328378,...,-0.377687,1.191774,-2.030038,-0.684867,-0.21446,0.183344,1.205798,-0.547975,-0.634285,0.121519
4,2023-04-04,0.56073,-1.13743,0.099652,0.013877,0.658632,-0.342217,0.223111,0.872403,-0.399698,...,1.434972,-0.347689,-0.377655,-1.394525,-0.538698,-0.487155,0.553153,0.755053,-0.537767,1.011197


In [28]:
numeric_cols = Z_matrix.select_dtypes(include=[np.number]).columns
non_numeric_cols = Z_matrix.select_dtypes(exclude=[np.number]).columns

Z_hat = Z_matrix.copy()

Z_hat_numeric = Z_matrix[numeric_cols].apply(lambda x: np.clip(x, -3, 3))

Z_hat_numeric = Z_hat_numeric.astype(np.float32)

Z_hat = pd.concat([Z_hat_numeric, Z_matrix[non_numeric_cols]], axis=1)

Z_hat = Z_hat.set_index("Date")

print(Z_hat.head())



                ADBE       ADP     GOOGL      GOOG      AMZN       AMD  \
Date                                                                     
2023-03-29  0.658925  2.189521  0.106285  0.207821  1.503570  0.444433   
2023-03-30  0.273051 -0.221306 -0.389329 -0.434766  0.787034  0.526996   
2023-03-31  0.360570  1.136308  1.540732  1.439604  0.531735 -0.056439   
2023-04-03 -0.713053 -2.259591  0.252735  0.407399 -0.591892 -0.600163   
2023-04-04  0.560730 -1.137430  0.099652  0.013877  0.658632 -0.342217   

                 AEP      AMGN       ADI      ANSS  ...      TTWO      TMUS  \
Date                                                ...                       
2023-03-29  1.020467  0.727398  1.860839  0.356414  ...  0.517876  0.615242   
2023-03-30  0.277291  0.076716  1.630406  0.936497  ... -0.109884  0.439972   
2023-03-31  0.475365  0.008639  0.935425  1.044069  ...  1.337544  0.117646   
2023-04-03 -0.086824  0.759731 -0.328378 -0.555499  ... -0.377687  1.191774   
2023-04

---
<font color=green>Q24: (1 Marks) </font>
<br><font color='green'>
Fetch the `Z_hat` data from GitHub, and we'll proceed with it now.
</font>



In [42]:
## Insert your code here
Z_hat = pd.read_csv("C:/Users/42275/Downloads/Z_hat.csv")
Z_hat

Unnamed: 0,Date,ADBE,ADP,GOOGL,GOOG,AMZN,AMD,AEP,AMGN,ADI,...,TTWO,TMUS,TSLA,TXN,VRSK,VRTX,WBA,WBD,WDAY,XEL
0,2023-03-29,0.658925,2.189521,0.106285,0.207821,1.503570,0.444433,1.020467,0.727398,1.860839,...,0.517876,0.615242,0.814986,1.344650,1.105776,0.127319,0.497115,0.429736,2.277783,1.269711
1,2023-03-30,0.273051,-0.221306,-0.389329,-0.434766,0.787034,0.526996,0.277291,0.076716,1.630406,...,-0.109884,0.439972,0.233493,1.187056,0.239468,-0.525980,0.691243,0.446100,0.389915,0.430715
2,2023-03-31,0.360570,1.136308,1.540732,1.439604,0.531735,-0.056439,0.475365,0.008639,0.935425,...,1.337544,0.117646,2.058867,0.640237,0.325258,0.538943,-0.008816,0.565842,1.611597,0.597545
3,2023-04-03,-0.713053,-2.259591,0.252735,0.407399,-0.591892,-0.600163,-0.086824,0.759731,-0.328378,...,-0.377687,1.191774,-2.030038,-0.684867,-0.214460,0.183344,1.205798,-0.547975,-0.634285,0.121519
4,2023-04-04,0.560730,-1.137430,0.099652,0.013877,0.658632,-0.342217,0.223111,0.872403,-0.399698,...,1.434972,-0.347689,-0.377655,-1.394525,-0.538698,-0.487155,0.553153,0.755053,-0.537767,1.011197
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
247,2024-03-22,-1.146803,-0.516681,1.151431,1.085070,0.074915,0.081847,-0.150701,-0.278013,-0.555224,...,0.046870,-0.246035,-0.386620,-0.054031,-0.476030,-0.091837,-0.421327,-0.946797,0.106149,-0.002851
248,2024-03-25,0.659349,-1.221008,-0.372489,-0.341074,0.109668,-0.292706,-0.084892,1.184710,-0.959285,...,-2.575803,0.240999,0.343241,-0.651297,-1.151639,-0.024341,0.166127,0.118796,-0.427841,0.321648
249,2024-03-26,-0.032708,0.234343,0.131651,0.109342,-0.556130,-0.244717,-0.377797,0.183364,-0.577464,...,0.150958,-0.070195,0.960796,-1.177043,-0.384593,0.306386,-0.206326,-0.246683,0.237594,-0.878121
250,2024-03-27,-0.363721,1.052056,-0.024166,-0.010591,0.315892,0.224881,2.192442,1.128111,1.411127,...,0.034693,0.474269,0.396878,1.991081,0.941800,-0.265792,1.179504,1.004419,-0.793726,2.193537


---
<font color=green>Q25: (3 Marks) </font>
<br><font color='green'>
Segment the standardized and capped returns matrix $\hat{Z}$ into two subsets for model training and testing. Precisly Allocate 70% of the data in $\hat{Z}$ to the training set $ \hat{Z}_{train} $ and Allocate the remaining 30% to the testing set $\hat{Z}_{test}$. Treat each stock within $\hat{Z}$ as an individual sample, by flattening temporal dependencies.
</font>



In [43]:
Z_hat.head()
Z_hat.set_index('Date', inplace=True)

In [52]:
num_columns = Z_hat.shape[1]

num_train = int(0.7 * num_columns)
num_test = num_columns - num_train

Z_train_values = Z_hat.iloc[:, :num_train]
Z_test_values = Z_hat.iloc[:, num_train:]

In [53]:
print(Z_train_values.shape)
print(Z_test_values.shape)

(252, 62)
(252, 27)


In [31]:
## Insert your code here
num_stocks = Z_hat.shape[1]
num_timepoints = Z_hat.shape[0]

Z_flattened = Z_hat.values.flatten()
Z_flattened_df = pd.DataFrame(Z_flattened.reshape(num_timepoints, num_stocks))

print(Z_flattened_df.head())

         0         1         2         3         4         5         6   \
0  0.658925  2.189521  0.106285  0.207821  1.503570  0.444433  1.020467   
1  0.273051 -0.221306 -0.389329 -0.434766  0.787034  0.526996  0.277291   
2  0.360570  1.136308  1.540732  1.439604  0.531735 -0.056439  0.475365   
3 -0.713053 -2.259591  0.252735  0.407399 -0.591892 -0.600163 -0.086824   
4  0.560730 -1.137430  0.099652  0.013877  0.658632 -0.342217  0.223111   

         7         8         9   ...        79        80        81        82  \
0  0.727398  1.860839  0.356414  ...  0.517876  0.615242  0.814986  1.344650   
1  0.076716  1.630406  0.936497  ... -0.109884  0.439972  0.233493  1.187056   
2  0.008639  0.935425  1.044069  ...  1.337544  0.117646  2.058867  0.640237   
3  0.759731 -0.328378 -0.555499  ... -0.377687  1.191774 -2.030038 -0.684867   
4  0.872403 -0.399698 -0.127907  ...  1.434972 -0.347689 -0.377655 -1.394525   

         83        84        85        86        87        88  
0  1

In [32]:
num_samples = Z_flattened_df.shape[0]
num_train = int(0.7 * num_samples)
num_test = num_samples - num_train

Z_flattened_shuffled = Z_flattened_df.sample(frac=1, random_state=42).reset_index(drop=True)

Z_train = Z_flattened_shuffled.iloc[:num_train]
Z_test = Z_flattened_shuffled.iloc[num_train:]

Z_train_values = Z_train.values.astype(np.float32)
Z_test_values = Z_test.values.astype(np.float32)

In [40]:
print(Z_train_values)
print(Z_test_values)

[[ 0.60538846 -0.08655936  0.5458483  ...  0.279244   -0.6195559
   0.5205008 ]
 [-0.24196579  1.0931555   2.104351   ...  0.71422184 -0.36500913
   0.1867584 ]
 [-0.1909823   0.04277211  0.24551223 ... -1.191706    0.3487746
   0.3638528 ]
 ...
 [-0.01398022 -1.7401919  -0.17699751 ... -0.22337161  0.70906764
  -0.09777372]
 [ 0.16617802  1.2301244   0.25460672 ...  0.7971345   0.92362463
   0.4074212 ]
 [-0.73792624 -0.8065016   0.45396045 ...  0.5877471  -2.1547284
   0.317724  ]]
[[ 1.262453    0.48849383  1.2352233  ...  0.5974299   1.319831
  -0.12866126]
 [-1.9811777  -0.81500125 -0.39912876 ...  0.15988344 -1.2877108
  -3.        ]
 [ 2.262857   -0.07646707 -0.2730693  ...  2.1655514   0.18502048
  -0.14914261]
 ...
 [ 0.12073871  0.01908077 -0.0881118  ...  0.06746316  0.45066357
  -0.18729164]
 [-0.76516426  0.83624583 -0.079596   ...  1.8343492   0.27980286
   2.545433  ]
 [-1.688042   -0.43952355 -1.2431799  ...  0.04555993 -1.3000882
  -0.5889864 ]]


---
<font color=green>Q26: (10 Marks) </font>
<br><font color='green'>
Please create an autoencoder following the instructions provided in  **[End-to-End Policy Learning of a Statistical Arbitrage Autoencoder Architecture](https://arxiv.org/pdf/2402.08233.pdf)**, Use the model 'Variant 2' in Table 1.
</font>

---

In [68]:
# Define the autoencoder model
input_dim = Z_train_values.shape[0]
encoding_dim = 20 

input_layer = Input(shape=(input_dim,))

encoder = Dense(encoding_dim, activation="tanh", use_bias=True)(input_layer)

decoder = Dense(input_dim, activation="tanh", use_bias=True)(encoder)

autoencoder = Model(inputs=input_layer, outputs=decoder)
autoencoder.compile(optimizer=Adam(learning_rate=0.001), loss='mean_squared_error')


In [80]:
autoencoder.fit(Z_train_values.T, Z_train_values.T,
                epochs=20,
                batch_size=32,
                shuffle=True,
                validation_split=0.2)


Epoch 1/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 203ms/step - loss: 0.3623 - val_loss: 0.5064
Epoch 2/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 82ms/step - loss: 0.3620 - val_loss: 0.5063
Epoch 3/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 68ms/step - loss: 0.3631 - val_loss: 0.5063
Epoch 4/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 84ms/step - loss: 0.3610 - val_loss: 0.5062
Epoch 5/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 75ms/step - loss: 0.3593 - val_loss: 0.5061
Epoch 6/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 82ms/step - loss: 0.3610 - val_loss: 0.5060
Epoch 7/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 80ms/step - loss: 0.3614 - val_loss: 0.5059
Epoch 8/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 80ms/step - loss: 0.3612 - val_loss: 0.5058
Epoch 9/20
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m 

<keras.src.callbacks.history.History at 0x2c2744fe790>

---
<font color=green>Q27 (1 Mark) :

Display all the parameters of the deep neural network.
</font>

---

In [81]:
## Insert your code here
autoencoder.summary()

---
<font color=green>Q28: (3 Marks) </font>
<br><font color='green'>
Train your model using the Adam optimizer for 20 epochs with a batch size equal to 8 and validation split to 20%. Specify the loss function you've chosen.
</font>


In [82]:
## Insert your code here
model_2 = autoencoder.fit(Z_train_values.T, Z_train_values.T,
                          epochs=20,
                          batch_size=8,
                          shuffle=True,
                          validation_split=0.2)

autoencoder.summary()


Epoch 1/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step - loss: 0.3509 - val_loss: 0.5048
Epoch 2/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - loss: 0.3538 - val_loss: 0.5046
Epoch 3/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step - loss: 0.3539 - val_loss: 0.5046
Epoch 4/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step - loss: 0.3589 - val_loss: 0.5045
Epoch 5/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - loss: 0.3561 - val_loss: 0.5044
Epoch 6/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - loss: 0.3490 - val_loss: 0.5041
Epoch 7/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step - loss: 0.3492 - val_loss: 0.5038
Epoch 8/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - loss: 0.3534 - val_loss: 0.5037
Epoch 9/20
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [

---
<font color=green>Q29: (3 Marks) </font>
<br><font color='green'>
Predict using the testing set and extract the residuals based on the methodology described in **[End-to-End Policy Learning of a Statistical Arbitrage Autoencoder Architecture](https://arxiv.org/pdf/2402.08233.pdf)**.
for 'NVDA' stock.
</font>

---

In [70]:
Z_test_values

Unnamed: 0_level_0,NVDA,NXPI,ORLY,ODFL,ON,PCAR,PANW,PAYX,PYPL,PEP,...,TTWO,TMUS,TSLA,TXN,VRSK,VRTX,WBA,WBD,WDAY,XEL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-03-29,0.552528,1.624397,0.213439,-0.170982,1.514933,0.360310,0.181513,3.000000,0.837246,0.703151,...,0.517876,0.615242,0.814986,1.344650,1.105776,0.127319,0.497115,0.429736,2.277783,1.269711
2023-03-30,0.318767,0.662701,0.823726,0.282329,0.847784,-0.336688,0.170956,-1.728764,0.117209,0.084364,...,-0.109884,0.439972,0.233493,1.187056,0.239468,-0.525980,0.691243,0.446100,0.389915,0.430715
2023-03-31,0.305373,1.283113,0.623793,1.062075,0.199890,0.870319,0.996037,0.669330,0.856903,0.828273,...,1.337544,0.117646,2.058867,0.640237,0.325258,0.538943,-0.008816,0.565842,1.611597,0.597545
2023-04-03,0.048987,-1.321406,1.729910,-1.083890,-0.509833,-0.356025,-0.639367,-2.274582,-0.350731,0.106076,...,-0.377687,1.191774,-2.030038,-0.684867,-0.214460,0.183344,1.205798,-0.547975,-0.634285,0.121519
2023-04-04,-0.794750,-1.734429,-0.439353,-0.926025,-1.149815,-2.553395,-0.029813,-1.055964,-0.015461,-0.372367,...,1.434972,-0.347689,-0.377655,-1.394525,-0.538698,-0.487155,0.553153,0.755053,-0.537767,1.011197
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-03-22,0.871467,-0.456547,0.217545,-0.508751,-0.516168,-0.602836,-0.167693,-1.004071,-0.850055,-0.262669,...,0.046870,-0.246035,-0.386620,-0.054031,-0.476030,-0.091837,-0.421327,-0.946797,0.106149,-0.002851
2024-03-25,0.075687,-0.922165,-2.251583,-0.783420,-0.542341,-0.405590,-0.268180,-1.267835,0.781072,0.339683,...,-2.575803,0.240999,0.343241,-0.651297,-1.151639,-0.024341,0.166127,0.118796,-0.427841,0.321648
2024-03-26,-1.043244,-0.358961,-0.272465,-0.273948,-0.317910,-0.431217,0.120918,0.354143,0.417953,0.070754,...,0.150958,-0.070195,0.960796,-1.177043,-0.384593,0.306386,-0.206326,-0.246683,0.237594,-0.878121
2024-03-27,-1.018786,1.279462,-0.053309,-0.715514,1.293988,0.722008,-0.585240,1.220609,-0.060711,0.492854,...,0.034693,0.474269,0.396878,1.991081,0.941800,-0.265792,1.179504,1.004419,-0.793726,2.193537


In [83]:
## Insert your code here
predictions = autoencoder.predict(Z_test_values.T)

actual_values = Z_test_values
predicted_values = predictions.T



[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 68ms/step


In [84]:
residuals = actual_values - predicted_values
print(residuals['NVDA'])

Date
2023-03-29   -0.021578
2023-03-30   -0.123023
2023-03-31    0.195271
2023-04-03    0.744047
2023-04-04   -1.153541
                ...   
2024-03-22    1.107585
2024-03-25    0.146311
2024-03-26   -0.570402
2024-03-27   -0.515333
2024-03-28   -0.072059
Name: NVDA, Length: 252, dtype: float64


<font color=green>Q30: (7 Marks) </font>
<br><font color='green'>
By reading carrefully the paper **[End-to-End Policy Learning of a Statistical Arbitrage Autoencoder Architecture](https://arxiv.org/pdf/2402.08233.pdf)**, answers the following question:
1. **Summarize the Key Actions**: Highlight the main experiments and methodologies employed by the authors in Section 5.
2. **Reproduction Steps**: Detail the necessary steps required to replicate the authors' approach based on the descriptions provided in the paper.
3. **Proposed Improvement**: Suggest one potential enhancement to the methodology that could potentially increase the effectiveness or efficiency of the model.



**Write your answers here:**








