# Midterm 2

## FINM 36700 - 2024

### UChicago Financial Mathematics

* Mark Hendricks
* hendricks@uchicago.edu

# Instructions

## Please note the following:

Points
* The exam is 100 points.
* You have 120 minutes to complete the exam.
* For every minute late you submit the exam, you will lose one point.


Submission
* You will upload your solution to the `Midterm 2` assignment on Canvas, where you downloaded this. 
* Be sure to **submit** on Canvas, not just **save** on Canvas.
* Your submission should be readable, (the graders can understand your answers.)
* Your submission should **include all code used in your analysis in a file format that the code can be executed.** 

Rules
* The exam is open-material, closed-communication.
* You do not need to cite material from the course github repo - you are welcome to use the code posted there without citation.

Advice
* If you find any question to be unclear, state your interpretation and proceed. We will only answer questions of interpretation if there is a typo, error, etc.
* The exam will be graded for partial credit.

## Data

**All data files are found in the class github repo, in the `data` folder.**

This exam makes use of the following data files:
* `midterm_2_data.xlsx`

This file contains the following sheets:
- for Section 2:
    * `sector stocks excess returns` - MONTHLY excess returns for 49 sector stocks
    * `factors excess returns` - MONTHLY excess returns of AQR factor model from Homework 5
- for Section 3:
    * `factors excess returns` - MONTHLY excess returns of AQR factor model from Homework 5

In [1]:
import pandas as pd
import numpy as np
from tabulate import tabulate
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
import scipy.stats as stats
from scipy.stats import norm


In [2]:
df_sector = pd.read_excel('../data/midterm_2_data.xlsx', sheet_name='sector excess returns').set_index('date')
df_factors = pd.read_excel('../data/midterm_2_data.xlsx', sheet_name='factors excess returns').set_index('date')

In [3]:
df_sector.head()

Unnamed: 0_level_0,Agric,Food,Soda,Beer,Smoke,Toys,Fun,Books,Hshld,Clths,...,Boxes,Trans,Whlsl,Rtail,Meals,Banks,Insur,RlEst,Fin,Other
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1980-01-01,-0.0076,0.0285,0.0084,0.1009,-0.0143,0.1002,0.0362,0.0323,0.0048,0.0059,...,0.0158,0.0875,0.0465,-0.0126,0.043,-0.0283,0.0258,0.0768,0.0308,0.0669
1980-02-01,0.0105,-0.0608,-0.0966,-0.0322,-0.0569,-0.0323,-0.0521,-0.08,-0.0555,-0.0167,...,-0.0079,-0.0541,-0.0346,-0.0639,-0.0652,-0.0854,-0.0959,-0.0347,-0.0282,-0.0274
1980-03-01,-0.2224,-0.1119,-0.0167,-0.1469,-0.0193,-0.1271,-0.0826,-0.1237,-0.0566,-0.0668,...,-0.0819,-0.1509,-0.1098,-0.0906,-0.1449,-0.056,-0.088,-0.2451,-0.1254,-0.1726
1980-04-01,0.0449,0.0766,0.0232,0.0321,0.083,-0.0529,0.0783,0.0153,0.0304,0.0115,...,0.042,-0.0103,-0.0312,0.0353,0.0542,0.0728,0.053,0.0977,0.0447,0.0769
1980-05-01,0.0632,0.0793,0.0457,0.0863,0.0815,0.0509,0.0324,0.0886,0.056,0.0098,...,0.0564,0.1063,0.1142,0.0877,0.1134,0.0578,0.0557,0.0915,0.0844,0.0685


In [4]:
df_factors.head()

Unnamed: 0_level_0,MKT,HML,RMW,UMD
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980-01-01,0.0551,0.0175,-0.017,0.0755
1980-02-01,-0.0122,0.0061,0.0004,0.0788
1980-03-01,-0.129,-0.0101,0.0146,-0.0955
1980-04-01,0.0397,0.0106,-0.021,-0.0043
1980-05-01,0.0526,0.0038,0.0034,-0.0112


## Scoring

| Problem | Points |
|---------|--------|
| 1       | 25     |
| 2       | 40     |
| 3       | 35     |

### Each numbered question is worth 5 points unless otherwise specified.

# 1. Short Answer

#### No Data Needed

These problems do not require any data file. Rather, analyze them conceptually. 

### 1.1.

Historically, which pricing factor among the ones we studied has shown a considerable decrease in importance?

- Historically, the Size factor in the Fama-French model has shown a significant decrease in importance over time.
- Initially, the size effect—where small-cap stocks tend to outperform large-cap stocks was recognized. Some research and case studies we used in class, including studies by DFA, indicated that small-cap stocks historically provided excess returns over large-cap stocks, likely due to higher risk or market inefficiencies.
- However, in recent years, the size factor’s relevance has diminished for several reasons:
- - Market Maturity
- - Investor Behavior
- - Empirical Evidence: Its significance has weakened in more recent data (Ex: from HW 4, 5)
- This evolution reflects the changing dynamics in asset pricing and has led some investors and academics to reconsider the weighting or inclusion of the size factor in multifactor models.

### 1.2.

True or False: For a given factor model and a set of test assets, the addition of one more factor to that model will surely decrease the cross-sectional MAE. 

True or False: For a given factor model and a set of test assets, the addition of one more factor to that model will surely decrease the time-series MAE. 

Along with stating T/F, explain your reasoning for the two statements.

 - 1 - False:
 - If the new factor doesn't explain additional cross-sectional variation, MAE may not decrease.
 - Introducing unnecessary factors can capture noise. May increase MAE.
- Highly correlated factors can reduce the model's explanatory power.
 - 2 - False:
 - An irrelevant factor can introduce additional noise into time-series predictions.
 - More complex models are not inherently better and can perform worse out-of-sample.
 - Additional factors can make the model more sensitive to estimation errors.

### 1.3.

Consider the scenario in which you are helping two people with investments.

* The young person has a 50 year investment horizon.
* The elderly person has a 10 year investment horizon.
* Both individuals have the same portfolio holdings.

State who has the more certain cumulative return and explain your reasoning.

- The elderly person has more certain cumulative returns over their 10-year horizon.
- Longer horizons lead to greater total volatility, despite lower annualized volatility.
- While the average annual return may become more predictable over time, the cumulative return becomes less certain due to compounding volatility.
- Probability of Averages increases the chance of encountering adverse market conditions.
- More time allows for greater divergence from expected outcomes.
- Shorter Horizons Offer Greater Certainty: The elderly person faces less cumulative risk over 10 years, making their cumulative returns more predictable.

### 1.4.

Suppose we find that the 10-year bond yield works well as a new pricing factor, along with `MKT`.

Consider two ways of building this new factor.
1. Directly use the index of 10-year yields, `YLD`
1. Construct a Fama-French style portfolio of equities, `FFYLD`. (Rank all the stocks by their correlation to bond yield changes, and go long the highest ranked and shor tthe lowest ranked.)

Could you test the model with `YLD` and the model with `FFYLD` in the exact same ways? Explain.

- No, YLD and FFYLD require different testing approaches.
- YLD (bond yield) is an economic indicator, suited for cross-sectional analysis to assess its impact across stocks, as it’s not a return factor.
- FFYLD (a return-based factor) fits traditional multifactor testing, allowing both time-series and cross-sectional analysis.
- YLD’s non-return nature demands a unique approach, while FFYLD aligns with standard factor model methods.

### 1.5.

Suppose we implement a momentum strategy on cryptocurrencies rather than US stocks.

Conceptually speaking, but specific to the context of our course discussion, how would the risk profile differ from the momentum strategy of US equities?

- A momentum strategy on cryptocurrencies would carry a much higher risk profile than US equities due to:
- Cryptocurrency prices fluctuate more drastically, amplifying momentum risks, i.e., high volatility.
- Thin market depth means trades can heavily impact prices, increasing execution risk, i.e., low liquidity.
- Regulatory and Structural Risks: Less oversight and potential for price manipulation heighten systemic risk​ (Ex, From Barnstable Case).
- In essence, crypto momentum strategies face amplified volatility, liquidity, and structural challenges compared to US equities.

***

# 2. Pricing and Tangency Portfolio

You work in a hedge fund that believes that the AQR 4-Factor Model (present in Homework 5) is the perfect pricing model for stocks.

$$
\mathbb{E} \left[ \tilde{r}^i \right] = \beta^{i,\text{MKT}} \mathbb{E} \left[ \tilde{f}_{\text{MKT}} \right] + \beta^{i,\text{HML}} \mathbb{E} \left[ \tilde{f}_{\text{HML}} \right] + \beta^{i,\text{RMW}} \mathbb{E} \left[ \tilde{f}_{\text{RMW}} \right] + \beta^{i,\text{UMD}} \mathbb{E} \left[ \tilde{f}_{\text{UMD}} \right]
$$

The factors are available in the sheet `factors excess returns`.

The hedge fund invests in sector-tracking ETFs available in the sheet `sectors excess returns`. You are to allocate into these sectors according to a mean-variance optimization with...

* regularization: elements outside the diagonal covariance matrix divided by 2.
* modeled risk premia: expected excess returns given by the factor model rather than just using the historic sample averages.

You are to train the portfolio and test out-of-sample. The timeframes should be:
* Training timeframe: Jan-2018 to Dec-2022.
* Testing timeframe: Jan-2023 to most recent observation.

In [5]:
train_start, train_end = '2018-01-01', '2022-12-01'
test_start, test_end = '2023-01-01', min(df_factors.index.max(), df_sector.index.max())
df_sector_train = df_sector.loc[train_start:train_end]
df_factors_train = df_factors.loc[train_start:train_end]
df_sector_test = df_sector.loc[test_start:test_end]
df_factors_test = df_factors.loc[test_start:test_end]

### 2.1.
(8pts)

Calculate the model-implied expected excess returns of every asset.

The time-series estimations should...
* NOT include an intercept. (You assume the model holds perfectly.)
* use data from the `training` timeframe.

With the time-series estimates, use the `training` timeframe's sample average of the factors as the factor premia. Together, this will give you the model-implied risk premia, which we label as
$$
\lambda_i := \mathbb{E}[\tilde{r}_i]
$$

* Store $\lambda_i$ and $\boldsymbol{\beta}^i$ for each asset.
* Print $\lambda_i$ for `Agric`, `Food`, `Soda`

In [6]:
# Note: The dataset has an additional space after Food and Soda in column names
factor_premia = df_factors_train.mean()

expected_excess_returns = {}
betas = {}

for asset in df_sector_train.columns:
    X = df_factors_train[['MKT', 'HML', 'RMW', 'UMD']]
    y = df_sector_train[asset]

    model = sm.OLS(y, X).fit()

    betas[asset] = model.params

    lambda_i = model.params @ factor_premia.values
    expected_excess_returns[asset] = lambda_i

for asset in ['Agric', 'Food ', 'Soda ']:
    print(f"{asset}: \nλ = {expected_excess_returns[asset]}, \nβ = {betas[asset]}\n")

Agric: 
λ = 0.003655102106916498, 
β = MKT    0.832362
HML    0.556541
RMW   -0.502093
UMD    0.038972
dtype: float64

Food : 
λ = 0.005454267868380435, 
β = MKT    0.524509
HML    0.205452
RMW    0.309711
UMD   -0.003572
dtype: float64

Soda : 
λ = 0.007336244651963076, 
β = MKT    0.540240
HML    0.179127
RMW    0.638443
UMD    0.013688
dtype: float64



### 2.2.

Use the expected excess returns derived from (2.1) with the **regularized** covariance matrix to calculate the weights of the tangency portfolio.

- Use the covariance matrix only for `training` timeframe.
- Calculate and store the vector of weights for all the assets.
- Return the weights of the tangency portfolio for `Agric`, `Food`, `Soda`.

$$
\textbf{w}_{t} = \dfrac{\tilde{\Sigma}^{-1} \bm{\lambda}}{\bm{1}' \tilde{\Sigma}^{-1} \bm{\lambda}}
$$

Where $\tilde{\Sigma}^{-1}$ is the regularized covariance-matrix.

In [7]:
# Note: Using the training sample for the calculations.
lambda_vector = np.array(list(expected_excess_returns.values()))

cov_matrix = df_sector_train.cov()
diag_cov = np.diag(np.diag(cov_matrix)) 
reg_cov = (cov_matrix + diag_cov) / 2  
inv_reg_cov = np.linalg.inv(reg_cov)

ones = np.ones(len(lambda_vector))
scaling_factor_reg = 1 / (ones.T @ inv_reg_cov @ lambda_vector)
reg_weights = scaling_factor_reg * (inv_reg_cov @ lambda_vector)

assets = df_sector_train.columns
weights_dict = dict(zip(assets, reg_weights))

for asset in ['Agric', 'Food ', 'Soda ']:
    print(f"{asset}: {weights_dict[asset]}")

Agric: -0.030722716601690035
Food : 0.01532022454483564
Soda : 0.13294447809892707


### 2.3.

Evaluate the performance of this allocation in the `testing` period. Report the **annualized**
- mean
- vol
- Sharpe

In [8]:
test_returns = df_sector_test @ reg_weights  
def calculate_univariate_statistics(df, annual_factor=12):
    if isinstance(df, pd.Series):
        df = df.to_frame()

    mean_return = df.mean() * annual_factor
    volatility = df.std() * np.sqrt(annual_factor)
    sharpe_ratio = mean_return / volatility

    stats_df = pd.DataFrame({
        'mean': mean_return,
        'vol': volatility,
        'sharpe': sharpe_ratio
    })

    return stats_df.T
calculate_univariate_statistics(test_returns)

Unnamed: 0,0
mean,0.181176
vol,0.119549
sharpe,1.515494


### 2.4.

(7pts)

Construct the same tangency portfolio as in `2.2` but with one change:
* replace the risk premia of the assets, (denoted $\lambda_i$) with the sample averages of the excess returns from the `training` set.

So instead of using $\lambda_i$ suggested by the factor model (as in `2.1-2.3`) you're using sample averages for $\lambda_i$.

- Return the weights of the tangency portfolio for `Agric`, `Food`, `Soda`.

Evaluate the performance of this allocation in the `testing` period. Report the **annualized**
- mean
- vol
- Sharpe

In [9]:
sample_excess_returns = df_sector.loc[train_start:train_end].mean()

cov_matrix = df_sector.loc[train_start:train_end].cov()
diag_cov = np.diag(np.diag(cov_matrix))
reg_cov = (cov_matrix + diag_cov) / 2
inv_reg_cov = np.linalg.inv(reg_cov)

ones = np.ones(len(sample_excess_returns))
scaling_factor = 1 / (ones.T @ inv_reg_cov @ sample_excess_returns.values)
reg_weights = scaling_factor * (inv_reg_cov @ sample_excess_returns.values)

assets = df_sector.columns
weights_dict = dict(zip(assets, reg_weights))

for asset in ['Agric', 'Food ', 'Soda ']:
    print(f"{asset}: {weights_dict[asset]}")

Agric: 0.14408986452043415
Food : -0.06980957545009533
Soda : 0.32267987966472


In [10]:
df_sector_test = df_sector.loc[test_start:test_end]
portfolio_returns = df_sector_test @ reg_weights

calculate_univariate_statistics(portfolio_returns)

Unnamed: 0,0
mean,0.176801
vol,0.15301
sharpe,1.155487


### 2.5.

Which allocation performed better in the `testing` period: the allocation based on premia from the factor model or from the sample averages?

Why might this be?

- Factor model-based allocation performed better in the testing period => High Sharpe ratio and Low volatility.
- Factor model captures systematic risk more effectively, aligning with the market rather than reflecting historical averages.
- Using expected premia, the model provides a risk-adjusted approach for future return expectations => more stable performance under changing conditions.

### 2.6.
Suppose you now want to build a tangency portfolio solely from the factors, without using the sector ETFs.

- Calculate the weights of the tangency portfolio using `training` data for the factors.
- Again, regularize the covariance matrix of factor returns by dividing off-diagonal elements by 2.

Report, in the `testing` period, the factor-based tangency stats **annualized**...
- mean
- vol
- Sharpe


In [11]:
cov_matrix_factors = df_factors_train.cov()
diag_cov_factors = np.diag(np.diag(cov_matrix_factors))
reg_cov_factors = (cov_matrix_factors + diag_cov_factors) / 2
inv_reg_cov_factors = np.linalg.inv(reg_cov_factors)

ones = np.ones(len(factor_premia))
scaling_factor = 1 / (ones.T @ inv_reg_cov_factors @ factor_premia)
factor_weights = scaling_factor * (inv_reg_cov_factors @ factor_premia)

test_start, test_end = '2023-01-01', '2023-12-01'
portfolio_returns = df_factors_test @ factor_weights

calculate_univariate_statistics(portfolio_returns, 12)

Unnamed: 0,0
mean,0.062376
vol,0.058191
sharpe,1.071918


### 2.7.

Based on the hedge fund's beliefs, would you prefer to use the ETF-based tangency or the factor-based tangency portfolio? Explain your reasoning. Note that you should answer based on broad principles and not on the particular estimation results.

- The ETF-based tangency portfolio is preferable for its diversification
- It captures sector-specific conditions and market factors.
- ETFs reflect the actual market dynamics => offer stability beyond model assumptions.
- Balances systematic exposure with market factors => Better risk-adjusted returns.

***

# 3. Long-Run Returns

For this question, use only the sheet `factors excess returns`.

Suppose we want to measure the long run returns of various pricing factors.

### 3.1.

Turn the data into log returns.
- Display the first 5 rows of the data.

Using these log returns, report the **annualized**
* mean
* vol
* Sharpe

### 3.2.

Consider 15-year cumulative log excess returns. Following the assumptions and modeling of Lecture 6, report the following 15-year stats:
- mean
- vol
- Sharpe

How do they compare to the estimated stats (1-year horizon) in `3.1`? 

In [12]:
# 3.1
log_returns = np.log1p(df_factors)
log_returns.head()

Unnamed: 0_level_0,MKT,HML,RMW,UMD
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980-01-01,0.053636,0.017349,-0.017146,0.072786
1980-02-01,-0.012275,0.006081,0.0004,0.075849
1980-03-01,-0.138113,-0.010151,0.014494,-0.100373
1980-04-01,0.038932,0.010544,-0.021224,-0.004309
1980-05-01,0.051263,0.003793,0.003394,-0.011263


In [13]:
summary_annualized = calculate_univariate_statistics(log_returns, 12)
summary_annualized

Unnamed: 0,MKT,HML,RMW,UMD
mean,0.073549,0.01977,0.04354,0.050095
vol,0.158841,0.109782,0.083573,0.160433
sharpe,0.463033,0.180081,0.520979,0.312248


In [14]:
# 3.2
# Note: 15 yrs interpretation - 15*Annual. Instead of using the most recent/ starting 15 years worth of data.
annual_mean = summary_annualized.loc['mean']
annual_vol = summary_annualized.loc['vol']

mean_15yr = 15 * annual_mean
vol_15yr = np.sqrt(15) * annual_vol
sharpe_15yr = mean_15yr / vol_15yr

summary_cum15_corrected = pd.DataFrame({
    'mean': mean_15yr,
    'vol': vol_15yr,
    'sharpe': sharpe_15yr
})
summary_cum15_corrected
# Reference GPT Prompt: What does it exactly mean by "15-year cumulative log excess returns"? How do I calculate it?

Unnamed: 0,mean,vol,sharpe
MKT,1.103228,0.615188,1.793318
HML,0.296544,0.425182,0.697452
RMW,0.653096,0.323676,2.017743
UMD,0.751422,0.621353,1.209332


- For 15 years, MKT and RMW factors improved in stability and risk-adjusted returns.
-  HML mean is negative for cum. 15 yrs => value is not as good as growth. UMD decreased => momentum fades over time.
-  Certain factors, like profitability, perform consistently well. Others, like value and momentum, vary in changing market dynamics.

### 3.3.

What is the probability that momentum factor has a negative mean excess return over the next 
* single period?
* 15 years?

In [15]:
def prob(mu, sigma, h):
    return norm.cdf(-np.sqrt(h) * mu / sigma)

mu_monthly = annual_mean['UMD'] / 12
sigma_monthly = annual_vol['UMD'] / np.sqrt(12)
prob_single = norm.cdf(-mu_monthly / sigma_monthly)

mu_15yr = 15 * annual_mean['UMD']
sigma_15yr = np.sqrt(15) * annual_vol['UMD']
prob_15yr = norm.cdf(-mu_15yr / sigma_15yr)

print(f"Probability of negative return (single period): {prob_single}")
print(f"Probability of negative return (15-year period): {prob_15yr}")

Probability of negative return (single period): 0.46408867614246724
Probability of negative return (15-year period): 0.11326774683663715


### 3.4.

Recall from the case that momentum has been underperforming since 2009. 

Using data from 2009 to present, what is the probability that momentum *outperforms* the market factor over the next
* period?
* 15 years?

In [16]:
log_returns_2009 = log_returns.loc['2009-01-01':]

annual_mean_2009 = log_returns_2009.mean() * 12
annual_vol_2009 = log_returns_2009.std() * np.sqrt(12)

mu_diff_monthly = (annual_mean_2009['UMD'] - annual_mean_2009['MKT']) / 12
sigma_diff_monthly = np.sqrt((annual_vol_2009['UMD']**2 + annual_vol_2009['MKT']**2) / 12)
prob_single = norm.cdf(mu_diff_monthly / sigma_diff_monthly)

mu_diff_15yr = 15 * (annual_mean_2009['UMD'] - annual_mean_2009['MKT'])
sigma_diff_15yr = np.sqrt(15) * np.sqrt(annual_vol_2009['UMD']**2 + annual_vol_2009['MKT']**2)
prob_15yr = norm.cdf(mu_diff_15yr / sigma_diff_15yr)

print(f"Probability of UMD outperforming MKT (single period): {prob_single}")
print(f"Probability of UMD outperforming MKT (15-year period): {prob_15yr}")
# Reference: HW 6 4.4

Probability of UMD outperforming MKT (single period): 0.4239310918100311
Probability of UMD outperforming MKT (15-year period): 0.0050280337865116044


### 3.5.
Conceptually, why is there such a discrepancy between this probability for 1 period vs. 15 years?

What assumption about the log-returns are we making when we use this technique to estimate underperformance?

- Over a single period, market randomness can overshadow expected returns, making negative outcomes fairly likely.
- Over 15 years, the expected returns accumulate linearly, while volatility increases only with the square root of time.
- This sub-additivity => the cumulative expected gain outperforms the growth in uncertainty, reducing the probability of underperformance over longer horizons.


- We are assuming that log returns are independent and normally distributed.
- This allows us to scale the mean linearly with time and the standard deviation with the square root of time, using the properties of the normal distribution to estimate probabilities across different time frames.

### 3.6.

Using your previous answers, explain what is meant by time diversification.

- Time diversification => investing over longer periods reduces the relative impact of volatility on cumulative returns.
- Short-term returns are unpredictable; over time, the expected gains compound, and the effect of volatility diminishes proportionally.
- Time acts as a stabilizer, increasing the likelihood that the average return will prevail over random fluctuations.

### 3.7.

Is the probability that `HML` and `UMD` both have negative cumulative returns over the next year higher or lower than the probability that `HML` and `MKT` both have negative cumulative returns over the next year?

Answer conceptually, but specifically. (No need to calculate the specific probabilities.)

- The probability that HML and MKT both have negative cumulative returns over the next year is higher than that for HML and UMD.
- HML and MKT are more positively correlated—they often respond similarly to market conditions.
- When the market declines, both underperform.
- HML and UMD are less correlated.
- The chance of both experiencing negative returns simultaneously is lower.

***