# Data-driven Finance: Unobservable Factors & PCA

In classical finance, we often assume certain factors drive stock returns. Here, we relax this assumption and let **data itself discover the factors**.  

We are entering the realm of **unsupervised learning**, where the goal is to find structure in data without pre-labeled outcomes.  

The first step is to collect daily stock prices for a set of stocks (here: Dow Jones tickers) and prepare the data for analysis.



In [1]:
import yfinance as yf
import pandas as pd
import numpy as np


tickers = ['AAPL', 'MSFT', 'JPM', 'GOOGL', 'AMZN', 'NVDA', 'TSLA', 'XOM', 'UNH', 'V']

# Download all data
data_full = yf.download(tickers, start="2020-01-01", end="2025-01-01")

# Extract only the closing prices for each ticker
data = data_full['Close']

# Display the first few rows
print(data.head())


  data_full = yf.download(tickers, start="2020-01-01", end="2025-01-01")
[*********************100%***********************]  10 of 10 completed

Ticker           AAPL       AMZN      GOOGL         JPM        MSFT      NVDA  \
Date                                                                            
2020-01-02  72.538506  94.900497  67.965233  119.573387  152.791122  5.971411   
2020-01-03  71.833282  93.748497  67.609680  117.995430  150.888596  5.875832   
2020-01-06  72.405685  95.143997  69.411758  117.901596  151.278656  5.900472   
2020-01-07  72.065147  95.343002  69.277679  115.897194  149.899307  5.971910   
2020-01-08  73.224396  94.598503  69.770775  116.801308  152.286957  5.983109   

Ticker           TSLA         UNH           V        XOM  
Date                                                      
2020-01-02  28.684000  267.026367  183.549088  54.131081  
2020-01-03  29.534000  264.324158  182.089310  53.695881  
2020-01-06  30.102667  266.159119  181.695557  54.108170  
2020-01-07  31.270666  264.552399  181.215332  53.665344  
2020-01-08  32.809334  270.130249  184.317383  52.856064  





# Compute Daily Log Returns and Standardize

Let $P_t$ be the price at time $t$. Daily **returns** are:

$$
R_t = \frac{P_t - P_{t-1}}{P_{t-1}}
$$

We then compute **log returns** to handle compounding:

$$
r_t = \log(1 + R_t)
$$

Finally, we **standardize** the returns to zero mean and unit variance:

$$
X = \frac{r_t - \bar{r}}{\sigma_r}
$$

This gives us the **data matrix $X$** used for PCA.



In [2]:
returns = data.pct_change().dropna()

log_returns = np.log(1 + returns)


In [6]:
X = (log_returns - log_returns.mean()) / log_returns.std()
print(X.head())

Ticker          AAPL      AMZN     GOOGL       JPM      MSFT      NVDA  \
Date                                                                     
2020-01-03 -0.539238 -0.568605 -0.295657 -0.676152 -0.693416 -0.552081   
2020-01-06  0.348745  0.622869  1.244276 -0.065173  0.092513  0.050673   
2020-01-07 -0.285713  0.062809 -0.134021 -0.865108 -0.518137  0.283520   
2020-01-08  0.751008 -0.375507  0.306527  0.353906  0.780113 -0.017882   
2020-01-09  1.004795  0.181927  0.470081  0.152043  0.603938  0.250541   

Ticker          TSLA       UNH         V       XOM  
Date                                                
2020-01-03  0.642066 -0.564564 -0.479498 -0.396846  
2020-01-06  0.402025  0.340039 -0.147786  0.328872  
2020-01-07  0.852087 -0.346678 -0.175243 -0.403563  
2020-01-08  1.088214  1.078299  0.942967 -0.725646  
2020-01-09 -0.575597 -0.327545  0.369190  0.327832  


# Covariance Matrix of Standardized Returns

Once we have the standardized returns \(X\), we compute the **covariance matrix** \(C\):

$$
C = \frac{X^T X}{N}
$$

Where:

- \(X\) is the standardized returns matrix (size \(N \times P\))
- \(N\) is the number of observations (days)
- \(C\) is a \(P \times P\) matrix representing **pairwise correlations** between stocks

The covariance matrix captures the linear relationships between stocks. A covariance of 0 means no linear correlation, while positive/negative values indicate positive/negative co-movement.

**Note:** Since \(X\) is standardized, the covariance matrix is equivalent to the **correlation matrix**.

This matrix is not diagonal, meaning the stock returns are **correlated**.


In [4]:
C = np.cov(X.T)

# Option 1: Just print the covariance matrix
# print(C)

# Option 2: Convert to DataFrame for nicer display
C_df = pd.DataFrame(C, index=tickers, columns=tickers)
print(C_df)


           AAPL      MSFT       JPM     GOOGL      AMZN      NVDA      TSLA  \
AAPL   1.000000  0.594779  0.652854  0.414321  0.750913  0.614656  0.492477   
MSFT   0.594779  1.000000  0.649973  0.269223  0.680171  0.586771  0.436566   
JPM    0.652854  0.649973  1.000000  0.406429  0.747925  0.600683  0.406230   
GOOGL  0.414321  0.269223  0.406429  1.000000  0.424330  0.332255  0.282216   
AMZN   0.750913  0.680171  0.747925  0.424330  1.000000  0.688373  0.456088   
NVDA   0.614656  0.586771  0.600683  0.332255  0.688373  1.000000  0.476880   
TSLA   0.492477  0.436566  0.406230  0.282216  0.456088  0.476880  1.000000   
XOM    0.422114  0.213523  0.347109  0.457495  0.435819  0.292680  0.198554   
UNH    0.585935  0.409001  0.548355  0.618969  0.609752  0.480090  0.361437   
V      0.288141  0.135746  0.263653  0.572316  0.248586  0.187483  0.152204   

            XOM       UNH         V  
AAPL   0.422114  0.585935  0.288141  
MSFT   0.213523  0.409001  0.135746  
JPM    0.347109 

## Step 3: Linear Transformation with PCA

We transform our standardized returns \(X\) into **uncorrelated components**:

$$
Z = X V
$$

Where:

- \(V \in \mathbb{R}^{P \times P}\) is an **orthogonal matrix** whose columns are the eigenvectors of the covariance matrix \(C\).
- \(Z \in \mathbb{R}^{N \times P}\) is the **encoded data** (principal components).

**Key points:**

1. Using all \(P\) components (\(K = P\)):

$$
\text{Cov}(Z) = V^\top C V = \Lambda
$$

where \(\Lambda\) is diagonal. This means the columns of \(Z\) are **uncorrelated**.

2. Each column of \(V\) corresponds to the **weights of an eigen-portfolio**, simplifying correlated stock analysis.

3. In practice, `sklearn` PCA computes \(V\) automatically as `pca.components_.T`.


In [5]:
from sklearn.decomposition import PCA

# Perform PCA on standardized returns
pca = PCA()
Z = pca.fit_transform(X)

# Optional: view the first few rows of Z
print("First few rows of Z (principal components):")
print(Z[:5])


First few rows of Z (principal components):
[[-1.35071541e+00 -4.21528095e-01  7.69828485e-01  4.38871326e-01
   5.35842049e-02 -2.22244211e-01  2.09000534e-01  2.61734397e-01
  -3.21674511e-01 -8.60373811e-02]
 [ 1.01499688e+00 -3.34440836e-01 -7.28811713e-02 -1.83810734e-01
   6.48581632e-01 -5.23168278e-01 -5.86550437e-02  5.83797116e-01
  -2.67826979e-01 -4.34639825e-01]
 [-4.91161357e-01 -9.15346494e-01  6.76668140e-01  4.86840562e-01
   1.85056489e-01  1.69149528e-01  2.87887240e-02 -1.14374872e-01
  -5.38625194e-01 -3.15805565e-01]
 [ 1.39359533e+00  4.46676653e-05 -1.29715198e-01  1.53544853e+00
  -7.20468181e-01 -3.82850729e-01  4.43060469e-01  2.40673737e-01
  -1.75711560e-03  2.43716426e-01]
 [ 9.18600083e-01 -7.59563201e-02 -4.48610793e-01 -7.70502713e-01
  -1.07006168e-01  1.64178698e-01  7.01519786e-01  1.16302526e-02
   4.70622840e-01 -3.15866290e-03]]
