# Data-driven Finance: Unobservable Factors & PCA

In classical finance, we often assume certain factors drive stock returns. Here, we relax this assumption and let **data itself discover the factors**.  

We are entering the realm of **unsupervised learning**, where the goal is to find structure in data without pre-labeled outcomes.  

The first step is to collect daily stock prices for a set of stocks (here: Dow Jones tickers) and prepare the data for analysis.



In [1]:
import yfinance as yf
import pandas as pd
import numpy as np


tickers = ['AAPL', 'MSFT', 'JPM', 'GOOGL', 'AMZN', 'NVDA', 'TSLA', 'XOM', 'UNH', 'V']

# Download all data
data_full = yf.download(tickers, start="2020-01-01", end="2025-01-01")

# Extract only the closing prices for each ticker
data = data_full['Close']

# Display the first few rows
print(data.head())


  data_full = yf.download(tickers, start="2020-01-01", end="2025-01-01")
[*********************100%***********************]  10 of 10 completed

Ticker           AAPL       AMZN      GOOGL         JPM        MSFT      NVDA  \
Date                                                                            
2020-01-02  72.538528  94.900497  67.965225  119.573364  152.791122  5.971409   
2020-01-03  71.833282  93.748497  67.609688  117.995422  150.888580  5.875831   
2020-01-06  72.405663  95.143997  69.411758  117.901611  151.278625  5.900473   
2020-01-07  72.065147  95.343002  69.277687  115.897202  149.899307  5.971909   
2020-01-08  73.224373  94.598503  69.770775  116.801323  152.286972  5.983109   

Ticker           TSLA         UNH           V        XOM  
Date                                                      
2020-01-02  28.684000  267.026398  183.549072  54.131065  
2020-01-03  29.534000  264.324219  182.089294  53.695881  
2020-01-06  30.102667  266.159119  181.695526  54.108170  
2020-01-07  31.270666  264.552338  181.215363  53.665352  
2020-01-08  32.809334  270.130249  184.317383  52.856056  





# Compute Daily Log Returns and Standardize

Let $P_t$ be the price at time $t$. Daily **returns** are:

$$
R_t = \frac{P_t - P_{t-1}}{P_{t-1}}
$$

We then compute **log returns** to handle compounding:

$$
r_t = \log(1 + R_t)
$$

Finally, we **standardize** the returns to zero mean and unit variance:

$$
X = \frac{r_t - \bar{r}}{\sigma_r}
$$

This gives us the **data matrix $X$** used for PCA.



In [2]:
returns = data.pct_change().dropna()

log_returns = np.log(1 + returns)



In [None]:
X = (log_returns - log_returns.mean()) / log_returns.std()
print(X.head())

Ticker          AAPL      AMZN     GOOGL       JPM      MSFT      NVDA  \
Date                                                                     
2020-01-03 -0.539254 -0.568605 -0.295646 -0.676145 -0.693421 -0.552074   
2020-01-06  0.348729  0.622869  1.244270 -0.065164  0.092508  0.050683   
2020-01-07 -0.285697  0.062809 -0.134016 -0.865111 -0.518127  0.283506   
2020-01-08  0.750992 -0.375507  0.306521  0.353909  0.780118 -0.017873   
2020-01-09  1.004805  0.181927  0.470081  0.152053  0.603937  0.250539   

Ticker          TSLA       UNH         V       XOM  
Date                                                
2020-01-03  0.642066 -0.564558 -0.479498 -0.396833  
2020-01-06  0.402025  0.340027 -0.147790  0.328872  
2020-01-07  0.852087 -0.346690 -0.175224 -0.403556  
2020-01-08  1.088214  1.078311  0.942957 -0.725660  
2020-01-09 -0.575597 -0.327545  0.369204  0.327835  


# Covariance Matrix of Standardized Returns

Once we have the standardized returns \(X\), we compute the **covariance matrix** \(C\):

$$
C = \frac{X^T X}{N}
$$

Where:

- \(X\) is the standardized returns matrix (size \(N \times P\))
- \(N\) is the number of observations (days)
- \(C\) is a \(P \times P\) matrix representing **pairwise correlations** between stocks

The covariance matrix captures the linear relationships between stocks. A covariance of 0 means no linear correlation, while positive/negative values indicate positive/negative co-movement.

**Note:** Since \(X\) is standardized, the covariance matrix is equivalent to the **correlation matrix**.

This matrix is not diagonal, meaning the stock returns are **correlated**.


In [None]:
C = np.cov(X.T)

# Option 1: Just print the covariance matrix
# print(C)

# Option 2: Convert to DataFrame for nicer display
C_df = pd.DataFrame(C, index=tickers, columns=tickers)
print(C_df)


           AAPL      MSFT       JPM     GOOGL      AMZN      NVDA      TSLA  \
AAPL   1.000000  0.594780  0.652853  0.414320  0.750913  0.614656  0.492477   
MSFT   0.594780  1.000000  0.649973  0.269223  0.680171  0.586771  0.436566   
JPM    0.652853  0.649973  1.000000  0.406429  0.747925  0.600683  0.406230   
GOOGL  0.414320  0.269223  0.406429  1.000000  0.424329  0.332254  0.282216   
AMZN   0.750913  0.680171  0.747925  0.424329  1.000000  0.688373  0.456088   
NVDA   0.614656  0.586771  0.600683  0.332254  0.688373  1.000000  0.476880   
TSLA   0.492477  0.436566  0.406230  0.282216  0.456088  0.476880  1.000000   
XOM    0.422115  0.213523  0.347109  0.457494  0.435818  0.292680  0.198554   
UNH    0.585935  0.409002  0.548355  0.618969  0.609752  0.480090  0.361438   
V      0.288141  0.135746  0.263653  0.572316  0.248586  0.187483  0.152203   

            XOM       UNH         V  
AAPL   0.422115  0.585935  0.288141  
MSFT   0.213523  0.409002  0.135746  
JPM    0.347109 

## Step 3: Linear Transformation with PCA

We transform our standardized returns \(X\) into **uncorrelated components**:

$$
Z = X V
$$

Where:

- \(V \in \mathbb{R}^{P \times P}\) is an **orthogonal matrix** whose columns are the eigenvectors of the covariance matrix \(C\).
- \(Z \in \mathbb{R}^{N \times P}\) is the **encoded data** (principal components).

**Key points:**

1. Using all \(P\) components (\(K = P\)):

$$
\text{Cov}(Z) = V^\top C V = \Lambda
$$

where \(\Lambda\) is diagonal. This means the columns of \(Z\) are **uncorrelated**.

2. Each column of \(V\) corresponds to the **weights of an eigen-portfolio**, simplifying correlated stock analysis.

3. In practice, `sklearn` PCA computes \(V\) automatically as `pca.components_.T`.


In [5]:

# 4️⃣ Eigen decomposition
eigvals, eigvecs = np.linalg.eig(C)

print("Eigenvalues (λ):")
print(eigvals)
print("\nEigenvectors (U):")
print(eigvecs)
print("-" * 50)



Eigenvalues (λ):
[5.15437097 1.46216594 0.70299713 0.62666795 0.19945142 0.30988342
 0.31663166 0.37287494 0.44281106 0.41214551]

Eigenvectors (U):
[[ 0.36889682  0.1190009  -0.07078767 -0.08281593 -0.31269849 -0.637389
   0.07186285  0.56759897 -0.08053497 -0.00455202]
 [ 0.31876322  0.34336019 -0.05397264  0.29289352 -0.13067042  0.01834067
   0.55039581 -0.40287558 -0.25072635 -0.38245917]
 [ 0.35852268  0.16355684 -0.18025339  0.26764537 -0.28071634  0.32032592
  -0.70387731  0.0377444  -0.02614883 -0.24824037]
 [ 0.28374773 -0.45399241  0.19914088  0.13744464  0.00143486 -0.45027632
  -0.15859486 -0.46851809  0.43473836 -0.13572105]
 [ 0.38577666  0.17027283 -0.21183069  0.04354857  0.86828673 -0.07304315
  -0.07272471  0.10173396  0.01816803  0.01374451]
 [ 0.33484582  0.24920423  0.0267386   0.01837176 -0.16832759  0.05806182
   0.00756963 -0.2783889   0.05066099  0.8443302 ]
 [ 0.26182106  0.21465699  0.74840219 -0.50595079  0.05849156  0.14051276
  -0.09607155 -0.00629218 -0.

In [6]:
# 5️⃣ Construct diagonal matrix of eigenvalues
Λ = np.diag(eigvals)
print("Diagonal matrix Λ (of eigenvalues):")
print(Λ)
print("-" * 50)

Diagonal matrix Λ (of eigenvalues):
[[5.15437097 0.         0.         0.         0.         0.
  0.         0.         0.         0.        ]
 [0.         1.46216594 0.         0.         0.         0.
  0.         0.         0.         0.        ]
 [0.         0.         0.70299713 0.         0.         0.
  0.         0.         0.         0.        ]
 [0.         0.         0.         0.62666795 0.         0.
  0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.19945142 0.
  0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.30988342
  0.         0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.31663166 0.         0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.37287494 0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.         0.         0.44281106 0.   

In [7]:
# Sort eigenvalues and eigenvectors
idx = np.argsort(eigvals)[::-1]
eigvals = eigvals[idx]
eigvecs = eigvecs[:, idx]
eigvecs_df = pd.DataFrame(eigvecs, index=tickers, columns=[f'PC{i+1}' for i in range(len(eigvals))])
print("Eigenvalues:\n", eigvecs_df)
print(sum(eigvals))  # should equal number of features (10)

Eigenvalues:
             PC1       PC2       PC3       PC4       PC5       PC6       PC7  \
AAPL   0.368897  0.119001 -0.070788 -0.082816 -0.080535 -0.004552  0.567599   
MSFT   0.318763  0.343360 -0.053973  0.292894 -0.250726 -0.382459 -0.402876   
JPM    0.358523  0.163557 -0.180253  0.267645 -0.026149 -0.248240  0.037744   
GOOGL  0.283748 -0.453992  0.199141  0.137445  0.434738 -0.135721 -0.468518   
AMZN   0.385777  0.170273 -0.211831  0.043549  0.018168  0.013745  0.101734   
NVDA   0.334846  0.249204  0.026739  0.018372  0.050661  0.844330 -0.278389   
TSLA   0.261821  0.214657  0.748402 -0.505951 -0.080477 -0.174267 -0.006292   
XOM    0.251315 -0.357691 -0.476105 -0.646569 -0.301322 -0.048097 -0.239824   
UNH    0.343934 -0.244344 -0.036332 -0.008307  0.548044 -0.036950  0.345647   
V      0.205690 -0.557012  0.293664  0.374819 -0.583480  0.162882  0.172371   

            PC8       PC9      PC10  
AAPL   0.071863 -0.637389 -0.312698  
MSFT   0.550396  0.018341 -0.130670  
JP

In [8]:
Z = X @ eigvecs
print("First few rows of Z (principal components):")
Z = np.array(X @ eigvecs)   # instead of just X @ eigvecs


print(Z[:5])
print(X.index)
print(type(X.index))



First few rows of Z (principal components):
[[-1.35071060e+00  4.21516797e-01  7.69829148e-01 -4.38868204e-01
  -5.35892499e-02 -2.22239569e-01  2.08986239e-01 -2.61756215e-01
   3.21679264e-01 -8.60408402e-02]
 [ 1.01498819e+00  3.34440723e-01 -7.28701076e-02  1.83819865e-01
  -6.48568513e-01 -5.23164427e-01 -5.86754981e-02 -5.83808639e-01
   2.67806702e-01 -4.34638376e-01]
 [-4.91149940e-01  9.15345342e-01  6.76667948e-01 -4.86832327e-01
  -1.85053782e-01  1.69137158e-01  2.88159358e-02  1.14359663e-01
   5.38628744e-01 -3.15800438e-01]
 [ 1.39359039e+00 -3.92241542e-05 -1.29726154e-01 -1.53546014e+00
   7.20474305e-01 -3.82836858e-01  4.43036981e-01 -2.40677546e-01
   1.75140075e-03  2.43724212e-01]
 [ 9.18611193e-01  7.59464722e-02 -4.48606621e-01  7.70505170e-01
   1.07013352e-01  1.64173020e-01  7.01527814e-01 -1.16101360e-02
  -4.70628076e-01 -3.16317530e-03]]
DatetimeIndex(['2020-01-03', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10', '2020-

In [9]:
Z_df = pd.DataFrame(Z, index=X.index, columns=[f'PC{i+1}' for i in range(len(eigvals))])
print(Z_df.head())

                 PC1       PC2       PC3       PC4       PC5       PC6  \
Date                                                                     
2020-01-03 -1.350711  0.421517  0.769829 -0.438868 -0.053589 -0.222240   
2020-01-06  1.014988  0.334441 -0.072870  0.183820 -0.648569 -0.523164   
2020-01-07 -0.491150  0.915345  0.676668 -0.486832 -0.185054  0.169137   
2020-01-08  1.393590 -0.000039 -0.129726 -1.535460  0.720474 -0.382837   
2020-01-09  0.918611  0.075946 -0.448607  0.770505  0.107013  0.164173   

                 PC7       PC8       PC9      PC10  
Date                                                
2020-01-03  0.208986 -0.261756  0.321679 -0.086041  
2020-01-06 -0.058675 -0.583809  0.267807 -0.434638  
2020-01-07  0.028816  0.114360  0.538629 -0.315800  
2020-01-08  0.443037 -0.240678  0.001751  0.243724  
2020-01-09  0.701528 -0.011610 -0.470628 -0.003163  


In [10]:
N = Z.shape[0]  # number of samples
C_manual = (Z.T @ Z) / (N - 1)


In [11]:
X_reconstructed = Z @ eigvecs.T


In [12]:
print(X_reconstructed)

[[-0.53925381 -0.56860451 -0.29564606 ... -0.56455809 -0.47949793
  -0.3968327 ]
 [ 0.34872906  0.62286864  1.24427033 ...  0.34002702 -0.14779039
   0.32887208]
 [-0.28569723  0.06280925 -0.13401568 ... -0.34668959 -0.17522387
  -0.40355649]
 ...
 [-0.71782439 -0.67577408 -0.7535233  ... -0.14642848 -0.42542398
  -0.02854607]
 [-0.71889975 -0.51548686 -0.42604644 ... -0.25396158 -0.62674323
  -0.33741491]
 [-0.4044984  -0.41210498 -0.53731532 ... -0.22878209  0.10739838
   0.7591223 ]]
