In [None]:
# === Environment Setup ===
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.datasets import load_digits
from IPython.display import display, Markdown

# --- Configuration ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'font.size': 14, 'figure.figsize': (12, 8), 'figure.dpi': 150})
np.set_printoptions(suppress=True, linewidth=120, precision=4)

# --- Utility Functions ---
def note(msg): display(Markdown(f"<div class='alert alert-info'>📝 {msg}</div>"))
def sec(title): print(f'\n{80*"="}\n| {title.upper()} |\n{80*"="}')

note("Environment initialized for Dimensionality Reduction.")

# Chapter 7.2: Dimensionality Reduction with PCA

---

### Table of Contents

1.  [**Introduction: The Curse of Dimensionality and Its Cures**](#intro)
2.  [**Principal Component Analysis (PCA)**](#pca-intro)
    - [The Formal Definition of PCA](#formal-pca)
    - [The Mechanics: PCA via Eigenvalue Decomposition](#pca-scratch)
    - [The Connection to Singular Value Decomposition (SVD)](#svd)
3.  [**Implementing PCA**](#implementation)
    - [Code: A 2D Example from Scratch](#code-2d)
    - [Explained Variance and the Scree Plot](#scree)
    - [Validation with Scikit-learn](#sklearn-pca)
4.  [**Applications of PCA**](#applications)
    - [Application 1: Visualizing High-Dimensional Data](#viz)
    - [Application 2: Building Statistical Factor Models in Finance](#factors)
5.  [**A Brief Look at Non-Linear Techniques**](#nonlinear)
    - [t-SNE for Visualization](#tsne)
    - [Autoencoders for Non-Linear Feature Extraction](#autoencoders)
6.  [**Exercises**](#exercises)
7.  [**Summary and Key Takeaways**](#summary)

<a id='intro'></a>
## 1. Introduction: The Curse of Dimensionality and Its Cures

Many modern economic datasets are **high-dimensional**, meaning they have a large number of features ($p$) relative to the number of observations ($n$). This presents several challenges, collectively known as the **curse of dimensionality**:

- **Computational Cost:** The resources required to store and process the data grow rapidly with $p$.
- **Statistical Sparsity:** In high dimensions, the data becomes very sparse. The volume of the space grows exponentially with the number of dimensions, so the available data points become increasingly isolated. This makes it difficult for algorithms to find meaningful patterns.
- **Multicollinearity:** Many features are often highly correlated with each other. This redundancy can make it difficult to disentangle their individual effects and leads to unstable estimates in classical models like OLS.
- **Overfitting:** With more features than observations ($p > n$), classical models are not identified, and flexible models are highly prone to overfitting the noise in the training data.

**Dimensionality reduction** is a suite of techniques designed to address this curse. The core idea is to transform the data from a high-dimensional space to a lower-dimensional one while retaining as much of the original, essential information as possible. This is useful for:

- **Visualization:** Reducing data to 2 or 3 dimensions allows us to create insightful plots.
- **Noise Reduction:** By focusing on the directions of high variance, we can filter out noise and improve the performance of downstream models.
- **Feature Engineering:** The lower-dimensional components can serve as powerful, de-correlated features for subsequent supervised learning tasks.

<a id='pca-intro'></a>
## 2. Principal Component Analysis (PCA)
PCA is the most widely used dimensionality reduction technique. Its goal is to find a new set of orthogonal (uncorrelated) axes, called **principal components**, that align with the directions of maximum variance in the data.

The **first principal component** is the direction in space along which the data varies the most. The **second principal component** is the direction, orthogonal to the first, that captures the next highest amount of variance, and so on. 

By projecting the original data onto the first few principal components, we can create a lower-dimensional representation that captures the bulk of the information (i.e., the variance) in the original data. Mathematically, the principal components are the **eigenvectors** of the covariance matrix of the data, and the corresponding **eigenvalues** measure the amount of variance explained by each component.

<a id='formal-pca'></a>
### 2.1 The Formal Definition of PCA

Formally, PCA seeks to find a set of $k$ linear combinations of the original $p$ features, $Z_m = \sum_{j=1}^p \phi_{jm} X_j$, that capture the most variance. 

The **first principal component** ($Z_1$) is the linear combination of the features that has the maximum possible variance, subject to the constraint that the loading vector $\phi_1$ has unit length (i.e., $\sum_{j=1}^p \phi_{j1}^2 = 1$). That is, it solves:
$$ \max_{\phi_{11},...,\phi_{p1}} \left\{ \frac{1}{n} \sum_{i=1}^n \left( \sum_{j=1}^p \phi_{j1} x_{ij} \right)^2 \right\} \quad \text{subject to} \quad \sum_{j=1}^p \phi_{j1}^2 = 1 $$
The **second principal component** ($Z_2$) is the linear combination with the second-highest variance, subject to being uncorrelated with (orthogonal to) the first principal component. This process continues until $p$ components are found. The solution to this optimization problem is found via the eigendecomposition of the covariance matrix.

<a id='svd'></a>
### 2.2 The Connection to Singular Value Decomposition (SVD)

While the eigendecomposition of the covariance matrix provides the classical definition of PCA, in modern numerical computation, PCA is almost always performed using the **Singular Value Decomposition (SVD)** of the data matrix $X$. SVD is a more general and numerically stable factorization.

The SVD of an $n \times p$ data matrix $X$ is given by:
$$ X = U \Sigma V^T $$
Where:
- $U$ is an $n \times n$ orthogonal matrix (the left singular vectors).
- $\Sigma$ is an $n \times p$ rectangular diagonal matrix of singular values.
- $V$ is a $p \times p$ orthogonal matrix whose columns are the **right singular vectors**.

The crucial connection is that the right singular vectors in $V$ **are the principal components** (the eigenvectors of $X^TX$). The singular values in $\Sigma$ are related to the eigenvalues of the covariance matrix and can be used to calculate the explained variance. Using SVD avoids explicitly forming the covariance matrix $X^TX$, which can be numerically unstable, and is generally more efficient.

<a id='implementation'></a>
## 3. Implementing PCA

<a id='pca-scratch'></a>
### 3.1 PCA from Scratch: The Eigenvalue Decomposition
To understand what PCA is doing, we will implement it from scratch using NumPy. The process is as follows:
1.  **Standardize the data:** PCA is sensitive to the scale of the features, so we must first standardize the data to have a mean of 0 and a standard deviation of 1.
2.  **Compute the covariance matrix:** This matrix captures the pairwise covariances between all features.
3.  **Perform eigenvalue decomposition:** We compute the eigenvectors and eigenvalues of the covariance matrix.
4.  **Sort components:** We sort the eigenvectors in descending order of their corresponding eigenvalues. The eigenvector with the largest eigenvalue is the first principal component.
5.  **Project the data:** We create a projection matrix from the top $k$ eigenvectors and project the original data onto this new, lower-dimensional space.

<a id='code-2d'></a>

In [None]:
sec("PCA from Scratch: A 2D Example")

# Generate some correlated 2D data
rng = np.random.default_rng(42)
X = rng.multivariate_normal(mean=[0, 0], cov=[[1, 0.8], [0.8, 1]], size=200)

# 1. Standardize the data
X_std = StandardScaler().fit_transform(X)

# 2. Compute the covariance matrix
cov_mat = np.cov(X_std.T)
note("Covariance Matrix:")
print(cov_mat)

# 3. Eigenvalue decomposition
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
note("Eigenvalues (Variance explained):")
print(eig_vals)
note("Eigenvectors (Principal Components):")
print(eig_vecs)

# 4. Sort eigenvectors by eigenvalues
sort_indices = np.argsort(eig_vals)[::-1]
eig_vals_sorted = eig_vals[sort_indices]
eig_vecs_sorted = eig_vecs[:, sort_indices]

# 5. Project data onto the first principal component
W = eig_vecs_sorted[:, :1]
X_projected = X_std @ W


In [None]:
sec("Visualizing the Principal Components")

plt.figure(figsize=(10, 10))
plt.scatter(X_std[:, 0], X_std[:, 1], alpha=0.7, label='Standardized Data')

# Plot the eigenvectors (principal components)
for i in range(len(eig_vals)):
    # Scale vector by eigenvalue for visualization
    vec = eig_vecs_sorted[:, i] * np.sqrt(eig_vals_sorted[i]) * 2
    plt.quiver(0, 0, vec[0], vec[1], angles='xy', scale_units='xy', scale=1, color='r', width=0.01, label=f'PC{i+1}')

plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Principal Components of the Data')
plt.axis('equal')
plt.legend()
plt.show()

note("PC1 is the direction of maximum variance. PC2 is orthogonal to PC1 and captures the remaining variance.")

<a id='scree'></a>
### 3.2 Explained Variance and the Scree Plot
How many principal components should we keep? A standard tool to help answer this is the **scree plot**, which plots the variance explained by each principal component (i.e., the eigenvalues). We look for an "elbow" in the plot, which suggests that subsequent components are capturing diminishing amounts of information (noise).

In [None]:
sec("Scree Plot")

explained_variance_ratio = eig_vals_sorted / np.sum(eig_vals_sorted)
cumulative_explained_variance = np.cumsum(explained_variance_ratio)

plt.figure(figsize=(10, 7))
plt.bar(range(1, len(eig_vals_sorted) + 1), explained_variance_ratio, alpha=0.7, align='center', label='Individual explained variance')
plt.step(range(1, len(eig_vals_sorted) + 1), cumulative_explained_variance, where='mid', label='Cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal component index')
plt.title('Scree Plot')
plt.legend(loc='best')
plt.show()

note(f"The first principal component explains {explained_variance_ratio[0]:.2%} of the total variance.")

<a id='sklearn-pca'></a>
### 3.3 Comparison with Scikit-learn
Now we validate our from-scratch implementation against the highly optimized `PCA` class from `scikit-learn`. The results should be identical (up to an arbitrary sign flip in the eigenvectors).

In [None]:
sec("Validating with Scikit-learn")

pca = PCA(n_components=2)
X_projected_sklearn = pca.fit_transform(X_std)

note("From-Scratch Eigenvectors:")
print(eig_vecs_sorted)
note("Sklearn Principal Components:")
print(pca.components_.T)

note("From-Scratch Explained Variance Ratio:")
print(explained_variance_ratio)
note("Sklearn Explained Variance Ratio:")
print(pca.explained_variance_ratio_)

note("The results are identical, confirming our understanding of the underlying mechanics.")

<a id='applications'></a>
## 4. Applications of PCA

<a id='viz'></a>
### Application 1: Visualizing High-Dimensional Data
A powerful application of PCA is to visualize high-dimensional data. The handwritten digits dataset consists of 8x8 pixel images, meaning each image is a point in a 64-dimensional space. We can use PCA to project this data down to 2 dimensions to see if the different digit classes form distinct clusters.

In [None]:
sec("Visualizing Digits with PCA")

digits = load_digits()
X_digits = digits.data
y_digits = digits.target

# Project from 64 dimensions to 2
pca_digits = PCA(n_components=2)
X_proj_digits = pca_digits.fit_transform(X_digits)

plt.figure(figsize=(12, 9))
plt.scatter(X_proj_digits[:, 0], X_proj_digits[:, 1], c=y_digits, edgecolor='none', alpha=0.7,
            cmap=plt.get_cmap('jet', 10))
plt.colorbar(label='digit label', ticks=range(10))
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Handwritten Digits')
plt.show()

note("Even in just two dimensions, we can see that the digits form distinct clusters. This shows how PCA can effectively uncover the underlying structure in high-dimensional data.")

<a id='factors'></a>
### Application 2: Building Statistical Factor Models in Finance

A major application of PCA in quantitative finance is the construction of **statistical factor models** for asset returns. The idea is that the returns of a large number of stocks are driven by a smaller number of unobserved, underlying risk factors (e.g., a 'market' factor, an 'industry' factor, a 'momentum' factor).

PCA can be used to extract these latent factors directly from the covariance matrix of asset returns. The principal components of the returns data represent portfolios of stocks whose returns are uncorrelated and capture the main sources of systematic risk in the market. The first few components often correspond to well-known economic factors.

Let's apply this to a portfolio of US stocks.

In [None]:
sec("Case Study: PCA for Financial Factor Modeling")

# We need a library to download stock data
try:
    import yfinance as yf
    note("Using yfinance to download stock data.")
    
    # 1. Download data for a portfolio of stocks
    tickers = ['AAPL', 'MSFT', 'JPM', 'XOM', 'JNJ', 'AMZN', 'GOOGL', 'PG', 'CVX', 'WMT']
    stock_data = yf.download(tickers, start='2010-01-01', end='2022-12-31', progress=False, auto_adjust=True)['Close']
    
    # 2. Calculate daily returns
    returns = stock_data.pct_change().dropna()
    
    # 3. Apply PCA
    # Standardize the returns before applying PCA
    returns_std = StandardScaler().fit_transform(returns)
    pca_finance = PCA()
    pca_finance.fit(returns_std)
    
    # 4. Analyze the results with a Scree Plot
    explained_variance = pca_finance.explained_variance_ratio_
    cumulative_variance = np.cumsum(explained_variance)
    
    plt.figure(figsize=(12, 8))
    plt.bar(range(1, len(explained_variance) + 1), explained_variance, alpha=0.7, align='center', label='Individual Factor Variance')
    plt.step(range(1, len(explained_variance) + 1), cumulative_variance, where='mid', label='Cumulative Factor Variance')
    plt.title('Variance Explained by Statistical Factors (PCA)')
    plt.xlabel('Principal Component (Factor)')
    plt.ylabel('Explained Variance Ratio')
    plt.legend()
    plt.show()
    
    note(f"The first principal component (the 'market factor') explains {explained_variance[0]:.2%} of the total variance in the returns of these 10 stocks. The first 3 components explain {cumulative_variance[2]:.2%}.")
    
    # 5. Interpret the first factor's loadings
    pc1_loadings = pd.Series(pca_finance.components_[0], index=tickers)
    note("Loadings on the First Principal Component:")
    display(pc1_loadings.sort_values())
    note("The loadings on the first PC are all positive and of similar magnitude. This means that the first factor represents a weighted average of all stocks in the portfolio—it is the 'market' factor. A high value for PC1 corresponds to a day when the entire market went up, and a low value corresponds to a day when it went down.")

except ImportError:
    note("Skipping financial example: 'yfinance' is not installed. You can install it with 'pip install yfinance'.")
except Exception as e:
    note(f"Could not download or process stock data. Skipping example. Error: {e}")

<a id='nonlinear'></a>
## 5. A Brief Look at Non-Linear Techniques

PCA's main limitation is that it is a **linear** projection method. If the data lies on a complex, curved manifold, PCA may fail to find a meaningful low-dimensional representation. For these cases, non-linear dimensionality reduction techniques are required.

<a id='tsne'></a>
### t-SNE for Visualization
**t-Distributed Stochastic Neighbor Embedding (t-SNE)** is a popular technique specifically for visualizing high-dimensional data. Unlike PCA, which tries to preserve large pairwise distances (global structure), t-SNE focuses on preserving small pairwise distances, meaning it tries to keep points that are close in the high-dimensional space close in the low-dimensional map. It is excellent at revealing local cluster structures but is not suitable for tasks other than visualization.

<a id='autoencoders'></a>
### Autoencoders for Non-Linear Feature Extraction
An **autoencoder** is a type of neural network used for unsupervised learning. It consists of two parts:
1.  An **encoder** that compresses the input data into a lower-dimensional **bottleneck layer**.
2.  A **decoder** that tries to reconstruct the original input from the compressed representation in the bottleneck.

The network is trained to minimize the reconstruction error. Once trained, the output of the bottleneck layer provides a non-linear, compressed representation of the original data, which can be used as features for other tasks. A linear autoencoder is closely related to PCA.

In [None]:
sec("Visualizing Digits with t-SNE")

# t-SNE can be slow on large datasets, but is fine for the digits data
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne_digits = tsne.fit_transform(X_digits)

plt.figure(figsize=(12, 9))
plt.scatter(X_tsne_digits[:, 0], X_tsne_digits[:, 1], c=y_digits, edgecolor='none', alpha=0.8,
            cmap=plt.get_cmap('jet', 10))
plt.colorbar(label='digit label', ticks=range(10))
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.title('t-SNE of Handwritten Digits')
plt.show()

note("Compare this to the PCA plot. t-SNE produces much more distinct and well-separated clusters for each digit. This is because it is a non-linear method designed to preserve local neighborhood structures, making it superior to PCA for pure visualization tasks.")

<a id='exercises'></a>
## 6. Exercises

1.  **Interpreting Components:** For the handwritten digits example, what might the first principal component represent? (Hint: you can reshape `pca_digits.components_[0]` back into an 8x8 image and visualize it with `plt.imshow`).
2.  **Choosing the Number of Components:** How many principal components would you need to keep to explain 95% of the variance in the digits dataset? Use a scree plot to determine this.
3.  **PCA for Pre-processing:** PCA is often used as a pre-processing step for other ML models. Explain why using the principal components as features in a regression model, instead of the original highly correlated features, can lead to a more stable and interpretable model.
4.  **Limitations of PCA:** PCA is a linear technique. Can you think of a data structure (e.g., a shape in 2D) where PCA would fail to find a good low-dimensional representation? (Hint: Think of non-linear manifolds).

<a id='summary'></a>
## 7. Summary and Key Takeaways

This chapter introduced dimensionality reduction, a critical set of techniques for handling high-dimensional data.

**Key Concepts**:
- **Curse of Dimensionality**: High-dimensional spaces are vast, sparse, and prone to issues of multicollinearity and overfitting, making pattern recognition and prediction difficult.
- **Principal Component Analysis (PCA)**: PCA is a linear technique that finds a new, lower-dimensional set of orthogonal axes (principal components) that capture the maximum amount of variance in the original data. It is a powerful tool for visualization, noise reduction, and creating de-correlated features.
- **Mechanics of PCA**: The principal components are the eigenvectors of the data's covariance matrix, and the corresponding eigenvalues measure the variance explained by each component. A more numerically stable way to find them is via the Singular Value Decomposition (SVD) of the data matrix.
- **Scree Plot**: This plot of the explained variance per component is the primary tool for deciding how many principal components to retain.
- **Economic Applications**: A key application in economics and finance is the construction of statistical factor models, where the principal components of asset returns are interpreted as underlying risk factors.
- **Beyond Linearity**: While PCA is powerful, it is a linear method. For data with complex non-linear structures, techniques like t-SNE (for visualization) or Autoencoders (for feature extraction) may be more appropriate.

### Solutions to Exercises

---

**1. Interpreting Components:**
When you visualize the first principal component, you see a blurry, composite image that looks like a prototype for a '0' or a '6'. This component captures the pixels that vary the most across all digits—often the contrast between the center of the image and the outer edges. It represents the most fundamental dimension of variation among the different digit shapes.

---

**2. Choosing the Number of Components:**
You would first fit PCA to the full dataset (`PCA().fit(X_digits)`) and then plot the `cumulative_explained_variance`. You would find that you need approximately 21 principal components to explain 95% of the variance. This means you can reduce the dimensionality from 64 to 21 (a >65% reduction) while still retaining almost all of the signal.

---

**3. PCA for Pre-processing:**
Using the principal components as regressors has two main benefits:
a. **No Multicollinearity:** By definition, the principal components are orthogonal to each other. This completely eliminates the problem of multicollinearity, leading to stable coefficient estimates in a regression.
b. **Dimensionality Reduction:** By using only the first $k$ components that explain most of the variance, we can run a regression with far fewer features than the original model. This reduces the risk of overfitting and can lead to better out-of-sample performance, especially if the omitted components were mostly capturing noise.

---

**4. Limitations of PCA:**
PCA would fail on data that lies on a non-linear manifold, such as a 'Swiss roll' or two intertwined crescent moons. PCA would try to draw a single straight line (PC1) through the data, which would completely fail to capture the underlying structure. It would project points that are far apart on the manifold to be close together in the lower-dimensional space. This is where non-linear techniques like t-SNE or Isomap are necessary.