# L05: PCA & t-SNE
## Dimensionality Reduction for Visualization and Preprocessing

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Digital-AI-Finance/Methods_and_Algorithms/blob/main/notebooks/L05_pca_tsne.ipynb)

**Methods and Algorithms -- MSc Data Science**

---

### Learning Objectives

By the end of this notebook, you will be able to:

1. Apply PCA for dimensionality reduction and feature extraction
2. Interpret variance explained and choose number of components
3. Use t-SNE for visualization of high-dimensional data
4. Compare linear (PCA) vs non-linear (t-SNE) methods

### Finance Application: Portfolio Risk Decomposition

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.datasets import make_blobs
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Plotting style
plt.rcParams.update({'font.size': 12, 'figure.figsize': (10, 6)})

print("Libraries imported successfully!")

## Part 1: Theory Recap

### Principal Component Analysis (PCA)

PCA finds orthogonal directions of maximum variance:

**Covariance Matrix:**
$$\Sigma = \frac{1}{n-1} X^T X$$

**Eigendecomposition:**
$$\Sigma v = \lambda v$$

where $v$ = eigenvector (principal direction), $\lambda$ = eigenvalue (variance)

**Variance Explained:**
$$\text{Explained Variance}_i = \frac{\lambda_i}{\sum_j \lambda_j}$$

### t-SNE

t-SNE preserves local neighborhood structure through probability matching:

**High-dimensional similarity (Gaussian):**
$$p_{j|i} = \frac{\exp(-||x_i - x_j||^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-||x_i - x_k||^2 / 2\sigma_i^2)}$$

**Low-dimensional similarity (t-distribution):**
$$q_{ij} = \frac{(1 + ||y_i - y_j||^2)^{-1}}{\sum_{k \neq l} (1 + ||y_k - y_l||^2)^{-1}}$$

**Objective: Minimize KL divergence**
$$KL(P||Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$$

## Part 2: Generate Synthetic Portfolio Data

In [None]:
# Generate synthetic portfolio returns data
def generate_portfolio_data(n_assets=50, n_periods=252, n_factors=5):
    """
    Generate synthetic asset returns with latent factor structure.
    
    Parameters:
    - n_assets: Number of assets
    - n_periods: Number of time periods (trading days)
    - n_factors: Number of latent factors
    
    Returns:
    - returns: Asset returns matrix (n_periods x n_assets)
    - factor_loadings: True factor loadings matrix
    - sectors: Sector labels for each asset
    """
    np.random.seed(42)
    
    # Create latent factors (market, size, value, momentum, sector)
    factors = np.random.randn(n_periods, n_factors) * 0.02  # Daily factor returns
    
    # Create sector structure (5 sectors)
    sectors = np.repeat(['Tech', 'Finance', 'Healthcare', 'Energy', 'Consumer'], n_assets // 5)
    
    # Factor loadings with sector structure
    factor_loadings = np.random.randn(n_assets, n_factors) * 0.5
    
    # Add sector-specific loading on one factor
    for i, sector in enumerate(['Tech', 'Finance', 'Healthcare', 'Energy', 'Consumer']):
        mask = sectors == sector
        factor_loadings[mask, i % n_factors] += 1.0  # Higher loading on sector factor
    
    # Generate returns: R = F * B' + epsilon
    returns = factors @ factor_loadings.T + np.random.randn(n_periods, n_assets) * 0.01
    
    return returns, factor_loadings, sectors

# Generate data
returns, true_loadings, sectors = generate_portfolio_data()

# Create asset names
asset_names = [f"{s[:3]}{i+1}" for i, s in enumerate(sectors)]

# Create DataFrame
df_returns = pd.DataFrame(returns, columns=asset_names)

print(f"Portfolio data shape: {df_returns.shape}")
print(f"Number of assets: {df_returns.shape[1]}")
print(f"Number of periods: {df_returns.shape[0]}")
print(f"\nSectors: {np.unique(sectors)}")
print(f"\nFirst few rows:")
df_returns.head()

## Part 3: Implementing PCA from Scratch

In [None]:
def pca_from_scratch(X, n_components=None):
    """
    Implement PCA from scratch using NumPy.
    
    Parameters:
    - X: Data matrix (n_samples x n_features)
    - n_components: Number of components to keep (default: all)
    
    Returns:
    - components: Principal components (eigenvectors)
    - explained_variance: Variance explained by each component
    - explained_variance_ratio: Proportion of variance explained
    - transformed: Data projected onto principal components
    """
    # Step 1: Center the data (subtract mean)
    X_centered = X - np.mean(X, axis=0)
    
    # Step 2: Compute covariance matrix
    n_samples = X.shape[0]
    cov_matrix = (X_centered.T @ X_centered) / (n_samples - 1)
    
    # Step 3: Compute eigenvalues and eigenvectors
    eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
    
    # Step 4: Sort by eigenvalue (descending)
    idx = np.argsort(eigenvalues)[::-1]
    eigenvalues = eigenvalues[idx]
    eigenvectors = eigenvectors[:, idx]
    
    # Step 5: Select top n_components
    if n_components is None:
        n_components = X.shape[1]
    
    components = eigenvectors[:, :n_components]
    explained_variance = eigenvalues[:n_components]
    explained_variance_ratio = explained_variance / np.sum(eigenvalues)
    
    # Step 6: Transform data
    transformed = X_centered @ components
    
    return components, explained_variance, explained_variance_ratio, transformed

# Apply PCA from scratch
X = df_returns.values
components, var_explained, var_ratio, X_pca = pca_from_scratch(X, n_components=10)

print("PCA from Scratch Results:")
print(f"Components shape: {components.shape}")
print(f"\nVariance explained by each component:")
for i, (var, ratio) in enumerate(zip(var_explained[:5], var_ratio[:5])):
    print(f"  PC{i+1}: {var:.6f} ({ratio*100:.2f}%)")
print(f"\nCumulative variance (first 5 components): {np.sum(var_ratio[:5])*100:.2f}%")

## Part 4: PCA with scikit-learn

In [None]:
# Standardize the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Fit PCA
pca = PCA()
X_pca_sklearn = pca.fit_transform(X_scaled)

# Compare with our implementation
print("Comparison of variance explained ratios:")
print(f"{'Component':<12} {'From Scratch':>15} {'scikit-learn':>15}")
print("-" * 45)
for i in range(5):
    print(f"PC{i+1:<10} {var_ratio[i]*100:>14.2f}% {pca.explained_variance_ratio_[i]*100:>14.2f}%")

In [None]:
# Scree Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Individual variance
ax1 = axes[0]
ax1.bar(range(1, 16), pca.explained_variance_ratio_[:15] * 100, color='steelblue', alpha=0.7)
ax1.set_xlabel('Principal Component')
ax1.set_ylabel('Variance Explained (%)')
ax1.set_title('Scree Plot')
ax1.set_xticks(range(1, 16))

# Cumulative variance
ax2 = axes[1]
cumulative_var = np.cumsum(pca.explained_variance_ratio_[:15]) * 100
ax2.plot(range(1, 16), cumulative_var, 'o-', color='steelblue', linewidth=2, markersize=8)
ax2.axhline(y=80, color='red', linestyle='--', label='80% threshold')
ax2.axhline(y=95, color='orange', linestyle='--', label='95% threshold')
ax2.set_xlabel('Number of Components')
ax2.set_ylabel('Cumulative Variance Explained (%)')
ax2.set_title('Cumulative Variance Explained')
ax2.legend()
ax2.set_xticks(range(1, 16))
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Determine optimal number of components
cumvar = np.cumsum(pca.explained_variance_ratio_)
n_80 = np.argmax(cumvar >= 0.80) + 1
n_95 = np.argmax(cumvar >= 0.95) + 1
print(f"\nComponents needed for 80% variance: {n_80}")
print(f"Components needed for 95% variance: {n_95}")

In [None]:
# Visualize assets in PC space (transpose data: assets as samples)
# For asset-level analysis, we treat each asset as a sample with 252 features (daily returns)
X_assets = df_returns.values.T  # Shape: (50 assets, 252 periods)
X_assets_scaled = StandardScaler().fit_transform(X_assets)

# Apply PCA to assets
pca_assets = PCA()
X_pca_assets = pca_assets.fit_transform(X_assets_scaled)

fig, ax = plt.subplots(figsize=(10, 8))

# Color by sector
sector_colors = {'Tech': '#0066CC', 'Finance': '#2CA02C', 'Healthcare': '#D62728', 
                 'Energy': '#FF7F0E', 'Consumer': '#9467BD'}
colors = [sector_colors[s] for s in sectors]

scatter = ax.scatter(X_pca_assets[:, 0], X_pca_assets[:, 1], 
                     c=colors, alpha=0.7, s=100)

# Add asset labels
for i, name in enumerate(asset_names):
    ax.annotate(name, (X_pca_assets[i, 0], X_pca_assets[i, 1]), 
                fontsize=7, alpha=0.7)

ax.set_xlabel(f'PC1 ({pca_assets.explained_variance_ratio_[0]*100:.1f}%)')
ax.set_ylabel(f'PC2 ({pca_assets.explained_variance_ratio_[1]*100:.1f}%)')
ax.set_title('Asset Relationships: First Two Principal Components')
ax.grid(True, alpha=0.3)

# Add legend
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=c, label=s) for s, c in sector_colors.items()]
ax.legend(handles=legend_elements, loc='upper right')

plt.tight_layout()
plt.show()

print("\nNote: Each point is an asset. Assets that cluster together have similar return patterns.")

## Part 5: PCA Loadings Interpretation

In [None]:
# Analyze factor loadings for PC1 and PC2 using time-series PCA
# Here we use the original orientation: time periods as samples, assets as features
# This gives us asset loadings on each principal component

# Get loadings from the time-series PCA (pca fitted on X_scaled)
loadings = pca.components_[:3].T  # First 3 components, shape: (50 assets, 3)
loadings_df = pd.DataFrame(loadings, 
                           columns=['PC1', 'PC2', 'PC3'],
                           index=asset_names)
loadings_df['Sector'] = sectors

# Average loadings by sector
sector_loadings = loadings_df.groupby('Sector').mean()
print("Average Factor Loadings by Sector (Time-Series PCA):")
print(sector_loadings.round(3))

print("\nInterpretation:")
print("- These loadings show how much each asset contributes to each principal component")
print("- PC1 loadings represent exposure to the first latent factor (often market risk)")

In [None]:
# Visualize loadings
fig, ax = plt.subplots(figsize=(12, 6))

x = np.arange(len(sector_loadings.index))
width = 0.25

bars1 = ax.bar(x - width, sector_loadings['PC1'], width, label='PC1', color='steelblue')
bars2 = ax.bar(x, sector_loadings['PC2'], width, label='PC2', color='coral')
bars3 = ax.bar(x + width, sector_loadings['PC3'], width, label='PC3', color='forestgreen')

ax.set_xlabel('Sector')
ax.set_ylabel('Average Loading')
ax.set_title('Average PCA Loadings by Sector')
ax.set_xticks(x)
ax.set_xticklabels(sector_loadings.index)
ax.legend()
ax.axhline(y=0, color='black', linestyle='-', linewidth=0.5)
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("- PC1 likely represents the 'market factor' (all sectors move together)")
print("- PC2/PC3 capture sector-specific risk factors")

## Part 6: Reconstruction Error

In [None]:
# Calculate reconstruction error for different numbers of components
n_components_range = range(1, 21)
reconstruction_errors = []

for k in n_components_range:
    pca_k = PCA(n_components=k)
    X_reduced = pca_k.fit_transform(X_scaled)
    X_reconstructed = pca_k.inverse_transform(X_reduced)
    error = np.mean((X_scaled - X_reconstructed) ** 2)
    reconstruction_errors.append(error)

# Plot
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(n_components_range, reconstruction_errors, 'o-', linewidth=2, markersize=8, color='steelblue')
ax.set_xlabel('Number of Components')
ax.set_ylabel('Mean Squared Reconstruction Error')
ax.set_title('Reconstruction Error vs. Number of Components')
ax.grid(True, alpha=0.3)

# Mark elbow
elbow_idx = 4  # Approximately where the elbow is
ax.axvline(x=elbow_idx, color='red', linestyle='--', label=f'Elbow at k={elbow_idx}')
ax.legend()

plt.tight_layout()
plt.show()

print(f"\nReconstruction error with 5 components: {reconstruction_errors[4]:.6f}")
print(f"Reconstruction error with 10 components: {reconstruction_errors[9]:.6f}")

## Part 7: t-SNE for Visualization

In [None]:
# Generate clustered data for t-SNE demonstration
X_clusters, y_clusters = make_blobs(n_samples=300, n_features=50, 
                                     centers=5, cluster_std=2.0,
                                     random_state=42)

# Standardize
X_clusters_scaled = StandardScaler().fit_transform(X_clusters)

print(f"Cluster data shape: {X_clusters.shape}")
print(f"Number of clusters: {len(np.unique(y_clusters))}")

In [None]:
# Compare perplexity values
perplexities = [5, 30, 50, 100]

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
axes = axes.ravel()

colors = plt.cm.Set1(np.linspace(0, 1, 5))

for idx, perp in enumerate(perplexities):
    # Fit t-SNE
    tsne = TSNE(n_components=2, perplexity=perp, random_state=42, n_iter=1000)
    X_tsne = tsne.fit_transform(X_clusters_scaled)
    
    # Plot
    ax = axes[idx]
    for i in range(5):
        mask = y_clusters == i
        ax.scatter(X_tsne[mask, 0], X_tsne[mask, 1], 
                   c=[colors[i]], label=f'Cluster {i+1}', s=30, alpha=0.7)
    
    ax.set_title(f'Perplexity = {perp}')
    ax.set_xlabel('t-SNE 1')
    ax.set_ylabel('t-SNE 2')
    ax.grid(True, alpha=0.3)

axes[0].legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.suptitle('t-SNE: Effect of Perplexity', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\nPerplexity Guidelines:")
print("- Low (5-10): Focus on very local structure")
print("- Medium (30-50): Balanced view (default)")
print("- High (100+): More global structure (if dataset is large)")

## Part 8: PCA vs t-SNE Comparison

In [None]:
# Apply PCA to cluster data
pca_cluster = PCA(n_components=2)
X_pca_cluster = pca_cluster.fit_transform(X_clusters_scaled)

# Apply t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne_cluster = tsne.fit_transform(X_clusters_scaled)

# Compare visualizations
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

colors = plt.cm.Set1(np.linspace(0, 1, 5))

# PCA
ax1 = axes[0]
for i in range(5):
    mask = y_clusters == i
    ax1.scatter(X_pca_cluster[mask, 0], X_pca_cluster[mask, 1],
                c=[colors[i]], label=f'Cluster {i+1}', s=50, alpha=0.7)
ax1.set_xlabel(f'PC1 ({pca_cluster.explained_variance_ratio_[0]*100:.1f}%)')
ax1.set_ylabel(f'PC2 ({pca_cluster.explained_variance_ratio_[1]*100:.1f}%)')
ax1.set_title('PCA Projection')
ax1.legend()
ax1.grid(True, alpha=0.3)

# t-SNE
ax2 = axes[1]
for i in range(5):
    mask = y_clusters == i
    ax2.scatter(X_tsne_cluster[mask, 0], X_tsne_cluster[mask, 1],
                c=[colors[i]], label=f'Cluster {i+1}', s=50, alpha=0.7)
ax2.set_xlabel('t-SNE 1')
ax2.set_ylabel('t-SNE 2')
ax2.set_title('t-SNE Projection')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.suptitle('PCA vs t-SNE: Cluster Visualization', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
# Summary comparison
comparison = {
    'Aspect': ['Type', 'Speed', 'Deterministic', 'Preserves', 'Reversible', 'Use for ML', 'Visualization'],
    'PCA': ['Linear', 'Fast O(np^2)', 'Yes', 'Global variance', 'Yes', 'Yes (preprocessing)', 'Okay'],
    't-SNE': ['Non-linear', 'Slow O(n^2)', 'No', 'Local neighbors', 'No', 'No', 'Excellent']
}

comparison_df = pd.DataFrame(comparison)
print("\n" + "="*60)
print("PCA vs t-SNE Comparison")
print("="*60)
print(comparison_df.to_string(index=False))
print("="*60)

## Part 9: Best Practice - PCA then t-SNE

In [None]:
import time

# Generate larger dataset
X_large, y_large = make_blobs(n_samples=1000, n_features=200, 
                               centers=5, random_state=42)
X_large_scaled = StandardScaler().fit_transform(X_large)

# Direct t-SNE
start = time.time()
tsne_direct = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne_direct = tsne_direct.fit_transform(X_large_scaled)
time_direct = time.time() - start

# PCA + t-SNE
start = time.time()
pca_pre = PCA(n_components=50)  # Reduce to 50 dimensions first
X_pca_pre = pca_pre.fit_transform(X_large_scaled)
tsne_after_pca = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne_pca = tsne_after_pca.fit_transform(X_pca_pre)
time_pca_tsne = time.time() - start

print(f"Direct t-SNE time: {time_direct:.2f}s")
print(f"PCA (50 dims) + t-SNE time: {time_pca_tsne:.2f}s")
print(f"Speedup: {time_direct/time_pca_tsne:.2f}x")

In [None]:
# Compare results
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

colors = plt.cm.Set1(np.linspace(0, 1, 5))

# Direct t-SNE
ax1 = axes[0]
for i in range(5):
    mask = y_large == i
    ax1.scatter(X_tsne_direct[mask, 0], X_tsne_direct[mask, 1],
                c=[colors[i]], label=f'Cluster {i+1}', s=20, alpha=0.6)
ax1.set_title(f'Direct t-SNE\n({time_direct:.2f}s)')
ax1.set_xlabel('t-SNE 1')
ax1.set_ylabel('t-SNE 2')
ax1.legend()
ax1.grid(True, alpha=0.3)

# PCA + t-SNE
ax2 = axes[1]
for i in range(5):
    mask = y_large == i
    ax2.scatter(X_tsne_pca[mask, 0], X_tsne_pca[mask, 1],
                c=[colors[i]], label=f'Cluster {i+1}', s=20, alpha=0.6)
ax2.set_title(f'PCA (50 dims) + t-SNE\n({time_pca_tsne:.2f}s)')
ax2.set_xlabel('t-SNE 1')
ax2.set_ylabel('t-SNE 2')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.suptitle('Comparison: Direct t-SNE vs PCA + t-SNE', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("\nBest Practice:")
print("1. Standardize data")
print("2. Apply PCA to reduce to 30-50 dimensions")
print("3. Apply t-SNE for 2D visualization")

## Exercises

### Exercise 1: Choose Number of Components
Using the portfolio returns data, determine the optimal number of components using:
1. 90% variance threshold
2. Kaiser criterion (eigenvalues > 1)

In [None]:
# Exercise 1 Solution: Choose Number of Components

# 1. Kaiser criterion: eigenvalues > 1 (for standardized data)
eigenvalues = pca.explained_variance_
kaiser_k = np.sum(eigenvalues > 1)
print(f"Kaiser criterion (eigenvalues > 1): k = {kaiser_k}")

# 2. 90% variance threshold
cumvar = np.cumsum(pca.explained_variance_ratio_)
var90_k = np.argmax(cumvar >= 0.90) + 1
print(f"90% variance threshold: k = {var90_k}")

# 3. Visual comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Eigenvalues with Kaiser threshold
ax1 = axes[0]
ax1.bar(range(1, min(21, len(eigenvalues)+1)), eigenvalues[:20], color='steelblue', alpha=0.7)
ax1.axhline(y=1, color='red', linestyle='--', linewidth=2, label='Kaiser threshold (=1)')
ax1.set_xlabel('Principal Component')
ax1.set_ylabel('Eigenvalue')
ax1.set_title('Kaiser Criterion')
ax1.legend()
ax1.set_xlim(0, 21)

# Cumulative variance with 90% threshold
ax2 = axes[1]
ax2.plot(range(1, min(21, len(cumvar)+1)), cumvar[:20] * 100, 'o-', color='steelblue', linewidth=2, markersize=6)
ax2.axhline(y=90, color='red', linestyle='--', linewidth=2, label='90% threshold')
ax2.axvline(x=var90_k, color='green', linestyle=':', linewidth=2, label=f'k={var90_k}')
ax2.set_xlabel('Number of Components')
ax2.set_ylabel('Cumulative Variance (%)')
ax2.set_title('Variance Explained')
ax2.legend()
ax2.set_xlim(0, 21)
ax2.set_ylim(0, 105)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nConclusion:")
print(f"  - Kaiser criterion suggests k = {kaiser_k} components")
print(f"  - 90% variance requires k = {var90_k} components")
print(f"  - For preprocessing, choose based on downstream task requirements")

### Exercise 2: t-SNE Stability
Run t-SNE 3 times with different random seeds. How different are the results?

In [None]:
# Exercise 2 Solution: t-SNE Stability

# Run t-SNE with different random seeds
seeds = [0, 42, 123]
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

colors = plt.cm.Set1(np.linspace(0, 1, 5))

for ax, seed in zip(axes, seeds):
    tsne = TSNE(n_components=2, perplexity=30, random_state=seed, n_iter=1000)
    X_tsne = tsne.fit_transform(X_clusters_scaled)
    
    for i in range(5):
        mask = y_clusters == i
        ax.scatter(X_tsne[mask, 0], X_tsne[mask, 1], 
                   c=[colors[i]], label=f'Cluster {i+1}', s=40, alpha=0.7)
    
    ax.set_title(f't-SNE (seed={seed})')
    ax.set_xlabel('t-SNE 1')
    ax.set_ylabel('t-SNE 2')
    ax.grid(True, alpha=0.3)

axes[0].legend(loc='upper right')
plt.suptitle('t-SNE Stability: Different Random Seeds', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Key Observations:")
print("1. Cluster STRUCTURE is preserved across runs (5 distinct groups)")
print("2. Exact POSITIONS and ORIENTATIONS differ significantly")
print("3. Cluster SIZES may appear different (not meaningful!)")
print("4. DISTANCES between clusters vary (also not meaningful!)")
print("\nBest Practice: Run t-SNE multiple times and look for consistent patterns.")

### Exercise 3: Interpret PCA Loadings
For the portfolio data, which sector has the highest loading on PC1? What does this mean economically?

In [None]:
# Exercise 3 Solution: Interpret PCA Loadings

# Get PC1 loadings (first row of components matrix)
pc1_loadings = pca.components_[0]

# Create DataFrame with loadings and sector info
loadings_analysis = pd.DataFrame({
    'Asset': asset_names,
    'PC1_Loading': pc1_loadings,
    'Abs_Loading': np.abs(pc1_loadings),
    'Sector': sectors
})

# Calculate sector average loadings
sector_avg = loadings_analysis.groupby('Sector')['PC1_Loading'].agg(['mean', 'std']).round(4)
sector_avg = sector_avg.sort_values('mean', ascending=False)

print("Average PC1 Loading by Sector:")
print(sector_avg)
print()

# Find sector with highest loading
top_sector = sector_avg['mean'].idxmax()
print(f"Sector with highest PC1 loading: {top_sector}")

# Visualize
fig, ax = plt.subplots(figsize=(10, 5))
colors = ['#0066CC' if s == top_sector else '#808080' for s in sector_avg.index]
bars = ax.barh(range(len(sector_avg)), sector_avg['mean'].values, color=colors, alpha=0.7)
ax.set_yticks(range(len(sector_avg)))
ax.set_yticklabels(sector_avg.index)
ax.set_xlabel('Average PC1 Loading')
ax.set_title('Sector Contribution to Principal Component 1')
ax.axvline(x=0, color='black', linewidth=0.5)
ax.grid(True, alpha=0.3, axis='x')

# Error bars
ax.errorbar(sector_avg['mean'].values, range(len(sector_avg)), 
            xerr=sector_avg['std'].values, fmt='none', color='black', capsize=3)

plt.tight_layout()
plt.show()

print(f"\nEconomic Interpretation:")
print(f"PC1 represents the dominant factor in portfolio returns.")
print(f"The '{top_sector}' sector has the highest loading on PC1.")
print(f"This suggests that {top_sector} stocks are most representative of the")
print(f"market-wide (systematic) risk factor captured by PC1.")
print(f"\nIn factor model terms: high PC1 loading = high beta to market factor.")

## Summary

### Key Takeaways

1. **PCA** is a linear dimensionality reduction technique:
   - Finds orthogonal directions of maximum variance
   - Fast, deterministic, and reversible
   - Use for preprocessing and feature extraction

2. **t-SNE** is a non-linear technique for visualization:
   - Preserves local neighborhood structure
   - Excellent for finding clusters
   - Non-deterministic; cluster sizes/distances are NOT meaningful

3. **Best Practice Pipeline**:
   - Standardize -> PCA (30-50 dims) -> t-SNE (2D)

4. **Choosing Components**:
   - 80-95% cumulative variance threshold
   - Scree plot elbow method
   - Kaiser criterion (eigenvalue > 1)

### Next Steps
- Explore UMAP as a modern alternative to t-SNE
- Apply PCA for noise reduction in time series
- Use dimensionality reduction for ML preprocessing