# **Analysis of Estimation Errors in Portfolio Optimization**

## This notebook analyzes the impact of estimation errors in means, variances, and covariances on portfolio optimization results. We'll use historical data to create a base portfolio and then simulate various types of estimation errors to understand their effects on portfolio performance.


In [None]:
import sys
import os
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from numpy.random import default_rng
import logging

project_root = Path.cwd().parent
sys.path.append(str(project_root))

# Import custom modules
from src.data_management import DataManager
from src.portfolio_optimizer import PortfolioParameters, create_base_portfolio
from src.error_analysis import ErrorAnalysisConfig, run_error_analysis
from src.visualization import PortfolioVisualizer

In [9]:
import seaborn as sns
sns.set_style("whitegrid")
%matplotlib inline

## 1. Data Collection and Processing
First, we'll define our universe of stocks (DJIA constituents) and randomly select 10 of them.


In [10]:
djia_constituents = [
    'AAPL',  # Apple
    'AMGN',  # Amgen
    'AXP',   # American Express
    'BA',    # Boeing
    'CAT',   # Caterpillar
    'CRM',   # Salesforce
    'CSCO',  # Cisco
    'CVX',   # Chevron
    'DIS',   # Disney
    'DOW',   # Dow Inc.
    'GS',    # Goldman Sachs
    'HD',    # Home Depot
    'HON',   # Honeywell
    'IBM',   # IBM
    'INTC',  # Intel
    'JNJ',   # Johnson & Johnson
    'JPM',   # JPMorgan Chase
    'KO',    # Coca-Cola
    'MCD',   # McDonald's
    'MMM',   # 3M
    'MRK',   # Merck
    'MSFT',  # Microsoft
    'NKE',   # Nike
    'PG',    # Procter & Gamble
    'TRV',   # Travelers
    'UNH',   # UnitedHealth
    'V',     # Visa
    'VZ',    # Verizon
    'WBA',   # Walgreens Boots Alliance
    'WMT',   # Walmart
]

# Set random seed for reproducibility
rng = default_rng(42)

# Randomly select 10 stocks
selected_symbols = sorted(rng.choice(djia_constituents, size=10, replace=False))

print("Selected Stocks:")
for symbol in selected_symbols:
    print(f"- {symbol}")


Selected Stocks:
- AMGN
- AXP
- CRM
- GS
- JNJ
- KO
- MMM
- NKE
- TRV
- WMT


In [11]:
# Initialize DataManager
data_manager = DataManager(
    symbols=selected_symbols,
    start_date='2014-01-01',  # Using 5 years of data
    end_date='2023-12-31',
    data_dir='../data'
)

# Process data
statistics = data_manager.process_all()

# Print basic statistics
print("\nPortfolio Components Statistics:")
for i, symbol in enumerate(selected_symbols):
    print(f"\n{symbol}:")
    print(f"  Expected Monthly Return: {statistics['expected_returns'][i]:.4%}")
    print(f"  Monthly Volatility: {np.sqrt(statistics['covariance_matrix'][i,i]):.4%}")

# Print data range
print("\nData Range:")
print(f"First data point: {data_manager.prices.index[0].strftime('%Y-%m-%d')}")
print(f"Last data point: {data_manager.prices.index[-1].strftime('%Y-%m-%d')}")

2024-11-29 19:57:45,701 - src.data_management - INFO - Initialized DataManager with 10 symbols
2024-11-29 19:57:45,755 - src.data_management - INFO - Data loaded successfully for 10 symbols



Portfolio Components Statistics:

AMGN:
  Expected Monthly Return: 2.2277%
  Monthly Volatility: 7.9929%

AXP:
  Expected Monthly Return: 2.5788%
  Monthly Volatility: 8.9089%

CRM:
  Expected Monthly Return: 1.1137%
  Monthly Volatility: 5.0717%

GS:
  Expected Monthly Return: 1.8014%
  Monthly Volatility: 6.8561%

JNJ:
  Expected Monthly Return: 0.6672%
  Monthly Volatility: 4.5168%

KO:
  Expected Monthly Return: 1.6570%
  Monthly Volatility: 7.1277%

MMM:
  Expected Monthly Return: 2.1688%
  Monthly Volatility: 9.7777%

NKE:
  Expected Monthly Return: 2.1990%
  Monthly Volatility: 6.3406%

TRV:
  Expected Monthly Return: 5.7350%
  Monthly Volatility: 13.5089%

WMT:
  Expected Monthly Return: 1.8031%
  Monthly Volatility: 5.7076%

Data Range:
First data point: 2014-12-01
Last data point: 2024-11-27


## 2. Base Portfolio Optimization

Now we'll create a base optimal portfolio using the true parameters (historical estimates).


In [12]:
# Create base optimal portfolio
risk_tolerance = 50  # Moderate risk tolerance
optimal_weights, base_optimizer = create_base_portfolio(
    expected_returns=statistics['expected_returns'],
    covariance_matrix=statistics['covariance_matrix'],
    risk_tolerance=risk_tolerance
)

# Print base portfolio characteristics
print("\nBase Portfolio Allocation:")
selected_weights = [(symbol, weight) for symbol, weight in zip(selected_symbols, optimal_weights) 
                   if weight > 0.01]  # Only show positions > 1%
selected_weights.sort(key=lambda x: x[1], reverse=True)

for symbol, weight in selected_weights:
    print(f"{symbol}: {weight:.2%}")

expected_return = np.dot(optimal_weights, statistics['expected_returns'])
portfolio_risk = np.sqrt(optimal_weights @ statistics['covariance_matrix'] @ optimal_weights)

print(f"\nPortfolio Characteristics:")
print(f"Expected Monthly Return: {expected_return:.2%}")
print(f"Monthly Portfolio Risk: {portfolio_risk:.2%}")
print(f"Active Positions (>1%): {len(selected_weights)}")
print(f"Annualized Sharpe Ratio (Rf=0): {(expected_return * 12) / (portfolio_risk * np.sqrt(12)):.2f}")


2024-11-29 19:57:47,196 - src.portfolio_optimizer - DEBUG - Starting optimization



Base Portfolio Allocation:
TRV: 40.00%
AXP: 28.07%
WMT: 24.78%
NKE: 7.15%

Portfolio Characteristics:
Expected Monthly Return: 3.62%
Monthly Portfolio Risk: 7.70%
Active Positions (>1%): 4
Annualized Sharpe Ratio (Rf=0): 1.63


## 3. Error Analysis Configuration

We'll configure the error analysis parameters to study how different types and magnitudes of estimation errors affect portfolio performance.


In [15]:
config = ErrorAnalysisConfig(
    n_iterations=50,                                 # Mantenemos esto
    error_magnitudes=np.array([0.05, 0.20]),        # Solo magnitudes extremas
    risk_tolerances=np.array([25, 75]),             # Solo tolerancias extremas
    batch_size=25,                                  # Batch size optimizado
    show_progress=True,
    n_jobs=min(4, os.cpu_count() or 1)             # Máximo 4 workers
)

# Añadir más logging para ver dónde se está atorando
logging.getLogger('src.error_analysis').setLevel(logging.DEBUG)
logging.getLogger('src.portfolio_optimizer').setLevel(logging.DEBUG)

print("Iniciando prueba de diagnóstico...")
results = run_error_analysis(
    expected_returns=statistics['expected_returns'],
    covariance_matrix=statistics['covariance_matrix'],
    config=config
)

Iniciando prueba de diagnóstico...


Running simulations:   0%|          | 0/600 [00:00<?, ?sim/s]

2024-11-29 19:59:17,289 - src.error_analysis - INFO - 
Starting error analysis:
- Total simulations: 600
- Parallel jobs: 4
- Batch size: 25

2024-11-29 19:59:42,766 - src.error_analysis - INFO - 
Simulation completed:
- Successful simulations: 598
- Failed simulations: 2
- Total time: 25.5 seconds
- Average speed: 23.6 sim/s

2024-11-29 19:59:42,833 - src.error_analysis - INFO - 
Analysis completed successfully. Computing final statistics...


## 4. Visualization of Results
Let's create various visualizations to better understand the impact of estimation errors.


In [16]:
# Initialize visualizer
visualizer = PortfolioVisualizer(figsize=(12, 8))

# Create and save all visualizations
visualizer.create_analysis_dashboard(results, output_dir='../figures')

# Display key plots inline
visualizer.plot_cel_heatmap(results)
plt.show()

visualizer.plot_cel_confidence_bands(results)
plt.show()

visualizer.plot_risk_return_scatter(results)
plt.show()

2024-11-29 19:59:52,520 - src.visualization - INFO - Generating cel_heatmap.png...
2024-11-29 19:59:55,531 - src.visualization - INFO - Successfully generated cel_heatmap.png
2024-11-29 19:59:55,533 - src.visualization - INFO - Generating cel_boxplots.png...
2024-11-29 19:59:55,654 - matplotlib.category - INFO - Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
2024-11-29 19:59:55,686 - matplotlib.category - INFO - Using categorical units to plot a list of strings that are all parsable as floats or dates. If these strings should be plotted as numbers, cast to the appropriate data type before plotting.
2024-11-29 19:59:56,863 - src.visualization - INFO - Successfully generated cel_boxplots.png
2024-11-29 19:59:56,864 - src.visualization - INFO - Generating weight_differences.png...
2024-11-29 19:59:56,911 - src.visualization - ERROR - Error generati

RuntimeError: Failed to generate weight_differences.png: Data must be 1-dimensional, got ndarray of shape (12, 1) instead

## 5. Analysis of Results
Let's analyze the key findings from our error analysis:


In [17]:
# Calculate detailed summary statistics
summary_stats = pd.DataFrame({
    'Error Type': results.index.get_level_values('error_type'),
    'Error Magnitude': results.index.get_level_values('error_magnitude'),
    'Risk Tolerance': results.index.get_level_values('risk_tolerance'),
    'Mean CEL': results[('cel', 'mean')],
    'Max CEL': results[('cel', 'max')],
    'Mean Weight Diff': results[('mean_weight_diff', 'mean')],
    'Return Difference': results[('suboptimal_return', 'mean')] - results[('optimal_return', 'mean')],
    'Risk Difference': results[('suboptimal_risk', 'mean')] - results[('optimal_risk', 'mean')]
}).round(4)

# Group by error type and magnitude
grouped_stats = summary_stats.groupby(['Error Type', 'Error Magnitude']).mean()
print("\nDetailed Summary Statistics:")
print(grouped_stats)

KeyError: ('suboptimal_return', 'mean')