---
## 1. Teoretick√Ω √övod

### 1.1 Proƒç Random Forest pro Imputaci?

**Random Forest** je ensemble metoda kombinuj√≠c√≠ rozhodovac√≠ stromy:

$$\hat{y} = \frac{1}{B}\sum_{b=1}^{B} T_b(x)$$

kde:
- $B$ = poƒçet strom≈Ø (n_estimators)
- $T_b(x)$ = predikce b-t√©ho stromu

#### V√Ωhody pro imputaci:

| Vlastnost | Proƒç je d≈Øle≈æit√° |
|-----------|------------------|
| **Neline√°rn√≠ vztahy** | Zachycuje komplexn√≠ z√°vislosti mezi OHLCV a fundamenty |
| **Robustnost v≈Øƒçi outlier≈Øm** | Finanƒçn√≠ data obsahuj√≠ extr√©my (NVDA +200%) |
| **Multi-output podpora** | Predikuje v≈°ech 14 metrik najednou |
| **Feature importance** | Interpretovatelnost - kter√© features jsou kl√≠ƒçov√© |
| **≈Ω√°dn√° normalizace** | Stromy nepot≈ôebuj√≠ ≈°k√°lov√°n√≠ dat |

### 1.2 Multi-Output Regrese

M√≠sto 14 samostatn√Ωch model≈Ø tr√©nujeme jeden **MultiOutputRegressor**:

$$f: \mathbb{R}^{18} \rightarrow \mathbb{R}^{14}$$

**Vstup (18 features):**
- OHLCV: open, high, low, close, volume
- Technick√©: returns, volatility, rsi, macd, sma, ema, ...

**V√Ωstup (14 targets):**
- Valuace: PE, PB, PS, EV_EBITDA
- Profitabilita: ROE, ROA, mar≈æe (3)
- Zdrav√≠: Debt/Equity, Current/Quick Ratio (3)
- R≈Øst: Revenue/Earnings Growth (2)

---
## 2. Setup Prost≈ôed√≠

In [None]:
# Instalace (pro Colab)
!pip install pandas numpy scikit-learn joblib matplotlib seaborn tqdm -q

print("‚úì Knihovny nainstalov√°ny")

In [None]:
# Import knihoven
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
import os
import joblib

# Scikit-learn
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

warnings.filterwarnings('ignore')
np.random.seed(42)

print("‚úì Knihovny naƒçteny")

In [None]:
# P≈ôipojen√≠ Google Drive
try:
    from google.colab import drive
    drive.mount('/content/drive')
    DRIVE_PATH = '/content/drive/MyDrive/MachineLearning'
    RUNNING_ON_COLAB = True
    print(f"‚úì Google Drive p≈ôipojen: {DRIVE_PATH}")
except:
    DRIVE_PATH = '.'
    RUNNING_ON_COLAB = False
    print("‚ÑπÔ∏è Lok√°ln√≠ prost≈ôed√≠")

# Cesty
DATA_PATH = f"{DRIVE_PATH}/data"
MODEL_PATH = f"{DRIVE_PATH}/models"
os.makedirs(MODEL_PATH, exist_ok=True)

---
## 3. Naƒçten√≠ Dat

In [None]:
# Naƒçten√≠ OHLCV dat z Notebooku 01
ohlcv_path = f"{DATA_PATH}/ohlcv/all_sectors_ohlcv_10y.csv"
ohlcv_df = pd.read_csv(ohlcv_path, parse_dates=['date'])

print(f"üìà OHLCV Data:")
print(f"   Z√°znam≈Ø: {len(ohlcv_df):,}")
print(f"   Ticker≈Ø: {ohlcv_df['ticker'].nunique()}")
print(f"   Obdob√≠: {ohlcv_df['date'].min().strftime('%Y-%m')} ‚Üí {ohlcv_df['date'].max().strftime('%Y-%m')}")

In [None]:
# Naƒçten√≠ fundament√°ln√≠ch dat
fund_path = f"{DATA_PATH}/fundamentals/all_sectors_fundamentals.csv"
fundamentals_df = pd.read_csv(fund_path, parse_dates=['date'])

print(f"\nüìä Fundament√°ln√≠ Data:")
print(f"   Z√°znam≈Ø: {len(fundamentals_df)}")
print(f"   Ticker≈Ø: {fundamentals_df['ticker'].nunique()}")

# Zobrazen√≠ prvn√≠ch z√°znam≈Ø
display(fundamentals_df.head())

---
## 4. P≈ô√≠prava Tr√©novac√≠ch Dat

### 4.1 Strategie

1. **Merge** OHLCV dat s fundamenty podle tickeru
2. **Forward-fill** fundament≈Ø pro vytvo≈ôen√≠ vƒõt≈°√≠ho datasetu
3. **V√Ωbƒõr features** (OHLCV + technick√©) a **targets** (14 fundament≈Ø)
4. **Odstranƒõn√≠ NaN** hodnot

In [None]:
# Definice features a targets

# OHLCV + technick√© indik√°tory (features)
FEATURE_COLS = [
    'open', 'high', 'low', 'close', 'volume',
    'returns', 'volatility_12m', 'rsi_14',
    'macd', 'macd_signal', 'macd_hist',
    'sma_3', 'sma_6', 'sma_12',
    'ema_3', 'ema_6', 'ema_12',
    'volume_change', 'price_momentum'
]

# Fundament√°ln√≠ metriky (targets)
TARGET_COLS = [
    'PE', 'PB', 'PS', 'EV_EBITDA',
    'ROE', 'ROA', 'Profit_Margin', 'Operating_Margin', 'Gross_Margin',
    'Debt_to_Equity', 'Current_Ratio', 'Quick_Ratio',
    'Revenue_Growth_YoY', 'Earnings_Growth_YoY'
]

print(f"üìä Features: {len(FEATURE_COLS)}")
for f in FEATURE_COLS:
    print(f"   ‚Ä¢ {f}")

print(f"\nüéØ Targets: {len(TARGET_COLS)}")
for t in TARGET_COLS:
    print(f"   ‚Ä¢ {t}")

In [None]:
def prepare_training_data(ohlcv: pd.DataFrame, fundamentals: pd.DataFrame) -> pd.DataFrame:
    """
    P≈ôiprav√≠ tr√©novac√≠ data spojen√≠m OHLCV s fundamenty.
    
    Strategie:
    1. Pro ka≈æd√Ω ticker p≈ôid√°me fundamenty k nejnovƒõj≈°√≠m OHLCV dat≈Øm
    2. Pou≈æ√≠v√°me forward-fill pro roz≈°√≠≈ôen√≠ datasetu
    """
    merged_data = []
    
    for ticker in ohlcv['ticker'].unique():
        # OHLCV pro tento ticker
        ticker_ohlcv = ohlcv[ohlcv['ticker'] == ticker].copy()
        ticker_ohlcv = ticker_ohlcv.sort_values('date')
        
        # Fundamenty pro tento ticker
        ticker_fund = fundamentals[fundamentals['ticker'] == ticker]
        
        if ticker_fund.empty:
            continue
        
        # P≈ôid√°me fundamenty ke v≈°em OHLCV z√°znam≈Øm (forward fill simulace)
        for col in TARGET_COLS:
            if col in ticker_fund.columns:
                ticker_ohlcv[col] = ticker_fund[col].values[0]
            else:
                ticker_ohlcv[col] = np.nan
        
        merged_data.append(ticker_ohlcv)
    
    result = pd.concat(merged_data, ignore_index=True)
    return result

# P≈ô√≠prava dat
print("üîÑ P≈ô√≠prava tr√©novac√≠ch dat...")
training_df = prepare_training_data(ohlcv_df, fundamentals_df)

print(f"\n‚úì Merged dataset: {len(training_df):,} z√°znam≈Ø")

In [None]:
# Odstranƒõn√≠ ≈ô√°dk≈Ø s NaN v features nebo targets
print("üßπ ƒåi≈°tƒõn√≠ dat...")

# P≈ôed ƒçi≈°tƒõn√≠m
print(f"   P≈ôed ƒçi≈°tƒõn√≠m: {len(training_df):,} z√°znam≈Ø")

# Kontrola dostupn√Ωch sloupc≈Ø
available_features = [f for f in FEATURE_COLS if f in training_df.columns]
available_targets = [t for t in TARGET_COLS if t in training_df.columns]

print(f"   Dostupn√© features: {len(available_features)}/{len(FEATURE_COLS)}")
print(f"   Dostupn√© targets: {len(available_targets)}/{len(TARGET_COLS)}")

# Odstranƒõn√≠ NaN
all_cols = available_features + available_targets
clean_df = training_df.dropna(subset=all_cols)

print(f"   Po ƒçi≈°tƒõn√≠: {len(clean_df):,} z√°znam≈Ø")
print(f"   Odstranƒõno: {len(training_df) - len(clean_df):,} z√°znam≈Ø")

In [None]:
# P≈ô√≠prava X (features) a y (targets)
X = clean_df[available_features].values
y = clean_df[available_targets].values

print(f"\nüìä Fin√°ln√≠ dataset:")
print(f"   X shape: {X.shape} (samples √ó features)")
print(f"   y shape: {y.shape} (samples √ó targets)")

---
## 5. Train/Test Split

### 5.1 Chronologick√Ω Split

Pro finanƒçn√≠ data **NIKDY** nepou≈æ√≠v√°me n√°hodn√Ω split (data leakage).

Pou≈æ√≠v√°me **chronologick√Ω split**:
- **Training**: 2015-2023 (80%)
- **Test**: 2024-2025 (20%)

In [None]:
# Chronologick√Ω split
# Proto≈æe m√°me forward-filled data, pou≈æijeme jednoduch√Ω split podle indexu

split_idx = int(len(X) * 0.8)

X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]

print(f"üìä Train/Test Split:")
print(f"   Training: {len(X_train):,} samples ({len(X_train)/len(X)*100:.1f}%)")
print(f"   Test: {len(X_test):,} samples ({len(X_test)/len(X)*100:.1f}%)")

In [None]:
# Standardizace features (pro konzistenci, i kdy≈æ RF ji nepot≈ôebuje)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("‚úì Features standardizov√°ny")

---
## 6. Tr√©nov√°n√≠ Random Forest Modelu

### 6.1 Hyperparametry

| Parametr | Hodnota | Vysvƒõtlen√≠ |
|----------|---------|------------|
| `n_estimators` | 200 | Poƒçet strom≈Ø - v√≠ce = stabilnƒõj≈°√≠ |
| `max_depth` | 15 | Hloubka stromu - prevence overfittingu |
| `min_samples_split` | 5 | Min. samples pro split |
| `min_samples_leaf` | 2 | Min. samples v listu |
| `n_jobs` | -1 | Paralelizace na v≈°ech CPU |

In [None]:
# Konfigurace modelu
RF_PARAMS = {
    'n_estimators': 200,
    'max_depth': 15,
    'min_samples_split': 5,
    'min_samples_leaf': 2,
    'random_state': 42,
    'n_jobs': -1
}

print("üå≤ Random Forest Konfigurace:")
for param, value in RF_PARAMS.items():
    print(f"   {param}: {value}")

In [None]:
%%time

# Tr√©nov√°n√≠ modelu
print("üöÄ Tr√©nov√°n√≠ Multi-Output Random Forest...")
print(f"   Targets: {len(available_targets)}")
print(f"   Training samples: {len(X_train):,}")
print()

# Vytvo≈ôen√≠ a tr√©nov√°n√≠ modelu
base_rf = RandomForestRegressor(**RF_PARAMS)
model = MultiOutputRegressor(base_rf)

model.fit(X_train_scaled, y_train)

print("\n‚úÖ Model natr√©nov√°n!")

---
## 7. Evaluace Modelu

### 7.1 Metriky

Pro ka≈ædou fundament√°ln√≠ metriku poƒç√≠t√°me:

| Metrika | Formule | Interpretace |
|---------|---------|-------------|
| **MAE** | $\frac{1}{n}\sum|y_i - \hat{y}_i|$ | Pr≈Ømƒõrn√° absolutn√≠ chyba |
| **RMSE** | $\sqrt{\frac{1}{n}\sum(y_i - \hat{y}_i)^2}$ | Penalizuje velk√© chyby |
| **R¬≤** | $1 - \frac{SS_{res}}{SS_{tot}}$ | Vysvƒõtlen√° variance (0-1) |

In [None]:
# Predikce na test setu
y_pred = model.predict(X_test_scaled)

print(f"‚úì Predikce dokonƒçeny: {y_pred.shape}")

In [None]:
# Evaluace pro ka≈æd√Ω target
results = []

print("üìä EVALUACE MODELU")
print("="*70)
print(f"{'Target':<25} {'MAE':>10} {'RMSE':>10} {'R¬≤':>10}")
print("-"*70)

for i, target in enumerate(available_targets):
    y_true_i = y_test[:, i]
    y_pred_i = y_pred[:, i]
    
    mae = mean_absolute_error(y_true_i, y_pred_i)
    rmse = np.sqrt(mean_squared_error(y_true_i, y_pred_i))
    r2 = r2_score(y_true_i, y_pred_i)
    
    results.append({
        'Target': target,
        'MAE': mae,
        'RMSE': rmse,
        'R2': r2
    })
    
    # Barevn√© oznaƒçen√≠ R¬≤
    r2_color = 'üü¢' if r2 > 0.5 else 'üü°' if r2 > 0.2 else 'üî¥'
    print(f"{target:<25} {mae:>10.3f} {rmse:>10.3f} {r2:>9.3f} {r2_color}")

print("-"*70)

# Pr≈Ømƒõrn√© metriky
results_df = pd.DataFrame(results)
print(f"{'PR≈ÆMƒöR':<25} {results_df['MAE'].mean():>10.3f} {results_df['RMSE'].mean():>10.3f} {results_df['R2'].mean():>9.3f}")

print("\nüü¢ R¬≤ > 0.5: Dobr√° predikce")
print("üü° R¬≤ 0.2-0.5: St≈ôedn√≠ predikce")
print("üî¥ R¬≤ < 0.2: Slab√° predikce")

In [None]:
# Vizualizace v√Ωsledk≈Ø
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. R¬≤ pro ka≈æd√Ω target
ax1 = axes[0, 0]
colors = ['green' if r > 0.5 else 'orange' if r > 0.2 else 'red' 
          for r in results_df['R2']]
bars = ax1.barh(results_df['Target'], results_df['R2'], color=colors)
ax1.axvline(0.5, color='green', linestyle='--', alpha=0.5, label='Dobr√° (0.5)')
ax1.axvline(0.2, color='orange', linestyle='--', alpha=0.5, label='St≈ôedn√≠ (0.2)')
ax1.set_xlabel('R¬≤ Score')
ax1.set_title('R¬≤ Score pro Ka≈ædou Fundament√°ln√≠ Metriku', fontweight='bold')
ax1.legend(loc='lower right')
ax1.set_xlim(-0.1, 1.0)

# 2. MAE vs RMSE
ax2 = axes[0, 1]
x = np.arange(len(results_df))
width = 0.35
ax2.bar(x - width/2, results_df['MAE'], width, label='MAE', color='steelblue')
ax2.bar(x + width/2, results_df['RMSE'], width, label='RMSE', color='coral')
ax2.set_xticks(x)
ax2.set_xticklabels(results_df['Target'], rotation=45, ha='right')
ax2.set_ylabel('Chyba')
ax2.set_title('MAE vs RMSE', fontweight='bold')
ax2.legend()

# 3. Scatter plot: Actual vs Predicted (pro nejlep≈°√≠ target)
best_idx = results_df['R2'].idxmax()
best_target = results_df.loc[best_idx, 'Target']
ax3 = axes[1, 0]
ax3.scatter(y_test[:, best_idx], y_pred[:, best_idx], alpha=0.5, s=20)
ax3.plot([y_test[:, best_idx].min(), y_test[:, best_idx].max()],
         [y_test[:, best_idx].min(), y_test[:, best_idx].max()],
         'r--', linewidth=2, label='Ide√°ln√≠')
ax3.set_xlabel('Skuteƒçn√° hodnota')
ax3.set_ylabel('Predikovan√° hodnota')
ax3.set_title(f'Actual vs Predicted: {best_target} (R¬≤={results_df.loc[best_idx, "R2"]:.3f})', fontweight='bold')
ax3.legend()

# 4. Distribuce R¬≤
ax4 = axes[1, 1]
ax4.hist(results_df['R2'], bins=10, color='steelblue', edgecolor='black', alpha=0.7)
ax4.axvline(results_df['R2'].mean(), color='red', linestyle='--', 
            label=f'Pr≈Ømƒõr: {results_df["R2"].mean():.3f}')
ax4.set_xlabel('R¬≤ Score')
ax4.set_ylabel('Poƒçet target≈Ø')
ax4.set_title('Distribuce R¬≤ Score', fontweight='bold')
ax4.legend()

plt.tight_layout()
plt.savefig(f"{DATA_PATH}/fundamental_predictor_evaluation.png", dpi=150, bbox_inches='tight')
plt.show()

print(f"\nüíæ Graf ulo≈æen: {DATA_PATH}/fundamental_predictor_evaluation.png")

---
## 8. Feature Importance

Kter√© OHLCV features jsou nejd≈Øle≈æitƒõj≈°√≠ pro predikci fundament≈Ø?

In [None]:
# Agregovan√° feature importance p≈ôes v≈°echny targets
importances = []

for estimator in model.estimators_:
    importances.append(estimator.feature_importances_)

# Pr≈Ømƒõr p≈ôes v≈°echny targets
avg_importance = np.mean(importances, axis=0)

# DataFrame
importance_df = pd.DataFrame({
    'Feature': available_features,
    'Importance': avg_importance
}).sort_values('Importance', ascending=False)

print("üìä FEATURE IMPORTANCE (pr≈Ømƒõr p≈ôes v≈°echny targets)")
print("="*50)
for _, row in importance_df.iterrows():
    bar = '‚ñà' * int(row['Importance'] * 50)
    print(f"{row['Feature']:<20} {row['Importance']:.3f} {bar}")

In [None]:
# Vizualizace feature importance
fig, ax = plt.subplots(figsize=(10, 8))

colors = plt.cm.viridis(np.linspace(0, 0.8, len(importance_df)))
bars = ax.barh(importance_df['Feature'], importance_df['Importance'], color=colors)

ax.set_xlabel('Importance')
ax.set_title('Feature Importance pro Predikci Fundament≈Ø', fontsize=14, fontweight='bold')
ax.invert_yaxis()

# Anotace
for bar, val in zip(bars, importance_df['Importance']):
    ax.text(val + 0.005, bar.get_y() + bar.get_height()/2, 
            f'{val:.3f}', va='center', fontsize=9)

plt.tight_layout()
plt.savefig(f"{DATA_PATH}/feature_importance.png", dpi=150, bbox_inches='tight')
plt.show()

print(f"\nüíæ Graf ulo≈æen: {DATA_PATH}/feature_importance.png")

---
## 9. Ulo≈æen√≠ Modelu

In [None]:
# Ulo≈æen√≠ modelu a scaleru
model_path = f"{MODEL_PATH}/fundamental_predictor.pkl"
scaler_path = f"{MODEL_PATH}/feature_scaler.pkl"

joblib.dump(model, model_path)
joblib.dump(scaler, scaler_path)

print(f"üíæ Model ulo≈æen: {model_path}")
print(f"üíæ Scaler ulo≈æen: {scaler_path}")

In [None]:
# Ulo≈æen√≠ metadat modelu
metadata = {
    'features': available_features,
    'targets': available_targets,
    'rf_params': RF_PARAMS,
    'training_samples': len(X_train),
    'test_samples': len(X_test),
    'avg_r2': results_df['R2'].mean(),
    'avg_mae': results_df['MAE'].mean(),
    'created': datetime.now().isoformat()
}

# Ulo≈æen√≠ jako JSON
import json
metadata_path = f"{MODEL_PATH}/fundamental_predictor_metadata.json"
with open(metadata_path, 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"üíæ Metadata ulo≈æena: {metadata_path}")

In [None]:
# Ulo≈æen√≠ v√Ωsledk≈Ø evaluace
results_path = f"{DATA_PATH}/fundamental_predictor_results.csv"
results_df.to_csv(results_path, index=False)

print(f"üíæ V√Ωsledky ulo≈æeny: {results_path}")

---
## 10. Shrnut√≠ a Dal≈°√≠ Kroky

### ‚úÖ Dokonƒçeno:

| √ökol | Status |
|------|--------|
| P≈ô√≠prava tr√©novac√≠ch dat | ‚úÖ |
| Tr√©nov√°n√≠ Multi-Output RF | ‚úÖ |
| Evaluace (MAE, RMSE, R¬≤) | ‚úÖ |
| Feature Importance anal√Ωza | ‚úÖ |
| Ulo≈æen√≠ modelu | ‚úÖ |

### üìÅ Vytvo≈ôen√© soubory:

| Soubor | Popis |
|--------|-------|
| `models/fundamental_predictor.pkl` | Natr√©novan√Ω RF model |
| `models/feature_scaler.pkl` | StandardScaler pro features |
| `models/fundamental_predictor_metadata.json` | Metadata modelu |
| `data/fundamental_predictor_results.csv` | Evaluaƒçn√≠ metriky |

### ‚û°Ô∏è Dal≈°√≠ notebook:

**Notebook 03: Doplnƒõn√≠ Historick√Ωch Dat**
- Pou≈æit√≠ natr√©novan√©ho modelu pro imputaci 2015-2024
- Validace predikovan√Ωch hodnot
- Vytvo≈ôen√≠ kompletn√≠ho 10-let√©ho datasetu

In [None]:
# Fin√°ln√≠ shrnut√≠
print("="*70)
print("üìä NOTEBOOK 02 - SHRNUT√ç")
print("="*70)

print(f"\nüå≤ Model: Multi-Output Random Forest")
print(f"   ‚Ä¢ Stromy: {RF_PARAMS['n_estimators']}")
print(f"   ‚Ä¢ Hloubka: {RF_PARAMS['max_depth']}")

print(f"\nüìä V√Ωsledky:")
print(f"   ‚Ä¢ Pr≈Ømƒõrn√© R¬≤: {results_df['R2'].mean():.3f}")
print(f"   ‚Ä¢ Pr≈Ømƒõrn√© MAE: {results_df['MAE'].mean():.3f}")
print(f"   ‚Ä¢ Nejlep≈°√≠ target: {best_target} (R¬≤={results_df.loc[best_idx, 'R2']:.3f})")

print(f"\nüîù Top 3 nejd≈Øle≈æitƒõj≈°√≠ features:")
for i, (_, row) in enumerate(importance_df.head(3).iterrows()):
    print(f"   {i+1}. {row['Feature']}: {row['Importance']:.3f}")

print(f"\n‚úÖ Model p≈ôipraven pro Notebook 03!")