# üìä Stock Price Prediction - Part 1: Data & AI Model

## F√ÅZE 2-3: Sta≈æen√≠ Fundament√°ln√≠ch Dat a Tr√©nov√°n√≠ AI Modelu

**Autor:** Bc. Jan Dub  
**Datum:** ≈ò√≠jen 2025  
**Google Colab Ready** ‚úÖ

---

### üéØ C√≠l tohoto notebooku:

1. **F√ÅZE 2:** St√°hnout fundament√°ln√≠ metriky (P/E, ROE, atd.) pro obdob√≠ 2024-2025
2. **F√ÅZE 3:** Natr√©novat Random Forest model, kter√Ω predikuje fundamenty z OHLCV dat

### üìù Co budeme dƒõlat:

- Nahr√°t OHLCV data z Google Drive
- St√°hnout fundament√°ln√≠ data pomoc√≠ yfinance
- Spojit OHLCV + fundamenty
- Natr√©novat Multi-output Random Forest
- Evaluovat model (MAE, RMSE, R¬≤)
- Analyzovat feature importance
- Ulo≈æit model pro dal≈°√≠ pou≈æit√≠

## üì¶ 1. Instalace a Import Knihoven

In [None]:
# Instalace pot≈ôebn√Ωch knihoven (v Colabu)
!pip install -q yfinance scikit-learn joblib matplotlib seaborn

In [None]:
# Import z√°kladn√≠ch knihoven
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import time
from datetime import datetime

# Machine Learning
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from joblib import dump, load

# yfinance pro fundament√°ln√≠ data
import yfinance as yf

# Nastaven√≠
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úÖ Knihovny naƒçteny")

## ‚öôÔ∏è 2. Konfigurace

In [None]:
# === KONFIGURACE ===

# ƒåasov√© obdob√≠ pro fundament√°ln√≠ data
START_DATE = "2024-01-01"
END_DATE = "2025-10-31"

# Features z OHLCV dat
OHLCV_FEATURES = [
    'open', 'high', 'low', 'close', 'volume',
    'volatility', 'returns',
    'rsi_14', 'macd', 'macd_signal', 'macd_hist',
    'sma_3', 'sma_6', 'sma_12',
    'ema_3', 'ema_6', 'ema_12',
    'volume_change'
]

# Target fundament√°ln√≠ metriky
FUNDAMENTAL_TARGETS = [
    'PE', 'PB', 'PS', 'EV_EBITDA',  # Valuaƒçn√≠
    'ROE', 'ROA', 'Profit_Margin', 'Operating_Margin', 'Gross_Margin',  # Profitabilita
    'Debt_to_Equity', 'Current_Ratio', 'Quick_Ratio',  # Finanƒçn√≠ zdrav√≠
    'Revenue_Growth_YoY', 'Earnings_Growth_YoY'  # R≈Øst
]

# Hyperparametry Random Forest
RF_PARAMS = {
    'n_estimators': 100,
    'max_depth': 20,
    'min_samples_split': 5,
    'min_samples_leaf': 2,
    'random_state': 42,
    'n_jobs': -1
}

print(f"‚úÖ Konfigurace nastavena")
print(f"   ‚Ä¢ Obdob√≠: {START_DATE} ‚Üí {END_DATE}")
print(f"   ‚Ä¢ OHLCV features: {len(OHLCV_FEATURES)}")
print(f"   ‚Ä¢ Fundamental targets: {len(FUNDAMENTAL_TARGETS)}")

## üìÇ 3. Naƒçten√≠ OHLCV Dat

### Google Drive Upload:
1. Nahrajte soubor `all_sectors_full_10y.csv` do Google Drive
2. P≈ôipojte Google Drive v Colabu
3. Spus≈•te n√°sleduj√≠c√≠ bu≈àku

In [None]:
# P≈ôipojen√≠ Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Cesta k dat≈Øm (upravte podle va≈°√≠ struktury)
OHLCV_PATH = '/content/drive/MyDrive/StockPrediction/all_sectors_full_10y.csv'

print("‚úÖ Google Drive p≈ôipojen")

In [None]:
# Naƒçten√≠ OHLCV dat
print("üìÇ Naƒç√≠t√°m OHLCV data...")

ohlcv_df = pd.read_csv(OHLCV_PATH)
ohlcv_df['date'] = pd.to_datetime(ohlcv_df['date'])

print(f"‚úÖ Naƒçteno {len(ohlcv_df):,} z√°znam≈Ø")
print(f"   ‚Ä¢ Obdob√≠: {ohlcv_df['date'].min()} ‚Üí {ohlcv_df['date'].max()}")
print(f"   ‚Ä¢ Tickery: {ohlcv_df['ticker'].nunique()}")
print(f"   ‚Ä¢ Sektory: {', '.join(ohlcv_df['sector'].unique())}")

# N√°hled dat
display(ohlcv_df.head())

## üì• 4. F√ÅZE 2: Sta≈æen√≠ Fundament√°ln√≠ch Dat

St√°hneme fundament√°ln√≠ metriky pomoc√≠ yfinance pro obdob√≠ 2024-2025.

In [None]:
# Pomocn√© funkce pro stahov√°n√≠ fundament≈Ø

def safe_get_info(ticker, key, default=None):
    """Bezpeƒçnƒõ z√≠sk√° hodnotu z ticker.info"""
    try:
        info = ticker.info
        return info.get(key, default)
    except:
        return default

def get_col(df, candidates):
    """Vr√°t√≠ prvn√≠ existuj√≠c√≠ sloupec z candidates"""
    for col in candidates:
        if col in df.columns:
            return df[col]
    return pd.Series(np.nan, index=df.index)

def calculate_quarterly_fundamentals(ticker_str):
    """St√°hne a vypoƒç√≠t√° fundament√°ln√≠ metriky z quarterly dat"""
    try:
        ticker = yf.Ticker(ticker_str)
        
        # Z√°kladn√≠ info
        market_cap = safe_get_info(ticker, 'marketCap')
        shares_outstanding = safe_get_info(ticker, 'sharesOutstanding')
        
        # Quarterly statements
        financials = ticker.quarterly_financials.T
        balance_sheet = ticker.quarterly_balance_sheet.T
        
        if financials.empty or balance_sheet.empty:
            return pd.DataFrame()
        
        # Align indexy
        financials.index = pd.to_datetime(financials.index)
        balance_sheet.index = pd.to_datetime(balance_sheet.index)
        
        # Merge
        df = financials.join(balance_sheet, how='outer', rsuffix='_bs')
        df = df[(df.index >= START_DATE) & (df.index <= END_DATE)]
        
        if df.empty:
            return pd.DataFrame()
        
        # Extrakce dat
        total_revenue = get_col(df, ['Total Revenue', 'TotalRevenue'])
        net_income = get_col(df, ['Net Income', 'NetIncome'])
        ebitda = get_col(df, ['EBITDA', 'Ebitda'])
        operating_income = get_col(df, ['Operating Income', 'OperatingIncome'])
        gross_profit = get_col(df, ['Gross Profit', 'GrossProfit'])
        
        total_equity = get_col(df, ['Total Stockholder Equity', 'Stockholders Equity'])
        total_assets = get_col(df, ['Total Assets', 'TotalAssets'])
        total_debt = get_col(df, ['Total Debt', 'Long Term Debt'])
        current_assets = get_col(df, ['Current Assets', 'CurrentAssets'])
        current_liabilities = get_col(df, ['Current Liabilities', 'CurrentLiabilities'])
        cash = get_col(df, ['Cash And Cash Equivalents', 'Cash'])
        
        # TTM
        revenue_ttm = total_revenue.rolling(4, min_periods=1).sum()
        net_income_ttm = net_income.rolling(4, min_periods=1).sum()
        ebitda_ttm = ebitda.rolling(4, min_periods=1).sum()
        operating_income_ttm = operating_income.rolling(4, min_periods=1).sum()
        gross_profit_ttm = gross_profit.rolling(4, min_periods=1).sum()
        
        # Vypoƒç√≠tan√© metriky
        result = pd.DataFrame(index=df.index)
        result['ticker'] = ticker_str
        
        # Valuace
        if market_cap and shares_outstanding:
            approx_price = market_cap / shares_outstanding
            eps_ttm = net_income_ttm / shares_outstanding
            result['PE'] = approx_price / eps_ttm.replace(0, np.nan)
            result['PS'] = market_cap / revenue_ttm.replace(0, np.nan)
        else:
            result['PE'] = np.nan
            result['PS'] = np.nan
        
        result['PB'] = (market_cap if market_cap else np.nan) / total_equity.replace(0, np.nan)
        
        # EV/EBITDA
        if market_cap:
            ev = market_cap + total_debt.fillna(0) - cash.fillna(0)
            result['EV_EBITDA'] = ev / ebitda_ttm.replace(0, np.nan)
        else:
            result['EV_EBITDA'] = np.nan
        
        # Profitabilita
        result['ROE'] = net_income_ttm / total_equity.replace(0, np.nan)
        result['ROA'] = net_income_ttm / total_assets.replace(0, np.nan)
        result['Profit_Margin'] = net_income_ttm / revenue_ttm.replace(0, np.nan)
        result['Operating_Margin'] = operating_income_ttm / revenue_ttm.replace(0, np.nan)
        result['Gross_Margin'] = gross_profit_ttm / revenue_ttm.replace(0, np.nan)
        
        # Finanƒçn√≠ zdrav√≠
        result['Debt_to_Equity'] = total_debt / total_equity.replace(0, np.nan)
        result['Current_Ratio'] = current_assets / current_liabilities.replace(0, np.nan)
        result['Quick_Ratio'] = (current_assets - get_col(df, ['Inventory'])) / current_liabilities.replace(0, np.nan)
        
        # R≈Øst
        result['Revenue_Growth_YoY'] = revenue_ttm.pct_change(periods=4)
        result['Earnings_Growth_YoY'] = net_income_ttm.pct_change(periods=4)
        
        return result
        
    except Exception as e:
        print(f"  ‚ùå {ticker_str}: {e}")
        return pd.DataFrame()

print("‚úÖ Funkce pro stahov√°n√≠ fundament≈Ø p≈ôipraveny")

In [None]:
# Sta≈æen√≠ fundament√°ln√≠ch dat pro v≈°echny tickery

print("üì• Stahuji fundament√°ln√≠ data...")
print(f"   Obdob√≠: {START_DATE} ‚Üí {END_DATE}")

tickers = ohlcv_df['ticker'].unique()
print(f"   Tickery: {len(tickers)}")

all_fundamentals = []
errors = 0

for i, ticker in enumerate(tickers, 1):
    if i % 10 == 0:
        print(f"   [{i}/{len(tickers)}] {ticker}...")
    
    df = calculate_quarterly_fundamentals(ticker)
    
    if not df.empty:
        # P≈ôidat sektor
        sector = ohlcv_df[ohlcv_df['ticker'] == ticker]['sector'].iloc[0]
        df['sector'] = sector
        all_fundamentals.append(df)
    else:
        errors += 1
    
    time.sleep(0.3)  # Rate limiting

# Spojen√≠ v≈°ech dat
if all_fundamentals:
    fundamentals_df = pd.concat(all_fundamentals, ignore_index=False)
    fundamentals_df = fundamentals_df.reset_index().rename(columns={'index': 'date'})
    
    print(f"\n‚úÖ Sta≈æeno {len(fundamentals_df)} z√°znam≈Ø")
    print(f"   ‚Ä¢ √öspƒõ≈°n√© tickery: {len(tickers) - errors}")
    print(f"   ‚Ä¢ Chyby: {errors}")
    
    display(fundamentals_df.head())
else:
    print("‚ùå ≈Ω√°dn√° data ke sta≈æen√≠")

## üîó 5. Spojen√≠ OHLCV a Fundament√°ln√≠ch Dat

In [None]:
# Spojen√≠ OHLCV a fundament√°ln√≠ch dat

print("üîó Spojuji OHLCV a fundament√°ln√≠ data...")

# Pro ka≈æd√Ω ticker zvl√°≈°≈• s forward-fill
merged_parts = []

for ticker in fundamentals_df['ticker'].unique():
    # OHLCV pro ticker
    ohlcv_ticker = ohlcv_df[ohlcv_df['ticker'] == ticker].copy()
    ohlcv_ticker = ohlcv_ticker.sort_values('date').set_index('date')
    
    # Fundamenty pro ticker
    fund_ticker = fundamentals_df[fundamentals_df['ticker'] == ticker].copy()
    fund_ticker = fund_ticker.sort_values('date').set_index('date')
    
    # Merge s forward-fill
    merged = ohlcv_ticker.join(fund_ticker[FUNDAMENTAL_TARGETS], how='left')
    merged[FUNDAMENTAL_TARGETS] = merged[FUNDAMENTAL_TARGETS].fillna(method='ffill')
    
    merged = merged.reset_index()
    merged_parts.append(merged)

merged_df = pd.concat(merged_parts, ignore_index=True)

# Filtrovat pouze obdob√≠ kde m√°me fundamenty (2024-2025)
merged_df = merged_df[merged_df['date'] >= '2024-01-01'].copy()

# Odstranit chybƒõj√≠c√≠ hodnoty
merged_df = merged_df.dropna(subset=OHLCV_FEATURES + FUNDAMENTAL_TARGETS)

print(f"‚úÖ Spojeno: {len(merged_df):,} z√°znam≈Ø")
print(f"   ‚Ä¢ Tickery: {merged_df['ticker'].nunique()}")
print(f"   ‚Ä¢ Obdob√≠: {merged_df['date'].min()} ‚Üí {merged_df['date'].max()}")

display(merged_df.head())

## ü§ñ 6. F√ÅZE 3: Tr√©nov√°n√≠ AI Modelu

Natr√©nujeme Multi-output Random Forest model.

In [None]:
# P≈ô√≠prava tr√©novac√≠ch dat

print("üîß P≈ô√≠prava tr√©novac√≠ch dat...")

X = merged_df[OHLCV_FEATURES].copy()
y = merged_df[FUNDAMENTAL_TARGETS].copy()

# Odstranit nekoneƒçn√© hodnoty
X = X.replace([np.inf, -np.inf], np.nan)
y = y.replace([np.inf, -np.inf], np.nan)

# Dropnout NaN
valid_mask = ~(X.isna().any(axis=1) | y.isna().any(axis=1))
X = X[valid_mask]
y = y[valid_mask]

print(f"‚úì Validn√≠ch vzork≈Ø: {len(X):,}")

# Train/test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, shuffle=True
)

print(f"  ‚Ä¢ Train: {len(X_train):,} vzork≈Ø")
print(f"  ‚Ä¢ Test: {len(X_test):,} vzork≈Ø")

# Standardizace
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("‚úÖ Data p≈ôipravena pro tr√©nov√°n√≠")

In [None]:
# Tr√©nov√°n√≠ modelu

print("ü§ñ Tr√©nuji Random Forest model...")
print(f"   Parametry: {RF_PARAMS}")

start_time = time.time()

model = MultiOutputRegressor(
    RandomForestRegressor(**RF_PARAMS)
)

model.fit(X_train_scaled, y_train.values)

elapsed = time.time() - start_time

print(f"‚úÖ Tr√©nov√°n√≠ dokonƒçeno za {elapsed:.1f}s")

## üìä 7. Evaluace Modelu

In [None]:
# Evaluace na testovac√≠ch datech

print("üìä Evaluace modelu...\n")

y_pred = model.predict(X_test_scaled)

# Metriky pro ka≈æd√Ω target
results = []

for i, target in enumerate(FUNDAMENTAL_TARGETS):
    y_true_i = y_test.values[:, i]
    y_pred_i = y_pred[:, i]
    
    mae = mean_absolute_error(y_true_i, y_pred_i)
    rmse = np.sqrt(mean_squared_error(y_true_i, y_pred_i))
    r2 = r2_score(y_true_i, y_pred_i)
    
    # Relativn√≠ MAE
    mean_val = np.abs(y_true_i).mean()
    mae_pct = (mae / mean_val * 100) if mean_val > 0 else np.nan
    
    results.append({
        'target': target,
        'mae': mae,
        'rmse': rmse,
        'r2': r2,
        'mae_pct': mae_pct
    })

results_df = pd.DataFrame(results)

print("\nüìà V√ùSLEDKY PO METRIK√ÅCH:\n")
display(results_df.style.format({
    'mae': '{:.3f}',
    'rmse': '{:.3f}',
    'r2': '{:.3f}',
    'mae_pct': '{:.1f}%'
}))

print(f"\nüìä CELKOV√ù PR≈ÆMƒöR:")
print(f"   ‚Ä¢ MAE: {results_df['mae'].mean():.3f}")
print(f"   ‚Ä¢ MAE%: {results_df['mae_pct'].mean():.1f}%")
print(f"   ‚Ä¢ RMSE: {results_df['rmse'].mean():.3f}")
print(f"   ‚Ä¢ R¬≤: {results_df['r2'].mean():.3f}")

# Hodnocen√≠
avg_mae_pct = results_df['mae_pct'].mean()
if avg_mae_pct < 15:
    print("\n‚ú® V√Ωbornƒõ! Model dos√°hl c√≠lov√© p≈ôesnosti (<15% MAE)")
elif avg_mae_pct < 20:
    print("\nüëç Dob≈ôe! Model je pou≈æiteln√Ω (15-20% MAE)")
else:
    print("\n‚ö†Ô∏è Model m√° vy≈°≈°√≠ chybu (>20% MAE)")

In [None]:
# Vizualizace MAE po metrik√°ch

fig, ax = plt.subplots(figsize=(14, 6))

results_sorted = results_df.sort_values('mae_pct')
colors = ['green' if x < 15 else 'orange' if x < 20 else 'red' for x in results_sorted['mae_pct']]

ax.barh(results_sorted['target'], results_sorted['mae_pct'], color=colors, alpha=0.7)
ax.axvline(x=15, color='green', linestyle='--', linewidth=2, alpha=0.5, label='C√≠l: 15%')
ax.set_xlabel('MAE (%)', fontsize=12)
ax.set_title('Relativn√≠ MAE pro ka≈ædou fundament√°ln√≠ metriku', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

## üîç 8. Feature Importance Anal√Ωza

In [None]:
# Extrakce feature importance

print("üîç Anal√Ωza Feature Importance...\n")

importance_data = []

for i, estimator in enumerate(model.estimators_):
    target = FUNDAMENTAL_TARGETS[i]
    importances = estimator.feature_importances_
    
    for j, feature in enumerate(OHLCV_FEATURES):
        importance_data.append({
            'target': target,
            'feature': feature,
            'importance': importances[j]
        })

importance_df = pd.DataFrame(importance_data)

# Top 5 features pro ka≈æd√Ω target
print("TOP 5 FEATURES PRO KA≈ΩD√ù TARGET:\n")
for target in FUNDAMENTAL_TARGETS[:5]:  # Prvn√≠ 5 pro √∫sporu m√≠sta
    print(f"\n{target}:")
    target_imp = importance_df[importance_df['target'] == target].sort_values('importance', ascending=False)
    display(target_imp.head(5))

In [None]:
# Vizualizace celkov√© feature importance

fig, ax = plt.subplots(figsize=(12, 8))

# Pr≈Ømƒõrn√° importance p≈ôes v≈°echny targety
avg_importance = importance_df.groupby('feature')['importance'].mean().sort_values(ascending=True)

avg_importance.plot(kind='barh', ax=ax, color='steelblue', alpha=0.8)
ax.set_xlabel('Pr≈Ømƒõrn√° Importance', fontsize=12)
ax.set_title('Celkov√° Feature Importance (pr≈Ømƒõr p≈ôes v≈°echny fundamenty)', fontsize=14, fontweight='bold')
ax.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

## üíæ 9. Ulo≈æen√≠ Modelu a V√Ωsledk≈Ø

In [None]:
# Ulo≈æen√≠ modelu a scaleru na Google Drive

from joblib import dump

print("üíæ Ukl√°d√°m model a v√Ωsledky...")

# Cesty (upravte podle va≈°√≠ struktury)
MODEL_PATH = '/content/drive/MyDrive/StockPrediction/fundamental_predictor.pkl'
SCALER_PATH = '/content/drive/MyDrive/StockPrediction/feature_scaler.pkl'
METRICS_PATH = '/content/drive/MyDrive/StockPrediction/fundamental_metrics.csv'
IMPORTANCE_PATH = '/content/drive/MyDrive/StockPrediction/feature_importance.csv'

# Ulo≈æen√≠
dump(model, MODEL_PATH)
dump(scaler, SCALER_PATH)
results_df.to_csv(METRICS_PATH, index=False)
importance_df.to_csv(IMPORTANCE_PATH, index=False)

print("‚úÖ Ulo≈æeno:")
print(f"   ‚Ä¢ Model: {MODEL_PATH}")
print(f"   ‚Ä¢ Scaler: {SCALER_PATH}")
print(f"   ‚Ä¢ Metriky: {METRICS_PATH}")
print(f"   ‚Ä¢ Feature Importance: {IMPORTANCE_PATH}")

## üéâ 10. Shrnut√≠

### ‚úÖ Co jsme udƒõlali:

1. ‚úÖ Naƒçetli OHLCV data z Google Drive
2. ‚úÖ St√°hli fundament√°ln√≠ metriky pomoc√≠ yfinance (2024-2025)
3. ‚úÖ Spojili OHLCV + fundamenty
4. ‚úÖ Natr√©novali Multi-output Random Forest model
5. ‚úÖ Evaluovali model (MAE, RMSE, R¬≤)
6. ‚úÖ Analyzovali feature importance
7. ‚úÖ Ulo≈æili model pro dal≈°√≠ pou≈æit√≠

### üìä V√Ωsledky:

- Pr≈Ømƒõrn√° p≈ôesnost: **{avg_mae_pct:.1f}% MAE**
- R¬≤ score: **{results_df['r2'].mean():.3f}**

### üîú Dal≈°√≠ kroky:

Otev≈ôete **Part 2 Notebook** pro:
- Doplnƒõn√≠ historick√Ωch dat (2015-2024)
- Tr√©nov√°n√≠ modelu pro predikci cen
- Fin√°ln√≠ anal√Ωzu a vizualizace