# 1. Fundamental Analysis Model Tuning

This notebook focuses on tuning the model used to predict missing fundamental data (e.g., P/E, ROE) based on historical price data.

## Goal
Find the best hyperparameters for the regressor using **Grid Search** and **Cross Validation**.

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import GridSearchCV, KFold, train_test_split
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
import joblib

%matplotlib inline

## 1. Load Data

In [None]:
# Paths
DATA_DIR = "../data/fundamentals"
OHLCV_DIR = "../data_10y"

# Load Fundamentals (Target)
fund_df = pd.read_csv(os.path.join(DATA_DIR, "all_sectors_fundamentals.csv"))
fund_df['date'] = pd.to_datetime(fund_df['date'])

# Load OHLCV (Features)
ohlcv_df = pd.read_csv(os.path.join(OHLCV_DIR, "all_sectors_full_10y.csv"))
ohlcv_df['date'] = pd.to_datetime(ohlcv_df['date'])

print(f"Fundamentals: {fund_df.shape}")
print(f"OHLCV: {ohlcv_df.shape}")

## 2. Prepare Training Data
We need to merge OHLCV features with Fundamental targets for the period where we have both (approx. last 1.5 years).

In [None]:
# Merge on ticker and date (nearest match for fundamentals as they are quarterly)
# Simplified merge for demonstration - in production use exact matching logic from scripts
merged = pd.merge_asof(
    ohlcv_df.sort_values('date'),
    fund_df.sort_values('date'),
    on='date',
    by='ticker',
    direction='backward',
    tolerance=pd.Timedelta(days=90) # Allow matching within a quarter
)

# Drop rows with missing targets
target_col = 'PE' # Example target
data = merged.dropna(subset=[target_col])

print(f"Training samples: {len(data)}")

## 3. Grid Search with Cross Validation
We will tune a Random Forest Regressor to predict P/E ratio.

In [None]:
# Features
features = ['close', 'volume', 'rsi_14', 'macd', 'volatility']
X = data[features]
y = data[target_col]

# Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', RandomForestRegressor(random_state=42))
])

# Parameter Grid
param_grid = {
    'regressor__n_estimators': [50, 100, 200],
    'regressor__max_depth': [None, 10, 20],
    'regressor__min_samples_split': [2, 5]
}

# Grid Search
grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=KFold(n_splits=5, shuffle=True, random_state=42),
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

print("Starting Grid Search...")
grid_search.fit(X, y)

## 4. Results Analysis

In [None]:
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best MSE: {-grid_search.best_score_:.4f}")

# Visualize Results
results_df = pd.DataFrame(grid_search.cv_results_)
plt.figure(figsize=(10, 6))
sns.lineplot(data=results_df, x='param_regressor__n_estimators', y='mean_test_score', hue='param_regressor__max_depth')
plt.title('Grid Search Results: n_estimators vs MSE')
plt.ylabel('Negative MSE (Higher is Better)')
plt.show()