# DSE Unified Multi-Company LSTM Predictor

This notebook trains a **single, unified LSTM model** to predict future stock prices for all companies in a dataset. Instead of training one model per company, this approach allows the model to learn general market patterns from all available data, which can lead to better generalization and performance.

### Key Changes from the Per-Scrip Approach:

1.  **Global Scaler**: A single `MinMaxScaler` is fit on the training data of **all companies**. This ensures that data from different stocks (with varying price ranges) is normalized consistently.
2.  **Company Embeddings**: To distinguish between companies, each `Scrip` is assigned a unique integer ID. An `Embedding` layer in the model learns a unique vector representation for each company, capturing its specific characteristics.
3.  **Multi-Input Model**: The Keras Functional API is used to build a model that accepts two inputs:
    - The time-series data (e.g., 60 days of OHLCV).
    - The integer ID of the company (`Scrip`).
4.  **Centralized Artifacts**: Only one model file and one scaler file are saved for each forecast horizon (e.g., 1, 3, 7 days).

## 1) Imports & Reproducibility

In [2]:
import os
import json
import joblib
import numpy as np
import pandas as pd

from sklearn.preprocessing import MinMaxScaler
from typing import Tuple, Dict, Any, List

# TensorFlow / Keras
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

# Reproducibility
def set_global_seed(seed: int = 42):
    import random
    random.seed(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)

set_global_seed(42)

print("TensorFlow Version:", tf.__version__)

2025-08-23 12:29:16.406384: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


TensorFlow Version: 2.20.0


## 2) Configuration

In [2]:
# === Required: set your CSV path here ===
CSV_PATH = "./dataset/merged_stock_data.csv"  # <- CHANGE THIS to your dataset path

# === Columns ===
DATE_COLUMN = "Date"
SCRIP_COLUMN = "Scrip"
TARGET_COLUMN = "Close"   # The column we want to predict

# === Features ===
# Using multivariate features is highly recommended for a unified model.
# The first feature should be the one you want to predict (TARGET_COLUMN).
FEATURE_COLS = ["Close", "Open", "High", "Low", "Volume"]

# === Sequence & Horizons ===
SEQ_LEN = 60
HORIZONS = [1, 3, 7] # Train a model for each forecast horizon

# === Data Splits ===
TRAIN_RATIO = 0.8
VAL_RATIO = 0.1 # 10% for validation, 10% for testing

# === Training Hyperparameters ===
MAX_EPOCHS = 50 # Reduced epochs as the dataset is much larger
BATCH_SIZE = 256 # Increased batch size for larger dataset
LEARNING_RATE = 1e-3
PATIENCE = 5  # Early stopping patience

# === Save Locations ===
SAVE_DIR = "./artifacts_unified"
os.makedirs(SAVE_DIR, exist_ok=True)

print("Artifacts will be saved to:", os.path.abspath(SAVE_DIR))

Artifacts will be saved to: /mnt/Work/projects/stock_cast/predictor/artifacts_unified


## 3) Data Loading & Preparation

This is the most modified section. We now perform the following steps:
1.  Load all data and create a mapping from `Scrip` names to integer IDs.
2.  Split the entire dataset by date into train, validation, and test sets.
3.  Fit a **single, global `MinMaxScaler`** on the `FEATURE_COLS` of the **training set only**.
4.  Group the data by `Scrip` and build sequences for each company separately to avoid creating windows that span across different stocks.
5.  Concatenate all sequences into final `X` (features), `X_scrip` (company IDs), and `y` (targets) arrays.

In [3]:
def load_and_preprocess_data(csv_path: str) -> Tuple[pd.DataFrame, Dict[str, int]]:
    """Loads data, sorts it, and creates a scrip-to-ID mapping."""
    df = pd.read_csv(csv_path, parse_dates=[DATE_COLUMN])
    df = df.sort_values([SCRIP_COLUMN, DATE_COLUMN]).reset_index(drop=True)
    df = df.dropna(subset=FEATURE_COLS)

    # Create Scrip to Integer ID mapping
    scrips = df[SCRIP_COLUMN].unique()
    scrip_to_id = {scrip: i for i, scrip in enumerate(scrips)}
    df['ScripID'] = df[SCRIP_COLUMN].map(scrip_to_id)

    return df, scrip_to_id

def build_all_sequences(df: pd.DataFrame, scrip_col: str, feature_cols: List[str], seq_len: int, n_ahead: int) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
    """Builds sequences for all scrips and concatenates them."""
    all_X, all_X_scrip, all_y = [], [], []

    # Group by scrip and build sequences for each one
    for scrip_id, group in df.groupby('ScripID'):
        values = group[feature_cols].values
        T = len(values)

        if T < seq_len + n_ahead:
            continue # Skip scrips with not enough data

        for i in range(T - seq_len - n_ahead + 1):
            window = values[i : i + seq_len]
            target = values[i + seq_len : i + seq_len + n_ahead, 0] # Target is the first feature_col (Close)

            all_X.append(window)
            all_y.append(target)
            all_X_scrip.append(scrip_id)

    return np.array(all_X), np.array(all_X_scrip), np.array(all_y)

# --- Main Data Preparation Flow ---
df_all, scrip_to_id = load_and_preprocess_data(CSV_PATH)
n_scrips = len(scrip_to_id)
print(f"Loaded data for {n_scrips} companies.")

# Split data chronologically
n = len(df_all)
train_end = int(n * TRAIN_RATIO)
val_end = int(n * (TRAIN_RATIO + VAL_RATIO))

df_train = df_all.iloc[:train_end]
df_val = df_all.iloc[train_end:val_end]
df_test = df_all.iloc[val_end:]

# Fit ONE scaler on the training data only
scaler = MinMaxScaler(feature_range=(0, 1))
scaler.fit(df_train[FEATURE_COLS])

# Scale all datasets
df_train.loc[:, FEATURE_COLS] = scaler.transform(df_train[FEATURE_COLS])
df_val.loc[:, FEATURE_COLS] = scaler.transform(df_val[FEATURE_COLS])
df_test.loc[:, FEATURE_COLS] = scaler.transform(df_test[FEATURE_COLS])

print(f"Training set size: {len(df_train)}")
print(f"Validation set size: {len(df_val)}")
print(f"Test set size: {len(df_test)}")

Loaded data for 464 companies.
Training set size: 1105256
Validation set size: 138157
Test set size: 138158


## 4) Unified Model Builder

We use the Keras Functional API to create a model with two input branches:
1.  **Time Series Input**: A standard LSTM path to process the sequence of price/volume data.
2.  **Scrip ID Input**: An `Embedding` layer that learns a dense vector for each company. This vector acts as a unique 'signature' for the stock.

These two paths are then concatenated before making the final prediction, allowing the model to combine general market patterns (from the LSTM) with company-specific characteristics (from the Embedding).

In [4]:
def build_unified_model(seq_len: int, n_features: int, n_scrips: int, n_ahead: int, lr: float = LEARNING_RATE) -> keras.Model:
    """Builds a multi-input Keras model with LSTM and Embedding layers."""
    # Input for time-series data
    ts_input = layers.Input(shape=(seq_len, n_features), name='ts_input')

    # Input for the scrip ID
    scrip_input = layers.Input(shape=(1,), name='scrip_input')

    # --- Branch 1: LSTM for time-series processing ---
    x1 = layers.LSTM(128, return_sequences=True)(ts_input)
    x1 = layers.Dropout(0.3)(x1)
    x1 = layers.LSTM(64)(x1)
    x1 = layers.Dropout(0.3)(x1)

    # --- Branch 2: Embedding for scrip identity ---
    # Embedding dimension can be tuned. A common heuristic is sqrt(n_scrips).
    embedding_dim = int(np.sqrt(n_scrips))
    x2 = layers.Embedding(input_dim=n_scrips, output_dim=embedding_dim, name='embedding')(scrip_input)
    x2 = layers.Flatten()(x2) # Flatten the embedding output

    # --- Concatenate branches ---
    concatenated = layers.concatenate([x1, x2], name='concatenation')

    # --- Output layer ---
    output = layers.Dense(64, activation='relu')(concatenated)
    output = layers.Dense(n_ahead, name='output')(output)

    # --- Build and compile model ---
    model = keras.Model(inputs=[ts_input, scrip_input], outputs=output)
    model.compile(optimizer=keras.optimizers.Adam(learning_rate=lr), loss="mse", metrics=["mae"])

    return model

def get_callbacks():
    es = keras.callbacks.EarlyStopping(monitor="val_loss", patience=PATIENCE, restore_best_weights=True, verbose=1)
    rlrop = keras.callbacks.ReduceLROnPlateau(monitor="val_loss", factor=0.2, patience=max(2, PATIENCE//2), min_lr=1e-6, verbose=1)
    return [es, rlrop]

## 5) Training & Evaluation Loop

The main loop now iterates through the `HORIZONS` (1, 3, 7 days). For each horizon:
1.  Builds the sequences for all datasets (train, val, test).
2.  Builds and trains the unified model.
3.  Saves the trained model, the global scaler, and the scrip-to-ID mapping.
4.  Evaluates the final model on the test set.

In [5]:
from math import sqrt

def rmse(y_true, y_pred):
    return sqrt(np.mean((y_true - y_pred) ** 2))

def mape(y_true, y_pred, eps=1e-8):
    return np.mean(np.abs((y_true - y_pred) / np.maximum(np.abs(y_true), eps))) * 100.0

def inverse_transform_target(arr: np.ndarray, scaler: MinMaxScaler, n_features: int) -> np.ndarray:
    """Inverse transforms only the target column."""
    # Create a dummy array of the original feature shape, filled with zeros
    dummy_array = np.zeros((len(arr), n_features))
    # Place the scaled target data into the first column
    dummy_array[:, 0] = arr.ravel()
    # Inverse transform the entire dummy array
    unscaled_array = scaler.inverse_transform(dummy_array)
    # Return only the first column (the unscaled target)
    return unscaled_array[:, 0]

# --- Main Training Orchestration ---
for n_ahead in HORIZONS:
    print(f"\n{'='*50}")
    print(f"Training model for N_AHEAD = {n_ahead}")
    print(f"{'='*50}\n")

    # 1. Build sequences for this horizon
    print("Building sequences...")
    X_train, X_scrip_train, y_train = build_all_sequences(df_train, SCRIP_COLUMN, FEATURE_COLS, SEQ_LEN, n_ahead)
    X_val, X_scrip_val, y_val = build_all_sequences(df_val, SCRIP_COLUMN, FEATURE_COLS, SEQ_LEN, n_ahead)
    X_test, X_scrip_test, y_test = build_all_sequences(df_test, SCRIP_COLUMN, FEATURE_COLS, SEQ_LEN, n_ahead)
    print(f"Train sequences: {X_train.shape[0]}, Val sequences: {X_val.shape[0]}, Test sequences: {X_test.shape[0]}")

    # 2. Build the model
    n_features = len(FEATURE_COLS)
    model = build_unified_model(SEQ_LEN, n_features, n_scrips, n_ahead)
    model.summary()

    # 3. Train the model
    print("\nTraining model...")
    history = model.fit(
        [X_train, X_scrip_train],
        y_train,
        epochs=MAX_EPOCHS,
        batch_size=BATCH_SIZE,
        validation_data=([X_val, X_scrip_val], y_val),
        callbacks=get_callbacks(),
        shuffle=True, # Shuffle the combined dataset
    )

    # 4. Evaluate on the test set
    print("\nEvaluating on test data...")
    y_pred_scaled = model.predict([X_test, X_scrip_test])

    # Inverse transform for metrics calculation (per step in horizon)
    rmse_list, mape_list = [], []
    for step in range(n_ahead):
        y_true_step = inverse_transform_target(y_test[:, step], scaler, n_features)
        y_pred_step = inverse_transform_target(y_pred_scaled[:, step], scaler, n_features)
        rmse_list.append(rmse(y_true_step, y_pred_step))
        mape_list.append(mape(y_true_step, y_pred_step))

    print(f"\n--- Test Metrics for N_AHEAD = {n_ahead} ---")
    print(f"Mean RMSE across horizon: {np.mean(rmse_list):.4f}")
    print(f"Mean MAPE across horizon: {np.mean(mape_list):.4f}%")
    print(f"RMSE per step: {[round(x, 4) for x in rmse_list]}")
    print(f"MAPE per step: {[round(x, 4) for x in mape_list]}")

    # 5. Save artifacts
    print("\nSaving artifacts...")
    model_path = os.path.join(SAVE_DIR, f"unified_lstm_nahead{n_ahead}.keras")
    scaler_path = os.path.join(SAVE_DIR, "global_scaler.bin")
    scrip_map_path = os.path.join(SAVE_DIR, "scrip_to_id.json")

    model.save(model_path)
    joblib.dump(scaler, scaler_path)
    with open(scrip_map_path, 'w') as f:
        json.dump(scrip_to_id, f)

    print(f"Model saved to: {model_path}")
    print(f"Scaler saved to: {scaler_path}")
    print(f"Scrip map saved to: {scrip_map_path}")


Training model for N_AHEAD = 1

Building sequences...
Train sequences: 1085381, Val sequences: 135155, Test sequences: 135195


2025-08-20 12:33:33.348843: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)



Training model...
Epoch 1/50
[1m4240/4240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m351s[0m 82ms/step - loss: 2.8961e-07 - mae: 2.3552e-04 - val_loss: 5.2772e-06 - val_mae: 0.0018 - learning_rate: 0.0010
Epoch 2/50
[1m4240/4240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m360s[0m 85ms/step - loss: 3.1455e-08 - mae: 1.2587e-04 - val_loss: 1.2946e-06 - val_mae: 8.8843e-04 - learning_rate: 0.0010
Epoch 3/50
[1m4240/4240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 81ms/step - loss: 1.5761e-08 - mae: 8.6692e-05
Epoch 3: ReduceLROnPlateau reducing learning rate to 0.00020000000949949026.
[1m4240/4240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m359s[0m 85ms/step - loss: 1.4221e-08 - mae: 8.1024e-05 - val_loss: 3.3651e-07 - val_mae: 4.5568e-04 - learning_rate: 0.0010
Epoch 4/50
[1m4240/4240[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m360s[0m 85ms/step - loss: 5.0254e-09 - mae: 3.7283e-05 - val_loss: 3.4457e-07 - val_mae: 4.5857e-04 - learning_rate: 2.0000e-04
E


Training model...
Epoch 1/50
[1m4238/4238[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m375s[0m 88ms/step - loss: 4.0926e-07 - mae: 1.7893e-04 - val_loss: 2.0817e-05 - val_mae: 0.0036 - learning_rate: 0.0010
Epoch 2/50
[1m4238/4238[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m372s[0m 88ms/step - loss: 3.7033e-08 - mae: 1.3208e-04 - val_loss: 8.0957e-06 - val_mae: 0.0023 - learning_rate: 0.0010
Epoch 3/50
[1m4237/4238[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 84ms/step - loss: 2.4364e-08 - mae: 1.0757e-04
Epoch 3: ReduceLROnPlateau reducing learning rate to 0.00020000000949949026.
[1m4238/4238[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m370s[0m 87ms/step - loss: 2.1590e-08 - mae: 1.0050e-04 - val_loss: 3.6789e-06 - val_mae: 0.0015 - learning_rate: 0.0010
Epoch 4/50
[1m4238/4238[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m372s[0m 88ms/step - loss: 7.6703e-09 - mae: 4.5806e-05 - val_loss: 3.6919e-06 - val_mae: 0.0015 - learning_rate: 2.0000e-04
Epoch 5/50
[


Training model...
Epoch 1/50
[1m4233/4233[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m382s[0m 90ms/step - loss: 3.2756e-07 - mae: 1.3653e-04 - val_loss: 6.5699e-06 - val_mae: 0.0018 - learning_rate: 0.0010
Epoch 2/50
[1m4233/4233[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m379s[0m 90ms/step - loss: 3.9347e-08 - mae: 9.2256e-05 - val_loss: 1.8426e-06 - val_mae: 7.7084e-04 - learning_rate: 0.0010
Epoch 3/50
[1m4232/4233[0m [32m━━━━━━━━━━━━━━━━━━━[0m[37m━[0m [1m0s[0m 85ms/step - loss: 3.4678e-08 - mae: 8.8392e-05
Epoch 3: ReduceLROnPlateau reducing learning rate to 0.00020000000949949026.
[1m4233/4233[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m375s[0m 89ms/step - loss: 3.3846e-08 - mae: 8.7110e-05 - val_loss: 1.1289e-07 - val_mae: 1.6821e-04 - learning_rate: 0.0010
Epoch 4/50
[1m4233/4233[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m376s[0m 89ms/step - loss: 1.8068e-08 - mae: 6.5475e-05 - val_loss: 1.7266e-07 - val_mae: 2.0402e-04 - learning_rate: 2.0000e-04
E

## 6) Example: How to Load and Predict for a Single Company

This shows the new workflow for making a prediction:
1.  Load the unified model, the **global** scaler, and the scrip-to-ID map.
2.  Select a company and get its historical data.
3.  Scale the data using the global scaler.
4.  Get the company's integer ID from the map.
5.  Feed both the scaled data and the ID into the model to get a prediction.

In [4]:
# --- Example Prediction ---
EXAMPLE_SCRIP = "ACI" # Choose a company from your dataset
HORIZON_TO_PREDICT = 7 # Choose which trained model to use (1, 3, or 7)

print(f"--- Running prediction for {EXAMPLE_SCRIP} with {HORIZON_TO_PREDICT}-day model ---")

# 1. Load artifacts
loaded_model = keras.models.load_model(os.path.join(SAVE_DIR, f"unified_lstm_nahead{HORIZON_TO_PREDICT}.keras"))
loaded_scaler = joblib.load(os.path.join(SAVE_DIR, "global_scaler.bin"))
with open(os.path.join(SAVE_DIR, "scrip_to_id.json"), 'r') as f:
    loaded_scrip_map = json.load(f)

# 2. Get the last SEQ_LEN days of data for the chosen scrip
# In a real application, you would fetch this from your database or a new CSV
scrip_df = df_all[df_all[SCRIP_COLUMN] == EXAMPLE_SCRIP].tail(SEQ_LEN)

if len(scrip_df) < SEQ_LEN:
    print(f"Error: Not enough data for {EXAMPLE_SCRIP} to make a prediction (need {SEQ_LEN}, have {len(scrip_df)}).")
else:
    # 3. Scale the features using the GLOBAL scaler
    scaled_features = loaded_scaler.transform(scrip_df[FEATURE_COLS])

    # 4. Get the scrip ID
    scrip_id = loaded_scrip_map.get(EXAMPLE_SCRIP)
    if scrip_id is None:
        print(f"Error: Scrip '{EXAMPLE_SCRIP}' not found in the training data.")
    else:
        # 5. Reshape inputs for the model
        X_pred = scaled_features.reshape(1, SEQ_LEN, len(FEATURE_COLS))
        X_scrip_pred = np.array([scrip_id]).reshape(1, 1)

        # 6. Predict
        pred_scaled = loaded_model.predict([X_pred, X_scrip_pred]).ravel()

        # 7. Inverse transform the prediction
        # We need to do this for each step of the horizon
        final_predictions = []
        for pred_val in pred_scaled:
            unscaled_pred = inverse_transform_target(np.array([pred_val]), loaded_scaler, len(FEATURE_COLS))
            final_predictions.append(unscaled_pred[0])

        print(f"\nPredicted closing prices for the next {HORIZON_TO_PREDICT} days:")
        for i, val in enumerate(final_predictions):
            print(f"  Day +{i+1}: {val:.2f}")

--- Running prediction for ACI with 7-day model ---


NameError: name 'SAVE_DIR' is not defined