#THIS IS THE ASSIGNMENT ATTEMPT STEPS:
> ALL THE EXPERIMENTS ARE DONE FOR DATASET-1 requests_opened_external (due to time constraints)
> From Hereon Dataset-1: requests_opened_external and Dataset-2: requests_closed_external
> 1) **The Cleaning and Reframing Dataset-1**: The Dataset-1 contained multiple entries of Volume corresponding to a single date for different time, so aggregated the volumes across times corresponding to same date after normalizing the Date, Time, Volume -> date, time, volume. Even the date format was mm/dd/yyyy different from yyyy-mm-dd of dataset-2, so brought the date format in that format. Basically, brought dataset-1 to dataset-2(the reference dataset) format. Also observed that there were some days missing, and by comparing to calendar, it seemed that most of those missing days were weekends and some were the weekdays(indicating the voluntary or business holiday). But they were not used in fine-tuning the LLMs. The reason comes in Feature Modelling Section

> 2) **Feature Engineering and Modelling**:
> a)Different features like lag, rolling window, ema, calendar feature(including gap_day feature indicating the holidays), Fourier Terms(for month, and year), and standardized the volume history. z = log_10(1 + y_target), where y_target = next_day_volume. So the target variable is z now.!
> b) Selected top 16 Features by Finding Coorelation between features and the target variable, using Pearson Coefficients -> Feature_Group_1
> c) Selected top 16 Features by using XGBoost technique -> Feature_Group_2

> 3) **Model Architecture and Training**: LoRA was used for the efficient training. which reduced the trainable parameter to almost 0.5% for TinyLlama case, and 1.13% for Qwen 2.5 7 B, model of their total trainable parameters. Nvidia A100, 40 GB was used.!
 a)Used TinyLlaMa 1.1 B chat, and Qwen 2.5 7B for fine-tuning. b) before moving to Qwen 2.5 7B, some 3 experiments were perfomed on TinyLlaMa 1.1 B, which are : i) Fine-tuning using Feature_Group_1, with 3 epochs ii) Training of Numerical Value head after Training TinyLlama on Feature_Group_1(with 3 epochs), with 5 epochs iii) Training Tiny Llama on Feature_Group_2, with 5 epochs. c) Trained Qwen 2.5 7B, with Feature_Group_1, 3 epochs.

> 4) **Model Evaluation**: The Metrics like RMSE, MSE, MAE and MAPE were used to check the performance of the trained Models. These values were compared with the Naive Method, like seasonal_naive(with lag = 5 days), moving average(with window = 7 days). it turned out that Tiny Llama, trained on Feature_Group_2 with 5 epochs was the best, and the Qwen2.5 7B trained on Feature_Group_1, 3 epochs was very near to the performance of the former one. Inferen
>5) **Inference**:  It's my hypothesis that, if the Qwen is trained on Feature_Group_2 or Feature_Group_1 with 5 epochs, can outperform TinyLlama easily.

>6) **Production**: Both the models can be exported for production, after including the LoRA adapters,(as in one case when I did not used the LoRA adapters, my predictions blew up to inf.).


**PS**:
> a) Due to bitsandbytes dependency problem, 4 bit/8bit quantization is not done, and current setup runs on bf16 precision
>b) I by chance, overwrite the training of TinyLlama on Feature_Group_1, 3 epochs with TinyLlama on Feature_Group_2, 5 epochs. Which is attached in Different notebook.
>c) The selection of Top K features by Pearson Coefficient occurs in the Training code only!

*Please Forgive Any Inconvenience Caused*.

Cleaning of the Dataset-1(requests_opened_external.csv)

In [None]:
import pandas as pd
import numpy as np

# ==== CONFIG ====
PATH_IN  = "/content/requests_opened_external.csv"
PATH_OUT = "/content/cleaned_requests_opened_daily.csv"

# ==== LOAD ====
df = pd.read_csv(PATH_IN)

# 1) normalize column titles to lower case
df.columns = df.columns.str.strip().str.lower()
date_col, volume_col = "date", "volume"

# 2) parse mm/dd/yyyy -> yyyy-mm-dd
df[date_col] = pd.to_datetime(df[date_col], format="%m/%d/%Y", errors="coerce")
bad_dates = df[date_col].isna().sum()
if bad_dates:
    print(f"[WARN] Dropping {bad_dates} rows with invalid dates.")
df = df.dropna(subset=[date_col])
df["date"] = df[date_col].dt.strftime("%Y-%m-%d")

# Ensure volume is numeric
df[volume_col] = pd.to_numeric(df[volume_col], errors="coerce")
nan_vols = df[volume_col].isna().sum()
if nan_vols:
    print(f"[WARN] {nan_vols} rows have non-numeric '{volume_col}'. Treating as 0 for checks.")
vol_before = df[volume_col].fillna(0)

# 4) SANITY CHECK BEFORE AGG
non_integer_before = int(((vol_before.astype(float) % 1) != 0).sum())
dtype_before = df[volume_col].dtype
print(f"Before aggregation → dtype={dtype_before}, non-integer rows={non_integer_before}")

# 3) AGGREGATE: sum cross time per date
daily = df.groupby("date", as_index=False)[volume_col].sum(min_count=1)

# ---- POST-AGG CHECK ----

daily_vol = pd.to_numeric(daily[volume_col], errors="coerce").fillna(0).astype(float)
non_integer_after = int(((daily_vol % 1) != 0).sum())
dtype_after = daily[volume_col].dtype
print(f"After aggregation  → dtype={dtype_after}, non-integer daily rows={non_integer_after}")

# Enforce integer output (round as a guard, then cast)
daily["volume"] = daily_vol.round().astype("int64")
daily = daily.sort_values("date")

# SAVE
daily.to_csv(PATH_OUT, index=False)
print(f"Saved daily aggregates to: {PATH_OUT}")
print(daily.head(10))


Before aggregation → dtype=int64, non-integer rows=0
After aggregation  → dtype=int64, non-integer daily rows=0
Saved daily aggregates to: /content/cleaned_requests_opened_daily.csv
         date    volume
0  1998-01-02  10139741
1  1998-01-05  19720368
2  1998-01-06  14116033
3  1998-01-07  16859831
4  1998-01-08  15860442
5  1998-01-09  28586578
6  1998-01-12  24067349
7  1998-01-13  13279676
8  1998-01-14  16616568
9  1998-01-15  24826372


#Feature Engineering and Modelling

In [None]:
import pandas as pd
import numpy as np
import json

INPUT_CSV   = "/content/cleaned_requests_opened_daily.csv"
OUTPUT_CSV  = "/content/features_trading_only.csv"
SCALER_JSON = "/content/features_trading_only_scaler.json"

DATE_COL = "date"
VOL_COL  = "volume"

ROLL_WINDOWS = [7, 14, 28]
EMA_SPANS    = [7, 28]
FOURIER_K    = 3

df = pd.read_csv(INPUT_CSV)
if DATE_COL not in df.columns or VOL_COL not in df.columns:
    raise ValueError(f"Input must contain columns: '{DATE_COL}', '{VOL_COL}'")

df[DATE_COL] = pd.to_datetime(df[DATE_COL], errors="coerce")
df = df.dropna(subset=[DATE_COL]).sort_values(DATE_COL).reset_index(drop=True)


df = df.replace([np.inf, -np.inf], np.nan).ffill().bfill()
df[VOL_COL] = df[VOL_COL].astype(float)

# CALENDAR FEATURES
df["dow"]    = df[DATE_COL].dt.weekday
df["dom"]    = df[DATE_COL].dt.day
df["week"]   = df[DATE_COL].dt.isocalendar().week.astype(int)
df["month"]  = df[DATE_COL].dt.month
df["quarter"]= df[DATE_COL].dt.quarter
df["year"]   = df[DATE_COL].dt.year

df["is_month_start"]   = df[DATE_COL].dt.is_month_start.astype(int)
df["is_month_end"]     = df[DATE_COL].dt.is_month_end.astype(int)
df["is_quarter_start"] = df[DATE_COL].dt.is_quarter_start.astype(int)
df["is_quarter_end"]   = df[DATE_COL].dt.is_quarter_end.astype(int)
df["is_year_start"]    = df[DATE_COL].dt.is_year_start.astype(int)
df["is_year_end"]      = df[DATE_COL].dt.is_year_end.astype(int)

# gaps between trading days
df["gap_days_prev"] = (df[DATE_COL] - df[DATE_COL].shift(1)).dt.days
df["gap_days_next"] = (df[DATE_COL].shift(-1) - df[DATE_COL]).dt.days

# ROLLING / EMA / LAGS / MOMENTUM
v = df[VOL_COL]

for w in ROLL_WINDOWS:
    df[f"roll{w}_mean"] = v.rolling(w).mean()
    df[f"roll{w}_sum"]  = v.rolling(w).sum()
    df[f"roll{w}_std"]  = v.rolling(w).std()
    df[f"roll{w}_min"]  = v.rolling(w).min()
    df[f"roll{w}_max"]  = v.rolling(w).max()

for span in EMA_SPANS:
    df[f"ema{span}"] = v.ewm(span=span, adjust=False).mean()

df["lag1"] = v.shift(1)
df["lag2"] = v.shift(2)
df["lag5"] = v.shift(5)

df["ret1"] = (v - v.shift(1)) / (v.shift(1).abs() + 1.0)
df["ret7"] = (v - v.shift(7)) / (v.shift(7).abs() + 1.0)

# FOURIER FEATURES
# (A) Trading-index seasonality
t = np.arange(len(df))
for k in range(1, FOURIER_K+1):
    df[f"fourier_t_sin{k}"] = np.sin(2*np.pi*k*t/len(df))
    df[f"fourier_t_cos{k}"] = np.cos(2*np.pi*k*t/len(df))

# (B) Annual seasonality by year-fraction
day_of_year = df[DATE_COL].dt.dayofyear.astype(float)
year_len = df[DATE_COL].dt.is_leap_year.map({True:366, False:365}).astype(float)
yf = day_of_year / year_len
for k in range(1, FOURIER_K+1):
    df[f"fourier_y_sin{k}"] = np.sin(2*np.pi*k*yf)
    df[f"fourier_y_cos{k}"] = np.cos(2*np.pi*k*yf)

# TARGETS
# Next trading day volume
df["y_trading"] = df[VOL_COL].shift(-1)

# z_target
df["y_log1p"] = np.log1p(df["y_trading"])

# drop last row
df = df.dropna(subset=["y_trading", "y_log1p"]).reset_index(drop=True)

mu = float(df["y_log1p"].mean())
sigma = float(df["y_log1p"].std(ddof=0)) if df["y_log1p"].std(ddof=0) > 0 else 1.0
df["z_target"] = (df["y_log1p"] - mu) / sigma

#FINAL CLEANUP
num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
df[num_cols] = df[num_cols].replace([np.inf, -np.inf], np.nan)
df = df.dropna().reset_index(drop=True)

# SAVE
df.to_csv(OUTPUT_CSV, index=False)
with open(SCALER_JSON, "w") as f:
    json.dump({"y_log1p_mean": mu, "y_log1p_std": sigma, "note": "y = expm1(z*sigma + mu)"}, f, indent=2)

print(f"Saved features: {OUTPUT_CSV}")
print(f"Saved scaler  : {SCALER_JSON}")
print(df[[DATE_COL, VOL_COL, "y_trading", "y_log1p", "z_target"]].head(), "\n")
print("Rows:", len(df), "| Cols:", len(df.columns))

Saved features: /content/features_trading_only.csv
Saved scaler  : /content/features_trading_only_scaler.json
        date      volume   y_trading    y_log1p  z_target
0 1998-02-11  15819189.0  13524081.0  16.419983  0.920051
1 1998-02-12  13524081.0   8694402.0  15.978190  0.337016
2 1998-02-13   8694402.0  14912102.0  16.517684  1.048988
3 1998-02-17  14912102.0  12788824.0  16.364082  0.846279
4 1998-02-18  12788824.0  16986901.0  16.647953  1.220905 

Rows: 6890 | Cols: 53


Model Architecture and Training

In [None]:

!pip -q install -U "transformers>=4.46.1" "datasets==2.20.0" "accelerate>=0.34.0" \
                      "peft==0.13.0" "scikit-learn" "einops" "evaluate"

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/547.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m35.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/322.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m322.5/322.5 kB[0m [31m30.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/9.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━[0m [32m5.4/9.5 MB[0m [31m162.2 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m9.5/9.5 MB[0m [31m187.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.5/9.5 MB[0m [31m119.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━

In [None]:
import transformers
print(transformers.__version__)

4.55.4


#Model Training and Evaluation

TinyLlama 1.1B parameters, training with top-16 features selected by XGBoost, 5 epochs

In [22]:

import os, re, math, json, random
import numpy as np
import pandas as pd
from dataclasses import dataclass
from typing import List, Dict, Optional

import torch
from torch.utils.data import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from transformers.trainer_utils import set_seed
from peft import LoraConfig, get_peft_model
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler

#  CONFIG
CONFIG = {
    "feature_files": ["/content/features_trading_only_2.csv"],
    "date_col": "date",
    "vol_col": "volume",
    "label_col": "z_target",
    "context_len": 64,
    "max_features": 16,
    "model_name": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "seed": 42,
    "train_frac": 0.9,
    "epochs": 5,
    "lr": 1e-4,
    "train_bs": 2,
    "grad_accum": 16,
    "max_length": 1024,
    "warmup_ratio": 0.1,
    "weight_decay": 0.01,
    "lora_r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "output_dir": "/content/tinyllama_ts_lora",
    "bf16": True,
}

set_seed(CONFIG["seed"])

# LOAD & MERGE FEATURES
def load_and_merge(paths: List[str], date_col: str):
    dfs = []
    for p in paths:
        df = pd.read_csv(p)
        df[date_col] = pd.to_datetime(df[date_col], errors="coerce")
        df = df.dropna(subset=[date_col])
        dfs.append(df)
    all_df = pd.concat(dfs, axis=0, ignore_index=True).sort_values(date_col).reset_index(drop=True)
    all_df = all_df.replace([np.inf, -np.inf], np.nan).ffill().bfill()
    return all_df

df = load_and_merge(CONFIG["feature_files"], CONFIG["date_col"])
print("Data shape:", df.shape)
print(df[[CONFIG["date_col"], CONFIG["vol_col"], "y_trading", "y_log1p", CONFIG["label_col"]]].head())

# SELECTING TOP-K FEATURES
EXCLUDE_COLS = {CONFIG["date_col"], CONFIG["label_col"], CONFIG["vol_col"], "y_trading", "y_log1p"}
num = df.select_dtypes(include=[np.number]).copy()
std = num.std(numeric_only=True)
non_const = std[std > 0].index.tolist()
num = num[non_const]
cand = [c for c in num.columns if c not in EXCLUDE_COLS]
if not cand:
    raise ValueError("No candidate numeric features found after exclusions.")
corr = num[cand].corrwith(num[CONFIG["label_col"]]).abs().replace([np.inf, -np.inf], np.nan).fillna(0.0)
feature_cols = corr.sort_values(ascending=False).index.tolist()[:CONFIG["max_features"]]
print("Selected feature columns:", feature_cols)

# BUILD z_history SERIES
# History from standardized log1p(volume) to stabilize scale
vol = df[CONFIG["vol_col"]].astype(float).values.reshape(-1,1)
z_hist_scaler = StandardScaler()
z_hist_series = z_hist_scaler.fit_transform(np.log1p(vol)).reshape(-1)

#  BUILD WINDOWS
def make_windows(df: pd.DataFrame, z_hist: np.ndarray, ctx: int, feat_cols: List[str], label_col: str):
    X, Y = [], []
    n = len(df)
    for t in range(ctx, n):
        hist = z_hist[t-ctx:t].tolist()
        feats = df.iloc[t][feat_cols].to_dict()
        y = float(df.iloc[t][label_col])
        X.append((hist, feats))
        Y.append(y)
    return X, np.array(Y, dtype=np.float32)

X_raw, Y = make_windows(df, z_hist_series, CONFIG["context_len"], feature_cols, CONFIG["label_col"])
print("Total samples:", len(X_raw))

# TRAIN / VAL SPLIT
N = len(X_raw)
cut = int(N * CONFIG["train_frac"])
train_idx = np.arange(0, cut)
val_idx = np.arange(cut, N)

def ex_to_text(hist, feats, y_z):
    hist_str = ", ".join(f"{x:.4f}" for x in hist)
    feats_str = ", ".join(f"{k}={float(v):.4f}" for k,v in feats.items()) if feats else "none"
    prompt = f"z_hist[{len(hist)}]:{hist_str}\nfeat:{feats_str}\nnext_z:"
    target = f"{y_z:.5f}\n"
    return {"prompt": prompt, "target": target}

train_text = [ex_to_text(*X_raw[i], Y[i]) for i in train_idx]
val_text   = [ex_to_text(*X_raw[i], Y[i]) for i in val_idx]

# TOKENIZER / MODEL
model_name = CONFIG["model_name"]
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

torch_dtype = (
    torch.bfloat16
    if (CONFIG["bf16"] and torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8)
    else torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch_dtype,
    device_map="auto",
    attn_implementation="sdpa",   # no extra installs
)
model.config.use_cache = False
model.gradient_checkpointing_enable()

# LoRA targets for TinyLlama blocks
lora_cfg = LoraConfig(
    r=CONFIG["lora_r"], lora_alpha=CONFIG["lora_alpha"], lora_dropout=CONFIG["lora_dropout"],
    bias="none", task_type="CAUSAL_LM",
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"]
)
model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()

# DATASET
class TxtDS(Dataset):
    def __init__(self, examples, tok, max_len=1024):
        self.ex = examples; self.tok = tok; self.max_len = max_len
    def __len__(self): return len(self.ex)
    def __getitem__(self, i):
        e = self.ex[i]
        p_ids = self.tok(e["prompt"], add_special_tokens=False)["input_ids"]
        t_ids = self.tok(e["target"], add_special_tokens=False)["input_ids"]
        ids = p_ids + t_ids
        if len(ids) > self.max_len:
            overflow = len(ids) - self.max_len
            keep_p = max(0, len(p_ids) - overflow)
            ids = p_ids[-keep_p:] + t_ids
        p_len = min(len(p_ids), len(ids) - len(t_ids))
        labels = [-100]*p_len + ids[p_len:]
        attn = [1]*len(ids)
        return {
            "input_ids": torch.tensor(ids, dtype=torch.long),
            "attention_mask": torch.tensor(attn, dtype=torch.long),
            "labels": torch.tensor(labels, dtype=torch.long),
        }

def pad_batch(batch, pad_id):
    mx = max(len(b["input_ids"]) for b in batch)
    out = {"input_ids": [], "attention_mask": [], "labels": []}
    for b in batch:
        pad_n = mx - len(b["input_ids"])
        out["input_ids"].append(torch.cat([b["input_ids"], torch.full((pad_n,), pad_id, dtype=torch.long)]))
        out["attention_mask"].append(torch.cat([b["attention_mask"], torch.zeros(pad_n, dtype=torch.long)]))
        out["labels"].append(torch.cat([b["labels"], torch.full((pad_n,), -100, dtype=torch.long)]))
    return {k: torch.stack(v) for k,v in out.items()}

def collate_fn(features):
    return pad_batch(features, tokenizer.pad_token_id)

train_ds = TxtDS(train_text, tokenizer, CONFIG["max_length"])
val_ds   = TxtDS(val_text, tokenizer, CONFIG["max_length"])

# TRAIN
args = TrainingArguments(
    output_dir=CONFIG["output_dir"],
    num_train_epochs=CONFIG["epochs"],
    per_device_train_batch_size=CONFIG["train_bs"],
    per_device_eval_batch_size=CONFIG["train_bs"],
    gradient_accumulation_steps=CONFIG["grad_accum"],
    learning_rate=CONFIG["lr"],
    weight_decay=CONFIG["weight_decay"],
    warmup_ratio=CONFIG["warmup_ratio"],
    logging_steps=50,
    save_strategy="epoch",
    lr_scheduler_type="cosine",
    bf16=(torch_dtype==torch.bfloat16),
    fp16=(torch_dtype==torch.float16),
    dataloader_num_workers=2,
    report_to="none",
    do_eval=False
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    data_collator=collate_fn,
    tokenizer=tokenizer,
)
trainer.train()

#EVAL (invert to original units)

def load_scaler_json(paths):
    base = os.path.dirname(paths[0])
    cand = os.path.join(base, "features_trading_only_scaler_2.json")
    if os.path.exists(cand):
        with open(cand, "r") as f:
            s = json.load(f)
        return float(s["y_log1p_mean"]), float(s["y_log1p_std"])
    mu = float(df["y_log1p"].mean()) if "y_log1p" in df.columns else 0.0
    sigma = float(df["y_log1p"].std(ddof=0)) if "y_log1p" in df.columns and df["y_log1p"].std(ddof=0)>0 else 1.0
    return mu, sigma

mu, sigma = load_scaler_json(CONFIG["feature_files"])

def number_from_text(text: str) -> Optional[float]:
    m = re.search(r"[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?", text)
    return float(m.group(0)) if m else None

def evaluate(model, tok, val_examples, mu, sigma, max_new_tokens=12):
    model.eval()
    preds_z, trues_z = [], []
    for ex in val_examples:
        ids = tok(ex["prompt"], return_tensors="pt").to(model.device)
        with torch.no_grad():
            out = model.generate(
                **ids, max_new_tokens=max_new_tokens, do_sample=False,
                pad_token_id=tok.pad_token_id, eos_token_id=tok.eos_token_id
            )
        gen = tok.decode(out[0][ids["input_ids"].shape[1]:], skip_special_tokens=True)
        z_hat = number_from_text(gen)
        if z_hat is None: continue
        preds_z.append(z_hat)
        trues_z.append(float(re.findall(r"[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?", ex["target"])[0]))
    if not preds_z:
        return {"val_MAE": float("nan"), "val_RMSE": float("nan")}
    preds_z = np.array(preds_z); trues_z = np.array(trues_z)
    y_pred = np.expm1(preds_z * sigma + mu)
    y_true = np.expm1(trues_z * sigma + mu)
    return {
        "val_MAE": mean_absolute_error(y_true, y_pred),
        "val_RMSE": math.sqrt(mean_squared_error(y_true, y_pred)),
    }

val_pairs = [{"prompt": e["prompt"], "target": e["target"]} for e in val_text]
metrics = evaluate(model, tokenizer, val_pairs, mu, sigma)
print("Validation metrics:", metrics)

#  INFERENCE
def forecast_next(raw_recent_volumes: List[float], last_feat_row: Dict[str,float], mu: float, sigma: float, k_decimals=5) -> float:
    z_hist = z_hist_scaler.transform(np.log1p(np.array(raw_recent_volumes).reshape(-1,1))).reshape(-1)
    hist_str = ", ".join(f"{z:.4f}" for z in z_hist[-CONFIG["context_len"]:])
    feats_str = ", ".join(f"{k}={float(v):.4f}" for k,v in last_feat_row.items()) if last_feat_row else "none"
    prompt = f"z_hist[{len(z_hist[-CONFIG['context_len']:])}]:{hist_str}\nfeat:{feats_str}\nnext_z:"
    ids = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(**ids, max_new_tokens=12, do_sample=False,
                             pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id)
    gen = tokenizer.decode(out[0][ids["input_ids"].shape[1]:], skip_special_tokens=True)
    z_hat = number_from_text(gen)
    if z_hat is None: raise RuntimeError("Model did not return a numeric answer.")
    return float(round(np.expm1(z_hat * sigma + mu), k_decimals))

print("TinyLlama LoRA fine-tune done. Use forecast_next(...) for inference.")


Data shape: (6890, 53)
        date    volume   y_trading    y_log1p  z_target
0 1998-02-11  15819189  13524081.0  16.419983  0.920051
1 1998-02-12  13524081   8694402.0  15.978190  0.337016
2 1998-02-13   8694402  14912102.0  16.517684  1.048988
3 1998-02-17  14912102  12788824.0  16.364082  0.846279
4 1998-02-18  12788824  16986901.0  16.647953  1.220905
Selected feature columns: ['ema28', 'ema7', 'roll14_mean', 'roll14_sum', 'roll28_sum', 'roll28_mean', 'roll7_mean', 'roll7_sum', 'year', 'roll7_min', 'roll14_min', 'roll28_min', 'lag1', 'lag2', 'roll7_max', 'fourier_t_sin1']
Total samples: 6826
trainable params: 12,615,680 || all params: 1,112,664,064 || trainable%: 1.1338


  trainer = Trainer(


Step,Training Loss
50,1.8128
100,1.3571
150,1.3496
200,1.3428
250,1.3412
300,1.3365
350,1.3455
400,1.3386
450,1.3305
500,1.3328


Validation metrics: {'val_MAE': 761753.6787408533, 'val_RMSE': 1541993.293025926}
TinyLlama LoRA fine-tune done. Use forecast_next(...) for inference.


Training a Numeric Value Head on TinyLlama with TinyLlama trained for 3 epochs.

In [12]:
import os, re, math, json, numpy as np, pandas as pd, torch, torch.nn as nn, torch.nn.functional as F
from torch.utils.data import Dataset
from sklearn.metrics import mean_absolute_error, mean_squared_error
from transformers import TrainingArguments, Trainer

# ---- config ----
FEATURE_FILE = "/content/features_trading_only_2.csv"
DATE_COL, VOL_COL, LABEL_COL = "date", "volume", "z_target"
CTX, MAX_FEATURES, TRAIN_FRAC, MAX_LEN, EPOCHS = 64, 16, 0.9, 1024, 5
LR, BATCH = 3e-4, 8
DEVICE = next(model.parameters()).device

# ---- reload features & splits ----
df = pd.read_csv(FEATURE_FILE)
df[DATE_COL] = pd.to_datetime(df[DATE_COL], errors="coerce")
df = df.dropna(subset=[DATE_COL]).sort_values(DATE_COL).reset_index(drop=True)
df = df.replace([np.inf, -np.inf], np.nan).ffill().bfill()

EXCL = {DATE_COL, LABEL_COL, VOL_COL, "y_trading", "y_log1p"}
num = df.select_dtypes(include=[np.number]).copy()
std = num.std(numeric_only=True)
cand = [c for c in num.columns if c not in EXCL and std[c] > 0]
corr = num[cand].corrwith(num[LABEL_COL]).abs().fillna(0.0)
feature_cols = corr.sort_values(ascending=False).index.tolist()[:MAX_FEATURES]

from sklearn.preprocessing import StandardScaler
vol = df[VOL_COL].astype(float).values.reshape(-1,1)
z_hist_scaler = StandardScaler()
z_hist_series = z_hist_scaler.fit_transform(np.log1p(vol)).reshape(-1)

def make_windows(df, z_hist, ctx, feat_cols, label_col):
    X, Y = [], []
    for t in range(ctx, len(df)):
        X.append((z_hist[t-ctx:t].tolist(), df.iloc[t][feat_cols].to_dict()))
        Y.append(float(df.iloc[t][label_col]))
    return X, np.array(Y, dtype=np.float32)

X_raw, Y = make_windows(df, z_hist_series, CTX, feature_cols, LABEL_COL)
cut = int(len(X_raw)*TRAIN_FRAC)

def ex_to_prompt(hist, feats):
    hist_str  = ", ".join(f"{x:.4f}" for x in hist)
    feats_str = ", ".join(f"{k}={float(v):.4f}" for k,v in feats.items()) if feats else "none"
    return f"z_hist[{len(hist)}]:{hist_str}\nfeat:{feats_str}\nnext_z:"

train_prompts = [ex_to_prompt(*X_raw[i]) for i in range(cut)]
val_prompts   = [ex_to_prompt(*X_raw[i]) for i in range(cut, len(X_raw))]

# scaler to invert
mu, sigma = float(df["y_log1p"].mean()), float(df["y_log1p"].std(ddof=0))
y_tr_z, y_va_z = Y[:cut], Y[cut:]
y_tr, y_val = np.expm1(y_tr_z*sigma+mu), np.expm1(y_va_z*sigma+mu)

# ---- dataset ----
class TxtRegDS(Dataset):
    def __init__(self, prompts, targets, tok, max_len=1024):
        self.prompts, self.targets, self.tok, self.max_len = prompts, targets, tok, max_len
    def __len__(self): return len(self.prompts)
    def __getitem__(self, i):
        x = self.tok(self.prompts[i], add_special_tokens=False, truncation=True, max_length=self.max_len)
        return {
            "input_ids": torch.tensor(x["input_ids"], dtype=torch.long),
            "attention_mask": torch.tensor(x.get("attention_mask", [1]*len(x["input_ids"])), dtype=torch.long),
            "labels": torch.tensor(self.targets[i], dtype=torch.float32),
        }

train_ds = TxtRegDS(train_prompts, y_tr_z, tokenizer, MAX_LEN)
val_ds   = TxtRegDS(val_prompts,   y_va_z, tokenizer, MAX_LEN)

def pad_batch(batch, pad_id):
    mx = max(len(b["input_ids"]) for b in batch)
    return {
        "input_ids": torch.stack([torch.cat([b["input_ids"], torch.full((mx-len(b["input_ids"]),), pad_id)]) for b in batch]),
        "attention_mask": torch.stack([torch.cat([b["attention_mask"], torch.zeros(mx-len(b["attention_mask"]), dtype=torch.long)]) for b in batch]),
        "labels": torch.stack([b["labels"] for b in batch]),
    }

# ---- regression wrapper ----
class LLMRegressor(nn.Module):
    def __init__(self, base, dropout=0.1):
        super().__init__()
        self.base = base
        for p in self.base.parameters(): p.requires_grad = False
        hidden = self.base.config.hidden_size
        self.head = nn.Sequential(nn.Dropout(dropout), nn.Linear(hidden, 1))
    def forward(self, input_ids, attention_mask=None, labels=None):
        out = self.base(input_ids=input_ids, attention_mask=attention_mask,
                        output_hidden_states=True, use_cache=False)
        h = out.hidden_states[-1]
        idx = attention_mask.sum(dim=1) - 1
        last_h = h[torch.arange(h.size(0), device=h.device), idx]
        pred = self.head(last_h).squeeze(-1)
        loss = F.mse_loss(pred, labels) if labels is not None else None
        return {"loss": loss, "logits": pred.unsqueeze(-1)}

reg_model = LLMRegressor(model).to(DEVICE)

# ---- trainer ----
args = TrainingArguments(
    output_dir="/content/tinyllama_value_head",
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH,
    per_device_eval_batch_size=BATCH,
    learning_rate=LR,
    logging_steps=50,
    save_strategy="no",
    report_to="none",
    do_eval=True,
    bf16=(model.dtype==torch.bfloat16),
)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    z_pred, z_true = logits.squeeze(-1), labels
    y_pred = np.expm1(z_pred*sigma+mu); y_true = np.expm1(z_true*sigma+mu)
    return {
        "MAE": mean_absolute_error(y_true,y_pred),
        "RMSE": math.sqrt(mean_squared_error(y_true,y_pred)),
        "MAPE%": float(np.mean(np.abs((y_true-y_pred)/np.clip(y_true,1e-9,None)))*100)
    }

trainer = Trainer(
    model=reg_model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    data_collator=lambda b: pad_batch(b, tokenizer.pad_token_id),
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

trainer.train()
metrics = trainer.evaluate()
print("Value-head metrics (volumes):", metrics)


  trainer = Trainer(


Step,Training Loss
50,0.5697
100,0.3146
150,0.3673
200,0.408
250,0.3536
300,0.3205
350,0.2959
400,0.3588
450,0.3233
500,0.3079


Value-head metrics (volumes): {'eval_loss': 0.511233925819397, 'eval_MAE': 1113996.75, 'eval_RMSE': 1550466.1081468372, 'eval_MAPE%': 60.99566650390625, 'eval_runtime': 9.9154, 'eval_samples_per_second': 68.883, 'eval_steps_per_second': 8.673, 'epoch': 5.0}


In [23]:
# ---- Auto-find and loading the LoRA adapters, required for evaluation---
BASE_MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
SEARCH_ROOTS = ["/content/tinyllama_ts_lora", "/content"]

import os, time, torch, json
from typing import Optional, Tuple
from transformers import AutoTokenizer, AutoModelForCausalLM
try:
    from peft import AutoPeftModelForCausalLM
    PEFT_OK = True
except Exception:
    PEFT_OK = False

def find_latest_adapter(root_dirs) -> Optional[str]:
    """
    Return dir containing BOTH adapter_model.* and adapter_config.json with the newest mtime.
    Looks recursively under the given roots (handles trainer's checkpoint-* subdirs).
    """
    best_dir, best_mtime = None, -1
    for root in root_dirs:
        if not os.path.exists(root):
            continue
        for cur, _, files in os.walk(root):
            has_cfg = "adapter_config.json" in files
            has_ad = any(f.startswith("adapter_model.") for f in files)
            if has_cfg and has_ad:
                ad_files = [os.path.join(cur, f) for f in files if f.startswith("adapter_model.")]
                mtime = max(os.path.getmtime(p) for p in ad_files)
                if mtime > best_mtime:
                    best_mtime, best_dir = mtime, cur
    return best_dir

def find_merged_model(root_dirs) -> Optional[str]:
    """Return dir containing a full merged model (model.safetensors / pytorch_model.*)."""
    for root in root_dirs:
        if not os.path.exists(root):
            continue
        for cur, _, files in os.walk(root):
            has_full = any(f in files for f in ["model.safetensors","pytorch_model.bin","pytorch_model.safetensors"])
            if has_full:
                return cur
    return None

def ensure_adapter_config(dirpath: str, base_model: str):
    """If adapter_config.json missing, write a minimal one that matches your LoRA training params."""
    cfg_path = os.path.join(dirpath, "adapter_config.json")
    if os.path.exists(cfg_path):
        return
    lora_cfg = {
        "base_model_name_or_path": base_model,
        "peft_type": "LORA",
        "task_type": "CAUSAL_LM",
        "r": 16, "lora_alpha": 32, "lora_dropout": 0.05,
        "bias": "none", "inference_mode": False,
        "target_modules": ["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"]
    }
    with open(cfg_path, "w") as f: json.dump(lora_cfg, f)
    print(f"[fix] wrote missing adapter_config.json at: {cfg_path}")

def load_ready_model_and_tokenizer(base_model: str, roots) -> Tuple[AutoModelForCausalLM, AutoTokenizer, str]:

    tok = AutoTokenizer.from_pretrained(base_model, use_fast=True, legacy=False)
    if tok.pad_token is None: tok.pad_token = tok.eos_token
    dtype = torch.bfloat16 if (torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8) else torch.float16


    adir = find_latest_adapter(roots)
    if adir and PEFT_OK:
        ensure_adapter_config(adir, base_model)
        try:
            mdl = AutoPeftModelForCausalLM.from_pretrained(adir, torch_dtype=dtype, device_map="auto")
            mdl = mdl.merge_and_unload()
            mdl.config.use_cache = False
            print(f"[ok] loaded & merged LoRA adapters from: {adir}")
            return mdl, tok, "adapters_merged"
        except Exception as e:
            print(f"[warn] adapter load failed at {adir}: {e}")


    mdir = find_merged_model(roots)
    if mdir:
        mdl = AutoModelForCausalLM.from_pretrained(mdir, torch_dtype=dtype, device_map="auto")
        mdl.config.use_cache = False
        print(f"[ok] loaded full merged model from: {mdir}")
        return mdl, tok, "merged_full"


    mdl = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=dtype, device_map="auto")
    mdl.config.use_cache = False
    print("[WARN] adapters/merged not found — using BASE ONLY.")
    return mdl, tok, "base_only"

model, tokenizer, mode = load_ready_model_and_tokenizer(BASE_MODEL, SEARCH_ROOTS)

#  sanity checking:
print("mode:", mode)
print("device:", next(model.parameters()).device)

[ok] loaded & merged LoRA adapters from: /content/tinyllama_ts_lora/checkpoint-960
mode: adapters_merged
device: cuda:0


Model(TinyLlaMa) Evaluation -1

TinyLLaMA trained on Feature_Group_1, with 3 epochs, Performance

In [10]:
FEATURE_FILE = "/content/features_trading_only_2.csv"
DATE_COL, VOL_COL, LABEL_COL = "date", "volume", "z_target"
CTX, MAX_FEATURES, TRAIN_FRAC, MAX_LEN = 64, 16, 0.9, 1024

import os, re, math, json, numpy as np, pandas as pd, torch
from typing import List
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
from transformers import LogitsProcessor

# ---------- Load & prep features ----------
df = pd.read_csv(FEATURE_FILE)
df[DATE_COL] = pd.to_datetime(df[DATE_COL], errors="coerce")
df = df.dropna(subset=[DATE_COL]).sort_values(DATE_COL).reset_index(drop=True)
df = df.replace([np.inf, -np.inf], np.nan).ffill().bfill()

# Top-K by absolute Pearson (keep consistent with your training)
EXCL = {DATE_COL, LABEL_COL, VOL_COL, "y_trading", "y_log1p"}
num = df.select_dtypes(include=[np.number]).copy()
std = num.std(numeric_only=True)
non_const = std[std > 0].index.tolist()
num = num[non_const]
cand = [c for c in num.columns if c not in EXCL]
if not cand: raise ValueError("No candidate numeric features found.")
corr = num[cand].corrwith(num[LABEL_COL]).abs().replace([np.inf,-np.inf], np.nan).fillna(0.0)
feature_cols = corr.sort_values(ascending=False).index.tolist()[:MAX_FEATURES]

# z-history from log1p(volume)
vol = df[VOL_COL].astype(float).values.reshape(-1,1)
z_hist_scaler = StandardScaler()
z_hist_series = z_hist_scaler.fit_transform(np.log1p(vol)).reshape(-1)

def make_windows(df, z_hist, ctx, feat_cols, label_col):
    X, Y = [], []
    for t in range(ctx, len(df)):
        X.append((z_hist[t-ctx:t].tolist(), df.iloc[t][feat_cols].to_dict()))
        Y.append(float(df.iloc[t][label_col]))
    return X, np.array(Y, dtype=np.float32)

X_raw, Y = make_windows(df, z_hist_series, CTX, feature_cols, LABEL_COL)
cut = int(len(X_raw) * TRAIN_FRAC)
val_text = []
for hist, feats in X_raw[cut:]:
    hist_str  = ", ".join(f"{x:.4f}" for x in hist)
    feats_str = ", ".join(f"{k}={float(v):.4f}" for k,v in feats.items()) if feats else "none"
    prompt = f"z_hist[{len(hist)}]:{hist_str}\nfeat:{feats_str}\nnext_z:"
    val_text.append({"prompt": prompt})

# scaler to invert to original units
def load_scaler_json(feature_path: str):
    base = os.path.dirname(feature_path)
    cand = os.path.join(base, "features_trading_only_scaler_2.json")
    if os.path.exists(cand):
        with open(cand, "r") as f:
            s = json.load(f)
        return float(s["y_log1p_mean"]), float(s["y_log1p_std"])
    mu = float(df["y_log1p"].mean()) if "y_log1p" in df.columns else 0.0
    std = df["y_log1p"].std(ddof=0) if "y_log1p" in df.columns else 1.0
    return mu, (float(std) if std and std > 0 else 1.0)

mu, sigma = load_scaler_json(FEATURE_FILE)

# y_val (ground truth) in original units
y_va_z = Y[cut:]
y_val  = np.expm1(y_va_z * sigma + mu)

# ---------- Numeric-constrained decoding ----------
class DigitsOnly(LogitsProcessor):
    def __init__(self, tok, device):
        allowed_chars = set("0123456789-+.eE \n")
        ids = []
        for i in range(tok.vocab_size):
            s = tok.decode([i])
            if s and set(s).issubset(allowed_chars):
                ids.append(i)
        self.allowed_ids = torch.tensor(ids, device=device)
    def __call__(self, input_ids, scores):
        mask = torch.full_like(scores, float("-inf"))
        mask[:, self.allowed_ids] = 0
        return scores + mask

digits_only = DigitsOnly(tokenizer, device=next(model.parameters()).device)
num_pat = re.compile(r"[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?")
def number_from_text(s):
    m = num_pat.search(s);
    return float(m.group(0)) if m else None

# ---------- Predict TinyLlama on validation ----------
model.eval()
preds_z = []
for ex in val_text:
    ids = tokenizer(ex["prompt"], return_tensors="pt", truncation=True, max_length=MAX_LEN).to(next(model.parameters()).device)
    with torch.no_grad():
        out = model.generate(**ids, max_new_tokens=12, do_sample=False,
                             logits_processor=[digits_only],
                             pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id)
    gen = tokenizer.decode(out[0][ids["input_ids"].shape[1]:], skip_special_tokens=True)
    z_hat = number_from_text(gen)
    preds_z.append(np.nan if z_hat is None else z_hat)

preds_z = np.array(preds_z, dtype=float)
mask = ~np.isnan(preds_z)
if mask.sum() == 0:
    raise RuntimeError("Model returned no numeric outputs. Check prompts/decoding.")
y_pred = np.expm1(preds_z[mask] * sigma + mu)
y_true = y_val[mask]

# ---------- Metrics ----------
def metrics(y_true, y_pred):
    mae  = mean_absolute_error(y_true, y_pred)
    rmse = math.sqrt(mean_squared_error(y_true, y_pred))
    mape = float(np.mean(np.abs((y_true - y_pred) / np.clip(y_true, 1e-9, None))) * 100)
    return {"MAE": mae, "RMSE": rmse, "MAPE%": mape}

print(f"Aligned eval samples: {len(y_true)} / {len(y_val)}")
print("TinyLlama (text→z→volume):", metrics(y_true, y_pred))

# ---------- Naive baselines on same span ----------
y_all = np.expm1(Y * sigma + mu)
tail_len = len(y_true)
truth_tail = y_all[-tail_len:]

def seasonal_naive(series, season=5):
    yhat = np.roll(series, season); yhat[:season] = series[:season]; return yhat
def moving_avg(series, k=7):
    s = pd.Series(series)
    return s.rolling(k, min_periods=1).mean().shift(1).bfill().to_numpy()

sn = seasonal_naive(np.r_[y_all[:-tail_len], truth_tail])[-tail_len:]
ma = moving_avg(np.r_[y_all[:-tail_len], truth_tail])[-tail_len:]
print("Seasonal naive:", metrics(truth_tail, sn))
print("Moving average:", metrics(truth_tail, ma))

# ---------- Preview few predictions ----------
for i in range(min(10, len(y_true))):
    print(f"{i:02d} | true={y_true[i]:.2f}  pred={y_pred[i]:.2f}")

Aligned eval samples: 683 / 683
TinyLlama (text→z→volume): {'MAE': 769220.187365776, 'RMSE': 1419827.828691292, 'MAPE%': 35.59028644550582}
Seasonal naive: {'MAE': 1035122.125, 'RMSE': 1966850.8488850903, 'MAPE%': 43.81184768676758}
Moving average: {'MAE': 783198.1818787911, 'RMSE': 1475405.823909972, 'MAPE%': 32.859275970257514}
00 | true=2165886.50  pred=2899729.81
01 | true=3933762.75  pred=3013983.48
02 | true=2430223.25  pred=2679824.59
03 | true=3374707.00  pred=2901927.90
04 | true=3410402.75  pred=2749887.98
05 | true=3193953.75  pred=3130365.99
06 | true=11314851.00  pred=3106547.08
07 | true=5218058.50  pred=3376760.99
08 | true=3779608.00  pred=3376581.88
09 | true=4183481.25  pred=3615025.68


Tiny LlaMa Model Evaluation-2

Performance of TinyLlama trained on **Features_Group_1** with 5 epochs

In [24]:
FEATURE_FILE = "/content/features_trading_only_2.csv"
DATE_COL, VOL_COL, LABEL_COL = "date", "volume", "z_target"
CTX, MAX_FEATURES, TRAIN_FRAC, MAX_LEN = 64, 16, 0.9, 1024

import os, re, math, json, numpy as np, pandas as pd, torch
from typing import List
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
from transformers import LogitsProcessor

# ---------- Loading & prep features ----------
df = pd.read_csv(FEATURE_FILE)
df[DATE_COL] = pd.to_datetime(df[DATE_COL], errors="coerce")
df = df.dropna(subset=[DATE_COL]).sort_values(DATE_COL).reset_index(drop=True)
df = df.replace([np.inf, -np.inf], np.nan).ffill().bfill()

# Top-K by absolute Pearson (keep consistent with your training)
EXCL = {DATE_COL, LABEL_COL, VOL_COL, "y_trading", "y_log1p"}
num = df.select_dtypes(include=[np.number]).copy()
std = num.std(numeric_only=True)
non_const = std[std > 0].index.tolist()
num = num[non_const]
cand = [c for c in num.columns if c not in EXCL]
if not cand: raise ValueError("No candidate numeric features found.")
corr = num[cand].corrwith(num[LABEL_COL]).abs().replace([np.inf,-np.inf], np.nan).fillna(0.0)
feature_cols = corr.sort_values(ascending=False).index.tolist()[:MAX_FEATURES]

# z-history from log1p(volume)
vol = df[VOL_COL].astype(float).values.reshape(-1,1)
z_hist_scaler = StandardScaler()
z_hist_series = z_hist_scaler.fit_transform(np.log1p(vol)).reshape(-1)

def make_windows(df, z_hist, ctx, feat_cols, label_col):
    X, Y = [], []
    for t in range(ctx, len(df)):
        X.append((z_hist[t-ctx:t].tolist(), df.iloc[t][feat_cols].to_dict()))
        Y.append(float(df.iloc[t][label_col]))
    return X, np.array(Y, dtype=np.float32)

X_raw, Y = make_windows(df, z_hist_series, CTX, feature_cols, LABEL_COL)
cut = int(len(X_raw) * TRAIN_FRAC)
val_text = []
for hist, feats in X_raw[cut:]:
    hist_str  = ", ".join(f"{x:.4f}" for x in hist)
    feats_str = ", ".join(f"{k}={float(v):.4f}" for k,v in feats.items()) if feats else "none"
    prompt = f"z_hist[{len(hist)}]:{hist_str}\nfeat:{feats_str}\nnext_z:"
    val_text.append({"prompt": prompt})

# scaler to invert to original units
def load_scaler_json(feature_path: str):
    base = os.path.dirname(feature_path)
    cand = os.path.join(base, "features_trading_only_scaler_2.json")
    if os.path.exists(cand):
        with open(cand, "r") as f:
            s = json.load(f)
        return float(s["y_log1p_mean"]), float(s["y_log1p_std"])
    mu = float(df["y_log1p"].mean()) if "y_log1p" in df.columns else 0.0
    std = df["y_log1p"].std(ddof=0) if "y_log1p" in df.columns else 1.0
    return mu, (float(std) if std and std > 0 else 1.0)

mu, sigma = load_scaler_json(FEATURE_FILE)

# y_val (ground truth) in original units
y_va_z = Y[cut:]
y_val  = np.expm1(y_va_z * sigma + mu)

# ---------- Numeric-constrained decoding ----------
class DigitsOnly(LogitsProcessor):
    def __init__(self, tok, device):
        allowed_chars = set("0123456789-+.eE \n")
        ids = []
        for i in range(tok.vocab_size):
            s = tok.decode([i])
            if s and set(s).issubset(allowed_chars):
                ids.append(i)
        self.allowed_ids = torch.tensor(ids, device=device)
    def __call__(self, input_ids, scores):
        mask = torch.full_like(scores, float("-inf"))
        mask[:, self.allowed_ids] = 0
        return scores + mask

digits_only = DigitsOnly(tokenizer, device=next(model.parameters()).device)
num_pat = re.compile(r"[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?")
def number_from_text(s):
    m = num_pat.search(s);
    return float(m.group(0)) if m else None

# ---------- Predict TinyLlama on validation ----------
model.eval()
preds_z = []
for ex in val_text:
    ids = tokenizer(ex["prompt"], return_tensors="pt", truncation=True, max_length=MAX_LEN).to(next(model.parameters()).device)
    with torch.no_grad():
        out = model.generate(**ids, max_new_tokens=12, do_sample=False,
                             logits_processor=[digits_only],
                             pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id)
    gen = tokenizer.decode(out[0][ids["input_ids"].shape[1]:], skip_special_tokens=True)
    z_hat = number_from_text(gen)
    preds_z.append(np.nan if z_hat is None else z_hat)

preds_z = np.array(preds_z, dtype=float)
mask = ~np.isnan(preds_z)
if mask.sum() == 0:
    raise RuntimeError("Model returned no numeric outputs. Check prompts/decoding.")
y_pred = np.expm1(preds_z[mask] * sigma + mu)
y_true = y_val[mask]

# ---------- Metrics ----------
def metrics(y_true, y_pred):
    mae  = mean_absolute_error(y_true, y_pred)
    rmse = math.sqrt(mean_squared_error(y_true, y_pred))
    mape = float(np.mean(np.abs((y_true - y_pred) / np.clip(y_true, 1e-9, None))) * 100)
    return {"MAE": mae, "RMSE": rmse, "MAPE%": mape}

print(f"Aligned eval samples: {len(y_true)} / {len(y_val)}")
print("TinyLlama (text→z→volume):", metrics(y_true, y_pred))

# ---------- Naive baselines on same span ----------

y_all = np.expm1(Y * sigma + mu)
tail_len = len(y_true)
truth_tail = y_all[-tail_len:]

def seasonal_naive(series, season=5):
    yhat = np.roll(series, season); yhat[:season] = series[:season]; return yhat
def moving_avg(series, k=7):
    s = pd.Series(series)
    return s.rolling(k, min_periods=1).mean().shift(1).bfill().to_numpy()

sn = seasonal_naive(np.r_[y_all[:-tail_len], truth_tail])[-tail_len:]
ma = moving_avg(np.r_[y_all[:-tail_len], truth_tail])[-tail_len:]
print("Seasonal naive:", metrics(truth_tail, sn))
print("Moving average:", metrics(truth_tail, ma))

# ---------- Preview few predictions ----------
for i in range(min(10, len(y_true))):
    print(f"{i:02d} | true={y_true[i]:.2f}  pred={y_pred[i]:.2f}")

Aligned eval samples: 683 / 683
TinyLlama (text→z→volume): {'MAE': 728973.1755427595, 'RMSE': 1413633.286961244, 'MAPE%': 31.674799226962445}
Seasonal naive: {'MAE': 1035122.125, 'RMSE': 1966850.8488850903, 'MAPE%': 43.81184768676758}
Moving average: {'MAE': 783198.1818787911, 'RMSE': 1475405.823909972, 'MAPE%': 32.859275970257514}
00 | true=2165886.50  pred=2899685.87
01 | true=3933762.75  pred=3153865.62
02 | true=2430223.25  pred=2751972.49
03 | true=3374707.00  pred=3130057.64
04 | true=3410402.75  pred=3152455.93
05 | true=3193953.75  pred=3376709.81
06 | true=11314851.00  pred=2886401.20
07 | true=5218058.50  pred=3623252.84
08 | true=3779608.00  pred=4216177.85
09 | true=4183481.25  pred=3642550.04


Performance Of TinyLlaMa, trained on Feature_Group_1 with 3 epochs, with Numerical Head Trained with 5 epochs.

In [16]:
import os, json, math, numpy as np, pandas as pd, torch
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler

# -------- Config --------
FEATURE_FILE = "/content/features_trading_only_2.csv"
DATE_COL, VOL_COL, LABEL_COL = "date", "volume", "z_target"
CTX, MAX_FEATURES, TRAIN_FRAC, MAX_LEN = 64, 16, 0.9, 1024
DEVICE = next(reg_model.parameters()).device

# -------- Data & split --------
df = pd.read_csv(FEATURE_FILE)
df[DATE_COL] = pd.to_datetime(df[DATE_COL], errors="coerce")
df = df.dropna(subset=[DATE_COL]).sort_values(DATE_COL).reset_index(drop=True)
df = df.replace([np.inf, -np.inf], np.nan).ffill().bfill()

EXCL = {DATE_COL, LABEL_COL, VOL_COL, "y_trading", "y_log1p"}
num = df.select_dtypes(include=[np.number]).copy()
std = num.std(numeric_only=True)
cand = [c for c in num.columns if c not in EXCL and std[c] > 0]
corr = num[cand].corrwith(num[LABEL_COL]).abs().fillna(0.0)
feature_cols = corr.sort_values(ascending=False).index.tolist()[:MAX_FEATURES]

from sklearn.preprocessing import StandardScaler
vol = df[VOL_COL].astype(float).values.reshape(-1,1)
z_hist_scaler = StandardScaler()
z_hist_series = z_hist_scaler.fit_transform(np.log1p(vol)).reshape(-1)

def make_windows(df, z_hist, ctx, feat_cols, label_col):
    X, Y = [], []
    for t in range(ctx, len(df)):
        X.append((z_hist[t-ctx:t].tolist(), df.iloc[t][feat_cols].to_dict()))
        Y.append(float(df.iloc[t][label_col]))
    return X, np.array(Y, dtype=np.float32)

X_raw, Y = make_windows(df, z_hist_series, CTX, feature_cols, LABEL_COL)
cut = int(len(X_raw) * TRAIN_FRAC)

def to_prompt(hist, feats):
    hist_str = ", ".join(f"{x:.4f}" for x in hist)
    feats_str = ", ".join(f"{k}={float(v):.4f}" for k,v in feats.items()) if feats else "none"
    return f"z_hist[{len(hist)}]:{hist_str}\nfeat:{feats_str}\nnext_z:"

val_prompts = [to_prompt(*X_raw[i]) for i in range(cut, len(X_raw))]

# scaler to invert to original units
def load_scaler_json(path):
    base = os.path.dirname(path)
    cand = os.path.join(base, "features_trading_only_scaler_2.json")
    if os.path.exists(cand):
        with open(cand, "r") as f: s = json.load(f)
        return float(s["y_log1p_mean"]), float(s["y_log1p_std"])
    mu = float(df["y_log1p"].mean()) if "y_log1p" in df.columns else 0.0
    sig = df["y_log1p"].std(ddof=0) if "y_log1p" in df.columns else 1.0
    return mu, (float(sig) if sig and sig > 0 else 1.0)

mu, sigma = load_scaler_json(FEATURE_FILE)
y_va_z = Y[cut:]
y_val  = np.expm1(y_va_z * sigma + mu)

# -------- LLM + value head predictions --------
reg_model.eval()
z_pred = []
with torch.no_grad():
    for p in val_prompts:
        tok = tokenizer(p, add_special_tokens=False, truncation=True, max_length=MAX_LEN, return_tensors="pt")
        tok = {k: v.to(DEVICE) for k,v in tok.items()}
        out = reg_model(**tok)
        z_pred.append(out["logits"].squeeze(-1).squeeze(0).detach().cpu().item())

z_pred = np.array(z_pred, dtype=float)
y_pred_llm = np.expm1(z_pred * sigma + mu)

# -------- Naive baselines --------
y_all = np.expm1(Y * sigma + mu)
tail_len = len(y_pred_llm)
truth_tail = y_all[-tail_len:]

def seasonal_naive(series, season=5):
    yhat = np.roll(series, season); yhat[:season] = series[:season]; return yhat

def moving_avg(series, k=7):
    s = pd.Series(series)
    return s.rolling(k, min_periods=1).mean().shift(1).bfill().to_numpy()

sn = seasonal_naive(np.r_[y_all[:-tail_len], truth_tail])[-tail_len:]
ma = moving_avg(np.r_[y_all[:-tail_len], truth_tail])[-tail_len:]

# -------- Metrics --------
def metrics(y_true, y_pred):
    mae  = mean_absolute_error(y_true, y_pred)
    rmse = math.sqrt(mean_squared_error(y_true, y_pred))
    mape = float(np.mean(np.abs((y_true - y_pred) / np.clip(y_true, 1e-9, None))) * 100)
    return mae, rmse, mape

rows = []
rows.append(("LLM + Value Head",) + metrics(y_val, y_pred_llm))
rows.append(("Seasonal Naive (lag=5)",) + metrics(y_val, sn))
rows.append(("Moving Average (k=7)",) + metrics(y_val, ma))

df_res = pd.DataFrame(rows, columns=["Model","MAE","RMSE","MAPE%"])
print(df_res.to_string(index=False))

# -------- Top-10 absolute-error misses --------
err = np.abs(y_pred_llm - y_val)
idx = np.argsort(-err)[:10]
print("\nTop-10 misses (LLM + Value Head):")
for j in idx:
    print(f"{j:03d} | true={y_val[j]:.0f}  pred={y_pred_llm[j]:.0f}  abs_err={err[j]:.0f}")


                 Model          MAE         RMSE     MAPE%
      LLM + Value Head 1.119315e+06 1.557513e+06 61.201292
Seasonal Naive (lag=5) 1.035122e+06 1.966851e+06 43.811848
  Moving Average (k=7) 7.831982e+05 1.475406e+06 32.859276

Top-10 misses (LLM + Value Head):
322 | true=22640160  pred=4638328  abs_err=18001832
072 | true=13930385  pred=4345909  abs_err=9584476
006 | true=11314851  pred=4384672  abs_err=6930179
634 | true=9685454  pred=2787773  abs_err=6897681
385 | true=11448904  pred=4672778  abs_err=6776126
576 | true=9816966  pred=3860656  abs_err=5956310
192 | true=7737242  pred=2461880  abs_err=5275362
235 | true=7526871  pred=2804326  abs_err=4722545
261 | true=7557361  pred=3028664  abs_err=4528697
130 | true=7017421  pred=2596604  abs_err=4420817


Using XGBoost for selecting top-k features

In [17]:
!pip install -q xgboost

In [18]:


import pandas as pd, numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import xgboost as xgb

# ---- config ----
FEATURE_FILE = "/content/features_trading_only_2.csv"
DATE_COL, VOL_COL, LABEL_COL = "date", "volume", "z_target"
K = 16  # top-k features to keep

# ---- load features ----
df = pd.read_csv(FEATURE_FILE)
df[DATE_COL] = pd.to_datetime(df[DATE_COL], errors="coerce")
df = df.dropna(subset=[DATE_COL]).sort_values(DATE_COL).reset_index(drop=True)

# numeric features only
exclude = {DATE_COL, LABEL_COL, VOL_COL, "y_trading", "y_log1p"}
num = df.select_dtypes(include=[np.number]).copy()
cand = [c for c in num.columns if c not in exclude]

X = num[cand].fillna(0).values
y = df[LABEL_COL].values

# ---- train/val split ----
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, shuffle=False)

# ---- fit XGBoost regressor ----
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=cand)
dval   = xgb.DMatrix(X_val,   label=y_val,   feature_names=cand)

params = {
    "objective": "reg:squarederror",
    "eval_metric": "rmse",
    "max_depth": 4,
    "eta": 0.1,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "seed": 42
}

bst = xgb.train(params, dtrain, num_boost_round=200, evals=[(dval,"val")],
                early_stopping_rounds=20, verbose_eval=False)

# ---- feature importance ----
imp_gain = bst.get_score(importance_type="gain")
imp_sorted = sorted(imp_gain.items(), key=lambda x: x[1], reverse=True)

print(f"\nTop-{K} features by XGBoost importance:")
top_features = [f for f,_ in imp_sorted[:K]]
for rank,(feat,score) in enumerate(imp_sorted[:K],1):
    print(f"{rank:02d}. {feat:25s} gain={score:.4f}")

# save to list for later use!
feature_cols_xgb = top_features



Top-16 features by XGBoost importance:
01. year                      gain=91.3184
02. ema7                      gain=79.9638
03. fourier_t_sin1            gain=70.3679
04. ema28                     gain=59.0522
05. roll7_sum                 gain=35.2179
06. roll7_mean                gain=27.8385
07. roll14_mean               gain=9.5542
08. roll7_min                 gain=8.9750
09. lag1                      gain=6.5701
10. ret7                      gain=4.4075
11. ret1                      gain=3.8452
12. dom                       gain=2.9399
13. week                      gain=2.6481
14. month                     gain=2.4693
15. gap_days_next             gain=2.4107
16. fourier_y_sin1            gain=2.1528


Fine-tuning TinyLlama On Feature_Group_2, 3 epochs

In [19]:
import os, re, math, json, random
import numpy as np
import pandas as pd
from dataclasses import dataclass
from typing import List, Dict, Optional

import torch
from torch.utils.data import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from transformers.trainer_utils import set_seed
from peft import LoraConfig, get_peft_model
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler

# ============================  CONFIG  ============================
CONFIG = {
    "feature_files": ["/content/features_trading_only_2.csv"],
    "date_col": "date",
    "vol_col": "volume",
    "label_col": "z_target",
    "context_len": 64,
    "max_features": 16,
    "model_name": "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    "seed": 42,
    "train_frac": 0.9,
    "epochs": 3,
    "lr": 1e-4,
    "train_bs": 2,
    "grad_accum": 16,
    "max_length": 1024,
    "warmup_ratio": 0.1,
    "weight_decay": 0.01,
    "lora_r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "output_dir": "/content/tinyllama_ts_lora",
    "bf16": True,
}

set_seed(CONFIG["seed"])

# ======================  LOAD & MERGE FEATURES  ==================
def load_and_merge(paths: List[str], date_col: str):
    dfs = []
    for p in paths:
        df = pd.read_csv(p)
        df[date_col] = pd.to_datetime(df[date_col], errors="coerce")
        df = df.dropna(subset=[date_col])
        dfs.append(df)
    all_df = pd.concat(dfs, axis=0, ignore_index=True).sort_values(date_col).reset_index(drop=True)
    all_df = all_df.replace([np.inf, -np.inf], np.nan).ffill().bfill()
    return all_df

df = load_and_merge(CONFIG["feature_files"], CONFIG["date_col"])
print("Data shape:", df.shape)
print(df[[CONFIG["date_col"], CONFIG["vol_col"], "y_trading", "y_log1p", CONFIG["label_col"]]].head())

# =====================  SELECT TOP-K FEATURES  ====================
EXCLUDE_COLS = {CONFIG["date_col"], CONFIG["label_col"], CONFIG["vol_col"], "y_trading", "y_log1p"}
num = df.select_dtypes(include=[np.number]).copy()
std = num.std(numeric_only=True)
non_const = std[std > 0].index.tolist()
num = num[non_const]
cand = [c for c in num.columns if c not in EXCLUDE_COLS]
if not cand:
    raise ValueError("No candidate numeric features found after exclusions.")
# corr = num[cand].corrwith(num[CONFIG["label_col"]]).abs().replace([np.inf, -np.inf], np.nan).fillna(0.0)
# feature_cols = corr.sort_values(ascending=False).index.tolist()[:CONFIG["max_features"]]
feature_cols = [
    "year","ema7","fourier_t_sin1","ema28","roll7_sum","roll7_mean",
    "roll14_mean","roll7_min","lag1","ret7","ret1",
    "dom","week","month","gap_days_next","fourier_y_sin1"
]
print("Selected feature columns:", feature_cols)

# ====================  BUILD z_history SERIES  ====================

vol = df[CONFIG["vol_col"]].astype(float).values.reshape(-1,1)
z_hist_scaler = StandardScaler()
z_hist_series = z_hist_scaler.fit_transform(np.log1p(vol)).reshape(-1)

# ====================  BUILD WINDOWS  ====================
def make_windows(df: pd.DataFrame, z_hist: np.ndarray, ctx: int, feat_cols: List[str], label_col: str):
    X, Y = [], []
    n = len(df)
    for t in range(ctx, n):
        hist = z_hist[t-ctx:t].tolist()
        feats = df.iloc[t][feat_cols].to_dict()
        y = float(df.iloc[t][label_col])
        X.append((hist, feats))
        Y.append(y)
    return X, np.array(Y, dtype=np.float32)

X_raw, Y = make_windows(df, z_hist_series, CONFIG["context_len"], feature_cols, CONFIG["label_col"])
print("Total samples:", len(X_raw))

# ===================  TRAIN / VAL SPLIT  ==========================
N = len(X_raw)
cut = int(N * CONFIG["train_frac"])
train_idx = np.arange(0, cut)
val_idx = np.arange(cut, N)

def ex_to_text(hist, feats, y_z):
    hist_str = ", ".join(f"{x:.4f}" for x in hist)
    feats_str = ", ".join(f"{k}={float(v):.4f}" for k,v in feats.items()) if feats else "none"
    prompt = f"z_hist[{len(hist)}]:{hist_str}\nfeat:{feats_str}\nnext_z:"
    target = f"{y_z:.5f}\n"
    return {"prompt": prompt, "target": target}

train_text = [ex_to_text(*X_raw[i], Y[i]) for i in train_idx]
val_text   = [ex_to_text(*X_raw[i], Y[i]) for i in val_idx]

# ===================  TOKENIZER / MODEL  =========================
model_name = CONFIG["model_name"]
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

torch_dtype = (
    torch.bfloat16
    if (CONFIG["bf16"] and torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8)
    else torch.float16
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch_dtype,
    device_map="auto",
    attn_implementation="sdpa",   # no extra installs
)
model.config.use_cache = False
model.gradient_checkpointing_enable()

# LoRA targets for TinyLlama blocks
lora_cfg = LoraConfig(
    r=CONFIG["lora_r"], lora_alpha=CONFIG["lora_alpha"], lora_dropout=CONFIG["lora_dropout"],
    bias="none", task_type="CAUSAL_LM",
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"]
)
model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()

# ===================  DATASET  ========================
class TxtDS(Dataset):
    def __init__(self, examples, tok, max_len=1024):
        self.ex = examples; self.tok = tok; self.max_len = max_len
    def __len__(self): return len(self.ex)
    def __getitem__(self, i):
        e = self.ex[i]
        p_ids = self.tok(e["prompt"], add_special_tokens=False)["input_ids"]
        t_ids = self.tok(e["target"], add_special_tokens=False)["input_ids"]
        ids = p_ids + t_ids
        if len(ids) > self.max_len:
            overflow = len(ids) - self.max_len
            keep_p = max(0, len(p_ids) - overflow)
            ids = p_ids[-keep_p:] + t_ids
        p_len = min(len(p_ids), len(ids) - len(t_ids))
        labels = [-100]*p_len + ids[p_len:]
        attn = [1]*len(ids)
        return {
            "input_ids": torch.tensor(ids, dtype=torch.long),
            "attention_mask": torch.tensor(attn, dtype=torch.long),
            "labels": torch.tensor(labels, dtype=torch.long),
        }

def pad_batch(batch, pad_id):
    mx = max(len(b["input_ids"]) for b in batch)
    out = {"input_ids": [], "attention_mask": [], "labels": []}
    for b in batch:
        pad_n = mx - len(b["input_ids"])
        out["input_ids"].append(torch.cat([b["input_ids"], torch.full((pad_n,), pad_id, dtype=torch.long)]))
        out["attention_mask"].append(torch.cat([b["attention_mask"], torch.zeros(pad_n, dtype=torch.long)]))
        out["labels"].append(torch.cat([b["labels"], torch.full((pad_n,), -100, dtype=torch.long)]))
    return {k: torch.stack(v) for k,v in out.items()}

def collate_fn(features):
    return pad_batch(features, tokenizer.pad_token_id)

train_ds = TxtDS(train_text, tokenizer, CONFIG["max_length"])
val_ds   = TxtDS(val_text, tokenizer, CONFIG["max_length"])

# ===================  TRAIN  ==========================
args = TrainingArguments(
    output_dir=CONFIG["output_dir"],
    num_train_epochs=CONFIG["epochs"],
    per_device_train_batch_size=CONFIG["train_bs"],
    per_device_eval_batch_size=CONFIG["train_bs"],
    gradient_accumulation_steps=CONFIG["grad_accum"],
    learning_rate=CONFIG["lr"],
    weight_decay=CONFIG["weight_decay"],
    warmup_ratio=CONFIG["warmup_ratio"],
    logging_steps=50,
    save_strategy="epoch",
    lr_scheduler_type="cosine",
    bf16=(torch_dtype==torch.bfloat16),
    fp16=(torch_dtype==torch.float16),
    dataloader_num_workers=2,
    report_to="none",
    do_eval=False
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    data_collator=collate_fn,
    tokenizer=tokenizer,
)
trainer.train()

#  EVAL

def load_scaler_json(paths):
    base = os.path.dirname(paths[0])
    cand = os.path.join(base, "features_trading_only_scaler_2.json")
    if os.path.exists(cand):
        with open(cand, "r") as f:
            s = json.load(f)
        return float(s["y_log1p_mean"]), float(s["y_log1p_std"])
    mu = float(df["y_log1p"].mean()) if "y_log1p" in df.columns else 0.0
    sigma = float(df["y_log1p"].std(ddof=0)) if "y_log1p" in df.columns and df["y_log1p"].std(ddof=0)>0 else 1.0
    return mu, sigma

mu, sigma = load_scaler_json(CONFIG["feature_files"])

def number_from_text(text: str) -> Optional[float]:
    m = re.search(r"[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?", text)
    return float(m.group(0)) if m else None

def evaluate(model, tok, val_examples, mu, sigma, max_new_tokens=12):
    model.eval()
    preds_z, trues_z = [], []
    for ex in val_examples:
        ids = tok(ex["prompt"], return_tensors="pt").to(model.device)
        with torch.no_grad():
            out = model.generate(
                **ids, max_new_tokens=max_new_tokens, do_sample=False,
                pad_token_id=tok.pad_token_id, eos_token_id=tok.eos_token_id
            )
        gen = tok.decode(out[0][ids["input_ids"].shape[1]:], skip_special_tokens=True)
        z_hat = number_from_text(gen)
        if z_hat is None: continue
        preds_z.append(z_hat)
        trues_z.append(float(re.findall(r"[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?", ex["target"])[0]))
    if not preds_z:
        return {"val_MAE": float("nan"), "val_RMSE": float("nan")}
    preds_z = np.array(preds_z); trues_z = np.array(trues_z)
    y_pred = np.expm1(preds_z * sigma + mu)
    y_true = np.expm1(trues_z * sigma + mu)
    return {
        "val_MAE": mean_absolute_error(y_true, y_pred),
        "val_RMSE": math.sqrt(mean_squared_error(y_true, y_pred)),
    }

val_pairs = [{"prompt": e["prompt"], "target": e["target"]} for e in val_text]
metrics = evaluate(model, tokenizer, val_pairs, mu, sigma)
print("Validation metrics:", metrics)

# ===================  INFERENCE  ==========================
def forecast_next(raw_recent_volumes: List[float], last_feat_row: Dict[str,float], mu: float, sigma: float, k_decimals=5) -> float:
    z_hist = z_hist_scaler.transform(np.log1p(np.array(raw_recent_volumes).reshape(-1,1))).reshape(-1)
    hist_str = ", ".join(f"{z:.4f}" for z in z_hist[-CONFIG["context_len"]:])
    feats_str = ", ".join(f"{k}={float(v):.4f}" for k,v in last_feat_row.items()) if last_feat_row else "none"
    prompt = f"z_hist[{len(z_hist[-CONFIG['context_len']:])}]:{hist_str}\nfeat:{feats_str}\nnext_z:"
    ids = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(**ids, max_new_tokens=12, do_sample=False,
                             pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id)
    gen = tokenizer.decode(out[0][ids["input_ids"].shape[1]:], skip_special_tokens=True)
    z_hat = number_from_text(gen)
    if z_hat is None: raise RuntimeError("Model did not return a numeric answer.")
    return float(round(np.expm1(z_hat * sigma + mu), k_decimals))

print("TinyLlama LoRA fine-tune done. Use forecast_next(...) for inference.")


Data shape: (6890, 53)
        date    volume   y_trading    y_log1p  z_target
0 1998-02-11  15819189  13524081.0  16.419983  0.920051
1 1998-02-12  13524081   8694402.0  15.978190  0.337016
2 1998-02-13   8694402  14912102.0  16.517684  1.048988
3 1998-02-17  14912102  12788824.0  16.364082  0.846279
4 1998-02-18  12788824  16986901.0  16.647953  1.220905
Selected feature columns: ['year', 'ema7', 'fourier_t_sin1', 'ema28', 'roll7_sum', 'roll7_mean', 'roll14_mean', 'roll7_min', 'lag1', 'ret7', 'ret1', 'dom', 'week', 'month', 'gap_days_next', 'fourier_y_sin1']
Total samples: 6826
trainable params: 12,615,680 || all params: 1,112,664,064 || trainable%: 1.1338


  trainer = Trainer(


Step,Training Loss
50,1.7541
100,1.3543
150,1.3463
200,1.3398
250,1.3372
300,1.3306
350,1.3396
400,1.3348
450,1.3247
500,1.3259


Validation metrics: {'val_MAE': 713283.7563821452, 'val_RMSE': 1367959.9324602024}
TinyLlama LoRA fine-tune done. Use forecast_next(...) for inference.


Tiny LlaMa Model Evaluation -3

Performance Of TinyLlama trained on Feature_Group_2, trained with 3 Epoch

In [20]:

FEATURE_FILE = "/content/features_trading_only_2.csv"
DATE_COL, VOL_COL, LABEL_COL = "date", "volume", "z_target"
CTX, MAX_FEATURES, TRAIN_FRAC, MAX_LEN = 64, 16, 0.9, 1024

import os, re, math, json, numpy as np, pandas as pd, torch
from typing import List
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
from transformers import LogitsProcessor

# ---------- Load & prep features ----------
df = pd.read_csv(FEATURE_FILE)
df[DATE_COL] = pd.to_datetime(df[DATE_COL], errors="coerce")
df = df.dropna(subset=[DATE_COL]).sort_values(DATE_COL).reset_index(drop=True)
df = df.replace([np.inf, -np.inf], np.nan).ffill().bfill()

# Top-K by absolute Pearson (keep consistent with your training)
EXCL = {DATE_COL, LABEL_COL, VOL_COL, "y_trading", "y_log1p"}
num = df.select_dtypes(include=[np.number]).copy()
std = num.std(numeric_only=True)
non_const = std[std > 0].index.tolist()
num = num[non_const]
cand = [c for c in num.columns if c not in EXCL]
if not cand: raise ValueError("No candidate numeric features found.")
corr = num[cand].corrwith(num[LABEL_COL]).abs().replace([np.inf,-np.inf], np.nan).fillna(0.0)
feature_cols = corr.sort_values(ascending=False).index.tolist()[:MAX_FEATURES]

# z-history from log1p(volume)
vol = df[VOL_COL].astype(float).values.reshape(-1,1)
z_hist_scaler = StandardScaler()
z_hist_series = z_hist_scaler.fit_transform(np.log1p(vol)).reshape(-1)

def make_windows(df, z_hist, ctx, feat_cols, label_col):
    X, Y = [], []
    for t in range(ctx, len(df)):
        X.append((z_hist[t-ctx:t].tolist(), df.iloc[t][feat_cols].to_dict()))
        Y.append(float(df.iloc[t][label_col]))
    return X, np.array(Y, dtype=np.float32)

X_raw, Y = make_windows(df, z_hist_series, CTX, feature_cols, LABEL_COL)
cut = int(len(X_raw) * TRAIN_FRAC)
val_text = []
for hist, feats in X_raw[cut:]:
    hist_str  = ", ".join(f"{x:.4f}" for x in hist)
    feats_str = ", ".join(f"{k}={float(v):.4f}" for k,v in feats.items()) if feats else "none"
    prompt = f"z_hist[{len(hist)}]:{hist_str}\nfeat:{feats_str}\nnext_z:"
    val_text.append({"prompt": prompt})

# scaler to invert to original units
def load_scaler_json(feature_path: str):
    base = os.path.dirname(feature_path)
    cand = os.path.join(base, "features_trading_only_scaler_2.json")
    if os.path.exists(cand):
        with open(cand, "r") as f:
            s = json.load(f)
        return float(s["y_log1p_mean"]), float(s["y_log1p_std"])
    mu = float(df["y_log1p"].mean()) if "y_log1p" in df.columns else 0.0
    std = df["y_log1p"].std(ddof=0) if "y_log1p" in df.columns else 1.0
    return mu, (float(std) if std and std > 0 else 1.0)

mu, sigma = load_scaler_json(FEATURE_FILE)

# y_val (ground truth) in original units
y_va_z = Y[cut:]
y_val  = np.expm1(y_va_z * sigma + mu)

# ---------- Numeric-constrained decoding ----------
class DigitsOnly(LogitsProcessor):
    def __init__(self, tok, device):
        allowed_chars = set("0123456789-+.eE \n")
        ids = []
        for i in range(tok.vocab_size):
            s = tok.decode([i])
            if s and set(s).issubset(allowed_chars):
                ids.append(i)
        self.allowed_ids = torch.tensor(ids, device=device)
    def __call__(self, input_ids, scores):
        mask = torch.full_like(scores, float("-inf"))
        mask[:, self.allowed_ids] = 0
        return scores + mask

digits_only = DigitsOnly(tokenizer, device=next(model.parameters()).device)
num_pat = re.compile(r"[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?")
def number_from_text(s):
    m = num_pat.search(s);
    return float(m.group(0)) if m else None

# ---------- Predict TinyLlama on validation ----------
model.eval()
preds_z = []
for ex in val_text:
    ids = tokenizer(ex["prompt"], return_tensors="pt", truncation=True, max_length=MAX_LEN).to(next(model.parameters()).device)
    with torch.no_grad():
        out = model.generate(**ids, max_new_tokens=12, do_sample=False,
                             logits_processor=[digits_only],
                             pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id)
    gen = tokenizer.decode(out[0][ids["input_ids"].shape[1]:], skip_special_tokens=True)
    z_hat = number_from_text(gen)
    preds_z.append(np.nan if z_hat is None else z_hat)

preds_z = np.array(preds_z, dtype=float)
mask = ~np.isnan(preds_z)
if mask.sum() == 0:
    raise RuntimeError("Model returned no numeric outputs. Check prompts/decoding.")
y_pred = np.expm1(preds_z[mask] * sigma + mu)
y_true = y_val[mask]

# ---------- Metrics ----------
def metrics(y_true, y_pred):
    mae  = mean_absolute_error(y_true, y_pred)
    rmse = math.sqrt(mean_squared_error(y_true, y_pred))
    mape = float(np.mean(np.abs((y_true - y_pred) / np.clip(y_true, 1e-9, None))) * 100)
    return {"MAE": mae, "RMSE": rmse, "MAPE%": mape}

print(f"Aligned eval samples: {len(y_true)} / {len(y_val)}")
print("TinyLlama (text→z→volume):", metrics(y_true, y_pred))

# ---------- Naive baselines on same span ----------

y_all = np.expm1(Y * sigma + mu)
tail_len = len(y_true)
truth_tail = y_all[-tail_len:]

def seasonal_naive(series, season=5):
    yhat = np.roll(series, season); yhat[:season] = series[:season]; return yhat
def moving_avg(series, k=7):
    s = pd.Series(series)
    return s.rolling(k, min_periods=1).mean().shift(1).bfill().to_numpy()

sn = seasonal_naive(np.r_[y_all[:-tail_len], truth_tail])[-tail_len:]
ma = moving_avg(np.r_[y_all[:-tail_len], truth_tail])[-tail_len:]
print("Seasonal naive:", metrics(truth_tail, sn))
print("Moving average:", metrics(truth_tail, ma))

# ---------- Preview few predictions ----------
for i in range(min(10, len(y_true))):
    print(f"{i:02d} | true={y_true[i]:.2f}  pred={y_pred[i]:.2f}")

Aligned eval samples: 683 / 683
TinyLlama (text→z→volume): {'MAE': 793397.1208467651, 'RMSE': 1434803.3611238997, 'MAPE%': 37.41941244499241}
Seasonal naive: {'MAE': 1035122.125, 'RMSE': 1966850.8488850903, 'MAPE%': 43.81184768676758}
Moving average: {'MAE': 783198.1818787911, 'RMSE': 1475405.823909972, 'MAPE%': 32.859275970257514}
00 | true=2165886.50  pred=2900389.06
01 | true=3933762.75  pred=2886554.31
02 | true=2430223.25  pred=3127971.16
03 | true=3374707.00  pred=3127994.86
04 | true=3410402.75  pred=2885001.76
05 | true=3193953.75  pred=3127805.25
06 | true=11314851.00  pred=2901927.90
07 | true=5218058.50  pred=3376786.58
08 | true=3779608.00  pred=3402445.45
09 | true=4183481.25  pred=3402393.89


Qwen 2.5 7B Training

In [1]:
# ============================  SETUP  ============================
import os, re, math, json, random
import numpy as np
import pandas as pd
from typing import List, Dict, Optional
import torch
from torch.utils.data import Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, LogitsProcessor
from transformers.trainer_utils import set_seed
from peft import LoraConfig, get_peft_model
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler
from accelerate.utils import set_module_tensor_to_device


torch.backends.cuda.matmul.allow_tf32 = True
torch.set_float32_matmul_precision("high")

# ============================  CONFIG  ============================
CONFIG = {
    "feature_files": ["/content/features_trading_only_2.csv"],
    "date_col": "date",
    "vol_col": "volume",
    "label_col": "z_target",
    "context_len": 64,
    "max_features": 16,
    "model_name": "Qwen/Qwen2.5-7B-Instruct",
    "seed": 42,
    "train_frac": 0.9,
    "epochs": 3,
    "lr": 1e-4,
    "train_bs": 1,
    "grad_accum": 32,
    "max_length": 2048,
    "warmup_ratio": 0.1,
    "weight_decay": 0.01,
    "lora_r": 16,
    "lora_alpha": 32,
    "lora_dropout": 0.05,
    "output_dir": "/content/qwen25_7b_ts_lora",
    "bf16": True,
}
set_seed(CONFIG["seed"])

# ======================  LOAD & MERGE FEATURES  ==================
def load_and_merge(paths: List[str], date_col: str):
    dfs = []
    for p in paths:
        df = pd.read_csv(p)
        df[date_col] = pd.to_datetime(df[date_col], errors="coerce")
        df = df.dropna(subset=[date_col])
        dfs.append(df)
    all_df = pd.concat(dfs, axis=0, ignore_index=True).sort_values(date_col).reset_index(drop=True)
    return all_df.replace([np.inf, -np.inf], np.nan).ffill().bfill()

df = load_and_merge(CONFIG["feature_files"], CONFIG["date_col"])

# =====================  SELECT TOP-K FEATURES  ====================
EXCL = {CONFIG["date_col"], CONFIG["label_col"], CONFIG["vol_col"], "y_trading", "y_log1p"}
num = df.select_dtypes(include=[np.number]).copy()
std = num.std(numeric_only=True)
non_const = std[std > 0].index.tolist()
num = num[non_const]
cand = [c for c in num.columns if c not in EXCL]
corr = num[cand].corrwith(num[CONFIG["label_col"]]).abs().replace([np.inf,-np.inf], np.nan).fillna(0.0)
feature_cols = corr.sort_values(ascending=False).index.tolist()[:CONFIG["max_features"]]


vol = df[CONFIG["vol_col"]].astype(float).values.reshape(-1,1)
z_hist_scaler = StandardScaler()
z_hist_series = z_hist_scaler.fit_transform(np.log1p(vol)).reshape(-1)

def make_windows(df, z_hist, ctx, feat_cols, label_col):
    X, Y = [], []
    for t in range(ctx, len(df)):
        X.append((z_hist[t-ctx:t].tolist(), df.iloc[t][feat_cols].to_dict()))
        Y.append(float(df.iloc[t][label_col]))
    return X, np.array(Y, dtype=np.float32)

X_raw, Y = make_windows(df, z_hist_series, CONFIG["context_len"], feature_cols, CONFIG["label_col"])
N = len(X_raw); cut = int(N * CONFIG["train_frac"])

def ex_to_text(hist, feats, y_z):
    hs = ", ".join(f"{x:.4f}" for x in hist)
    fs = ", ".join(f"{k}={float(v):.4f}" for k,v in feats.items()) if feats else "none"
    return {"prompt": f"z_hist[{len(hist)}]:{hs}\nfeat:{fs}\nnext_z:", "target": f"{y_z:.5f}\n"}

train_text = [ex_to_text(*X_raw[i], Y[i]) for i in range(cut)]
val_text   = [ex_to_text(*X_raw[i], Y[i]) for i in range(cut, N)]

#   TOKENIZER
tokenizer = AutoTokenizer.from_pretrained(CONFIG["model_name"], use_fast=True, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

dtype = torch.bfloat16 if (CONFIG["bf16"] and torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8) else torch.float16

OFFLOAD_DIR = os.path.join(CONFIG["output_dir"], "offload")
os.makedirs(OFFLOAD_DIR, exist_ok=True)

model = AutoModelForCausalLM.from_pretrained(
    CONFIG["model_name"],
    torch_dtype=dtype,
    device_map="auto",
    attn_implementation="sdpa",
    trust_remote_code=True,
    low_cpu_mem_usage=False,
    offload_state_dict=True,
    offload_folder=OFFLOAD_DIR,
)
model.config.use_cache = False
model.gradient_checkpointing_enable()

lora_cfg = LoraConfig(
    r=CONFIG["lora_r"], lora_alpha=CONFIG["lora_alpha"], lora_dropout=CONFIG["lora_dropout"],
    bias="none", task_type="CAUSAL_LM",
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"]
)
model = get_peft_model(model, lora_cfg)
model.print_trainable_parameters()


target_device = next(model.parameters()).device
for name, param in list(model.named_parameters()) + list(model.named_buffers()):
    if hasattr(param, "device") and str(param.device) == "meta":
        set_module_tensor_to_device(model, name, device=target_device)

#   DATASET
class TxtDS(Dataset):
    def __init__(self, examples, tok, max_len=2048):
        self.ex, self.tok, self.max_len = examples, tok, max_len
    def __len__(self): return len(self.ex)
    def __getitem__(self, i):
        e = self.ex[i]
        p_ids = self.tok(e["prompt"], add_special_tokens=False)["input_ids"]
        t_ids = self.tok(e["target"], add_special_tokens=False)["input_ids"]
        ids = p_ids + t_ids
        if len(ids) > self.max_len:
            overflow = len(ids) - self.max_len
            keep_p = max(0, len(p_ids) - overflow)
            ids = p_ids[-keep_p:] + t_ids
        p_len = min(len(p_ids), len(ids) - len(t_ids))
        labels = [-100]*p_len + ids[p_len:]
        attn = [1]*len(ids)
        return {
            "input_ids": torch.tensor(ids, dtype=torch.long),
            "attention_mask": torch.tensor(attn, dtype=torch.long),
            "labels": torch.tensor(labels, dtype=torch.long),
        }

def pad_batch(batch, pad_id):
    mx = max(len(b["input_ids"]) for b in batch)
    out = {"input_ids": [], "attention_mask": [], "labels": []}
    for b in batch:
        pad_n = mx - len(b["input_ids"])
        out["input_ids"].append(torch.cat([b["input_ids"], torch.full((pad_n,), pad_id, dtype=torch.long)]))
        out["attention_mask"].append(torch.cat([b["attention_mask"], torch.zeros(pad_n, dtype=torch.long)]))
        out["labels"].append(torch.cat([b["labels"], torch.full((pad_n,), -100, dtype=torch.long)]))
    return {k: torch.stack(v) for k,v in out.items()}

train_ds = TxtDS(train_text, tokenizer, CONFIG["max_length"])
val_ds   = TxtDS(val_text,   tokenizer, CONFIG["max_length"])

#   TRAIN
args = TrainingArguments(
    output_dir=CONFIG["output_dir"],
    num_train_epochs=CONFIG["epochs"],
    per_device_train_batch_size=CONFIG["train_bs"],
    gradient_accumulation_steps=CONFIG["grad_accum"],
    learning_rate=CONFIG["lr"],
    weight_decay=CONFIG["weight_decay"],
    warmup_ratio=CONFIG["warmup_ratio"],
    lr_scheduler_type="cosine",
    save_strategy="epoch",
    logging_steps=50,
    report_to="none",
    bf16=(dtype==torch.bfloat16),
    fp16=(dtype==torch.float16),
)
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_ds,
    eval_dataset=val_ds,  # not used during training; kept for consistency
    data_collator=lambda b: pad_batch(b, tokenizer.pad_token_id),
    tokenizer=tokenizer,
)
trainer.train()

#  EVAL HELPERS
class DigitsOnly(LogitsProcessor):
    def __init__(self, tok, device):
        allowed = set("0123456789-+.eE \n")
        ids = []
        for i in range(tok.vocab_size):
            s = tok.decode([i])
            if s and set(s).issubset(allowed): ids.append(i)
        self.allowed_ids = torch.tensor(ids, device=device)
    def __call__(self, input_ids, scores):
        mask = torch.full_like(scores, float("-inf"))
        mask[:, self.allowed_ids] = 0
        return scores + mask

digits_only = DigitsOnly(tokenizer, device=next(model.parameters()).device)

def number_from_text(s: str) -> Optional[float]:
    m = re.search(r"[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?", s)
    return float(m.group(0)) if m else None

def load_scaler_json(paths: List[str]):
    base = os.path.dirname(paths[0])
    cand = os.path.join(base, "features_trading_only_scaler_2.json")
    if os.path.exists(cand):
        with open(cand, "r") as f:
            s = json.load(f)
        return float(s["y_log1p_mean"]), float(s["y_log1p_std"])
    mu = float(df["y_log1p"].mean()) if "y_log1p" in df.columns else 0.0
    sigma = float(df["y_log1p"].std(ddof=0)) if "y_log1p" in df.columns and df["y_log1p"].std(ddof=0)>0 else 1.0
    return mu, sigma

mu, sigma = load_scaler_json(CONFIG["feature_files"])

def evaluate(model, tok, val_examples, mu, sigma, max_new_tokens=16):
    model.eval()
    preds_z, trues_z = [], []
    for ex in val_examples:
        ids = tok(ex["prompt"], return_tensors="pt").to(model.device)
        with torch.no_grad():
            out = model.generate(
                **ids, max_new_tokens=max_new_tokens, do_sample=False,
                logits_processor=[digits_only],
                pad_token_id=tok.pad_token_id, eos_token_id=tok.eos_token_id,
            )
        gen = tok.decode(out[0][ids["input_ids"].shape[1]:], skip_special_tokens=True)
        z_hat = number_from_text(gen)
        if z_hat is None: continue
        preds_z.append(z_hat)
        # pull true z from the text target string
        z_true = float(re.findall(r"[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?", ex["target"])[0])
        trues_z.append(z_true)
    if not preds_z:
        return {"val_MAE": float("nan"), "val_RMSE": float("nan")}
    preds_z = np.array(preds_z); trues_z = np.array(trues_z)
    y_pred = np.expm1(preds_z * sigma + mu)
    y_true = np.expm1(trues_z * sigma + mu)
    return {
        "val_MAE": mean_absolute_error(y_true, y_pred),
        "val_RMSE": math.sqrt(mean_squared_error(y_true, y_pred)),
    }

val_pairs = [{"prompt": e["prompt"], "target": e["target"]} for e in val_text]
metrics = evaluate(model, tokenizer, val_pairs, mu, sigma)
print("Validation metrics:", metrics)

# ===================  INFERENCE HELPER  ==========================
def forecast_next(raw_recent_volumes: List[float], last_feat_row: Dict[str,float], mu: float, sigma: float, k_decimals=5) -> float:
    z_hist = z_hist_scaler.transform(np.log1p(np.array(raw_recent_volumes).reshape(-1,1))).reshape(-1)
    hist_str = ", ".join(f"{z:.4f}" for z in z_hist[-CONFIG["context_len"]:])
    feats_str = ", ".join(f"{k}={float(v):.4f}" for k,v in last_feat_row.items()) if last_feat_row else "none"
    prompt = f"z_hist[{len(z_hist[-CONFIG['context_len']:])}]:{hist_str}\nfeat:{feats_str}\nnext_z:"
    ids = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(
            **ids, max_new_tokens=16, do_sample=False,
            logits_processor=[digits_only],
            pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id
        )
    gen = tokenizer.decode(out[0][ids["input_ids"].shape[1]:], skip_special_tokens=True)
    z_hat = number_from_text(gen)
    if z_hat is None:
        raise RuntimeError("Model did not return a numeric answer.")
    y_hat = np.expm1(z_hat * sigma + mu)
    return float(round(y_hat, k_decimals))

print("Qwen2.5-7B LoRA: training done, metrics printed, forecast_next(...) ready.")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

trainable params: 40,370,176 || all params: 7,655,986,688 || trainable%: 0.5273


  trainer = Trainer(


Step,Training Loss
50,1.9639
100,1.4441
150,1.4366
200,1.4351
250,1.4352
300,1.4237
350,1.4332
400,1.4325
450,1.4191
500,1.4216


The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p', 'top_k']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
The following generation flags are not valid and may be ignored: ['temperature', 'top_p'

Validation metrics: {'val_MAE': 744230.5754974667, 'val_RMSE': 1416663.6189370814}
Qwen2.5-7B LoRA: training done, metrics printed, forecast_next(...) ready.


#Example of Bad Performance:

I by mistake inferenced the fine-tuned Qwen, without considering the LoRA adapters and shooting up of Predictions happened

In [13]:
MODEL_PATH = "/content/qwen25_7b_ts_lora"

import os, warnings, re, math, json
os.environ["TRANSFORMERS_VERBOSITY"] = "error"
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)

try:
    from transformers.utils import logging as hf_logging
    hf_logging.set_verbosity_error()
except Exception:
    pass

import numpy as np, pandas as pd, torch
from typing import List
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
from transformers import AutoModelForCausalLM, AutoTokenizer, LogitsProcessor


if "model" not in globals() or "tokenizer" not in globals():
    if not MODEL_PATH:
        raise RuntimeError("Set MODEL_PATH to your fine-tuned Qwen 2.5-7B checkpoint or load model/tokenizer beforehand.")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH,
        torch_dtype=torch.bfloat16 if torch.cuda.is_available() else None,
        device_map="auto",
        trust_remote_code=True,
    )
if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token

device = next(model.parameters()).device
torch.set_grad_enabled(False)
model.eval()

# ----------------- Config -----------------
FEATURE_FILE = "/content/features_trading_only_2.csv"
DATE_COL, VOL_COL, LABEL_COL = "date", "volume", "z_target"
CTX, MAX_FEATURES, TRAIN_FRAC, MAX_LEN = 64, 16, 0.9, 1024

# ----------------- Data prep -----------------
df = pd.read_csv(FEATURE_FILE)
df[DATE_COL] = pd.to_datetime(df[DATE_COL], errors="coerce")
df = df.dropna(subset=[DATE_COL]).sort_values(DATE_COL).reset_index(drop=True)
df = df.replace([np.inf, -np.inf], np.nan).ffill().bfill()

# Top-K by |Pearson| vs label (exclude leakage)
EXCL = {DATE_COL, LABEL_COL, VOL_COL, "y_trading", "y_log1p"}
num = df.select_dtypes(include=[np.number]).copy()
std = num.std(numeric_only=True)
non_const = std[std > 0].index.tolist()
num = num[non_const]
cand = [c for c in num.columns if c not in EXCL and c != LABEL_COL]
if not cand:
    raise ValueError("No candidate numeric features found.")
corr = num[cand].corrwith(num[LABEL_COL]).abs().replace([np.inf,-np.inf], np.nan).fillna(0.0)
feature_cols = corr.sort_values(ascending=False).index.tolist()[:MAX_FEATURES]

# z-history from log1p(volume)
vol = df[VOL_COL].astype(float).values.reshape(-1,1)
z_hist_scaler = StandardScaler()
z_hist_series = z_hist_scaler.fit_transform(np.log1p(vol)).reshape(-1)

def make_windows(df, z_hist, ctx, feat_cols, label_col):
    X, Y = [], []
    for t in range(ctx, len(df)):
        X.append((z_hist[t-ctx:t].tolist(), df.iloc[t][feat_cols].to_dict()))
        Y.append(float(df.iloc[t][label_col]))
    return X, np.array(Y, dtype=np.float32)

X_raw, Y = make_windows(df, z_hist_series, CTX, feature_cols, LABEL_COL)
cut = int(len(X_raw) * TRAIN_FRAC)
val_text = []
for hist, feats in X_raw[cut:]:
    hist_str  = ", ".join(f"{x:.4f}" for x in hist)
    feats_str = ", ".join(f"{k}={float(v):.4f}" for k,v in feats.items()) if feats else "none"
    prompt = f"z_hist[{len(hist)}]:{hist_str}\nfeat:{feats_str}\nnext_z:"
    val_text.append({"prompt": prompt})

# ----------------- Scaler for inversion -----------------
def load_scaler_json(feature_path: str):
    base = os.path.dirname(feature_path)
    cand = os.path.join(base, "features_trading_only_scaler_2.json")
    if os.path.exists(cand):
        try:
            with open(cand, "r") as f:
                s = json.load(f)
            return float(s["y_log1p_mean"]), float(s["y_log1p_std"])
        except Exception:
            pass
    if "y_log1p" in df.columns:
        mu = float(df["y_log1p"].mean())
        sd = float(df["y_log1p"].std(ddof=0))
        return mu, (sd if sd and sd > 0 else 1.0)
    return 0.0, 1.0

mu, sigma = load_scaler_json(FEATURE_FILE)
sigma = 1.0 if (not np.isfinite(sigma) or sigma == 0) else sigma

# y_val (ground truth)
y_va_z = Y[cut:]
y_val  = np.expm1(y_va_z * sigma + mu)

# Numeric-constrained decoding
class DigitsOnly(LogitsProcessor):
    def __init__(self, tok, device):
        allowed = set("0123456789-+.eE \n")
        ids = []
        for i in range(tok.vocab_size):
            try:
                s = tok.decode([i])
            except Exception:
                continue
            if s and set(s).issubset(allowed):
                ids.append(i)
        if not ids:
            raise RuntimeError("DigitsOnly: no allowed token ids found for this tokenizer.")
        self.allowed_ids = torch.tensor(ids, device=device)
    def __call__(self, input_ids, scores):
        mask = torch.full_like(scores, float("-inf"))
        mask[:, self.allowed_ids] = 0
        return scores + mask

digits_only = DigitsOnly(tokenizer, device=device)
num_pat = re.compile(r"[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?")
def number_from_text(s):
    m = num_pat.search(s)
    return float(m.group(0)) if m else np.nan

# ----------------- Predict z -----------------
preds_z = []
for ex in val_text:
    ids = tokenizer(ex["prompt"], return_tensors="pt", truncation=True, max_length=MAX_LEN).to(device)
    out = model.generate(
        **ids,
        max_new_tokens=12,
        do_sample=False,
        logits_processor=[digits_only],
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )
    gen = tokenizer.decode(out[0][ids["input_ids"].shape[1]:], skip_special_tokens=True)
    preds_z.append(number_from_text(gen))

preds_z = np.array(preds_z, dtype=float)

# ----------------- Overflow-safe inversion -----------------

y_log1p_series = np.log1p(df[VOL_COL].astype(float).clip(lower=0))
q_lo, q_hi = np.nanpercentile(y_log1p_series, [0.1, 99.9])
margin = 0.5 * sigma
y_log1p_min = max(-50.0, float(q_lo - margin))
y_log1p_max = min( 50.0, float(q_hi + margin))
z_min = (y_log1p_min - mu) / sigma
z_max = (y_log1p_max - mu) / sigma

mask = np.isfinite(preds_z)
if mask.sum() == 0:
    raise RuntimeError("Model returned no numeric outputs. Check prompts/decoding/template.")
z_hat = np.clip(preds_z[mask], z_min, z_max)

y_pred = np.expm1(np.clip(z_hat * sigma + mu, y_log1p_min, y_log1p_max))
y_true = y_val[mask]
finite = np.isfinite(y_pred) & np.isfinite(y_true)
y_pred = y_pred[finite]
y_true = y_true[finite]

# ----------------- Metrics -----------------
def metrics(y_true, y_pred):
    if len(y_true) == 0:
        return {"MAE": np.nan, "RMSE": np.nan, "MAPE%": np.nan}
    mae  = mean_absolute_error(y_true, y_pred)
    rmse = math.sqrt(mean_squared_error(y_true, y_pred))
    mape = float(np.mean(np.abs((y_true - y_pred) / np.clip(y_true, 1e-9, None))) * 100)
    return {"MAE": float(mae), "RMSE": float(rmse), "MAPE%": mape}

print(f"Aligned eval samples (finite): {len(y_true)} / {len(y_val)}")
print("Qwen2.5-7B (text→z→volume):", metrics(y_true, y_pred))

# Naive baselines on same span
y_all = np.expm1(Y * sigma + mu)
tail_len = len(y_true)
truth_tail = y_all[-tail_len:]

def seasonal_naive(series, season=5):
    yhat = np.roll(series, season); yhat[:season] = series[:season]; return yhat
def moving_avg(series, k=7):
    s = pd.Series(series)
    return s.rolling(k, min_periods=1).mean().shift(1).bfill().to_numpy()

sn = seasonal_naive(np.r_[y_all[:-tail_len], truth_tail])[-tail_len:]
ma = moving_avg(np.r_[y_all[:-tail_len], truth_tail])[-tail_len:]
print("Seasonal naive:", metrics(truth_tail, sn))
print("Moving average:", metrics(truth_tail, ma))

# ----------------- Preview few predictions -----------------
for i in range(min(10, len(y_true))):
    print(f"{i:02d} | true={y_true[i]:.2f}  pred={y_pred[i]:.2f}")


Aligned eval samples (finite): 672 / 683
Qwen2.5-7B (text→z→volume): {'MAE': 38030821.67282869, 'RMSE': 48829463.199295476, 'MAPE%': 2071.6677831837596}
Seasonal naive: {'MAE': 1029828.875, 'RMSE': 1958129.1232275772, 'MAPE%': 44.14181137084961}
Moving average: {'MAE': 773136.3305830145, 'RMSE': 1449779.1263772284, 'MAPE%': 32.98392204428371}
00 | true=2165886.50  pred=69165194.28
01 | true=3933762.75  pred=69165194.28
02 | true=2430223.25  pred=69165194.28
03 | true=3374707.00  pred=6734927.09
04 | true=3410402.75  pred=3673372.34
05 | true=3193953.75  pred=6734927.09
06 | true=11314851.00  pred=69165194.28
07 | true=5218058.50  pred=69165194.28
08 | true=3779608.00  pred=13208479.05
09 | true=4183481.25  pred=10611672.60


Finding and Loading LoRA adapters

In [1]:
# ---- Auto-find and load the LoRA adapters
BASE_MODEL = "Qwen/Qwen2.5-7B-Instruct"
SEARCH_ROOTS = ["/content/qwen25_7b_ts_lora", "/content"]  # add more roots if needed
device = "cuda:0"

import os, time, torch, json
from typing import Optional, Tuple
from transformers import AutoTokenizer, AutoModelForCausalLM
try:
    from peft import AutoPeftModelForCausalLM
    PEFT_OK = True
except Exception:
    PEFT_OK = False

def find_latest_adapter(root_dirs) -> Optional[str]:
    """
    Return dir containing BOTH adapter_model.* and adapter_config.json with the newest mtime.
    Looks recursively under the given roots (handles trainer's checkpoint-* subdirs).
    """
    best_dir, best_mtime = None, -1
    for root in root_dirs:
        if not os.path.exists(root):
            continue
        for cur, _, files in os.walk(root):
            has_cfg = "adapter_config.json" in files
            has_ad = any(f.startswith("adapter_model.") for f in files)
            if has_cfg and has_ad:
                ad_files = [os.path.join(cur, f) for f in files if f.startswith("adapter_model.")]
                mtime = max(os.path.getmtime(p) for p in ad_files)
                if mtime > best_mtime:
                    best_mtime, best_dir = mtime, cur
    return best_dir

def find_merged_model(root_dirs) -> Optional[str]:
    """Return dir containing a full merged model (model.safetensors / pytorch_model.*)."""
    for root in root_dirs:
        if not os.path.exists(root):
            continue
        for cur, _, files in os.walk(root):
            has_full = any(f in files for f in ["model.safetensors","pytorch_model.bin","pytorch_model.safetensors"])
            if has_full:
                return cur
    return None

def ensure_adapter_config(dirpath: str, base_model: str):
    """If adapter_config.json missing, write a minimal one that matches your LoRA training params."""
    cfg_path = os.path.join(dirpath, "adapter_config.json")
    if os.path.exists(cfg_path):
        return
    lora_cfg = {
        "base_model_name_or_path": base_model,
        "peft_type": "LORA",
        "task_type": "CAUSAL_LM",
        "r": 16, "lora_alpha": 32, "lora_dropout": 0.05,
        "bias": "none", "inference_mode": False,
        "target_modules": ["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"]
    }
    with open(cfg_path, "w") as f: json.dump(lora_cfg, f)
    print(f"[fix] wrote missing adapter_config.json at: {cfg_path}")

def load_ready_model_and_tokenizer(base_model: str, roots) -> Tuple[AutoModelForCausalLM, AutoTokenizer, str]:
    # tokenizer ALWAYS from base
    tok = AutoTokenizer.from_pretrained(base_model, use_fast=True, legacy=False)
    if tok.pad_token is None: tok.pad_token = tok.eos_token
    dtype = torch.bfloat16 if (torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8) else torch.float16

    # 1) Try adapters (newest)
    adir = find_latest_adapter(roots)
    if adir and PEFT_OK:
        ensure_adapter_config(adir, base_model)
        try:
            mdl = AutoPeftModelForCausalLM.from_pretrained(adir, torch_dtype=dtype, device_map="auto")
            mdl = mdl.merge_and_unload()
            mdl.config.use_cache = False
            print(f"[ok] loaded & merged LoRA adapters from: {adir}")
            return mdl, tok, "adapters_merged"
        except Exception as e:
            print(f"[warn] adapter load failed at {adir}: {e}")

    # 2) Try full merged checkpoint
    mdir = find_merged_model(roots)
    if mdir:
        mdl = AutoModelForCausalLM.from_pretrained(mdir, torch_dtype=dtype, device_map="auto")
        mdl.config.use_cache = False
        print(f"[ok] loaded full merged model from: {mdir}")
        return mdl, tok, "merged_full"

    # 3) Fallback to base
    mdl = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=dtype, device_map="auto")
    mdl.config.use_cache = False
    print("[WARN] adapters/merged not found — using BASE ONLY.")
    return mdl, tok, "base_only"

model, tokenizer, mode = load_ready_model_and_tokenizer(BASE_MODEL, SEARCH_ROOTS)

# Quick sanity print:
print("mode:", mode)
print("device:", next(model.parameters()).device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

[ok] loaded & merged LoRA adapters from: /content/qwen25_7b_ts_lora/checkpoint-576
mode: adapters_merged
device: cuda:0


Performance of Qwen, when the LoRA adapters are taken into account

In [2]:
MODEL_PATH = "/content/qwen25_7b_ts_lora"

import os, warnings, re, math, json
os.environ["TRANSFORMERS_VERBOSITY"] = "error"
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)

try:
    from transformers.utils import logging as hf_logging
    hf_logging.set_verbosity_error()
except Exception:
    pass

import numpy as np, pandas as pd, torch
from typing import List
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
from transformers import AutoModelForCausalLM, AutoTokenizer, LogitsProcessor

# Load Qwen if not provided
if "model" not in globals() or "tokenizer" not in globals():
    if not MODEL_PATH:
        raise RuntimeError("Set MODEL_PATH to your fine-tuned Qwen 2.5-7B checkpoint or load model/tokenizer beforehand.")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_PATH,
        torch_dtype=torch.bfloat16 if torch.cuda.is_available() else None,
        device_map="auto",
        trust_remote_code=True,
    )
# Ensure pad/eos are usable (Qwen often lacks pad_token by default)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token

device = next(model.parameters()).device
torch.set_grad_enabled(False)
model.eval()

# Config
FEATURE_FILE = "/content/features_trading_only_2.csv"
DATE_COL, VOL_COL, LABEL_COL = "date", "volume", "z_target"
CTX, MAX_FEATURES, TRAIN_FRAC, MAX_LEN = 64, 16, 0.9, 1024

# Data prep
df = pd.read_csv(FEATURE_FILE)
df[DATE_COL] = pd.to_datetime(df[DATE_COL], errors="coerce")
df = df.dropna(subset=[DATE_COL]).sort_values(DATE_COL).reset_index(drop=True)
df = df.replace([np.inf, -np.inf], np.nan).ffill().bfill()

# Top-K by |Pearson| vs label
EXCL = {DATE_COL, LABEL_COL, VOL_COL, "y_trading", "y_log1p"}
num = df.select_dtypes(include=[np.number]).copy()
std = num.std(numeric_only=True)
non_const = std[std > 0].index.tolist()
num = num[non_const]
cand = [c for c in num.columns if c not in EXCL and c != LABEL_COL]
if not cand:
    raise ValueError("No candidate numeric features found.")
corr = num[cand].corrwith(num[LABEL_COL]).abs().replace([np.inf,-np.inf], np.nan).fillna(0.0)
feature_cols = corr.sort_values(ascending=False).index.tolist()[:MAX_FEATURES]

# z-history from log1p(volume)
vol = df[VOL_COL].astype(float).values.reshape(-1,1)
z_hist_scaler = StandardScaler()
z_hist_series = z_hist_scaler.fit_transform(np.log1p(vol)).reshape(-1)

def make_windows(df, z_hist, ctx, feat_cols, label_col):
    X, Y = [], []
    for t in range(ctx, len(df)):
        X.append((z_hist[t-ctx:t].tolist(), df.iloc[t][feat_cols].to_dict()))
        Y.append(float(df.iloc[t][label_col]))
    return X, np.array(Y, dtype=np.float32)

X_raw, Y = make_windows(df, z_hist_series, CTX, feature_cols, LABEL_COL)
cut = int(len(X_raw) * TRAIN_FRAC)
val_text = []
for hist, feats in X_raw[cut:]:
    hist_str  = ", ".join(f"{x:.4f}" for x in hist)
    feats_str = ", ".join(f"{k}={float(v):.4f}" for k,v in feats.items()) if feats else "none"
    prompt = f"z_hist[{len(hist)}]:{hist_str}\nfeat:{feats_str}\nnext_z:"
    val_text.append({"prompt": prompt})

# Scaler for inversion
def load_scaler_json(feature_path: str):
    base = os.path.dirname(feature_path)
    cand = os.path.join(base, "features_trading_only_scaler_2.json")
    if os.path.exists(cand):
        try:
            with open(cand, "r") as f:
                s = json.load(f)
            return float(s["y_log1p_mean"]), float(s["y_log1p_std"])
        except Exception:
            pass
    if "y_log1p" in df.columns:
        mu = float(df["y_log1p"].mean())
        sd = float(df["y_log1p"].std(ddof=0))
        return mu, (sd if sd and sd > 0 else 1.0)
    return 0.0, 1.0

mu, sigma = load_scaler_json(FEATURE_FILE)
sigma = 1.0 if (not np.isfinite(sigma) or sigma == 0) else sigma

# y_val (ground truth)
y_va_z = Y[cut:]
y_val  = np.expm1(y_va_z * sigma + mu)


class DigitsOnly(LogitsProcessor):
    def __init__(self, tok, device):
        allowed = set("0123456789-+.eE \n")
        ids = []

        for i in range(tok.vocab_size):
            try:
                s = tok.decode([i])
            except Exception:

                continue
            if s and set(s).issubset(allowed):
                ids.append(i)
        if not ids:
            raise RuntimeError("DigitsOnly: no allowed token ids found for this tokenizer.")
        self.allowed_ids = torch.tensor(ids, device=device)
    def __call__(self, input_ids, scores):
        mask = torch.full_like(scores, float("-inf"))
        mask[:, self.allowed_ids] = 0
        return scores + mask

digits_only = DigitsOnly(tokenizer, device=device)
num_pat = re.compile(r"[-+]?\d*\.?\d+(?:[eE][-+]?\d+)?")
def number_from_text(s):
    m = num_pat.search(s)
    return float(m.group(0)) if m else np.nan

#  Predict z
preds_z = []
for ex in val_text:
    ids = tokenizer(ex["prompt"], return_tensors="pt", truncation=True, max_length=MAX_LEN).to(device)
    out = model.generate(
        **ids,
        max_new_tokens=12,
        do_sample=False,
        logits_processor=[digits_only],
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )
    gen = tokenizer.decode(out[0][ids["input_ids"].shape[1]:], skip_special_tokens=True)
    preds_z.append(number_from_text(gen))

preds_z = np.array(preds_z, dtype=float)

# Overflow-safe inversion

y_log1p_series = np.log1p(df[VOL_COL].astype(float).clip(lower=0))
q_lo, q_hi = np.nanpercentile(y_log1p_series, [0.1, 99.9])
margin = 0.5 * sigma
y_log1p_min = max(-50.0, float(q_lo - margin))
y_log1p_max = min( 50.0, float(q_hi + margin))
z_min = (y_log1p_min - mu) / sigma
z_max = (y_log1p_max - mu) / sigma

mask = np.isfinite(preds_z)
if mask.sum() == 0:
    raise RuntimeError("Model returned no numeric outputs. Check prompts/decoding/template.")
z_hat = np.clip(preds_z[mask], z_min, z_max)

y_pred = np.expm1(np.clip(z_hat * sigma + mu, y_log1p_min, y_log1p_max))
y_true = y_val[mask]
finite = np.isfinite(y_pred) & np.isfinite(y_true)
y_pred = y_pred[finite]
y_true = y_true[finite]

#  Metrics
def metrics(y_true, y_pred):
    if len(y_true) == 0:
        return {"MAE": np.nan, "RMSE": np.nan, "MAPE%": np.nan}
    mae  = mean_absolute_error(y_true, y_pred)
    rmse = math.sqrt(mean_squared_error(y_true, y_pred))
    mape = float(np.mean(np.abs((y_true - y_pred) / np.clip(y_true, 1e-9, None))) * 100)
    return {"MAE": float(mae), "RMSE": float(rmse), "MAPE%": mape}

print(f"Aligned eval samples (finite): {len(y_true)} / {len(y_val)}")
print("Qwen2.5-7B (text→z→volume):", metrics(y_true, y_pred))

#  Naive baselines on same span
y_all = np.expm1(Y * sigma + mu)
tail_len = len(y_true)
truth_tail = y_all[-tail_len:]

def seasonal_naive(series, season=5):
    yhat = np.roll(series, season); yhat[:season] = series[:season]; return yhat
def moving_avg(series, k=7):
    s = pd.Series(series)
    return s.rolling(k, min_periods=1).mean().shift(1).bfill().to_numpy()

sn = seasonal_naive(np.r_[y_all[:-tail_len], truth_tail])[-tail_len:]
ma = moving_avg(np.r_[y_all[:-tail_len], truth_tail])[-tail_len:]
print("Seasonal naive:", metrics(truth_tail, sn))
print("Moving average:", metrics(truth_tail, ma))

#  Preview few predictions
for i in range(min(10, len(y_true))):
    print(f"{i:02d} | true={y_true[i]:.2f}  pred={y_pred[i]:.2f}")


Aligned eval samples (finite): 683 / 683
Qwen2.5-7B (text→z→volume): {'MAE': 735187.8427327155, 'RMSE': 1409614.814201706, 'MAPE%': 32.24847408891829}
Seasonal naive: {'MAE': 1035122.125, 'RMSE': 1966850.8488850903, 'MAPE%': 43.81184768676758}
Moving average: {'MAE': 783198.1818787911, 'RMSE': 1475405.823909972, 'MAPE%': 32.859275970257514}
00 | true=2165886.50  pred=3156519.44
01 | true=3933762.75  pred=3645311.22
02 | true=2430223.25  pred=3130555.75
03 | true=3374707.00  pred=3645366.46
04 | true=3410402.75  pred=3534069.10
05 | true=3193953.75  pred=3670256.16
06 | true=11314851.00  pred=3154391.42
07 | true=5218058.50  pred=3130555.75
08 | true=3779608.00  pred=4274142.85
09 | true=4183481.25  pred=3642853.67
