# Notebook 4: SVM

In this notebook, our goal is to build a simple but solid machine-learning baseline that we can later compare with more advanced deep-learning models.
We use an SVM classifier to predict the direction of future returns (1-day, 1-week, and 1-month horizons) based on a set of technical features.

## Table of Contents

1. [Imports, Constants, and Utility Functions](#1-imports-constants-and-utility-functions)
2. [SVM Training and Testing on the Same Asset](#2-svm-training-and-testing-on-the-same-asset)
3. [Generalization Test: Train on One Asset, Test on Another](#3-generalization-test-train-on-one-asset-test-on-another)

## 1. Imports, Constants, and Utility Functions
In the first part of the notebook, we load all the libraries and functions that are already defined during Feature Engineering and Data Preprocessing part. All these will be used in our Machine Learing Model - SVM

In [18]:
import numpy as np
import pandas as pd
import warnings
import os
warnings.filterwarnings('ignore')

from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.svm import SVC
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    accuracy_score,
    f1_score,
    roc_auc_score
)

import pickle



SEED = 42
np.random.seed(SEED)


PROCESSED_DIR = './data_new/data_processed/'
SEQUENCES_DIR = './data_new/sequences/'

ASSETS = {
    'AAPL': {'name': 'Apple Inc.', 'type': 'stock'},
    'AMZN': {'name': 'Amazon.com Inc.', 'type': 'stock'},
    'NVDA': {'name': 'NVIDIA Corporation', 'type': 'stock'},
    'SPY': {'name': 'S&P 500 ETF', 'type': 'stock'},
    'BTC-USD': {'name': 'Bitcoin', 'type': 'crypto'}
}

TARGET_HORIZONS = ['1day', '1week', '1month']


TRAIN_RATIO = 0.7
VAL_RATIO   = 0.10
TEST_RATIO  = 0.20


In [19]:
def get_feature_target_columns(df):
   
    exclude_cols = ['open', 'high', 'low', 'close', 'volume']
    target_cols = [c for c in df.columns if c.startswith('target_')]
    feature_cols = [
        c for c in df.columns
        if c not in exclude_cols
        and c not in target_cols
        and pd.api.types.is_numeric_dtype(df[c])
    ]
    return feature_cols, target_cols

def time_series_split(df, train_ratio=TRAIN_RATIO, val_ratio=VAL_RATIO, test_ratio=TEST_RATIO):
    
    assert abs(train_ratio + val_ratio + test_ratio - 1.0) < 1e-6, "Ratios must sum to 1.0"

    n = len(df)
    train_end = int(n * train_ratio)
    val_end   = int(n * (train_ratio + val_ratio))

    train_df = df.iloc[:train_end]
    val_df   = df.iloc[train_end:val_end]
    test_df  = df.iloc[val_end:]

    return train_df, val_df, test_df

def scale_features(train_df, val_df, test_df, feature_cols, scaler_type='standard'):
    
    if scaler_type == 'standard':
        scaler = StandardScaler()
    elif scaler_type == 'minmax':
        scaler = MinMaxScaler()
    elif scaler_type == 'robust':
        scaler = RobustScaler()
    else:
        raise ValueError(f"Unknown scaler type: {scaler_type}")


    scaler.fit(train_df[feature_cols])

    train_scaled = train_df.copy()
    val_scaled   = val_df.copy()
    test_scaled  = test_df.copy()

    train_scaled[feature_cols] = scaler.transform(train_df[feature_cols])
    val_scaled[feature_cols]   = scaler.transform(val_df[feature_cols])
    test_scaled[feature_cols]  = scaler.transform(test_df[feature_cols])

    return train_scaled, val_scaled, test_scaled, scaler





In [20]:
PROCESSED_DIR = '../data_new/data_processed/'
SEQUENCES_DIR = '../data_new/sequences/'

def load_cleaned_data(processed_dir=PROCESSED_DIR):
    data = {}
    for asset in ASSETS.keys():
        filename = f'{processed_dir}{asset}_features_cleaned.csv'
        print("Trying to read:", filename)  
        df = pd.read_csv(filename, index_col=0, parse_dates=True)
        data[asset] = df
        print(f"{asset}: shape={df.shape}, range={df.index.min()} → {df.index.max()}")
    return data

data = load_cleaned_data()


Trying to read: ../data_new/data_processed/AAPL_features_cleaned.csv
AAPL: shape=(6303, 47), range=2000-10-16 00:00:00 → 2025-11-06 00:00:00
Trying to read: ../data_new/data_processed/AMZN_features_cleaned.csv
AMZN: shape=(6303, 47), range=2000-10-16 00:00:00 → 2025-11-06 00:00:00
Trying to read: ../data_new/data_processed/NVDA_features_cleaned.csv
NVDA: shape=(6303, 47), range=2000-10-16 00:00:00 → 2025-11-06 00:00:00
Trying to read: ../data_new/data_processed/SPY_features_cleaned.csv
SPY: shape=(6303, 48), range=2000-10-16 00:00:00 → 2025-11-06 00:00:00
Trying to read: ../data_new/data_processed/BTC-USD_features_cleaned.csv
BTC-USD: shape=(3872, 47), range=2015-04-04 00:00:00 → 2025-11-08 00:00:00


The class imbalance weights were computed during the sequence-generation step.
Here we reload them so that the SVM can use them during training.

In [21]:
def load_class_weights(sequences_dir=SEQUENCES_DIR):
    
    filename = f'{sequences_dir}class_weights.pkl'
    with open(filename, 'rb') as f:
        cw = pickle.load(f)

    
    cleaned = {}
    for asset, hdict in cw.items():
        cleaned[asset] = {}
        for horizon, w in hdict.items():
            cleaned[asset][horizon] = {int(k): float(v) for k, v in w.items()}
    return cleaned

class_weights_dict = load_class_weights()
print("Example class weights for AAPL-1day:", class_weights_dict['AAPL']['1day'])

Example class weights for AAPL-1day: {0: 1.0384252710985384, 1: 0.9643169877408057}


## 2. SVM Training and Testing on the Same Asset
In this section, we train a separate SVM classifier for each asset and each prediction horizon.

The idea is simple:
use only one asset at a time (no cross-asset information here)
predict the binary label target_horizon (e.g. target_1day)
tune the SVM hyperparameters with a time-series cross-validation
evaluate the final model on the test set of the same asset

This gives us a first baseline of “how far we can go” when we only use past information from the same asset.

### 2.1 Function: training SVM for one asset & horizon
The function below:
selects features and the right target column
splits the data into train / validation / test (in time order)
scales the features
merges train + validation to run a GridSearchCV with TimeSeriesSplit
trains the final SVM using the best hyperparameters
evaluates performance on the test set (accuracy, F1, ROC AUC, confusion matrix)

In [11]:
### SVM + Grid Search
def train_svm_for_asset_horizon(
    data,
    class_weights_dict,
    asset='AAPL',
    horizon='1day',
    scaler_type='standard'
):
    """
    train one SVM for a specific asset + horizon (with GridSearchCV + TimeSeriesSplit)
    return: best_model, metrics_dict
    """
    assert asset in data, f"Unknown asset: {asset}"
    assert horizon in TARGET_HORIZONS, f"Unknown horizon: {horizon}"

    df = data[asset].dropna().copy()


    feature_cols, target_cols = get_feature_target_columns(df)
    target_col = f'target_{horizon}'
    if target_col not in target_cols:
        raise ValueError(f"{asset} does not have target column {target_col}")


    train_df, val_df, test_df = time_series_split(df)


    train_scaled, val_scaled, test_scaled, scaler = scale_features(
        train_df, val_df, test_df, feature_cols, scaler_type=scaler_type
    )


    X_train = train_scaled[feature_cols].values
    y_train = train_scaled[target_col].values

    X_val   = val_scaled[feature_cols].values
    y_val   = val_scaled[target_col].values

    X_test  = test_scaled[feature_cols].values
    y_test  = test_scaled[target_col].values

    X_train_full = np.vstack([X_train, X_val])
    y_train_full = np.concatenate([y_train, y_val])

   
    cw = class_weights_dict[asset][horizon]
    print(f"\n=== Training SVM for {asset} - {horizon} ===")
    print("Class weights used:", cw)

    base_svm = SVC(
        kernel='rbf',
        class_weight=cw,
        probability=False,  
        random_state=SEED
    )


    param_grid = {
        'C': [0.1, 1, 10, 100],
        'gamma': ['scale', 0.01, 0.001]
    }

 
    tscv = TimeSeriesSplit(n_splits=5)

    grid = GridSearchCV(
        estimator=base_svm,
        param_grid=param_grid,
        cv=tscv,
        scoring='f1',   
        n_jobs=-1,
        verbose=1
    )

    grid.fit(X_train_full, y_train_full)

    print("Best params:", grid.best_params_)
    print("Best CV f1:", grid.best_score_)

    
    best_svm = grid.best_estimator_
    y_pred = best_svm.predict(X_test)


    acc = accuracy_score(y_test, y_pred)
    f1  = f1_score(y_test, y_pred)
    cm  = confusion_matrix(y_test, y_pred)

    
    y_score = best_svm.decision_function(X_test) 
    try:
        auc = roc_auc_score(y_test, y_score)
    except ValueError:
        auc = float('nan')

    print("\n=== Test set performance ===")
    print("Accuracy:", acc)
    print("F1-score:", f1)
    print("ROC AUC:", auc)
    print("\nConfusion matrix (rows=true, cols=pred):")
    print(cm)
    print("\nClassification report:")
    print(classification_report(y_test, y_pred, digits=4))

   
    metrics = {
    'accuracy': acc,
    'f1': f1,
    'auc': auc,
    'confusion_matrix': cm,
    'best_params': grid.best_params_,
    'scaler': scaler,
    'feature_cols': feature_cols,
    'target_col': target_col
    }
    return best_svm, metrics



In [13]:
results = {}

for asset in ASSETS.keys():
    for horizon in TARGET_HORIZONS:
        try:
            model, metrics = train_svm_for_asset_horizon(
                data,
                class_weights_dict,
                asset=asset,
                horizon=horizon,
                scaler_type='standard'
            )
            results[(asset, horizon)] = {
                'model': model,
                'metrics': metrics
            }
        except Exception as e:
            print(f"[ERROR] {asset} - {horizon}: {e}")



=== Training SVM for AAPL - 1day ===
Class weights used: {0: 1.0384252710985384, 1: 0.9643169877408057}
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best params: {'C': 100, 'gamma': 0.001}
Best CV f1: 0.47361255641079286

=== Test set performance ===
Accuracy: 0.5035685963521015
F1-score: 0.4685908319185059
ROC AUC: 0.5088802367389811

Confusion matrix (rows=true, cols=pred):
[[359 233]
 [393 276]]

Classification report:
              precision    recall  f1-score   support

           0     0.4774    0.6064    0.5342       592
           1     0.5422    0.4126    0.4686       669

    accuracy                         0.5036      1261
   macro avg     0.5098    0.5095    0.5014      1261
weighted avg     0.5118    0.5036    0.4994      1261


=== Training SVM for AAPL - 1week ===
Class weights used: {0: 1.1611022787493375, 1: 0.8781563126252505}
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best params: {'C': 100, 'gamma': 0.01}
Best CV f1: 0.5107878195

Best params: {'C': 100, 'gamma': 'scale'}
Best CV f1: 0.6368895644545187

=== Test set performance ===
Accuracy: 0.44885011895321175
F1-score: 0.4981949458483755
ROC AUC: 0.45726022930411037

Confusion matrix (rows=true, cols=pred):
[[221 177]
 [518 345]]

Classification report:
              precision    recall  f1-score   support

           0     0.2991    0.5553    0.3887       398
           1     0.6609    0.3998    0.4982       863

    accuracy                         0.4489      1261
   macro avg     0.4800    0.4775    0.4435      1261
weighted avg     0.5467    0.4489    0.4636      1261


=== Training SVM for BTC-USD - 1day ===
Class weights used: {0: 1.0881642512077294, 1: 0.9250513347022588}
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best params: {'C': 0.1, 'gamma': 0.001}
Best CV f1: 0.5397933702392044

=== Test set performance ===
Accuracy: 0.47870967741935483
F1-score: 0.0
ROC AUC: 0.5435203223826426

Confusion matrix (rows=true, cols=pred):
[[371   0

### 2.2 Results : train and test on the same asset

In [14]:
def results_to_table(results):
    rows = []

    for (asset, horizon), content in results.items():
        m = content['metrics']

        rows.append({
            "Asset": asset,
            "Horizon": horizon,
            "Accuracy": m.get('accuracy', None),
            "F1-score": m.get('f1', None),
            "AUC": m.get('auc', None),
            "Best Params": m.get('best_params', None)
        })

    df = pd.DataFrame(rows)
    return df

df_results = results_to_table(results)
df_results



Unnamed: 0,Asset,Horizon,Accuracy,F1-score,AUC,Best Params
0,AAPL,1day,0.503569,0.468591,0.50888,"{'C': 100, 'gamma': 0.001}"
1,AAPL,1week,0.477399,0.360815,0.510974,"{'C': 100, 'gamma': 0.01}"
2,AAPL,1month,0.594766,0.7292,0.550624,"{'C': 10, 'gamma': 'scale'}"
3,AMZN,1day,0.49881,0.544012,0.507467,"{'C': 10, 'gamma': 0.01}"
4,AMZN,1week,0.490087,0.506523,0.509521,"{'C': 100, 'gamma': 0.01}"
5,AMZN,1month,0.486915,0.403687,0.515941,"{'C': 100, 'gamma': 'scale'}"
6,NVDA,1day,0.539255,0.693404,0.502962,"{'C': 1, 'gamma': 'scale'}"
7,NVDA,1week,0.571768,0.717573,0.494013,"{'C': 10, 'gamma': 'scale'}"
8,NVDA,1month,0.610626,0.752395,0.526872,"{'C': 100, 'gamma': 'scale'}"
9,SPY,1day,0.505155,0.567268,0.494368,"{'C': 100, 'gamma': 'scale'}"


- Generally, for most asset–horizon pairs, the accuracy is around 0.48–0.55 and the AUC is close to 0.50. This means that, at daily and weekly horizons, the SVM is only slightly better than a random long/short decision. The market is very noisy at those frequencies. And we have to say that Machine Learning SVM is not the best methode to predict the stock trend.
- Monthly horizon on some equities, for example, AAPL and especially NVDA, the 1-month F1-scores are much higher (around 0.73–0.75) with accuracies above 0.59–0.61. This suggests that at a longer horizon, the direction of the move is a bit easier to capture, at least for these stocks. The model seems to exploit more persistent trends on monthly returns than on daily moves.
- NVDA is a little bit special : NVDA shows the best F1 across all horizons (0.69–0.75). One possible explanation is that its trend/volatility structure in the sample period is more “directional”, so the model benefits more from technical features there than for SPY or AAPL.
- BTC-USD: collapse of F1： For BTC-USD at 1-day and 1-week horizons, F1 = 0 even though AUC is > 0.54. This means the classifier is almost always predicting the same class (it probably sticks to the majority side), so precision or recall for the minority class goes to zero. The fact that AUC is still above 0.5 shows that the ranking of scores contains some signal, but the default decision threshold is not working well in a very imbalanced or very volatile setting.

## 3. Generalization Test: Train on One Asset, Test on Another
In the previous section, each model was trained and tested on the same asset.
Here we keep exactly the same training procedure (SVM + time-series grid search), but we ask a different question:
If we train a model on asset A, does it still work when we apply it to asset B?

To do that, we extend the SVM function with an extra argument test_asset.
The function always trains on asset, but it can evaluate:
once on the in-sample asset (same as before),
and optionally on a different asset (test_asset), using the same scaler and same SVM.
This gives us a direct comparison between “same-asset” performance and “cross-asset” performance.

### 3.1 Function: SVM with optional cross-asset test

In [22]:
### Here we update the function below by adding a function that can test the generalization of the model
def train_svm_for_asset_horizon_generalization(
    data,
    class_weights_dict,
    asset='AAPL',          
    horizon='1day',
    scaler_type='standard',
    test_asset=None        # We add a new parameter here (test with another asset)
):
    """
    train one SVM for a specific asset + horizon (with GridSearchCV + TimeSeriesSplit)
    if test_asset is different from the asset(train), we do cross-test (test on test_asset)
    return: best_model, metrics_dict
    """
    assert asset in data, f"Unknown asset: {asset}"
    assert horizon in TARGET_HORIZONS, f"Unknown horizon: {horizon}"

    if test_asset is None:
        test_asset = asset
    assert test_asset in data, f"Unknown test_asset: {test_asset}"

    # ========= TRAIN With One Asset (same as the function below) =========
    df = data[asset].dropna().copy()

    feature_cols, target_cols = get_feature_target_columns(df)
    target_col = f'target_{horizon}'
    if target_col not in target_cols:
        raise ValueError(f"{asset} does not have target column {target_col}")

    train_df, val_df, test_df = time_series_split(df)

    train_scaled, val_scaled, test_scaled, scaler = scale_features(
        train_df, val_df, test_df, feature_cols, scaler_type=scaler_type
    )

    X_train = train_scaled[feature_cols].values
    y_train = train_scaled[target_col].values

    X_val   = val_scaled[feature_cols].values
    y_val   = val_scaled[target_col].values

    X_test  = test_scaled[feature_cols].values
    y_test  = test_scaled[target_col].values

    # train + val
    X_train_full = np.vstack([X_train, X_val])
    y_train_full = np.concatenate([y_train, y_val])

    # ========= GridSearch / train =========
    cw = class_weights_dict[asset][horizon]
    print(f"\n=== Training SVM for {asset} (train) - horizon {horizon} ===")
    print("Class weights used:", cw)

    base_svm = SVC(
        kernel='rbf',
        class_weight=cw,
        probability=False,
        random_state=SEED
    )

    param_grid = {
        'C': [0.1, 1, 10, 100],
        'gamma': ['scale', 0.01, 0.001]
    }

    tscv = TimeSeriesSplit(n_splits=5)

    grid = GridSearchCV(
        estimator=base_svm,
        param_grid=param_grid,
        cv=tscv,
        scoring='f1',
        n_jobs=-1,
        verbose=1
    )

    grid.fit(X_train_full, y_train_full)

    print("Best params:", grid.best_params_)
    print("Best CV f1:", grid.best_score_)

    best_svm = grid.best_estimator_

    # ========= TEST On The Same Asset  =========
    print(f"\n=== Test set performance on {asset} (in-sample asset) ===")
    y_pred = best_svm.predict(X_test)

    acc = accuracy_score(y_test, y_pred)
    f1  = f1_score(y_test, y_pred)
    cm  = confusion_matrix(y_test, y_pred)

    y_score = best_svm.decision_function(X_test)
    try:
        auc = roc_auc_score(y_test, y_score)
    except ValueError:
        auc = float('nan')

    print("Accuracy:", acc)
    print("F1-score:", f1)
    print("ROC AUC:", auc)
    print("\nConfusion matrix (rows=true, cols=pred):")
    print(cm)
    print("\nClassification report:")
    print(classification_report(y_test, y_pred, digits=4))

    metrics = {
        'train_asset': asset,
        'test_asset': asset,   
        'accuracy': acc,
        'f1': f1,
        'auc': auc,
        'confusion_matrix': cm,
        'best_params': grid.best_params_,
        'scaler': scaler,
        'feature_cols': feature_cols,
        'target_col': target_col
    }

    # ========= TEST GENERALIZATION , test with another asset (updated part) =========
    if test_asset != asset:
        print(f"\n=== Cross-asset test: train on {asset}, test on {test_asset} ===")
        df_test_asset = data[test_asset].dropna().copy()

        feat_cols_test, target_cols_test = get_feature_target_columns(df_test_asset)
        if target_col not in target_cols_test:
            raise ValueError(f"{test_asset} does not have target column {target_col}")

        
        if list(feat_cols_test) != list(feature_cols):
            print("[Warning] feature columns differ between assets.")
            print("Train features :", feature_cols)
            print("Test  features :", feat_cols_test)

        
        df_test_asset_scaled = df_test_asset.copy()
        df_test_asset_scaled[feature_cols] = scaler.transform(
            df_test_asset_scaled[feature_cols]
        )

        X_test_x = df_test_asset_scaled[feature_cols].values
        y_test_x = df_test_asset_scaled[target_col].values

        y_pred_x = best_svm.predict(X_test_x)
        acc_x = accuracy_score(y_test_x, y_pred_x)
        f1_x  = f1_score(y_test_x, y_pred_x)
        cm_x  = confusion_matrix(y_test_x, y_pred_x)

        try:
            y_score_x = best_svm.decision_function(X_test_x)
            auc_x = roc_auc_score(y_test_x, y_score_x)
        except ValueError:
            auc_x = float('nan')

        print("Cross-asset Accuracy:", acc_x)
        print("Cross-asset F1-score:", f1_x)
        print("Cross-asset ROC AUC:", auc_x)
        print("\nCross-asset confusion matrix (rows=true, cols=pred):")
        print(cm_x)
        print("\nCross-asset classification report:")
        print(classification_report(y_test_x, y_pred_x, digits=4))

        metrics['cross_asset'] = {
            'train_asset': asset,
            'test_asset': test_asset,
            'accuracy': acc_x,
            'f1': f1_x,
            'auc': auc_x,
            'confusion_matrix': cm_x
        }

    return best_svm, metrics


### 3.2 Results of Cross-Asset Test

Here we restrict the cross-asset tests to the equity names only.
We leave out BTC for two reasons:
- it is structurally very different from the stocks and the index,
- and its available history is shorter compared to the others.

We then loop over all (train asset, test asset, horizon) combinations and store:
- the same-asset metrics (train asset tested on itself)
- the cross-asset metrics (train asset tested on a different asset)

In [24]:
# We don't test with BTC because it is not enough correlate with others.
# Furthermore, we don't have enough data on BTC who begins at 2005 while others begin at 2000.

NON_BTC_ASSETS = [a for a in ASSETS.keys() if a != 'BTC-USD']
print("Non-BTC assets used for cross-asset:", NON_BTC_ASSETS)



cross_results = {}

for train_asset in NON_BTC_ASSETS:
    for test_asset in NON_BTC_ASSETS:
        # We do only cross-asset here
        if train_asset == test_asset:
            continue
        
        for horizon in TARGET_HORIZONS:
            try:
                model, metrics = train_svm_for_asset_horizon_generalization(
                    data=data,              
                    class_weights_dict=class_weights_dict,
                    asset=train_asset,             
                    horizon=horizon,
                    scaler_type='standard',
                    test_asset=test_asset          
                )
                
                cross_metrics = metrics.get('cross_asset', None)
                if cross_metrics is None:
                    print(f"[WARN] No cross_asset metrics for {train_asset}->{test_asset}, {horizon}")
                
                cross_results[(train_asset, test_asset, horizon)] = {
                    'model': model,
                    'metrics_same_asset': {        
                        'accuracy': metrics['accuracy'],
                        'f1': metrics['f1'],
                        'auc': metrics['auc'],
                        'confusion_matrix': metrics['confusion_matrix'],
                    },
                    'metrics_cross_asset': cross_metrics  
                }

            except Exception as e:
                print(f"[ERROR] train={train_asset}, test={test_asset}, horizon={horizon}: {e}")


Non-BTC assets used for cross-asset: ['AAPL', 'AMZN', 'NVDA', 'SPY']

=== Training SVM for AAPL (train) - horizon 1day ===
Class weights used: {0: 1.0384252710985384, 1: 0.9643169877408057}
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best params: {'C': 100, 'gamma': 0.001}
Best CV f1: 0.47361255641079286

=== Test set performance on AAPL (in-sample asset) ===
Accuracy: 0.5035685963521015
F1-score: 0.4685908319185059
ROC AUC: 0.5088802367389811

Confusion matrix (rows=true, cols=pred):
[[359 233]
 [393 276]]

Classification report:
              precision    recall  f1-score   support

           0     0.4774    0.6064    0.5342       592
           1     0.5422    0.4126    0.4686       669

    accuracy                         0.5036      1261
   macro avg     0.5098    0.5095    0.5014      1261
weighted avg     0.5118    0.5036    0.4994      1261


=== Cross-asset test: train on AAPL, test on AMZN ===
Cross-asset Accuracy: 0.5237188640330002
Cross-asset F1-score: 0

Best params: {'C': 100, 'gamma': 0.001}
Best CV f1: 0.47361255641079286

=== Test set performance on AAPL (in-sample asset) ===
Accuracy: 0.5035685963521015
F1-score: 0.4685908319185059
ROC AUC: 0.5088802367389811

Confusion matrix (rows=true, cols=pred):
[[359 233]
 [393 276]]

Classification report:
              precision    recall  f1-score   support

           0     0.4774    0.6064    0.5342       592
           1     0.5422    0.4126    0.4686       669

    accuracy                         0.5036      1261
   macro avg     0.5098    0.5095    0.5014      1261
weighted avg     0.5118    0.5036    0.4994      1261


=== Cross-asset test: train on AAPL, test on SPY ===
Train features : ['dividends', 'stock splits', 'close_to_open_ratio', 'high_to_low_ratio', 'price_range', 'volume_change', 'volume_sma_20', 'volume_ratio', 'rsi_14', 'macd', 'macd_diff', 'roc_12', 'close_to_sma20', 'close_to_sma50', 'bb_low', 'bb_width', 'bb_position', 'adx', 'atr_14', 'atr_ratio', 'hist_vol_50', '

Accuracy: 0.49881046788263284
F1-score: 0.5440115440115441
ROC AUC: 0.5074668197289187

Confusion matrix (rows=true, cols=pred):
[[252 361]
 [271 377]]

Classification report:
              precision    recall  f1-score   support

           0     0.4818    0.4111    0.4437       613
           1     0.5108    0.5818    0.5440       648

    accuracy                         0.4988      1261
   macro avg     0.4963    0.4964    0.4938      1261
weighted avg     0.4967    0.4988    0.4952      1261


=== Cross-asset test: train on AMZN, test on AAPL ===
Cross-asset Accuracy: 0.48357924797715374
Cross-asset F1-score: 0.2919295192516859
Cross-asset ROC AUC: 0.48769777577190787

Cross-asset confusion matrix (rows=true, cols=pred):
[[2377  622]
 [2633  671]]

Cross-asset classification report:
              precision    recall  f1-score   support

           0     0.4745    0.7926    0.5936      2999
           1     0.5189    0.2031    0.2919      3304

    accuracy                         

Best params: {'C': 10, 'gamma': 0.01}
Best CV f1: 0.5023470462653862

=== Test set performance on AMZN (in-sample asset) ===
Accuracy: 0.49881046788263284
F1-score: 0.5440115440115441
ROC AUC: 0.5074668197289187

Confusion matrix (rows=true, cols=pred):
[[252 361]
 [271 377]]

Classification report:
              precision    recall  f1-score   support

           0     0.4818    0.4111    0.4437       613
           1     0.5108    0.5818    0.5440       648

    accuracy                         0.4988      1261
   macro avg     0.4963    0.4964    0.4938      1261
weighted avg     0.4967    0.4988    0.4952      1261


=== Cross-asset test: train on AMZN, test on SPY ===
Train features : ['dividends', 'stock splits', 'close_to_open_ratio', 'high_to_low_ratio', 'price_range', 'volume_change', 'volume_sma_20', 'volume_ratio', 'rsi_14', 'macd', 'macd_diff', 'roc_12', 'close_to_sma20', 'close_to_sma50', 'bb_low', 'bb_width', 'bb_position', 'adx', 'atr_14', 'atr_ratio', 'hist_vol_50', 'ob

Accuracy: 0.5392545598731165
F1-score: 0.6934036939313983
ROC AUC: 0.5029621753000152

Confusion matrix (rows=true, cols=pred):
[[ 23 557]
 [ 24 657]]

Classification report:
              precision    recall  f1-score   support

           0     0.4894    0.0397    0.0734       580
           1     0.5412    0.9648    0.6934       681

    accuracy                         0.5393      1261
   macro avg     0.5153    0.5022    0.3834      1261
weighted avg     0.5173    0.5393    0.4082      1261


=== Cross-asset test: train on NVDA, test on AAPL ===
Cross-asset Accuracy: 0.5280025384737427
Cross-asset F1-score: 0.6486358804771466
Cross-asset ROC AUC: 0.5203250256138647

Cross-asset confusion matrix (rows=true, cols=pred):
[[ 582 2417]
 [ 558 2746]]

Cross-asset classification report:
              precision    recall  f1-score   support

           0     0.5105    0.1941    0.2812      2999
           1     0.5319    0.8311    0.6486      3304

    accuracy                         0.5

Best params: {'C': 1, 'gamma': 'scale'}
Best CV f1: 0.5389574487610063

=== Test set performance on NVDA (in-sample asset) ===
Accuracy: 0.5392545598731165
F1-score: 0.6934036939313983
ROC AUC: 0.5029621753000152

Confusion matrix (rows=true, cols=pred):
[[ 23 557]
 [ 24 657]]

Classification report:
              precision    recall  f1-score   support

           0     0.4894    0.0397    0.0734       580
           1     0.5412    0.9648    0.6934       681

    accuracy                         0.5393      1261
   macro avg     0.5153    0.5022    0.3834      1261
weighted avg     0.5173    0.5393    0.4082      1261


=== Cross-asset test: train on NVDA, test on SPY ===
Train features : ['dividends', 'stock splits', 'close_to_open_ratio', 'high_to_low_ratio', 'price_range', 'volume_change', 'volume_sma_20', 'volume_ratio', 'rsi_14', 'macd', 'macd_diff', 'roc_12', 'close_to_sma20', 'close_to_sma50', 'bb_low', 'bb_width', 'bb_position', 'adx', 'atr_14', 'atr_ratio', 'hist_vol_50', 'o

Accuracy: 0.5051546391752577
F1-score: 0.5672676837725382
ROC AUC: 0.4943678773572934

Confusion matrix (rows=true, cols=pred):
[[228 343]
 [281 409]]

Classification report:
              precision    recall  f1-score   support

           0     0.4479    0.3993    0.4222       571
           1     0.5439    0.5928    0.5673       690

    accuracy                         0.5052      1261
   macro avg     0.4959    0.4960    0.4947      1261
weighted avg     0.5004    0.5052    0.5016      1261


=== Cross-asset test: train on SPY, test on AAPL ===
Train features : ['dividends', 'stock splits', 'capital gains', 'close_to_open_ratio', 'high_to_low_ratio', 'price_range', 'volume_change', 'volume_sma_20', 'volume_ratio', 'rsi_14', 'macd', 'macd_diff', 'roc_12', 'close_to_sma20', 'close_to_sma50', 'bb_low', 'bb_width', 'bb_position', 'adx', 'atr_14', 'atr_ratio', 'hist_vol_50', 'obv', 'close_to_vwap', 'returns_lag_1', 'returns_lag_3', 'returns_lag_6', 'returns_lag_12', 'returns_lag_24', '

Best params: {'C': 100, 'gamma': 'scale'}
Best CV f1: 0.6073790917488046

=== Test set performance on SPY (in-sample asset) ===
Accuracy: 0.46708961141950833
F1-score: 0.4700315457413249
ROC AUC: 0.49603777078318834

Confusion matrix (rows=true, cols=pred):
[[291 200]
 [472 298]]

Classification report:
              precision    recall  f1-score   support

           0     0.3814    0.5927    0.4641       491
           1     0.5984    0.3870    0.4700       770

    accuracy                         0.4671      1261
   macro avg     0.4899    0.4898    0.4671      1261
weighted avg     0.5139    0.4671    0.4677      1261


=== Cross-asset test: train on SPY, test on AMZN ===
Train features : ['dividends', 'stock splits', 'capital gains', 'close_to_open_ratio', 'high_to_low_ratio', 'price_range', 'volume_change', 'volume_sma_20', 'volume_ratio', 'rsi_14', 'macd', 'macd_diff', 'roc_12', 'close_to_sma20', 'close_to_sma50', 'bb_low', 'bb_width', 'bb_position', 'adx', 'atr_14', 'atr_ratio

Best params: {'C': 100, 'gamma': 'scale'}
Best CV f1: 0.6368895644545187

=== Test set performance on SPY (in-sample asset) ===
Accuracy: 0.44885011895321175
F1-score: 0.4981949458483755
ROC AUC: 0.45726022930411037

Confusion matrix (rows=true, cols=pred):
[[221 177]
 [518 345]]

Classification report:
              precision    recall  f1-score   support

           0     0.2991    0.5553    0.3887       398
           1     0.6609    0.3998    0.4982       863

    accuracy                         0.4489      1261
   macro avg     0.4800    0.4775    0.4435      1261
weighted avg     0.5467    0.4489    0.4636      1261


=== Cross-asset test: train on SPY, test on NVDA ===
Train features : ['dividends', 'stock splits', 'capital gains', 'close_to_open_ratio', 'high_to_low_ratio', 'price_range', 'volume_change', 'volume_sma_20', 'volume_ratio', 'rsi_14', 'macd', 'macd_diff', 'roc_12', 'close_to_sma20', 'close_to_sma50', 'bb_low', 'bb_width', 'bb_position', 'adx', 'atr_14', 'atr_ratio

###  Why cross-asset testing fails when training on SPY

Cross-asset evaluation breaks specifically for models trained on **SPY** because the SPY dataset contains an additional feature column that other assets do **not** have:

- `capital gains`  ← **exists only in SPY**

During training, the SVM learns using SPY’s full feature set (including `capital gains`).  
However, during cross-asset testing, the code applies the same `feature_cols` to evaluate on AAPL / AMZN / NVDA. Since these assets do **not** contain the `capital gains` column.



In [31]:
def results_to_table_cross(cross_results):
    rows = []

    for (train_asset, test_asset, horizon), content in cross_results.items():
        
        same = content.get('metrics_same_asset', {})
        cross = content.get('metrics_cross_asset', {})

        rows.append({
            "Train Asset": train_asset,
            "Test Asset": test_asset,
            "Horizon": horizon,

            # Cross-asset metrics
            "Cross Accuracy": cross.get('accuracy', None),
            "Cross F1-score": cross.get('f1', None),
            "Cross AUC": cross.get('auc', None),

            # Same-asset metrics
            "Same Accuracy": same.get('accuracy', None),
            "Same F1-score": same.get('f1', None),
            "Same AUC": same.get('auc', None),
        })

    df = pd.DataFrame(rows)
    return df


In [32]:
df_cross = results_to_table_cross(cross_results)
df_cross

Unnamed: 0,Train Asset,Test Asset,Horizon,Cross Accuracy,Cross F1-score,Cross AUC,Same Accuracy,Same F1-score,Same AUC
0,AAPL,AMZN,1day,0.523719,0.560726,0.536879,0.503569,0.468591,0.50888
1,AAPL,AMZN,1week,0.521656,0.555506,0.528618,0.477399,0.360815,0.510974
2,AAPL,AMZN,1month,0.548152,0.657528,0.52157,0.594766,0.7292,0.550624
3,AAPL,NVDA,1day,0.5142,0.547844,0.51733,0.503569,0.468591,0.50888
4,AAPL,NVDA,1week,0.515945,0.575838,0.508468,0.477399,0.360815,0.510974
5,AAPL,NVDA,1month,0.580359,0.677478,0.554306,0.594766,0.7292,0.550624
6,AAPL,SPY,1day,0.520387,0.601135,0.506335,0.503569,0.468591,0.50888
7,AAPL,SPY,1week,0.499762,0.579432,0.470123,0.477399,0.360815,0.510974
8,AAPL,SPY,1month,0.628748,0.76416,0.481863,0.594766,0.7292,0.550624
9,AMZN,AAPL,1day,0.483579,0.29193,0.487698,0.49881,0.544012,0.507467


## Key Findings and Analysis

### Overall Model Performance

The SVM models showed moderate success in predicting market direction across different time horizons. Looking at the results, we can see some clear patterns emerging. The daily predictions remain challenging with most F1-scores hovering between 0.29 and 0.65. Weekly predictions showed slightly better consistency, while monthly forecasts produced the strongest results overall.

### Cross-Asset Generalization Performance

One of the most interesting findings is how well these models transfer between different assets. When we train on one stock and test on another, the performance drop isn't catastrophic. In many cases, the cross-asset F1-scores stay within 5-10 points of the same-asset baseline. This tells us something important: the technical indicators we're using capture patterns that exist across multiple markets, not just individual stocks.

The best generalization results came from NVDA-trained models. When we trained on NVIDIA data and tested on other assets, the F1-scores consistently exceeded 0.65 for most combinations, with some reaching above 0.70. The 1-month horizon showed particularly strong transfers, with NVDA→SPY achieving an F1-score of 0.788. This makes sense because NVDA's strong trending behavior during the training period likely helped the model learn robust directional patterns.

On the flip side, AMZN-trained models struggled the most with generalization. Training on Amazon and testing elsewhere produced some of the weakest results, particularly at the daily horizon where F1-scores dropped as low as 0.24. This suggests Amazon's price movements have more idiosyncratic characteristics that don't transfer well to other stocks.

### Horizon-Specific Observations

**1-day predictions**: These were tough across the board. Cross-asset performance ranged wildly from 0.29 to 0.71. The market noise at daily frequencies makes it hard for any model to consistently predict direction, whether testing on the same asset or a different one.

**1-week predictions**: Performance improved somewhat here. We see more stability in the cross-asset results, with most F1-scores landing between 0.50 and 0.68. The weekly timeframe seems to filter out some of the daily noise while still maintaining enough signal for the model to work with.

**1-month predictions**: This is where the SVM really shines. Monthly predictions showed the strongest performance overall, with several cross-asset combinations achieving F1-scores above 0.75. The longer horizon gives trends more time to develop, making them easier to identify using technical features.

### Practical Takeaways

These results suggest that SVMs can be useful for medium-term market prediction, especially at monthly horizons. However, you shouldn't expect magic. The models work best when:
- You're predicting over longer timeframes (1 month vs 1 day)
- You're working with assets that have strong trending characteristics
- You understand that cross-asset predictions will be slightly weaker than same-asset predictions

The fact that NVDA models generalize well is noteworthy. If you had to pick one asset to train on and apply elsewhere, NVDA from this period would be your best bet. But realistically, for production use, you'd want to train individual models for each asset you're trading.

The relatively small performance gap between same-asset and cross-asset testing (often less than 0.10 in F1-score) is encouraging. It means the features we engineered aren't overfitting to individual stock quirks. They're capturing broader market dynamics that exist across different securities.

That said, the AUC scores hovering around 0.50-0.55 for many combinations remind us to stay humble. We're extracting some signal from the noise, but we're not exactly predicting the future with high confidence. These models might be useful as one input in a larger trading system, but they're not standalone money-printing machines.

In [None]:
# Display summary statistics for cross-asset generalization
print("\n" + "="*70)
print("CROSS-ASSET GENERALIZATION SUMMARY")
print("="*70)

# Calculate average performance by horizon
print("\nAverage Cross-Asset F1-Score by Horizon:")
for horizon in TARGET_HORIZONS:
    df_horizon = df_cross[df_cross["Horizon"] == horizon]
    avg_cross_f1 = df_horizon["Cross F1-score"].mean()
    avg_same_f1 = df_horizon["Same F1-score"].mean()
    print(f"  {horizon}:")
    print(f"    Cross-asset: {avg_cross_f1:.4f}")
    print(f"    Same-asset:  {avg_same_f1:.4f}")

# Top 5 best generalization cases
print("\n" + "="*70)
print("Top 5 Best Cross-Asset Generalizations (by F1-score):")
print("="*70)
top_5 = df_cross.nlargest(5, "Cross F1-score")[["Train Asset", "Test Asset", "Horizon", "Cross F1-score", "Cross Accuracy"]]
print(top_5.to_string(index=False))

# Top 5 worst generalization cases
print("\n" + "="*70)
print("Top 5 Worst Cross-Asset Generalizations (by F1-score):")
print("="*70)
bottom_5 = df_cross.nsmallest(5, "Cross F1-score")[["Train Asset", "Test Asset", "Horizon", "Cross F1-score", "Cross Accuracy"]]
print(bottom_5.to_string(index=False))

# Calculate performance degradation (same vs cross)
df_cross["F1_degradation"] = df_cross["Same F1-score"] - df_cross["Cross F1-score"]
print("\n" + "="*70)
print("Average F1-Score Degradation (Same-Asset vs Cross-Asset):")
print("="*70)
for horizon in TARGET_HORIZONS:
    df_horizon = df_cross[df_cross["Horizon"] == horizon]
    avg_degradation = df_horizon["F1_degradation"].mean()
    print(f"  {horizon}: {avg_degradation:.4f}")


### Notes on the columns in `df_cross`

The first part of the table (Accuracy, F1-score, AUC) refers to the **cross-asset results**.  
These values show how a model trained on one asset behaves when it is tested on a *different* asset.  
They are useful for seeing whether the patterns learned from the training asset transfer to another market.

The next set of columns (Same Accuracy, Same F1-score, Same AUC) comes from the **test set of the training asset itself**.  
These give a baseline of how well the model performs when the training and testing data follow the same distribution.

In brief:

- **Cross metrics** → how well the model generalizes across assets.  
- **Same metrics** → how well the model performs on its own asset.


### Interpretation of the results 
We can observe that cross-asset performance is often very close to the same-asset baseline. For many train/test pairs, the cross accuracy and F1-score stay within a small range of the original same-asset values.
This suggests that the technical features we used (momentum indicators, volatility measures, rolling signals, etc.) capture behaviours that are not entirely specific to a single stock. The model is generalized.