### Packages Importations

In [None]:
from Data.DataManager import DataManager
from Models.utils import compare_models_metrics
from pathlib import Path
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

from Data.TextVectoriser import TextVectorizer
from Models.utils import compare_models_metrics, set_seed

from Models.sk_models import (
    LogReg, RandomForest,
    XGBoostClassifier, KNNClassifier, 
    NaiveBayesClassifier, LightGBMClassifier,
    ExtraTreesClassifierWrapper, AdaBoostClassifierWrapper,
    RidgeClassifierWrapper, SGDClassifierWrapper
)

from Models.torch_mlp import MLPClassifier
from Models.torch_hf_transformer import HFTransformerClassifier
from Models.utils import batch_check_naive_predictions
from Models.utils import compare_models_metrics, set_seed, generalized_gridsearch

### Features and Target variable management

This block handles the core data preprocessing pipeline that combines ECB speech data with French government bond yield information to create our modeling dataset.

- **ECB Speeches**: Complete textual content of European Central Bank communications from 2011 onwards
- **Bond Yield Data**: Daily French government bond spreads between different maturities (2Y, 10Y)

The `DataManager` class performs several key operations:

1. **Data Loading**: Loads both speech texts and corresponding yield curve data, ensuring proper date parsing and temporal alignment

2. **Target Variable Construction**: For each speech date, calculates the directional movement of yield spreads by comparing:
  - **Target 1**: 2Y-10Y spread movement (short-term vs medium-term rates)

3. **Classification Labels**: Converts continuous spread changes into directional classes:
  - `+1`: Spread widening (positive movement)
  - `0`: No significant movement
  - `-1`: Spread tightening (negative movement)

4. **Dataset Export**: Creates a clean, analysis-ready dataset with speech features aligned to their corresponding market impact labels

In [2]:
BASE_DIR = Path.cwd() 
manager = DataManager(
    speeches_csv=BASE_DIR / "Data/all_ECB_speeches.csv",
    rates_csv=BASE_DIR / "Data/data.csv",
    output_csv=BASE_DIR / "Data/dataset.csv",
    start_date="2011-01-01",
    force_binary=False
)

manager.load_speeches()
manager.load_rates()
manager.build_dataset()
manager.export_dataset()
manager.summary()

Final dataset exported to c:\Users\rabhi\Documents\Master 272 IEF - Dauphine\M2\S2\NLP\projet\Data\dataset.csv (1651 rows).
Summary of classes (target_1):
target_1
-1    847
 1    795
 0      9
Name: count, dtype: int64

Summary of classes (target_2):
target_2
 1    856
-1    776
 0     19
Name: count, dtype: int64


In [4]:
set_seed(69)

### Models application with Raw Features

This section implements our comprehensive machine learning pipeline to predict yield spread movements from ECB speech content using **raw TF-IDF features** without dimensionality reduction.

- **Feature Extraction**: TF-IDF vectorization with up to 50,000 features using unigrams and bigrams
- **Temporal Split**: Training on pre-2022 data, testing on 2022+ to simulate realistic deployment conditions
- **Feature Space**: Full high-dimensional representation preserving all textual information

Automated hyperparameter optimization using `generalized_gridsearch` across multiple algorithm families:
- **Linear Models**: Logistic Regression, Ridge, SGD
- **Tree-Based Ensembles**: Random Forest, XGBoost, LightGBM, Extra Trees, AdaBoost  
- **Instance-Based**: K-Nearest Neighbors
- **Probabilistic**: Naive Bayes (Multinomial)

Each model undergoes systematic grid search to identify optimal hyperparameters, ensuring fair performance comparison across different algorithmic approaches on the original feature space.

Final comparison using F1-macro score to balance performance across all directional movement classes, establishing baseline performance before applying dimensionality reduction techniques.

In [5]:
tv = TextVectorizer(
    dataset_csv="dataset.csv",
    text_cols=("speakers", "title", "subtitle", "contents"),
    target_cols=("target_1"),
    id_col="date",
    concat_text=False
)
tv.load()

X, y, vectorizer = tv.tfidf(target="target_1", max_features=50000, ngram_range=(1, 2))
X_tr, X_te, y_tr, y_te = tv.temporal_split(X, y, split_date="2022-01-01")

optimized_models = {}

result = generalized_gridsearch(LogReg(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["LogisticRegression"] = result['best_model']

result = generalized_gridsearch(RandomForest(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["RandomForest"] = result['best_model']


result = generalized_gridsearch(XGBoostClassifier(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["XGBoost"] = result['best_model']

result = generalized_gridsearch(KNNClassifier(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["KNN_Cosine"] = result['best_model']

result = generalized_gridsearch(KNNClassifier(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["KNN_Euclidean"] = result['best_model']

result = generalized_gridsearch(NaiveBayesClassifier(variant="multinomial"), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["NaiveBayes_Multi"] = result['best_model']

result = generalized_gridsearch(LightGBMClassifier(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["LightGBM"] = result['best_model']

result = generalized_gridsearch(ExtraTreesClassifierWrapper(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["ExtraTrees"] = result['best_model']

result = generalized_gridsearch(AdaBoostClassifierWrapper(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["AdaBoost"] = result['best_model']

result = generalized_gridsearch(RidgeClassifierWrapper(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["Ridge"] = result['best_model']

result = generalized_gridsearch(SGDClassifierWrapper(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["SGD"] = result['best_model']

input_dim = X_tr.shape[1]
mlp = MLPClassifier(
    input_dim=input_dim,
    num_classes=len(np.unique(y)),
    hidden_dims=(512, 128),
    epochs=30,
    batch_size=128,
    verbose=False
)
mlp.fit(X_tr, y_tr, X_val=X_te, y_val=y_te)
optimized_models["MLP"] = mlp

df_metrics = compare_models_metrics(optimized_models, X_te, y_te, average="macro")
df_sorted = df_metrics.sort_values('f1_macro', ascending=False)

df_sorted

Unnamed: 0_level_0,accuracy,balanced_accuracy,f1_macro,precision_macro,recall_macro,confusion_matrix
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
NaiveBayes_Multi,0.555556,0.556897,0.555526,0.556806,0.556897,"[[104, 91], [73, 101]]"
Ridge,0.552846,0.554642,0.552846,0.554642,0.554642,"[[102, 93], [72, 102]]"
XGBoost,0.550136,0.552387,0.550106,0.552499,0.552387,"[[100, 95], [71, 103]]"
KNN_Cosine,0.550136,0.553006,0.549974,0.553314,0.553006,"[[98, 97], [69, 105]]"
KNN_Euclidean,0.550136,0.553006,0.549974,0.553314,0.553006,"[[98, 97], [69, 105]]"
LogisticRegression,0.547425,0.549514,0.547412,0.549581,0.549514,"[[100, 95], [72, 102]]"
ExtraTrees,0.547425,0.544253,0.543996,0.544717,0.544253,"[[117, 78], [89, 85]]"
MLP,0.542005,0.537887,0.537096,0.538622,0.537887,"[[119, 76], [93, 81]]"
AdaBoost,0.542005,0.53603,0.533439,0.537639,0.53603,"[[125, 70], [99, 75]]"
RandomForest,0.536585,0.531521,0.529901,0.53249,0.531521,"[[121, 74], [97, 77]]"


The following block alidates that all trained models demonstrate genuine learning by checking for naive prediction patterns (e.g., always predicting majority class) and confirming meaningful improvement over dummy classifiers.

In [6]:
summary_df = batch_check_naive_predictions(optimized_models, X_te, y_te, y_tr)
summary_df

Unnamed: 0,model_name,unique_classes_predicted,majority_prediction_ratio,improvement_over_dummy,is_naive_majority,is_single_class_predictor
0,LogisticRegression,2,0.466125,0.201667,False,False
1,RandomForest,2,0.590786,0.184157,False,False
2,XGBoost,2,0.463415,0.204361,False,False
3,KNN_Cosine,2,0.452575,0.204229,False,False
4,KNN_Euclidean,2,0.452575,0.204229,False,False
5,NaiveBayes_Multi,2,0.479675,0.209781,False,False
6,ExtraTrees,2,0.558266,0.198251,False,False
7,AdaBoost,2,0.607046,0.187694,False,False
8,Ridge,2,0.471545,0.207101,False,False
9,MLP,2,0.574526,0.191352,False,False


### Models application with reduced Features (PCA)

This block repeats the comprehensive model training and hyperparameter optimization pipeline using **PCA-transformed features** to evaluate the impact of linear dimensionality reduction on predictive performance across all algorithm families.RéessayerClaude peut faire des erreurs. Assurez-vous de vérifier ses réponses.

In [8]:
tv = TextVectorizer(
    dataset_csv="dataset.csv",
    text_cols=("speakers", "title", "subtitle", "contents"),
    target_cols=("target_1"),
    id_col="date",
    concat_text=False
)
tv.load()

X, y, vectorizer = tv.tfidf(target="target_1", max_features=50000, ngram_range=(1, 2))
tv.reduce(X,method="pca")
X_tr, X_te, y_tr, y_te = tv.temporal_split(X, y, split_date="2022-01-01")

optimized_models = {}

result = generalized_gridsearch(LogReg(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["LogisticRegression"] = result['best_model']

result = generalized_gridsearch(RandomForest(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["RandomForest"] = result['best_model']


result = generalized_gridsearch(XGBoostClassifier(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["XGBoost"] = result['best_model']

result = generalized_gridsearch(KNNClassifier(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["KNN_Cosine"] = result['best_model']

result = generalized_gridsearch(KNNClassifier(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["KNN_Euclidean"] = result['best_model']

result = generalized_gridsearch(NaiveBayesClassifier(variant="multinomial"), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["NaiveBayes_Multi"] = result['best_model']

result = generalized_gridsearch(LightGBMClassifier(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["LightGBM"] = result['best_model']

result = generalized_gridsearch(ExtraTreesClassifierWrapper(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["ExtraTrees"] = result['best_model']

result = generalized_gridsearch(AdaBoostClassifierWrapper(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["AdaBoost"] = result['best_model']

result = generalized_gridsearch(RidgeClassifierWrapper(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["Ridge"] = result['best_model']

result = generalized_gridsearch(SGDClassifierWrapper(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["SGD"] = result['best_model']

input_dim = X_tr.shape[1]
mlp = MLPClassifier(
    input_dim=input_dim,
    num_classes=len(np.unique(y)),
    hidden_dims=(512, 128),
    epochs=30,
    batch_size=128,
    verbose=False
)
mlp.fit(X_tr, y_tr, X_val=X_te, y_val=y_te)
optimized_models["MLP"] = mlp

df_metrics = compare_models_metrics(optimized_models, X_te, y_te, average="macro")
df_sorted = df_metrics.sort_values('f1_macro', ascending=False)

df_sorted

Unnamed: 0_level_0,accuracy,balanced_accuracy,f1_macro,precision_macro,recall_macro,confusion_matrix
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
MLP,0.566396,0.568391,0.566392,0.568435,0.568391,"[[104, 91], [69, 105]]"
NaiveBayes_Multi,0.555556,0.556897,0.555526,0.556806,0.556897,"[[104, 91], [73, 101]]"
Ridge,0.552846,0.554642,0.552846,0.554642,0.554642,"[[102, 93], [72, 102]]"
XGBoost,0.550136,0.552387,0.550106,0.552499,0.552387,"[[100, 95], [71, 103]]"
KNN_Cosine,0.550136,0.553006,0.549974,0.553314,0.553006,"[[98, 97], [69, 105]]"
KNN_Euclidean,0.550136,0.553006,0.549974,0.553314,0.553006,"[[98, 97], [69, 105]]"
LogisticRegression,0.547425,0.549514,0.547412,0.549581,0.549514,"[[100, 95], [72, 102]]"
ExtraTrees,0.547425,0.544253,0.543996,0.544717,0.544253,"[[117, 78], [89, 85]]"
AdaBoost,0.542005,0.53603,0.533439,0.537639,0.53603,"[[125, 70], [99, 75]]"
RandomForest,0.517615,0.516357,0.516333,0.516339,0.516357,"[[105, 90], [88, 86]]"


In [9]:
summary_df = batch_check_naive_predictions(optimized_models, X_te, y_te, y_tr)
summary_df

Unnamed: 0,model_name,unique_classes_predicted,majority_prediction_ratio,improvement_over_dummy,is_naive_majority,is_single_class_predictor
0,LogisticRegression,2,0.466125,0.201667,False,False
1,RandomForest,2,0.523035,0.170588,False,False
2,XGBoost,2,0.463415,0.204361,False,False
3,KNN_Cosine,2,0.452575,0.204229,False,False
4,KNN_Euclidean,2,0.452575,0.204229,False,False
5,NaiveBayes_Multi,2,0.479675,0.209781,False,False
6,ExtraTrees,2,0.558266,0.198251,False,False
7,AdaBoost,2,0.607046,0.187694,False,False
8,Ridge,2,0.471545,0.207101,False,False
9,MLP,2,0.468835,0.220648,False,False


### Models application with reduced Features (t-SNE)

Repeats the comprehensive model training and hyperparameter optimization pipeline using **t-SNE-transformed features** to evaluate the impact of non-linear dimensionality reduction on predictive performance across all algorithm families.

In [11]:
tv = TextVectorizer(
    dataset_csv="dataset.csv",
    text_cols=("speakers", "title", "subtitle", "contents"),
    target_cols=("target_1"),
    id_col="date",
    concat_text=False
)
tv.load()

X, y, vectorizer = tv.tfidf(target="target_1", max_features=50000, ngram_range=(1, 2))
tv.reduce(X,method="tsne", n_components=3)
X_tr, X_te, y_tr, y_te = tv.temporal_split(X, y, split_date="2022-01-01")

optimized_models = {}

result = generalized_gridsearch(LogReg(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["LogisticRegression"] = result['best_model']

result = generalized_gridsearch(RandomForest(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["RandomForest"] = result['best_model']


result = generalized_gridsearch(XGBoostClassifier(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["XGBoost"] = result['best_model']

result = generalized_gridsearch(KNNClassifier(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["KNN_Cosine"] = result['best_model']

result = generalized_gridsearch(KNNClassifier(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["KNN_Euclidean"] = result['best_model']

result = generalized_gridsearch(NaiveBayesClassifier(variant="multinomial"), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["NaiveBayes_Multi"] = result['best_model']

result = generalized_gridsearch(LightGBMClassifier(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["LightGBM"] = result['best_model']

result = generalized_gridsearch(ExtraTreesClassifierWrapper(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["ExtraTrees"] = result['best_model']

result = generalized_gridsearch(AdaBoostClassifierWrapper(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["AdaBoost"] = result['best_model']

result = generalized_gridsearch(RidgeClassifierWrapper(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["Ridge"] = result['best_model']

result = generalized_gridsearch(SGDClassifierWrapper(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["SGD"] = result['best_model']

input_dim = X_tr.shape[1]
mlp = MLPClassifier(
    input_dim=input_dim,
    num_classes=len(np.unique(y)),
    hidden_dims=(512, 128),
    epochs=30,
    batch_size=128,
    verbose=False
)
mlp.fit(X_tr, y_tr, X_val=X_te, y_val=y_te)
optimized_models["MLP"] = mlp

df_metrics = compare_models_metrics(optimized_models, X_te, y_te, average="macro")
df_sorted = df_metrics.sort_values('f1_macro', ascending=False)

df_sorted

Unnamed: 0_level_0,accuracy,balanced_accuracy,f1_macro,precision_macro,recall_macro,confusion_matrix
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
NaiveBayes_Multi,0.555556,0.556897,0.555526,0.556806,0.556897,"[[104, 91], [73, 101]]"
MLP,0.558266,0.554509,0.554021,0.555391,0.554509,"[[121, 74], [89, 85]]"
Ridge,0.552846,0.554642,0.552846,0.554642,0.554642,"[[102, 93], [72, 102]]"
XGBoost,0.550136,0.552387,0.550106,0.552499,0.552387,"[[100, 95], [71, 103]]"
KNN_Cosine,0.550136,0.553006,0.549974,0.553314,0.553006,"[[98, 97], [69, 105]]"
KNN_Euclidean,0.550136,0.553006,0.549974,0.553314,0.553006,"[[98, 97], [69, 105]]"
LogisticRegression,0.547425,0.549514,0.547412,0.549581,0.549514,"[[100, 95], [72, 102]]"
ExtraTrees,0.547425,0.544253,0.543996,0.544717,0.544253,"[[117, 78], [89, 85]]"
AdaBoost,0.542005,0.53603,0.533439,0.537639,0.53603,"[[125, 70], [99, 75]]"
RandomForest,0.517615,0.516357,0.516333,0.516339,0.516357,"[[105, 90], [88, 86]]"


In [12]:
summary_df = batch_check_naive_predictions(optimized_models, X_te, y_te, y_tr)
summary_df

Unnamed: 0,model_name,unique_classes_predicted,majority_prediction_ratio,improvement_over_dummy,is_naive_majority,is_single_class_predictor
0,LogisticRegression,2,0.466125,0.201667,False,False
1,RandomForest,2,0.523035,0.170588,False,False
2,XGBoost,2,0.463415,0.204361,False,False
3,KNN_Cosine,2,0.452575,0.204229,False,False
4,KNN_Euclidean,2,0.452575,0.204229,False,False
5,NaiveBayes_Multi,2,0.479675,0.209781,False,False
6,ExtraTrees,2,0.558266,0.198251,False,False
7,AdaBoost,2,0.607046,0.187694,False,False
8,Ridge,2,0.471545,0.207101,False,False
9,MLP,2,0.569106,0.208276,False,False


### Models application with reduced Features (MDS)

Repeats the comprehensive model training and hyperparameter optimization pipeline using **MDS-transformed features** to evaluate the impact of distance-preserving dimensionality reduction on predictive performance across all algorithm families.

In [13]:
tv = TextVectorizer(
    dataset_csv="dataset.csv",
    text_cols=("speakers", "title", "subtitle", "contents"),
    target_cols=("target_1"),
    id_col="date",
    concat_text=False
)
tv.load()

X, y, vectorizer = tv.tfidf(target="target_1", max_features=50000, ngram_range=(1, 2))
tv.reduce(X,method="mds")
X_tr, X_te, y_tr, y_te = tv.temporal_split(X, y, split_date="2022-01-01")

optimized_models = {}

result = generalized_gridsearch(LogReg(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["LogisticRegression"] = result['best_model']

result = generalized_gridsearch(RandomForest(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["RandomForest"] = result['best_model']


result = generalized_gridsearch(XGBoostClassifier(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["XGBoost"] = result['best_model']

result = generalized_gridsearch(KNNClassifier(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["KNN_Cosine"] = result['best_model']

result = generalized_gridsearch(KNNClassifier(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["KNN_Euclidean"] = result['best_model']

result = generalized_gridsearch(NaiveBayesClassifier(variant="multinomial"), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["NaiveBayes_Multi"] = result['best_model']

result = generalized_gridsearch(LightGBMClassifier(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["LightGBM"] = result['best_model']

result = generalized_gridsearch(ExtraTreesClassifierWrapper(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["ExtraTrees"] = result['best_model']

result = generalized_gridsearch(AdaBoostClassifierWrapper(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["AdaBoost"] = result['best_model']

result = generalized_gridsearch(RidgeClassifierWrapper(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["Ridge"] = result['best_model']

result = generalized_gridsearch(SGDClassifierWrapper(), X_tr, y_tr, X_te, y_te, verbose=False)
if result['best_model']:
    optimized_models["SGD"] = result['best_model']

input_dim = X_tr.shape[1]
mlp = MLPClassifier(
    input_dim=input_dim,
    num_classes=len(np.unique(y)),
    hidden_dims=(512, 128),
    epochs=30,
    batch_size=128,
    verbose=False
)
mlp.fit(X_tr, y_tr, X_val=X_te, y_val=y_te)
optimized_models["MLP"] = mlp

df_metrics = compare_models_metrics(optimized_models, X_te, y_te, average="macro")
df_sorted = df_metrics.sort_values('f1_macro', ascending=False)

df_sorted

Unnamed: 0_level_0,accuracy,balanced_accuracy,f1_macro,precision_macro,recall_macro,confusion_matrix
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
NaiveBayes_Multi,0.555556,0.556897,0.555526,0.556806,0.556897,"[[104, 91], [73, 101]]"
Ridge,0.552846,0.554642,0.552846,0.554642,0.554642,"[[102, 93], [72, 102]]"
XGBoost,0.550136,0.552387,0.550106,0.552499,0.552387,"[[100, 95], [71, 103]]"
KNN_Cosine,0.550136,0.553006,0.549974,0.553314,0.553006,"[[98, 97], [69, 105]]"
KNN_Euclidean,0.550136,0.553006,0.549974,0.553314,0.553006,"[[98, 97], [69, 105]]"
LogisticRegression,0.547425,0.549514,0.547412,0.549581,0.549514,"[[100, 95], [72, 102]]"
ExtraTrees,0.547425,0.544253,0.543996,0.544717,0.544253,"[[117, 78], [89, 85]]"
MLP,0.547425,0.554156,0.543996,0.557005,0.554156,"[[85, 110], [57, 117]]"
AdaBoost,0.542005,0.53603,0.533439,0.537639,0.53603,"[[125, 70], [99, 75]]"
RandomForest,0.517615,0.516357,0.516333,0.516339,0.516357,"[[105, 90], [88, 86]]"


In [14]:
summary_df = batch_check_naive_predictions(optimized_models, X_te, y_te, y_tr)
summary_df

Unnamed: 0,model_name,unique_classes_predicted,majority_prediction_ratio,improvement_over_dummy,is_naive_majority,is_single_class_predictor
0,LogisticRegression,2,0.466125,0.201667,False,False
1,RandomForest,2,0.523035,0.170588,False,False
2,XGBoost,2,0.463415,0.204361,False,False
3,KNN_Cosine,2,0.452575,0.204229,False,False
4,KNN_Euclidean,2,0.452575,0.204229,False,False
5,NaiveBayes_Multi,2,0.479675,0.209781,False,False
6,ExtraTrees,2,0.558266,0.198251,False,False
7,AdaBoost,2,0.607046,0.187694,False,False
8,Ridge,2,0.471545,0.207101,False,False
9,MLP,2,0.384824,0.198251,False,False


### HFT Classifier

Implements a **Transformer-based approach** using CamemBERT for comparison with traditional machine learning methods. This model processes raw text directly through pre-trained language representations, fine-tuned on our specific yield spread prediction task with early stopping and mixed-precision training for computational efficiency.

In [15]:
df = tv.df
df["text"] = df[list(tv.text_cols)].fillna("").astype(str).agg(" ".join, axis=1)

split_date = "2022-01-01"
mask_tr = pd.to_datetime(df["date"]) < pd.Timestamp(split_date)
mask_te = ~mask_tr

X_tr_hf, y_tr_hf = df.loc[mask_tr, "text"], y[mask_tr]
X_te_hf, y_te_hf = df.loc[mask_te, "text"], y[mask_te]

hf_model = HFTransformerClassifier(
    model_name="camembert-base",
    max_length=512,
    batch_size=8,
    epochs=3,
    lr=2e-5,
    fp16=True,
    early_stopping_patience=1,
    eval_every=1,
    auto_remap_labels=True
)
hf_model.fit(X_tr_hf, y_tr_hf, X_val=X_te_hf, y_val=y_te_hf)
hf_model.evaluate(X_te_hf, y_te_hf)

Some weights of CamembertForSequenceClassification were not initialized from the model checkpoint at camembert-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


accuracy :
 0.5284552845528455
balanced_accuracy :
 0.5
f1_macro :
 0.34574468085106386
precision_macro :
 0.26422764227642276
recall_macro :
 0.5
confusion_matrix :
 [[195   0]
 [174   0]]


{'accuracy': 0.5284552845528455,
 'balanced_accuracy': 0.5,
 'f1_macro': 0.34574468085106386,
 'precision_macro': 0.26422764227642276,
 'recall_macro': 0.5,
 'confusion_matrix': array([[195,   0],
        [174,   0]], dtype=int64)}