Program 3: Hybrid Router (Keyword ‚Üí Encoder), written to be:

symmetric with Programs 1 & 2, and result-log compatible (so we can compare across routers)

We will not retrain anything ‚Äî we reuse: the keyword router from Program 1,
the trained MiniLM + classifier from Program 2

üéØ Hybrid Routing Logic Decision rule:

Run Keyword Router
If:
route is quantitative
confidence ‚â• KEYWORD_CONF_THRESHOLD
‚Üí accept keyword decision
Else:
‚Üí fallback to Encoder Router

This preserves:
speed for explicit numericals
semantic power for ambiguous cases

This program
‚úî Uses SAME MiniLM encoder & classifier weights as Program 2
‚úî Uses SAME keyword router dictionary as Program 1

In [1]:
# ================================================================
# üìò Program 3 ‚Äî Hybrid Router (Keyword ‚Üí Encoder)
# ================================================================
# Purpose:
# - Hybrid classifier: keyword router first, encoder router as fallback
# - Encoder = MiniLM-L6 + Logistic Regression (from Program 2)
# - Store predictions and summary in Google Drive
# - Fully reproducible with fixed seeds
# ================================================================

# -----------------------------
# üìå Step 0 ‚Äî Setup & Reproducibility
# -----------------------------
import random
import numpy as np
import pandas as pd
from pathlib import Path

# Reproducibility
RANDOM_SEED = 42
random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

# Mount Drive
from google.colab import drive
drive.mount('/content/drive')

BASE_DIR = Path("/content/drive/MyDrive/FinGuardSDG")
DATA_DIR = BASE_DIR / "data" / "splits"
RESULTS_DIR = BASE_DIR / "results" / "hybrid"
MODELS_DIR = BASE_DIR / "models"

RESULTS_DIR.mkdir(parents=True, exist_ok=True)
(MODELS_DIR / "hybrid").mkdir(parents=True, exist_ok=True)

print("Hybrid router setup complete.")
print("BASE_DIR:", BASE_DIR)


Mounted at /content/drive
Hybrid router setup complete.
BASE_DIR: /content/drive/MyDrive/FinGuardSDG


In [2]:
# -----------------------------
# üìå Step 1 ‚Äî Load Test Set + Encoder + Classifier
# -----------------------------
TEST_PATH = DATA_DIR / "FinGuard_SDG_test.csv"
test_df = pd.read_csv(TEST_PATH)

print("Loaded test set:", test_df.shape)
display(test_df.head())

# Load encoder
from sentence_transformers import SentenceTransformer
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
encoder = SentenceTransformer(MODEL_NAME)

# Load classifier from Program 2
import joblib
clf = joblib.load(str(MODELS_DIR / "encoder" / "encoder_classifier.joblib"))

print("MiniLM encoder + trained classifier loaded.")


Loaded test set: (174, 7)


Unnamed: 0,id,category,subcategory,question_text,answer_text,difficulty,source
0,Q-TVM-054,quantitative,time_value_of_money,"An investment of ‚Çπ1,50,000 earns 9% annually. ...","The future value is ‚Çπ2,73,832.14.",1,template
1,Q-EQ-047,quantitative,equity_valuation,A firm trades at a premium despite lower curre...,Investors expect future earnings growth.,2,literature-inspired
2,C-RR-011,conceptual,risk_return_theory,Why are risky assets expected to outperform ri...,Investors demand compensation for risk exposure.,1,literature-inspired
3,C-RR-020,conceptual,risk_return_theory,What limitation does variance have as a risk m...,It treats upside and downside deviations equally.,2,literature-inspired
4,Q-TVM-051,quantitative,time_value_of_money,"An annuity pays ‚Çπ48,000 annually for 9 years. ...","The present value is ‚Çπ2,87,184.93.",2,template


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

MiniLM encoder + trained classifier loaded.


In [3]:
# -----------------------------
# üìå Step 2 ‚Äî Compute Test Embeddings
# -----------------------------
test_embeddings = encoder.encode(
    test_df["question_text"].tolist(),
    batch_size=32,
    convert_to_numpy=True,
    show_progress_bar=True
)

print("Embedding shape:", test_embeddings.shape)


Batches:   0%|          | 0/6 [00:00<?, ?it/s]

Embedding shape: (174, 384)


In [4]:
# -----------------------------
# üìå Step 3 ‚Äî Keyword Router Rule Set
# -----------------------------
keyword_map = {
    "quantitative": [
        "calculate", "compute", "variance", "covariance", "beta",
        "duration", "convexity", "return", "yield", "volatility",
        "npv", "present value", "future value", "discount rate", "formula",
        "rate", "annuity", "cash flow"
    ],
    "advisory": [
        "should", "recommend", "suggest", "advisor", "portfolio",
        "risk tolerance", "investor", "allocation", "suitability"
    ],
    "conceptual": [
        "why", "explain", "describe", "define", "conceptually",
        "theory", "principle", "interpretation"
    ],
    "esg": [
        "esg", "environment", "governance", "sustainability",
        "carbon", "emissions", "stewardship", "renewable"
    ]
}

def keyword_router(q):
    q = q.lower()
    scores = {cat: 0 for cat in keyword_map}
    matched = {}

    for cat, kws in keyword_map.items():
        hits = [kw for kw in kws if kw in q]
        if hits:
            scores[cat] = len(hits)
            matched[cat] = hits

    best_cat = max(scores, key=scores.get)
    if scores[best_cat] >= 2:
        return best_cat, scores[best_cat] / 5.0, matched  # score capped at 1
    return "unrouted", 0.0, matched


In [5]:
# -----------------------------
# üìå Step 4 ‚Äî Hybrid Routing (Keyword First ‚Üí Encoder Fallback)
# -----------------------------
hybrid_rows = []

for idx, row in test_df.iterrows():
    q = row["question_text"]

    # 1Ô∏è‚É£ Try keyword routing
    k_pred, k_conf, k_hits = keyword_router(q)

    if k_pred != "unrouted":
        final_pred = k_pred
        router_used = "keyword"
        confidence = k_conf

    else:
        # 2Ô∏è‚É£ Encoder fallback
        v = test_embeddings[idx].reshape(1, -1)
        e_pred = clf.predict(v)[0]
        e_prob = clf.predict_proba(v).max()

        final_pred = e_pred
        router_used = "encoder"
        confidence = float(e_prob)

    hybrid_rows.append({
        "id": row["id"],
        "question_text": q,
        "true_category": row["category"],
        "predicted_category": final_pred,
        "confidence": confidence,
        "router_used": router_used,
        "keyword_hits": k_hits
    })

hybrid_df = pd.DataFrame(hybrid_rows)
hybrid_df.head()


Unnamed: 0,id,question_text,true_category,predicted_category,confidence,router_used,keyword_hits
0,Q-TVM-054,"An investment of ‚Çπ1,50,000 earns 9% annually. ...",quantitative,quantitative,0.964027,encoder,{}
1,Q-EQ-047,A firm trades at a premium despite lower curre...,quantitative,quantitative,0.751454,encoder,{'advisory': ['suggest']}
2,C-RR-011,Why are risky assets expected to outperform ri...,conceptual,conceptual,0.625763,encoder,{'conceptual': ['why']}
3,C-RR-020,What limitation does variance have as a risk m...,conceptual,quantitative,0.312273,encoder,{'quantitative': ['variance']}
4,Q-TVM-051,"An annuity pays ‚Çπ48,000 annually for 9 years. ...",quantitative,quantitative,0.8,keyword,"{'quantitative': ['present value', 'discount r..."


In [6]:
# -----------------------------
# üìå Step 5 ‚Äî Evaluate Hybrid Router
# -----------------------------
from sklearn.metrics import classification_report, accuracy_score

y_true = hybrid_df["true_category"].tolist()
y_pred = hybrid_df["predicted_category"].tolist()

acc = accuracy_score(y_true, y_pred)

print("Hybrid Router Accuracy:", acc)
print("\nClassification Report:\n")
print(classification_report(y_true, y_pred, zero_division=0))


Hybrid Router Accuracy: 0.8793103448275862

Classification Report:

              precision    recall  f1-score   support

    advisory       0.93      0.95      0.94        39
  conceptual       0.74      0.81      0.77        36
         esg       0.86      0.91      0.88        33
quantitative       0.95      0.86      0.90        66

    accuracy                           0.88       174
   macro avg       0.87      0.88      0.87       174
weighted avg       0.88      0.88      0.88       174



In [7]:
# -----------------------------
# üìå Step 6 ‚Äî Save Predictions CSV
# -----------------------------
pred_path = RESULTS_DIR / "hybrid_router_predictions.csv"
hybrid_df.to_csv(pred_path, index=False)

print("Saved predictions to:", pred_path)


Saved predictions to: /content/drive/MyDrive/FinGuardSDG/results/hybrid/hybrid_router_predictions.csv


In [8]:
# -----------------------------
# üìå Step 7 ‚Äî Save Summary JSON
# -----------------------------
from sklearn.metrics import classification_report

report = classification_report(y_true, y_pred, output_dict=True, zero_division=0)

summary = {
    "router": "hybrid",
    "accuracy": acc,
    "macro_f1": report["macro avg"]["f1-score"],
    "router_usage": hybrid_df["router_used"].value_counts().to_dict(),
    "per_class": {
        k: v for k, v in report.items()
        if k in ["quantitative", "advisory", "conceptual", "esg"]
    }
}

import json
summary_path = RESULTS_DIR / "hybrid_router_summary.json"
with open(summary_path, "w") as f:
    json.dump(summary, f, indent=2)

print("Saved summary to:", summary_path)


Saved summary to: /content/drive/MyDrive/FinGuardSDG/results/hybrid/hybrid_router_summary.json


In [9]:
# -----------------------------
# üìå Step 8 ‚Äî Save Hybrid Router Config
# -----------------------------
config_path = MODELS_DIR / "hybrid" / "hybrid_router_config.json"

config = {
    "seed": RANDOM_SEED,
    "encoder_model": MODEL_NAME,
    "classifier_path": str(MODELS_DIR / "encoder" / "encoder_classifier.joblib"),
    "keyword_threshold": 2,
    "keyword_map": keyword_map
}

with open(config_path, "w") as f:
    json.dump(config, f, indent=2)

print("Saved hybrid config to:", config_path)


Saved hybrid config to: /content/drive/MyDrive/FinGuardSDG/models/hybrid/hybrid_router_config.json
