# LumaFin - XGBoost Reranker Training

This notebook trains the XGBoost reranker model that improves category prediction accuracy.

**What this notebook does:**
1. Loads training data and fine-tuned embeddings
2. Builds FAISS index for retrieval
3. Generates training features for reranker
4. Trains XGBoost classifier with feature engineering
5. Evaluates and saves the trained model

**Runtime:** GPU helpful but not required (CPU OK)
**Time:** ~15-30 minutes

## Step 1: Setup

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
!pip install -q sentence-transformers xgboost faiss-cpu pandas numpy scikit-learn tqdm

## Step 2: Load Data and Models

In [None]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
import os

# Load training data
train_file = '/content/drive/MyDrive/LumaFin/data/train.csv'
test_file = '/content/drive/MyDrive/LumaFin/data/test.csv'

df_train = pd.read_csv(train_file)
df_test = pd.read_csv(test_file)

print(f"‚úÖ Training: {len(df_train)} examples")
print(f"‚úÖ Test: {len(df_test)} examples")

In [None]:
# Load fine-tuned embedding model (or fallback to base)
finetuned_path = '/content/drive/MyDrive/LumaFin/models/lumafin-lacft-v1.0'
base_model_name = 'sentence-transformers/all-MiniLM-L6-v2'

if os.path.exists(finetuned_path):
    print(f"‚úÖ Loading fine-tuned model from {finetuned_path}")
    model = SentenceTransformer(finetuned_path)
else:
    print(f"‚ö†Ô∏è Fine-tuned model not found, using base model: {base_model_name}")
    model = SentenceTransformer(base_model_name)

print(f"‚úÖ Model loaded. Embedding dim: {model.get_sentence_embedding_dimension()}")

## Step 3: Build FAISS Index

In [None]:
import faiss
from tqdm import tqdm

# Create text representations
def create_text(row):
    desc = row.get('description', '')
    return f"{row['merchant']} {desc} ${row['amount']:.2f}"

print("Creating embeddings for FAISS index...")
train_texts = [create_text(row) for _, row in df_train.iterrows()]
train_categories = df_train['category'].tolist()

# Encode in batches
batch_size = 256
embeddings = []
for i in tqdm(range(0, len(train_texts), batch_size)):
    batch = train_texts[i:i+batch_size]
    batch_emb = model.encode(batch, show_progress_bar=False)
    embeddings.append(batch_emb)

train_embeddings = np.vstack(embeddings).astype('float32')
print(f"‚úÖ Created {len(train_embeddings)} embeddings")

In [None]:
# Normalize embeddings for cosine similarity
faiss.normalize_L2(train_embeddings)

# Build FAISS index
dimension = train_embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)  # Inner Product (cosine after normalization)
index.add(train_embeddings)

print(f"‚úÖ FAISS index built with {index.ntotal} vectors")

## Step 4: Generate Training Features for Reranker

In [None]:
from collections import Counter

def extract_reranker_features(query_text, candidates_with_scores, true_category, all_categories):
    """
    Extract features for each candidate category.
    Returns a feature matrix and labels.
    """
    # Aggregate scores by category
    category_scores = {cat: [] for cat in all_categories}
    for cat, score in candidates_with_scores:
        category_scores[cat].append(score)
    
    features = []
    labels = []
    
    for cat in all_categories:
        scores = category_scores[cat]
        
        # Feature engineering (7 features)
        feat = [
            len(scores),  # count
            sum(scores) if scores else 0,  # sum
            max(scores) if scores else 0,  # max
            np.mean(scores) if scores else 0,  # mean
            min(scores) if scores else 0,  # min
            len(scores) / len(candidates_with_scores) if candidates_with_scores else 0,  # vote fraction
            0  # amount diff proxy (simplified, set to 0 for now)
        ]
        features.append(feat)
        labels.append(1 if cat == true_category else 0)
    
    return np.array(features), np.array(labels)

print("‚úÖ Feature extraction function ready")

In [None]:
# Get all unique categories
all_categories = sorted(df_train['category'].unique())
print(f"Categories: {all_categories}")
num_categories = len(all_categories)

# Use subset for faster training (adjust as needed)
train_subset = df_train.sample(n=min(5000, len(df_train)), random_state=42)
print(f"\n‚úÖ Using {len(train_subset)} examples for training")

In [None]:
print("Generating training features...")
X_train_list = []
y_train_list = []

k = 20  # retrieve top-20 candidates

for idx, row in tqdm(train_subset.iterrows(), total=len(train_subset)):
    query_text = create_text(row)
    query_emb = model.encode([query_text])[0].astype('float32')
    query_emb = query_emb / np.linalg.norm(query_emb)  # normalize
    
    # Search FAISS
    scores, indices = index.search(np.array([query_emb]), k)
    
    # Get candidates with scores
    candidates = [(train_categories[i], scores[0][j]) for j, i in enumerate(indices[0])]
    
    # Extract features
    X_feat, y_feat = extract_reranker_features(
        query_text, candidates, row['category'], all_categories
    )
    
    X_train_list.append(X_feat)
    y_train_list.append(y_feat)

# Stack all features
X_train = np.vstack(X_train_list)
y_train = np.hstack(y_train_list)

print(f"\n‚úÖ Feature matrix: {X_train.shape}")
print(f"‚úÖ Labels: {y_train.shape}")
print(f"‚úÖ Positive samples: {y_train.sum()} / {len(y_train)} ({100*y_train.mean():.1f}%)")

## Step 5: Train XGBoost Classifier

In [None]:
import xgboost as xgb
from sklearn.calibration import CalibratedClassifierCV

# XGBoost parameters
params = {
    'n_estimators': 200,
    'max_depth': 4,
    'learning_rate': 0.07,
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'random_state': 42,
    'tree_method': 'hist',  # faster training
}

print("Training XGBoost classifier...")
xgb_model = xgb.XGBClassifier(**params)
xgb_model.fit(X_train, y_train, verbose=True)

print("\n‚úÖ XGBoost training complete")

In [None]:
# Calibrate probabilities with Platt scaling
print("Calibrating probabilities...")
calibrated_model = CalibratedClassifierCV(xgb_model, method='sigmoid', cv=3)
calibrated_model.fit(X_train, y_train)

print("‚úÖ Calibration complete")

## Step 6: Evaluate on Test Set

In [None]:
from sklearn.metrics import classification_report, f1_score

print("Evaluating on test set...")
test_subset = df_test.sample(n=min(1000, len(df_test)), random_state=42)

X_test_list = []
y_test_list = []
test_true_categories = []

for idx, row in tqdm(test_subset.iterrows(), total=len(test_subset)):
    query_text = create_text(row)
    query_emb = model.encode([query_text])[0].astype('float32')
    query_emb = query_emb / np.linalg.norm(query_emb)
    
    scores, indices = index.search(np.array([query_emb]), k)
    candidates = [(train_categories[i], scores[0][j]) for j, i in enumerate(indices[0])]
    
    X_feat, y_feat = extract_reranker_features(
        query_text, candidates, row['category'], all_categories
    )
    
    X_test_list.append(X_feat)
    y_test_list.append(y_feat)
    test_true_categories.append(row['category'])

X_test = np.vstack(X_test_list)
y_test = np.hstack(y_test_list)

print(f"\n‚úÖ Test features: {X_test.shape}")

In [None]:
# Predict
y_pred = calibrated_model.predict(X_test)
y_pred_proba = calibrated_model.predict_proba(X_test)[:, 1]

# Metrics
print("\nüìä Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))

f1 = f1_score(y_test, y_pred)
print(f"\n‚úÖ F1 Score: {f1:.3f}")

In [None]:
# Category-level accuracy
print("\nüìä Category-Level Performance:")
correct = 0
total = len(test_subset)

for i, row in enumerate(test_subset.iterrows()):
    idx, data = row
    true_cat = data['category']
    
    # Get predictions for this example
    start_idx = i * num_categories
    end_idx = start_idx + num_categories
    example_probs = y_pred_proba[start_idx:end_idx]
    
    # Predicted category
    pred_idx = np.argmax(example_probs)
    pred_cat = all_categories[pred_idx]
    
    if pred_cat == true_cat:
        correct += 1

accuracy = correct / total
print(f"\n‚úÖ Reranker Accuracy: {accuracy:.1%} ({correct}/{total})")

## Step 7: Save Model

In [None]:
import pickle

# Save to Google Drive
model_path = '/content/drive/MyDrive/LumaFin/models/xgb_reranker.pkl'

with open(model_path, 'wb') as f:
    pickle.dump(calibrated_model, f)

print(f"‚úÖ Model saved to: {model_path}")

# Also save as XGBoost JSON format
xgb_path = '/content/drive/MyDrive/LumaFin/models/xgb_reranker.json'
xgb_model.save_model(xgb_path)
print(f"‚úÖ XGBoost model saved to: {xgb_path}")

In [None]:
# Save FAISS index and metadata
faiss_path = '/content/drive/MyDrive/LumaFin/models/faiss_index.bin'
metadata_path = '/content/drive/MyDrive/LumaFin/models/faiss_metadata.pkl'

faiss.write_index(index, faiss_path)
print(f"‚úÖ FAISS index saved to: {faiss_path}")

metadata = {
    'categories': train_categories,
    'texts': train_texts,
    'all_categories': all_categories
}

with open(metadata_path, 'wb') as f:
    pickle.dump(metadata, f)

print(f"‚úÖ Metadata saved to: {metadata_path}")

## ‚úÖ Training Complete!

Your XGBoost reranker and FAISS index are ready!

**Saved files:**
- `xgb_reranker.pkl` - Calibrated XGBoost classifier
- `xgb_reranker.json` - XGBoost model in JSON format
- `faiss_index.bin` - FAISS vector index
- `faiss_metadata.pkl` - Category and text metadata

### Next Steps:
1. **Run notebook 04_evaluate_pipeline.ipynb** to evaluate the complete system

### To use these models in your local repository:
1. Download all files from `/content/drive/MyDrive/LumaFin/models/`
2. Place them in your local `models/` directory
3. Update `.env` file with appropriate paths