# LumaFin - Label-Aware Contrastive Fine-Tuning (L-A CFT)

This notebook fine-tunes the sentence transformer embedding model using contrastive learning.

**What this notebook does:**
1. Loads training data from Google Drive
2. Creates contrastive triplets (anchor, positive, negative)
3. Fine-tunes sentence-transformers model with contrastive loss
4. Saves the fine-tuned model to Google Drive

**Runtime:** GPU Required (T4 or better recommended)
**Time:** ~30-60 minutes for full training

**‚ö†Ô∏è IMPORTANT:** Make sure to enable GPU runtime:  
Runtime ‚Üí Change runtime type ‚Üí Hardware accelerator ‚Üí GPU

## Step 1: Mount Google Drive and Setup

In [None]:
from google.colab import drive
import torch

drive.mount('/content/drive')

# Check GPU availability
if torch.cuda.is_available():
    print(f"‚úÖ GPU Available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")
else:
    print("‚ö†Ô∏è WARNING: No GPU detected. Training will be VERY slow.")
    print("   Please enable GPU: Runtime ‚Üí Change runtime type ‚Üí GPU")

## Step 2: Install Dependencies

In [None]:
!pip install -q sentence-transformers transformers torch pandas numpy scikit-learn tqdm

## Step 3: Load Training Data

In [None]:
import pandas as pd
import os

# Load from Google Drive
train_file = '/content/drive/MyDrive/LumaFin/data/train.csv'

if not os.path.exists(train_file):
    print("‚ùå Training data not found. Please run notebook 01 first.")
else:
    df_train = pd.read_csv(train_file)
    print(f"‚úÖ Loaded {len(df_train)} training examples")
    print(f"\nCategory distribution:")
    print(df_train['category'].value_counts())
    print(f"\nSample:")
    print(df_train.head())

## Step 4: Create Contrastive Dataset

In [None]:
import random
from collections import defaultdict
import numpy as np

# Group examples by category
category_to_examples = defaultdict(list)
for _, row in df_train.iterrows():
    text = f"{row['merchant']} {row.get('description', '')} ${row['amount']:.2f}"
    category_to_examples[row['category']].append(text)

categories = list(category_to_examples.keys())
print(f"‚úÖ Grouped into {len(categories)} categories")
for cat in categories:
    print(f"  {cat}: {len(category_to_examples[cat])} examples")

In [None]:
def create_triplets(category_to_examples, num_triplets_per_example=3):
    """Create anchor-positive-negative triplets for contrastive learning."""
    triplets = []
    
    for category, examples in category_to_examples.items():
        if len(examples) < 2:
            continue  # Skip categories with too few examples
        
        for anchor in examples:
            for _ in range(num_triplets_per_example):
                # Positive: same category, different example
                positive_candidates = [e for e in examples if e != anchor]
                if not positive_candidates:
                    continue
                positive = random.choice(positive_candidates)
                
                # Negative: different category
                negative_category = random.choice([c for c in category_to_examples.keys() if c != category])
                negative = random.choice(category_to_examples[negative_category])
                
                triplets.append((anchor, positive, negative))
    
    return triplets

print("Creating triplets...")
triplets = create_triplets(category_to_examples, num_triplets_per_example=2)
print(f"‚úÖ Created {len(triplets)} triplets")

# Show sample
if triplets:
    anchor, positive, negative = triplets[0]
    print(f"\nSample triplet:")
    print(f"  Anchor:   {anchor}")
    print(f"  Positive: {positive}")
    print(f"  Negative: {negative}")

## Step 5: Load Base Model and Setup Training

In [None]:
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Load base model
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
print(f"Loading model: {model_name}")
model = SentenceTransformer(model_name)
print(f"‚úÖ Model loaded. Embedding dimension: {model.get_sentence_embedding_dimension()}")

In [None]:
# Convert triplets to InputExamples
print("Converting to InputExamples...")
train_examples = []
for anchor, positive, negative in triplets:
    train_examples.append(InputExample(texts=[anchor, positive, negative]))

print(f"‚úÖ Created {len(train_examples)} training examples")

In [None]:
# Create DataLoader
batch_size = 16  # Adjust based on GPU memory
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)

print(f"‚úÖ DataLoader created with batch size {batch_size}")
print(f"   Total batches: {len(train_dataloader)}")

## Step 6: Configure Training

In [None]:
# Training configuration
num_epochs = 3  # Increase for better results (3-5 epochs recommended)
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)  # 10% warmup

# Use TripletLoss for contrastive learning
train_loss = losses.TripletLoss(model=model)

print(f"‚úÖ Training configuration:")
print(f"   Epochs: {num_epochs}")
print(f"   Batch size: {batch_size}")
print(f"   Warmup steps: {warmup_steps}")
print(f"   Total training steps: {len(train_dataloader) * num_epochs}")
print(f"   Loss function: TripletLoss")

## Step 7: Train the Model

In [None]:
import time

# Output path
output_path = '/content/drive/MyDrive/LumaFin/models/lumafin-lacft-v1.0'

print(f"üöÄ Starting training...\n")
start_time = time.time()

# Train
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=num_epochs,
    warmup_steps=warmup_steps,
    output_path=output_path,
    show_progress_bar=True,
    save_best_model=True,
)

training_time = time.time() - start_time
print(f"\n‚úÖ Training complete! Time: {training_time/60:.1f} minutes")
print(f"‚úÖ Model saved to: {output_path}")

## Step 8: Test the Fine-Tuned Model

In [None]:
# Load the fine-tuned model
finetuned_model = SentenceTransformer(output_path)

# Test queries
test_queries = [
    "Starbucks coffee $5.50",
    "Uber ride $15.00",
    "Netflix subscription $15.99",
    "Walmart groceries $45.30",
    "Doctor visit $120.00"
]

print("Testing embeddings on sample queries:\n")
for query in test_queries:
    embedding = finetuned_model.encode(query)
    print(f"‚úÖ {query}")
    print(f"   Embedding shape: {embedding.shape}, Norm: {np.linalg.norm(embedding):.3f}\n")

## Step 9: Compare Base vs Fine-Tuned

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Load base model
base_model = SentenceTransformer(model_name)

# Test similarity
query1 = "Starbucks coffee"
query2 = "Dunkin donuts"  # Same category
query3 = "Uber ride"  # Different category

# Base model
base_emb1 = base_model.encode([query1])
base_emb2 = base_model.encode([query2])
base_emb3 = base_model.encode([query3])

# Fine-tuned model
ft_emb1 = finetuned_model.encode([query1])
ft_emb2 = finetuned_model.encode([query2])
ft_emb3 = finetuned_model.encode([query3])

print("Similarity Comparison:\n")
print(f"Query 1: {query1}")
print(f"Query 2: {query2} (same category - should be HIGH)")
print(f"Query 3: {query3} (different category - should be LOW)\n")

print("BASE MODEL:")
print(f"  Similarity(Q1, Q2): {cosine_similarity(base_emb1, base_emb2)[0][0]:.3f}")
print(f"  Similarity(Q1, Q3): {cosine_similarity(base_emb1, base_emb3)[0][0]:.3f}\n")

print("FINE-TUNED MODEL:")
print(f"  Similarity(Q1, Q2): {cosine_similarity(ft_emb1, ft_emb2)[0][0]:.3f}")
print(f"  Similarity(Q1, Q3): {cosine_similarity(ft_emb1, ft_emb3)[0][0]:.3f}\n")

print("‚úÖ Fine-tuned model should show higher similarity for same-category pairs!")

## ‚úÖ Training Complete!

Your fine-tuned embedding model is ready and saved to:
```
/content/drive/MyDrive/LumaFin/models/lumafin-lacft-v1.0
```

### Next Steps:
1. **Run notebook 03_train_reranker.ipynb** to train the XGBoost reranker
2. **Run notebook 04_evaluate_pipeline.ipynb** to test the complete system

### To use this model in your local repository:
1. Download the model folder from Google Drive
2. Place it in `models/embeddings/lumafin-lacft-v1.0`
3. Update `.env` file:
   ```
   MODEL_PATH=models/embeddings/lumafin-lacft-v1.0
   ```