# Text Classification with LLMs - Modern Approaches
## Interview Preparation Notebook for Senior Applied AI Scientist (Retail Banking)

---

**Goal**: Demonstrate mastery of LLM-based text classification including fine-tuning, zero-shot, few-shot, and prompt engineering approaches.

**Interview Signal**: This notebook shows you understand the full spectrum of LLM classification approaches and can make informed decisions about when to use each.

## 1. Business Context (Banking Lens)

### Why LLMs for Classification Now?

| Challenge | Traditional Solution | LLM Solution |
|-----------|---------------------|---------------|
| Cold start (no labels) | Can't train | Zero-shot works |
| New categories | Retrain entire model | Update prompt |
| Nuanced decisions | Rule-based fallback | Handles nuance |
| Multi-lingual | Separate models | One model |

### Banking Use Cases

1. **Zero-shot complaint routing** - No labeled data? LLM can route based on category descriptions
2. **Few-shot fraud detection** - 10 examples of new fraud patterns → instant classifier
3. **Fine-tuned sentiment** - BERT fine-tuned on banking language for customer satisfaction

## 2. Problem Definition

### LLM Classification Approaches

| Approach | Training Data | Latency | Accuracy | Cost |
|----------|--------------|---------|----------|------|
| **Fine-tuned BERT** | 1K+ examples | 50-200ms | 92-96% | $0.0001/doc |
| **Zero-shot (BART-MNLI)** | 0 examples | 200-500ms | 75-85% | $0.001/doc |
| **Few-shot (GPT)** | 5-20 examples | 500ms-2s | 80-90% | $0.005-0.02/doc |
| **SetFit** | 8-64 examples | 50-100ms | 85-92% | $0.0001/doc |

In [None]:
# Install required packages
# !pip install transformers torch scikit-learn pandas numpy

In [None]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Sample banking complaints for classification
sample_complaints = [
    {"text": "The mobile app crashes every time I try to deposit a check", "label": "Tech_Support"},
    {"text": "I was charged a $35 overdraft fee even though I had money in savings", "label": "Fee_Dispute"},
    {"text": "The customer service representative was extremely rude to me", "label": "Service_Quality"},
    {"text": "Someone made unauthorized purchases on my credit card", "label": "Fraud"},
    {"text": "What is the interest rate on your high-yield savings account?", "label": "Product_Inquiry"},
]

print(f"Sample complaints: {len(sample_complaints)}")

## 3-4. LLM Approach Selection & Implementation

### 4.1 Zero-Shot Classification

In [None]:
# Zero-shot classification (pseudocode)
'''
from transformers import pipeline

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

labels = ["Technical Support", "Fee Dispute", "Service Quality", "Fraud", "Product Inquiry"]

text = "The mobile app crashes every time I try to deposit a check"
result = classifier(text, labels)

print(f"Predicted: {result['labels'][0]} ({result['scores'][0]:.2%})")
'''

def simulate_zero_shot(text, labels):
    """Simulated zero-shot for demonstration."""
    # Simple keyword-based simulation
    keywords = {
        "Technical Support": ["app", "crash", "login", "error", "website"],
        "Fee Dispute": ["fee", "charge", "overdraft", "refund"],
        "Service Quality": ["rude", "wait", "service", "representative"],
        "Fraud": ["unauthorized", "fraud", "stolen", "scam"],
        "Product Inquiry": ["rate", "interest", "account", "what is"]
    }
    
    scores = {}
    text_lower = text.lower()
    for label in labels:
        matches = sum(1 for kw in keywords.get(label, []) if kw in text_lower)
        scores[label] = matches
    
    total = sum(scores.values()) + 0.001
    return {k: v/total for k, v in sorted(scores.items(), key=lambda x: -x[1])}

# Test
labels = ["Technical Support", "Fee Dispute", "Service Quality", "Fraud", "Product Inquiry"]
for complaint in sample_complaints[:3]:
    result = simulate_zero_shot(complaint['text'], labels)
    top_label = list(result.keys())[0]
    print(f"Text: {complaint['text'][:50]}...")
    print(f"Predicted: {top_label}, True: {complaint['label']}\n")

### 4.2 Few-Shot with GPT

In [None]:
def create_few_shot_prompt(text, examples, categories):
    """Create few-shot classification prompt."""
    
    prompt = f"""Classify the following customer complaint into one of these categories:
{', '.join(categories)}

Examples:
"""
    
    for ex in examples:
        prompt += f'Complaint: "{ex["text"]}"\nCategory: {ex["label"]}\n\n'
    
    prompt += f'"""Complaint: "{text}"\nCategory:'
    
    return prompt

# Example prompt
few_shot_examples = sample_complaints[:3]
new_complaint = "I can't access my account online and the password reset isn't working"

prompt = create_few_shot_prompt(
    new_complaint, 
    few_shot_examples,
    ["Tech_Support", "Fee_Dispute", "Service_Quality", "Fraud", "Product_Inquiry"]
)

print("FEW-SHOT PROMPT")
print("=" * 50)
print(prompt)

### 4.3 Fine-tuned BERT (Pseudocode)

In [None]:
'''
# Fine-tuning BERT for classification
from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments

# Load pre-trained BERT
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=5  # Number of categories
)

# Prepare training data
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
train_dataset = CustomDataset(train_encodings, train_labels)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir="./logs",
)

# Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

trainer.train()
'''

print("Fine-tuning pseudocode shown above.")
print("Requires: transformers, torch, GPU recommended")

## 5-6. Evaluation & Production

### Same Metrics as Traditional
- Accuracy, Precision, Recall, F1
- Class imbalance handling

### Additional LLM Considerations
- **Prompt sensitivity**: Same model, different prompts → different results
- **Calibration**: LLM confidence vs actual accuracy
- **Consistency**: Same input, different runs → same output?

## 7. Production Readiness Checklist

```
API-BASED (GPT-4, Claude)
[ ] Rate limiting and retry logic
[ ] Cost monitoring and alerts
[ ] PII masking before API call
[ ] Fallback to local model if API down
[ ] Prompt versioning and A/B testing
[ ] Response validation (is output a valid category?)

SELF-HOSTED (Fine-tuned BERT)
[ ] Model serving infrastructure (TorchServe, Triton)
[ ] Batch processing for throughput
[ ] Model versioning and rollback
[ ] Latency monitoring (p50, p99)
[ ] GPU utilization optimization
```

## 8. Traditional vs LLM Comparison

| Dimension | Traditional (LR/SVM) | Fine-tuned BERT | Zero-shot | Few-shot GPT |
|-----------|---------------------|-----------------|-----------|---------------|
| **Accuracy** | 85-92% | 92-96% | 75-85% | 80-90% |
| **Training data** | 1000+ | 100+ | 0 | 5-20 |
| **Latency** | <10ms | 50-200ms | 200-500ms | 500ms-2s |
| **Cost/prediction** | ~$0 | $0.0001 | $0.001 | $0.01-0.02 |
| **New categories** | Retrain | Retrain | Update labels | Update prompt |
| **Explainability** | High | Low | Medium | Medium |

## 9. Advanced Techniques

### Chain-of-Thought Classification
```python
prompt = """Classify this complaint step by step:
1. What is the customer's main issue?
2. What banking product is involved?
3. What emotion is expressed?
4. Based on the above, the category is:"""
```

### SetFit (Few-shot with sentence transformers)
- 8-64 examples needed
- Fine-tunes embedding model
- No prompting, fast inference

## 10. Interview Soundbites

**On Approach Selection:**
> "My decision tree: Have 1000+ labeled examples? Fine-tune BERT. Have 10-50 examples? Try SetFit. Zero examples but know the categories? Zero-shot. Need to add new categories dynamically? Few-shot prompting."

**On Cost vs Accuracy:**
> "At 1M classifications/month, GPT-4 costs ~$15K. Fine-tuned BERT costs ~$100 in GPU time. For a 5% accuracy gain, is that worth 150x the cost? Usually not in banking where we have enough data to fine-tune."

**On Zero-shot Limitations:**
> "Zero-shot sounds magical but it's really pattern matching on label names. 'Technical Support' works because BART learned what those words mean. If I use 'Category A' and 'Category B', it fails."

**On Production:**
> "I never deploy raw LLM output to production. Always validate: Is the output one of my valid categories? If not, route to human review. LLMs can generate creative but invalid responses."

---

**Q: When would you NOT use an LLM for classification?**
> High volume (millions/day), need full explainability, have sufficient training data, or strict latency requirements (<50ms). Traditional models win on all these dimensions.

In [None]:
print("""
╔══════════════════════════════════════════════════════════════════╗
║                    NOTEBOOK SUMMARY                               ║
╠══════════════════════════════════════════════════════════════════╣
║  Task: Text Classification with LLMs                             ║
║  Approaches: Fine-tuned BERT, Zero-shot, Few-shot                ║
║  Banking Use: Cold start classification, dynamic categories      ║
║                                                                  ║
║  Key Takeaways:                                                  ║
║  1. Zero-shot: No data needed, but lower accuracy                ║
║  2. Few-shot: 5-20 examples, quick to deploy                     ║
║  3. Fine-tuned: Best accuracy, needs data + GPU                  ║
║  4. SetFit: Sweet spot (few examples, fast inference)            ║
║  5. Always validate LLM output is a valid category               ║
╚══════════════════════════════════════════════════════════════════╝
""")