# Text Classification - Traditional NLP
## Interview Preparation Notebook for Senior Applied AI Scientist (Retail Banking)

---

**Goal**: Demonstrate mastery of supervised text classification using traditional ML approaches, with emphasis on production deployment and banking-specific considerations.

**Interview Signal**: This notebook shows you can build interpretable, scalable classification systems that meet regulatory requirements.

## 1. Business Context (Banking Lens)

### Why Text Classification Exists in Retail Banking

Text classification is arguably the most deployed NLP task in banking. Every piece of unstructured text needs to be routed, prioritized, or categorized.

| Use Case | Input | Output | Business Impact |
|----------|-------|--------|----------------|
| **Complaint Routing** | Customer complaint | Department (fraud, billing, service) | Reduce resolution time by 40% |
| **Fraud Alert Triage** | Transaction alert text | Priority (high/medium/low) | Focus analysts on real threats |
| **Sentiment Analysis** | Survey response | Positive/Negative/Neutral | Track NPS drivers |
| **Email Intent Detection** | Inbound email | Intent (inquiry, complaint, request) | Auto-response eligibility |
| **Document Classification** | Uploaded document | Type (ID, statement, proof of address) | Automate KYC workflows |

### The Business Problem

> "We receive 50,000 customer emails per day. How do we route them to the right team without making customers wait?"

**Without classification**: Manual triage by generalist agents → 24-48 hour routing delay  
**With classification**: Instant routing to specialist queues → <1 hour first response

### Real Banking Example: Complaint Classification

**Input**: "I was charged $35 for an overdraft but I had money in my savings account. This is unfair and I want a refund immediately."

**Classification Task**: 
- Primary Category: `Fee Dispute`
- Product: `Checking Account`
- Urgency: `High` (refund request)
- Sentiment: `Negative`

### Interview Framing

```
"Text classification in banking isn't just about accuracy - it's about building trust. 
When we misroute a fraud complaint to the billing team, we're not just creating 
inefficiency - we're potentially leaving a customer vulnerable. That's why I focus 
on high recall for high-stakes categories, even if it means more manual review."
```

## 2. Problem Definition

### Task Type: Supervised Learning (Classification)

| Aspect | Description |
|--------|-------------|
| **Learning Type** | Supervised (requires labeled training data) |
| **Input** | Single text document |
| **Output** | Class label + probability score |
| **Variants** | Binary, Multi-class, Multi-label |

### Mathematical Formulation

Given document $d$ represented as feature vector $\mathbf{x}$, predict class $y$:

$$P(y|\mathbf{x}) = \frac{P(\mathbf{x}|y)P(y)}{P(\mathbf{x})}$$

(Naive Bayes) or find $\mathbf{w}$ such that:

$$\hat{y} = \text{argmax}_y \ \mathbf{w}^T \phi(\mathbf{x}, y)$$

(Linear models like Logistic Regression, SVM)

### Why Traditional Approaches Before LLMs

1. **Interpretable**: Logistic regression coefficients show which words drive predictions
2. **Fast inference**: <10ms per document vs 100ms+ for transformers
3. **Low resource**: Runs on CPUs, no GPU required
4. **Audit-friendly**: Deterministic, reproducible results
5. **Works with limited data**: 1,000 examples often sufficient

### Classification Types in Banking

| Type | Example | Challenge |
|------|---------|----------|
| **Binary** | Fraud vs Not Fraud | Extreme class imbalance (99.9% not fraud) |
| **Multi-class** | Route to 1 of 10 departments | Overlapping categories |
| **Multi-label** | Tag with multiple products | One complaint → multiple issues |

## 3. Dataset

### Public Dataset: Consumer Finance Complaints (CFPB)

We'll use a subset of the CFPB Consumer Complaint Database - actual banking complaints filed with the US government.

**Why this is the ideal banking dataset**:
- Real customer language about financial products
- Pre-labeled with product categories
- Contains complaint narratives (not just metadata)
- Public and frequently updated

For this demo, we'll simulate with 20 Newsgroups as a fallback, but production would use CFPB data.

In [None]:
# Install required packages
# !pip install scikit-learn nltk pandas numpy matplotlib seaborn

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    classification_report, confusion_matrix, accuracy_score,
    precision_recall_fscore_support, roc_auc_score, roc_curve
)
from sklearn.pipeline import Pipeline
from sklearn.calibration import CalibratedClassifierCV
import warnings
warnings.filterwarnings('ignore')

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("Libraries loaded successfully")

In [None]:
# Load dataset - Using 20 Newsgroups with banking-relevant categories
# Mapping to banking context:
# comp.* -> Technical Support (app/website issues)
# talk.politics.* -> Policy Complaints (regulatory, fees)
# rec.* -> General Service Issues
# sci.* -> Product Inquiries

categories = [
    'comp.sys.mac.hardware',  # Tech support
    'comp.windows.x',         # Tech support
    'rec.autos',              # Service issues
    'rec.sport.baseball',     # Service issues
    'sci.electronics',        # Product inquiries
    'sci.med',                # Product inquiries
    'talk.politics.misc',     # Policy complaints
    'talk.religion.misc',     # Policy complaints
]

# Simplified to 4 categories for banking simulation
category_mapping = {
    'comp.sys.mac.hardware': 'Tech_Support',
    'comp.windows.x': 'Tech_Support',
    'rec.autos': 'Service_Issues',
    'rec.sport.baseball': 'Service_Issues',
    'sci.electronics': 'Product_Inquiry',
    'sci.med': 'Product_Inquiry',
    'talk.politics.misc': 'Policy_Complaint',
    'talk.religion.misc': 'Policy_Complaint',
}

# Load data
newsgroups_train = fetch_20newsgroups(
    subset='train',
    categories=categories,
    remove=('headers', 'footers', 'quotes'),
    random_state=RANDOM_STATE
)

newsgroups_test = fetch_20newsgroups(
    subset='test',
    categories=categories,
    remove=('headers', 'footers', 'quotes'),
    random_state=RANDOM_STATE
)

# Map to banking categories
banking_categories = ['Tech_Support', 'Service_Issues', 'Product_Inquiry', 'Policy_Complaint']

def map_to_banking_label(original_label, target_names):
    original_category = target_names[original_label]
    banking_label = category_mapping[original_category]
    return banking_categories.index(banking_label)

X_train = newsgroups_train.data
y_train = np.array([map_to_banking_label(y, newsgroups_train.target_names) 
                    for y in newsgroups_train.target])

X_test = newsgroups_test.data
y_test = np.array([map_to_banking_label(y, newsgroups_test.target_names) 
                   for y in newsgroups_test.target])

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"\nBanking Categories: {banking_categories}")

In [None]:
# Class distribution - critical for banking (often imbalanced)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

for ax, (y, title) in zip(axes, [(y_train, 'Training'), (y_test, 'Test')]):
    counts = np.bincount(y)
    ax.bar(banking_categories, counts)
    ax.set_title(f'{title} Set Class Distribution')
    ax.set_ylabel('Count')
    ax.tick_params(axis='x', rotation=45)
    
    # Add count labels
    for i, count in enumerate(counts):
        ax.text(i, count + 10, str(count), ha='center')

plt.tight_layout()
plt.show()

# Check for imbalance
train_counts = np.bincount(y_train)
imbalance_ratio = max(train_counts) / min(train_counts)
print(f"\nImbalance ratio (max/min): {imbalance_ratio:.2f}")
print("Note: In real banking, fraud detection has 100:1 or worse imbalance")

In [None]:
# Sample documents
print("SAMPLE DOCUMENTS BY CATEGORY")
print("=" * 60)

for cat_idx, cat_name in enumerate(banking_categories):
    sample_idx = np.where(y_train == cat_idx)[0][0]
    print(f"\n[{cat_name}]")
    print(f"{X_train[sample_idx][:300]}...")
    print("-" * 40)

## 4. Traditional NLP Pipeline

### 4.1 Text Cleaning

In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('punkt_tab', quiet=True)

class TextClassificationPreprocessor:
    """
    Preprocessor optimized for text classification.
    
    Key differences from topic modeling:
    - Keep some stopwords (negations matter: "not satisfied" vs "satisfied")
    - Preserve case for proper nouns optionally
    - Handle banking-specific patterns (account numbers, amounts)
    """
    
    def __init__(self, 
                 remove_numbers=True,
                 preserve_negations=True,
                 min_word_length=2):
        self.remove_numbers = remove_numbers
        self.preserve_negations = preserve_negations
        self.min_word_length = min_word_length
        self.lemmatizer = WordNetLemmatizer()
        
        # Modified stopwords - keep negations for sentiment/classification
        self.stop_words = set(stopwords.words('english'))
        if preserve_negations:
            negations = {'not', 'no', 'never', 'neither', 'nobody', 'nothing', 
                        'nowhere', 'cannot', "can't", "won't", "don't", "doesn't",
                        "didn't", "isn't", "aren't", "wasn't", "weren't"}
            self.stop_words -= negations
    
    def preprocess(self, text):
        """Full preprocessing pipeline."""
        # Lowercase
        text = text.lower()
        
        # Remove emails
        text = re.sub(r'\S+@\S+', ' EMAIL ', text)
        
        # Remove URLs
        text = re.sub(r'http\S+|www\S+', ' URL ', text)
        
        # Handle money amounts (banking-specific)
        text = re.sub(r'\$[\d,]+\.?\d*', ' MONEY ', text)
        
        # Handle account-like numbers
        text = re.sub(r'\b\d{4,}\b', ' ACCTNUM ', text)
        
        # Remove remaining numbers
        if self.remove_numbers:
            text = re.sub(r'\d+', '', text)
        
        # Remove punctuation (keep apostrophes for contractions)
        text = re.sub(r"[^a-zA-Z'\s]", ' ', text)
        
        # Tokenize
        tokens = word_tokenize(text)
        
        # Filter and lemmatize
        processed = [
            self.lemmatizer.lemmatize(token)
            for token in tokens
            if token not in self.stop_words
            and len(token) >= self.min_word_length
        ]
        
        return ' '.join(processed)

preprocessor = TextClassificationPreprocessor(preserve_negations=True)

# Demonstrate
sample = "I was charged $35.00 for overdraft on account 123456789. I am NOT satisfied with this!"
print(f"Original: {sample}")
print(f"Processed: {preprocessor.preprocess(sample)}")

In [None]:
# Preprocess all data
print("Preprocessing training data...")
X_train_processed = [preprocessor.preprocess(doc) for doc in X_train]

print("Preprocessing test data...")
X_test_processed = [preprocessor.preprocess(doc) for doc in X_test]

print(f"\nProcessed {len(X_train_processed)} training documents")
print(f"Processed {len(X_test_processed)} test documents")

### 4.2 Stemming vs. Lemmatization for Classification

**For Classification**: Either can work, but consider:

| Aspect | Stemming | Lemmatization |
|--------|----------|---------------|
| **Speed** | Faster | Slower |
| **Feature interpretability** | Lower ("studi") | Higher ("study") |
| **Vocabulary reduction** | More aggressive | Less aggressive |
| **Best for** | High-volume, less interpretability needed | When explaining to stakeholders |

**My choice for banking**: Lemmatization, because:
1. We often need to explain why a complaint was classified a certain way
2. Feature coefficients should be readable to compliance teams

### 4.3 Feature Engineering

In [None]:
# Compare different vectorization approaches

# 1. Bag of Words (CountVectorizer)
bow_vectorizer = CountVectorizer(
    max_features=10000,
    ngram_range=(1, 1),  # Unigrams only
    min_df=2,
    max_df=0.95
)

# 2. TF-IDF (standard for classification)
tfidf_vectorizer = TfidfVectorizer(
    max_features=10000,
    ngram_range=(1, 2),  # Unigrams + bigrams
    min_df=2,
    max_df=0.95,
    sublinear_tf=True,  # Apply log scaling to term frequencies
)

# 3. TF-IDF with more aggressive bigrams
tfidf_bigram_vectorizer = TfidfVectorizer(
    max_features=15000,
    ngram_range=(1, 3),  # Up to trigrams
    min_df=3,
    max_df=0.90,
    sublinear_tf=True,
)

# Fit and compare
X_train_bow = bow_vectorizer.fit_transform(X_train_processed)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_processed)
X_train_tfidf_bigram = tfidf_bigram_vectorizer.fit_transform(X_train_processed)

print("Feature Matrix Shapes:")
print(f"  BoW (unigrams): {X_train_bow.shape}")
print(f"  TF-IDF (uni+bi): {X_train_tfidf.shape}")
print(f"  TF-IDF (uni+bi+tri): {X_train_tfidf_bigram.shape}")

In [None]:
# Why TF-IDF over BoW for classification?
print("""
TF-IDF vs Bag-of-Words for Classification:

BoW: Raw word counts
  - "the" appears 10 times → feature value = 10
  - Doesn't account for word importance

TF-IDF: Term Frequency × Inverse Document Frequency  
  - "the" appears in every doc → low IDF → low weight
  - "overdraft" appears in fee complaints → high IDF → high weight
  
For classification, TF-IDF typically performs better because
discriminative words are upweighted automatically.
""")

### 4.4 Model Choice

Traditional classifiers for text:

| Model | Strengths | Weaknesses | Best For |
|-------|-----------|------------|----------|
| **Naive Bayes** | Fast, works well with small data | Assumes feature independence | Baseline, quick iteration |
| **Logistic Regression** | Interpretable, probabilistic | Linear decision boundary | Production, when explainability needed |
| **SVM (Linear)** | Excellent for high-dim sparse data | Less interpretable | Maximum accuracy |
| **Random Forest** | Handles non-linearity | Slower, harder to interpret | When features have interactions |

## 5. Model Training & Inference

In [None]:
# Transform test data
X_test_tfidf = tfidf_vectorizer.transform(X_test_processed)

# Train multiple models for comparison
models = {
    'Naive Bayes (Multinomial)': MultinomialNB(alpha=0.1),
    'Naive Bayes (Complement)': ComplementNB(alpha=0.1),  # Better for imbalanced data
    'Logistic Regression': LogisticRegression(
        max_iter=1000, 
        random_state=RANDOM_STATE,
        class_weight='balanced',  # Handle imbalance
        C=1.0
    ),
    'Linear SVM': LinearSVC(
        random_state=RANDOM_STATE,
        class_weight='balanced',
        max_iter=2000
    ),
}

# Train and evaluate each model
results = {}

for name, model in models.items():
    print(f"\nTraining {name}...")
    model.fit(X_train_tfidf, y_train)
    
    # Predict
    y_pred = model.predict(X_test_tfidf)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision, recall, f1, _ = precision_recall_fscore_support(y_test, y_pred, average='weighted')
    
    results[name] = {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'model': model,
        'predictions': y_pred
    }
    
    print(f"  Accuracy: {accuracy:.4f}")
    print(f"  F1 (weighted): {f1:.4f}")

In [None]:
# Compare results
results_df = pd.DataFrame({
    name: {'Accuracy': r['accuracy'], 'Precision': r['precision'], 
           'Recall': r['recall'], 'F1': r['f1']}
    for name, r in results.items()
}).T

print("\nMODEL COMPARISON")
print("=" * 60)
print(results_df.round(4))

# Visualize
fig, ax = plt.subplots(figsize=(10, 5))
results_df.plot(kind='bar', ax=ax)
ax.set_ylabel('Score')
ax.set_title('Model Performance Comparison')
ax.legend(loc='lower right')
ax.set_ylim([0.7, 1.0])
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

In [None]:
# Detailed analysis of best model (Logistic Regression for interpretability)
best_model_name = 'Logistic Regression'
best_model = results[best_model_name]['model']
y_pred_best = results[best_model_name]['predictions']

print(f"\nDETAILED CLASSIFICATION REPORT: {best_model_name}")
print("=" * 60)
print(classification_report(y_test, y_pred_best, target_names=banking_categories))

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_best)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=banking_categories,
            yticklabels=banking_categories)
plt.title(f'Confusion Matrix - {best_model_name}')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.tight_layout()
plt.show()

# Normalize for better interpretation
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

plt.figure(figsize=(8, 6))
sns.heatmap(cm_normalized, annot=True, fmt='.2%', cmap='Blues',
            xticklabels=banking_categories,
            yticklabels=banking_categories)
plt.title(f'Normalized Confusion Matrix - {best_model_name}')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.tight_layout()
plt.show()

In [None]:
# Feature importance - what words drive each class?
def get_top_features(model, vectorizer, class_names, top_n=10):
    """Extract most important features per class from logistic regression."""
    feature_names = vectorizer.get_feature_names_out()
    
    for i, class_name in enumerate(class_names):
        if hasattr(model, 'coef_'):
            # Get coefficients for this class
            coefficients = model.coef_[i]
            
            # Top positive features (indicate this class)
            top_positive_idx = coefficients.argsort()[-top_n:][::-1]
            top_positive = [(feature_names[j], coefficients[j]) for j in top_positive_idx]
            
            # Top negative features (indicate NOT this class)
            top_negative_idx = coefficients.argsort()[:top_n]
            top_negative = [(feature_names[j], coefficients[j]) for j in top_negative_idx]
            
            print(f"\n[{class_name}]")
            print(f"  Top POSITIVE indicators: {[f[0] for f in top_positive]}")
            print(f"  Top NEGATIVE indicators: {[f[0] for f in top_negative]}")

print("FEATURE IMPORTANCE BY CLASS")
print("=" * 60)
get_top_features(best_model, tfidf_vectorizer, banking_categories, top_n=8)

In [None]:
# Inference function for production
def classify_complaint(text, model, vectorizer, preprocessor, class_names, threshold=0.5):
    """
    Classify a customer complaint with confidence score.
    
    Returns:
        dict with predicted class, confidence, and all class probabilities
    """
    # Preprocess
    processed = preprocessor.preprocess(text)
    
    # Vectorize
    features = vectorizer.transform([processed])
    
    # Predict
    if hasattr(model, 'predict_proba'):
        probabilities = model.predict_proba(features)[0]
    else:
        # For SVM, use decision function
        decision = model.decision_function(features)[0]
        # Softmax approximation
        probabilities = np.exp(decision) / np.sum(np.exp(decision))
    
    predicted_class = np.argmax(probabilities)
    confidence = probabilities[predicted_class]
    
    # Flag low confidence predictions
    needs_review = confidence < threshold
    
    return {
        'predicted_class': class_names[predicted_class],
        'confidence': confidence,
        'needs_human_review': needs_review,
        'all_probabilities': dict(zip(class_names, probabilities))
    }

# Test inference
test_complaints = [
    "The mobile app keeps crashing when I try to check my balance. Very frustrating!",
    "I was charged a $35 fee that I don't understand. Please explain your fee policy.",
    "How do I set up automatic bill pay for my credit card?",
    "The customer service representative was very rude to me on the phone."
]

print("INFERENCE EXAMPLES")
print("=" * 60)

for complaint in test_complaints:
    result = classify_complaint(
        complaint, best_model, tfidf_vectorizer, 
        preprocessor, banking_categories, threshold=0.6
    )
    print(f"\nInput: {complaint[:80]}...")
    print(f"Predicted: {result['predicted_class']} (confidence: {result['confidence']:.2%})")
    if result['needs_human_review']:
        print("  ⚠️  LOW CONFIDENCE - Route to human review")

## 6. Evaluation Strategy

### Why Accuracy is NOT Always the Right Metric

**Scenario**: Fraud detection with 99.9% non-fraud transactions
- Model predicts "not fraud" for everything → 99.9% accuracy
- But catches 0% of actual fraud → useless

### Choosing the Right Metric

| Metric | Formula | Use When |
|--------|---------|----------|
| **Accuracy** | (TP+TN)/(TP+TN+FP+FN) | Balanced classes |
| **Precision** | TP/(TP+FP) | Cost of false positives is high (fraud alerts) |
| **Recall** | TP/(TP+FN) | Cost of false negatives is high (missing fraud) |
| **F1** | 2×(P×R)/(P+R) | Balance precision and recall |
| **AUC-ROC** | Area under ROC curve | Ranking quality, threshold-agnostic |

In [None]:
# Per-class metrics - critical for banking
print("PER-CLASS METRICS ANALYSIS")
print("=" * 60)

precision, recall, f1, support = precision_recall_fscore_support(
    y_test, y_pred_best, average=None
)

metrics_df = pd.DataFrame({
    'Category': banking_categories,
    'Precision': precision,
    'Recall': recall,
    'F1': f1,
    'Support': support
}).set_index('Category')

print(metrics_df.round(4))

# Highlight potential issues
print("\nPOTENTIAL ISSUES:")
for cat, row in metrics_df.iterrows():
    if row['Recall'] < 0.8:
        print(f"  ⚠️  {cat}: Low recall ({row['Recall']:.2%}) - missing {100-row['Recall']*100:.1f}% of actual cases")
    if row['Precision'] < 0.8:
        print(f"  ⚠️  {cat}: Low precision ({row['Precision']:.2%}) - {100-row['Precision']*100:.1f}% false positives")

In [None]:
# Cross-validation for robust estimates
from sklearn.model_selection import StratifiedKFold

# Create pipeline for cross-validation
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, ngram_range=(1, 2))),
    ('clf', LogisticRegression(max_iter=1000, class_weight='balanced', random_state=RANDOM_STATE))
])

# 5-fold cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
cv_scores = cross_val_score(pipeline, X_train_processed, y_train, cv=cv, scoring='f1_weighted')

print("CROSS-VALIDATION RESULTS")
print("=" * 40)
print(f"F1 Scores: {cv_scores.round(4)}")
print(f"Mean: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")

# This gives us confidence interval for production performance

## 7. Production Readiness Checklist

```
DATA QUALITY
[ ] PII detection and masking before classification
[ ] Input validation (min/max length, language detection)
[ ] Handle encoding issues (UTF-8 normalization)
[ ] Logging of raw input for debugging (separate PII-safe store)

MODEL ARTIFACTS
[ ] Serialized model with version number
[ ] Vectorizer saved (vocabulary must match)
[ ] Preprocessing pipeline saved (identical transforms)
[ ] Model card documenting training data, metrics, limitations

INFERENCE PIPELINE
[ ] Latency benchmarks (p50 < 50ms, p99 < 200ms for real-time)
[ ] Batch inference for nightly processing
[ ] Confidence threshold for human escalation
[ ] Fallback behavior when model is unavailable

HANDLING CLASS IMBALANCE
[ ] Class weights in training
[ ] Threshold tuning per class
[ ] Separate high-recall model for critical classes (fraud)
[ ] Stratified sampling in train/test splits

MONITORING & DRIFT
[ ] Prediction distribution monitoring (shift detection)
[ ] Confidence score distribution tracking
[ ] Human override rate tracking
[ ] Feature drift detection (new words appearing)

GOVERNANCE (BANKING-SPECIFIC)
[ ] Model risk assessment (SR 11-7)
[ ] Bias testing across demographic segments
[ ] Audit trail for all predictions
[ ] Explainability for individual predictions
[ ] Periodic revalidation schedule

FAILURE MODES
[ ] What if input is empty or very short?
[ ] What if input is in wrong language?
[ ] What if input contains adversarial content?
[ ] What happens with previously unseen vocabulary?
```

In [None]:
# Production model wrapper with safety checks
import joblib
from datetime import datetime

class ProductionClassifier:
    """
    Production-ready text classifier with safety checks and logging.
    """
    
    def __init__(self, model, vectorizer, preprocessor, class_names,
                 confidence_threshold=0.6, min_text_length=10):
        self.model = model
        self.vectorizer = vectorizer
        self.preprocessor = preprocessor
        self.class_names = class_names
        self.confidence_threshold = confidence_threshold
        self.min_text_length = min_text_length
        self.prediction_count = 0
        self.low_confidence_count = 0
    
    def predict(self, text):
        """Make prediction with safety checks."""
        self.prediction_count += 1
        
        # Input validation
        if not text or len(text.strip()) < self.min_text_length:
            return {
                'status': 'ERROR',
                'error': 'Input too short',
                'predicted_class': None,
                'route_to_human': True
            }
        
        try:
            # Preprocess
            processed = self.preprocessor.preprocess(text)
            
            if len(processed.split()) < 3:
                return {
                    'status': 'ERROR',
                    'error': 'Insufficient content after preprocessing',
                    'predicted_class': None,
                    'route_to_human': True
                }
            
            # Vectorize and predict
            features = self.vectorizer.transform([processed])
            probabilities = self.model.predict_proba(features)[0]
            
            predicted_idx = np.argmax(probabilities)
            confidence = probabilities[predicted_idx]
            
            # Check confidence
            route_to_human = confidence < self.confidence_threshold
            if route_to_human:
                self.low_confidence_count += 1
            
            return {
                'status': 'SUCCESS',
                'predicted_class': self.class_names[predicted_idx],
                'confidence': float(confidence),
                'route_to_human': route_to_human,
                'all_probabilities': {name: float(prob) 
                                     for name, prob in zip(self.class_names, probabilities)},
                'timestamp': datetime.now().isoformat()
            }
            
        except Exception as e:
            return {
                'status': 'ERROR',
                'error': str(e),
                'predicted_class': None,
                'route_to_human': True
            }
    
    def get_metrics(self):
        """Return operational metrics."""
        return {
            'total_predictions': self.prediction_count,
            'low_confidence_rate': self.low_confidence_count / max(1, self.prediction_count),
            'human_escalation_rate': self.low_confidence_count / max(1, self.prediction_count)
        }

# Initialize production classifier
prod_classifier = ProductionClassifier(
    model=best_model,
    vectorizer=tfidf_vectorizer,
    preprocessor=preprocessor,
    class_names=banking_categories,
    confidence_threshold=0.6
)

# Test production classifier
print("PRODUCTION CLASSIFIER TEST")
print("=" * 50)

test_inputs = [
    "The ATM ate my card and I need it back immediately!",
    "",  # Empty input
    "Hi",  # Too short
    "I have a question about the new savings account interest rates and how they compare to competitors."
]

for text in test_inputs:
    result = prod_classifier.predict(text)
    print(f"\nInput: '{text[:50]}{'...' if len(text) > 50 else ''}'")
    print(f"Status: {result['status']}")
    if result['status'] == 'SUCCESS':
        print(f"Prediction: {result['predicted_class']} ({result['confidence']:.2%})")
        print(f"Route to human: {result['route_to_human']}")
    else:
        print(f"Error: {result.get('error', 'Unknown')}")

## 8. Modern LLM-Based Approach

### How would we solve text classification with LLMs today?

**Option 1: Zero-Shot Classification**
```python
from transformers import pipeline

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

result = classifier(
    "The mobile app crashes when I check my balance",
    candidate_labels=["Tech Support", "Service Issues", "Product Inquiry", "Policy Complaint"]
)
```

**Option 2: Few-Shot with GPT**
```python
prompt = """
Classify the following customer complaint into one of these categories:
- Tech_Support: Issues with mobile app, website, or digital banking
- Service_Issues: Problems with customer service or branch experience
- Product_Inquiry: Questions about products, rates, or features
- Policy_Complaint: Concerns about fees, policies, or terms

Examples:
"The app keeps logging me out" -> Tech_Support
"The teller was rude" -> Service_Issues
"What's the interest rate?" -> Product_Inquiry
"Why was I charged $35?" -> Policy_Complaint

Complaint: {user_input}
Category:
"""
```

**Option 3: Fine-Tuned BERT**
```python
from transformers import BertForSequenceClassification, Trainer

model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased', 
    num_labels=4
)

# Fine-tune on labeled banking data
trainer = Trainer(model=model, train_dataset=train_dataset, ...)
trainer.train()
```

In [None]:
# Pseudocode for LLM classification (would require API key)

def create_classification_prompt(text, categories, examples=None):
    """
    Create a prompt for LLM-based classification.
    
    In banking production:
    - PII must be masked before sending to external API
    - Use Azure OpenAI (approved vendor) not consumer API
    - Log prompts for audit trail
    """
    
    category_descriptions = {
        'Tech_Support': 'Issues with mobile app, website, ATM, or digital banking platforms',
        'Service_Issues': 'Problems with customer service quality, wait times, or staff behavior',
        'Product_Inquiry': 'Questions about account features, interest rates, or product offerings',
        'Policy_Complaint': 'Concerns about fees, terms and conditions, or bank policies'
    }
    
    prompt = f"""You are a customer service routing assistant for a retail bank.

Task: Classify the following customer message into exactly one category.

Categories:
"""
    
    for cat in categories:
        prompt += f"- {cat}: {category_descriptions.get(cat, 'No description')}\n"
    
    if examples:
        prompt += "\nExamples:\n"
        for ex_text, ex_label in examples:
            prompt += f'"{ex_text}" -> {ex_label}\n'
    
    prompt += f"""
Customer Message: "{text}"

Respond with only the category name, nothing else.

Category:"""
    
    return prompt

# Example
example_prompt = create_classification_prompt(
    "The mobile app crashes every time I try to deposit a check",
    banking_categories,
    examples=[
        ("Website won't load", "Tech_Support"),
        ("Charged wrong fee", "Policy_Complaint")
    ]
)

print("EXAMPLE LLM CLASSIFICATION PROMPT")
print("=" * 50)
print(example_prompt)

## 9. Traditional vs LLM Decision Matrix

| Dimension | Traditional (LR/SVM) | LLM (Fine-tuned BERT) | LLM (Zero-shot GPT) |
|-----------|---------------------|----------------------|--------------------|
| **Accuracy** | 85-92% | 92-96% | 80-90% |
| **Latency** | <10ms | 50-200ms | 500ms-2s |
| **Cost per prediction** | ~$0 | $0.0001-0.001 | $0.001-0.01 |
| **Training data needed** | 1,000+ per class | 100+ per class | 0-10 examples |
| **Explainability** | High (coefficients) | Low (black box) | Medium (can ask why) |
| **Cold start** | Needs labels | Needs labels | Works immediately |
| **New classes** | Retrain required | Retrain required | Just update prompt |
| **Compliance** | Easy (on-premise) | Medium (self-hosted) | Complex (API/data residency) |

### When to Use Each Approach

**Use Traditional (Logistic Regression/SVM)**:
- High volume (millions of predictions/day)
- Strict latency requirements (<50ms)
- Need full explainability for regulators
- Stable categories that don't change often
- Sufficient labeled data available

**Use Fine-tuned BERT**:
- Accuracy is critical
- Have GPU infrastructure
- Medium volume with latency budget
- Can self-host (data privacy concerns)

**Use Zero-shot LLM**:
- New category needs to be added quickly
- No labeled data available
- Low volume, high-value decisions
- Exploratory/prototyping phase

## 10. Interview Soundbites

### Ready-to-Say Statements

**On Algorithm Choice:**
> "For complaint classification at scale, I'd start with Logistic Regression on TF-IDF features. It's not sexy, but it gives me interpretable coefficients I can show to compliance, sub-10ms latency, and 90%+ accuracy. I'd only move to BERT if that 5% accuracy gain justifies the 10x latency and infrastructure cost."

**On Evaluation:**
> "Accuracy can be misleading with imbalanced classes. For fraud detection with 0.1% fraud rate, a model that predicts 'not fraud' for everything gets 99.9% accuracy but is useless. I focus on recall for the minority class and use precision-recall curves, not ROC curves, for imbalanced problems."

**On Feature Engineering:**
> "TF-IDF with n-grams is still competitive with transformers for many classification tasks, especially when you have clean, well-structured text. The key is good preprocessing - keeping negations for sentiment, handling domain-specific patterns like account numbers, and aggressive stopword removal."

**On Class Imbalance:**
> "For imbalanced classes in banking, I use a combination of class weights during training, stratified cross-validation, and threshold tuning per class at inference. For critical categories like fraud, I might train a separate high-recall model that errs on the side of caution."

**On When NOT to Use LLMs:**
> "Zero-shot classification sounds appealing, but at 1 million complaints per month, the API costs alone would be $10K-100K. And I can't send PII to external APIs without masking, which might remove important context. Traditional models give me 95% of the performance at 0.1% of the cost."

**On Production Failures:**
> "Text classifiers fail silently when the vocabulary drifts. When a new product launches with new terminology, the model sees unknown words and falls back to generic predictions. We need vocabulary monitoring and regular retraining cycles - I recommend monthly for fast-moving domains like banking."

**On Explainability:**
> "In banking, I need to explain every routing decision. With Logistic Regression, I can say 'this complaint was routed to Tech Support because it contains the words app, crash, and login with high positive coefficients for that class.' With BERT, I can only show attention weights which don't satisfy auditors."

---

### Common Interview Questions

**Q: How do you handle class imbalance?**
> Multiple strategies: class weights in the loss function, oversampling minority class (SMOTE for tabular, not recommended for text), undersampling majority, threshold adjustment at inference, or ensemble of models with different class balances.

**Q: Why Logistic Regression over Naive Bayes?**
> Naive Bayes assumes feature independence, which is violated in text (word co-occurrences matter). Logistic Regression is discriminative - it directly models P(y|x) without the independence assumption. In practice, LR usually outperforms NB by 3-5% on text classification.

**Q: How do you choose between multi-class and multi-label?**
> Multi-class: exactly one label per document (complaint routing - goes to one queue). Multi-label: multiple labels possible (complaint tagging - can be about both 'fees' AND 'service'). The latter requires different loss functions (binary cross-entropy per label) and evaluation (subset accuracy, hamming loss).

In [None]:
print("""
╔══════════════════════════════════════════════════════════════════╗
║                    NOTEBOOK SUMMARY                               ║
╠══════════════════════════════════════════════════════════════════╣
║  Task: Text Classification                                       ║
║  Approach: Traditional NLP (Logistic Regression, TF-IDF)         ║
║  Banking Use: Complaint routing, fraud triage                    ║
║                                                                  ║
║  Key Takeaways:                                                  ║
║  1. TF-IDF + Logistic Regression is production-ready baseline    ║
║  2. Class imbalance requires careful metric selection            ║
║  3. Explainability (coefficients) crucial for banking            ║
║  4. Confidence thresholds for human escalation                   ║
║  5. Traditional beats LLMs on cost at scale                      ║
╚══════════════════════════════════════════════════════════════════╝
""")