# Testing Semantic Cache Embeddings

Semantic caching avoids redundant LLM calls by recognizing when two queries mean the same thing. The challenge is distinguishing "same question, different words" from "related topic, different question."

**What we're testing:** Redis released LangCache embedding models fine-tuned specifically for semantic caching. Can they outperform general-purpose embeddings at detecting equivalent queries?

**The scenario:** User A asks "luxury home with pool." User B asks "upscale property with swimming pool." Should these hit the same cache? What about "luxury home with pool" vs "affordable starter home"?

## What is Semantic Caching?

Traditional caching uses exact key matching. Semantic caching uses embedding similarity to detect equivalent queries with different words.

The tradeoff: maximize cache hits for equivalent queries while avoiding false positives (returning wrong cached responses for queries that look similar but have different intent).

## Setup

In [None]:
# Install dependencies (uncomment if needed)
# !pip install sentence-transformers numpy pandas matplotlib

In [None]:
import numpy as np
import time
from sentence_transformers import SentenceTransformer
import matplotlib.pyplot as plt

def cosine_similarity(a, b):
    """Compute cosine similarity between two vectors."""
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

## Test Data: Real Estate Query Pairs

We need pairs of queries where we know if they should match (cache hit) or not.

**Positive pairs:** Different phrasings of the same search intent  
**Negative pairs:** Queries that look similar but have different intent

In [None]:
# Positive pairs - these SHOULD match (same intent, different words)
positive_pairs = [
    ("waterfront property with private dock", "lakefront home with boat dock"),
    ("cozy cottage near the beach", "small beach house with charm"),
    ("modern downtown condo with city views", "contemporary urban apartment overlooking the city"),
    ("family home with large backyard", "house with big yard for kids"),
    ("luxury home with pool", "upscale property with swimming pool"),
    ("quiet neighborhood for retirees", "peaceful area good for retirement"),
    ("home with home office space", "house with dedicated workspace"),
    ("pet-friendly apartment", "condo that allows dogs"),
    ("fixer-upper with potential", "needs work but good bones"),
    ("move-in ready home", "turnkey property no repairs needed"),
    ("open floor plan with natural light", "bright and airy layout"),
    ("close to good schools", "in a top school district"),
    ("investment property with rental income", "income-generating real estate"),
    ("historic home with character", "older house with original features"),
    ("energy efficient home with solar", "eco-friendly house with solar panels"),
]

# Negative pairs - these should NOT match (different intent)
negative_pairs = [
    ("waterfront property with private dock", "downtown condo near restaurants"),
    ("cozy cottage near the beach", "mountain cabin with ski access"),
    ("luxury home with pool", "affordable starter home"),
    ("quiet neighborhood for retirees", "vibrant area with nightlife"),
    ("family home with large backyard", "studio apartment downtown"),
    ("fixer-upper with potential", "brand new construction"),
    ("pet-friendly apartment", "no pets allowed policy"),
    ("close to good schools", "adult community 55+"),
    ("historic home with character", "modern minimalist new build"),
    ("investment property with rental income", "forever home to raise family"),
]

print(f"Positive pairs (should match): {len(positive_pairs)}")
print(f"Negative pairs (should NOT match): {len(negative_pairs)}")

## Load the Models

We're comparing:
1. **redis/langcache-embed-v3-small** - Fine-tuned specifically for semantic caching
2. **all-MiniLM-L6-v2** - General-purpose sentence embeddings

Here's what makes this interesting: LangCache is actually fine-tuned *from* MiniLM-L6. Same base model, same size (22.6M params, 384 dimensions), different training objective. The LangCache model was trained on sentence-pair data (anchor, positive, negative examples) to better separate "same intent" from "related but different" queries.

In [None]:
print("Loading Redis LangCache model...")
langcache_model = SentenceTransformer("redis/langcache-embed-v3-small")
print(f"  Embedding dimension: {langcache_model.get_sentence_embedding_dimension()}")

print("\nLoading baseline MiniLM model...")
baseline_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
print(f"  Embedding dimension: {baseline_model.get_sentence_embedding_dimension()}")

## Compute Similarities

For each model, we'll compute the cosine similarity between all query pairs.

In [None]:
def compute_similarities(model, positive_pairs, negative_pairs):
    """Compute similarities for all pairs using a given model."""
    # Collect all unique texts
    all_texts = []
    for p1, p2 in positive_pairs + negative_pairs:
        all_texts.extend([p1, p2])
    all_texts = list(set(all_texts))
    
    # Generate embeddings
    start = time.time()
    embeddings = model.encode(all_texts, show_progress_bar=False)
    embed_time = time.time() - start
    
    # Create lookup
    text_to_emb = {t: e for t, e in zip(all_texts, embeddings)}
    
    # Compute similarities
    pos_sims = [cosine_similarity(text_to_emb[p1], text_to_emb[p2]) for p1, p2 in positive_pairs]
    neg_sims = [cosine_similarity(text_to_emb[p1], text_to_emb[p2]) for p1, p2 in negative_pairs]
    
    return pos_sims, neg_sims, embed_time

# Run for both models
print("Computing similarities with LangCache model...")
lc_pos, lc_neg, lc_time = compute_similarities(langcache_model, positive_pairs, negative_pairs)

print("Computing similarities with baseline model...")
bl_pos, bl_neg, bl_time = compute_similarities(baseline_model, positive_pairs, negative_pairs)

## Results: Similarity Distributions

A good caching model should:
- Give HIGH similarity scores to positive pairs (cache hits)
- Give LOW similarity scores to negative pairs (avoid false hits)
- Have clear SEPARATION between the two distributions

In [None]:
print("="*60)
print("SIMILARITY DISTRIBUTION COMPARISON")
print("="*60)

print("\nRedis LangCache (cache-optimized):")
print(f"  Positive pairs - Mean: {np.mean(lc_pos):.4f}, Min: {np.min(lc_pos):.4f}, Max: {np.max(lc_pos):.4f}")
print(f"  Negative pairs - Mean: {np.mean(lc_neg):.4f}, Min: {np.min(lc_neg):.4f}, Max: {np.max(lc_neg):.4f}")
print(f"  Separation gap: {np.mean(lc_pos) - np.mean(lc_neg):.4f}")

print("\nBaseline MiniLM (general-purpose):")
print(f"  Positive pairs - Mean: {np.mean(bl_pos):.4f}, Min: {np.min(bl_pos):.4f}, Max: {np.max(bl_pos):.4f}")
print(f"  Negative pairs - Mean: {np.mean(bl_neg):.4f}, Min: {np.min(bl_neg):.4f}, Max: {np.max(bl_neg):.4f}")
print(f"  Separation gap: {np.mean(bl_pos) - np.mean(bl_neg):.4f}")

In [None]:
# Visualize the distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# LangCache
axes[0].hist(lc_pos, bins=15, alpha=0.7, label='Positive (should match)', color='green')
axes[0].hist(lc_neg, bins=15, alpha=0.7, label='Negative (should NOT match)', color='red')
axes[0].axvline(x=0.85, color='black', linestyle='--', label='Threshold 0.85')
axes[0].set_xlabel('Cosine Similarity')
axes[0].set_ylabel('Count')
axes[0].set_title('Redis LangCache (cache-optimized)')
axes[0].legend()
axes[0].set_xlim(0, 1)

# Baseline
axes[1].hist(bl_pos, bins=15, alpha=0.7, label='Positive (should match)', color='green')
axes[1].hist(bl_neg, bins=15, alpha=0.7, label='Negative (should NOT match)', color='red')
axes[1].axvline(x=0.85, color='black', linestyle='--', label='Threshold 0.85')
axes[1].set_xlabel('Cosine Similarity')
axes[1].set_ylabel('Count')
axes[1].set_title('Baseline MiniLM (general-purpose)')
axes[1].legend()
axes[1].set_xlim(0, 1)

plt.tight_layout()
plt.savefig('similarity_distributions.png', dpi=150)
plt.show()

## Precision/Recall at Different Thresholds

The similarity threshold determines when we consider two queries equivalent:
- **Too low (0.7):** More cache hits, but also more false positives
- **Too high (0.95):** Fewer false positives, but miss valid cache hits

Let's find the sweet spot for each model.

In [None]:
def evaluate_threshold(pos_sims, neg_sims, threshold):
    """Evaluate precision/recall at a given threshold."""
    tp = sum(1 for s in pos_sims if s >= threshold)  # Correct cache hits
    fn = sum(1 for s in pos_sims if s < threshold)   # Missed cache hits
    fp = sum(1 for s in neg_sims if s >= threshold)  # False cache hits
    tn = sum(1 for s in neg_sims if s < threshold)   # Correct rejections
    
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    
    return precision, recall, f1

thresholds = [0.70, 0.75, 0.80, 0.85, 0.90, 0.95]

print("="*70)
print("THRESHOLD COMPARISON")
print("="*70)
print(f"{'Threshold':<12} {'LangCache':<28} {'Baseline':<28}")
print("-"*70)

lc_results = []
bl_results = []

for t in thresholds:
    lc_p, lc_r, lc_f1 = evaluate_threshold(lc_pos, lc_neg, t)
    bl_p, bl_r, bl_f1 = evaluate_threshold(bl_pos, bl_neg, t)
    
    lc_results.append((lc_p, lc_r, lc_f1))
    bl_results.append((bl_p, bl_r, bl_f1))
    
    print(f"{t:<12.2f} P:{lc_p:.2f} R:{lc_r:.2f} F1:{lc_f1:.2f}    P:{bl_p:.2f} R:{bl_r:.2f} F1:{bl_f1:.2f}")

In [None]:
# Plot F1 scores across thresholds
plt.figure(figsize=(10, 6))

plt.plot(thresholds, [r[2] for r in lc_results], 'o-', label='Redis LangCache', linewidth=2, markersize=8)
plt.plot(thresholds, [r[2] for r in bl_results], 's-', label='Baseline MiniLM', linewidth=2, markersize=8)

plt.xlabel('Similarity Threshold', fontsize=12)
plt.ylabel('F1 Score', fontsize=12)
plt.title('Cache Hit Detection: F1 Score vs Threshold', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.ylim(0, 1.05)

plt.tight_layout()
plt.savefig('f1_comparison.png', dpi=150)
plt.show()

## Hardest Cases Analysis

Which pairs are causing problems? Let's look at:
- **Lowest positive similarities:** Cache misses we shouldn't have
- **Highest negative similarities:** False cache hits we need to avoid

In [None]:
# Combine pairs with their similarities
lc_pos_with_sims = list(zip(positive_pairs, lc_pos))
lc_neg_with_sims = list(zip(negative_pairs, lc_neg))

print("LANGCACHE - Hardest Positive Pairs (lowest similarity, risk of cache miss):")
for (p1, p2), sim in sorted(lc_pos_with_sims, key=lambda x: x[1])[:3]:
    print(f"  {sim:.4f}: '{p1}' vs '{p2}'")

print("\nLANGCACHE - Riskiest Negative Pairs (highest similarity, risk of false hit):")
for (p1, p2), sim in sorted(lc_neg_with_sims, key=lambda x: x[1], reverse=True)[:3]:
    print(f"  {sim:.4f}: '{p1}' vs '{p2}'")

In [None]:
bl_pos_with_sims = list(zip(positive_pairs, bl_pos))
bl_neg_with_sims = list(zip(negative_pairs, bl_neg))

print("BASELINE - Hardest Positive Pairs (lowest similarity, risk of cache miss):")
for (p1, p2), sim in sorted(bl_pos_with_sims, key=lambda x: x[1])[:3]:
    print(f"  {sim:.4f}: '{p1}' vs '{p2}'")

print("\nBASELINE - Riskiest Negative Pairs (highest similarity, risk of false hit):")
for (p1, p2), sim in sorted(bl_neg_with_sims, key=lambda x: x[1], reverse=True)[:3]:
    print(f"  {sim:.4f}: '{p1}' vs '{p2}'")

## Summary

Let's summarize what we found.

In [None]:
# Find optimal thresholds
lc_best_idx = np.argmax([r[2] for r in lc_results])
bl_best_idx = np.argmax([r[2] for r in bl_results])

print("="*60)
print("EXPERIMENT SUMMARY")
print("="*60)

print("\nRedis LangCache (cache-optimized):")
print(f"  Best threshold: {thresholds[lc_best_idx]}")
print(f"  Best F1 score: {lc_results[lc_best_idx][2]:.3f}")
print(f"  Separation gap: {np.mean(lc_pos) - np.mean(lc_neg):.4f}")
print(f"  Embedding time: {lc_time:.3f}s")

print("\nBaseline MiniLM (general-purpose):")
print(f"  Best threshold: {thresholds[bl_best_idx]}")
print(f"  Best F1 score: {bl_results[bl_best_idx][2]:.3f}")
print(f"  Separation gap: {np.mean(bl_pos) - np.mean(bl_neg):.4f}")
print(f"  Embedding time: {bl_time:.3f}s")

print("\n" + "="*60)
winner = "LangCache" if lc_results[lc_best_idx][2] > bl_results[bl_best_idx][2] else "Baseline"
diff = abs(lc_results[lc_best_idx][2] - bl_results[bl_best_idx][2])
print(f"Winner: {winner} (by {diff:.3f} F1 points)")
print("="*60)

## Conclusions

On this real estate query dataset:

1. **Separation gap** tells us how distinguishable "should match" vs "should not match" pairs are
2. **F1 score** balances precision (avoiding false cache hits) and recall (not missing valid cache hits)
3. **Threshold selection** matters - the right threshold depends on your cost of false positives vs false negatives

### Where Both Models Struggled

Real estate jargon tripped up both models. "Move-in ready home" vs "turnkey property no repairs needed" scored low on both, even though these mean the same thing in real estate. The models learned general paraphrase patterns but not domain-specific vocabulary.

### For Production Use

- If false cache hits are expensive (wrong answers): prioritize **precision**, use higher threshold
- If cache misses are expensive (unnecessary LLM calls): prioritize **recall**, use lower threshold
- Remember you still pay for embedding generation and vector search on every cache lookup

### Open Source Stack

The embedding models are Apache 2.0 licensed (separate from Redis's managed LangCache service):
- [redis/langcache-embed-v3-small](https://huggingface.co/redis/langcache-embed-v3-small)
- Run locally with sentence-transformers
- Store embeddings in any vector database (Redis Stack, pgvector, etc.)