# Debug Drill: The Wrong Distance

**Scenario:**
A colleague built a support ticket similarity system to find duplicate tickets.

"I'm using Euclidean distance like we learned in school!" they say.

But the results are terrible: completely unrelated tickets are matched.

**Your Task:**
1. Run the similarity search and see the bad results
2. Diagnose why Euclidean distance fails for text
3. Fix it with the right metric
4. Write a 3-bullet postmortem

---

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import euclidean_distances, cosine_similarity
from sklearn.neighbors import NearestNeighbors

np.random.seed(42)

In [None]:
# Sample support tickets with clear topics
tickets = [
    # Password/Login issues (0-3)
    "I can't log in to my account",
    "Password reset not working please help",
    "Unable to access my account after password change",
    "Login credentials not accepted",
    
    # Shipping issues (4-7)
    "Where is my order it's been two weeks",
    "Package tracking shows delivered but I never received it",
    "Shipping is taking way too long",
    "My delivery is late when will it arrive",
    
    # Refund issues (8-11)
    "I want a refund for this product",
    "How do I return this item and get my money back",
    "Request for refund - product not as described",
    "Cancel my order and refund please",
]

# Expected groupings: 0-3 (password), 4-7 (shipping), 8-11 (refund)
topics = ['password'] * 4 + ['shipping'] * 4 + ['refund'] * 4

# Create TF-IDF embeddings
vectorizer = TfidfVectorizer(stop_words='english')
embeddings = vectorizer.fit_transform(tickets).toarray()

print(f"Tickets: {len(tickets)}")
print(f"Embedding dimensions: {embeddings.shape[1]}")

In [None]:
# ===== COLLEAGUE'S CODE (BUG: WRONG METRIC) =====

# Using Euclidean distance (wrong for sparse high-dim text!)
nn_euclidean = NearestNeighbors(n_neighbors=3, metric='euclidean')  # <-- BUG!
nn_euclidean.fit(embeddings)

def search_euclidean(query):
    """Find similar tickets using Euclidean distance."""
    query_emb = vectorizer.transform([query]).toarray()
    distances, indices = nn_euclidean.kneighbors(query_emb)
    
    print(f"Query: '{query}'")
    print("Top matches (Euclidean):")
    for dist, idx in zip(distances[0], indices[0]):
        topic = topics[idx]
        print(f"  [{dist:.3f}] [{topic}] {tickets[idx]}")
    print()

print("=== Colleague's Results (Euclidean Distance) ===")
print()
search_euclidean("password problem")
search_euclidean("shipping delay")
search_euclidean("want refund")

---

## Your Investigation

### Step 1: Why Euclidean fails for text

In [None]:
print("=== Why Euclidean Distance Fails for Text ===")
print()
print("Problem 1: Sparse vectors")
print(f"  Average non-zero elements: {(embeddings > 0).sum(axis=1).mean():.1f} out of {embeddings.shape[1]}")
print("  Most dimensions are 0 â†’ Euclidean distance is dominated by zeros")
print()
print("Problem 2: Vector length varies")
norms = np.linalg.norm(embeddings, axis=1)
print(f"  Vector norms range: {norms.min():.3f} to {norms.max():.3f}")
print("  Longer documents have larger vectors â†’ appear more different")
print()
print("Problem 3: High dimensionality")
print(f"  {embeddings.shape[1]} dimensions â†’ curse of dimensionality")
print("  All points become roughly equidistant in high dimensions")

In [None]:
# Compare distance distributions
euc_distances = euclidean_distances(embeddings)
cos_similarities = cosine_similarity(embeddings)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Euclidean distances
ax1 = axes[0]
im1 = ax1.imshow(euc_distances, cmap='RdYlGn_r', vmin=0)
ax1.set_title('Euclidean Distances (lower = more similar)')
ax1.set_xticks(range(12))
ax1.set_yticks(range(12))
plt.colorbar(im1, ax=ax1)

# Cosine similarities
ax2 = axes[1]
im2 = ax2.imshow(cos_similarities, cmap='RdYlGn', vmin=0, vmax=1)
ax2.set_title('Cosine Similarities (higher = more similar)')
ax2.set_xticks(range(12))
ax2.set_yticks(range(12))
plt.colorbar(im2, ax=ax2)

plt.tight_layout()
plt.show()

print("ðŸ’¡ Cosine similarity shows clear 3x3 blocks (topics)!")
print("   Euclidean distances don't show this structure as clearly.")

### Step 2: TODO - Fix with cosine similarity

In [None]:
# TODO: Use cosine similarity instead

# Uncomment and complete:

# nn_cosine = NearestNeighbors(n_neighbors=3, metric='cosine')  # Fixed!
# nn_cosine.fit(embeddings)
# 
# def search_cosine(query):
#     """Find similar tickets using cosine distance."""
#     query_emb = vectorizer.transform([query]).toarray()
#     distances, indices = nn_cosine.kneighbors(query_emb)
#     
#     print(f"Query: '{query}'")
#     print("Top matches (Cosine):")
#     for dist, idx in zip(distances[0], indices[0]):
#         similarity = 1 - dist  # Convert distance to similarity
#         topic = topics[idx]
#         print(f"  [{similarity:.3f}] [{topic}] {tickets[idx]}")
#     print()
# 
# print("=== Fixed Results (Cosine Similarity) ===")
# print()
# search_cosine("password problem")
# search_cosine("shipping delay")
# search_cosine("want refund")

In [None]:
# TODO: Compare retrieval quality

# Uncomment:

# def evaluate_retrieval(nn_model, metric_name):
#     """Check if nearest neighbors are from the same topic."""
#     correct = 0
#     total = 0
#     
#     for i in range(len(tickets)):
#         distances, indices = nn_model.kneighbors(embeddings[i:i+1])
#         # Check neighbors (excluding self)
#         for idx in indices[0][1:]:  # Skip first (self)
#             total += 1
#             if topics[idx] == topics[i]:
#                 correct += 1
#     
#     accuracy = correct / total
#     print(f"{metric_name}: {accuracy:.1%} neighbors from same topic")
#     return accuracy
# 
# print("=== Retrieval Quality ===")
# acc_euclidean = evaluate_retrieval(nn_euclidean, "Euclidean")
# acc_cosine = evaluate_retrieval(nn_cosine, "Cosine")

In [None]:
# ============================================
# SELF-CHECK
# ============================================

# Uncomment:

# assert acc_cosine > acc_euclidean, "Cosine should outperform Euclidean for text"
# assert acc_cosine > 0.7, "Cosine should get most neighbors correct"
# 
# print("âœ“ Metric fixed!")
# print(f"âœ“ Euclidean accuracy: {acc_euclidean:.1%}")
# print(f"âœ“ Cosine accuracy: {acc_cosine:.1%}")
# print(f"âœ“ Improvement: {acc_cosine - acc_euclidean:+.1%}")

### Step 3: Write your postmortem

In [None]:
postmortem = """
## Postmortem: The Wrong Distance

### What happened:
- (Your answer: What was the observed problem with retrieval quality?)

### Root cause:
- (Your answer: Why does Euclidean distance fail for text embeddings?)

### How to prevent:
- (Your answer: What metric should we use for text similarity?)

"""

print(postmortem)

---

## âœ… Drill Complete!

**Key lessons:**

1. **Cosine similarity is the standard for text.** It measures the angle between vectors, ignoring magnitude.

2. **Euclidean distance fails for sparse, high-dimensional data.** All points become roughly equidistant.

3. **Vector length doesn't indicate semantic content.** A long document isn't necessarily more relevant.

4. **Always visualize your similarity matrix** to verify the metric captures the structure you expect.

---

## Similarity Metric Guide

| Data Type | Recommended Metric | Why |
|-----------|-------------------|-----|
| Text/NLP | **Cosine** | Sparse, high-dim, length-invariant |
| Images (CNN) | Cosine | Normalized embeddings |
| Dense embeddings | Cosine or Euclidean | Both work, test empirically |
| Geographic | Euclidean or Haversine | Physical distance matters |
| Binary features | Jaccard | Set intersection/union |
| Mixed features | Gower | Handles different types |