# ANSWER KEY: Debug Drill 08 - Bad Similarity Search

**Bugs:**
1. Using `CountVectorizer` instead of `TfidfVectorizer` (no term weighting)
2. Using `euclidean_distances` instead of `cosine_similarity` (affected by document length)

**Key Lesson:** For text similarity, use TF-IDF + cosine similarity.

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import euclidean_distances, cosine_similarity
import warnings
warnings.filterwarnings('ignore')

# Sample support tickets
tickets = pd.DataFrame({
    'ticket_id': range(1, 11),
    'ticket_text': [
        "I want a refund for my order",
        "How do I return this product and get my money back",
        "Package never arrived, very frustrated",
        "Where is my order? Still waiting for delivery",
        "Can I change my shipping address please",
        "Need to update my billing information",
        "Product arrived damaged, want replacement",
        "How to cancel my subscription",
        "Charged twice for same order, need refund",
        "Great product, just wanted to say thanks"
    ]
})

## The Bug (Colleague's Code)

In [None]:
# ===== BUGGY CODE =====

# Bug 1: Using CountVectorizer (no weighting)
vectorizer_buggy = CountVectorizer(max_features=100)
ticket_vectors_buggy = vectorizer_buggy.fit_transform(tickets['ticket_text'])

# Bug 2: Using Euclidean distance (affected by length)
query = "I need my money back"
query_vector_buggy = vectorizer_buggy.transform([query])

distances = euclidean_distances(query_vector_buggy, ticket_vectors_buggy)[0]
most_similar_idx = np.argmin(distances)  # Smallest distance = most similar

print("Buggy search for: 'I need my money back'")
print(f"\nBest match (Euclidean): {tickets.iloc[most_similar_idx]['ticket_text']}")
print(f"\nTop 3 matches:")
for idx in np.argsort(distances)[:3]:
    print(f"  {distances[idx]:.2f}: {tickets.iloc[idx]['ticket_text']}")

## Why This Is Wrong

**Problem 1: CountVectorizer**
- Just counts word occurrences
- Common words like "the", "I", "my" have same weight as important words
- Doesn't capture term importance

**Problem 2: Euclidean Distance**
- Affected by document LENGTH
- Longer documents have larger vectors â†’ larger distances
- A short document might appear "similar" just because it's short

**Why TF-IDF + Cosine works:**
- TF-IDF downweights common words, highlights distinctive terms
- Cosine measures ANGLE between vectors, not magnitude
- Two documents about "refunds" point in the same direction regardless of length

## The Fix

In [None]:
# ===== FIXED CODE =====

# Fix 1: Use TfidfVectorizer
tfidf = TfidfVectorizer(
    max_features=500,
    stop_words='english',  # Remove common words
    ngram_range=(1, 2)     # Include bigrams like "money back"
)
ticket_vectors_fixed = tfidf.fit_transform(tickets['ticket_text'])

# Fix 2: Use cosine_similarity
query = "I need my money back"
query_vector_fixed = tfidf.transform([query])

similarities = cosine_similarity(query_vector_fixed, ticket_vectors_fixed)[0]
most_similar_idx = np.argmax(similarities)  # Highest similarity = best match

print("Fixed search for: 'I need my money back'")
print(f"\nBest match (Cosine): {tickets.iloc[most_similar_idx]['ticket_text']}")
print(f"\nTop 3 matches:")
for idx in np.argsort(similarities)[::-1][:3]:
    print(f"  {similarities[idx]:.3f}: {tickets.iloc[idx]['ticket_text']}")

In [None]:
# Compare results side by side
print("\n" + "="*60)
print("COMPARISON: Buggy vs Fixed")
print("="*60)

queries = [
    "I need my money back",
    "package not delivered",
    "cancel my account"
]

for q in queries:
    # Buggy
    q_buggy = vectorizer_buggy.transform([q])
    dist = euclidean_distances(q_buggy, ticket_vectors_buggy)[0]
    buggy_match = tickets.iloc[np.argmin(dist)]['ticket_text']
    
    # Fixed
    q_fixed = tfidf.transform([q])
    sim = cosine_similarity(q_fixed, ticket_vectors_fixed)[0]
    fixed_match = tickets.iloc[np.argmax(sim)]['ticket_text']
    
    print(f"\nQuery: '{q}'")
    print(f"  Buggy: {buggy_match[:50]}...")
    print(f"  Fixed: {fixed_match[:50]}...")

In [None]:
# Show what TF-IDF learned
print("\nTop terms by TF-IDF weight (for 'refund' ticket):")
refund_vec = tfidf.transform(["I want a refund for my order"])
feature_names = tfidf.get_feature_names_out()
weights = refund_vec.toarray()[0]
top_indices = np.argsort(weights)[::-1][:5]
for idx in top_indices:
    if weights[idx] > 0:
        print(f"  {feature_names[idx]}: {weights[idx]:.3f}")

In [None]:
# Self-check
# Query about refund should match refund-related tickets
q_test = tfidf.transform(["refund money back"])
sim_test = cosine_similarity(q_test, ticket_vectors_fixed)[0]
best_match = tickets.iloc[np.argmax(sim_test)]['ticket_text'].lower()

assert 'refund' in best_match or 'money back' in best_match, "Should match refund ticket"
print("\nPASS: Similarity search returns semantically relevant results!")

## Quick Reference

| Component | Wrong | Right | Why |
|-----------|-------|-------|-----|
| Vectorizer | CountVectorizer | TfidfVectorizer | TF-IDF weights by importance |
| Distance | Euclidean | Cosine Similarity | Cosine ignores document length |
| Preprocessing | None | stop_words='english' | Remove noise words |
| N-grams | (1,1) only | (1,2) | Capture phrases like "money back" |

## Completed Postmortem

### What happened:
- Colleague's similarity search returned irrelevant tickets
- "I need my money back" was matching unrelated tickets instead of refund requests

### Root cause:
- CountVectorizer doesn't weight terms by importance ("the" = "refund")
- Euclidean distance is biased by document length, not semantic content

### How to prevent:
- Default to TF-IDF + Cosine for text similarity
- Test with known similar pairs to validate search quality
- Use stop_words and ngrams for better matching