# Debug Drill 08: Bad Similarity Search

**Symptom:** Your colleague built a ticket similarity search. When support searches for "refund request", the top result is about "shipping delay". The search seems broken.

**Your task:** Find the bug, fix the search, and write a postmortem.

**Time:** 15 minutes

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load tickets
tickets = pd.read_csv('https://raw.githubusercontent.com/189investmentai/ml-foundations-interactive/main/data/streamcart_tickets.csv')
print(f"Loaded {len(tickets)} tickets")
print(tickets['ticket_text'].head())

In [None]:
# ===== COLLEAGUE'S CODE (CONTAINS BUGS) =====

# Bug 1: Using raw counts instead of TF-IDF
vectorizer = CountVectorizer(max_features=100)  # Should use TfidfVectorizer
ticket_vectors = vectorizer.fit_transform(tickets['ticket_text'])

def search_tickets_buggy(query, top_k=5):
    query_vector = vectorizer.transform([query])
    
    # Bug 2: Using Euclidean distance instead of cosine similarity
    distances = euclidean_distances(query_vector, ticket_vectors)[0]
    
    # Bug 3: Taking largest distances (should be smallest, or use similarity)
    top_indices = np.argsort(distances)[-top_k:][::-1]  # WRONG: gets largest
    
    return tickets.iloc[top_indices][['ticket_text', 'category']]

# Test the buggy search
print("Search: 'refund request'")
print(search_tickets_buggy("refund request"))

## Your Investigation

**Q1:** Identify at least 2 bugs in the code above.

In [None]:
# TODO: List the bugs you found
# Bug 1: 
# Bug 2: 
# Bug 3: 

**Q2:** Why is TF-IDF better than raw counts for similarity search?

In [None]:
# TODO: Your explanation
# TF-IDF is better because...

## Fix the Bug

**Q3:** Build a correct similarity search.

In [None]:
# TODO: Fix all the bugs

# Fix 1: Use TF-IDF
tfidf = TfidfVectorizer(
    max_features=500,
    stop_words='english',
    ngram_range=(1, 2)
)
ticket_vectors_fixed = tfidf.fit_transform(tickets['ticket_text'])

def search_tickets_fixed(query, top_k=5):
    query_vector = tfidf.transform([query])
    
    # Fix 2: Use cosine similarity
    similarities = cosine_similarity(query_vector, ticket_vectors_fixed)[0]
    
    # Fix 3: Get highest similarities
    top_indices = np.argsort(similarities)[-top_k:][::-1]
    
    results = tickets.iloc[top_indices][['ticket_text', 'category']].copy()
    results['similarity'] = similarities[top_indices]
    return results

# Test the fixed search
print("Search: 'refund request'")
print(search_tickets_fixed("refund request"))

In [None]:
# Test more queries
print("\n" + "="*50)
print("Search: 'shipping delay'")
print(search_tickets_fixed("shipping delay"))

print("\n" + "="*50)
print("Search: 'cancel subscription'")
print(search_tickets_fixed("cancel subscription"))

## Self-Check

In [None]:
# Verify fix
refund_results = search_tickets_fixed("refund request", top_k=3)

# At least one result should be about billing/refunds
has_billing = any('billing' in str(cat).lower() or 'refund' in str(text).lower() 
                  for cat, text in zip(refund_results['category'], refund_results['ticket_text']))

assert refund_results['similarity'].iloc[0] > 0.1, "Top result should have decent similarity"
print("PASS: Search returns relevant results!")

## Postmortem

Write 3 bullets:
1. **Root cause:** 
2. **How we detected it:** 
3. **Prevention for next time:** 