# üîç FAISS Vector Search Testing

This notebook demonstrates how to:
1. Load a pre-built FAISS index
2. Perform semantic search queries
3. Understand similarity scores
4. Explore search results by category
5. Compare different query types

**Prerequisites:** Run `04_faiss_index_creation.ipynb` first to generate the FAISS index.

## 1Ô∏è‚É£ Install & Import Libraries

In [None]:
# Install required packages
!pip install faiss-cpu sentence-transformers pandas numpy matplotlib seaborn -q

In [None]:
import faiss
import pickle
import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
import matplotlib.pyplot as plt
import seaborn as sns

print("‚úÖ All libraries imported successfully!")

## 2Ô∏è‚É£ Load FAISS Index and Metadata

In [None]:
# Load FAISS index
print("üîÑ Loading FAISS index...")
index = faiss.read_index('sentences.faiss')
print(f"‚úÖ FAISS index loaded! Contains {index.ntotal} vectors")

# Load metadata
print("\nüîÑ Loading metadata...")
with open('sentences_metadata.pkl', 'rb') as f:
    metadata = pickle.load(f)

sentences = metadata['sentences']
categories = metadata['categories']
model_name = metadata['model_name']

print(f"‚úÖ Metadata loaded!")
print(f"üìê Model: {model_name}")
print(f"üìä Sentences: {len(sentences)}")
print(f"üìÇ Categories: {list(set(categories))}")

In [None]:
# Load the same embedding model used for indexing
print("üîÑ Loading embedding model...")
model = SentenceTransformer(model_name)
print(f"‚úÖ Model '{model_name}' loaded!")

## 3Ô∏è‚É£ Create Search Function

In [None]:
def search(query: str, k: int = 5, show_results: bool = True):
    """
    Search for similar sentences in the FAISS index.
    
    Args:
        query: The search query text
        k: Number of results to return
        show_results: Whether to print results
    
    Returns:
        List of (sentence, category, score) tuples
    """
    # Generate embedding for query
    query_embedding = model.encode([query], convert_to_numpy=True)
    faiss.normalize_L2(query_embedding)
    
    # Search
    distances, indices = index.search(query_embedding, k)
    
    results = []
    
    if show_results:
        print(f"\nüîç Query: '{query}'")
        print(f"{'='*70}")
    
    for i, (dist, idx) in enumerate(zip(distances[0], indices[0])):
        sentence = sentences[idx]
        category = categories[idx]
        score = float(dist)
        results.append((sentence, category, score))
        
        if show_results:
            # Color code based on score
            if score >= 0.5:
                emoji = "üü¢"  # High relevance
            elif score >= 0.3:
                emoji = "üü°"  # Medium relevance
            else:
                emoji = "üî¥"  # Low relevance
            
            print(f"\n{emoji} Rank {i+1} | Score: {score:.4f} | Category: {category}")
            print(f"   {sentence}")
    
    return results

## 4Ô∏è‚É£ Test Search Queries

Let's test different types of queries to understand how semantic search works.

### üî¨ Test 1: Technology Query

In [None]:
results = search("How do I build AI applications?")

### üè• Test 2: Health Query

In [None]:
results = search("How can I improve my wellbeing?")

### üí∞ Test 3: Finance Query

In [None]:
results = search("How do I save money for the future?")

### üåç Test 4: Travel Query

In [None]:
results = search("What are famous tourist attractions in Asia?")

### üî¨ Test 5: Abstract/Conceptual Query

In [None]:
results = search("How does the universe work?")

## 5Ô∏è‚É£ Interactive Search

Try your own queries!

In [None]:
# üéØ Enter your own query here!
my_query = "What should I eat for breakfast?"

results = search(my_query, k=10)

## 6Ô∏è‚É£ Category Distribution Analysis

Let's analyze which categories appear most frequently in search results.

In [None]:
def analyze_search(query: str, k: int = 20):
    """
    Analyze the category distribution of search results.
    """
    results = search(query, k=k, show_results=False)
    
    # Count categories
    category_counts = {}
    category_scores = {}
    
    for sentence, category, score in results:
        category_counts[category] = category_counts.get(category, 0) + 1
        if category not in category_scores:
            category_scores[category] = []
        category_scores[category].append(score)
    
    # Calculate average scores
    avg_scores = {cat: np.mean(scores) for cat, scores in category_scores.items()}
    
    # Create visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Category counts
    categories_sorted = sorted(category_counts.keys(), key=lambda x: category_counts[x], reverse=True)
    counts = [category_counts[c] for c in categories_sorted]
    
    colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(categories_sorted)))
    axes[0].barh(categories_sorted, counts, color=colors)
    axes[0].set_xlabel('Count')
    axes[0].set_title(f'Category Distribution (Top {k} results)')
    axes[0].invert_yaxis()
    
    # Plot 2: Average scores
    scores = [avg_scores[c] for c in categories_sorted]
    axes[1].barh(categories_sorted, scores, color=colors)
    axes[1].set_xlabel('Average Similarity Score')
    axes[1].set_title('Average Score by Category')
    axes[1].invert_yaxis()
    axes[1].set_xlim(0, 1)
    
    plt.suptitle(f"Query: '{query}'", fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    return category_counts, avg_scores

In [None]:
# Analyze different queries
counts, scores = analyze_search("How do computers learn from data?")

In [None]:
counts, scores = analyze_search("I want to visit beautiful places")

In [None]:
counts, scores = analyze_search("Tell me about wild animals in nature")

## 7Ô∏è‚É£ Score Distribution Visualization

In [None]:
def visualize_score_distribution(query: str):
    """
    Visualize the similarity score distribution across all documents.
    """
    # Get all results
    results = search(query, k=len(sentences), show_results=False)
    
    # Create DataFrame
    df = pd.DataFrame(results, columns=['sentence', 'category', 'score'])
    
    # Create visualization
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Plot 1: Score histogram
    axes[0].hist(df['score'], bins=30, edgecolor='white', color='steelblue', alpha=0.7)
    axes[0].axvline(x=df['score'].mean(), color='red', linestyle='--', label=f'Mean: {df["score"].mean():.3f}')
    axes[0].axvline(x=df['score'].median(), color='orange', linestyle='--', label=f'Median: {df["score"].median():.3f}')
    axes[0].set_xlabel('Similarity Score')
    axes[0].set_ylabel('Count')
    axes[0].set_title('Score Distribution')
    axes[0].legend()
    
    # Plot 2: Box plot by category
    df_sorted = df.sort_values('score', ascending=False)
    category_order = df.groupby('category')['score'].mean().sort_values(ascending=False).index
    
    palette = sns.color_palette("viridis", len(category_order))
    sns.boxplot(data=df, x='score', y='category', order=category_order, palette=palette, ax=axes[1])
    axes[1].set_xlabel('Similarity Score')
    axes[1].set_ylabel('Category')
    axes[1].set_title('Score Distribution by Category')
    
    plt.suptitle(f"Query: '{query}'", fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # Print top categories
    print("\nüìä Category Rankings (by average score):")
    for i, cat in enumerate(category_order[:5]):
        avg = df[df['category'] == cat]['score'].mean()
        print(f"  {i+1}. {cat}: {avg:.4f}")

In [None]:
visualize_score_distribution("What is the best exercise for staying fit?")

In [None]:
visualize_score_distribution("Coffee and food recipes")

## 8Ô∏è‚É£ Compare Similar Queries

See how different phrasings of the same question affect results.

In [None]:
def compare_queries(queries: list, k: int = 5):
    """
    Compare search results for multiple queries.
    """
    all_results = {}
    
    for query in queries:
        results = search(query, k=k, show_results=False)
        all_results[query] = results
    
    # Display comparison
    print("\n" + "="*80)
    print("üìä QUERY COMPARISON")
    print("="*80)
    
    for query in queries:
        print(f"\nüîç Query: '{query}'")
        print("-"*60)
        for i, (sentence, category, score) in enumerate(all_results[query]):
            print(f"  {i+1}. [{category:10}] (Score: {score:.3f}) {sentence[:60]}...")
    
    # Find common results
    sets = [set(r[0] for r in results) for results in all_results.values()]
    common = sets[0].intersection(*sets[1:])
    
    if common:
        print(f"\n‚ú® Common results across all queries ({len(common)}):")
        for sentence in common:
            print(f"  ‚Ä¢ {sentence[:70]}...")

In [None]:
# Compare similar queries about AI/ML
compare_queries([
    "How does AI work?",
    "What is machine learning?",
    "Tell me about artificial intelligence and neural networks"
])

In [None]:
# Compare similar queries about money
compare_queries([
    "How to become rich?",
    "Investment strategies for beginners",
    "Financial planning and savings"
])

## 9Ô∏è‚É£ Threshold-Based Filtering

In production, you often want to filter results by a minimum similarity score.

In [None]:
def search_with_threshold(query: str, threshold: float = 0.3, max_results: int = 10):
    """
    Search with a minimum similarity threshold.
    """
    results = search(query, k=max_results, show_results=False)
    filtered = [(s, c, score) for s, c, score in results if score >= threshold]
    
    print(f"\nüîç Query: '{query}'")
    print(f"üìè Threshold: {threshold}")
    print(f"üìä Results: {len(filtered)} / {len(results)} passed threshold\n")
    
    if filtered:
        for i, (sentence, category, score) in enumerate(filtered):
            print(f"  {i+1}. [Score: {score:.4f}] [{category}]")
            print(f"     {sentence}\n")
    else:
        print("  ‚ö†Ô∏è No results above threshold!")
    
    return filtered

In [None]:
# Test with different thresholds
results = search_with_threshold("Python programming language", threshold=0.3)

In [None]:
# Higher threshold = fewer but more relevant results
results = search_with_threshold("Python programming language", threshold=0.5)

In [None]:
# Test with an unrelated query
results = search_with_threshold("asdfghjkl random text", threshold=0.3)

## üéì Summary

In this notebook, we explored:

1. ‚úÖ **Loading FAISS Index** - Fast loading of pre-built vector database
2. ‚úÖ **Semantic Search** - Finding similar sentences based on meaning, not keywords
3. ‚úÖ **Score Interpretation** - Understanding similarity scores (0-1 range)
4. ‚úÖ **Category Analysis** - Visualizing which categories match queries
5. ‚úÖ **Query Comparison** - How different phrasings affect results
6. ‚úÖ **Threshold Filtering** - Production-ready result filtering

## üí° Key Insights

| Concept | Description |
|---------|-------------|
| **Semantic Search** | Finds results by meaning, not exact keywords |
| **Similarity Score** | Higher = more similar (range 0-1 for cosine similarity) |
| **Thresholds** | Use 0.3-0.5 for balanced precision/recall |
| **Query Phrasing** | Similar questions can get slightly different results |

## üöÄ Next Steps for RAG Systems

To build a complete RAG system, you would:

1. **Retrieve** - Use FAISS to find top-K relevant documents
2. **Augment** - Add retrieved documents to your LLM prompt
3. **Generate** - Use an LLM (GPT, Gemini, etc.) to generate a response

```python
# Example RAG pseudocode
user_query = "How does machine learning work?"

# Step 1: Retrieve
results = search(user_query, k=3)
context = "\n".join([r[0] for r in results])

# Step 2: Augment
prompt = f"""
Based on the following context, answer the question.

Context:
{context}

Question: {user_query}
"""

# Step 3: Generate
response = llm.generate(prompt)
```