# Semantic Code Search Pipeline
This notebook implements an offline semantic code search system for the Flask repository.

## Features:
- Parse Python files using tree-sitter
- Extract functions and classes with metadata
- Generate embeddings using sentence-transformers (Jina v2 code embeddings)
- Store in ChromaDB with **cosine similarity** for semantic retrieval
- Search with natural language queries

## Importing

In [3]:
import os
import sys
from typing import List, Dict, Tuple
from pathlib import Path

# Import utils functions
sys.path.append(os.getcwd())
from utils import find_repo_root, list_python_files

import chromadb
from sentence_transformers import SentenceTransformer
from tree_sitter_languages import get_parser

## Configuration and Setup

In [4]:
# Configuration
FLASK_REPO_PATH = "../flask"  # Adjust to your Flask repo location
CHROMA_DB_PATH = "../data/chroma_db"
COLLECTION_NAME = "flask_code"
EMBEDDING_MODEL = "jinaai/jina-embeddings-v2-base-code"

# Initialize parser (get_parser returns a parser already configured for Python)
parser = get_parser("python")

# Load embedding model
print("Loading embedding model...")
try:
    embedding_model = SentenceTransformer(EMBEDDING_MODEL, trust_remote_code=True)
    print("Model loaded successfully!")
except Exception as e:
    print(f"Error loading embedding model: {e}")



Loading embedding model...


A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-bert-v2-qk-post-norm:
- modeling_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


Model loaded successfully!


In [1]:
import torch; print(torch.__version__)
import transformers; print(transformers.__version__)
import sentence_transformers; print(sentence_transformers.__version__)

2.2.0+cpu


  from .autonotebook import tqdm as notebook_tqdm


4.57.3
5.2.0


In [5]:
# Test the embedding model with two very different texts
text1 = "This is a function to handle HTTP requests in Flask."
text2 = "The quick brown fox jumps over the lazy dog."

embeddings = embedding_model.encode([text1, text2])

# print("Embedding for text1:", embeddings[0])
# print("Embedding for text2:", embeddings[1])

# Print summary statistics for comparison
import numpy as np
print("L2 norm text1:", np.linalg.norm(embeddings[0]))
print("L2 norm text2:", np.linalg.norm(embeddings[1]))
print("Cosine similarity:", np.dot(embeddings[0], embeddings[1]) / (np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1])))

L2 norm text1: 14.371183
L2 norm text2: 13.882114
Cosine similarity: 0.09143908


In [17]:
from sentence_transformers import SentenceTransformer
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = embedding_model.encode(["This is a test.", "Completely different text."])
print("Cosine similarity:", np.dot(embeddings[0], embeddings[1]) / (np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1])))

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Cosine similarity: 0.25478408


In [20]:
# from sentence_transformers import SentenceTransformer
# embedding_model = SentenceTransformer("Qodo/Qodo-Embed-1-1.5B", trust_remote_code=True)
# embeddings = embedding_model.encode(["This is a test.", "Completely different text."])
# import numpy as np
# print("Cosine similarity:", np.dot(embeddings[0], embeddings[1]) / (np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1])))

## Code Parsing and Chunking

In [6]:
def extract_code_chunks(file_path: str) -> List[Dict]:
    """Extract functions and classes from a Python file using tree-sitter."""
    chunks = []
    
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            code = f.read()
        
        tree = parser.parse(bytes(code, "utf8"))
        root_node = tree.root_node
        
        def traverse(node, depth=0):
            # Extract function definitions
            if node.type == 'function_definition':
                name_node = node.child_by_field_name('name')
                if name_node:
                    func_name = code[name_node.start_byte:name_node.end_byte]
                    func_code = code[node.start_byte:node.end_byte]
                    
                    # Extract docstring if present
                    docstring = ""
                    body = node.child_by_field_name('body')
                    if body and body.child_count > 0:
                        first_child = body.children[0]
                        if first_child.type == 'expression_statement':
                            expr = first_child.children[0]
                            if expr.type == 'string':
                                docstring = code[expr.start_byte:expr.end_byte].strip('"""').strip("'''").strip()
                    
                    chunks.append({
                        'type': 'function',
                        'name': func_name,
                        'code': func_code,
                        'docstring': docstring,
                        'file_path': file_path,
                        'start_line': node.start_point[0] + 1,
                        'end_line': node.end_point[0] + 1,
                    })
            
            # Extract class definitions
            elif node.type == 'class_definition':
                name_node = node.child_by_field_name('name')
                if name_node:
                    class_name = code[name_node.start_byte:name_node.end_byte]
                    class_code = code[node.start_byte:node.end_byte]
                    
                    # Extract class docstring
                    docstring = ""
                    body = node.child_by_field_name('body')
                    if body and body.child_count > 0:
                        first_child = body.children[0]
                        if first_child.type == 'expression_statement':
                            expr = first_child.children[0]
                            if expr.type == 'string':
                                docstring = code[expr.start_byte:expr.end_byte].strip('"""').strip("'''").strip()
                    
                    # Limit class code to avoid huge chunks
                    if len(class_code) > 2000:
                        class_code = class_code[:2000] + "\n    # ... (truncated)"
                    
                    chunks.append({
                        'type': 'class',
                        'name': class_name,
                        'code': class_code,
                        'docstring': docstring,
                        'file_path': file_path,
                        'start_line': node.start_point[0] + 1,
                        'end_line': node.end_point[0] + 1,
                    })
            
            # Recursively traverse children
            for child in node.children:
                traverse(child, depth + 1)
        
        traverse(root_node)
    except Exception as e:
        print(f"Error parsing {file_path}: {e}")
    
    return chunks

In [7]:
def create_searchable_text(chunk: Dict) -> str:
    """Create searchable text with prioritized metadata and limited code for better embeddings."""
    parts = []
    
    # 1. Prioritize docstring (most semantic information)
    if chunk['docstring']:
        parts.append(f"Documentation: {chunk['docstring']}")
    
    # 2. Add type and name (critical identifiers)
    parts.append(f"{chunk['type']}: {chunk['name']}")
    
    # 3. Add file path for context
    file_name = chunk['file_path'].split('/')[-1] if '/' in chunk['file_path'] else chunk['file_path'].split('\\')[-1]
    parts.append(f"File: {file_name}")
    
    # 4. Add limited code (first 400 chars to avoid dilution)
    code_snippet = chunk['code'][:400]
    parts.append(f"Code:\n{code_snippet}")
    
    return "\n\n".join(parts)

## Indexing Pipeline

In [8]:
def index_repository(repo_path: str, force_reindex: bool = False):
    """Index all Python files in the repository."""
    
    # Initialize ChromaDB
    os.makedirs(CHROMA_DB_PATH, exist_ok=True)
    client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
    
    # Get or create collection with COSINE similarity (critical for semantic search)
    try:
        if force_reindex:
            client.delete_collection(name=COLLECTION_NAME)
            print("Deleted existing collection for reindexing.")
    except:
        pass
    
    collection = client.get_or_create_collection(
        name=COLLECTION_NAME,
        metadata={
            "description": "Flask repository code chunks",
            "hnsw:space": "cosine"  # Use cosine similarity instead of L2
        }
    )
    
    # Check if already indexed
    if collection.count() > 0 and not force_reindex:
        print(f"Repository already indexed with {collection.count()} chunks.")
        print("Note: If indexed before the distance metric fix, run with force_reindex=True")
        return collection
    
    # Get all Python files
    print(f"Finding Python files in {repo_path}...")
    py_files = list_python_files(repo_path)
    print(f"Found {len(py_files)} Python files.")
    
    # Extract and index chunks
    all_chunks = []
    for i, file_path in enumerate(py_files):
        if i % 10 == 0:
            print(f"Processing file {i+1}/{len(py_files)}...")
        
        chunks = extract_code_chunks(file_path)
        all_chunks.extend(chunks)
    
    print(f"Extracted {len(all_chunks)} code chunks.")
    
    if not all_chunks:
        print("No code chunks found!")
        return collection
    
    # Generate embeddings in batches
    print("Generating embeddings...")
    batch_size = 32
    indexed_count = 0
    
    for i in range(0, len(all_chunks), batch_size):
        batch = all_chunks[i:i+batch_size]
        texts = [create_searchable_text(chunk) for chunk in batch]
        
        # Generate embeddings
        embeddings = embedding_model.encode(texts, show_progress_bar=False)
        
        # Prepare unique IDs (add index to prevent collisions)
        ids = [f"{indexed_count + j}:{chunk['file_path']}:{chunk['name']}:{chunk['start_line']}" 
               for j, chunk in enumerate(batch)]
        
        metadatas = [{
            'type': chunk['type'],
            'name': chunk['name'],
            'file_path': chunk['file_path'],
            'start_line': chunk['start_line'],
            'end_line': chunk['end_line'],
            'docstring': chunk['docstring'][:500] if chunk['docstring'] else "",
        } for chunk in batch]
        
        # Store the full code (not truncated) for better context
        documents = [chunk['code'] for chunk in batch]
        
        # Add to collection
        collection.add(
            ids=ids,
            embeddings=embeddings.tolist(),
            metadatas=metadatas,
            documents=documents
        )
        
        indexed_count += len(batch)
        
        if indexed_count % 100 == 0 or indexed_count == len(all_chunks):
            print(f"Indexed {indexed_count}/{len(all_chunks)} chunks...")
    
    print(f"✓ Indexing complete! Total chunks: {collection.count()}")
    return collection

## Search Implementation

In [9]:
def search_code(query: str, top_k: int = 5, apply_filter: bool = False) -> List[Dict]:
    """Search for code chunks matching the query with optional keyword filtering."""
    
    # Connect to ChromaDB
    client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
    
    try:
        collection = client.get_collection(name=COLLECTION_NAME)
    except:
        print("Collection not found. Please index the repository first.")
        return []
    
    # Generate query embedding
    query_embedding = embedding_model.encode([query])[0]
    
    # Retrieve more results if filtering (to have candidates)
    retrieve_count = top_k * 3 if apply_filter else top_k
    
    # Search (using cosine distance if collection was created correctly)
    results = collection.query(
        query_embeddings=[query_embedding.tolist()],
        n_results=retrieve_count,
        include=['metadatas', 'documents', 'distances']
    )
    
    # Format results
    formatted_results = []
    if results['ids'] and results['ids'][0]:
        for i in range(len(results['ids'][0])):
            # For cosine distance: similarity = 1 - distance (distance is already in [0, 2])
            # Lower distance = higher similarity
            distance = results['distances'][0][i]
            similarity = 1 - distance  # Cosine similarity from cosine distance
            
            formatted_results.append({
                'id': results['ids'][0][i],
                'type': results['metadatas'][0][i]['type'],
                'name': results['metadatas'][0][i]['name'],
                'file_path': results['metadatas'][0][i]['file_path'],
                'start_line': results['metadatas'][0][i]['start_line'],
                'end_line': results['metadatas'][0][i]['end_line'],
                'docstring': results['metadatas'][0][i]['docstring'],
                'code': results['documents'][0][i],
                'distance': distance,
                'similarity': similarity
            })
    
    # Apply keyword filtering if requested
    if apply_filter and formatted_results:
        formatted_results = filter_results_by_keywords(formatted_results, query)
        formatted_results = formatted_results[:top_k]  # Limit to top_k after filtering
    
    return formatted_results


def filter_results_by_keywords(results: List[Dict], query: str) -> List[Dict]:
    """Filter and re-rank results by keyword presence in name, docstring, and code."""
    # Extract keywords from query (simple tokenization)
    keywords = set(query.lower().split())
    # Remove common stop words
    stop_words = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by', 'from', 'is', 'are', 'was', 'were', 'how', 'does', 'do'}
    keywords = keywords - stop_words
    
    scored_results = []
    for result in results:
        # Create searchable text from result
        search_text = f"{result['name']} {result.get('docstring', '')} {result['code']}".lower()
        
        # Count keyword matches
        keyword_score = sum(1 for keyword in keywords if keyword in search_text)
        
        # Boost if keyword in name (very important)
        name_score = sum(2 for keyword in keywords if keyword in result['name'].lower())
        
        # Boost if keyword in docstring
        doc_score = sum(1.5 for keyword in keywords if keyword in result.get('docstring', '').lower())
        
        total_score = keyword_score + name_score + doc_score
        
        # Combine with semantic similarity (weighted)
        combined_score = result['similarity'] * 0.6 + (total_score / max(len(keywords), 1)) * 0.4
        
        scored_results.append((combined_score, result))
    
    # Sort by combined score
    scored_results.sort(key=lambda x: x[0], reverse=True)
    
    # Return filtered results (only those with at least one keyword match)
    filtered = [result for score, result in scored_results if score > 0]
    
    return filtered if filtered else [result for _, result in scored_results]

In [10]:
def pretty_print_results(results: List[Dict]):
    """Pretty print search results for CLI."""
    if not results:
        print("No results found.")
        return
    
    print(f"Found {len(results)} results:")
    
    for i, result in enumerate(results, 1):
        print(f"{i}. {result['type'].upper()}: {result['name']}")
        print(f"   File: {result['file_path']}:{result['start_line']}-{result['end_line']}")
        print(f"   Similarity: {result['similarity']:.4f} (distance: {result['distance']:.4f})")
        
        if result['docstring']:
            # Show first 150 chars of docstring
            doc_preview = result['docstring'][:150].replace('\n', ' ')
            print(f"   Doc: {doc_preview}{'...' if len(result['docstring']) > 150 else ''}")
        
        print(f"   Code Preview:")
        code_lines = result['code'].split('\n')[:8]  # Show first 8 lines
        for line in code_lines:
            if line.strip():  # Skip empty lines
                print(f"      {line[:100]}")  # Limit line length
        
        total_lines = len(result['code'].split('\n'))
        if total_lines > 8:
            print(f"      ... ({total_lines - 8} more lines)")
        print()  # Blank line between results

In [11]:
# Index the repository
# Set force_reindex=True if you indexed before the cosine similarity fix
collection = index_repository(FLASK_REPO_PATH, force_reindex=True)

Deleted existing collection for reindexing.
Finding Python files in ../flask...
Found 34 Python files.
Processing file 1/34...
Processing file 11/34...
Processing file 21/34...
Processing file 31/34...
Extracted 471 code chunks.
Generating embeddings...
Indexed 471/471 chunks...
✓ Indexing complete! Total chunks: 471


In [12]:
# Diagnostic: Check collection metadata and sample entries
client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
try:
    collection = client.get_collection(name=COLLECTION_NAME)
    print(f"Collection: {collection.name}")
    print(f"Total chunks: {collection.count()}")
    print(f"Metadata: {collection.metadata}")
    
    # Verify cosine similarity is enabled
    if collection.metadata.get('hnsw:space') == 'cosine':
        print("Cosine similarity is ENABLED - semantic search will work correctly!")
    else:
        print("WARNING: Collection is not using cosine similarity!")
    
    # Get a few samples
    sample = collection.get(limit=3, include=['metadatas', 'documents', 'embeddings'])
    print(f"\nSample entries:")
    for i in range(min(3, len(sample['ids']))):
        print(f"\n{i+1}. ID: {sample['ids'][i]}")
        print(f"   Type: {sample['metadatas'][i]['type']}, Name: {sample['metadatas'][i]['name']}")
        print(f"   File: {sample['metadatas'][i]['file_path']}")
        
        # Check embedding dimension
        if sample['embeddings'] is not None and len(sample['embeddings']) > i:
            print(f"   Embedding dim: {len(sample['embeddings'][i])}")
        else:
            print(f"   Embedding dim: N/A")
        
        # Show code snippet
        code_preview = sample['documents'][i][:100].replace('\n', ' ')
        print(f"   Code: {code_preview}...")
        
except Exception as e:
    print(f"Error: {e}")
    import traceback
    traceback.print_exc()

Collection: flask_code
Total chunks: 471
Metadata: {'hnsw:space': 'cosine', 'description': 'Flask repository code chunks'}
Cosine similarity is ENABLED - semantic search will work correctly!

Sample entries:

1. ID: 0:../flask\examples\celery\src\task_app\tasks.py:add:8
   Type: function, Name: add
   File: ../flask\examples\celery\src\task_app\tasks.py
   Embedding dim: 768
   Code: def add(a: int, b: int) -> int:     return a + b...

2. ID: 1:../flask\examples\celery\src\task_app\tasks.py:block:13
   Type: function, Name: block
   File: ../flask\examples\celery\src\task_app\tasks.py
   Embedding dim: 768
   Code: def block() -> None:     time.sleep(5)...

3. ID: 2:../flask\examples\celery\src\task_app\tasks.py:process:18
   Type: function, Name: process
   File: ../flask\examples\celery\src\task_app\tasks.py
   Embedding dim: 768
   Code: def process(self: Task, total: int) -> object:     for i in range(total):         self.update_state(...


## Embedding Diagnostics

Check embedding model output and verify normalization/diversity.

In [13]:
import numpy as np

# Check Embedding Model Output
print("Check Embedding Model Output\n")

# Generate embeddings for different types of text
sample_texts = [
    "function: route decorator for Flask",
    "function: handle HTTP request",
    "class: Blueprint for Flask apps",
    "function: error handling exception",
    "simple hello world function"
]

print("Generating embeddings for sample texts...")
sample_embeddings = embedding_model.encode(sample_texts)

for i, (text, emb) in enumerate(zip(sample_texts, sample_embeddings)):
    print(f"\n{i+1}. Text: '{text}'")
    print(f"   Embedding shape: {emb.shape}")
    print(f"   Embedding mean: {np.mean(emb):.6f}")
    print(f"   Embedding std: {np.std(emb):.6f}")
    print(f"   Embedding min: {np.min(emb):.6f}")
    print(f"   Embedding max: {np.max(emb):.6f}")
    print(f"   L2 norm: {np.linalg.norm(emb):.6f}")
    print(f"   Non-zero elements: {np.count_nonzero(emb)}/{len(emb)}")

# Check pairwise similarities
print("\n\n=== Pairwise Cosine Similarities ===")
from sklearn.metrics.pairwise import cosine_similarity

similarities = cosine_similarity(sample_embeddings)
print("\nSimilarity matrix:")
for i in range(len(sample_texts)):
    for j in range(len(sample_texts)):
        if i < j:
            print(f"  '{sample_texts[i][:30]}...' <-> '{sample_texts[j][:30]}...': {similarities[i][j]:.4f}")

Check Embedding Model Output

Generating embeddings for sample texts...

1. Text: 'function: route decorator for Flask'
   Embedding shape: (768,)
   Embedding mean: 0.002063
   Embedding std: 0.548010
   Embedding min: -1.431866
   Embedding max: 1.641745
   L2 norm: 15.187018
   Non-zero elements: 768/768

2. Text: 'function: handle HTTP request'
   Embedding shape: (768,)
   Embedding mean: -0.001454
   Embedding std: 0.519145
   Embedding min: -1.617335
   Embedding max: 1.912307
   L2 norm: 14.387020
   Non-zero elements: 768/768

3. Text: 'class: Blueprint for Flask apps'
   Embedding shape: (768,)
   Embedding mean: -0.001948
   Embedding std: 0.501446
   Embedding min: -1.396184
   Embedding max: 2.108231
   L2 norm: 13.896589
   Non-zero elements: 768/768

4. Text: 'function: error handling exception'
   Embedding shape: (768,)
   Embedding mean: 0.001313
   Embedding std: 0.533178
   Embedding min: -1.307905
   Embedding max: 1.519243
   L2 norm: 14.775918
   Non-zero element

In [14]:
# Verify Embedding Normalization and Diversity
print("Verify Embedding Normalization and Diversity\n")

# Get sample embeddings from indexed collection
client = chromadb.PersistentClient(path=CHROMA_DB_PATH)
try:
    collection = client.get_collection(name=COLLECTION_NAME)
    sample = collection.get(limit=50, include=['embeddings', 'metadatas'])
    
    if sample['embeddings'] is not None and len(sample['embeddings']) > 0:
        embeddings_array = np.array(sample['embeddings'])
        
        print(f"Analyzed {len(embeddings_array)} embeddings from the collection\n")
        
        # Check norms (should be consistent for normalized embeddings)
        norms = np.linalg.norm(embeddings_array, axis=1)
        print(f"L2 Norms:")
        print(f"  Mean: {np.mean(norms):.6f}")
        print(f"  Std: {np.std(norms):.6f}")
        print(f"  Min: {np.min(norms):.6f}")
        print(f"  Max: {np.max(norms):.6f}")
        
        # Check diversity (pairwise distances)
        from sklearn.metrics.pairwise import cosine_distances
        distances = cosine_distances(embeddings_array[:20])  # Sample 20 for speed
        
        # Get upper triangle (unique pairs)
        upper_triangle = distances[np.triu_indices_from(distances, k=1)]
        
        print(f"\nPairwise Cosine Distances (sample of 20):")
        print(f"  Mean: {np.mean(upper_triangle):.6f}")
        print(f"  Std: {np.std(upper_triangle):.6f}")
        print(f"  Min: {np.min(upper_triangle):.6f}")
        print(f"  Max: {np.max(upper_triangle):.6f}")
        
        # Check for suspiciously similar embeddings
        very_similar = np.sum(upper_triangle < 0.01)
        print(f"\n  Very similar pairs (distance < 0.01): {very_similar}/{len(upper_triangle)}")
        
        if very_similar > len(upper_triangle) * 0.1:
            print("  WARNING: Many embeddings are suspiciously similar!")
        else:
            print("  Embeddings show good diversity")
            
    else:
        print("No embeddings found in collection")
        
except Exception as e:
    print(f"Error: {e}")

Verify Embedding Normalization and Diversity

Analyzed 50 embeddings from the collection

L2 Norms:
  Mean: 13.255367
  Std: 0.423480
  Min: 12.052321
  Max: 14.031509

Pairwise Cosine Distances (sample of 20):
  Mean: 0.566770
  Std: 0.195600
  Min: 0.107651
  Max: 0.903741

  Very similar pairs (distance < 0.01): 0/190
  Embeddings show good diversity


## Verify Search Quality

Test with a specific technical query to verify semantic relevance.

In [15]:
# Test query to verify semantic search is working correctly
test_query = "add route decorator"
print(f"Test Query: '{test_query}'\n")
test_results = search_code(test_query, top_k=3)
pretty_print_results(test_results)

# Check diversity - results should have different names
if test_results:
    names = [r['name'] for r in test_results]
    unique_names = set(names)
    print(f"\nResult Diversity: {len(unique_names)}/{len(names)} unique functions/classes")
    if len(unique_names) < len(names):
        print("Warning: Duplicate results detected!")

Test Query: 'add route decorator'

Found 3 results:
1. FUNCTION: decorator
   File: ../flask\src\flask\sansio\scaffold.py:360-363
   Similarity: 0.8196 (distance: 0.1804)
   Code Preview:
      def decorator(f: T_route) -> T_route:
                  endpoint = options.pop("endpoint", None)
                  self.add_url_rule(rule, endpoint, f, **options)
                  return f

2. FUNCTION: decorator
   File: ../flask\src\flask\sansio\scaffold.py:453-455
   Similarity: 0.6787 (distance: 0.3213)
   Code Preview:
      def decorator(f: F) -> F:
                  self.view_functions[endpoint] = f
                  return f

3. FUNCTION: route
   File: ../flask\src\flask\sansio\scaffold.py:336-365
   Similarity: 0.6569 (distance: 0.3431)
   Doc: Decorate a view function to register it with the given URL         rule and options. Calls :meth:`add_url_rule`, which has more         details about ...
   Code Preview:
      def route(self, rule: str, **options: t.Any) -> t.Callable[[T_route

## Post-Filtering Comparison

Compare search results with and without keyword-based post-filtering.

In [16]:
# Post-Filtering Demonstration
print("Post-Filtering Comparison\n")

test_query = "add route decorator"
print(f"Query: '{test_query}'\n")

# Get results WITHOUT filtering
print("--- WITHOUT Post-Filtering ---")
results_no_filter = search_code(test_query, top_k=5, apply_filter=False)
for i, result in enumerate(results_no_filter, 1):
    print(f"{i}. {result['type'].upper()}: {result['name']}")
    print(f"   Similarity: {result['similarity']:.4f}")
    if result['docstring']:
        print(f"   Doc: {result['docstring'][:80]}...")
    print()

# Get results WITH filtering
print("\n--- WITH Post-Filtering (Keyword-Based Re-ranking) ---")
results_with_filter = search_code(test_query, top_k=5, apply_filter=True)
for i, result in enumerate(results_with_filter, 1):
    print(f"{i}. {result['type'].upper()}: {result['name']}")
    print(f"   Similarity: {result['similarity']:.4f}")
    if result['docstring']:
        print(f"   Doc: {result['docstring'][:80]}...")
    print()

# Compare diversity
print("\n--- Comparison ---")
names_no_filter = [r['name'] for r in results_no_filter]
names_with_filter = [r['name'] for r in results_with_filter]

print(f"Unique results without filter: {len(set(names_no_filter))}/{len(names_no_filter)}")
print(f"Unique results with filter: {len(set(names_with_filter))}/{len(names_with_filter)}")

# Check if results changed
changed = sum(1 for i in range(min(len(names_no_filter), len(names_with_filter))) 
              if names_no_filter[i] != names_with_filter[i])
print(f"Results changed: {changed}/{min(len(names_no_filter), len(names_with_filter))} positions")

Post-Filtering Comparison

Query: 'add route decorator'

--- WITHOUT Post-Filtering ---
1. FUNCTION: decorator
   Similarity: 0.8196

2. FUNCTION: decorator
   Similarity: 0.6787

3. FUNCTION: route
   Similarity: 0.6569
   Doc: Decorate a view function to register it with the given URL
        rule and opti...

4. FUNCTION: _method_route
   Similarity: 0.6385

5. FUNCTION: endpoint
   Similarity: 0.6252
   Doc: Decorate a view function to register it for the given
        endpoint. Used if ...


--- WITH Post-Filtering (Keyword-Based Re-ranking) ---
1. FUNCTION: add_url_rule
   Similarity: 0.5765
   Doc: Register a rule for routing incoming requests and building
        URLs. The :me...

2. FUNCTION: route
   Similarity: 0.6569
   Doc: Decorate a view function to register it with the given URL
        rule and opti...

3. FUNCTION: decorator
   Similarity: 0.8196

4. FUNCTION: _method_route
   Similarity: 0.6385

5. FUNCTION: decorator
   Similarity: 0.5943


--- Comparison ---
Unique

## Similarity Score Distribution Analysis

Visualize and analyze the distribution of similarity scores for sample queries.

In [None]:
# Analyze Similarity Score Distribution
print("=== Similarity Score Distribution Analysis ===\n")

# Test with multiple queries
test_queries = [
    "add route decorator",
    "handle HTTP request",
    "template rendering",
    "error handling",
    "database connection"
]

all_similarities = []
query_stats = []

for query in test_queries:
    results = search_code(query, top_k=20)
    if results:
        similarities = [r['similarity'] for r in results]
        all_similarities.extend(similarities)
        
        print(f"Query: '{query}'")
        print(f"  Results: {len(results)}")
        print(f"  Similarity range: [{min(similarities):.4f}, {max(similarities):.4f}]")
        print(f"  Mean: {np.mean(similarities):.4f}")
        print(f"  Std: {np.std(similarities):.4f}")
        
        # Show top 3 and bottom 3
        print(f"  Top 3: {', '.join([f'{s:.4f}' for s in similarities[:3]])}")
        print(f"  Bottom 3: {', '.join([f'{s:.4f}' for s in similarities[-3:]])}")
        print()
        
        query_stats.append({
            'query': query,
            'mean': np.mean(similarities),
            'std': np.std(similarities),
            'range': max(similarities) - min(similarities)
        })

# Overall statistics
if all_similarities:
    print(f"\n=== Overall Statistics (all {len(all_similarities)} results) ===")
    print(f"Mean similarity: {np.mean(all_similarities):.4f}")
    print(f"Std similarity: {np.std(all_similarities):.4f}")
    print(f"Min similarity: {np.min(all_similarities):.4f}")
    print(f"Max similarity: {np.max(all_similarities):.4f}")
    
    # Distribution bins
    bins = [0, 0.2, 0.4, 0.6, 0.8, 1.0]
    hist, _ = np.histogram(all_similarities, bins=bins)
    print(f"\nDistribution:")
    for i in range(len(bins)-1):
        print(f"  [{bins[i]:.1f}-{bins[i+1]:.1f}]: {hist[i]} results ({hist[i]/len(all_similarities)*100:.1f}%)")
    
    # Check for saturation
    if np.mean(all_similarities) > 0.95:
        print("\nWARNING: Similarities are saturated near 1.0!")
        print("   This suggests the model is not differentiating well between chunks.")
    elif np.std(all_similarities) < 0.05:
        print("\nWARNING: Very low variance in similarities!")
        print("   Results may not be well-ranked.")
    else:
        print("\n✓ Similarity distribution looks reasonable")

=== Similarity Score Distribution Analysis ===

Query: 'add route decorator'
  Results: 20
  Similarity range: [0.5239, 0.8196]
  Mean: 0.5992
  Std: 0.0630
  Top 3: 0.8196, 0.6787, 0.6569
  Bottom 3: 0.5487, 0.5348, 0.5239

Query: 'handle HTTP request'
  Results: 20
  Similarity range: [0.2643, 0.5321]
  Mean: 0.3269
  Std: 0.0695
  Top 3: 0.5321, 0.4264, 0.4036
  Bottom 3: 0.2683, 0.2675, 0.2643

Query: 'template rendering'
  Results: 20
  Similarity range: [0.3883, 0.6051]
  Mean: 0.4327
  Std: 0.0542
  Top 3: 0.6051, 0.5135, 0.5109
  Bottom 3: 0.3936, 0.3908, 0.3883

Query: 'error handling'
  Results: 20
  Similarity range: [0.2632, 0.4777]
  Mean: 0.3256
  Std: 0.0512
  Top 3: 0.4777, 0.4057, 0.3898
  Bottom 3: 0.2740, 0.2725, 0.2632

Query: 'database connection'
  Results: 20
  Similarity range: [0.1732, 0.5466]
  Mean: 0.2703
  Std: 0.0990
  Top 3: 0.5466, 0.4227, 0.3944
  Bottom 3: 0.1740, 0.1739, 0.1732


=== Overall Statistics (all 100 results) ===
Mean similarity: 0.3909
Std

## Example Searches
Try different queries to test the semantic search.

In [18]:
# Example 1: Routing
query = "how does Flask handle routing"
print(f"Query: {query}")
results = search_code(query, top_k=5)
pretty_print_results(results)

Query: how does Flask handle routing
Found 5 results:
1. FUNCTION: routes_command
   File: ../flask\src\flask\cli.py:1061-1107
   Similarity: 0.6058 (distance: 0.3942)
   Doc: Show all registered routes with endpoints and methods.
   Code Preview:
      def routes_command(sort: str, all_methods: bool) -> None:
          """Show all registered routes with endpoints and methods."""
          rules = list(current_app.url_map.iter_rules())
          if not rules:
              click.echo("No routes were registered.")
              return
      ... (39 more lines)

2. FUNCTION: index
   File: ../flask\examples\celery\src\task_app\__init__.py:20-21
   Similarity: 0.5737 (distance: 0.4263)
   Code Preview:
      def index() -> str:
              return render_template("index.html")

3. FUNCTION: match_request
   File: ../flask\src\flask\ctx.py:398-407
   Similarity: 0.5336 (distance: 0.4664)
   Doc: Apply routing to the current request, storing either the matched         endpoint and args, or

In [19]:
# Example 2: Request handling
query = "request context and session management"
print(f"Query: {query}")
results = search_code(query, top_k=5)
pretty_print_results(results)

Query: request context and session management
Found 5 results:
1. FUNCTION: session
   File: ../flask\src\flask\ctx.py:381-396
   Similarity: 0.5481 (distance: 0.4519)
   Doc: The session object associated with this context. Accessed through         :data:`.session`. Only available in request contexts, otherwise raises      ...
   Code Preview:
      def session(self) -> SessionMixin:
              """The session object associated with this context. Accessed through
              :data:`.session`. Only available in request contexts, otherwise raises
              :exc:`RuntimeError`.
              """
              if self._request is None:
                  raise RuntimeError("There is no request in this context.")
      ... (8 more lines)

2. FUNCTION: open_session
   File: ../flask\src\flask\sessions.py:337-349
   Similarity: 0.5346 (distance: 0.4654)
   Code Preview:
      def open_session(self, app: Flask, request: Request) -> SecureCookieSession | None:
              s = self.get

In [25]:
# Example 3: Template rendering
query = "template rendering and Jinja integration"
print(f"Query: {query}")
results = search_code(query, top_k=5)
pretty_print_results(results)

Query: template rendering and Jinja integration
Found 5 results:
1. FUNCTION: _render
   File: ../flask\src\flask\templating.py:122-132
   Similarity: 0.6581 (distance: 0.3419)
   Code Preview:
      def _render(ctx: AppContext, template: Template, context: dict[str, t.Any]) -> str:
          app = ctx.app
          app.update_template_context(ctx, context)
          before_render_template.send(
              app, _async_wrapper=app.ensure_sync, template=template, context=context
          )
          rv = template.render(context)
          template_rendered.send(
      ... (3 more lines)

2. CLASS: DispatchingJinjaLoader
   File: ../flask\src\flask\templating.py:48-119
   Similarity: 0.6288 (distance: 0.3712)
   Doc: A loader that looks for templates in the application and all     the blueprint folders.
   Code Preview:
      class DispatchingJinjaLoader(BaseLoader):
          """A loader that looks for templates in the application and all
          the blueprint folders.
          ""

In [21]:
# Example 4: Error handling
query = "error handling and exception management"
print(f"Query: {query}")
results = search_code(query, top_k=5)
pretty_print_results(results)

Query: error handling and exception management
Found 5 results:
1. FUNCTION: on_json_loading_failed
   File: ../flask\src\flask\wrappers.py:212-219
   Similarity: 0.4673 (distance: 0.5327)
   Code Preview:
      def on_json_loading_failed(self, e: ValueError | None) -> t.Any:
              try:
                  return super().on_json_loading_failed(e)
              except BadRequest as ebr:
                  if current_app and current_app.debug:
                      raise
                  raise BadRequest() from ebr

2. FUNCTION: __exit__
   File: ../flask\src\flask\ctx.py:486-492
   Similarity: 0.3990 (distance: 0.6010)
   Code Preview:
      def __exit__(
              self,
              exc_type: type[BaseException] | None,
              exc_value: BaseException | None,
              tb: TracebackType | None,
          ) -> None:
              self.pop(exc_value)

3. CLASS: UnexpectedUnicodeError
   File: ../flask\src\flask\debughelpers.py:17-20
   Similarity: 0.3956 (distance: 

In [24]:
# Example 4: Error handling
query = "error handling and exception management"
print(f"Query: {query}")
results = search_code(query, top_k=5, apply_filter=True)
pretty_print_results(results)

Query: error handling and exception management
Found 5 results:
1. FUNCTION: handle_user_exception
   File: ../flask\src\flask\app.py:864-894
   Similarity: 0.3868 (distance: 0.6132)
   Doc: This method is called whenever an exception occurs that         should be handled. A special case is :class:`~werkzeug         .exceptions.HTTPExcepti...
   Code Preview:
      def handle_user_exception(
              self, ctx: AppContext, e: Exception
          ) -> HTTPException | ft.ResponseReturnValue:
              """This method is called whenever an exception occurs that
              should be handled. A special case is :class:`~werkzeug
              .exceptions.HTTPException` which is forwarded to the
              :meth:`handle_http_exception` method. This function will either
              return a response value or reraise the exception with the same
      ... (23 more lines)

2. FUNCTION: handle_http_exception
   File: ../flask\src\flask\app.py:829-862
   Similarity: 0.3406 (distance

## Custom Search
Run your own queries here.

In [26]:
# Custom search - modify the query below
custom_query = "database connection pooling in Flask"
results = search_code(custom_query, top_k=10, apply_filter=True)
pretty_print_results(results)

Found 10 results:
1. FUNCTION: get_db
   File: ../flask\examples\tutorial\flaskr\db.py:9-20
   Similarity: 0.6040 (distance: 0.3960)
   Doc: Connect to the application's configured database. The connection     is unique for each request and will be reused if this is called     again.
   Code Preview:
      def get_db():
          """Connect to the application's configured database. The connection
          is unique for each request and will be reused if this is called
          again.
          """
          if "db" not in g:
              g.db = sqlite3.connect(
                  current_app.config["DATABASE"], detect_types=sqlite3.PARSE_DECLTYPES
      ... (4 more lines)

2. FUNCTION: close_db
   File: ../flask\examples\tutorial\flaskr\db.py:23-30
   Similarity: 0.5232 (distance: 0.4768)
   Doc: If this request connected to the database, close the     connection.
   Code Preview:
      def close_db(e=None):
          """If this request connected to the database, close the
          