# Hybrid Search & Smart Flashcard Generator

In this workshop, you'll learn how modern search engines combine **lexical search** (keyword matching) with **semantic search** (meaning-based) to deliver better results. Then you'll build a **Smart Flashcard Generator** that uses this technology!

**What you'll learn:**
1. How BM25 (lexical search) works - the algorithm behind Elasticsearch
2. How semantic search uses neural embeddings to understand meaning
3. Why neither approach alone is perfect
4. How to combine them with **hybrid search**
5. **RAG (Retrieval-Augmented Generation)** - combining search with LLMs
6. Build a working **flashcard generator** you can use for studying!

**The Pipeline:**
```
Your Notes ‚Üí Hybrid Search (retrieve relevant content) ‚Üí LLM (generate Q&A) ‚Üí Flashcards (Anki export!)
```


## ‚ö° Pre-workshop Setup (Do This Before the Workshop!)

To make the most of our limited time, please run these setup steps **before** the workshop. This downloads the required models (~9GB total) which can take 10-15 minutes.

### Step 1: Install packages (~2 min)
Run the cell below to install all required packages.

### Step 2: Download models (~10-15 min)
Run the second cell to pre-download the AI models. This only needs to be done once - the models are cached for future use.

### Step 3: Verify setup
If both cells complete without errors, you're ready for the workshop! üéâ

In [None]:
# Step 1: Install packages
!pip install requests beautifulsoup4 rank_bm25 sentence-transformers faiss-cpu networkx pyvis gradio tqdm numpy transformers accelerate bitsandbytes genanki pypdf -q
print("‚úÖ Packages installed!")

In [None]:
# Step 2: Pre-download models (run this before the workshop!)
print("Downloading models... This takes 10-15 minutes on first run.")
print("=" * 50)

# Download sentence-transformers model (~90MB)
print("\n1/2: Downloading embedding model...")
from sentence_transformers import SentenceTransformer
embed_model = SentenceTransformer('all-MiniLM-L6-v2')
print("‚úÖ Embedding model ready!")

# Download Qwen3-8B (~8GB)
print("\n2/2: Downloading Qwen3-8B (this is the big one, ~8GB)...")
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")

# Detect device and download with appropriate settings
if torch.cuda.is_available():
    print("   CUDA detected - downloading for GPU...")
    from transformers import BitsAndBytesConfig
    bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16)
    model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", quantization_config=bnb_config, device_map="auto")
elif torch.backends.mps.is_available():
    print("   Apple Silicon detected - downloading for MPS...")
    model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype=torch.float16, device_map="auto")
else:
    print("   CPU mode - downloading (will be slow during workshop)...")
    model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", device_map="auto")

print("‚úÖ Qwen3-8B ready!")

print("\n" + "=" * 50)
print("üéâ All set! You're ready for the workshop!")
print("=" * 50)

# Clean up memory
del model, tokenizer, embed_model
import gc
gc.collect()

## The Search Problem

Why is finding relevant content so hard?

**Scenario 1:** You search for "Singapore economy"
- A great article titled "GDP Growth in the Lion City" exists
- But it doesn't contain the exact words "Singapore" or "economy"!
- Traditional keyword search would miss it completely

**Scenario 2:** You search for "BM25 algorithm"
- You want technical documentation with that exact term
- A semantic search might return general "search algorithm" articles
- Exact keyword matching would be more helpful here

**The solution?** Combine both approaches!

### Dependencies

| Package | Purpose |
|---------|---------|
| `requests` | Make HTTP requests to fetch web pages |
| `beautifulsoup4` | Parse HTML and extract content/links |
| `rank_bm25` | BM25 algorithm implementation |
| `sentence-transformers` | Generate text embeddings for semantic search |
| `faiss-cpu` | Fast similarity search in vector space |
| `networkx` | Build and analyze link graphs |
| `pyvis` | Interactive graph visualizations (double-click to open URLs!) |
| `gradio` | Build web interfaces for our app |
| `tqdm` | Show progress bars during crawling |
| `numpy` | Numerical operations for embeddings |
| `transformers` | HuggingFace library for loading LLMs |
| `accelerate` | Efficient model loading across devices |
| `bitsandbytes` | 4-bit quantization (reduces memory 4x) |
| `genanki` | Create Anki flashcard decks (.apkg) |
| `pypdf` | Extract text from PDF files |

In [None]:
# Install required packages
!pip install requests beautifulsoup4 rank_bm25 sentence-transformers transformers accelerate bitsandbytes faiss-cpu networkx pyvis gradio tqdm numpy genanki pypdf -q

## Part 2: Building Our Document Collection

Before we can search, we need documents to search through! We'll:
1. Crawl Wikipedia pages to build a corpus
2. Allow adding custom documents

### What is a Web Crawler?

A **web crawler** (also called a spider or bot) is a program that automatically browses the web to collect information. Here's how it works:

1. **Start with a seed URL** - Give it a starting webpage (e.g., a Wikipedia article)
2. **Download the page** - Fetch the HTML content
3. **Extract links** - Find all links to other pages
4. **Follow the links** - Visit those pages and repeat
5. **Store the content** - Save the text for later use (like searching!)

```
üåê Start URL ‚Üí üìÑ Download Page ‚Üí üîó Find Links ‚Üí üîÑ Repeat
                      ‚Üì
                 üíæ Store Content
```

**Why do search engines need crawlers?** Search engines like Google can't search the entire internet in real-time. Instead, they use crawlers to discover and download billions of pages *in advance*, storing them in a massive index. When you search, you're actually searching this pre-built index, not the live web!

In this workshop, we'll build a simple crawler to collect Wikipedia articles for our search experiments.

### Ethical Web Crawling

When crawling websites, we must be respectful:
- **Respect `robots.txt`** - Check what's allowed to crawl
- **Identify yourself** - Use a proper User-Agent header
- **Rate limiting** - Don't overwhelm servers

### What is robots.txt?

Every website can have a `robots.txt` file at their root that tells crawlers what they're allowed to access. Let's fetch some real examples!

In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse, unquote
from urllib.robotparser import RobotFileParser
from collections import deque
import time
from tqdm import tqdm

# Configuration
USER_AGENT = "HybridSearchWorkshop/1.0 (Educational)"
REQUEST_DELAY = 0.5  # seconds between requests

class WebCrawler:
    """
    General-purpose web crawler that respects robots.txt.
    
    This base class can crawl any website. Subclass it to add 
    site-specific logic (like WikipediaCrawler below).
    
    Usage:
        crawler = WebCrawler()
        data = crawler.crawl("https://example.com", max_pages=10)
    """
    
    def __init__(self, user_agent: str = USER_AGENT, delay: float = REQUEST_DELAY) -> None:
        self.user_agent = user_agent
        self.delay = delay
        self.session = requests.Session()
        self.session.headers["User-Agent"] = user_agent
        self.robots_cache = {}  # Cache robots parsers per domain
    
    def get_domain(self, url: str) -> str:
        """Extract the domain from a URL."""
        parsed = urlparse(url)
        return f"{parsed.scheme}://{parsed.netloc}"
    
    def can_fetch(self, url: str) -> bool:
        """Check if URL is allowed by robots.txt."""
        domain = self.get_domain(url)
        if domain not in self.robots_cache:
            try:
                rp = RobotFileParser()
                robots_url = f"{domain}/robots.txt"
                robots_txt = self.session.get(robots_url, timeout=5).text
                rp.parse(robots_txt.splitlines())
                self.robots_cache[domain] = rp
            except:
                # If we can't fetch robots.txt, assume allowed
                return True
        return self.robots_cache[domain].can_fetch("*", url)
    
    def get(self, url: str, **kwargs) -> requests.Response:
        """Make a GET request with proper headers."""
        kwargs.setdefault('timeout', 10)
        return self.session.get(url, **kwargs)
    
    def get_title(self, url: str, soup: BeautifulSoup) -> str:
        """Extract page title. Override in subclass for custom logic."""
        title_tag = soup.find('title')
        return title_tag.get_text(strip=True) if title_tag else url
    
    def extract_text(self, soup: BeautifulSoup) -> str:
        """Extract main text content. Override in subclass for custom logic."""
        # Remove script and style elements
        for tag in soup(['script', 'style', 'nav', 'footer', 'header']):
            tag.decompose()
        return soup.get_text(separator=' ', strip=True)[:5000]
    
    def is_valid_link(self, href: str, base_url: str) -> bool:
        """Check if a link should be followed. Override in subclass."""
        if not href:
            return False
        # Skip anchors, javascript, mailto, etc.
        if href.startswith(('#', 'javascript:', 'mailto:', 'tel:')):
            return False
        return True
    
    def extract_links(self, soup: BeautifulSoup, base_url: str) -> list[str]:
        """Extract all valid links from the page."""
        links = []
        for a in soup.find_all('a', href=True):
            href = a['href']
            if self.is_valid_link(href, base_url):
                # Convert relative URLs to absolute
                full_url = urljoin(base_url, href)
                # Only follow links on the same domain
                if self.get_domain(full_url) == self.get_domain(base_url):
                    links.append(full_url)
        return list(dict.fromkeys(links))  # Remove duplicates, preserve order
    
    def crawl_page(self, url: str) -> dict | None:
        """Crawl a single page and extract title, text, and links."""
        if not self.can_fetch(url):
            print(f"[robots.txt blocked] {url}")
            return None
        
        try:
            response = self.get(url)
            if response.status_code != 200:
                print(f"[HTTP {response.status_code}] {url}")
                return None
            
            soup = BeautifulSoup(response.text, 'html.parser')
            
            return {
                'url': url,
                'title': self.get_title(url, soup),
                'text': self.extract_text(soup),
                'links': self.extract_links(soup, url)
            }
        
        except Exception as e:
            print(f"[Error] {url}: {e}")
            return None
    
    def crawl(self, seed_url: str, max_pages: int = 20) -> dict[str, dict]:
        """
        Crawl starting from seed_url using Breadth-First Search (BFS).
        
        Args:
            seed_url: Starting URL
            max_pages: Maximum number of pages to crawl
        
        Returns:
            Dictionary mapping URL -> page data
        """
        crawled = {}
        visited = set()
        queue = deque([seed_url])
        
        pbar = tqdm(total=max_pages, desc="Crawling")
        
        while queue and len(crawled) < max_pages:
            url = queue.popleft()
            if url in visited:
                continue
            
            visited.add(url)
            result = self.crawl_page(url)
            
            if result:
                crawled[url] = result
                pbar.update(1)
                pbar.set_description(f"Crawling: {result['title'][:25]}...")
                
                # Add new links to the queue (BFS)
                for link in result['links']:
                    if link not in visited:
                        queue.append(link)
            
            time.sleep(self.delay)
        
        pbar.close()
        print(f"\nCrawled {len(crawled)} pages!")
        return crawled

    def show_robots_txt(self, url: str, max_rules: int = 10) -> None:
        """Fetch and display important rules from a site's robots.txt."""
        try:
            response = self.get(f"{self.get_domain(url)}/robots.txt")
            
            print(f"\n{'='*50}")
            print(f"üìÑ {self.get_domain(url)}/robots.txt")
            print('='*50)
            
            # Filter to show only important lines
            important_prefixes = ('user-agent:', 'disallow:', 'allow:', 'sitemap:')
            rules_shown = 0
            
            for line in response.text.split('\n'):
                line_lower = line.lower().strip()
                if any(line_lower.startswith(prefix) for prefix in important_prefixes):
                    print(line.strip())
                    rules_shown += 1
                    if rules_shown >= max_rules:
                        print("...")
                        break
        except Exception as e:
            print(f"Error: {e}")

print("WebCrawler base class ready!")

In [None]:
# Let's look at real robots.txt files using our crawler!
crawler = WebCrawler()

# Wikipedia - generally permissive
crawler.show_robots_txt("https://en.wikipedia.org", max_rules=8)

# Google - much more restrictive
crawler.show_robots_txt("https://www.google.com", max_rules=12)

In [None]:
# Now let's test our crawler's can_fetch() method
test_crawler = WebCrawler()

print("Testing if our crawler can access these URLs:\n")

# Wikipedia - should be allowed
wiki_url = "https://en.wikipedia.org/wiki/Search_engine"
print(f"Wikipedia article: {test_crawler.can_fetch(wiki_url)}")

# Google search - should be blocked for generic bots
google_url = "https://www.google.com/search?q=test"
print(f"Google search page: {test_crawler.can_fetch(google_url)}")

In [None]:
class WikipediaCrawler(WebCrawler):
    """
    Wikipedia-specific crawler that extends WebCrawler.
    
    Customizations:
    - Extracts clean titles from Wikipedia URLs
    - Only follows links to Wikipedia articles (skips Special:, File:, etc.)
    - Extracts text from the main content area only
    
    Usage:
        crawler = WikipediaCrawler()
        data = crawler.crawl("https://en.wikipedia.org/wiki/Python", max_pages=20)
    """
    
    # Wikipedia namespaces to skip
    SKIP_NAMESPACES = [
        'File:', 'Category:', 'Help:', 'Template:', 'Wikipedia:', 
        'Special:', 'Talk:', 'Portal:', 'Module:', 'Draft:', 'MediaWiki:'
    ]
    
    def get_title(self, url: str, soup: BeautifulSoup | None = None) -> str:
        """Extract a clean title from a Wikipedia URL."""
        path = urlparse(url).path
        title = path.split('/')[-1]
        return unquote(title).replace('_', ' ')
    
    def is_valid_link(self, href: str, base_url: str) -> bool:
        """Only follow links to Wikipedia articles."""
        if not href:
            return False
        if not href.startswith('/wiki/'):
            return False
        # Skip special namespaces
        for ns in self.SKIP_NAMESPACES:
            if href.startswith(f'/wiki/{ns}'):
                return False
        # Skip anchor links
        if '#' in href:
            return False
        return True
    
    def extract_text(self, soup: BeautifulSoup) -> str:
        """Extract text from Wikipedia's main content area."""
        content_div = soup.find('div', {'id': 'mw-content-text'})
        if not content_div:
            return ""
        
        # Remove non-content elements
        for tag in content_div(['script', 'style', 'sup', 'table', 'nav']):
            tag.decompose()
        
        return content_div.get_text(separator=' ', strip=True)[:5000]
    
    def extract_links(self, soup: BeautifulSoup, base_url: str) -> list[str]:
        """Extract links from Wikipedia's main content area."""
        content_div = soup.find('div', {'id': 'mw-content-text'})
        if not content_div:
            return []
        
        links = []
        for a in content_div.find_all('a', href=True):
            href = a['href']
            if self.is_valid_link(href, base_url):
                full_url = urljoin('https://en.wikipedia.org', href)
                if full_url != base_url:
                    links.append(full_url)
        
        return list(dict.fromkeys(links))

# Create crawler instances
crawler = WikipediaCrawler()  # For Wikipedia (used in this workshop)

print("WikipediaCrawler ready!")
print("\nYou can also use WebCrawler for other websites:")
print("  generic_crawler = WebCrawler()")
print("  data = generic_crawler.crawl('https://any-website.com', max_pages=10)")

In [None]:
# The crawl() method is now part of the WebCrawler class
# These helper functions are kept for backwards compatibility

def crawl_wikipedia_page(url: str) -> dict | None:
    """Crawl a single Wikipedia page. (Uses WikipediaCrawler internally)"""
    return crawler.crawl_page(url)

def crawl_wikipedia(seed_url: str, max_pages: int = 30) -> dict[str, dict]:
    """Crawl Wikipedia starting from seed_url. (Uses WikipediaCrawler internally)"""
    return crawler.crawl(seed_url, max_pages=max_pages)

### How BFS Works

Our crawler uses **BFS (Breadth-First Search)** - exploring pages "level by level":

```
              üåê Seed Page
             /     |     \
          üìÑ A    üìÑ B    üìÑ C      ‚Üê Level 1 (crawled first)
          /  \      |
       üìÑ D  üìÑ E  üìÑ F            ‚Üê Level 2 (crawled second)
```

**Why BFS?** Pages closer to the seed are usually more relevant. BFS finds them first before wandering off into unrelated topics.

In [None]:
# Crawl Wikipedia starting from "Search engine"
# This takes about 30-40 seconds
SEED_URL = "https://en.wikipedia.org/wiki/Search_engine"
MAX_PAGES = 100

crawled_data = crawl_wikipedia(SEED_URL, max_pages=MAX_PAGES)

# Show what we crawled
print("\nPages crawled:")
for url, data in list(crawled_data.items())[:10]:
    print(f"  - {data['title']}")
print(f"  ... and {len(crawled_data) - 10} more")

In [None]:
# üíæ Optional: Save crawled data to avoid re-crawling
import json

def save_crawled_data(data: dict, filename: str = "crawled_data.json") -> None:
    """Save crawled data to a JSON file."""
    with open(filename, 'w') as f:
        json.dump(data, f, indent=2)
    print(f"‚úÖ Saved {len(data)} pages to {filename}")

# Uncomment to save:
# save_crawled_data(crawled_data)

In [None]:
# üìÇ Optional: Load previously crawled data (skip crawling)
import json

def load_crawled_data(filename: str = "crawled_data.json") -> dict:
    """Load crawled data from a JSON file."""
    with open(filename, 'r') as f:
        data = json.load(f)
    print(f"‚úÖ Loaded {len(data)} pages from {filename}")
    return data

# Uncomment to load instead of crawling:
# crawled_data = load_crawled_data()

### Visualizing the Link Graph

Let's see how our crawled pages are connected! We'll build a **link graph** where:
- **Nodes** = Wikipedia pages
- **Edges** = Links between pages

This is exactly how search engines model the web. You can **double-click any node** to open that Wikipedia page!

In [None]:
import networkx as nx
from pyvis.network import Network
import os

def build_link_graph(crawled_data: dict) -> "nx.DiGraph":
    """Build a directed graph from crawled data."""
    G = nx.DiGraph()
    crawled_urls = set(crawled_data.keys())
    
    # Add nodes
    for url, data in crawled_data.items():
        G.add_node(url, title=data['title'])
    
    # Add edges (only between crawled pages)
    for url, data in crawled_data.items():
        for link in data['links']:
            if link in crawled_urls:
                G.add_edge(url, link)
    
    return G

def visualize_link_graph(G: "nx.DiGraph", crawled_data: dict, filename: str = "link_graph.html") -> str:
    """Create an interactive visualization - double-click nodes to open URLs!"""
    net = Network(height="500px", width="100%", directed=True, notebook=False)
    net.barnes_hut(gravity=-3000, central_gravity=0.3, spring_length=200)
    
    # Calculate in-degrees for node sizing
    in_degrees = dict(G.in_degree())
    max_deg = max(in_degrees.values()) if in_degrees else 1
    
    for node in G.nodes():
        title = G.nodes[node].get('title', node)
        size = 10 + (in_degrees.get(node, 0) / max_deg) * 30
        hover = f"{title}\n{in_degrees.get(node, 0)} backlinks\nDouble-click to open"
        net.add_node(node, label=title[:20], title=hover, size=size)
    
    for source, target in G.edges():
        net.add_edge(source, target)
    
    filepath = os.path.abspath(filename)
    net.save_graph(filepath)
    
    # Add double-click handler to open URLs
    with open(filepath, 'r') as f:
        html = f.read()
    
    click_script = """
    <script>
    network.on("doubleClick", function(params) {
        if (params.nodes.length > 0) {
            var nodeId = params.nodes[0];
            window.open(nodeId, '_blank');
        }
    });
    </script>
    </body>
    """
    html = html.replace("</body>", click_script)
    
    with open(filepath, 'w') as f:
        f.write(html)
    
    return filepath

# Build and visualize the link graph
link_graph = build_link_graph(crawled_data)
print(f"Graph: {link_graph.number_of_nodes()} pages, {link_graph.number_of_edges()} links")

graph_file = visualize_link_graph(link_graph, crawled_data)
print(f"\nGraph saved to: {graph_file}")
print("Open this file in your browser, or it will appear in the Gradio app later!")

### Adding Custom Documents

Besides Wikipedia pages, you can add your own documents to the corpus:
- **Text documents** - Paste or type content directly
- **PDF files** - Extract text from PDF documents

This is useful for:
- Testing search with your own notes or study materials
- Adding domain-specific content not found on Wikipedia
- Experimenting with how different document types affect search quality

In [None]:
from pypdf import PdfReader

# Our document store - combines crawled pages with custom documents
documents = {}

# Add crawled pages to documents
for url, data in crawled_data.items():
    documents[url] = {
        'title': data['title'],
        'text': data['text'],
        'source': 'wikipedia'
    }

def add_custom_document(title: str, text: str, source: str = "custom", url: str | None = None) -> str:
    """Add a custom text document to our corpus."""
    doc_id = f"custom_{len([k for k in documents if k.startswith('custom_')])}" 
    documents[doc_id] = {
        'title': title,
        'text': text,
        'source': source,
        'url': url
    }
    print(f"Added document: {title}")
    return doc_id

def process_pdf(file_path: str) -> str:
    """Extract text from a PDF file."""
    try:
        reader = PdfReader(file_path)
        text = ""
        for page in reader.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text + "\n"
        return text.strip()
    except Exception as e:
        return f"Error reading PDF: {str(e)}"

def add_pdf_document(file_path: str, title: str | None = None) -> str:
    """Add a PDF document to our corpus.
    
    Args:
        file_path: Path to the PDF file
        title: Optional title (defaults to filename if not provided)
    
    Returns:
        The document ID, or error message if extraction failed
    """
    text = process_pdf(file_path)
    if text.startswith("Error"):
        print(text)
        return text
    
    # Use filename as title if not provided
    if title is None:
        import os
        title = os.path.splitext(os.path.basename(file_path))[0]
    
    doc_id = add_custom_document(title, text, source="pdf", url=file_path)
    print(f"  Extracted {len(text):,} characters from PDF")
    return doc_id

# Example 1: Add custom text documents
add_custom_document(
    "GDP Growth in the Lion City",
    "The Lion City has seen remarkable economic expansion in recent years. Financial services and technology sectors have driven significant growth in gross domestic product. The city-state continues to attract global investment."
)

add_custom_document(
    "How Automobiles Changed Transportation", 
    "The motor vehicle revolutionized how people travel. Machines powered by internal combustion engines replaced horse-drawn carriages. Modern automobiles feature advanced safety systems and increasingly electric powertrains."
)

# Example 2: Add a PDF document about BM25
# This PDF wasn't in our Wikipedia crawl, so it adds new knowledge to our corpus!
add_pdf_document("bm25_intro.pdf", "BM25 Algorithm Introduction")

print(f"\nTotal documents in corpus: {len(documents)}")

## Part 3: Lexical Search with BM25

### What is Lexical (Keyword) Search?

Lexical search finds documents by matching exact words. The key concepts are:

**Term Frequency (TF):** How often does the search term appear in a document?
- "apple" appears 5 times ‚Üí higher score

**Inverse Document Frequency (IDF):** How rare is the term across all documents?
- "the" appears everywhere ‚Üí low IDF (not useful for ranking)
- "BM25" is rare ‚Üí high IDF (very useful for ranking)

**Document Length Normalization:** Longer documents naturally contain more words, so we adjust for length.

### BM25 - "Best Match 25"

BM25 is the evolution of TF-IDF, and is the algorithm behind:
- **Elasticsearch** (default ranking)
- **Apache Lucene/Solr**
- **Many production search systems**

The formula (simplified):
```
score = IDF * (TF * (k1 + 1)) / (TF + k1 * (1 - b + b * docLength/avgDocLength))
```

Where:
- `k1` controls term frequency saturation (typically 1.2-2.0)
- `b` controls document length normalization (typically 0.75)

Don't worry about the math - the `rank_bm25` library handles this for us!

In [None]:
from rank_bm25 import BM25Okapi
import re

def tokenize(text: str) -> list[str]:
    """Simple tokenization: lowercase and split on non-alphanumeric."""
    return re.findall(r'\w+', text.lower())

# Prepare documents for BM25
doc_ids = list(documents.keys())
doc_texts = [documents[doc_id]['text'] for doc_id in doc_ids]
doc_titles = [documents[doc_id]['title'] for doc_id in doc_ids]

# Tokenize all documents
tokenized_docs = [tokenize(text) for text in doc_texts]

# Build BM25 index
bm25_index = BM25Okapi(tokenized_docs)

print(f"BM25 index built with {len(doc_ids)} documents!")
print(f"Average document length: {sum(len(d) for d in tokenized_docs) / len(tokenized_docs):.0f} tokens")

In [None]:
def bm25_search(query: str, top_k: int = 5) -> list[tuple[str, float]]:
    """Search using BM25 and return top_k results."""
    tokenized_query = tokenize(query)
    scores = bm25_index.get_scores(tokenized_query)
    
    # Get top k results
    top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
    
    results = []
    for idx in top_indices:
        results.append({
            'doc_id': doc_ids[idx],
            'title': doc_titles[idx],
            'score': scores[idx],
            'text_preview': doc_texts[idx][:200] + '...'
        })
    
    return results

In [None]:
# Test BM25 search
test_queries = ["search engine", "web crawler", "information retrieval", "bm25"]

for query in test_queries:
    print(f"\nQuery: '{query}'")
    print("-" * 50)
    results = bm25_search(query, top_k=3)
    for i, r in enumerate(results, 1):
        print(f"  {i}. {r['title']} (score: {r['score']:.2f})")

### BM25 Strengths & Weaknesses

**Strengths:**
- Exact keyword matches rank highly
- Rare/specific terms get boosted (great for technical searches)
- Fast - no GPU or complex models needed
- Interpretable - you know exactly WHY a document matched

**Weaknesses:**
- "Singapore economy" won't find "Lion City GDP"
- Typos break searches ("serch engne" ‚Üí no results)
- No understanding of synonyms ("car" won't match "automobile")
- Word order doesn't matter ("dog bites man" = "man bites dog")

In [None]:
# Demo: Where BM25 fails
print("=== BM25 Failure Cases ===")

# Semantic mismatch - no keyword overlap
print("\n1. Semantic mismatch: 'Singapore economy'")
print("   Looking for our custom doc 'GDP Growth in the Lion City'...")
results = bm25_search("Singapore economy", top_k=3)
found = any("Lion City" in r['title'] for r in results)
if found:
    print("   Found it!")
else:
    print("   NOT FOUND - BM25 misses it because 'Singapore' and 'economy' aren't in the doc!")
print("   Top results instead:", [r['title'][:30] for r in results])

# Synonyms - "car" vs "automobile"
print("\n2. Synonyms: 'car vehicle' (looking for 'Automobiles' doc)")
results = bm25_search("car vehicle", top_k=3)
found = any("Automobile" in r['title'] for r in results)
if found:
    print("   Found it!")
else:
    print("   NOT FOUND - BM25 doesn't know 'car' = 'automobile'!")
print("   Top results instead:", [r['title'][:30] for r in results])

# Paraphrase - different words, same meaning
print("\n3. Paraphrase: 'city-state financial growth' (looking for 'Lion City GDP' doc)")
results = bm25_search("city-state financial growth", top_k=3)
found = any("Lion City" in r['title'] for r in results)
if found:
    print("   Found it!")
else:
    print("   NOT FOUND - BM25 needs exact keywords!")
print("   Top results:", [r['title'][:30] for r in results])

In [None]:
# Demo: Where BM25 excels
print("=== BM25 Success Cases ===")

# Our BM25 PDF document
print("\n1. Exact technical term: 'BM25 term frequency saturation'")
print("   (This exact phrase is in our bm25_intro.pdf)")
results = bm25_search("BM25 term frequency saturation", top_k=3)
for i, r in enumerate(results, 1):
    marker = " <-- From our PDF!" if "BM25" in r['title'] else ""
    print(f"   {i}. {r['title']} (score: {r['score']:.2f}){marker}")

# Specific rare terms
print("\n2. Rare technical term: 'PageRank algorithm'")
results = bm25_search("PageRank algorithm", top_k=3)
for i, r in enumerate(results, 1):
    print(f"   {i}. {r['title']} (score: {r['score']:.2f})")

# Named entities
print("\n3. Specific product name: 'Elasticsearch'")
results = bm25_search("Elasticsearch", top_k=3)
for i, r in enumerate(results, 1):
    print(f"   {i}. {r['title']} (score: {r['score']:.2f})")

print("\nüí° BM25 excels when you know the exact keywords in the documents!")

## Part 4: Semantic Search with Embeddings

### What is Semantic Search?

Semantic search finds documents based on **meaning**, not just keywords.

The key insight: We can convert text into **vectors** (lists of numbers) where:
- Similar meanings ‚Üí similar vectors
- Different meanings ‚Üí different vectors

This is done using **neural network models** trained on massive amounts of text.

### How Embeddings Work

```
"Search engine"     ‚Üí [0.12, -0.45, 0.78, ..., 0.33]  (384 numbers)
"Information retrieval" ‚Üí [0.14, -0.42, 0.75, ..., 0.31]  (similar!)
"Chocolate cake"    ‚Üí [-0.67, 0.23, -0.11, ..., -0.89]  (very different)
```

**Measuring similarity:**
- **Cosine similarity:** angle between vectors (1 = identical, 0 = unrelated)
- **L2 distance (Euclidean):** straight-line distance (0 = identical)

We use **FAISS** (Facebook AI Similarity Search) to efficiently find similar vectors among millions.

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss

# Load embedding model
# all-MiniLM-L6-v2 is small (80MB) and fast, but still effective
print("Loading embedding model...")
embed_model = SentenceTransformer('all-MiniLM-L6-v2')
print("Model loaded!")

# Test it
test_sentences = ["Search engine", "Information retrieval", "Chocolate cake"]
test_embeddings = embed_model.encode(test_sentences)

print(f"\nEmbedding dimension: {test_embeddings.shape[1]}")
print(f"Sample embedding (first 10 values): {test_embeddings[0][:10]}")

In [None]:
# Build FAISS index for semantic search
print("Encoding all documents...")

# Encode all document texts
doc_embeddings = embed_model.encode(doc_texts, show_progress_bar=True)
doc_embeddings = np.array(doc_embeddings).astype('float32')

# Build FAISS index
dimension = doc_embeddings.shape[1]
faiss_index = faiss.IndexFlatL2(dimension)  # L2 distance
faiss_index.add(doc_embeddings)

print(f"\nFAISS index built with {faiss_index.ntotal} vectors!")

In [None]:
def semantic_search(query: str, top_k: int = 5) -> list[tuple[str, float]]:
    """Search using semantic similarity."""
    # Encode query
    query_embedding = embed_model.encode([query]).astype('float32')
    
    # Search FAISS index
    distances, indices = faiss_index.search(query_embedding, top_k)
    
    results = []
    for i, idx in enumerate(indices[0]):
        # Convert L2 distance to similarity score (higher = more similar)
        similarity = 1 / (1 + distances[0][i])
        results.append({
            'doc_id': doc_ids[idx],
            'title': doc_titles[idx],
            'score': similarity,
            'text_preview': doc_texts[idx][:200] + '...'
        })
    
    return results

In [None]:
# Test semantic search
test_queries = ["finding information on the internet", "how websites get ranked", "storing and retrieving data"]

for query in test_queries:
    print(f"\nQuery: '{query}'")
    print("-" * 50)
    results = semantic_search(query, top_k=3)
    for i, r in enumerate(results, 1):
        print(f"  {i}. {r['title']} (similarity: {r['score']:.3f})")

### Semantic Search Strengths & Weaknesses

**Strengths:**
- Understands synonyms ("car" finds "automobile")
- Handles paraphrasing ("how to find info" matches "information retrieval")
- Cross-lingual search possible with multilingual models
- Works well for conceptual/exploratory queries

**Weaknesses:**
- May miss exact keyword matches (looking for "BM25" might return general search articles)
- Less interpretable (hard to explain WHY something matched)
- Requires embedding model (more compute, ~100-500ms per query)
- Can struggle with rare proper nouns and technical terms

In [None]:
# Demo: Where Semantic Search shines
print("=== Semantic Search Success Cases ===")

# Same queries that BM25 failed - semantic succeeds!
print("\n1. Finding 'Lion City' doc with 'Singapore economy':")
print("   (Remember: BM25 failed because 'Singapore'/'economy' aren't in the doc)")
results = semantic_search("Singapore economy", top_k=3)
for i, r in enumerate(results, 1):
    marker = " <-- FOUND IT!" if "Lion City" in r['title'] else ""
    print(f"   {i}. {r['title']} (similarity: {r['score']:.3f}){marker}")

# Synonyms
print("\n2. Finding 'Automobiles' doc with 'car vehicle':")
print("   (BM25 failed because 'car'/'vehicle' aren't in the doc - it uses 'motor vehicle', 'automobile')")
results = semantic_search("car vehicle", top_k=3)
for i, r in enumerate(results, 1):
    marker = " <-- FOUND IT!" if "Automobile" in r['title'] else ""
    print(f"   {i}. {r['title']} (similarity: {r['score']:.3f}){marker}")

# Intent/meaning based
print("\n3. Intent-based: 'how websites get discovered and indexed'")
results = semantic_search("how websites get discovered and indexed", top_k=3)
for i, r in enumerate(results, 1):
    print(f"   {i}. {r['title']} (similarity: {r['score']:.3f})")

print("\nüí° Semantic search understands meaning, synonyms, and intent!")

In [None]:
# Demo: Where Semantic Search can struggle
print("=== Semantic Search Challenges ===")

# Very specific technical strings
print("\n1. Very specific phrase: 'k1 parameter 1.2 to 2.0'")
print("   (This exact text is in our BM25 PDF)")
bm25_results = bm25_search("k1 parameter 1.2 to 2.0", top_k=1)
sem_results = semantic_search("k1 parameter 1.2 to 2.0", top_k=1)
print(f"   BM25 top result: {bm25_results[0]['title']}")
print(f"   Semantic top result: {sem_results[0]['title']}")

# Code/technical jargon
print("\n2. Technical acronym: 'TF-IDF'")
bm25_results = bm25_search("TF-IDF", top_k=3)
sem_results = semantic_search("TF-IDF", top_k=3)
print(f"   BM25: {[r['title'][:25] for r in bm25_results]}")
print(f"   Semantic: {[r['title'][:25] for r in sem_results]}")

print("\nüí° For exact technical terms and specific strings, BM25 is often more precise!")
print("   This is why hybrid search combines both approaches.")

## Part 5: Hybrid Search - Best of Both Worlds

### Why Hybrid?

Neither BM25 nor semantic search is perfect:

| Query Type | BM25 | Semantic | Winner |
|------------|------|----------|--------|
| Exact term: "BM25 algorithm" | Great | OK | BM25 |
| Conceptual: "finding stuff online" | Poor | Great | Semantic |
| Mixed: "PageRank for SEO" | Good | Good | Tie |

**Hybrid search combines both** to get:
- Exact keyword matching when needed
- Semantic understanding for conceptual queries

**Who uses hybrid search?**
- Elasticsearch (8.0+)
- Vespa
- Weaviate


### Approach 1: Reciprocal Rank Fusion (RRF)

RRF is simple but effective. The idea:
1. Run both BM25 and semantic search
2. For each document, calculate: `score = 1 / (k + rank)`
3. Sum scores from both methods
4. Re-rank by combined score

The `k` parameter (typically 60) controls how much we trust each ranking.

**Why RRF works:**
- Documents that rank high in BOTH methods get boosted
- No need to normalize different score scales
- No hyperparameters to tune

In [None]:
def reciprocal_rank_fusion(bm25_results: list, semantic_results: list, k: int = 60) -> list[tuple[str, float]]:
    """
    Combine BM25 and semantic results using Reciprocal Rank Fusion.
    
    Args:
        bm25_results: List of results from BM25 search
        semantic_results: List of results from semantic search
        k: RRF constant (typically 60)
    
    Returns:
        Combined results sorted by RRF score
    """
    rrf_scores = {}
    doc_info = {}  # Store document info for results
    
    # Add BM25 contributions
    for rank, result in enumerate(bm25_results):
        doc_id = result['doc_id']
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k + rank + 1)
        doc_info[doc_id] = result
    
    # Add semantic contributions
    for rank, result in enumerate(semantic_results):
        doc_id = result['doc_id']
        rrf_scores[doc_id] = rrf_scores.get(doc_id, 0) + 1 / (k + rank + 1)
        if doc_id not in doc_info:
            doc_info[doc_id] = result
    
    # Sort by RRF score
    sorted_docs = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
    
    results = []
    for doc_id, score in sorted_docs:
        results.append({
            'doc_id': doc_id,
            'title': doc_info[doc_id]['title'],
            'score': score,
            'text_preview': doc_info[doc_id]['text_preview']
        })
    
    return results

### Approach 2: Weighted Score Combination

Another approach is to directly combine scores:

```
final_score = alpha * normalize(bm25_score) + (1 - alpha) * normalize(semantic_score)
```

Where `alpha` controls the balance:
- `alpha = 1.0` ‚Üí pure BM25
- `alpha = 0.0` ‚Üí pure semantic
- `alpha = 0.5` ‚Üí equal weight

**Challenge:** BM25 and semantic scores are on different scales, so we need to normalize them.

In [None]:
def normalize_scores(results: list[tuple[str, float]]) -> dict[str, float]:
    """Normalize scores to 0-1 range using min-max normalization."""
    if not results:
        return results
    
    scores = [r['score'] for r in results]
    min_score, max_score = min(scores), max(scores)
    
    if max_score == min_score:
        # All scores are the same
        for r in results:
            r['normalized_score'] = 1.0
    else:
        for r in results:
            r['normalized_score'] = (r['score'] - min_score) / (max_score - min_score)
    
    return results

def weighted_hybrid_search(query: str, alpha: float = 0.5, top_k: int = 10) -> list[tuple[str, float]]:
    """
    Combine BM25 and semantic search with weighted scores.
    
    Args:
        query: Search query
        alpha: Weight for BM25 (1-alpha for semantic)
        top_k: Number of results
    
    Returns:
        Combined results sorted by weighted score
    """
    # Get results from both methods
    bm25_results = normalize_scores(bm25_search(query, top_k=top_k))
    semantic_results = normalize_scores(semantic_search(query, top_k=top_k))
    
    # Build score dictionaries
    bm25_scores = {r['doc_id']: r['normalized_score'] for r in bm25_results}
    semantic_scores = {r['doc_id']: r['normalized_score'] for r in semantic_results}
    doc_info = {r['doc_id']: r for r in bm25_results + semantic_results}
    
    # Get all unique doc_ids
    all_docs = set(bm25_scores.keys()) | set(semantic_scores.keys())
    
    # Calculate weighted scores
    combined_scores = {}
    for doc_id in all_docs:
        bm25_s = bm25_scores.get(doc_id, 0)
        semantic_s = semantic_scores.get(doc_id, 0)
        combined_scores[doc_id] = alpha * bm25_s + (1 - alpha) * semantic_s
    
    # Sort and return
    sorted_docs = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
    
    results = []
    for doc_id, score in sorted_docs:
        results.append({
            'doc_id': doc_id,
            'title': doc_info[doc_id]['title'],
            'score': score,
            'text_preview': doc_info[doc_id]['text_preview']
        })
    
    return results

In [None]:
def hybrid_search(query: str, method: str = 'rrf', alpha: float = 0.5, top_k: int = 5) -> list[tuple[str, float]]:
    """
    Unified hybrid search function.
    
    Args:
        query: Search query
        method: 'rrf' for Reciprocal Rank Fusion, 'weighted' for weighted combination
        alpha: Weight for BM25 (only used with 'weighted' method)
        top_k: Number of results
    """
    if method == 'rrf':
        bm25_results = bm25_search(query, top_k=top_k*2)  # Get more for fusion
        semantic_results = semantic_search(query, top_k=top_k*2)
        return reciprocal_rank_fusion(bm25_results, semantic_results)[:top_k]
    else:
        return weighted_hybrid_search(query, alpha=alpha, top_k=top_k)

In [None]:
# Compare all three approaches
def compare_search_methods(query: str, comment: str = "") -> None:
    """Compare BM25, Semantic, and Hybrid search results."""
    print(f"\nQuery: '{query}'")
    if comment:
        print(f"  ({comment})")
    print("=" * 70)
    
    bm25_results = bm25_search(query, top_k=3)
    semantic_results = semantic_search(query, top_k=3)
    hybrid_results = hybrid_search(query, method='rrf', top_k=3)
    
    print(f"{'Rank':<5} {'BM25':<25} {'Semantic':<25} {'Hybrid (RRF)':<25}")
    print("-" * 70)
    
    for i in range(3):
        bm25_title = bm25_results[i]['title'][:22] + ".." if len(bm25_results[i]['title']) > 22 else bm25_results[i]['title']
        sem_title = semantic_results[i]['title'][:22] + ".." if len(semantic_results[i]['title']) > 22 else semantic_results[i]['title']
        hyb_title = hybrid_results[i]['title'][:22] + ".." if len(hybrid_results[i]['title']) > 22 else hybrid_results[i]['title']
        print(f"{i+1:<5} {bm25_title:<25} {sem_title:<25} {hyb_title:<25}")

# Test queries that clearly show different behaviors
print("=" * 70)
print("COMPARING SEARCH METHODS")
print("=" * 70)

# Query 1: Both work well (baseline)
compare_search_methods("search engine", "Both methods should work - exact keywords + clear intent")

# Query 2: Semantic wins - no keyword overlap
compare_search_methods("Singapore economy", "Semantic wins: 'Lion City GDP' doc has no 'Singapore'/'economy' keywords")

# Query 3: BM25 wins - exact technical terms from our PDF
compare_search_methods("BM25 term frequency", "BM25 wins: exact technical phrase from our PDF")

# Query 4: Semantic wins - synonyms/paraphrase  
compare_search_methods("car vehicle travel", "Semantic wins: 'Automobiles' doc uses different words")

In [None]:
print("\n" + "=" * 70)
print("CASE STUDY: Why Hybrid Search Wins")
print("=" * 70)

# Example where each method has partial success
query = "Singapore economy growth"
print(f"\nQuery: '{query}'")
print("\nWe have a document 'GDP Growth in the Lion City' about Singapore's economy.")
print("Challenge: It doesn't contain 'Singapore' or 'economy' as keywords!\n")

bm25_results = bm25_search(query, top_k=5)
semantic_results = semantic_search(query, top_k=5)
hybrid_results = hybrid_search(query, method='rrf', top_k=5)

# Check where Lion City appears in each
def find_position(results, keyword):
    for i, r in enumerate(results):
        if keyword in r['title']:
            return i + 1
    return None

lion_bm25 = find_position(bm25_results, "Lion City")
lion_sem = find_position(semantic_results, "Lion City")  
lion_hyb = find_position(hybrid_results, "Lion City")

print("BM25 results:")
for i, r in enumerate(bm25_results[:3], 1):
    marker = " <-- Target doc!" if "Lion City" in r['title'] else ""
    print(f"  {i}. {r['title']}{marker}")
print(f"  ‚Üí Lion City doc position: {lion_bm25 if lion_bm25 else 'Not in top 5'}")

print("\nSemantic results:")
for i, r in enumerate(semantic_results[:3], 1):
    marker = " <-- Target doc!" if "Lion City" in r['title'] else ""
    print(f"  {i}. {r['title']}{marker}")
print(f"  ‚Üí Lion City doc position: {lion_sem if lion_sem else 'Not in top 5'}")

print("\nHybrid (RRF) results:")
for i, r in enumerate(hybrid_results[:3], 1):
    marker = " <-- Target doc!" if "Lion City" in r['title'] else ""
    print(f"  {i}. {r['title']}{marker}")
print(f"  ‚Üí Lion City doc position: {lion_hyb if lion_hyb else 'Not in top 5'}")

print("\nüí° Key Insight:")
print("   - BM25 fails completely (no keyword match)")
print("   - Semantic understands 'Singapore' = 'Lion City' conceptually")
print("   - Hybrid benefits from semantic's understanding")

## Part 6: When to Use What - Decision Framework

### Quick Decision Guide

| Use Case | Recommended | Why |
|----------|-------------|-----|
| Exact product SKU/ID lookup | **BM25** | Need exact matches |
| "Find similar articles" | **Semantic** | Conceptual similarity |
| General content search | **Hybrid** | Best of both worlds |
| Low latency required (<10ms) | **BM25** | No embedding computation |
| Multilingual content | **Semantic** | Models understand multiple languages |
| Technical documentation | **BM25** | Exact term matching important |
| E-commerce product search | **Hybrid** | Users use various terms |
| Legal/medical search | **Hybrid** | Precision + understanding |

### Real-World Examples

**Google Search:**
- Uses hybrid approach: BERT for semantic understanding + traditional signals (links, keywords)
- Introduced "neural matching" in 2018, then BERT in 2019

**Elasticsearch:**
- Default: BM25
- 8.0+: Added kNN search for semantic/vector search
- Hybrid via `_rank_feature` queries

- Sophisticated hybrid search across billions of pages
- Keyword filters + relevance scoring
- Finds content "about" topics even with different wording

**ChatGPT Retrieval (RAG):**
- Primarily semantic search on embeddings
- Some implementations add BM25 for keyword grounding

**How hybrid search powers it:**

1. **Keyword Matching (BM25-like):**
   - Exact phrase matches
   - "In title" filter
   - Specific word requirements

2. **Relevance Scoring (Semantic-like):**
   - Finds content "about" a topic
   - Understands related concepts
   - Ranks by topical relevance

3. **Authority Signals (Beyond search):**
   - Domain Rating
   - Traffic estimates
   - Social shares

**Why SEO professionals need both:**
- Find exact competitor articles (BM25 - specific keywords)
- Discover related content opportunities (Semantic - topical exploration)
- Content gap analysis requires understanding BOTH

In [None]:
# Interactive comparison with different alpha values
print("Effect of alpha on weighted hybrid search")
print("(alpha=1.0 is pure BM25, alpha=0.0 is pure Semantic)")
print("=" * 60)

query = "finding information online"
print(f"\nQuery: '{query}'\n")

for alpha in [1.0, 0.7, 0.5, 0.3, 0.0]:
    results = hybrid_search(query, method='weighted', alpha=alpha, top_k=3)
    print(f"alpha={alpha}: {results[0]['title'][:40]}...")

## Part 7: Document Chunking for RAG

Now let's build something practical: a **Smart Flashcard Generator**!

To do this, we need to understand **RAG (Retrieval-Augmented Generation)**:
1. **Retrieve** relevant content using our hybrid search
2. **Augment** the LLM prompt with this context
3. **Generate** flashcards using an LLM

### Why Chunking Matters

Before we can use our documents with an LLM, we need to **chunk** them into smaller pieces:

- **LLMs have context limits** - we can't send entire Wikipedia pages
- **Smaller chunks = more precise retrieval** - find the exact relevant paragraph
- **Overlapping chunks** prevent losing context at chunk boundaries

Think of it like this: Instead of searching for "which book has info about X", we search for "which paragraph explains X".

In [None]:
# NOTE: There are more sophisticated chunking methods (e.g., LangChain's RecursiveCharacterTextSplitter,
# semantic chunking, etc.) but we'll keep it simple with sentence-based chunking.

def chunk_text(text: str, target_size: int = 1000, overlap_sentences: int = 2) -> list[str]:
    """
    Split text into chunks at sentence boundaries.
    
    Args:
        text: The text to chunk
        target_size: Target size of each chunk (in characters)
        overlap_sentences: Number of sentences to overlap between chunks
    
    Returns:
        List of text chunks
    """
    # Simple sentence splitting (handles . ! ?)
    import re
    sentences = re.split(r'(?<=[.!?])\s+', text)
    sentences = [s.strip() for s in sentences if s.strip()]
    
    if not sentences:
        return [text] if text.strip() else []
    
    chunks = []
    current_chunk = []
    current_size = 0
    
    for sentence in sentences:
        # If adding this sentence exceeds target and we have content, start new chunk
        if current_size + len(sentence) > target_size and current_chunk:
            chunks.append(' '.join(current_chunk))
            # Keep last N sentences for overlap (context continuity)
            current_chunk = current_chunk[-overlap_sentences:] if overlap_sentences else []
            current_size = sum(len(s) for s in current_chunk)
        
        current_chunk.append(sentence)
        current_size += len(sentence)
    
    # Don't forget the last chunk
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks

# Test the chunking function
test_text = "This is the first sentence. Here comes the second one! And what about the third? Finally, the fourth sentence arrives."
test_chunks = chunk_text(test_text, target_size=60, overlap_sentences=1)
print("Test chunks (with sentence boundaries):")
for i, chunk in enumerate(test_chunks):
    print(f"  Chunk {i+1}: '{chunk}'")


In [None]:
# Chunk all documents and track metadata
chunks = []
chunk_metadata = []

for doc_id, doc in documents.items():
    doc_chunks = chunk_text(doc['text'], target_size=1000, overlap_sentences=2)
    for i, chunk in enumerate(doc_chunks):
        chunks.append(chunk)
        chunk_metadata.append({
            'doc_id': doc_id,
            'title': doc['title'],
            'chunk_idx': i,
            'source': doc.get('source', 'unknown'),
            'url': doc.get('url') or (doc_id if doc_id.startswith('http') else None)
        })

print(f"Created {len(chunks)} chunks from {len(documents)} documents")
print(f"Average chunk length: {sum(len(c) for c in chunks) / len(chunks):.0f} characters")

# Show example chunks from one document
print(f"\nExample chunks from '{chunk_metadata[0]['title']}':")
for i, chunk in enumerate(chunks[:3]):
    print(f"  Chunk {i+1}: {chunk[:80]}...")

In [None]:
# Rebuild BM25 index on chunks
tokenized_chunks = [tokenize(chunk) for chunk in chunks]
chunk_bm25_index = BM25Okapi(tokenized_chunks)
print(f"BM25 index rebuilt with {len(chunks)} chunks")

# Rebuild FAISS index on chunks
print("Encoding chunks for semantic search...")
chunk_embeddings = embed_model.encode(chunks, show_progress_bar=True)
chunk_embeddings = np.array(chunk_embeddings).astype('float32')

chunk_faiss_index = faiss.IndexFlatL2(chunk_embeddings.shape[1])
chunk_faiss_index.add(chunk_embeddings)
print(f"FAISS index rebuilt with {chunk_faiss_index.ntotal} chunk vectors")

In [None]:
def bm25_search_chunks(query: str, top_k: int = 5) -> list[dict]:
    """Search chunks using BM25."""
    tokenized_query = tokenize(query)
    scores = chunk_bm25_index.get_scores(tokenized_query)
    top_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)[:top_k]
    
    results = []
    for idx in top_indices:
        results.append({
            'chunk_idx': idx,
            'text': chunks[idx],
            'score': scores[idx],
            'title': chunk_metadata[idx]['title'],
            'doc_id': chunk_metadata[idx]['doc_id']
        })
    return results

def semantic_search_chunks(query: str, top_k: int = 5) -> list[dict]:
    """Search chunks using semantic similarity."""
    query_embedding = embed_model.encode([query]).astype('float32')
    distances, indices = chunk_faiss_index.search(query_embedding, top_k)
    
    results = []
    for i, idx in enumerate(indices[0]):
        similarity = 1 / (1 + distances[0][i])
        results.append({
            'chunk_idx': idx,
            'text': chunks[idx],
            'score': similarity,
            'title': chunk_metadata[idx]['title'],
            'doc_id': chunk_metadata[idx]['doc_id']
        })
    return results

def hybrid_search_chunks(query: str, top_k: int = 5, k: int = 60) -> list[dict]:
    """Hybrid search on chunks using RRF."""
    bm25_results = bm25_search_chunks(query, top_k=top_k*2)
    semantic_results = semantic_search_chunks(query, top_k=top_k*2)
    
    rrf_scores = {}
    chunk_info = {}
    
    for rank, result in enumerate(bm25_results):
        idx = result['chunk_idx']
        rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (k + rank + 1)
        chunk_info[idx] = result
    
    for rank, result in enumerate(semantic_results):
        idx = result['chunk_idx']
        rrf_scores[idx] = rrf_scores.get(idx, 0) + 1 / (k + rank + 1)
        if idx not in chunk_info:
            chunk_info[idx] = result
    
    sorted_chunks = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
    
    results = []
    for idx, score in sorted_chunks:
        result = chunk_info[idx].copy()
        result['score'] = score
        results.append(result)
    
    return results

# Test chunk search
print("Testing chunk-based hybrid search for 'search engine ranking':")
results = hybrid_search_chunks("search engine ranking", top_k=3)
for i, r in enumerate(results, 1):
    print(f"\n{i}. From '{r['title']}':")
    print(f"   {r['text'][:150]}...")

## Part 8: LLM-Powered Flashcard Generation

**RAG (Retrieval-Augmented Generation)** is a technique that combines:
1. **Retrieval** - Find relevant context using search (our hybrid search!)
2. **Augmentation** - Add this context to the LLM prompt
3. **Generation** - LLM generates output using the context

This is how tools like ChatGPT with browsing, Perplexity, and AI writing assistants work!

### Our Model: Qwen3-8B

We'll use **Qwen3-8B**, one of the best small LLMs available in 2025:
- **High quality** - Beats much larger models on benchmarks
- **Runs locally** - No API keys needed
- **Works everywhere** - Colab (GPU) and Mac (Apple Silicon)

**Memory optimization:**
- On Colab: Uses 4-bit quantization (~4GB VRAM) - fits easily on free tier!
- On Mac M4: Uses full precision with unified memory

‚ö†Ô∏è **First run downloads the model (~8GB). This takes a few minutes.**

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Detect environment
def get_device_config() -> dict:
    """Detect the best device and configuration for the current environment."""
    if torch.cuda.is_available():
        # Colab or NVIDIA GPU - use 4-bit quantization
        print("üñ•Ô∏è  NVIDIA GPU detected - using 4-bit quantization")
        from transformers import BitsAndBytesConfig
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16
        )
        return {"quantization_config": bnb_config, "device_map": "auto"}
    
    elif torch.backends.mps.is_available():
        # Mac Apple Silicon
        print("üçé Apple Silicon detected - using MPS")
        return {"device_map": "auto", "torch_dtype": torch.float16}
    
    else:
        # CPU fallback
        print("üíª CPU mode - this will be slow")
        return {"device_map": "auto", "torch_dtype": torch.float32}

# Load the model
MODEL_NAME = "Qwen/Qwen3-8B"

print(f"Loading {MODEL_NAME}...")
print("(First run downloads ~8GB - this takes a few minutes)")
print()

device_config = get_device_config()

llm_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
llm_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    **device_config
)

print()
print("‚úÖ Model loaded successfully!")

In [None]:
def generate_text(prompt: str, max_new_tokens: int = 200) -> str:
    """Generate text using Qwen3."""
    messages = [{"role": "user", "content": prompt}]
    
    text = llm_tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=False  # Disable thinking mode for faster responses
    )
    
    inputs = llm_tokenizer([text], return_tensors="pt").to(llm_model.device)
    
    outputs = llm_model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=llm_tokenizer.eos_token_id
    )
    
    # Decode only the new tokens
    generated = outputs[0][inputs.input_ids.shape[1]:]
    return llm_tokenizer.decode(generated, skip_special_tokens=True).strip()

# Test the model
print("Testing the model...")
response = generate_text("What is a search engine? Answer in one sentence.")
print(f"Response: {response}")

In [None]:
def generate_flashcard(topic: str, context: str, source_title: str | None = None, 
                       source_url: str | None = None, previous_questions: list[str] | None = None) -> dict:
    """Generate a study flashcard using Qwen3."""
    
    # Build the prompt
    avoid_section = ""
    if previous_questions:
        avoid_section = f"""
IMPORTANT: Do NOT ask about these topics (already covered):
{chr(10).join(f'- {q[:80]}' for q in previous_questions[-5:])}

Ask about something DIFFERENT from the above."""

    prompt = f"""Create a study flashcard about "{topic}" based on this context.

Context: {context[:1500]}
{avoid_section}
Create ONE flashcard with:
- A clear question that tests understanding of a key concept
- A complete answer (1-3 sentences)

Format your response EXACTLY like this:
Question: [your question]
Answer: [your answer]"""

    response = generate_text(prompt, max_new_tokens=250)
    
    # Parse the response
    question = ""
    answer = ""
    
    # Look for Question: and Answer: markers
    if "Question:" in response and "Answer:" in response:
        parts = response.split("Answer:")
        question_part = parts[0]
        answer_part = parts[1] if len(parts) > 1 else ""
        
        # Extract question
        if "Question:" in question_part:
            question = question_part.split("Question:")[-1].strip()
        
        # Extract answer
        answer = answer_part.strip()
        # Clean up: stop at double newlines or additional markers
        for stop in ["\n\n", "\nQuestion:", "\nNote:", "\n---"]:
            if stop in answer:
                answer = answer.split(stop)[0].strip()
    
    # Fallback parsing
    if not question or not answer:
        if "?" in response:
            idx = response.rindex("?")
            question = response[:idx+1].strip()
            answer = response[idx+1:].strip()
    
    # Build the card
    card = {
        'question': question if question else "(Generation failed - please retry)",
        'answer': answer if answer else "(No answer generated)",
        'topic': topic
    }
    
    # Include source with URL if available
    if source_url:
        card['source'] = source_url  # Direct link
        card['source_title'] = source_title
    else:
        card['source'] = source_title or 'Custom'
        card['source_title'] = source_title
    
    return card

# Test flashcard generation
test_context = """A search engine is a software system designed to carry out web searches. 
They search the World Wide Web in a systematic way for particular information specified in a textual web search query. 
The search results are generally presented in a line of results, often referred to as search engine results pages (SERPs).
Google is the most widely used search engine, processing over 8.5 billion searches per day."""

print("Generating test flashcard...")
card = generate_flashcard("search engines", test_context, "Wikipedia", "https://en.wikipedia.org/wiki/Search_engine")
print(f"\n‚úÖ Generated flashcard:")
print(f"   Q: {card['question']}")
print(f"   A: {card['answer']}")
print(f"   Source: {card['source']}")

In [None]:
def generate_flashcards_for_topic(topic: str, num_cards: int = 5) -> list[dict]:
    """
    Full RAG pipeline: Retrieve relevant chunks ‚Üí Generate flashcards.
    
    Args:
        topic: The topic to generate flashcards for
        num_cards: Number of flashcards to generate
    
    Returns:
        List of flashcard dictionaries
    """
    print(f"Generating {num_cards} flashcards for '{topic}'...")
    
    # Step 1: Retrieve relevant chunks using hybrid search
    retrieved_chunks = hybrid_search_chunks(topic, top_k=num_cards * 3)
    print(f"  Retrieved {len(retrieved_chunks)} relevant chunks")
    
    # Step 2: Generate flashcard from each unique chunk
    flashcards = []
    seen_sources = set()  # Avoid duplicate sources
    previous_questions = []  # Track questions to avoid duplicates
    
    for chunk in retrieved_chunks:
        if len(flashcards) >= num_cards:
            break
        
        # Skip if we already have a card from nearby chunks in same document
        doc_key = (chunk['doc_id'], chunk.get('chunk_idx', 0) // 3)
        if doc_key in seen_sources:
            continue
        seen_sources.add(doc_key)
        
        # Generate flashcard, passing previous questions for diversity
        print(f"  Generating card from '{chunk['title']}'...")
        card = generate_flashcard(
            topic, 
            chunk['text'], 
            chunk['title'], 
            chunk.get('url'),
            previous_questions=previous_questions
        )
        
        # Track this question for diversity
        if card['question'] and not card['question'].startswith("("):
            previous_questions.append(card['question'])
        
        flashcards.append(card)
    
    print(f"Generated {len(flashcards)} flashcards!")
    return flashcards

In [None]:
# Test the full RAG pipeline!
topic = "web crawler"
flashcards = generate_flashcards_for_topic(topic, num_cards=3)

print("\n" + "=" * 60)
print(f"FLASHCARDS FOR: {topic}")
print("=" * 60)

for i, card in enumerate(flashcards, 1):
    print(f"\nüìù Card {i}")
    print(f"   Q: {card['question']}")
    print(f"   A: {card['answer']}")
    print(f"   Source: {card.get('source_title', 'Unknown')}")
    if card.get('source', '').startswith('http'):
        print(f"   URL: {card['source']}")

### Tips for Better Flashcards

**Topic selection:**
- More specific topics = better cards ("PageRank algorithm" vs "search")
- Use the same terminology as your source documents

**Quality depends on:**
- Quality of source documents (garbage in = garbage out)
- How well the hybrid search retrieves relevant content
- The LLM's ability to extract key facts

**After generation:**
- Review and edit cards before studying
- Remove duplicates or low-quality cards
- Add your own cards for topics not well covered

**Pro tip:** The flashcards work best when studying content you've already added to the corpus!

In [None]:
def generate_flashcards_batch(topics: list[str], cards_per_topic: int = 3) -> list[dict]:
    """Generate flashcards for multiple topics."""
    all_cards = []
    
    for topic in topics:
        cards = generate_flashcards_for_topic(topic, num_cards=cards_per_topic)
        all_cards.extend(cards)
        print()  # Add spacing between topics
    
    return all_cards

# Generate cards for multiple topics
study_topics = ["search engine", "web indexing", "Google"]
all_flashcards = generate_flashcards_batch(study_topics, cards_per_topic=2)

print(f"\n{'='*60}")
print(f"Generated {len(all_flashcards)} total flashcards across {len(study_topics)} topics!")

## Part 9: Export to Anki & CSV

Now let's make our flashcards actually useful! We'll export them to:

1. **CSV** - Universal format, import into any flashcard app
2. **Anki (.apkg)** - Direct import into [Anki](https://apps.ankiweb.net/), the most popular flashcard app

### Why Anki?

Anki uses **spaced repetition** - it shows you cards just before you'd forget them. This is scientifically proven to be the most efficient way to memorize information!

- Free and open source
- Available on all platforms (desktop, mobile, web)
- Used by medical students, language learners, and programmers worldwide

In [None]:
import csv

def export_to_csv(flashcards: list[dict], filename: str = "flashcards.csv") -> str:
    """
    Export flashcards to CSV format.
    
    CSV can be imported into most flashcard apps including Anki, Quizlet, etc.
    """
    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        writer.writerow(['Question', 'Answer', 'Topic', 'Source', 'URL'])
        
        for card in flashcards:
            writer.writerow([
                card['question'],
                card['answer'],
                card.get('topic', ''),
                card.get('source_title', card.get('source', '')),
                card.get('source', '') if str(card.get('source', '')).startswith('http') else ''
            ])
    
    print(f"‚úÖ Exported {len(flashcards)} cards to {filename}")
    return filename

# Export our flashcards to CSV
csv_file = export_to_csv(all_flashcards, "study_flashcards.csv")
print(f"\nCSV file created! You can import this into Quizlet, Anki, or any flashcard app.")

In [None]:
import genanki
import random

# Define the Anki note model (card template)
FLASHCARD_MODEL = genanki.Model(
    random.randrange(1 << 30, 1 << 31),  # Unique model ID
    'Simple Q&A',
    fields=[
        {'name': 'Question'},
        {'name': 'Answer'},
        {'name': 'Topic'},
        {'name': 'Source'},
    ],
    templates=[
        {
            'name': 'Card 1',
            'qfmt': '''
                <div style="font-size: 20px; text-align: center;">
                    {{Question}}
                </div>
                <div style="font-size: 12px; color: #666; margin-top: 20px;">
                    Topic: {{Topic}}
                </div>
            ''',
            'afmt': '''
                {{FrontSide}}
                <hr>
                <div style="font-size: 18px; text-align: center;">
                    {{Answer}}
                </div>
                <div style="font-size: 10px; color: #999; margin-top: 20px;">
                    Source: {{Source}}
                </div>
            ''',
        },
    ],
    css='''
        .card {
            font-family: Arial, sans-serif;
            background-color: #fafafa;
            padding: 20px;
        }
    '''
)

def export_to_anki(flashcards: list[dict], filename: str = "flashcards.apkg", deck_name: str = "Study Deck") -> str:
    """
    Export flashcards to Anki format (.apkg).
    
    The .apkg file can be directly imported into Anki.
    """
    # Create a new deck
    deck = genanki.Deck(
        random.randrange(1 << 30, 1 << 31),  # Unique deck ID
        deck_name
    )
    
    # Add each flashcard as a note
    for card in flashcards:
        note = genanki.Note(
            model=FLASHCARD_MODEL,
            fields=[
                card['question'],
                card['answer'],
                card.get('topic', ''),
                card.get('source', '')
            ]
        )
        deck.add_note(note)
    
    # Create the package and save
    package = genanki.Package(deck)
    package.write_to_file(filename)
    
    print(f"‚úÖ Exported {len(flashcards)} cards to {filename}")
    return filename

In [None]:
# Export to Anki format
anki_file = export_to_anki(all_flashcards, "search_study_deck.apkg", "Search Engines Study Deck")

print(f"\nüìÅ Files created:")
print(f"   - {csv_file} (for Quizlet, Google Sheets, etc.)")
print(f"   - {anki_file} (for Anki app)")

In [None]:
# process_pdf() and add_pdf_document() are defined in Part 2
# Here we add a function to rebuild indices after adding new documents

def add_notes_to_corpus(title: str, text: str) -> str:
    """Add user notes to the document corpus and rebuild indices."""
    global documents, chunks, chunk_metadata, chunk_bm25_index, chunk_embeddings, chunk_faiss_index
    global doc_ids, doc_texts, doc_titles, tokenized_docs, bm25_index, doc_embeddings, faiss_index
    
    # Add to documents
    doc_id = add_custom_document(title, text, source="user_notes")
    
    # Rebuild document-level indices
    doc_ids = list(documents.keys())
    doc_texts = [documents[d]['text'] for d in doc_ids]
    doc_titles = [documents[d]['title'] for d in doc_ids]
    tokenized_docs = [tokenize(text) for text in doc_texts]
    bm25_index = BM25Okapi(tokenized_docs)
    doc_embeddings = embed_model.encode(doc_texts).astype('float32')
    faiss_index = faiss.IndexFlatL2(doc_embeddings.shape[1])
    faiss_index.add(doc_embeddings)
    
    # Rebuild chunk-level indices
    chunks = []
    chunk_metadata = []
    for doc_id, doc in documents.items():
        doc_chunks = chunk_text(doc['text'], target_size=1000, overlap_sentences=2)
        for i, chunk in enumerate(doc_chunks):
            chunks.append(chunk)
            chunk_metadata.append({
                'doc_id': doc_id,
                'title': doc['title'],
                'chunk_idx': i,
                'source': doc.get('source', 'unknown'),
                'url': doc.get('url') or (doc_id if doc_id.startswith('http') else None)
            })
    
    tokenized_chunks = [tokenize(chunk) for chunk in chunks]
    chunk_bm25_index = BM25Okapi(tokenized_chunks)
    chunk_embeddings = embed_model.encode(chunks).astype('float32')
    chunk_faiss_index = faiss.IndexFlatL2(chunk_embeddings.shape[1])
    chunk_faiss_index.add(chunk_embeddings)
    
    return f"Added '{title}' ({len(text)} chars) to corpus. Now have {len(documents)} documents, {len(chunks)} chunks."

In [None]:
# App state for generated flashcards
generated_cards = []

def generate_cards_ui(topics_text: str, num_cards: int) -> tuple[str, list]:
    """Generate flashcards from UI input."""
    global generated_cards
    
    if not topics_text.strip():
        return "Please enter at least one topic.", []
    
    # Parse topics (comma or newline separated)
    topics = [t.strip() for t in topics_text.replace('\n', ',').split(',') if t.strip()]
    
    # Generate flashcards
    generated_cards = generate_flashcards_batch(topics, cards_per_topic=int(num_cards))
    
    # Format for display
    rows = []
    for i, card in enumerate(generated_cards, 1):
        rows.append([i, card['question'], card['answer'], card.get('topic', ''), card.get('source_title', 'Custom'), card.get('source', '') if card.get('source', '').startswith('http') else ''])
    
    return f"Generated {len(generated_cards)} flashcards for {len(topics)} topic(s)!", rows

def export_csv_ui() -> str:
    """Export current flashcards to CSV."""
    if not generated_cards:
        return "No flashcards to export. Generate some first!"
    filename = export_to_csv(generated_cards, "flashcards_export.csv")
    return f"Exported to {filename}"

def export_anki_ui(deck_name: str) -> str:
    """Export current flashcards to Anki."""
    if not generated_cards:
        return "No flashcards to export. Generate some first!"
    filename = export_to_anki(generated_cards, "flashcards_export.apkg", deck_name or "Study Deck")
    return f"Exported to {filename}"

def upload_pdf_ui(file, title: str) -> tuple[str, str]:
    """Handle PDF upload."""
    if file is None:
        return "Please upload a PDF file.", ""
    
    text = process_pdf(file.name)
    if text.startswith("Error"):
        return text, ""
    
    return f"Extracted {len(text)} characters from PDF.", text[:2000] + ("..." if len(text) > 2000 else "")

def upload_text_ui(title: str, text: str) -> str:
    """Add text notes to corpus."""
    if not title.strip() or not text.strip():
        return "Please provide both title and text."
    
    result = add_notes_to_corpus(title, text)
    return result

In [None]:
# Build the complete Gradio app
with gr.Blocks(title="Smart Flashcard Generator", theme=gr.themes.Soft()) as flashcard_app:
    gr.Markdown("# üéì Smart Flashcard Generator")
    gr.Markdown("Generate study flashcards from your notes using AI + Hybrid Search")
    
    with gr.Tabs():
        # Tab 1: Generate Flashcards
        with gr.Tab("üìù Generate Flashcards"):
            gr.Markdown("### Generate Flashcards for Topics")
            gr.Markdown("Enter topics (comma or newline separated) to generate flashcards from the document corpus.")
            
            with gr.Row():
                with gr.Column(scale=2):
                    topics_input = gr.Textbox(
                        label="Topics to Study",
                        placeholder="search engine, web crawler, PageRank",
                        lines=3
                    )
                with gr.Column(scale=1):
                    num_cards_slider = gr.Slider(
                        minimum=1, maximum=5, value=3, step=1,
                        label="Cards per Topic"
                    )
            
            generate_btn = gr.Button("üöÄ Generate Flashcards", variant="primary")
            gen_status = gr.Textbox(label="Status", interactive=False)
            
            flashcards_table = gr.Dataframe(
                headers=["#", "Question", "Answer", "Topic", "Source", "URL"],
                interactive=False,
                wrap=True
            )
            
            gr.Markdown("### Export")
            with gr.Row():
                export_csv_btn = gr.Button("üìÑ Export to CSV")
                deck_name_input = gr.Textbox(label="Anki Deck Name", value="Study Deck", scale=2)
                export_anki_btn = gr.Button("üìö Export to Anki")
            export_status = gr.Textbox(label="Export Status", interactive=False)
            
            generate_btn.click(
                fn=generate_cards_ui,
                inputs=[topics_input, num_cards_slider],
                outputs=[gen_status, flashcards_table]
            )
            export_csv_btn.click(fn=export_csv_ui, outputs=export_status)
            export_anki_btn.click(fn=export_anki_ui, inputs=deck_name_input, outputs=export_status)
        
        # Tab 2: Upload Notes
        with gr.Tab("üì§ Upload Notes"):
            gr.Markdown("### Add Your Own Notes")
            gr.Markdown("Upload a PDF or paste text to add to the searchable corpus.")
            
            with gr.Row():
                with gr.Column():
                    gr.Markdown("**Upload PDF**")
                    pdf_file = gr.File(label="Select PDF", file_types=[".pdf"])
                    pdf_title = gr.Textbox(label="Document Title", placeholder="My Study Notes")
                    pdf_upload_btn = gr.Button("Process PDF")
                    pdf_status = gr.Textbox(label="Status", interactive=False)
                    pdf_preview = gr.Textbox(label="Preview", lines=10, interactive=False)
                
                with gr.Column():
                    gr.Markdown("**Or Paste Text**")
                    text_title = gr.Textbox(label="Document Title", placeholder="Lecture Notes - Week 1")
                    text_content = gr.Textbox(label="Text Content", lines=10, placeholder="Paste your notes here...")
                    text_add_btn = gr.Button("Add to Corpus")
                    text_status = gr.Textbox(label="Status", interactive=False)
            
            pdf_upload_btn.click(
                fn=upload_pdf_ui,
                inputs=[pdf_file, pdf_title],
                outputs=[pdf_status, pdf_preview]
            )
            text_add_btn.click(
                fn=upload_text_ui,
                inputs=[text_title, text_content],
                outputs=text_status
            )
        
        # Tab 3: Search Comparison (kept from before)
        with gr.Tab("üîç Compare Search"):
            gr.Markdown("### Compare BM25 vs Semantic vs Hybrid")
            compare_query = gr.Textbox(label="Search Query", placeholder="Try: 'Singapore economy'")
            compare_btn = gr.Button("Compare Methods")
            
            with gr.Row():
                with gr.Column():
                    gr.Markdown("**BM25**")
                    bm25_results = gr.Dataframe(headers=["Rank", "Title", "Score"], interactive=False)
                with gr.Column():
                    gr.Markdown("**Semantic**")
                    semantic_results = gr.Dataframe(headers=["Rank", "Title", "Score"], interactive=False)
                with gr.Column():
                    gr.Markdown("**Hybrid**")
                    hybrid_results = gr.Dataframe(headers=["Rank", "Title", "Score"], interactive=False)
            
            compare_btn.click(
                fn=compare_all_methods,
                inputs=compare_query,
                outputs=[bm25_results, semantic_results, hybrid_results]
            )
    
    gr.Markdown("""
    ---
    ### How This Works
    1. **Upload** your notes (PDF or text) or use the pre-loaded Wikipedia pages
    2. **Enter topics** you want to study
    3. **Hybrid Search** finds relevant content (BM25 + Semantic)
    4. **LLM** generates flashcards from the retrieved context
    5. **Export** to Anki or CSV for spaced repetition study!
    
    """)

In [None]:
# Launch the complete Flashcard Generator app!
flashcard_app.launch(share=False)

## Summary & Further Reading

### What You Learned

1. **BM25 (Lexical Search)**
   - Matches exact keywords using TF-IDF principles
   - Fast, interpretable, great for specific terms
   - Fails on synonyms and conceptual queries

2. **Semantic Search**
   - Uses neural embeddings to understand meaning
   - Handles synonyms, paraphrasing, conceptual queries
   - Can miss exact keyword matches

3. **Hybrid Search**
   - Combines both approaches for best results
   - RRF: Simple, no tuning needed
   - Weighted: Adjustable balance via alpha parameter

### Further Reading

- [BM25: The Next Generation of Lucene Relevance](https://www.elastic.co/blog/practical-bm25-part-2-the-bm25-algorithm-and-its-variables)
- [Sentence-BERT Paper](https://arxiv.org/abs/1908.10084)
- [Reciprocal Rank Fusion Paper](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf)
- [Elasticsearch: Combining BM25 and kNN](https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html)
