# Semantic Search: Baseline vs Fine-Tuning Comparison (Hard Corpus)

**Goal:** Compare semantic search quality before and after fine-tuning using three approaches, but with a more challenging corpus and queries:

1. **Baseline**: Embedding search with FAISS (no fine-tuning)
2. **Contrastive Fine-Tuning**: Improve embeddings directly using contrastive learning
3. **LoRA Reranking**: Train a lightweight LoRA adapter on a small LLM to rerank FAISS results

We'll evaluate each approach using standard information retrieval metrics:
- **Hit@K**: Did we find at least one relevant document in the top K results?
- **MRR (Mean Reciprocal Rank)**: How high is the first relevant result ranked?
- **nDCG (Normalized Discounted Cumulative Gain)**: How well are relevant documents ranked overall?

Additionally, we'll test how each approach affects **downstream LLM performance** by using retrieved documents as context for question answering.

## Theoretical Background

### What is Semantic Search?

Traditional keyword search matches exact words, but **semantic search** understands the *meaning* behind queries and documents. For example:
- Query: "Who discovered the first antibiotic?"
- Relevant document: "Alexander Fleming found penicillin in 1928"

Notice that the words "discovered," "first," and "antibiotic" don't appear exactly in the document, but the meaning matches.

### How Does Semantic Search Work?

**Step 1: Embedding**
- Convert text into dense vectors (arrays of numbers) that capture semantic meaning
- Similar meanings â†’ similar vectors (close in vector space)
- Example: "dog" and "puppy" have similar embeddings, but "dog" and "car" don't

**Step 2: Indexing**
- Store all document embeddings in a searchable index (we use FAISS)
- FAISS enables fast similarity search over millions of vectors

**Step 3: Retrieval**
- Embed the query using the same model
- Find documents with embeddings most similar to the query embedding
- Similarity is typically measured using cosine similarity or dot product

## Import Required Libraries

We'll use Python libraries for data handling, modeling, and evaluation. For the interactive app, we'll use Gradio (or Streamlit as an alternative).

In [None]:
# Install required packages if not already installed
!pip install -q gradio faiss-cpu scikit-learn pandas numpy tqdm

import numpy as np
import pandas as pd
import faiss
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from tqdm import tqdm
import gradio as gr
import random
import re

## Prepare a Harder Text Corpus

We'll use a more challenging, longer, and ambiguous synthetic corpus to make retrieval and classification harder. You can also swap this for a real dataset (e.g., 20 Newsgroups) if desired.

In [None]:
# Create a harder, longer, and more ambiguous synthetic corpus
corpus = [
    "The quick brown fox jumps over the lazy dog near the river bank, while the sun sets behind the mountains.",
    "A bank can be a financial institution or the side of a river, depending on the context of the sentence.",
    "The astronomer observed the stars through the telescope, but the bank's interest rates were also rising.",
    "The chef prepared a delicious meal with rare spices, while the banker discussed loans with a client.",
    "A bat can refer to a flying mammal or a piece of sports equipment used in cricket or baseball.",
    "The cricket match was interrupted by a sudden downpour, and the players took shelter in the pavilion.",
    "The lawyer presented the case in court, but the judge was more interested in the evidence.",
    "The artist painted a beautiful landscape, capturing the essence of the countryside in spring.",
    "The computer crashed during the presentation, causing the speaker to lose all unsaved work.",
    "The patient was prescribed a new medication, but the pharmacist warned about possible side effects.",
    # Add more ambiguous and context-rich sentences
    "The coach gave a pep talk before the game, but the team was still nervous about the final.",
    "The engineer designed a bridge that could withstand earthquakes and heavy traffic.",
    "The gardener planted roses and tulips, hoping for a colorful bloom in the summer.",
    "The pilot announced a delay due to bad weather, and the passengers groaned in frustration.",
    "The musician composed a symphony inspired by the sounds of the city at night.",
    "The teacher explained the concept of gravity using simple experiments in the classroom.",
    "The detective solved the mystery by finding a crucial clue at the crime scene.",
    "The doctor recommended regular exercise and a balanced diet for better health.",
    "The programmer debugged the code, fixing a bug that caused the app to crash.",
    "The historian wrote a book about ancient civilizations and their cultural impact."
]

# Create challenging queries that require deeper understanding
queries = [
    "What is the meaning of the word 'bank' in different contexts?",
    "Describe a scenario where a bat is not an animal.",
    "How can a computer failure affect a public event?",
    "What are the possible side effects of new drugs?",
    "Explain how a bridge can be made earthquake-resistant.",
    "How does gravity influence classroom experiments?",
    "What is the role of a judge in a legal case?",
    "How do musicians find inspiration for their work?",
    "What are the responsibilities of a pharmacist?",
    "How do weather conditions impact air travel?"
]

# Create ground truth for evaluation (index of relevant corpus sentence for each query)
relevant_indices = [1, 4, 8, 9, 11, 15, 6, 14, 9, 13]

## Explore and Preprocess the Corpus

Let's inspect the corpus and queries, then preprocess the text (lowercasing, removing punctuation, etc.).

In [None]:
# Inspect the corpus and queries
print("Sample corpus sentences:")
for i, sent in enumerate(corpus[:5]):
    print(f"{i}: {sent}")

print("\nSample queries:")
for i, q in enumerate(queries[:5]):
    print(f"{i}: {q}")

# Simple text preprocessing
def preprocess(text):
    text = text.lower()
    text = re.sub(r"[^a-z0-9\s]", "", text)
    return text

corpus_clean = [preprocess(s) for s in corpus]
queries_clean = [preprocess(q) for q in queries]

## Build and Train a Text Classification Model

We'll vectorize the text using TF-IDF and train a logistic regression classifier to predict which corpus sentence is relevant to a given query. This is a simplification for demonstration; in practice, you would use semantic embeddings and more advanced models.

In [None]:
# Vectorize corpus and queries
vectorizer = TfidfVectorizer()
corpus_vecs = vectorizer.fit_transform(corpus_clean)
queries_vecs = vectorizer.transform(queries_clean)

# For demonstration, train a classifier to map queries to relevant corpus indices
X = queries_vecs
y = relevant_indices
clf = LogisticRegression(max_iter=1000)
clf.fit(X, y)

# Evaluate on training data (since we have few samples)
preds = clf.predict(X)
print("Classification report (query to relevant corpus sentence):")
print(classification_report(y, preds))

## Create an Interactive App to Load Custom Datasets

We'll build a Gradio app that allows you to upload your own CSV/text dataset, preprocess it, and apply the trained model for predictions.

In [None]:
# Gradio app for custom dataset upload and prediction
def predict_custom(queries, custom_corpus):
    # Preprocess
    custom_corpus_clean = [preprocess(s) for s in custom_corpus]
    queries_clean = [preprocess(q) for q in queries]
    # Vectorize
    custom_corpus_vecs = vectorizer.transform(custom_corpus_clean)
    queries_vecs = vectorizer.transform(queries_clean)
    # Predict relevant corpus index for each query
    preds = clf.predict(queries_vecs)
    # Return the predicted relevant sentence for each query
    results = [custom_corpus[p] if 0 <= p < len(custom_corpus) else "No match" for p in preds]
    return results

def gradio_interface():
    with gr.Blocks() as demo:
        gr.Markdown("# Custom Semantic Search Demo\nUpload your own corpus and enter queries to see predictions.")
        corpus_input = gr.Textbox(label="Custom Corpus (one sentence per line)", lines=10)
        queries_input = gr.Textbox(label="Queries (one per line)", lines=5)
        output = gr.Textbox(label="Predicted Relevant Sentences", lines=10)
        def run_app(corpus_text, queries_text):
            custom_corpus = [s.strip() for s in corpus_text.split("\n") if s.strip()]
            queries = [q.strip() for q in queries_text.split("\n") if q.strip()]
            results = predict_custom(queries, custom_corpus)
            return "\n".join(results)
        btn = gr.Button("Predict")
        btn.click(run_app, inputs=[corpus_input, queries_input], outputs=output)
    return demo

demo = gradio_interface()
# Uncomment the next line to launch the app in a notebook or script
demo.launch(share=False)

## Test the App with a Custom Dataset

Let's demonstrate the app by loading a sample custom corpus and queries.

In [None]:
# Example: Test the app with a sample custom corpus and queries
sample_corpus = [
    "The volcano erupted, sending ash into the sky.",
    "A chef uses a knife to chop vegetables.",
    "The scientist published a paper on climate change.",
    "The athlete won a gold medal at the Olympics."
]
sample_queries = [
    "Who studies the environment?",
    "What tool does a chef use?",
    "Describe a natural disaster involving ash."
]

results = predict_custom(sample_queries, sample_corpus)
for q, r in zip(sample_queries, results):
    print(f"Query: {q}\nPredicted relevant: {r}\n")