## Semantic Search with SBERT and FAISS

This stage enables semantic search over court rulings by converting text into embeddings using SBERT and indexing them with FAISS. The goal is to retrieve rulings based on meaning — not just keyword matches. For example, a query like *“default judgment in breach of contract”* should return relevant results even if they use different legal language.

---

### Why Not Keyword Search?

Legal text is inconsistent in structure and phrasing. Keyword search often misses important rulings simply because different words are used to describe the same idea. Embedding-based search solves this by comparing meaning, not surface-level word overlap.

---

### Why I Chose SBERT

I used Sentence-BERT (SBERT) to generate embeddings because it’s designed for semantic similarity tasks.

- It’s tuned specifically for comparing sentences using cosine similarity.
- It performs well on short- and medium-length texts, like legal rulings, without needing fine-tuning or labeled training data.
- I chose the `all-MiniLM-L6-v2` variant, which is lightweight and fast — ideal for prototyping.

This allowed me to create high-quality, reusable embeddings with minimal setup and no infrastructure overhead.

---

### Why I Didn't Use Other Embedding Methods

I considered traditional and modern alternatives but ruled them out based on project needs:

- Vanilla BERT’s `[CLS]` token isn’t reliable for measuring sentence similarity.
- TF-IDF and bag-of-words methods don’t capture context or semantics — they only compare surface words.
- Doc2Vec is outdated and less predictable in modern workflows.
- OpenAI or LLM-based embeddings offer great quality but require API access, introduce cost, and aren’t reproducible offline.
- Custom fine-tuning wasn’t needed — SBERT was strong enough for the task.

---

### Why I Chose FAISS

Once the embeddings were generated, I used FAISS (Facebook AI Similarity Search) to build a local index for fast retrieval.

- It handles thousands of vectors efficiently and works well even without approximation.
- It runs entirely offline — no cloud services or extra setup.
- It integrates naturally with SBERT, especially when using cosine similarity.
- I used `IndexFlatIP` with normalized embeddings to get exact top-k results using cosine-based search.

This setup gave me full control, speed, and transparency — ideal for a lightweight prototype.

---

### Why I Didn't Use Other Indexing Tools

I looked at other vector search tools but decided they weren’t the right fit:

- Annoy is simple but slower and doesn’t natively support cosine similarity.
- HNSWlib has great performance but comes with extra memory usage and complexity I didn’t need.
- ScaNN is optimized for very large datasets, which adds setup overhead without clear benefit here.
- Cloud-based solutions like Pinecone or Qdrant are powerful, but too heavy for a local, reproducible project.

---

### Summary

SBERT and FAISS work well together to enable fast, accurate, and explainable semantic search — all without relying on cloud infrastructure or fine-tuning. This combination keeps the system lightweight, interpretable, and easy to adapt for real-world legal and compliance review workflows.


## Step 1: Import Libraries and Load Clean Texts

In [1]:
import pandas as pd
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

In [5]:
import os

#os.getcwd()

# Folder location
folder_path = "data/clean"

texts = []
file_names = []

for file_name in sorted(os.listdir(folder_path)):
    if file_name.endswith(".txt"):
        with open(os.path.join(folder_path, file_name), 'r', encoding='utf-8') as f:
            texts.append(f.read())
            file_names.append(file_name)

print(f"Loaded {len(texts)} rulings.")

Loaded 74 rulings.


## Step 2: Generate SBERT Embeddings

In [6]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

embeddings = model.encode(
    texts,
    show_progress_bar=True,
    normalize_embeddings=True  # Required for cosine similarity
)


Batches:   0%|          | 0/3 [00:00<?, ?it/s]

## Step 3: Build the FAISS Index

In [7]:
import faiss
import numpy as np

dimension = embeddings.shape[1]  # Should be 384
index = faiss.IndexFlatIP(dimension)  # Inner Product index (cosine-compatible with normalized vectors)

index.add(np.array(embeddings))
print(f"Indexed {index.ntotal} documents.")

Indexed 74 documents.


## Step 4: Implement the Search Function

In [19]:
def search(query, top_k):
    query_embedding = model.encode([query], normalize_embeddings=True)
    scores, indices = index.search(np.array(query_embedding), top_k)

    results = []
    for idx, score in zip(indices[0], scores[0]):
        results.append({
            "score": round(float(score), 4),
            "file": file_names[idx],
            "text": texts[idx][:500] + "..."  # Short preview
        })

    return pd.DataFrame(results)

## Step 5: Run a Test Query

In [26]:
# Show full column width and long strings
pd.set_option('display.max_colwidth', None)

results = search("default judgment in breach of contract", 5)
results

Unnamed: 0,score,file,text
0,0.4728,clean_opinion_34.txt,"green v forster &amp; garbus, llp (2025 ny slip op 02324)\ngreen v forster &amp; garbus, llp\n2025 ny slip op 02324\ndecided on april 23, 2025\nappellate division, second department\npublished by new york state law reporting bureau pursuant to judiciary law § 431.\nthis opinion is uncorrected and subject to revision before publication in the official reports.\ndecided on april 23, 2025\nsupreme court of the state of new york\nappellate division, second judicial department\ncheryl e. chambers, j.p.\npaul wo..."
1,0.4611,clean_opinion_1.txt,"pantanilla v yuson (2025 ny slip op 02597)\npantanilla v yuson\n2025 ny slip op 02597\ndecided on april 30, 2025\nappellate division, second department\npublished by new york state law reporting bureau pursuant to judiciary law § 431.\nthis opinion is uncorrected and subject to revision before publication in the official reports.\ndecided on april 30, 2025\nsupreme court of the state of new york\nappellate division, second judicial department\nmark c. dillon, j.p.\nrobert j. miller\nhelen voutsinas\nlourdes ..."
2,0.4004,clean_opinion_51.txt,"us bank n.a. v pane (2025 ny slip op 02619)\nus bank n.a. v pane\n2025 ny slip op 02619\ndecided on april 30, 2025\nappellate division, second department\npublished by new york state law reporting bureau pursuant to judiciary law § 431.\nthis opinion is uncorrected and subject to revision before publication in the official reports.\ndecided on april 30, 2025\nsupreme court of the state of new york\nappellate division, second judicial department\nangela g. iannacci, j.p.\npaul wooten\nbarry e. warhit\ncarl j...."
3,0.384,clean_opinion_22.txt,"fg&amp;n trust v 165 hous. corp. (2025 ny slip op 02129)\nfg&amp;n trust v 165 hous. corp.\n2025 ny slip op 02129\ndecided on april 10, 2025\nappellate division, first department\npublished by new york state law reporting bureau pursuant to judiciary law § 431.\nthis opinion is uncorrected and subject to revision before publication in the official reports.\ndecided and entered: april 10, 2025\nbefore: kern, j.p., kennedy, gesmer, pitt-burke, o'neill levy, jj. \nindex no. 158976/23|appeal no. 4093|case no..."
4,0.3791,clean_opinion_21.txt,"rauch v rauch (2025 ny slip op 02802)\nrauch v rauch\n2025 ny slip op 02802\ndecided on may 7, 2025\nappellate division, second department\npublished by new york state law reporting bureau pursuant to judiciary law § 431.\nthis opinion is uncorrected and subject to revision before publication in the official reports.\ndecided on may 7, 2025\nsupreme court of the state of new york\nappellate division, second judicial department\nangela g. iannacci, j.p.\nwilliam g. ford\nhelen voutsinas\njames p. mccormack, j..."
