## **03 - Building Index**

#### **Setup**

In [1]:
import sys
from pathlib import Path

CWD = Path.cwd().resolve()
ROOT = CWD if (CWD / "src").exists() else CWD.parent
if str(ROOT) not in sys.path:
    sys.path.append(str(ROOT))

from src.ingest.build_index import build_index

In [2]:
CHUNKS = ROOT / "data" / "chunks" / "chunks.jsonl"
INDEX_DIR = ROOT / "data" / "index"

print("ROOT:", ROOT)
print("CHUNKS exists?", CHUNKS.exists())
print("INDEX_DIR:", INDEX_DIR)

ROOT: D:\IIT BBS\Job Resources\Business Optima\pdf-agent
CHUNKS exists? True
INDEX_DIR: D:\IIT BBS\Job Resources\Business Optima\pdf-agent\data\index


#### **Peek chunks (raw JSONL; heading_path is still a list here)**

In [3]:
import json, itertools
assert CHUNKS.exists(), "chunks.jsonl not found—run 01/02 notebooks first."
for line in itertools.islice(open(CHUNKS, "r", encoding="utf-8"), 10):
    rec = json.loads(line)
    print(rec["id"], rec["metadata"]["block_type"], rec["metadata"]["heading_path"][:2])

title17-h-1 heading ['Copyright Law United States Copyri']
title17-2 para ['Copyright Law United States Copyri']
title17-3 para ['Copyright Law United States Copyri']
title17-h-4 heading ['Copyright Law United States Copyri', 'Copyright Law of the United States']
title17-5 para ['Copyright Law United States Copyri', 'Copyright Law of the United States']
title17-6 para ['Copyright Law United States Copyri', 'Copyright Law of the United States']
title17-7 para ['Copyright Law United States Copyri', 'Copyright Law of the United States']
title17-h-8 heading ['Copyright Law United States Copyri', 'dedication']
title17-9 para ['Copyright Law United States Copyri', 'dedication']
title17-h-10 heading ['Copyright Law United States Copyri', 'Preface']


#### **Embed model dim**

In [4]:
from sentence_transformers import SentenceTransformer
st = SentenceTransformer("BAAI/bge-base-en-v1.5", device="cpu")
print("bge-base dim:", st.get_sentence_embedding_dimension())


bge-base dim: 768


#### **Build index (set reset=True if re-running)**

In [None]:
collection_name = build_index(
    CHUNKS,
    persist=INDEX_DIR,
    collection=None,                     # defaults to doc_id
    embed_model="BAAI/bge-base-en-v1.5", # CPU-friendly
    batch_size=64,
    bge_use_prompt=True,
    reset=False,                         # True for a clean rebuild
)
collection_name


Indexed 64/2683
Indexed 128/2683
Indexed 192/2683
Indexed 256/2683
Indexed 320/2683
Indexed 384/2683
Indexed 448/2683
Indexed 512/2683
Indexed 576/2683
Indexed 640/2683
Indexed 704/2683
Indexed 768/2683
Indexed 832/2683
Indexed 896/2683
Indexed 960/2683
Indexed 1024/2683
Indexed 1088/2683
Indexed 1152/2683
Indexed 1216/2683
Indexed 1280/2683
Indexed 1344/2683
Indexed 1408/2683
Indexed 1472/2683
Indexed 1536/2683
Indexed 1600/2683
Indexed 1664/2683
Indexed 1728/2683
Indexed 1792/2683
Indexed 1856/2683
Indexed 1920/2683
Indexed 1984/2683
Indexed 2048/2683
Indexed 2112/2683
Indexed 2176/2683
Indexed 2240/2683
Indexed 2304/2683
Indexed 2368/2683
Indexed 2432/2683
Indexed 2496/2683
Indexed 2560/2683
Indexed 2624/2683
Indexed 2683/2683
[OK] Chroma collection 'title17' built at D:\IIT BBS\Job Resources\Business Optima\pdf-agent\data\index
[INFO] Embed model: BAAI/bge-base-en-v1.5 | BGE passage prompt: True


'title17'

#### **Verify index count**

In [6]:
import chromadb
client = chromadb.PersistentClient(path=str(INDEX_DIR))
coll = client.get_collection(collection_name)
print("Indexed documents:", coll.count())

Indexed documents: 2683


#### **Retrieval sanity check (+ distances)**

In [7]:
from sentence_transformers import SentenceTransformer
st = SentenceTransformer("BAAI/bge-base-en-v1.5", device="cpu")

def embed_query(q: str):
    q = "query: " + q.strip()
    return st.encode([q], normalize_embeddings=True, show_progress_bar=False)[0].tolist()

def last2_headings(meta: dict):
    hp = (meta or {}).get("heading_path") or ""
    parts = [p.strip() for p in hp.split(">") if p.strip()]
    return parts[-2:] if len(parts) >= 2 else parts

q = "What does §107 say about fair use?"
qe = embed_query(q)

res = coll.query(query_embeddings=[qe], n_results=8, include=["documents","metadatas","distances"])
for doc, meta, dist in zip(res["documents"][0], res["metadatas"][0], res["distances"][0]):
    print(f"\n[dist={dist:.3f} | p.{meta.get('page_start')}] {last2_headings(meta)}")
    print(doc[:280], "…")



[dist=0.412 | p.40] ['Copyright Law United States Copyri', '§ 107 · Limitations on exclusive rights: Fair use 40']
- (1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
- (2) the nature of the copyrighted work;
- (3) the amount and substantiality of the portion used in relation to the copyrighted work as a wh …

[dist=0.433 | p.403] ['Copyright Law United States Copyri', 'Sec. 103 · Other Rights Not Affected.']
- (b) Fair Use. -The amendments made by this title shall not affect the fair use, under section 107 of title 17, United States Code, of a genuine certificate, licensing document, registration card, similar labeling component, or documentation or packaging described in paragraph ( …

[dist=0.440 | p.40] ['Copyright Law United States Copyri', '§ 107 · Limitations on exclusive rights: Fair use 40']
Notwithstanding the provisions of sections 106 and 106A, the fair use of a copyrighted work, includi

#### **Negative control (out-of-scope)**

In [None]:
q = "How to bake sourdough bread?"
qe = embed_query(q)
res = coll.query(query_embeddings=[qe], n_results=5, include=["distances"])
print("distances:", [round(d, 3) for d in res["distances"][0]])

distances: [0.935, 0.942, 0.948, 0.954, 0.957]


#### **Quick duplicate detector**

In [9]:
import json, hashlib, collections
histo = collections.Counter()
dups = []
with open(CHUNKS, "r", encoding="utf-8") as f:
    for line in f:
        rec = json.loads(line)
        t = " ".join(rec["text"].split())
        h = hashlib.md5(t.encode("utf-8")).hexdigest()
        histo[h] += 1
        if histo[h] == 2:
            dups.append((rec["id"], t[:140]))
print("Unique hash count:", len(histo), "| total:", sum(histo.values()))
print("Potential duplicates:", len(dups))
dups[:5]


Unique hash count: 2572 | total: 2683
Potential duplicates: 17


[('title17-235',
  '(I) if the performance is by audio means only, the performance is communicated by means of a total of not more than 6 loudspeakers, of which'),
 ('title17-236',
  '(II) if the performance or display is by audiovisual means, any visual portion of the performance or display is communicated by means of a t'),
 ('title17-496', '-'),
 ('title17-1002', '| | section | page |'),
 ('title17-h-1721', 'section')]

#### **ToC leakage probe**

In [10]:
import json, re, itertools
toc_like = []
with open(CHUNKS, "r", encoding="utf-8") as f:
    for line in f:
        rec = json.loads(line)
        txt = rec["text"]
        if txt.count("|") >= 3 or re.search(r"(?:\.\s){3,}", txt):
            toc_like.append(rec)
print("TOC-like chunks:", len(toc_like))
for r in itertools.islice(toc_like, 5):
    print(r["id"], "→", r["text"][:120], "…")

TOC-like chunks: 43
title17-41 → | chapter 1 | Subject Matter and Scope of Copyright ... | . . 1 | …
title17-42 → |-------------|--------------------------------------------------------------------------------------------------------- …
title17-43 → t of 1998 ... | 371 |
| appendix c | The Copyright Royalty and Distribution Reform Act of 2004 ... | 377 |
| appendix d  …
title17-44 → | appendix i | The STELA Reauthorization Act of 2014 ... | 411 | …
title17-45 → |--------------|-------------------------------------------------------------------------------------------------------- …


## **Observations**

**1) Indexed documents:**
- 2683 — matches chunk count. 
- Coverage is complete.

**2) Retrieval sanity (fair use §107):**
- Top hits are directly from §107 (p.40) with distances 0.41–0.47, plus closely related sections (§108 etc.) and an appendix that mentions fair use. 
- That’s exactly what i want: the target section + nearby legal cross-refs.

**3) Negative control (“sourdough”)**
- Distances ~0.94–0.96—very far. 
- Great separation: low semantic bleed/noise.

**4) Duplicates: 17 / 2683 ≈ 0.63%**
- Mostly tiny fragments (e.g., single dash, short “section” header, TOC stub). 
- That’s harmless at this rate. Can clean further later if it ever shows in results.

**5) TOC-like chunks: 43 / 2683 ≈ 1.6%**
- Not large. They’re mostly tables/piped rows. 
- They rarely surface for substantive queries (and your fair-use query didn’t return them).