**Author - Amjad Ali**

In [1]:
!nvidia-smi

Tue Feb  3 04:05:59 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   37C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
#bootstrap
from google.colab import drive
import os
import yaml

drive.mount("/content/drive")

REPO_URL = "https://github.com/AmjadKudsi/Meta_llama-chatbot.git"
REPO_DIR = "/content/rag-chatbot"

if not os.path.exists(REPO_DIR):
    !git clone {REPO_URL} {REPO_DIR}
else:
    %cd {REPO_DIR}
    !git pull

%cd {REPO_DIR}

# install dependencies from repo
!pip -q install -r requirements.txt

# load config
CONFIG_PATH = os.path.join(REPO_DIR, "configs", "app.yaml")

with open(CONFIG_PATH, "r") as f:
    cfg = yaml.safe_load(f)

paths = cfg["paths"]
DOCS_DIR = paths["docs_dir"]
INDEX_DIR = paths["index_dir"]
EVAL_RUNS_DIR = paths["eval_runs_dir"]
TRACES_DIR = paths["traces_dir"]

rag_cfg = cfg.get("rag", {})
TOP_K = rag_cfg.get("top_k", 6)
CHUNK_SIZE = rag_cfg.get("chunk_size", 900)
CHUNK_OVERLAP = rag_cfg.get("chunk_overlap", 150)

print("DOCS_DIR:", DOCS_DIR)
print("INDEX_DIR:", INDEX_DIR)
print("TOP_K:", TOP_K)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/rag-chatbot
remote: Enumerating objects: 5, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (1/1), done.[K
remote: Total 3 (delta 2), reused 3 (delta 2), pack-reused 0 (from 0)[K
Unpacking objects: 100% (3/3), 264 bytes | 88.00 KiB/s, done.
From https://github.com/AmjadKudsi/Meta_llama-chatbot
   77a3f1c..eac3bd4  main       -> origin/main
Updating 77a3f1c..eac3bd4
Fast-forward
 requirements.txt | 3 [32m++[m[31m-[m
 1 file changed, 2 insertions(+), 1 deletion(-)
/content/rag-chatbot
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.8/23.8 MB[0m [31m50.2 MB/s[0m eta [36m0:00:00[0m
[?25hDOCS_DIR: /content/drive/MyDrive/rag-chatbot/raw_docs
INDEX_DIR: /content/drive/MyDrive/rag-chatbot/artifacts/indexes
TOP_K: 6


In [3]:
# sanity check
import glob, os
pdfs = glob.glob(DOCS_DIR + "/*.pdf")
print("PDFs found:", len(pdfs))
print("Example:", os.path.basename(pdfs[0]) if pdfs else "None")

PDFs found: 10
Example: tax01.pdf


> **SimpleDirectoryReader** is the simplest way to load data from local files into LlamaIndex



In [4]:
# ingestion with doc_manifest enrichment

import os
import pandas as pd
from llama_index.core import SimpleDirectoryReader

MANIFEST_PATH = "/content/rag-chatbot/data/samples/doc_manifest.csv"
manifest = pd.read_csv(MANIFEST_PATH)  # pandas read_csv loads CSV into a DataFrame :contentReference[oaicite:1]{index=1}

manifest_map = {
    r["file_name"]: {
        "source_id": r["source_id"],
        "source_title": r["source_title"],
        "irs_publication": str(r["irs_publication"]),
        "tax_year": str(r["tax_year"]),
        "notes": r.get("notes", ""),
    }
    for _, r in manifest.iterrows()
}

def file_metadata(file_path: str) -> dict:
    fn = os.path.basename(file_path)
    extra = manifest_map.get(fn, {})
    return {
        "source_file": fn,
        "source_path": file_path,
        "domain": "IRS individual tax documents",
        **extra,
    }

reader = SimpleDirectoryReader(
    input_dir=DOCS_DIR,
    file_metadata=file_metadata,  # supported by SimpleDirectoryReader :contentReference[oaicite:2]{index=2}
)

documents = reader.load_data()
print("Documents loaded:", len(documents))
print("Sample metadata:", documents[0].metadata if documents else None)
print("Sample text preview:", documents[0].text[:])


Documents loaded: 469
Sample metadata: {'page_label': '1', 'file_name': 'tax01.pdf', 'source_file': 'tax01.pdf', 'source_path': '/content/drive/MyDrive/rag-chatbot/raw_docs/tax01.pdf', 'domain': 'IRS individual tax documents', 'source_id': 'irs_pub17_2024', 'source_title': 'Your Federal Income Tax for Individuals (Publication 17)', 'irs_publication': '17', 'tax_year': '2024', 'notes': 'For use in preparing 2024 Returns'}
Sample text preview: Userid: CPM Schema: tipx Leadpct: 100% Pt. size: 8  Draft  Ok to Print
AH XSL/XML Fileid: … ication-17/2024/b/xml/cycle02/source (Init. & Date) _______
Page 1 of 143  6:16 - 23-Jan-2025
The type and rule above prints on all proofs including departmental reproduction proofs. MUST be removed before printing.
TAX GUIDE
2024
Get forms and other information faster and easier at:
• IRS.gov (English) 
• IRS.gov/Spanish (Español) 
• IRS.gov/Chinese (中文) 
• IRS.gov/Korean (한국어) 
• IRS.gov/Russian (Pусский) 
• IRS.gov/Vietnamese (Tiếng Việt) 
Publication 17 

> **SimpleDirectoryReader** will automatically attach a metadata dictionary to each Document object.<br>
> The PDFs were split into per page Document objects (eg. 143 pages becomes about 143 items). Helpful for citations because every piece is already tied to a page.

The preview also includes lines like “The type and rule above prints on all proofs ....” and internal print proof headers. This is noise and will hurt retrieval and answers. <br>
**A quick cleaning step is required to remove boilerplate lines that appear on every page.**

In [5]:
# per page stats

import re
import pandas as pd

def to_int(x, default=-1):
    try:
        return int(str(x).strip())
    except:
        return default

rows = []
for i, d in enumerate(documents):
    meta = d.metadata or {}
    file_name = meta.get("file_name") or meta.get("source_file") or "UNKNOWN"
    page_label = meta.get("page_label")
    text = d.text or ""

    lines = [ln.strip() for ln in text.splitlines() if ln.strip()]
    n_lines = len(lines)
    n_chars = len(text)
    n_words = len(re.findall(r"\w+", text))
    n_unique_lines = len(set(lines)) if n_lines else 0
    unique_line_ratio = (n_unique_lines / n_lines) if n_lines else 0.0

    rows.append({
        "doc_idx": i,
        "file_name": file_name,
        "page_label": str(page_label) if page_label is not None else "",
        "page_num": to_int(page_label),
        "n_chars": n_chars,
        "n_words": n_words,
        "n_lines": n_lines,
        "n_unique_lines": n_unique_lines,
        "unique_line_ratio": unique_line_ratio,
    })

df_pages = pd.DataFrame(rows).sort_values(["file_name", "page_num", "doc_idx"]).reset_index(drop=True)

print("Total page-docs:", len(df_pages))
print("PDF files:", df_pages["file_name"].nunique())
display(df_pages.head(10))

# Per PDF summary
df_pdf_summary = (
    df_pages
    .groupby("file_name", as_index=False)
    .agg(
        pages=("doc_idx", "count"),
        chars_mean=("n_chars", "mean"),
        chars_median=("n_chars", "median"),
        chars_min=("n_chars", "min"),
        chars_max=("n_chars", "max"),
        words_mean=("n_words", "mean"),
        unique_line_ratio_mean=("unique_line_ratio", "mean"),
        pct_very_short=("n_chars", lambda s: float((s < 200).mean()) * 100.0),
        pct_short=("n_chars", lambda s: float((s < 600).mean()) * 100.0),
    )
    .sort_values("pages", ascending=False)
)

display(df_pdf_summary)

Total page-docs: 469
PDF files: 10


Unnamed: 0,doc_idx,file_name,page_label,page_num,n_chars,n_words,n_lines,n_unique_lines,unique_line_ratio
0,0,tax01.pdf,1,1,752,124,21,21,1.0
1,2,tax01.pdf,1,1,8877,1523,276,273,0.98913
2,1,tax01.pdf,2,2,2547,352,64,64,1.0
3,3,tax01.pdf,2,2,8964,1556,273,269,0.985348
4,4,tax01.pdf,3,3,8648,1517,263,263,1.0
5,5,tax01.pdf,4,4,4025,678,110,108,0.981818
6,6,tax01.pdf,5,5,891,151,23,23,1.0
7,7,tax01.pdf,6,6,6098,1074,148,148,1.0
8,8,tax01.pdf,7,7,6967,1277,159,157,0.987421
9,9,tax01.pdf,8,8,7844,1427,186,186,1.0


Unnamed: 0,file_name,pages,chars_mean,chars_median,chars_min,chars_max,words_mean,unique_line_ratio_mean,pct_very_short,pct_short
0,tax01.pdf,143,7073.741259,7745.0,752,10169,1305.538462,0.96333,0.0,0.0
9,tax10.pdf,79,4684.873418,4972.0,578,6915,775.177215,0.971552,0.0,1.265823
5,tax06.pdf,45,4871.044444,4329.0,1604,8513,840.688889,0.956346,0.0,0.0
8,tax09.pdf,42,5263.333333,5399.5,416,8476,1195.452381,0.960142,0.0,2.380952
1,tax02.pdf,30,6863.033333,7850.0,1188,10013,1223.3,0.981415,0.0,0.0
4,tax05.pdf,30,4980.833333,5274.0,2675,6624,861.533333,0.990426,0.0,0.0
6,tax07.pdf,27,4635.333333,5142.0,886,6723,801.333333,0.990295,0.0,0.0
2,tax03.pdf,27,4598.074074,4683.0,3005,5827,780.111111,0.975744,0.0,0.0
7,tax08.pdf,26,7529.192308,7811.5,2635,9225,1251.230769,0.988647,0.0,0.0
3,tax04.pdf,20,4582.95,5298.0,1145,5790,812.05,0.966264,0.0,0.0


> 'pct_very_short' and 'pct_short' help you find pages that are likely low value or extraction issues.
> unique_line_ratio helps spot pages that are repetitive boilerplate.

Hence, we observe there are almost no "very-short" pages (pct_very_short is 0 for all, pct_short is near 0 with only a couple small percentages).<br>
**Means extraction is working.**

In [6]:
# inspection report per PDF, for boilerplate discovery

import re
from collections import Counter, defaultdict

def normalize_line(line: str) -> str:
    return re.sub(r"\s+", " ", line.strip())

# Group page docs by file_name
by_file = defaultdict(list)
for d in documents:
    fn = (d.metadata or {}).get("file_name") or (d.metadata or {}).get("source_file") or "UNKNOWN"
    by_file[fn].append(d)

def per_file_repeated_lines(fn: str, doc_list, top_n: int = 30, frac_threshold: float = 0.80):
    # Count lines across pages, but only once per page
    counts = Counter()
    total_pages = len(doc_list)
    for d in doc_list:
        seen = set()
        for raw in (d.text or "").splitlines():
            line = normalize_line(raw)
            if not line or len(line) < 10:
                continue
            if line in seen:
                continue
            seen.add(line)
            counts[line] += 1

    top = counts.most_common(top_n)
    candidates = [(line, c) for line, c in counts.items() if c / total_pages >= frac_threshold]

    print("\n" + "=" * 90)
    print("FILE:", fn, "| pages:", total_pages)
    print(f"Top {top_n} repeated lines (count | line):")
    for line, c in top:
        print(f"{c:4d} | {line}")

    print(f"\nCandidates appearing on >= {int(frac_threshold*100)}% of pages in {fn}: {len(candidates)}")
    # Show a few, highest counts first
    for line, c in sorted(candidates, key=lambda x: -x[1])[:25]:
        print(f"{c:4d} | {line}")

# Run report for each PDF
for fn in sorted(by_file.keys()):
    per_file_repeated_lines(fn, by_file[fn], top_n=25, frac_threshold=0.85)



FILE: tax01.pdf | pages: 143
Top 25 repeated lines (count | line):
 143 | The type and rule above prints on all proofs including departmental reproduction proofs. MUST be removed before printing.
  15 | Introduction
  13 | Useful Items
  13 | Y ou may want to see:
  12 | Publication
  12 | Single Married
  12 | Your tax is—
  12 | If line 15
  12 | income) is—
  12 | And you are—
  12 | * This column must also be used by a qualifying surviving spouse.
  11 | (Continued)
  11 | 2024 Tax Table — Continued
  10 | Form (and Instructions)
   9 | information.
   9 | For these and other useful items, go to IRS.gov/
   8 | more information.
   7 | formation.
   7 | Married filing separately 23
   7 | arrangements (IRAs))
   6 | for more information.
   6 | Dependents
   6 | 525 T axable and Nontaxable Income
   6 | Life insurance proceeds when
   5 | (Form 1040).

Candidates appearing on >= 85% of pages in tax01.pdf: 1
 143 | The type and rule above prints on all proofs including departmental

**"The type and rule above prints on all proofs including departmental reproduction proofs. MUST be removed before printing."**<br>
That is clearly print proof boilerplate, and it is safe to remove.

In [7]:
# targeted manual review set

import numpy as np

def pick_first_mid_last(df_one_pdf: pd.DataFrame):
    df_one_pdf = df_one_pdf.sort_values(["page_num", "doc_idx"])
    idxs = []
    if len(df_one_pdf) >= 1:
        idxs.append(int(df_one_pdf.iloc[0]["doc_idx"]))
    if len(df_one_pdf) >= 3:
        idxs.append(int(df_one_pdf.iloc[len(df_one_pdf)//2]["doc_idx"]))
    if len(df_one_pdf) >= 2:
        idxs.append(int(df_one_pdf.iloc[-1]["doc_idx"]))
    return idxs

review_idxs = set()

# First, middle, last for each PDF
for fn, g in df_pages.groupby("file_name"):
    for idx in pick_first_mid_last(g):
        review_idxs.add(idx)

# Outliers: shortest pages
shortest = df_pages.nsmallest(10, "n_chars")["doc_idx"].tolist()
for idx in shortest:
    review_idxs.add(int(idx))

# Outliers: most repetitive pages (low unique_line_ratio)
most_repetitive = df_pages.nsmallest(10, "unique_line_ratio")["doc_idx"].tolist()
for idx in most_repetitive:
    review_idxs.add(int(idx))

# Random sample (reproducible)
rng = np.random.default_rng(42)
random_idxs = rng.choice(df_pages["doc_idx"].values, size=min(20, len(df_pages)), replace=False)
for idx in random_idxs:
    review_idxs.add(int(idx))

review_list = sorted(review_idxs)
print("Total pages selected for manual review:", len(review_list))
print("First 30 doc indices:", review_list[:30])

def show_doc(doc_idx: int, n_chars_preview: int = 1200):
    d = documents[doc_idx]
    meta = d.metadata or {}
    print("doc_idx:", doc_idx)
    print("file_name:", meta.get("file_name") or meta.get("source_file"))
    print("page_label:", meta.get("page_label"))
    print("n_chars:", len(d.text or ""))
    print("n_lines:", len([ln for ln in (d.text or "").splitlines() if ln.strip()]))
    print("\nTEXT PREVIEW:\n")
    print((d.text or "")[:n_chars_preview])

# Show the first few review samples
for idx in review_list[:5]:
    print("\n" + "="*80 + "\n")
    show_doc(idx)

Total pages selected for manual review: 63
First 30 doc indices: [0, 6, 39, 40, 43, 59, 66, 71, 92, 125, 126, 142, 143, 158, 167, 172, 173, 186, 196, 198, 199, 200, 204, 210, 211, 217, 219, 220, 235, 239]


doc_idx: 0
file_name: tax01.pdf
page_label: 1
n_chars: 752
n_lines: 21

TEXT PREVIEW:

Userid: CPM Schema: tipx Leadpct: 100% Pt. size: 8  Draft  Ok to Print
AH XSL/XML Fileid: … ication-17/2024/b/xml/cycle02/source (Init. & Date) _______
Page 1 of 143  6:16 - 23-Jan-2025
The type and rule above prints on all proofs including departmental reproduction proofs. MUST be removed before printing.
TAX GUIDE
2024
Get forms and other information faster and easier at:
• IRS.gov (English) 
• IRS.gov/Spanish (Español) 
• IRS.gov/Chinese (中文) 
• IRS.gov/Korean (한국어) 
• IRS.gov/Russian (Pусский) 
• IRS.gov/Vietnamese (Tiếng Việt) 
Publication 17 (2024)  Catalog Number 10311G
Jan 22, 2025 Department of the T reasury  Internal Revenue Service  www.irs.gov
Your Federal
Income Tax
For Individuals
Pu

**"Draft  Ok to Print" still visible, need more scrutiny**

In [8]:
# global and per-PDF inspection
import re
from collections import Counter, defaultdict

def normalize_line(line: str) -> str:
    return re.sub(r"\s+", " ", line.strip())

first_pages = [d for d in documents if str((d.metadata or {}).get("page_label", "")).strip() == "1"]

c = Counter()
examples = defaultdict(list)

for d in first_pages:
    seen = set()
    meta = d.metadata or {}
    fn = meta.get("file_name") or meta.get("source_file") or "UNKNOWN"

    for raw in (d.text or "").splitlines():
        line = normalize_line(raw)
        if not line or len(line) < 10:
            continue
        if line in seen:
            continue
        seen.add(line)

        c[line] += 1
        if len(examples[line]) < 3:
            examples[line].append({"file_name": fn, "page_label": meta.get("page_label")})

top = c.most_common(60)
print("First-page repeated lines across PDFs (count | line):")
for line, cnt in top:
    print(f"{cnt:3d} | {line}")


First-page repeated lines across PDFs (count | line):
 11 | The type and rule above prints on all proofs including departmental reproduction proofs. MUST be removed before printing.
 10 | Get forms and other information faster and easier at:
 10 | • IRS.gov (English)
 10 | • IRS.gov/Spanish (Español)
 10 | • IRS.gov/Chinese (中文)
 10 | • IRS.gov/Korean (한국어)
 10 | • IRS.gov/Russian (Pусский)
 10 | • IRS.gov/Vietnamese (Tiếng Việt)
  9 | For use in preparing
  6 | 2024 Returns
  6 | Userid: CPM Schema: tipx Leadpct: 100% Pt. size: 10 Draft Ok to Print
  5 | Future Developments
  4 | Userid: CPM Schema: tipx Leadpct: 100% Pt. size: 8 Draft Ok to Print
  4 | What's New
  4 | For the latest information about developments related to
  4 | Photographs of missing children. The IRS is a proud
  4 | partner with the National Center for Missing & Exploited
  4 | Children® (NCMEC). Photographs of missing children se-
  4 | lected by the Center may appear in this publication on pa-
  4 | ges that w

**Other artifacts like XSL/XML Fileid and timestamped Page X of Y exist, but some of those are specific to certain PDFs.**

In [9]:
# boilerplate cleaning

import re
from llama_index.core import Document

def normalize_line(line: str) -> str:
    return re.sub(r"\s+", " ", line.strip())

# 1) Exact line removals, proven by frequency analysis
REMOVE_EXACT = {
    "The type and rule above prints on all proofs including departmental reproduction proofs. MUST be removed before printing."
}
REMOVE_EXACT_N = {normalize_line(x) for x in REMOVE_EXACT}

# 2) Prefix removals for production artifacts (covers  Userid, XSL/XML Fileid, etc)
REMOVE_PREFIXES = [
    "Userid:",
    "AH XSL/XML Fileid:",
]

# 3) Signature rule for the timestamped "Page X of Y ..." proof line
# We keep this as a structural rule, not a broad regex.
DATE_SIG = re.compile(r"\b\d{1,2}-[A-Za-z]{3}-\d{4}\b")  # 23-Jan-2025
TIME_SIG = re.compile(r"\b\d{1,2}:\d{2}\b")             # 6:16

def is_proof_page_line(norm: str) -> bool:
    if not norm.startswith("Page "):
        return False
    if " of " not in norm:
        return False
    if not TIME_SIG.search(norm):
        return False
    if not DATE_SIG.search(norm):
        return False
    return True

cleaned_documents = []
removed = {"exact": 0, "prefix": 0, "page_sig": 0}

for d in documents:
    meta = d.metadata or {}
    new_lines = []

    for raw in (d.text or "").splitlines():
        norm = normalize_line(raw)
        if not norm:
            continue

        if norm in REMOVE_EXACT_N:
            removed["exact"] += 1
            continue

        if any(norm.startswith(pfx) for pfx in REMOVE_PREFIXES):
            removed["prefix"] += 1
            continue

        if is_proof_page_line(norm):
            removed["page_sig"] += 1
            continue

        new_lines.append(raw)

    cleaned_documents.append(Document(text="\n".join(new_lines).strip(), metadata=meta))

print("Pages:", len(documents), "Cleaned pages:", len(cleaned_documents))
print("Lines removed:", removed)
print("Preview:\n", cleaned_documents[0].text[:] if cleaned_documents else "None")

Pages: 469 Cleaned pages: 469
Lines removed: {'exact': 469, 'prefix': 20, 'page_sig': 469}
Preview:
 TAX GUIDE
2024
Get forms and other information faster and easier at:
• IRS.gov (English) 
• IRS.gov/Spanish (Español) 
• IRS.gov/Chinese (中文) 
• IRS.gov/Korean (한국어) 
• IRS.gov/Russian (Pусский) 
• IRS.gov/Vietnamese (Tiếng Việt) 
Publication 17 (2024)  Catalog Number 10311G
Jan 22, 2025 Department of the T reasury  Internal Revenue Service  www.irs.gov
Your Federal
Income Tax
For Individuals
Publication 17
For use in preparing
2024 Returns


In [10]:
# post cleaning check

def show_first_page(fn: str):
    for d in cleaned_documents:
        m = d.metadata or {}
        if (m.get("file_name") == fn) and (str(m.get("page_label")) == "1"):
            print("FILE:", fn, "PAGE:", m.get("page_label"))
            print(d.text[:600])
            return
    print("Not found:", fn)

show_first_page("tax01.pdf")
show_first_page("tax04.pdf")

FILE: tax01.pdf PAGE: 1
TAX GUIDE
2024
Get forms and other information faster and easier at:
• IRS.gov (English) 
• IRS.gov/Spanish (Español) 
• IRS.gov/Chinese (中文) 
• IRS.gov/Korean (한국어) 
• IRS.gov/Russian (Pусский) 
• IRS.gov/Vietnamese (Tiếng Việt) 
Publication 17 (2024)  Catalog Number 10311G
Jan 22, 2025 Department of the T reasury  Internal Revenue Service  www.irs.gov
Your Federal
Income Tax
For Individuals
Publication 17
For use in preparing
2024 Returns
FILE: tax04.pdf PAGE: 1
Publication 503
Child and
Dependent
Care Expenses
For use in preparing
2025 Returns
Get forms and other information faster and easier at:
• IRS.gov (English) 
• IRS.gov/Spanish (Español) 
• IRS.gov/Chinese (中文) 
• IRS.gov/Korean (한국어) 
• IRS.gov/Russian (Pусский) 
• IRS.gov/Vietnamese (Tiếng Việt) 
Future Developments
For the latest information about developments related to 
Pub. 503, such as legislation enacted after it was 
published, go to IRS.gov/Pub503.
What’s New
Trump account and new Form 4547. 

✅ **Cleaning complete (evidence based).**<br>Removed global print proof boilerplate, per file "Userid" and "AH XSL/XML Fileid" production lines, and timestamped "Page X of Y" proof headers while preserving provenance metadata (source_title, source_id, file_name, page_label) for reliable citations.


In [11]:
# chunking

from llama_index.core.node_parser import SentenceSplitter

splitter = SentenceSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
)

nodes = splitter.get_nodes_from_documents(cleaned_documents)

print("Nodes created:", len(nodes))
if nodes:
    print("Sample node metadata keys:", sorted(list(nodes[0].metadata.keys()))[:20])
    print("Sample node metadata:", {k: nodes[0].metadata.get(k) for k in ["source_title", "source_id", "file_name", "page_label"] if k in nodes[0].metadata})
    print("Sample node text preview:", nodes[0].get_content()[:])

Nodes created: 1595
Sample node metadata keys: ['domain', 'file_name', 'irs_publication', 'notes', 'page_label', 'source_file', 'source_id', 'source_path', 'source_title', 'tax_year']
Sample node metadata: {'source_title': 'Your Federal Income Tax for Individuals (Publication 17)', 'source_id': 'irs_pub17_2024', 'file_name': 'tax01.pdf', 'page_label': '1'}
Sample node text preview: TAX GUIDE
2024
Get forms and other information faster and easier at:
• IRS.gov (English) 
• IRS.gov/Spanish (Español) 
• IRS.gov/Chinese (中文) 
• IRS.gov/Korean (한국어) 
• IRS.gov/Russian (Pусский) 
• IRS.gov/Vietnamese (Tiếng Việt) 
Publication 17 (2024)  Catalog Number 10311G
Jan 22, 2025 Department of the T reasury  Internal Revenue Service  www.irs.gov
Your Federal
Income Tax
For Individuals
Publication 17
For use in preparing
2024 Returns


**Moving on to Indexing:**

LlamaIndex has a first party FAISS vector store integration (FaissVectorStore)<br>
Dimension depends on the embedding model. LlamaIndex will handle it when building the index.<br><br>
RAG retrieval works by turning text into vectors, then doing similarity search. LlamaIndex needs an embedding model to create those vectors.<br><br>

Hence:
Embeddings: BAAI/bge-small-en-v1.5 (384 dims)<br>
FAISS index: IndexFlatIP (inner product), commonly paired with cosine style embeddings after normalization.

In [12]:
# embedding model

import faiss
print("faiss version:", faiss.__version__)
from llama_index.core import Settings, StorageContext, VectorStoreIndex
from llama_index.vector_stores.faiss import FaissVectorStore
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Free local embedding model (384-dim)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# FAISS index: exact search (good baseline, very strong quality)
DIM = 384
faiss_index = faiss.IndexFlatIP(DIM)

vector_store = FaissVectorStore(faiss_index=faiss_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Build index from nodes and persist
index = VectorStoreIndex(nodes, storage_context=storage_context)
index.storage_context.persist(persist_dir=INDEX_DIR)

print("FAISS index built and persisted to:", INDEX_DIR)

faiss version: 1.13.2


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

FAISS index built and persisted to: /content/drive/MyDrive/rag-chatbot/artifacts/indexes


**This follows the LlamaIndex FAISS vector store pattern and uses standard persistence via storage_context.persist.**

In [13]:
from llama_index.core import load_index_from_storage

vector_store_reloaded = FaissVectorStore.from_persist_dir(INDEX_DIR)
storage_context_reloaded = StorageContext.from_defaults(
    vector_store=vector_store_reloaded,
    persist_dir=INDEX_DIR,
)

reloaded_index = load_index_from_storage(storage_context=storage_context_reloaded)
print("Reloaded FAISS index OK.")

Reloaded FAISS index OK.


The current IndexFlatIP is exact nearest neighbor, usually best quality.<br>
For much larger corpora (hundreds of thousands to millions of chunks), we could switch to an approximate index like IVF, HNSW, or PQ in FAISS, which requires training and more tuning.

## Notebook 01 summary

- Ingested 10 IRS PDF publications from Google Drive and enriched every page-level Document with a `doc_manifest` mapping (human-readable `source_title`, stable `source_id`, `tax_year`, and provenance paths).
- Profiled extraction quality (per-page and per-PDF stats) and audited boilerplate using corpus-wide, per-PDF, and first-page frequency reports.
- Applied an evidence-based cleaning pass to remove print-proof artifacts while preserving citation-critical metadata (`file_name`, `page_label`, `source_title`, `source_id`).
- Chunked cleaned pages with `SentenceSplitter` using config-driven parameters (`chunk_size=700`, `chunk_overlap=120`) to produce citation-ready nodes.
- Built and persisted a FAISS-backed `VectorStoreIndex` with HuggingFace embeddings for fast similarity search, and validated persistence by reloading the index from `INDEX_DIR`.
