### Introduction

This notebook takes the manifest and uses to embed all core documents + citations avaible in a FAISS vector databse with langchain and granite-embedding:30m

```
ollama pull granite-embedding:30m
``` 

### Importing and Paths

Change the ROOT  paths as needed. It should point to to the main knowledge pack dir.

In [17]:
# --- A. Imports & config ---
from pathlib import Path
import json, hashlib, uuid, yaml
from typing import List, Dict
from langchain_community.vectorstores import FAISS
from langchain_ollama import OllamaEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

# Paths (adapt for your pack root)
ROOT = Path("/Users/ktejwani/Personal CS Projects/Summer 2025/Offline AI Kiosk/Offline-AI-Kiosk/first_aid_pack_demo_v2")
MANIFEST = ROOT / "manifest.yaml"
print(ROOT)
print(MANIFEST)

/Users/ktejwani/Personal CS Projects/Summer 2025/Offline AI Kiosk/Offline-AI-Kiosk/first_aid_pack_demo_v2
/Users/ktejwani/Personal CS Projects/Summer 2025/Offline AI Kiosk/Offline-AI-Kiosk/first_aid_pack_demo_v2/manifest.yaml


### Parsing YAML, Embedding Documents, and Creating Vector Store

NOTES: 
1. Below cell will create a new directory inisde the knolwedge pack:
- Example: first_aid_pack_demo_v2/vector_db/text/faiss_index <br>
This directory will have the actual .faiss store and index pickle file

2. embeddings.jsonl, index.bin, and meta.json under first_aid_pack_demo_v2/vector_db/text/faiss_index will be overwritten.

In [None]:

with open(MANIFEST, "r", encoding="utf-8") as f:
    manifest = yaml.safe_load(f)

# Choose model from manifest
embed_model_name = manifest["embedding_config"]["text"]["model"]  # "granite-embedding:278m"
normalize = bool(manifest["embedding_config"]["text"].get("normalize", True))
max_tokens   = int(manifest["embedding_config"]["text"]["chunking"]["max_tokens"])
overlap_toks = int(manifest["embedding_config"]["text"]["chunking"]["overlap_tokens"])

# --- A.1 Embeddings (Ollama + Granite) ---
emb = OllamaEmbeddings(model=embed_model_name, # served by local Ollama
                       # normalize embeddings to unit length if you want cosine = dot
                       # LangChain’s FAISS uses inner product; normalized vectors ≈ cosine
                       # Some versions expose "show_progress_bar" etc.
                      )

# --- A.2 Chunking (approx "semantic+fixed") ---
# We approximate "semantic+fixed" by:
#   1) small paragraphs/sentences splits (heuristics), then
#   2) fixed-size merge with overlap
# This is a pragmatic compromise without extra libs.
splitter = RecursiveCharacterTextSplitter(
    # Sensible boundaries; tweak as needed
    separators=["\n\n", "\n", "। ", ". ", "?", "!", " "],
    chunk_size=2000,     # ~ tokens proxy; adjust to your docs (we don't have tokenizer here)
    chunk_overlap=250,
    length_function=len
)

def file_bytes(path: Path) -> bytes:
    return path.read_bytes()

def hash_bytes(b: bytes) -> str:
    return hashlib.sha256(b).hexdigest()

# --- A.3 Build LangChain Documents with METADATA ---
docs: List[Document] = []
pack_name   = manifest["name"]
pack_ver    = manifest["version"]
pack_locales= manifest["locales"]

# Build a quick citation lookup
citations = {c["id"]: c for c in manifest.get("citations", [])}

for topic in manifest["index_of_topics"]:
    topic_id = topic["id"]
    for fmeta in topic["core_files"]:
        fpath = ROOT / fmeta["path"]
        if not fpath.exists():
            # skip missing dummy files gracefully
            continue
        raw = fpath.read_text(encoding="utf-8")
        chunks = splitter.split_text(raw)
        # Derive a locale label from folder name if you like (hi_en)
        locale = fmeta["path"].split("/")[2] if "/hi_en/" in fmeta["path"] else "en"

        # Expand citation IDs -> full objs
        c_full = [citations[cid] for cid in fmeta.get("citations", []) if cid in citations]

        # Make one Document per chunk with rich metadata
        for i, text in enumerate(chunks):
            docs.append(Document(
                page_content=text,
                metadata={
                    "pack_name": pack_name,
                    "pack_version": pack_ver,
                    "topic_id": topic_id,
                    "file_id": fmeta["id"],
                    "path": str(fmeta["path"]),
                    "media_type": fmeta["media_type"],
                    "locale": locale,
                    "citations": c_full,    # keep full objects for traceability
                    "chunk_index": i,
                    # Stable ID for your own bookkeeping
                    "chunk_id": f"{fmeta['id']}::chunk::{i}",
                }
            ))

print(f"Prepared {len(docs)} chunks")

# --- A.4 Create FAISS & persist ---
# Note: FAISS persists two artifacts: "index.faiss" and "index.pkl" (docstore+index_to_docstore_id)
# We'll also export jsonl embeddings/meta to match your manifest's 'precomputed_indices'.
faiss_dir = ROOT / "vector_db" / "text" / "faiss_index"
faiss_dir.mkdir(parents=True, exist_ok=True)

vs = FAISS.from_documents(docs, emb)
vs.save_local(str(faiss_dir))  # writes index.faiss + index.pkl

# --- A.5 (Optional) Export JSONL embeddings + meta to align with manifest paths ---
# This performs a forward pass to dump raw vectors & metadata for audit/portability.
# Note: it re-embeds; for huge corpora you'd capture vectors in one pass.
embeddings_path = ROOT / manifest["precomputed_indices"]["text"]["embeddings"]      # vector_db/text/embeddings.jsonl
meta_path       = ROOT / manifest["precomputed_indices"]["text"]["meta"]            # vector_db/text/meta.json
index_bin_path  = ROOT / manifest["precomputed_indices"]["text"]["index"]           # vector_db/text/index.bin

embeddings_path.parent.mkdir(parents=True, exist_ok=True)

# Pull documents back out (the FAISS docstore keeps your metadata)
# vs.docstore._dict is a mapping of doc_id -> Document (internal but stable enough for export)
records = []
for doc_id, doc in vs.docstore._dict.items():
    vec = emb.embed_query(doc.page_content)  # one more embed for export
    rec = {
        "id": doc_id,
        "embedding": vec,
        "metadata": doc.metadata,
        "text": doc.page_content
    }
    records.append(rec)

with open(embeddings_path, "w", encoding="utf-8") as f:
    for r in records:
        f.write(json.dumps(r, ensure_ascii=False) + "\n")

with open(meta_path, "w", encoding="utf-8") as f:
    json.dump({
        "model": embed_model_name,
        "dim": manifest["embedding_config"]["text"]["dim"],
        "normalize": normalize,
        "count": len(records),
        "pack": {"name": pack_name, "version": pack_ver, "locales": pack_locales}
    }, f, ensure_ascii=False, indent=2)

# Optionally copy the FAISS binary to the path your manifest expects:
# (FAISS writes 'index.faiss' -> we copy as 'index.bin' to match your field)
import shutil
shutil.copyfile(faiss_dir / "index.faiss", index_bin_path)
print("Saved FAISS + JSONL export")


Prepared 11 chunks
Saved FAISS + JSONL export


### Testing it out

In [None]:
# Typical retriever usage
retriever = vs.as_retriever(search_kwargs={"k": 4})  # if you used Option A 'vs'
query = "What to do for bleeding?"  #
hits = retriever.invoke(query)

for i, d in enumerate(hits, 1):
    print(d)
    print(f"\n[{i}]")
    print("Topic:", d.metadata["topic_id"])
    print("File:", d.metadata["file_id"])
    print("Locale:", d.metadata["locale"])
    print("Citations:", [c["title"] for c in d.metadata.get("citations", [])])
    print("Chunk text:")
    print(d.page_content[:300], "..." if len(d.page_content) > 300 else "")

# Filter to a topic or locale:
hits = retriever.invoke("tourniquet steps", filter={"topic_id": "bleed-control", "locale": "en"})



[1]
Topic: bleed-control
File: guide-bleed-overview
Locale: hi_en
Citations: ['WHO Basic Emergency Care (B.E.C.)']
Chunk text:
# Severe Bleeding Control
Severe bleeding can quickly become life-threatening if not controlled.  
Apply firm direct pressure with a clean cloth or sterile gauze.  
If bleeding soaks through, add more cloths without removing the first.  
Elevate the injured limb if possible while maintaining pressur ...

[2]
Topic: fracture-splint
File: guide-fracture-overview
Locale: hi_en
Citations: ['WHO Basic Emergency Care (B.E.C.)']
Chunk text:
# Fracture & Splinting
Fractures may present with pain, swelling, deformity, and inability to use the limb.  
Immobilize the joint above and below the suspected fracture.  
Use local materials such as bamboo sticks, boards, or rolled newspapers as splints.  
Pad the splints with cloth before tying t ...

[3]
Topic: snakebite
File: guide-snakebite-donts
Locale: hi_en
Citations: ['WHO SEARO – Snakebite Management']
Chunk text:
# Sna

:)