# FACTR_02_KB_Ingest_lang.ipynb (updated)

This notebook builds the unified FACTR knowledge base (KB) from normalised sources:

1. Load `MANIFEST.json` produced by `FACTR_01_Primary_Source_Import_Helper.ipynb`.
2. Merge all `_normalized/*.jsonl` files into `KB_passages.jsonl`.
3. Encode passages into `KB_embeddings.npy` + `KB_embeddings.meta.json`.
4. Build FAISS indexes for:
   - `KB_all.faiss` (all sources),
   - `KB_islam.faiss` (Islam-only),
   - `KB_christian.faiss` (Christianity-only).
5. Provide a `kb_search()` helper for interactive testing.
:

Confirm it’s actually using the GPU

In [3]:
import torch
print("CUDA available:", torch.cuda.is_available())


CUDA available: True


## 🔹 Code cell 1 – Config & load MANIFEST.json

In [4]:
# Step 1 — Configuration and MANIFEST loading

import os, json
from pathlib import Path

# Base directories (adjust ROOT if your Drive path is different)
ROOT = "/content/drive/MyDrive/FATCR"
DATA_DIR = f"{ROOT}/data/processed"
os.makedirs(DATA_DIR, exist_ok=True)

RAW_ROOT = Path(ROOT) / "data/raw/kb"
NORM_DIR = RAW_ROOT / "_normalized"

KB_PASS = Path(DATA_DIR) / "KB_passages.jsonl"
KB_EMB  = Path(DATA_DIR) / "KB_embeddings.npy"
KB_META = Path(DATA_DIR) / "KB_embeddings.meta.json"

manifest_path = NORM_DIR / "MANIFEST.json"
manifest = json.loads(manifest_path.read_text(encoding="utf-8"))

print("Using manifest:", manifest_path)
print("Sources:", len(manifest))
for m in manifest:
    print(" -", m["name"], "lang=", m.get("lang"), "genre=", m.get("genre"))


Using manifest: /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/MANIFEST.json
Sources: 8
 - quran_en lang= en genre= scripture
 - quran_ar lang= ar genre= scripture
 - bible_web_en lang= en genre= scripture
 - hadith_9books_en lang= en genre= hadith
 - tafsir_ibn_kathir_en lang= en genre= tafsir
 - tafsir_al_qurtubi_ar lang= ar genre= tafsir
 - christian_commentaries_patristic lang= en genre= commentary
 - christian_creeds lang= en genre= creed


## 🔹 Code cell 2 – Merge all normalised files → KB_passages.jsonl

In [5]:
# Step 2 — Merge all normalised files into KB_passages.jsonl

ROOT_PATH = Path(ROOT)

def build_kb_passages(manifest, out_path: Path):
    total = 0
    out_path.parent.mkdir(parents=True, exist_ok=True)
    with out_path.open("w", encoding="utf-8") as out_f:
        for entry in manifest:
            src_path = ROOT_PATH / entry["path"]
            print(f"[KB] Ingesting {entry['name']} from {src_path}")
            with src_path.open("r", encoding="utf-8") as f:
                for line in f:
                    line = line.strip()
                    if not line:
                        continue
                    rec = json.loads(line)
                    # Attach high-level metadata if missing
                    rec.setdefault("tradition", entry.get("tradition"))
                    rec.setdefault("genre", entry.get("genre"))
                    rec.setdefault("lang", entry.get("lang"))
                    rec.setdefault("source_name", entry["name"])
                    rec.setdefault("source_collection", entry.get("collection"))
                    rec.setdefault("source_file", entry["path"])
                    json.dump(rec, out_f, ensure_ascii=False)
                    out_f.write("\n")
                    total += 1
    print(f"[KB] Wrote {total} passages to {out_path}")
    return total

_ = build_kb_passages(manifest, KB_PASS)


[KB] Ingesting quran_en from /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/quran_en.jsonl
[KB] Ingesting quran_ar from /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/quran_ar.jsonl
[KB] Ingesting bible_web_en from /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/bible_web_en.jsonl
[KB] Ingesting hadith_9books_en from /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/hadith_9books_en.jsonl
[KB] Ingesting tafsir_ibn_kathir_en from /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/tafsir_ibn_kathir_en.jsonl
[KB] Ingesting tafsir_al_qurtubi_ar from /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/tafsir_al_qurtubi_ar.jsonl
[KB] Ingesting christian_commentaries_patristic from /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/christian_commentaries_patristic.jsonl
[KB] Ingesting christian_creeds from /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/christian_creeds.jsonl
[KB] Wrote 162712 passages to /content/drive/MyDrive/FATCR/data/processed/KB_passages.js

## 🔹 Code cell 3 – Build embeddings (KB_embeddings.npy + meta)

In [5]:
# Step 3 — Build sentence embeddings for all KB passages

!pip install -q sentence-transformers

import numpy as np
from sentence_transformers import SentenceTransformer

MODEL_NAME = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2"
model = SentenceTransformer(MODEL_NAME, device="cuda" if torch.cuda.is_available() else "cpu")


texts = []
with KB_PASS.open("r", encoding="utf-8") as f:
    for line in f:
        line = line.strip()
        if not line:
            continue
        rec = json.loads(line)
        texts.append(rec["text"])

print("Passages to embed:", len(texts))

model = SentenceTransformer(MODEL_NAME)
emb = model.encode(
    texts,
    batch_size=64,
    show_progress_bar=True,
    convert_to_numpy=True,
    normalize_embeddings=True,  # cosine similarity via inner product
)
emb = emb.astype("float32")

np.save(KB_EMB, emb)

meta = {
    "model_name": MODEL_NAME,
    "count": int(len(texts)),
    "dim": int(emb.shape[1]),
    "normalized": True,
}
KB_META.write_text(json.dumps(meta, indent=2, ensure_ascii=False), encoding="utf-8")

print("Saved embeddings to:", KB_EMB)
print("Saved embedding meta to:", KB_META)


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Passages to embed: 162712


Batches:   0%|          | 0/2543 [00:00<?, ?it/s]

Saved embeddings to: /content/drive/MyDrive/FATCR/data/processed/KB_embeddings.npy
Saved embedding meta to: /content/drive/MyDrive/FATCR/data/processed/KB_embeddings.meta.json


## 🔹 Code cell 4 – Build FAISS indexes + LAST_KB.json

In [6]:
# Step 4 — Build FAISS indexes for ALL, ISLAM-only, and CHRISTIAN-only

!pip install -q faiss-cpu

import faiss
from datetime import datetime, timezone
import numpy as np

# Load rows + embeddings
rows = [json.loads(l) for l in KB_PASS.open("r", encoding="utf-8") if l.strip()]
emb = np.load(str(KB_EMB)).astype("float32")
meta = json.loads(KB_META.read_text(encoding="utf-8"))

assert len(rows) == emb.shape[0], f"rows/emb size mismatch: {len(rows)} vs {emb.shape[0]}"

was_normalized = bool(meta.get("normalized", True))
if was_normalized:
    faiss.normalize_L2(emb)

d = emb.shape[1]
index_factory = lambda: (faiss.IndexFlatIP(d) if was_normalized else faiss.IndexFlatL2(d))

def build_subset(name, mask):
    mask = np.asarray(mask, dtype=bool)
    idxs = np.where(mask)[0]
    sub_emb = emb[idxs]
    idx = index_factory()
    idx.add(sub_emb)

    faiss_path = Path(DATA_DIR) / f"{name}.faiss"
    map_path   = Path(DATA_DIR) / f"{name}.map.jsonl"

    faiss.write_index(idx, str(faiss_path))

    with map_path.open("w", encoding="utf-8") as f:
        for pos, src_idx in enumerate(idxs):
            json.dump({"pos": int(pos), "kb_row": int(src_idx)}, f)
            f.write("\n")

    print(f"✓ {name}: {len(idxs)} vectors → {faiss_path}")
    return faiss_path, map_path

all_mask    = np.ones(len(rows), dtype=bool)
islam_mask  = np.array([r.get("tradition") == "Islam"        for r in rows], dtype=bool)
christ_mask = np.array([r.get("tradition") == "Christianity" for r in rows], dtype=bool)

fa_all,    map_all    = build_subset("KB_all",       all_mask)
fa_islam,  map_islam  = build_subset("KB_islam",     islam_mask)
fa_christ, map_christ = build_subset("KB_christian", christ_mask)

LAST_KB = {
    "time": datetime.now(timezone.utc).isoformat().replace("+00:00", "Z"),
    "dim": d,
    "normalized": was_normalized,
    "model": meta.get("model_name"),
    "artefacts": {
        "all":       {"faiss": os.path.relpath(str(fa_all),   ROOT), "map": os.path.relpath(str(map_all),   ROOT)},
        "islam":     {"faiss": os.path.relpath(str(fa_islam), ROOT), "map": os.path.relpath(str(map_islam), ROOT)},
        "christian": {"faiss": os.path.relpath(str(fa_christ),ROOT), "map": os.path.relpath(str(map_christ),ROOT)},
        "passages":  os.path.relpath(str(KB_PASS), ROOT),
    },
}

LAST_KB_PATH = Path(DATA_DIR) / "LAST_KB.json"
LAST_KB_PATH.write_text(json.dumps(LAST_KB, indent=2, ensure_ascii=False), encoding="utf-8")
print("✓ Updated", LAST_KB_PATH)


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.6/23.6 MB[0m [31m103.1 MB/s[0m eta [36m0:00:00[0m
[?25h✓ KB_all: 162712 vectors → /content/drive/MyDrive/FATCR/data/processed/KB_all.faiss
✓ KB_islam: 100186 vectors → /content/drive/MyDrive/FATCR/data/processed/KB_islam.faiss
✓ KB_christian: 62526 vectors → /content/drive/MyDrive/FATCR/data/processed/KB_christian.faiss
✓ Updated /content/drive/MyDrive/FATCR/data/processed/LAST_KB.json


## 🔹 Code cell 5 – kb_search() helper + smoke tests

In [9]:
# Step 5 — kb_search() helper for interactive testing

import faiss
import numpy as np

# Reload rows & metadata (in case the notebook was resumed)
rows = [json.loads(l) for l in KB_PASS.open("r", encoding="utf-8") if l.strip()]
meta = json.loads(KB_META.read_text(encoding="utf-8"))
LAST_KB = json.loads(LAST_KB_PATH.read_text(encoding="utf-8"))

from sentence_transformers import SentenceTransformer
model = SentenceTransformer(meta["model_name"])
was_normalized = bool(meta.get("normalized", True))

INDEX_CACHE = {}

def _load_index(subset: str = "all"):
    subset = subset.lower()
    if subset not in INDEX_CACHE:
        art = LAST_KB["artefacts"][subset]
        faiss_path = Path(ROOT) / art["faiss"]
        map_path   = Path(ROOT) / art["map"]
        idx = faiss.read_index(str(faiss_path))

        mapping = []
        with map_path.open("r", encoding="utf-8") as f:
            for line in f:
                if not line.strip():
                    continue
                d = json.loads(line)
                mapping.append(d["kb_row"])
        mapping = np.array(mapping, dtype=int)
        INDEX_CACHE[subset] = (idx, mapping)
    return INDEX_CACHE[subset]

def kb_search(query: str, top_k: int = 5, subset: str = "all"):
    """Search the KB for a natural-language query.

    subset: 'all', 'islam', or 'christian'.
    """
    idx, mapping = _load_index(subset)
    q_emb = model.encode(
        [query],
        convert_to_numpy=True,
        normalize_embeddings=was_normalized
    ).astype("float32")
    if was_normalized:
        faiss.normalize_L2(q_emb)

    D, I = idx.search(q_emb, top_k)
    print(f"\nQuery: {query!r}  | subset='{subset}'")
    for rank, (score, pos) in enumerate(zip(D[0], I[0]), start=1):
        if pos == -1:
            continue
        row_idx = int(mapping[pos])
        rec = rows[row_idx]
        txt = rec["text"].replace("\n", " ")
        snippet = (txt[:160] + "…") if len(txt) > 160 else txt
        print(f"{rank:2d}. score={float(score):.4f} "
              f"[{rec.get('tradition','?')}/{rec.get('genre','?')}] "
              f"{rec.get('ref', rec.get('group_key',''))}")
        print("    ", snippet)

# Example smoke tests (you can comment these out later)
kb_search("cup of blessing which we bless", top_k=5, subset="all")
kb_search("one God in Trinity and Trinity in unity", top_k=5, subset="christian")


modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]


Query: 'cup of blessing which we bless'  | subset='all'
 1. score=0.7423 [Islam/scripture] Qur'an 54:35
     A blessing from Us. Thus We reward the thankful.
 2. score=0.6961 [Islam/scripture] Qur'an 76:5
     But the righteous will drink from a cup whose mixture is aroma.
 3. score=0.6636 [Islam/scripture] Qur'an 78:36
     A reward from your Lord, a fitting gift.
 4. score=0.6532 [Christianity/commentary] 1 Corinthians 10:16
     And adds, "The cup of blessing which we bless, is it not the communion of the blood of Christ? ". But if this indeed do not attain salvation, then neither did t…
 5. score=0.6429 [Islam/scripture] Qur'an 93:11
     But proclaim the blessings of your Lord.

Query: 'one God in Trinity and Trinity in unity'  | subset='christian'
 1. score=0.8073 [Christianity/creed] Athanasian Creed (Quicumque vult) (Latin Western creed (c. 5th–6th C)) ¶2
     Now the universal faith is this: that we worship one God in Trinity, and Trinity in unity, neither confusing the perso