Cell 1 — Title & overview (Markdown)
# Primary Source Import Helper v3.3 (adapter-based)

**Goal:** ingest many different *raw* sources (JSONL / CSV / TXT / directory JSON) and normalize them into a **single, consistent JSONL schema** that your KB index notebook can ingest.

### Target record schema (per passage)

```json
{
  "tradition": "Islam" | "Christianity",
  "genre": "scripture" | "hadith" | "tafsir" | "commentary" | "creed" | "...",
  "source": "Quran (EN: semarketir/quranjson)" | "World English Bible" | "spa5k/tafsir_api" | "...",
  "collection": "Quran" | "World English Bible" | "Sahih al-Bukhari" | "Tafsīr Ibn Kathīr" | "Tafsīr al-Jalālayn" | "..." ,
  "book": "Al-Fatiha" | "Genesis" | "Bukhari" | "Ibn Kathir" | "...",
  "chapter": 1,
  "verse": 1,
  "number": null,
  "grade": null,
  "lang": "en" | "ar",
  "ref": "Friendly ref e.g. “Qur’an 2:255”, “John 3:16”, “Bukhari 1:1”",
  "text": "Passage text…",
  "group_key": "Stable key used to align parallels (e.g., quran-2-255, bible-John-3-16, hadith:bukhari:1:1)"
}

### Workflow

Drop raw files under data/raw/kb/<tradition>/<collection>/..., or point to git-cloned folders.

Add a config in SOURCES (below) with kind = "jsonl" | "csv" | "txt" | "bible_dir" | "hadith_dir".

Run Convert all → normalized JSONLs in _normalized/.

(Optional) Run the Tafsīr fetcher to pull per-āyah commentary into normalized JSONL.

Run your KB ingest notebook to rebuild the FAISS KB.

In [2]:

# Cell 2 — Paths & artefacts

import os, json, re, csv, glob, time
from pathlib import Path

# --- Project root (adjust if needed) ---
ROOT = os.environ.get("FACTR_ROOT", "/content/drive/MyDrive/FATCR")
DATA_DIR = f"{ROOT}/data"
RAW_ROOT = f"{DATA_DIR}/raw/kb"
NORM_DIR = f"{RAW_ROOT}/_normalized"
os.makedirs(NORM_DIR, exist_ok=True)

print("ROOT:", ROOT)
print("RAW_ROOT:", RAW_ROOT)
print("NORM_DIR:", NORM_DIR)


ROOT: /content/drive/MyDrive/FATCR
RAW_ROOT: /content/drive/MyDrive/FATCR/data/raw/kb
NORM_DIR: /content/drive/MyDrive/FATCR/data/raw/kb/_normalized


In [None]:
from google.colab import drive
drive.mount('/content/drive')

## Cell 3 — Small helpers (writer, cleaning, group keys)

In [3]:
def _clean(s):
    import re
    return re.sub(r"\s+", " ", str(s)).strip()

def quran_key(chapter:int, verse:int):
    return f"quran-{int(chapter)}-{int(verse)}"

def bible_key(book:str, chapter:int, verse:int):
    import re
    b = re.sub(r"\s+", "_", str(book).strip())
    return f"bible-{b}-{int(chapter)}-{int(verse)}"

def hadith_key(collection:str, book, number):
    import re
    c = re.sub(r"\s+", "_", str(collection).lower())
    return f"hadith:{c}:{book}:{number}"

def write_jsonl_line(path, rec: dict):
    """Strict writer that enforces required keys (including 'collection')."""
    required = {"tradition","genre","source","collection","lang","book","chapter","verse","number","grade","text","ref","group_key"}
    missing = required - set(rec.keys())
    if missing:
        raise ValueError(f"Record missing required keys: {sorted(missing)}")
    if not rec.get("collection"):
        raise ValueError("Record has empty 'collection'. Adapter must set it.")
    with open(path, "a", encoding="utf-8") as f:
        f.write(json.dumps(rec, ensure_ascii=False) + "\n")


## Cell 4 — Adapters: JSONL pass-through / CSV / TXT folder

In [4]:
def convert_jsonl(in_path:str, out_path:str, fixed:dict, keymap:dict, group_key_fn, post=None):
    wrote = 0
    with open(out_path, "w", encoding="utf-8"): pass
    with open(in_path, encoding="utf-8") as f:
        for line in f:
            if not line.strip(): continue
            raw = json.loads(line)
            rec = {
                "tradition": fixed["tradition"],
                "genre": fixed["genre"],
                "source": fixed["source"],
                "collection": fixed["collection"],
                "lang": fixed["lang"],
                "book": raw.get(keymap.get("book","book")),
                "chapter": int(raw.get(keymap.get("chapter","chapter"), 1)),
                "verse": int(raw.get(keymap.get("verse","verse"), 1)),
                "number": raw.get("number"),
                "grade": raw.get("grade"),
                "text": _clean(raw.get(keymap.get("text","text"), "")),
                "ref": raw.get(keymap.get("ref","ref")) or "",
                "group_key": ""
            }
            rec["group_key"] = group_key_fn(rec, raw)
            if post: rec = post(rec, raw)
            write_jsonl_line(out_path, rec); wrote += 1
    print(f"✅ Wrote {out_path} | rows: {wrote}")

def convert_csv(in_path:str, out_path:str, fixed:dict, colmap:dict, group_key_fn, post=None):
    wrote = 0
    with open(out_path, "w", encoding="utf-8"): pass
    with open(in_path, encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for raw in reader:
            rec = {
                "tradition": fixed["tradition"],
                "genre": fixed["genre"],
                "source": fixed["source"],
                "collection": fixed["collection"],
                "lang": fixed["lang"],
                "book": raw.get(colmap.get("book","book")),
                "chapter": int(raw.get(colmap.get("chapter","chapter"), 1)),
                "verse": int(raw.get(colmap.get("verse","verse"), 1)),
                "number": raw.get(colmap.get("number","number")),
                "grade": raw.get(colmap.get("grade","grade")),
                "text": _clean(raw.get(colmap.get("text","text"), "")),
                "ref": raw.get(colmap.get("ref","ref")) or "",
                "group_key": ""
            }
            rec["group_key"] = group_key_fn(rec, raw)
            if post: rec = post(rec, raw)
            write_jsonl_line(out_path, rec); wrote += 1
    print(f"✅ Wrote {out_path} | rows: {wrote}")

def convert_txt_folder(in_dir:str, out_path:str, fixed:dict, lang="en", per_file_book=None, post=None):
    wrote = 0
    with open(out_path, "w", encoding="utf-8"): pass
    for p in sorted(Path(in_dir).glob("*.txt")):
        book = per_file_book(p) if per_file_book else p.stem
        with open(p, encoding="utf-8") as fh:
            for i, raw_line in enumerate(fh, 1):
                t = _clean(raw_line)
                if not t: continue
                rec = {
                    "tradition": fixed["tradition"],
                    "genre": fixed["genre"],
                    "source": fixed["source"],
                    "collection": fixed["collection"],
                    "lang": lang,
                    "book": book,
                    "chapter": 1,
                    "verse": i,
                    "number": None,
                    "grade": None,
                    "text": t,
                    "ref": f"{book} line {i}",
                    "group_key": f"txt-{book}-{i}"
                }
                if post: rec = post(rec, None)
                write_jsonl_line(out_path, rec); wrote += 1
    print(f"✅ Wrote {out_path} | rows: {wrote}")


## Cell 5 — Bible directory adapter (WEB JSON books)

In [5]:
def _iter_bible_from_obj(obj):
    """
    Tries common shapes:
    A) {"book":"Genesis","chapters":[{"chapter":1,"verses":[{"verse":1,"text":"..."}]}]}
    B) {"chapters": [{"1": {"1":"text", "2":"text", ...}}, ...]}
    C) {"1": {"1":"text", ...}, "2": {...}}  # direct map chapter -> verse map
    """
    # A or B
    if isinstance(obj, dict) and "chapters" in obj:
        chs = obj["chapters"]
        if isinstance(chs, list):
            for ch in chs:
                cnum = ch.get("chapter")
                verses = ch.get("verses")
                if isinstance(verses, list):
                    for it in verses:
                        v = it.get("verse")
                        txt = it.get("text") or it.get("content") or ""
                        if cnum and v:
                            yield int(cnum), int(v), _clean(txt)
                elif isinstance(verses, dict):
                    for vk, tv in verses.items():
                        try: v = int(vk)
                        except: continue
                        txt = tv if isinstance(tv, str) else (tv.get("text") if isinstance(tv, dict) else str(tv))
                        yield int(cnum), v, _clean(txt)
        elif isinstance(chs, dict):
            for ck, vdict in chs.items():
                try: c = int(ck)
                except: continue
                if isinstance(vdict, dict):
                    for vk, tv in vdict.items():
                        try: v = int(vk)
                        except: continue
                        txt = tv if isinstance(tv, str) else (tv.get("text") if isinstance(tv, dict) else str(tv))
                        yield c, v, _clean(txt)
    # C
    elif isinstance(obj, dict):
        for ck, vdict in obj.items():
            try: c = int(ck)
            except: continue
            if isinstance(vdict, dict):
                for vk, tv in vdict.items():
                    try: v = int(vk)
                    except: continue
                    txt = tv if isinstance(tv, str) else (tv.get("text") if isinstance(tv, dict) else str(tv))
                    yield c, v, _clean(txt)

def convert_bible_dir(in_dir:str, out_path:str, fixed:dict, post=None):
    wrote = 0
    with open(out_path, "w", encoding="utf-8"): pass
    for p in sorted(Path(in_dir).glob("*.json")):
        try:
            obj = json.loads(Path(p).read_text(encoding="utf-8"))
        except Exception as e:
            print("Skipping", p, "(", e, ")")
            continue
        book = obj.get("book") or p.stem
        for c, v, txt in _iter_bible_from_obj(obj):
            rec = {
                "tradition": fixed["tradition"],
                "genre": fixed["genre"],
                "source": fixed["source"],
                "collection": fixed["collection"],
                "lang": fixed["lang"],
                "book": book,
                "chapter": int(c),
                "verse": int(v),
                "number": None,
                "grade": None,
                "text": txt,
                "ref": f"{book} {c}:{v}",
                "group_key": bible_key(book, c, v)
            }
            if post: rec = post(rec, None)
            write_jsonl_line(out_path, rec); wrote += 1
        print(f"{Path(p).name}: +{wrote}")
    print(f"✅ Wrote {out_path} | rows: {wrote}")


## Cell 6 — Hadith directory adapter (AhmedBaset/hadith-json)

In [6]:
def _extract_hadith_text(d):
    # Best-effort: prefer English, fall back to other fields
    for k in ("english","hadith_en","hadith","text_en","text"):
        if k in d and d[k]:
            return _clean(d[k])
    if "text" in d and isinstance(d["text"], dict):
        for k in ("en","english","arabic"):
            if k in d["text"] and d["text"][k]:
                return _clean(d["text"][k])
    return ""

def convert_hadith_dir(in_dir:str, out_path:str, fixed:dict, collection_name_mapper=None):
    """
    Expects JSON files containing lists of hadith dicts for each book/collection.
    We try to read: book, number/hadithNo, grade when present.
    """
    wrote = 0
    with open(out_path, "w", encoding="utf-8"): pass

    for p in sorted(Path(in_dir).glob("*.json")):
        try:
            arr = json.loads(Path(p).read_text(encoding="utf-8"))
        except Exception as e:
            print("Skipping", p, "(", e, ")")
            continue

        # Derive collection from filename (or mapper)
        coll = p.stem
        if collection_name_mapper:
            coll = collection_name_mapper(coll)

        for d in arr if isinstance(arr, list) else []:
            book = d.get("book") or d.get("book_number") or 1
            number = d.get("hadithnumber") or d.get("hadithNo") or d.get("number") or d.get("id")
            grade = d.get("grade") or d.get("grading") or (d.get("metadata",{}).get("grade") if isinstance(d.get("metadata"), dict) else None)
            text = _extract_hadith_text(d)

            try: book_i = int(book)
            except: book_i = 1
            try: num_i = int(number) if number is not None else None
            except: num_i = None

            rec = {
                "tradition": fixed["tradition"],
                "genre": "hadith",
                "source": fixed["source"],
                "collection": coll or fixed["collection"],
                "lang": fixed["lang"],
                "book": str(book),
                "chapter": 1,
                "verse": 1,
                "number": num_i,
                "grade": grade,
                "text": text,
                "ref": f"{coll} {book}:{num_i}" if num_i else f"{coll} {book}",
                "group_key": hadith_key(coll, book_i, num_i or 0)
            }
            write_jsonl_line(out_path, rec); wrote += 1
        print(f"{Path(p).name}: +{wrote}")
    print(f"✅ Wrote {out_path} | rows: {wrote}")


## Cell 7 — Tafsīr fetcher (spa5k/tafsir_api → normalized JSONL)

In [None]:
# --- Improved Tafsir fetcher: tries multiple URL layouts & JSON shapes ---
import requests, time

SURAH_LENGTHS = {
  1:7,2:286,3:200,4:176,5:120,6:165,7:206,8:75,9:129,10:109,11:123,12:111,13:43,14:52,15:99,16:128,17:111,18:110,19:98,20:135,
  21:112,22:78,23:118,24:64,25:77,26:227,27:93,28:88,29:69,30:60,31:34,32:30,33:73,34:54,35:45,36:83,37:182,38:88,39:75,40:85,
  41:54,42:53,43:89,44:59,45:37,46:35,47:38,48:29,49:18,50:45,51:60,52:49,53:62,54:55,55:78,56:96,57:29,58:22,59:24,60:13,61:14,
  62:11,63:11,64:18,65:12,66:12,67:30,68:52,69:52,70:44,71:28,72:28,73:20,74:56,75:40,76:31,77:50,78:40,79:46,80:42,81:29,82:19,
  83:36,84:25,85:22,86:17,87:19,88:26,89:30,90:20,91:15,92:21,93:11,94:8,95:8,96:19,97:5,98:8,99:8,100:11,101:11,102:8,103:3,
  104:9,105:5,106:4,107:7,108:3,109:6,110:3,111:5,112:4,113:5,114:6
}

TAFSIR_EDITIONS = {
    # keep these codes; we’ll discover layout automatically
    "en-tafsir-ibn-kathir": {"collection":"Tafsīr Ibn Kathīr", "slug":"en-tafsir-ibn-kathir"},
    "en-al-jalalayn":      {"collection":"Tafsīr al-Jalālayn", "slug":"en-al-jalalayn"},
}

def normalize_record(rec: dict) -> dict:
    required = {"tradition","genre","source","collection","lang","book",
                "chapter","verse","number","grade","text","ref","group_key"}
    for k in required:
        rec.setdefault(k, None)
    return rec

def _candidate_urls(base, slug, s, a):
    # try most common layouts (both raw GitHub and jsDelivr)
    return [
        f"{base}/{slug}/{s}/{a}.json",         # per-ayah JSON
        f"{base}/{slug}/{s}.json",             # per-surah JSON
        f"{base}/en/{slug}/{s}/{a}.json",      # extra en/ level
        f"{base}/en/{slug}/{s}.json",
        # jsDelivr fallbacks (comment out if you keep seeing 403s)
        f"https://cdn.jsdelivr.net/gh/spa5k/tafsir_api@main/tafsir/{slug}/{s}/{a}.json",
        f"https://cdn.jsdelivr.net/gh/spa5k/tafsir_api@main/tafsir/{slug}/{s}.json",
        f"https://cdn.jsdelivr.net/gh/spa5k/tafsir_api@main/tafsir/en/{slug}/{s}/{a}.json",
        f"https://cdn.jsdelivr.net/gh/spa5k/tafsir_api@main/tafsir/en/{slug}/{s}.json",
    ]

def _extract_ayah_text_from_obj(obj, ayah_number):
    """
    Given a JSON object, pull text for one ayah.
    Handles:
      - direct string
      - {'text': '...'}
      - {'ayahs': {'1': '...', '2': '...'}} or list-like
    """
    if isinstance(obj, str):
        return obj
    if isinstance(obj, dict):
        # direct text
        if "text" in obj and isinstance(obj["text"], (str,)):
            return obj["text"]
        # nested ayahs map/dict
        if "ayahs" in obj:
            ayahs = obj["ayahs"]
            if isinstance(ayahs, dict):
                v = ayahs.get(str(ayah_number)) or ayahs.get(ayah_number)
                if isinstance(v, str):
                    return v
                if isinstance(v, dict) and "text" in v:
                    return v["text"]
            elif isinstance(ayahs, list):
                idx = ayah_number-1
                if 0 <= idx < len(ayahs):
                    v = ayahs[idx]
                    if isinstance(v, str):
                        return v
                    if isinstance(v, dict) and "text" in v:
                        return v["text"]
        # maybe chapter dict: {'1': '...', '2': '...'}
        v = obj.get(str(ayah_number))
        if isinstance(v, str):
            return v
        if isinstance(v, dict) and "text" in v:
            return v["text"]
    return None

def fetch_tafsir_edition(
    code: str,
    out_path: str,
    fixed_meta: dict,
    start_surah=1,
    end_surah=114,
    sleep=0.0,
    verbose_probe=True
):
    """
    Tries several directory layouts; supports per-ayah JSON and per-surah JSON.
    Uses raw.githubusercontent.com first (usually fewer 403s).
    """
    if code not in TAFSIR_EDITIONS:
        raise ValueError(f"Unknown tafsir edition code: {code}")
    info = TAFSIR_EDITIONS[code]
    collection = info["collection"]
    slug = info["slug"]

    with open(out_path, "w", encoding="utf-8"):
        pass

    base_raw = "https://raw.githubusercontent.com/spa5k/tafsir_api/main/tafsir"

    wrote = 0
    chosen_layout = None  # remember which URL pattern worked first

    for s in range(start_surah, end_surah+1):
        length = SURAH_LENGTHS.get(s, 0)
        if length <= 0:
            if s == start_surah:
                print(f"{code} s{s}: unknown length — skipping")
            continue

        # If we haven’t identified a working layout yet, probe ayah 1
        if chosen_layout is None:
            for url in _candidate_urls(base_raw, slug, s, 1):
                try:
                    r = requests.get(url, timeout=15)
                except requests.exceptions.RequestException:
                    continue
                if r.status_code == 200:
                    chosen_layout = url.replace(str(s) + ("/1.json" if url.endswith("/1.json") else ".json"), "{surah}{tail}")
                    if verbose_probe:
                        print(f"{code}: using layout like -> {url}")
                    break
            if chosen_layout is None:
                # couldn't find a layout; continue to next surah
                print(f"{code} s{s}: no working layout (404s); continuing")
                continue

        # Now fetch content using the chosen pattern
        # If the chosen URL was per-ayah, we keep per-ayah; if it was per-surah,
        # we reuse that pattern and then pick ayahs from the blob.
        is_per_ayah = chosen_layout.endswith("{surah}/1.json{tail}") or "/{tail}" in chosen_layout

        # Build surah JSON if per-surah
        surah_blob = None
        if not is_per_ayah:
            surah_url = chosen_layout.format(surah=str(s), tail="")
            try:
                rs = requests.get(surah_url, timeout=20)
                if rs.status_code == 200:
                    surah_blob = rs.json()
                else:
                    if s == start_surah:
                        print(f"{code} s{s}: per-surah fetch got {rs.status_code}")
                    # skip this surah
                    continue
            except Exception as e:
                if s == start_surah:
                    print(f"{code} s{s}: per-surah error {e}")
                continue

        # Iterate ayahs
        for a in range(1, length+1):
            text = None
            if is_per_ayah:
                ayah_url = chosen_layout.format(surah=str(s), tail="").replace("1.json", f"{a}.json")
                try:
                    ra = requests.get(ayah_url, timeout=15)
                    if ra.status_code == 200:
                        obj = ra.json()
                        # Accept 'text' directly or dict with 'text'
                        if isinstance(obj, str):
                            text = obj
                        elif isinstance(obj, dict):
                            text = obj.get("text")
                            if text is None:
                                # some files might wrap in {'data': {...}}
                                if "data" in obj and isinstance(obj["data"], dict):
                                    text = obj["data"].get("text")
                    else:
                        # If some ayah files are missing, we just skip
                        continue
                except Exception:
                    continue
            else:
                # per-surah blob
                text = _extract_ayah_text_from_obj(surah_blob, a)

            if not text:
                continue

            rec = {
                "tradition": fixed_meta.get("tradition", "Islam"),
                "genre": "tafsir",
                "source": fixed_meta.get("source", "spa5k/tafsir_api"),
                "collection": collection,
                "lang": fixed_meta.get("lang", "en"),
                "book": collection,
                "chapter": s,
                "verse": a,
                "number": None,
                "grade": None,
                "text": _clean(text),
                "ref": f"{collection} on Qur’an {s}:{a}",
                "group_key": quran_key(s, a)
            }
            rec = normalize_record(rec)
            write_jsonl_line(out_path, rec)
            wrote += 1

        if sleep:
            time.sleep(sleep)

    print(f"✅ Wrote {out_path} | rows: {wrote}")


Cell 8 — Configure SOURCES (toggle by uncommenting)bold text

In [7]:
# Enable/disable by (un)commenting. Paths assume you've placed raw data under RAW_ROOT.

SOURCES = [
    # Example: Qur'an EN pass-through JSONL (already normalized elsewhere)
    {
      "name": "quran_en",
      "kind": "jsonl",
      "in":  f"{RAW_ROOT}/Islam/Quran/quran_en_semarketir.jsonl",
      "out": f"{NORM_DIR}/quran_en.jsonl",
      "fixed": {
          "tradition":"Islam","genre":"scripture","source":"Quran (EN: semarketir/quranjson)",
          "collection":"Quran","lang":"en"
      },
      "keymap": {"book":"book","chapter":"chapter","verse":"verse","text":"text","ref":"ref"},
      "group_key_fn": lambda rec, raw: quran_key(rec["chapter"], rec["verse"]),
    },

    # Example: Qur'an AR pass-through JSONL
    {
      "name": "quran_ar",
      "kind": "jsonl",
      "in":  f"{RAW_ROOT}/Islam/Quran/quran_ar_semarketir.jsonl",
      "out": f"{NORM_DIR}/quran_ar.jsonl",
      "fixed": {
          "tradition":"Islam","genre":"scripture","source":"Quran (AR)",
          "collection":"Quran","lang":"ar"
      },
      "keymap": {"book":"book","chapter":"chapter","verse":"verse","text":"text","ref":"ref"},
      "group_key_fn": lambda rec, raw: quran_key(rec["chapter"], rec["verse"]),
    },

    # Example: Bible (WEB) book JSONs in a directory
    {
      "name": "bible_web_en_dir",
      "kind": "bible_dir",
      "in":  f"{RAW_ROOT}/Christianity/Bible/WEB",  # folder of *.json from TehShrike/world-english-bible
      "out": f"{NORM_DIR}/bible_web_en.jsonl",
      "fixed": {
          "tradition":"Christianity","genre":"scripture","source":"World English Bible",
          "collection":"World English Bible","lang":"en"
      }
    },

    # Example: Hadith directory (AhmedBaset/hadith-json) - set path to the_9_books
    {
      "name": "hadith_the9_dir",
      "kind": "hadith_dir",
      "in":  f"{RAW_ROOT}/Islam/Hadith/the_9_books",  # folder of collection files
      "out": f"{NORM_DIR}/hadith_9books_en.jsonl",
      "fixed": {
          "tradition":"Islam","genre":"hadith","source":"AhmedBaset/hadith-json",
          "collection":"", "lang":"en"
      }
    },
]
print("Configured sources:", [s["name"] for s in SOURCES])

Configured sources: ['quran_en', 'quran_ar', 'bible_web_en_dir', 'hadith_the9_dir']


Cell 7a — Build WEB JSON (run once, before your current “Cell 7b”)

Place this right before your existing WEB importer cell.

A) World English Bible (WEB) → bible_web_en.jsonl (handles JSON or USFM/TXT)

Put this as a single new cell. It scans the WEB repo for *.json and *.usfm/*.txt, parses either format, and writes a unified JSONL.
It also reads your existing KJV JSONL to copy the same group_key book slugs, so WEB and KJV align perfectly.

In [None]:
# Cell — Bible (WEB) adapter (robust) → normalized JSONL
# Handles:
#  (A) {"book": "...", "chapters": {"1": {"1":"...", ...}, "2": {...}}}
#  (B) {"book": "...", "chapters": [{"chapter":1, "verses": {... or [...] }}, ...]}
#  (C) [ {"chapter":1,"verse":1,"text":"..."}, {"chapter":1,"verse":2,"text":"..."} , ... ]

import os, json, re
from pathlib import Path

# ---- Paths (keep consistent) ----
ROOT = "/content/drive/MyDrive/FATCR"
RAW_ROOT = f"{ROOT}/data/raw/kb"
NORM_DIR = f"{RAW_ROOT}/_normalized"
WEB_IN   = f"{RAW_ROOT}/Christianity/Bible/web/json"   # where 1chronicles.json etc live
WEB_OUT  = f"{NORM_DIR}/bible_web_en.jsonl"
os.makedirs(NORM_DIR, exist_ok=True)

# ---- Helpers ----
def _clean(s: str) -> str:
    return re.sub(r"\s+", " ", str(s or "")).strip()

def _strip_tags(s: str) -> str:
    return re.sub(r"<[^>]+>", "", s or "")

def _int_or_none(x):
    try:
        return int(x)
    except:
        return None

def _friendly_book_from_stem(stem: str) -> str:
    st = stem.lower().replace("-", "")
    m = re.match(r"^([123])?(.*)$", st)
    num, rest = m.groups() if m else (None, st)
    words = re.findall(r"[a-z]+", rest)
    title = " ".join(w.capitalize() for w in words).strip() or stem
    return (num + " " + title).strip() if num else title

def _emit(rows, book, ch, v, txt, fixed_meta):
    rows.append({
        **fixed_meta,
        "book": book,
        "chapter": int(ch),
        "verse": int(v),
        "number": None,
        "grade": None,
        "text": _clean(txt),
        "ref": f"{book} {ch}:{v}",
        "group_key": f"bible-{book}-{int(ch)}-{int(v)}",
    })

fixed_meta_web = {
    "tradition": "Christianity",
    "genre": "scripture",
    "source": "World English Bible (TehShrike/world-english-bible)",
    "collection": "Bible",
    "lang": "en",
}

pdir = Path(WEB_IN)
assert pdir.is_dir(), f"WEB folder not found: {WEB_IN}"

files = sorted([p for p in pdir.glob("*.json") if p.is_file()], key=lambda p: p.name.lower())
print(f"WEB: scanning {len(files)} JSON files in {WEB_IN}")

rows = []
for i, f in enumerate(files, start=1):
    try:
        obj = json.loads(f.read_text(encoding="utf-8"))
    except Exception as e:
        print(f"  {i:2d}/{len(files)} {f.name}: JSON parse error → {e}")
        continue

    # Determine book name
    book = None
    if isinstance(obj, dict):
        book = obj.get("book")
    if not book:
        book = _friendly_book_from_stem(f.stem)

    added = 0

    # ---------- Shape A: chapters as dict of dicts ----------
    if isinstance(obj, dict) and isinstance(obj.get("chapters"), dict):
        for ch_k, ch_val in obj["chapters"].items():
            ch = _int_or_none(ch_k)
            if ch is None:
                continue
            # verses may be dict {"1": "In the beginning..."} or list ["v1","v2",...]
            if isinstance(ch_val, dict):
                for v_k, v_val in ch_val.items():
                    v = _int_or_none(v_k)
                    if v is None:
                        continue
                    if isinstance(v_val, str):
                        txt = v_val
                    elif isinstance(v_val, dict):
                        txt = v_val.get("text") or v_val.get("content") or json.dumps(v_val, ensure_ascii=False)
                    else:
                        txt = str(v_val)
                    _emit(rows, book, ch, v, _strip_tags(txt), fixed_meta_web); added += 1
            elif isinstance(ch_val, list):
                for idx, v_val in enumerate(ch_val, start=1):
                    txt = v_val if isinstance(v_val, str) else (v_val.get("text") if isinstance(v_val, dict) else str(v_val))
                    _emit(rows, book, ch, idx, _strip_tags(txt), fixed_meta_web); added += 1

    # ---------- Shape B: chapters as list of chapter objects ----------
    elif isinstance(obj, dict) and isinstance(obj.get("chapters"), list):
        for ch_obj in obj["chapters"]:
            if not isinstance(ch_obj, dict):
                continue
            ch = _int_or_none(ch_obj.get("chapter") or ch_obj.get("chapterNumber"))
            if ch is None:
                continue
            verses = ch_obj.get("verses") or ch_obj.get("data") or ch_obj.get("items")
            if isinstance(verses, dict):
                for v_k, v_val in verses.items():
                    v = _int_or_none(v_k)
                    if v is None:
                        continue
                    txt = v_val if isinstance(v_val, str) else (v_val.get("text") if isinstance(v_val, dict) else str(v_val))
                    _emit(rows, book, ch, v, _strip_tags(txt), fixed_meta_web); added += 1
            elif isinstance(verses, list):
                for idx, v_val in enumerate(verses, start=1):
                    txt = v_val if isinstance(v_val, str) else (v_val.get("text") if isinstance(v_val, dict) else str(v_val))
                    _emit(rows, book, ch, idx, _strip_tags(txt), fixed_meta_web); added += 1

    # ---------- Shape C: top-level list of verse objects ----------
    elif isinstance(obj, list):
        for rec in obj:
            if not isinstance(rec, dict):
                continue
            ch = _int_or_none(rec.get("chapter") or rec.get("chapterNumber") or rec.get("c"))
            v  = _int_or_none(rec.get("verse")   or rec.get("verseNumber")   or rec.get("v"))
            if ch is None or v is None:
                continue
            txt = rec.get("text") or rec.get("content") or rec.get("t") or json.dumps({k:rec[k] for k in rec if k not in ("chapter","verse")}, ensure_ascii=False)
            _emit(rows, book, ch, v, _strip_tags(txt), fixed_meta_web); added += 1

    print(f"  {i:2d}/{len(files)} {f.name}: +{added}")

# ---- Write JSONL ----
with open(WEB_OUT, "w", encoding="utf-8") as fo:
    for r in rows:
        fo.write(json.dumps(r, ensure_ascii=False) + "\n")
print(f"\n✅ Wrote {WEB_OUT} | rows: {len(rows)}")

# Quick preview
try:
    with open(WEB_OUT, "r", encoding="utf-8") as f:
        for _ in range(3):
            line = f.readline().strip()
            if not line: break
            print("Sample:", line[:180], "…")
except Exception as e:
    print("Preview error:", e)


WEB: scanning 66 JSON files in /content/drive/MyDrive/FATCR/data/raw/kb/Christianity/Bible/web/json
   1/66 1chronicles.json: +985
   2/66 1corinthians.json: +444
   3/66 1john.json: +108
   4/66 1kings.json: +856
   5/66 1peter.json: +123
   6/66 1samuel.json: +910
   7/66 1thessalonians.json: +90
   8/66 1timothy.json: +120
   9/66 2chronicles.json: +838
  10/66 2corinthians.json: +268
  11/66 2john.json: +13
  12/66 2kings.json: +787
  13/66 2peter.json: +61
  14/66 2samuel.json: +849
  15/66 2thessalonians.json: +47
  16/66 2timothy.json: +91
  17/66 3john.json: +14
  18/66 acts.json: +1088
  19/66 amos.json: +420
  20/66 colossians.json: +95
  21/66 daniel.json: +406
  22/66 deuteronomy.json: +1164
  23/66 ecclesiastes.json: +298
  24/66 ephesians.json: +158
  25/66 esther.json: +172
  26/66 exodus.json: +1264
  27/66 ezekiel.json: +1682
  28/66 ezra.json: +295
  29/66 galatians.json: +152
  30/66 genesis.json: +1716
  31/66 habakkuk.json: +102
  32/66 haggai.json: +40
  33/66 heb

1) Compact WEB to one row per verse

In [None]:
# Cell — Compact Bible (WEB) to one row per (book, chapter, verse)

import os, json
from collections import defaultdict

ROOT = "/content/drive/MyDrive/FATCR"
RAW_ROOT = f"{ROOT}/data/raw/kb"
NORM_DIR = f"{RAW_ROOT}/_normalized"

INP = f"{NORM_DIR}/bible_web_en.jsonl"
OUT = f"{NORM_DIR}/bible_web_en_compact.jsonl"

def clean(s):
    return " ".join((s or "").split())

buckets = defaultdict(list)
meta_example = None

with open(INP, "r", encoding="utf-8") as f:
    for line in f:
        r = json.loads(line)
        # keep a sample of meta fields
        if meta_example is None:
            meta_example = {k: r.get(k) for k in ("tradition","genre","source","collection","lang")}
        key = (r["book"], int(r["chapter"]), int(r["verse"]))
        txt = clean(r.get("text",""))
        if txt:
            buckets[key].append(txt)

rows = []
for (book,ch,v), parts in buckets.items():
    text = clean(" ".join(parts))
    if not text:
        continue
    rows.append({
        **(meta_example or {}),
        "book": book,
        "chapter": ch,
        "verse": v,
        "number": None,
        "grade": None,
        "text": text,
        "ref": f"{book} {ch}:{v}",
        "group_key": f"bible-{book}-{ch}-{v}",
    })

rows.sort(key=lambda r: (r["book"], r["chapter"], r["verse"]))

with open(OUT, "w", encoding="utf-8") as fo:
    for r in rows:
        fo.write(json.dumps(r, ensure_ascii=False) + "\n")

print(f"✅ Compacted → {OUT} | rows: {len(rows)}")


✅ Compacted → /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/bible_web_en_compact.jsonl | rows: 31216


B) English “Nine Books” → hadith_9books_en.jsonl (only if you have an English source)

Your current source (AhmedBaset/hadith-json) is Arabic. That’s why hadith_9books_en.jsonl is empty. If you provide a compatible English set (same per-book JSON shape, or a parallel folder containing English text), the adapter below will pick it up.

Put this as a new cell. It searches for any “en” folders under your nine-books tree. If found, it writes hadith_9books_en.jsonl. If not, it prints a clear message and leaves the file alone.

In [None]:
import json, itertools, glob, os
for p in sorted(glob.glob(f"{NORM_DIR}/hadith_9books_*.jsonl")):
    size = os.path.getsize(p)
    with open(p, "r", encoding="utf-8") as f:
        first = [json.loads(x) for x in itertools.islice(f, 3)]
    print(os.path.basename(p), "| bytes:", size, "| sample refs:", [r["ref"] for r in first])


hadith_9books_ar.jsonl | bytes: 77500273 | sample refs: ['Sunan Abi Dawud ?:?', 'Sunan Abi Dawud ?:?', 'Sunan Abi Dawud ?:?']
hadith_9books_en.jsonl | bytes: 27884965 | sample refs: ['Sunan Abi Dawud ?:?', 'Sunan Abi Dawud ?:?', 'Sunan Abi Dawud ?:?']


In [None]:
# Cell 11 (15-11-2025)— Hadith 9 Books (AR+EN) → _normalized/
# Purpose: Read AhmedBaset/hadith-json (by_book + by_chapter), extract AR + EN,
#          normalise to JSONL with stable (chapter, number) when available.

import os, json, re, hashlib, collections, itertools
from pathlib import Path

# ---- Paths -------------------------------------------------------------------
ROOT      = "/content/drive/MyDrive/FATCR"
RAW_ROOT  = f"{ROOT}/data/raw/kb"
NORM_DIR  = f"{RAW_ROOT}/_normalized"
os.makedirs(NORM_DIR, exist_ok=True)

# IMPORTANT: point to the *repo root*, not a subfolder
H_ROOT = "/content/drive/MyDrive/FATCR/data/raw/kb/Islam/Hadith/hadith-json"

# Force the canonical /db tree (this repo keeps the JSON there)
H_DB = os.path.join(H_ROOT, "db")
if not os.path.isdir(H_DB):
    raise RuntimeError(f"Expected hadith-json /db at: {H_DB}")
BASE = H_DB

BY_BOOK_DIRS    = [os.path.join(BASE, "by_book",    "the_9_books")]
BY_CHAPTER_DIRS = [os.path.join(BASE, "by_chapter", "the_9_books")]

print("Hadith root:", H_ROOT)
print("by_book dirs:", BY_BOOK_DIRS)
print("by_chapter dirs:", BY_CHAPTER_DIRS)

# ---- Config / helpers --------------------------------------------------------
BOOK_SLUGS = ["abudawud","ahmed","bukhari","darimi","ibnmajah","malik","muslim","nasai","tirmidhi"]
BOOK_NAME  = {
    "abudawud":"Sunan Abi Dawud",
    "ahmed":"Musnad Ahmad",
    "bukhari":"Sahih al-Bukhari",
    "darimi":"Sunan al-Darimi",
    "ibnmajah":"Sunan Ibn Majah",
    "malik":"Muwatta Malik",
    "muslim":"Sahih Muslim",
    "nasai":"Sunan al-Nasa'i",
    "tirmidhi":"Jamiʿ al-Tirmidhi",
}

# ⛔ No English dataset for Dārimī in your mirror — skip EN rows for this slug
EN_DISABLED = {"darimi"}

def _clean(s):
    return re.sub(r"\s+", " ", s).strip() if isinstance(s, str) else ""

def _as_int(x):
    if x is None: return None
    if isinstance(x, int): return x
    m = re.search(r"\d+", str(x))
    return int(m.group()) if m else None

def _hash_text(t):  # de-dup helper
    return hashlib.md5(t.encode("utf-8")).hexdigest()

def _norm_record(lang, slug, chapter, number, text, grade=None):
    return {
        "tradition": "Islam",
        "genre": "hadith",
        "collection": "Nine Books",
        "source": "AhmedBaset/hadith-json",
        "lang": lang,
        "book": BOOK_NAME.get(slug, slug),
        "chapter": int(chapter) if chapter is not None else None,
        "number": int(number) if number is not None else None,
        "grade": grade,
        "text": _clean(text),
        "ref": f"{BOOK_NAME.get(slug, slug)} {chapter if chapter is not None else '?'}:{number if number is not None else '?'}",
        "group_key": f"hadith-{slug}-{chapter if chapter is not None else 'X'}-{number if number is not None else 'Y'}",
    }

# ---- file readers ------------------------------------------------------------

def _iter_by_book(slug):
    """Yield a list of hadith dicts from by_book/<slug>.json (handles common shapes)."""
    paths = [os.path.join(d, f"{slug}.json") for d in BY_BOOK_DIRS]
    for p in paths:
        if os.path.isfile(p):
            try:
                with open(p, "r", encoding="utf-8") as f:
                    obj = json.load(f)
            except Exception:
                continue
            if isinstance(obj, list):
                return obj
            if isinstance(obj, dict):
                for k in ("hadiths","data","items"):
                    if k in obj and isinstance(obj[k], list):
                        return obj[k]
    return []

def _iter_by_chapter(slug):
    """Yield (chapter_no, list_of_items) from by_chapter/<slug>/<N>.json (handles common shapes)."""
    for base_dir in BY_CHAPTER_DIRS:
        slug_dir = os.path.join(base_dir, slug)
        if not os.path.isdir(slug_dir):
            continue
        entries = sorted(os.scandir(slug_dir), key=lambda d: _as_int(d.name) or float('inf'))
        for entry in entries:
            if not entry.is_file() or not entry.name.endswith(".json"):
                continue
            ch = _as_int(Path(entry.name).stem)
            try:
                with open(entry.path, "r", encoding="utf-8") as f:
                    obj = json.load(f)
            except Exception:
                continue
            if isinstance(obj, list):
                yield ch, obj
            elif isinstance(obj, dict):
                for k in ("hadiths","data","items"):
                    if k in obj and isinstance(obj[k], list):
                        yield ch, obj[k]
                        break

# ---- the key extractor (handles english as an object) ------------------------
def _extract_texts(item):
    # ---- Arabic ----
    ar = ""
    for k in ("arabic","ar","text_ar","textArabic","arabic_text","hadithArabic","hadith_ar","matn"):
        if isinstance(item.get(k), str) and item[k].strip():
            ar = _clean(item[k]); break
    if not ar and isinstance(item.get("text"), dict):
        for k in ("ar","arabic","matn"):
            v = item["text"].get(k)
            if isinstance(v, str) and v.strip():
                ar = _clean(v); break

    # ---- English ----
    en = ""
    eng = item.get("english")

    def _coerce_text(x):
        if isinstance(x, str):
            return _clean(x)
        if isinstance(x, list):
            parts = [str(t).strip() for t in x if isinstance(t, (str, int, float)) and str(t).strip()]
            return _clean(" ".join(parts)) if parts else ""
        if isinstance(x, dict):
            parts = [str(v).strip() for v in x.values() if isinstance(v, (str, int, float)) and str(v).strip()]
            return _clean(" ".join(parts)) if parts else ""
        return ""

    if isinstance(eng, dict):
        txt = ""
        for k in ("text","body","hadith","content","value","translation"):
            if k in eng:
                txt = _coerce_text(eng[k])
                if txt:
                    break
        narr = _clean(eng.get("narrator"))
        if txt:
            en = f"Narrated {narr}: {txt}" if narr and not txt.lower().startswith(("narrated","reported")) else txt

    elif isinstance(eng, str) and eng.strip():
        en = _clean(eng)

    elif isinstance(eng, list) and eng:
        cand = eng[0] if isinstance(eng[0], dict) else None
        if isinstance(cand, dict):
            txt = ""
            for k in ("text","body","hadith","content","value","translation"):
                if k in cand:
                    txt = _coerce_text(cand[k])
                    if txt:
                        break
            narr = _clean(cand.get("narrator"))
            if txt:
                en = f"Narrated {narr}: {txt}" if narr and not txt.lower().startswith(("narrated","reported")) else txt

    if not en:
        for k in ("en","text_en","textEnglish","english_text","hadithEnglish","hadith_en","translation"):
            v = item.get(k)
            if isinstance(v, (str, list, dict)):
                en = _coerce_text(v)
                if en:
                    break
    if not en and isinstance(item.get("text"), dict):
        v = item["text"].get("en") or item["text"].get("english")
        if isinstance(v, (str, list, dict)):
            en = _coerce_text(v)

    # narrator-only fallback
    if not en:
        narr = ""
        if isinstance(eng, dict):
            narr = _clean(eng.get("narrator"))
        elif isinstance(eng, list) and eng and isinstance(eng[0], dict):
            narr = _clean(eng[0].get("narrator"))
        if narr:
            en = f"Narrated {narr}"

    return ar, en

def _extract_numbers(item, default_ch=None, idx=None):
    """Try to obtain (chapter, number) from item; fall back to defaults."""
    ch = None
    for k in ("chapter","chapter_no","chapterNumber","chapter_id","bookNumber","book_id"):
        if k in item:
            ch = _as_int(item[k]);
            if ch is not None: break
    if ch is None:
        ch = default_ch

    num = None
    for k in ("hadithnumber","hadith_no","hadithNumber","number","id","hadith_id","index"):
        if k in item:
            num = _as_int(item[k]);
            if num is not None: break
    if num is None and isinstance(item.get("reference"), dict):
        for k in ("hadith","hadithNo","hadith_no"):
            if k in item["reference"]:
                num = _as_int(item["reference"][k]);
                if num is not None: break
    if num is None and idx is not None:
        num = int(idx) + 1
    return ch, num

# ---- build & write -----------------------------------------------------------

def build_nine_books_jsonl():
    ar_rows, en_rows = [], []
    seen = set()  # (lang, slug, chapter, number, hash(text))

    print("Scanning…")
    for slug in BOOK_SLUGS:
        add_ar = add_en = 0

        # Prefer richer by_chapter first
        bc_map = {}
        for ch, arr in _iter_by_chapter(slug):
            bc_map.setdefault(ch, []).extend(arr)
        for ch, arr in sorted(bc_map.items(), key=lambda kv: kv[0] if kv[0] is not None else float('inf')):
            for i, it in enumerate(arr):
                ar, en = _extract_texts(it)
                ch2, num2 = _extract_numbers(it, default_ch=ch, idx=i)
                if ar:
                    key = ("ar", slug, ch2, num2, _hash_text(ar))
                    if key not in seen:
                        ar_rows.append(_norm_record("ar", slug, ch2, num2, ar))
                        seen.add(key); add_ar += 1
                # ⛔ Skip EN rows for slugs with no English dataset (e.g., darimi)
                if en and slug not in EN_DISABLED:
                    key = ("en", slug, ch2, num2, _hash_text(en))
                    if key not in seen:
                        en_rows.append(_norm_record("en", slug, ch2, num2, en))
                        seen.add(key); add_en += 1

        # Also ingest by_book to catch anything missing
        for it in _iter_by_book(slug):
            ar, en = _extract_texts(it)
            ch2, num2 = _extract_numbers(it, default_ch=None, idx=None)
            if ar:
                key = ("ar", slug, ch2, num2, _hash_text(ar))
                if key not in seen:
                    ar_rows.append(_norm_record("ar", slug, ch2, num2, ar))
                    seen.add(key); add_ar += 1
            if en and slug not in EN_DISABLED:
                key = ("en", slug, ch2, num2, _hash_text(en))
                if key not in seen:
                    en_rows.append(_norm_record("en", slug, ch2, num2, en))
                    seen.add(key); add_en += 1

        print(f"  {slug:8s}: +{add_ar:6d} AR, +{add_en:6d} EN")

    out_ar = os.path.join(NORM_DIR, "hadith_9books_ar.jsonl")
    out_en = os.path.join(NORM_DIR, "hadith_9books_en.jsonl")
    with open(out_ar, "w", encoding="utf-8") as f:
        for r in ar_rows: f.write(json.dumps(r, ensure_ascii=False) + "\n")
    with open(out_en, "w", encoding="utf-8") as f:
        for r in en_rows: f.write(json.dumps(r, ensure_ascii=False) + "\n")

    print(f"\n✅ Wrote {out_ar} | rows: {len(ar_rows)}")
    print(f"✅ Wrote {out_en} | rows: {len(en_rows)}")

# ---- run ---------------------------------------------------------------------
build_nine_books_jsonl()

# ---- quick preview -----------------------------------------------------------
def _preview(p, n=3):
    size = os.path.getsize(p) if os.path.exists(p) else 0
    try:
        with open(p, "r", encoding="utf-8") as f:
            first = [json.loads(x) for x in itertools.islice(f, n)]
    except Exception:
        first = []
    refs  = [r.get("ref") for r in first]
    books = sorted({r.get("book") for r in first})
    print(os.path.basename(p), "| bytes:", size, "| sample refs:", refs or [], "| sample books:", books or [])

_preview(os.path.join(NORM_DIR, "hadith_9books_ar.jsonl"))
_preview(os.path.join(NORM_DIR, "hadith_9books_en.jsonl"))


Hadith root: /content/drive/MyDrive/FATCR/data/raw/kb/Islam/Hadith/hadith-json
by_book dirs: ['/content/drive/MyDrive/FATCR/data/raw/kb/Islam/Hadith/hadith-json/db/by_book/the_9_books']
by_chapter dirs: ['/content/drive/MyDrive/FATCR/data/raw/kb/Islam/Hadith/hadith-json/db/by_chapter/the_9_books']
Scanning…
  abudawud: + 10552 AR, + 10552 EN
  ahmed   : +  2748 AR, +  2718 EN
  bukhari : + 14554 AR, + 14554 EN
  darimi  : +  6812 AR, +     0 EN
  ibnmajah: +  8690 AR, +  8690 EN
  malik   : +  3720 AR, +  3946 EN
  muslim  : + 14918 AR, + 14916 EN
  nasai   : + 11536 AR, + 11536 EN
  tirmidhi: +  8106 AR, +  8106 EN

✅ Wrote /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/hadith_9books_ar.jsonl | rows: 81636
✅ Wrote /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/hadith_9books_en.jsonl | rows: 75018
hadith_9books_ar.jsonl | bytes: 105606911 | sample refs: ['Sunan Abi Dawud 1:1', 'Sunan Abi Dawud 1:2', 'Sunan Abi Dawud 1:3'] | sample books: ['Sunan Abi Dawud']
hadith_9books_en

## Christian commentaries
## *Slim version: sparse clone only the 4 Fathers*

In [None]:
%%bash
# Go to the Commentary folder under Christianity
cd "/content/drive/MyDrive/FATCR/data/raw/kb/Christianity/Commentary"

# Clone the repo if it isn't there yet
if [ ! -d "Commentaries-Database" ]; then
  git clone --filter=blob:none --sparse \
    https://github.com/HistoricalChristianFaith/Commentaries-Database.git
fi

cd Commentaries-Database

# Initialise sparse-checkout
git sparse-checkout init --cone

# Restrict to the four Fathers you want
git sparse-checkout set \
  "Irenaeus" \
  "Origen of Alexandria" \
  "John Chrysostom" \
  "Augustine of Hippo"



Cloning into 'Commentaries-Database'...
Updating index flags:  70% (36919/52032)Updating index flags:  71% (36943/52032)Updating index flags:  72% (37464/52032)Updating index flags:  73% (37984/52032)Updating index flags:  74% (38504/52032)Updating index flags:  75% (39024/52032)Updating index flags:  76% (39545/52032)Updating index flags:  77% (40065/52032)Updating index flags:  78% (40585/52032)Updating index flags:  79% (41106/52032)Updating index flags:  80% (41626/52032)Updating index flags:  81% (42146/52032)Updating index flags:  82% (42667/52032)Updating index flags:  83% (43187/52032)Updating index flags:  84% (43707/52032)Updating index flags:  85% (44228/52032)Updating index flags:  86% (44748/52032)Updating index flags:  87% (45268/52032)Updating index flags:  88% (45789/52032)Updating index flags:  89% (46309/52032)Updating index flags:  90% (46829/52032)Updating index flags:  91% (47350/52032)Updating index flags:  92% (47870/52032)Updating index fl

In [None]:
# === Patristic Christian Commentaries (Commentaries-Database) → FACTR KB ===
#
# This adapter ingests TOML files from the HistoricalChristianFaith
# Commentaries-Database repo into a unified JSONL KB file.
#
# It assumes:
#   - RAW_ROOT and NORM_ROOT are already defined
#   - bible_key(book, chapter, verse) exists and returns a canonical group_key
#   - write_jsonl_line(f_obj, record_dict) is available
#
# Upstream repo docs for format:
#   - File name:  [Father]/[Book] Chapter_Verse(-Verse...).toml
#   - Contents:   [[commentary]] with keys: quote, source_title, [source_url, append_to_author_name, time]
# See: Commentaries-Database README (File Contents Format / File Name Formats).

from pathlib import Path
import json
import re # Added import for re module, used by bible_key

# Helper functions from Cell 3, ensuring they are in scope
def _clean(s):
    # Re-import re for this specific function to ensure it's self-contained if needed
    # but it's already imported at the top of the cell for global scope.
    return re.sub(r"\s+", " ", str(s)).strip()

def bible_key(book:str, chapter:int, verse:int):
    # Re-import re for this specific function to ensure it's self-contained if needed
    # but it's already imported at the top of the cell for global scope.
    b = re.sub(r"\s+", "_", str(book).strip())
    # Corrected f-string syntax here
    return f"bible-{b}-{int(chapter)}-{int(verse)}"

def write_jsonl_line(path, rec: dict):
    """Strict writer that enforces required keys (including 'collection')."""
    required = {"tradition","genre","source","collection","lang","book","chapter","verse","number","grade","text","ref","group_key"}
    missing = required - set(rec.keys())
    if missing:
        raise ValueError(f"Record missing required keys: {sorted(missing)}")
    if not rec.get("collection"):
        raise ValueError("Record has empty 'collection'. Adapter must set it.")
    with open(path, "a", encoding="utf-8") as f:
        f.write(json.dumps(rec, ensure_ascii=False) + "\n")

# Try stdlib tomllib first (Python 3.11+), otherwise fall back to rtoml (pip install rtoml)
try:
    import tomllib as _toml  # Python 3.11+
    def _load_toml_bytes(b: bytes):
        return _toml.loads(b.decode("utf-8"))
except ImportError:  # pragma: no cover
    import rtoml as _toml  # pip install rtoml
    def _load_toml_bytes(b: bytes):
        return _toml.loads(b.decode("utf-8"))


# Root where YOU have cloned / copied Commentaries-Database
# Adjust this if you keep it somewhere else.
COMMENTARIES_DB_ROOT = Path(RAW_ROOT) / "Christianity" / "Commentary" / "Commentaries-Database"
# Directory names in Commentaries-Database for our 4 core Fathers.
# IMPORTANT: make sure these match the actual folder names in your clone.
PATristic_AUTHOR_DIRS = [
    "Irenaeus",
    "Origen of Alexandria",  # or "Origen" if that’s the actual folder name
    "John Chrysostom",
    "Augustine of Hippo",
]



def parse_commentary_filename(path: Path):
    """
    Parse a Commentaries-Database TOML filename into (book, chapter:int, verse:int, span_str).

    File name formats (from the repo README):
        [Father-Name]/[Book-Name] Chapter_Verse.toml
        [Father-Name]/[Book-Name] Chapter_Verse-Verse.toml
        [Father-Name]/[Book-Name] Chapter_Verse-Chapter_Verse.toml

    Examples:
        'Matthew 23_35.toml'
        'Matthew 23_35-41.toml'
        '1 Kings 19_10-20_3.toml'

    We anchor each commentary record on the *first* verse of the span, but we keep the
    full span string (e.g. "1 Corinthians 10:16-20:3") in meta["span"] and in ref.
    """
    stem = path.stem  # e.g. '1 Corinthians 10_16' or 'Matthew 23_35-41'

    # Split book vs chapter/verse part
    try:
        book_part, ref_part = stem.rsplit(" ", 1)
    except ValueError as exc:
        raise ValueError(f"Unexpected commentary filename format: {path.name}") from exc

    # Human-readable span string, converting underscores to ':' inside the reference part
    span_str = f"{book_part} " + ref_part.replace("_", ":")

    # Anchor: only the start of the range, e.g. "10_16" from "10_16-20_3"
    range_start = ref_part.split("-", 1)[0]

    try:
        chap_str, verse_str = range_start.split("_", 1)
        chapter = int(chap_str)
        verse = int(verse_str)
    except Exception as exc:
        raise ValueError(f"Could not parse chapter/verse from {path.name}: {exc}") from exc

    return book_part, chapter, verse, span_str


def convert_commentaries_database(
    author_dirs=PATristic_AUTHOR_DIRS,
    out_path=None,
):
    """
    Walk the Commentaries-Database clone and normalise TOML commentary files
    for the given author dirs into our KB JSONL format.

    Returns: 혹은
        Path to the JSONL file produced.
    """
    if out_path is None:
        out_path = Path(NORM_DIR) / "christian_commentaries_patristic.jsonl"

    out_path = Path(out_path)
    out_path.parent.mkdir(parents=True, exist_ok=True)

    n_files = 0
    n_entries = 0

    # Initialize the output file to be empty, so write_jsonl_line appends correctly.
    # This replaces the `with out_path.open(...) as f_out:` block.
    with out_path.open("w", encoding="utf-8") as f_init:
        f_init.truncate(0)

    for author_dir_name in author_dirs:
        author_dir = COMMENTARIES_DB_ROOT / author_dir_name

        if not author_dir.exists():
            print(f"[COMMENTARIES][WARN] Author dir not found: {author_dir} (skipping)")
            continue

        print(f"[COMMENTARIES] Scanning {author_dir_name} …")
        for toml_path in sorted(author_dir.rglob("*.toml")):
            # Skip metadata.toml files, as they don't contain commentary and break parsing
            if toml_path.stem == "metadata":
                print(f"  [COMMENTARIES] Skipping non-commentary file: {toml_path.name}")
                continue

            n_files += 1

            data = _load_toml_bytes(toml_path.read_bytes())
            entries = data.get("commentary", [])

            # Normalise degenerate case where there is a single [commentary] table
            if isinstance(entries, dict):
                entries = [entries]

            book, chapter, verse, span_str = parse_commentary_filename(toml_path)
            group = bible_key(book, chapter, verse)

            for entry in entries:
                quote = (entry.get("quote") or "").strip()
                if not quote:
                    continue

                eff_author = author_dir_name + (entry.get("append_to_author_name") or "")

                meta = {
                    "author": eff_author,
                    "source_title": entry.get("source_title"),
                    "source_url": entry.get("source_url"),
                    "time": entry.get("time"),  # AD year or 9999 for unknown
                    "span": span_str,
                    "filename": toml_path.name,
                    "collection": "Commentaries-Database" # Explicitly set collection in meta
                }

                record = {
                    "tradition": "Christianity",
                    "genre": "commentary",
                    "source": "HistoricalChristianFaith-CommentariesDatabase",
                    "collection": meta["collection"], # Ensure collection is copied from meta
                    "lang": "en",                     # Set language to English
                    "book": book,
                    "chapter": chapter,
                    "verse": verse,
                    "number": None,                   # Set number to None as it's not applicable
                    "grade": None,                    # Set grade to None as it's not applicable
                    "text": quote,
                    "ref": span_str,
                    "group_key": group,
                    "author": eff_author,
                    # "meta": meta, # Removed, as it's not part of the standard schema
                }


                write_jsonl_line(out_path, record) # Pass the path, not the file object
                n_entries += 1

    print(f"[COMMENTARIES] Authors requested: {author_dirs}")
    print(f"[COMMENTARIES] Files scanned: {n_files}")
    print(f"[COMMENTARIES] Commentary entries written: {n_entries}")
    print(f"[COMMENTARIES] Output JSONL: {out_path}")
    return out_path


# ---- Run once to generate the KB file + update MANIFEST ----

commentary_out = convert_commentaries_database()

# manifest_path = Path(NORM_DIR) / "MANIFEST.json"
# all_sources = [] # Initialize as a list to store all source entries

# try:
#     loaded_manifest = json.loads(manifest_path.read_text(encoding="utf-8"))
#     if isinstance(loaded_manifest, list):
#         all_sources.extend(loaded_manifest)
#     elif isinstance(loaded_manifest, dict) and "normalized_files" in loaded_manifest:
#         # Handle manifest from Cell 9 (dict with 'normalized_files' list of paths)
#         for p_str in loaded_manifest["normalized_files"]:
#             # Create a minimal structured entry for existing paths
#             all_sources.append({
#                 "name": Path(p_str).stem,
#                 "path": p_str,
#                 "tradition": "unknown", # Default placeholder
#                 "genre": "unknown",     # Default placeholder
#                 "source": "unknown",    # Default placeholder
#                 "notes": "Generated from previous manifest entries"
#             })
#     # If loaded_manifest is a dict but not the expected one from Cell 9, it's effectively ignored
# except FileNotFoundError:
#     pass # all_sources remains empty

# # Add the new commentary entry
# all_sources.append({
#     "name": "christian_commentaries_patristic",
#     "path": str(commentary_out.relative_to(Path(NORM_DIR))),
#     "tradition": "Christianity",
#     "genre": "commentary",
#     "source": "HistoricalChristianFaith-CommentariesDatabase",
#     "notes": "Patristic commentaries (Irenaeus, Origen, John Chrysostom, Augustine of Hippo) imported from Commentaries-Database"
# })

# # Deduplicate entries based on 'name' to avoid adding the same source multiple times
# seen_names = set()
# deduplicated_sources = []
# for source_entry in all_sources:
#     if source_entry["name"] not in seen_names:
#         deduplicated_sources.append(source_entry)
#         seen_names.add(source_entry["name"])

# manifest_path.write_text(json.dumps(deduplicated_sources, indent=2), encoding="utf-8") # Write the list
# print(f"[MANIFEST] Updated: {manifest_path}")


[COMMENTARIES] Scanning Irenaeus …
  [COMMENTARIES] Skipping non-commentary file: metadata.toml
[COMMENTARIES] Scanning Origen of Alexandria …
  [COMMENTARIES] Skipping non-commentary file: metadata.toml
[COMMENTARIES] Scanning John Chrysostom …
  [COMMENTARIES] Skipping non-commentary file: metadata.toml
[COMMENTARIES] Scanning Augustine of Hippo …
  [COMMENTARIES] Skipping non-commentary file: metadata.toml
[COMMENTARIES] Authors requested: ['Irenaeus', 'Origen of Alexandria', 'John Chrysostom', 'Augustine of Hippo']
[COMMENTARIES] Files scanned: 13641
[COMMENTARIES] Commentary entries written: 17639
[COMMENTARIES] Output JSONL: /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/christian_commentaries_patristic.jsonl
[MANIFEST] Updated: /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/MANIFEST.json


Quick sanity-checks / 3 below (optional but nice)

In [None]:
import json
from itertools import islice
from pathlib import Path

path = Path(NORM_DIR) / "christian_commentaries_patristic.jsonl"
print("File:", path)

with path.open("r", encoding="utf-8") as f:
    for line in islice(f, 5):
        rec = json.loads(line)
        # Modified to print 'source' field (with fallback) and other keys
        print(f"{rec.get('source', 'N/A Source')} → {rec.get('ref', 'N/A Ref')}")
        print("  text:", rec.get("text", "N/A Text")[:160], "…")
        print("  group_key:", rec.get("group_key", "N/A Group Key"))
        print()


File: /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/christian_commentaries_patristic.jsonl
HistoricalChristianFaith-CommentariesDatabase → 1 Corinthians 10:1
  text: ) is come. Wherefore let him that thinketh he standeth, take heed lest he fall." …
  group_key: bible-1_Corinthians-10-1

HistoricalChristianFaith-CommentariesDatabase → 1 Corinthians 10:11
  text: For during forty days He was learning to keep …
  group_key: bible-1_Corinthians-10-11

HistoricalChristianFaith-CommentariesDatabase → 1 Corinthians 10:16
  text: And adds, "The cup of blessing which we bless, is it not the communion of the blood of Christ? ".
But if this indeed do not attain salvation, then neither did t …
  group_key: bible-1_Corinthians-10-16

HistoricalChristianFaith-CommentariesDatabase → 1 Corinthians 10:4
  text: And as He was born of Mary in the last times, so did He also proceed from God as the First-begotten of every creature; and as He hungered, so did He satisfy …
  group_key: bible-1_Corinth

In [None]:
import json
from pathlib import Path

path = Path(NORM_DIR) / "christian_commentaries_patristic.jsonl"
with path.open("r", encoding="utf-8") as f:
    rec = json.loads(next(f))
print(rec.keys())
# Removed: print(rec["meta"])


dict_keys(['tradition', 'genre', 'source', 'collection', 'lang', 'book', 'chapter', 'verse', 'number', 'grade', 'text', 'ref', 'group_key', 'author'])


In [None]:
import json
from collections import Counter
from pathlib import Path

path = Path(NORM_DIR) / "christian_commentaries_patristic.jsonl"

author_counts = Counter()
with path.open("r", encoding="utf-8") as f:
    for line in f:
        rec = json.loads(line)
        author_counts[rec.get("author","<unknown>")] += 1

author_counts


Counter({'Irenaeus': 721,
         'Origen of Alexandria': 2265,
         'Origen of Alexandria is referenced above by Jerome (AD 420)': 1,
         'Origen of Alexandria (as quoted by Aquinas, AD 1274)': 238,
         'John Chrysostom': 7323,
         'John Chrysostom (as quoted by Aquinas, AD 1274)': 657,
         'Augustine of Hippo': 5837,
         'Augustine of Hippo (as quoted by Aquinas, AD 1274)': 597})

Manafest fix cell - not needed for the final prod

In [None]:
# JUst a quick fix if the manifest is out of touch? Uncomment if required

# import json
# from pathlib import Path

# manifest_path = Path(NORM_DIR) / "MANIFEST.json"
# manifest = json.loads(manifest_path.read_text(encoding="utf-8"))

# for entry in manifest:
#     if entry["name"] == "bible_web_en":
#         entry.update({
#             "tradition": "Christianity",
#             "genre": "scripture",
#             "source": "World English Bible",
#             "notes": "WEB (public domain) Bible, English",
#         })
#     elif entry["name"] == "hadith_9books_en":
#         entry.update({
#             "tradition": "Islam",
#             "genre": "hadith",
#             "source": "AhmedBaset/hadith-json (the_9_books)",
#             "notes": "Nine canonical Sunni hadith collections (English translation)",
#         })
#     elif entry["name"] == "christian_commentaries_patristic":
#         entry.update({
#             "path": "data/raw/kb/_normalized/christian_commentaries_patristic.jsonl",
#             "tradition": "Christianity",
#             "genre": "commentary",
#             "source": "HistoricalChristianFaith-CommentariesDatabase",
#         })

# manifest_path.write_text(json.dumps(manifest, indent=2), encoding="utf-8")
# print("Updated manifest written to:", manifest_path)
# print(json.dumps(manifest, indent=2))


Updated manifest written to: /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/MANIFEST.json
[
  {
    "name": "bible_web_en",
    "path": "data/raw/kb/_normalized/bible_web_en.jsonl",
    "tradition": "Christianity",
    "genre": "scripture",
    "source": "World English Bible",
    "notes": "WEB (public domain) Bible, English"
  },
  {
    "name": "hadith_9books_en",
    "path": "data/raw/kb/_normalized/hadith_9books_en.jsonl",
    "tradition": "Islam",
    "genre": "hadith",
    "source": "AhmedBaset/hadith-json (the_9_books)",
    "notes": "Nine canonical Sunni hadith collections (English translation)"
  },
  {
    "name": "christian_commentaries_patristic",
    "path": "data/raw/kb/_normalized/christian_commentaries_patristic.jsonl",
    "tradition": "Christianity",
    "genre": "commentary",
    "source": "HistoricalChristianFaith-CommentariesDatabase",
    "notes": "Patristic commentaries (Irenaeus, Origen, John Chrysostom, Augustine of Hippo) imported from Commentaries-Datab

## clone creeds repo & ingest

In [None]:
%%bash
# Go to the Commentary folder under Christianity
cd "/content/drive/MyDrive/FATCR/data/raw/kb/Christianity/Creeds"

# mkdir -p _external
# cd _external

REPO_URL="https://github.com/lukmaanviscomi/christian-creeds-kb.git"

if [ ! -d "christian-creeds-kb" ]; then
  git clone "$REPO_URL"
else
  cd christian-creeds-kb
  git pull --ff-only || echo "[GIT] Pull failed; using existing copy."
fi


Cloning into 'christian-creeds-kb'...


## Christian Creeds (TXT → FACTR KB JSONL)

In [None]:
# === Christian Creeds (TXT → FACTR KB JSONL) ===
#
# This adapter ingests a small set of core Christian creeds
# from plain-text files and normalises them into our KB schema.
#
# Files expected under:
#   RAW_ROOT / "Christianity" / "Creeds"
#
# You can rename the .txt files, but if you do, update CREEDS_SPEC
# to match.

import json
from pathlib import Path

CREEDS_DIR = Path(RAW_ROOT) / "Christianity" / "Creeds" / "christian-creeds-kb" / "txt"

CREEDS_SPEC = [
    {
        "filename": "nicene_381_en.txt",
        "book": "Nicene Creed",
        "short_name": "Niceno-Constantinopolitan Creed",
        "year": 381,
        "council": "First Council of Constantinople",
        "group_key": "creed-nicene-381",
    },
    {
        "filename": "apostles_creed_en.txt",
        "book": "Apostles' Creed",
        "short_name": "Apostles' Creed",
        "year": None,  # early Western baptismal creed
        "council": "Western baptismal creed",
        "group_key": "creed-apostles",
    },
    {
        "filename": "athanasian_creed_en.txt",
        "book": "Athanasian Creed",
        "short_name": "Athanasian Creed (Quicumque vult)",
        "year": None,
        "council": "Latin Western creed (c. 5th–6th C)",
        "group_key": "creed-athanasian",
    },
    {
        "filename": "chalcedon_definition_en.txt",
        "book": "Chalcedonian Definition",
        "short_name": "Definition of Chalcedon",
        "year": 451,
        "council": "Council of Chalcedon",
        "group_key": "creed-chalcedon-451",
    },
]


def _normalize_text(s: str) -> str:
    """Normalise line endings and trim trailing spaces."""
    return "\n".join(
        line.rstrip() for line in s.replace("\r\n", "\n").split("\n")
    ).strip()


def convert_creeds_txt(creeds_spec=CREEDS_SPEC, out_path=None):
    """
    Read creed .txt files and write one JSONL file in KB schema.

    We split each creed into paragraph-ish chunks on blank lines,
    so verses 1,2,3,... correspond to paragraphs within that creed.
    """
    if out_path is None:
        out_path = Path(NORM_DIR) / "christian_creeds.jsonl"
    out_path = Path(out_path)
    out_path.parent.mkdir(parents=True, exist_ok=True)

    n_creeds = 0
    n_passages = 0

    with out_path.open("w", encoding="utf-8") as f_out:
        for c in creeds_spec:
            path = CREEDS_DIR / c["filename"]
            if not path.exists():
                print(f"[CREEDS][WARN] Missing file: {path}")
                continue

            raw = path.read_text(encoding="utf-8")
            text_norm = _normalize_text(raw)

            # split on blank lines into paragraphs
            paragraphs = [
                p.strip() for p in text_norm.split("\n\n") if p.strip()
            ]
            if not paragraphs:
                print(f"[CREEDS][WARN] No paragraphs found in {path}")
                continue

            n_creeds += 1
            for idx, para in enumerate(paragraphs, start=1):
                # Build a human-readable ref string
                if c["year"]:
                    ref_str = f"{c['short_name']} ({c['year']}, {c['council']}) ¶{idx}"
                else:
                    ref_str = f"{c['short_name']} ({c['council']}) ¶{idx}"

                record = {
                    "tradition": "Christianity",
                    "genre": "creed",
                    "source": "Wikisource",
                    "collection": "Ecumenical Creeds",
                    "lang": "en",
                    "book": c["book"],
                    "chapter": 1,
                    "verse": idx,
                    "number": None,
                    "grade": None,
                    "text": para,
                    "ref": ref_str,
                    "group_key": c["group_key"],
                    # extra creed-specific metadata
                    "creed_name": c["short_name"],
                    "year": c["year"],
                    "council": c["council"],
                }

                json.dump(record, f_out, ensure_ascii=False)
                f_out.write("\n")
                n_passages += 1

    print(f"[CREEDS] Creeds processed: {n_creeds}")
    print(f"[CREEDS] Paragraph passages written: {n_passages}")
    print(f"[CREEDS] Output JSONL: {out_path}")
    return out_path


# --- Run once to generate the file and update MANIFEST ---

creeds_out = convert_creeds_txt()

# manifest_path = Path(NORM_DIR) / "MANIFEST.json"
# try:
#     manifest = json.loads(manifest_path.read_text(encoding="utf-8"))
# except FileNotFoundError:
#     manifest = []

# # Path relative to repo ROOT, to match your other entries
# rel_path = str(creeds_out.relative_to(Path(ROOT)))

# # Remove any existing entry with the same name (idempotent)
# manifest = [m for m in manifest if m.get("name") != "christian_creeds"]

# manifest.append({
#     "name": "christian_creeds",
#     "path": rel_path,  # e.g. "data/raw/kb/_normalized/christian_creeds.jsonl"
#     "tradition": "Christianity",
#     "genre": "creed",
#     "source": "Wikisource",
#     "notes": "Ecumenical Christian creeds (Nicene, Apostles, Athanasian, Chalcedon) in English.",
# })

# manifest_path.write_text(json.dumps(manifest, indent=2), encoding="utf-8")
# print(f"[MANIFEST] Updated with christian_creeds → {manifest_path}")


[CREEDS] Creeds processed: 4
[CREEDS] Paragraph passages written: 20
[CREEDS] Output JSONL: /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/christian_creeds.jsonl
[MANIFEST] Updated with christian_creeds → /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/MANIFEST.json


## Quick sanity check (optional)

In [None]:
import json
from pathlib import Path

path = Path(NORM_DIR) / "christian_creeds.jsonl"
with path.open("r", encoding="utf-8") as f:
    rec = json.loads(next(f))
print(rec.keys())
print(rec["ref"])
print(rec["text"][:160], "…")

dict_keys(['tradition', 'genre', 'source', 'collection', 'lang', 'book', 'chapter', 'verse', 'number', 'grade', 'text', 'ref', 'group_key', 'creed_name', 'year', 'council'])
Niceno-Constantinopolitan Creed (381, First Council of Constantinople) ¶1
We believe in one God,
the Father Almighty,
maker of heaven and earth,
of all things visible and invisible. …


## Optional tiny check: counts per creed
If you’re curious (and for the thesis stats), you can run:

In [None]:
import json
from collections import Counter
from pathlib import Path

path = Path(NORM_DIR) / "christian_creeds.jsonl"
counts = Counter()

with path.open("r", encoding="utf-8") as f:
    for line in f:
        rec = json.loads(line)
        counts[rec["creed_name"]] += 1

counts


Counter({'Niceno-Constantinopolitan Creed': 4,
         "Apostles' Creed": 3,
         'Athanasian Creed (Quicumque vult)': 11,
         'Definition of Chalcedon': 2})

## Cell 9 — Convert all (writes MANIFEST.json)

In [None]:
# produced = []
# for s in SOURCES:
#     kind = s["kind"]
#     print(f"\n== Converting: {s['name']} ==")
#     if kind == "jsonl":
#         convert_jsonl(s["in"], s["out"], s["fixed"], keymap=s.get("keymap",{}),
#                       group_key_fn=s["group_key_fn"], post=s.get("post"))
#     elif kind == "csv":
#         convert_csv(s["in"], s["out"], s["fixed"], colmap=s.get("colmap",{}),
#                     group_key_fn=s["group_key_fn"], post=s.get("post"))
#     elif kind == "txt":
#         convert_txt_folder(s["in"], s["out"], s["fixed"], s.get("lang","en"),
#                            per_file_book=s.get("per_file_book"), post=s.get("post"))
#     elif kind == "bible_dir":
#         convert_bible_dir(s["in"], s["out"], s["fixed"], post=s.get("post"))
#     elif kind == "hadith_dir":
#         def _mapper(stem):
#             # customize mapping to nice collection names if needed
#             m = {"bukhari":"Sahih al-Bukhari", "muslim":"Sahih Muslim"}
#             return m.get(stem.lower(), stem)
#         fixed = dict(s["fixed"])
#         fixed.setdefault("collection","")  # adapter fills from filename if empty
#         convert_hadith_dir(s["in"], s["out"], fixed, collection_name_mapper=_mapper)
#     else:
#         print(" ! Unknown kind:", kind)
#         continue
#     produced.append(s["out"])

# # Write MANIFEST
# manifest = {
#     "time": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
#     "normalized_files": [os.path.relpath(p, ROOT) for p in produced]
# }
# with open(os.path.join(NORM_DIR, "MANIFEST.json"), "w", encoding="utf-8") as f:
#     json.dump(manifest, f, ensure_ascii=False, indent=2)

# print("\nWrote MANIFEST:", os.path.relpath(os.path.join(NORM_DIR, "MANIFEST.json"), ROOT))



== Converting: bible_web_en_dir ==
✅ Wrote /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/bible_web_en.jsonl | rows: 0

== Converting: hadith_the9_dir ==
✅ Wrote /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/hadith_9books_en.jsonl | rows: 0

Wrote MANIFEST: data/raw/kb/_normalized/MANIFEST.json


In [9]:
import json
from pathlib import Path
import os

# Re-define ROOT, RAW_ROOT, and NORM_DIR as Path objects
ROOT = Path(os.environ.get("FACTR_ROOT", "/content/drive/MyDrive/FATCR"))
DATA_DIR = ROOT / "data"
RAW_ROOT = DATA_DIR / "raw" / "kb"
NORM_DIR = RAW_ROOT / "_normalized"
os.makedirs(NORM_DIR, exist_ok=True)

manifest_path = NORM_DIR / "MANIFEST.json"

def rel_from_root(filename: str) -> str:
    """Return path to NORM_DIR/filename relative to the repo ROOT."""
    p = NORM_DIR / filename # This now works because NORM_DIR is a Path object
    if not p.exists():
        raise FileNotFoundError(p)
    return str(p.relative_to(ROOT))

manifest = [
    # Qur'an – EN + AR
    {
        "name": "quran_en",
        "path": rel_from_root("quran_en.jsonl"),
        "tradition": "Islam",
        "genre": "scripture",
        "collection": "Quran",
        "lang": "en",
        "source": "Quran (EN: semarketir/quranjson)",
        "notes": "English Qur'an translation from semarketir/quranjson.",
    },
    {
        "name": "quran_ar",
        "path": rel_from_root("quran_ar.jsonl"),
        "tradition": "Islam",
        "genre": "scripture",
        "collection": "Quran",
        "lang": "ar",
        "source": "Quran (Arabic text)",
        "notes": "Arabic Qur'an text aligned verse-by-verse with quran_en.",
    },

    # Bible – WEB
    {
        "name": "bible_web_en",
        "path": rel_from_root("bible_web_en.jsonl"),
        "tradition": "Christianity",
        "genre": "scripture",
        "collection": "World English Bible",
        "lang": "en",
        "source": "World English Bible",
        "notes": "Public-domain World English Bible (whole OT+NT).",
    },

    # Hadith – 9 books EN
    {
        "name": "hadith_9books_en",
        "path": rel_from_root("hadith_9books_en.jsonl"),
        "tradition": "Islam",
        "genre": "hadith",
        "collection": "Nine Books (AhmedBaset/hadith-json)",
        "lang": "en",
        "source": "AhmedBaset/hadith-json (the_9_books)",
        "notes": "Nine Sunni hadith collections in English translation.",
    },

    # Tafsir – Ibn Kathir EN + Qurtubi AR
    {
        "name": "tafsir_ibn_kathir_en",
        "path": rel_from_root("tafsir_ibn_kathir_en.jsonl"),
        "tradition": "Islam",
        "genre": "tafsir",
        "collection": "Tafsīr Ibn Kathīr",
        "lang": "en",
        "source": "spa5k/tafsir_api (local mirror)",
        "notes": "English excerpts from Tafsīr Ibn Kathīr.",
    },
    {
        "name": "tafsir_al_qurtubi_ar",
        "path": rel_from_root("tafsir_al_qurtubi_ar.jsonl"),
        "tradition": "Islam",
        "genre": "tafsir",
        "collection": "Tafsīr al-Qurtubī",
        "lang": "ar",
        "source": "spa5k/tafsir_api (local mirror)",
        "notes": "Arabic excerpts from Tafsīr al-Qurtubī.",
    },

    # Patristic commentaries
    {
        "name": "christian_commentaries_patristic",
        "path": rel_from_root("christian_commentaries_patristic.jsonl"),
        "tradition": "Christianity",
        "genre": "commentary",
        "collection": "Patristic Commentaries",
        "lang": "en",
        "source": "HistoricalChristianFaith-CommentariesDatabase",
        "notes": "Irenaeus, Origen, John Chrysostom, Augustine of Hippo (verse-indexed excerpts).",
    },

    # Christian creeds
    {
        "name": "christian_creeds",
        "path": rel_from_root("christian_creeds.jsonl"),
        "tradition": "Christianity",
        "genre": "creed",
        "collection": "Ecumenical Creeds",
        "lang": "en",
        "source": "christian-creeds-kb",
        "notes": "Apostles', Nicene (381), Athanasian, and Chalcedon creeds (paragraph-level).",
    },
]

manifest_path.write_text(
    json.dumps(manifest, indent=2, ensure_ascii=False), encoding="utf-8"
)

print(f"[MANIFEST] Wrote {len(manifest)} entries to {manifest_path}")
for m in manifest:
    print(" -", m["name"], "→", m["path"])

[MANIFEST] Wrote 8 entries to /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/MANIFEST.json
 - quran_en → data/raw/kb/_normalized/quran_en.jsonl
 - quran_ar → data/raw/kb/_normalized/quran_ar.jsonl
 - bible_web_en → data/raw/kb/_normalized/bible_web_en.jsonl
 - hadith_9books_en → data/raw/kb/_normalized/hadith_9books_en.jsonl
 - tafsir_ibn_kathir_en → data/raw/kb/_normalized/tafsir_ibn_kathir_en.jsonl
 - tafsir_al_qurtubi_ar → data/raw/kb/_normalized/tafsir_al_qurtubi_ar.jsonl
 - christian_commentaries_patristic → data/raw/kb/_normalized/christian_commentaries_patristic.jsonl
 - christian_creeds → data/raw/kb/_normalized/christian_creeds.jsonl


## check after the manifest cell (optional but helpful)

In [10]:
import json
from pathlib import Path

manifest_path = NORM_DIR / "MANIFEST.json"
manifest = json.loads(manifest_path.read_text(encoding="utf-8"))

print("Manifest entries:")
for m in manifest:
    full_path = ROOT / m["path"]
    exists = full_path.exists()
    print(f" - {m['name']:30s} lang={m.get('lang')} genre={m.get('genre')}  exists={exists}")
    if not exists:
        print("   !!! Missing file:", full_path)


Manifest entries:
 - quran_en                       lang=en genre=scripture  exists=True
 - quran_ar                       lang=ar genre=scripture  exists=True
 - bible_web_en                   lang=en genre=scripture  exists=True
 - hadith_9books_en               lang=en genre=hadith  exists=True
 - tafsir_ibn_kathir_en           lang=en genre=tafsir  exists=True
 - tafsir_al_qurtubi_ar           lang=ar genre=tafsir  exists=True
 - christian_commentaries_patristic lang=en genre=commentary  exists=True
 - christian_creeds               lang=en genre=creed  exists=True


## Cell 9a — Mirror the tafsir repo locally (run once, or when you want to refresh)

In [None]:
# # Cell 9a — Clone/mirror spa5k/tafsir_api locally (run once or when you want to refresh)
# # This avoids flaky 403/404 issues from CDNs.

# !mkdir -p "/content/drive/MyDrive/FATCR/data/raw/kb/_mirrors"
# %cd "/content/drive/MyDrive/FATCR/data/raw/kb/_mirrors"

# # If you need a clean refresh, uncomment the next line:
# # !rm -rf tafsir_api

# !git clone --depth=1 https://github.com/spa5k/tafsir_api.git

# # Return to project root (optional)
# #%cd "/content/drive/MyDrive/FATCR"


/content/drive/MyDrive/FATCR/data/raw/kb/_mirrors
Cloning into 'tafsir_api'...
remote: Enumerating objects: 141876, done.[K
remote: Counting objects: 100% (141876/141876), done.[K
remote: Compressing objects: 100% (140443/140443), done.[K
remote: Total 141876 (delta 2337), reused 140656 (delta 1306), pack-reused 0 (from 0)[K
Receiving objects: 100% (141876/141876), 196.58 MiB | 10.51 MiB/s, done.
Resolving deltas: 100% (2337/2337), done.
Updating files: 100% (140816/140816), done.


In [None]:
# Cell 9a — Clone/refresh spa5k/tafsir_api locally (idempotent)

import os, subprocess, shlex

ROOT = "/content/drive/MyDrive/FATCR"
RAW_ROOT = f"{ROOT}/data/raw/kb"
MIR_DIR = f"{RAW_ROOT}/_mirrors"
REPO_DIR = f"{MIR_DIR}/tafsir_api"

os.makedirs(MIR_DIR, exist_ok=True)

def sh(cmd, cwd=None):
    print("+", cmd)
    return subprocess.run(shlex.split(cmd), cwd=cwd, check=True)

if not os.path.isdir(REPO_DIR):
    # fresh sparse clone, shallow
    sh("git clone --depth=1 https://github.com/spa5k/tafsir_api.git", cwd=MIR_DIR)
    # enable sparse checkout (cone)
    sh("git sparse-checkout init --cone", cwd=REPO_DIR)
    # only keep `tafsir/*` (saves space & makes ls faster)
    sh("git sparse-checkout set tafsir", cwd=REPO_DIR)
else:
    # refresh existing mirror
    try:
        sh("git fetch --depth=1 origin main", cwd=REPO_DIR)
        sh("git checkout main", cwd=REPO_DIR)
        sh("git pull --ff-only", cwd=REPO_DIR)
        # re-apply sparse in case upstream changed tree structure
        sh("git sparse-checkout init --cone", cwd=REPO_DIR)
        sh("git sparse-checkout set tafsir", cwd=REPO_DIR)
    except Exception as e:
        print("⚠️ Mirror refresh failed (will still try to use existing copy):", e)

print("✅ Local tafsir mirror ready at:", REPO_DIR)


+ git fetch --depth=1 origin main
⚠️ Mirror refresh failed (will still try to use existing copy): Command '['git', 'fetch', '--depth=1', 'origin', 'main']' returned non-zero exit status 128.
✅ Local tafsir mirror ready at: /content/drive/MyDrive/FATCR/data/raw/kb/_mirrors/tafsir_api


In [None]:
# Sanity check size & presence
!du -sh "/content/drive/MyDrive/FATCR/data/raw/kb/_mirrors/tafsir_api"
!find "/content/drive/MyDrive/FATCR/data/raw/kb/_mirrors/tafsir_api/tafsir" -maxdepth 2 -type d | head -n 20


770M	/content/drive/MyDrive/FATCR/data/raw/kb/_mirrors/tafsir_api
/content/drive/MyDrive/FATCR/data/raw/kb/_mirrors/tafsir_api/tafsir
/content/drive/MyDrive/FATCR/data/raw/kb/_mirrors/tafsir_api/tafsir/ar-tafseer-al-qurtubi
/content/drive/MyDrive/FATCR/data/raw/kb/_mirrors/tafsir_api/tafsir/ar-tafseer-al-qurtubi/1
/content/drive/MyDrive/FATCR/data/raw/kb/_mirrors/tafsir_api/tafsir/ar-tafseer-al-qurtubi/10
/content/drive/MyDrive/FATCR/data/raw/kb/_mirrors/tafsir_api/tafsir/ar-tafseer-al-qurtubi/100
/content/drive/MyDrive/FATCR/data/raw/kb/_mirrors/tafsir_api/tafsir/ar-tafseer-al-qurtubi/101
/content/drive/MyDrive/FATCR/data/raw/kb/_mirrors/tafsir_api/tafsir/ar-tafseer-al-qurtubi/102
/content/drive/MyDrive/FATCR/data/raw/kb/_mirrors/tafsir_api/tafsir/ar-tafseer-al-qurtubi/103
/content/drive/MyDrive/FATCR/data/raw/kb/_mirrors/tafsir_api/tafsir/ar-tafseer-al-qurtubi/104
/content/drive/MyDrive/FATCR/data/raw/kb/_mirrors/tafsir_api/tafsir/ar-tafseer-al-qurtubi/105
/content/drive/MyDrive/FATC

## Cell 9b — Tell the notebook to use the local mirror + discover real edition folder names

Ckeck the local mirror

In [None]:
# # inside the local mirror
# %cd /content/drive/MyDrive/FATCR/data/raw/kb/_mirrors/tafsir_api
# !git ls-tree -r --name-only HEAD | grep '^tafsir/en-' | cut -d/ -f2 | sort -u


/content/drive/MyDrive/FATCR/data/raw/kb/_mirrors/tafsir_api
en-al-jalalayn
en-al-qushairi-tafsir
en-asbab-al-nuzul-by-al-wahidi
en-kashani-tafsir
en-kashf-al-asrar-tafsir
en-tafisr-ibn-kathir
en-tafsir-al-tustari
en-tafsir-ibn-abbas
en-tafsir-maarif-ul-quran
en-tazkirul-quran


## Cell 10 — Tafsīr — run selected editions (optional)

In [None]:
# Cell 10 — Tafsīr (local mirror → normalized JSONL)
# Purpose: Convert selected editions in local mirror to a unified JSONL schema.

import os, json, time, re, itertools, glob
from pathlib import Path

# ---- CONFIG ----
ROOT = "/content/drive/MyDrive/FATCR"
RAW_ROOT = f"{ROOT}/data/raw/kb"
NORM_DIR = f"{RAW_ROOT}/_normalized"
LOCAL_TAFSIR_DIR = f"{RAW_ROOT}/_mirrors/tafsir_api/tafsir"
os.makedirs(NORM_DIR, exist_ok=True)

def _clean(s: str) -> str:
    return re.sub(r"\s+", " ", str(s)).strip()

def quran_key(ch:int, v:int) -> str:
    return f"quran-{int(ch)}-{int(v)}"

# Optional: read ayah counts from your normalized English Qur'an (if present)
QURAN_INDEX = {}
try:
    src = f"{RAW_ROOT}/_normalized/quran_en.jsonl"
    if os.path.exists(src):
        counts = {}
        with open(src, "r", encoding="utf-8") as f:
            for line in f:
                try:
                    r = json.loads(line)
                    s = int(r.get("chapter", 0)); v = int(r.get("verse", 0))
                    if s > 0 and v > 0:
                        counts[s] = max(counts.get(s, 0), v)
                except:
                    pass
        QURAN_INDEX = counts
except:
    pass

def ayah_count(s):
    return QURAN_INDEX.get(s)

def write_jsonl(out_path: str, rows: list):
    with open(out_path, "w", encoding="utf-8") as f:
        for r in rows:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")
    print(f"✅ Wrote {out_path} | rows: {len(rows)}")

# ------------------------------
# Map edition → human-readable collection name (canonical keys)
COLLECTION_NAME = {
    # EN
    "en-tafsir-ibn-kathir": "Tafsir Ibn Kathīr (EN)",
    "en-al-jalalayn": "Tafsir al-Jalalayn (EN)",
    "en-asbab-al-nuzul-by-al-wahidi": "Asbāb al-Nuzūl (al-Wāḥidī, EN)",
    "en-tafsir-al-tustari": "Tafsir al-Tustarī (EN)",
    "en-tafsir-ibn-abbas": "Tafsir Ibn ʿAbbās (EN)",
    "en-kashani-tafsir": "Tafsir al-Kāshānī (EN)",
    "en-kashf-al-asrar-tafsir": "Kashf al-Asrār (EN)",
    "en-tafsir-maarif-ul-quran": "Maʿārif al-Qurʾān (EN)",
    "en-tazkirul-quran": "Tazkirul Qurʾān (EN)",
    "en-al-qushairi-tafsir": "Qushayrī (EN)",
    # AR
    "ar-tafsir-ibn-kathir": "Tafsir Ibn Kathīr (AR)",
    "ar-tafsir-al-qurtubi": "Tafsir al-Qurṭubī (AR)",
    "ar-tafsir-al-tabari": "Tafsir al-Ṭabarī (AR)",
    "ar-tafseer-al-saddi": "Tafsir al-Suddī (AR)",
    "ar-tafseer-tanwir-al-miqbas": "Tanwīr al-Miqbās (AR)",
    "ar-tafsir-al-baghawi": "Tafsir al-Baghawī (AR)",
    "ar-tafsir-al-wasit": "Tafsir al-Wasīṭ (AR)",
    "ar-tafsir-muyassar": "Tafsir al-Muyassar (AR)",
}

# Local folder aliases (handle mis-spellings or legacy names on disk)
EDITION_ALIASES = {
    "ar-tafsir-al-qurtubi": "ar-tafseer-al-qurtubi",  # canonical -> on-disk alias
    "en-tafsir-ibn-kathir": "en-tafisr-ibn-kathir",   # if your mirror has the 'tafisr' typo
    # add more if you bump into others
}

def _resolve_local_dir(base: Path, edition: str) -> Path:
    """Return the directory to read for this edition, trying alias if needed."""
    p = base / edition
    if p.is_dir():
        return p
    alias = EDITION_ALIASES.get(edition)
    if alias:
        q = base / alias
        if q.is_dir():
            print(f"🔎 Using local alias for {edition} -> {alias}")
            return q
    return p  # caller will warn if missing

# ------------------------------
# Layout helpers (per-ayah / per-surah detection)

def iter_tafsir_ayah_files(base: Path, edition: str):
    ed = _resolve_local_dir(base, edition)
    if not ed.is_dir():
        return
    for s_dir in ed.iterdir():
        if not s_dir.is_dir():
            continue
        try:
            s = int(s_dir.name)
        except:
            continue
        for f in s_dir.iterdir():
            if f.suffix.lower() == ".json":
                try:
                    v = int(f.stem)
                except:
                    continue
                yield s, v, f

def iter_tafsir_surah_files(base: Path, edition: str):
    ed = _resolve_local_dir(base, edition)
    if not ed.is_dir():
        return
    for f in ed.iterdir():
        if f.suffix.lower() != ".json":
            continue
        m = re.match(r"^s?(\d+)\.json$", f.name)
        if not m:
            continue
        s = int(m.group(1))
        yield s, f

def normalize_record(base_meta: dict, edition: str, lang: str, sura: int, ayah: int, text: str):
    coll = COLLECTION_NAME.get(edition, edition)
    return {
        **base_meta,
        "collection": coll,
        "lang": lang,
        "book": "Qur'an",
        "chapter": int(sura),
        "verse": int(ayah),
        "number": None,
        "grade": None,
        "text": _clean(text),
        "ref": f"Qur'an {sura}:{ayah}",
        "group_key": quran_key(sura, ayah),
    }

def convert_tafsir_local(edition: str, out_path: str, fixed_meta: dict, sleep=0.0):
    base = Path(LOCAL_TAFSIR_DIR)
    ed_dir = _resolve_local_dir(base, edition)
    lang = "en" if edition.startswith("en-") else "ar"
    rows = []
    wrote = 0

    if not ed_dir.is_dir():
        print(f"⚠️ Edition folder missing locally: {edition}")
        write_jsonl(out_path, rows)
        return

    found_any = False

    # 1) per-ayah
    for s, v, p in iter_tafsir_ayah_files(base, edition):
        found_any = True
        if ayah_count(s) is not None and (v < 1 or v > ayah_count(s)):
            continue
        try:
            obj = json.loads(p.read_text(encoding="utf-8"))
        except Exception as e:
            print(f"{edition} s{s} v{v}: JSON parse error: {e}")
            continue
        txt = obj.get("text") or obj.get("tafsir") or obj.get("Tafsir") or obj.get("content") or ""
        if txt:
            rows.append(normalize_record(fixed_meta, edition, lang, s, v, txt))
            wrote += 1
        if sleep > 0: time.sleep(sleep)

    # 2) per-surah (fallback)
    if not wrote:
        for s, f in iter_tafsir_surah_files(base, edition):
            found_any = True
            try:
                obj = json.loads(f.read_text(encoding="utf-8"))
            except Exception as e:
                print(f"{edition} s{s}: JSON parse error: {e}")
                continue
            if isinstance(obj, dict):
                for vk, vv in obj.items():
                    try:
                        v = int(vk)
                    except:
                        continue
                    if ayah_count(s) is not None and (v < 1 or v > ayah_count(s)):
                        continue
                    if isinstance(vv, str):
                        txt = vv
                    elif isinstance(vv, dict):
                        txt = vv.get("text") or vv.get("tafsir") or vv.get("content") or ""
                    else:
                        txt = str(vv)
                    if txt:
                        rows.append(normalize_record(fixed_meta, edition, lang, s, v, txt))
                        wrote += 1

    if not found_any:
        print(f"⚠️ {edition}: no working layout detected (folder exists but no JSON matched)")
    write_jsonl(out_path, rows)

# ------------------------------
# ✅ Default pack only (fast & reliable)
# EN: Ibn Kathīr, Jalālayn, Asbāb al-Nuzūl
# AR: Ṭabarī, Qurṭubī, Ibn Kathīr, Muyassar
editions_to_run = [
    # # EN
    # "en-tafsir-ibn-kathir",
    # "en-al-jalalayn",
    # "en-asbab-al-nuzul-by-al-wahidi",
    # # AR
    # "ar-tafsir-al-tabari",
      "ar-tafsir-al-qurtubi",
    # "ar-tafsir-ibn-kathir",
    # "ar-tafsir-muyassar",
]

# Output filename helper (nice, readable)
def out_name_for(edition: str) -> str:
    # e.g. "en-tafsir-ibn-kathir" -> "tafsir_ibn_kathir_en.jsonl"
    lang = "en" if edition.startswith("en-") else "ar"
    stem = edition.split("-", 1)[1]  # drop "en-" / "ar-"
    stem = stem.replace("-", "_")
    return f"tafsir_{stem}_{lang}.jsonl"

# Fixed meta applied to every record (collection set per edition)
fixed_tafsir_meta = {
    "tradition": "Islam",
    "genre": "tafsir",
    "source": "spa5k/tafsir_api",
    "collection": "",  # overridden dynamically in normalize_record
    "lang": "en",      # overridden per edition in normalize_record
}

# ---- Run them
for ed in editions_to_run:
    out_path = os.path.join(NORM_DIR, out_name_for(ed))
    print(f"→ Converting {ed} → {os.path.basename(out_path)}")
    convert_tafsir_local(ed, out_path, fixed_tafsir_meta, sleep=0.0)

# ---- Quick preview (sanity)
for p in glob.glob(f"{NORM_DIR}/tafsir_*_*.jsonl"):
    try:
        with open(p, "r", encoding="utf-8") as f:
            first = next(f).strip()
        print("Preview:", os.path.basename(p), "→", first[:140], "…")
    except StopIteration:
        print("Preview:", os.path.basename(p), "→ (empty)")


→ Converting ar-tafsir-al-qurtubi → tafsir_tafsir_al_qurtubi_ar.jsonl
🔎 Using local alias for ar-tafsir-al-qurtubi -> ar-tafseer-al-qurtubi
🔎 Using local alias for ar-tafsir-al-qurtubi -> ar-tafseer-al-qurtubi
✅ Wrote /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/tafsir_tafsir_al_qurtubi_ar.jsonl | rows: 6236
Preview: tafsir_ibn_kathir_en.jsonl → {"tradition": "Islam", "genre": "tafsir", "source": "spa5k/tafsir_api (local mirror)", "collection": "Tafsir Ibn Kathīr (EN)", "lang": "en", …
Preview: tafsir_ibn_kathir_ar.jsonl → {"tradition": "Islam", "genre": "tafsir", "source": "spa5k/tafsir_api (local mirror)", "collection": "Tafsir Ibn Kathīr (AR)", "lang": "ar", …
Preview: tafsir_tafsir_ibn_kathir_en.jsonl → {"tradition": "Islam", "genre": "tafsir", "source": "spa5k/tafsir_api", "collection": "Tafsir Ibn Kathīr (EN)", "lang": "en", "book": "Qur'a …
Preview: tafsir_al_jalalayn_en.jsonl → {"tradition": "Islam", "genre": "tafsir", "source": "spa5k/tafsir_api", "collection": "Tafsir

sanity check

In [None]:
# sanity check (alias-aware)
from pathlib import Path

ROOT = "/content/drive/MyDrive/FATCR"
RAW_ROOT = f"{ROOT}/data/raw/kb"
LOCAL_TAFSIR_DIR = f"{RAW_ROOT}/_mirrors/tafsir_api/tafsir"

# keep in sync with Cell 10
EDITION_ALIASES = {
    "ar-tafsir-al-qurtubi": "ar-tafseer-al-qurtubi",
    "en-tafsir-ibn-kathir": "en-tafisr-ibn-kathir",
}

def resolve_dir(ed: str) -> Path:
    p = Path(LOCAL_TAFSIR_DIR) / ed
    if p.is_dir():
        return p
    alias = EDITION_ALIASES.get(ed)
    q = Path(LOCAL_TAFSIR_DIR) / alias if alias else None
    return q if (q and q.is_dir()) else p

editions = ["en-al-jalalayn", "en-tafsir-ibn-kathir", "ar-tafsir-ibn-kathir", "ar-tafsir-al-qurtubi"]
for ed in editions:
    p = resolve_dir(ed)
    cnt = sum(1 for _ in p.rglob("*.json")) if p.is_dir() else 0
    print(f"{ed:>22} | exists: {str(p.is_dir()):5} | files: {cnt:4} | path: {p}")


        en-al-jalalayn | exists: True  | files: 6350 | path: /content/drive/MyDrive/FATCR/data/raw/kb/_mirrors/tafsir_api/tafsir/en-al-jalalayn
  en-tafsir-ibn-kathir | exists: True  | files: 6350 | path: /content/drive/MyDrive/FATCR/data/raw/kb/_mirrors/tafsir_api/tafsir/en-tafsir-ibn-kathir
  ar-tafsir-ibn-kathir | exists: True  | files: 6350 | path: /content/drive/MyDrive/FATCR/data/raw/kb/_mirrors/tafsir_api/tafsir/ar-tafsir-ibn-kathir
  ar-tafsir-al-qurtubi | exists: True  | files: 6350 | path: /content/drive/MyDrive/FATCR/data/raw/kb/_mirrors/tafsir_api/tafsir/ar-tafseer-al-qurtubi


file cleanup

In [None]:
# keep the short names, remove doubles with "tafsir_tafsir_"
cd "/content/drive/MyDrive/FATCR/data/raw/kb/_normalized"
rm -f tafsir_tafsir_*.jsonl


normalised sanity check

In [None]:
import os, glob, json, itertools
NORM = "/content/drive/MyDrive/FATCR/data/raw/kb/_normalized"

for p in sorted(glob.glob(f"{NORM}/*.jsonl")):
    size = os.path.getsize(p)
    with open(p, "r", encoding="utf-8") as f:
        head = next(itertools.islice(f, 1), "").strip()
    print(os.path.basename(p), "| bytes:", size, "| sample:", head[:140], "…")


bible_kjv_en.jsonl | bytes: 15248000 | sample: {"tradition": "Christianity", "genre": "scripture", "source": "KJV (aruljohn/Bible-kjv)", "collection": "Bible", "lang": "en", "book": "1 Ch …
bible_web_en.jsonl | bytes: 0 | sample:  …
hadith_9books_ar.jsonl | bytes: 52501348 | sample: {"tradition": "Islam", "genre": "hadith", "collection": "Nine Books", "source": "hadith-json", "lang": "ar", "book": "Abudawud", "chapter":  …
hadith_9books_en.jsonl | bytes: 0 | sample:  …
quran_ar.jsonl | bytes: 2923183 | sample: {"tradition": "Islam", "genre": "scripture", "collection": "Quran", "source": "Quran (AR)", "lang": "ar", "book": "Al-Fatiha", "chapter": 1, …
quran_en.jsonl | bytes: 2463646 | sample: {"tradition": "Islam", "genre": "scripture", "collection": "Quran", "source": "Quran (EN: semarketir/quranjson)", "lang": "en", "book": "Al- …
tafsir_al_jalalayn_en.jsonl | bytes: 3824124 | sample: {"tradition": "Islam", "genre": "tafsir", "source": "spa5k/tafsir_api", "collection": "Tafsir al-Jala

fix lost file bible web normalised

In [None]:
def safe_write_jsonl(out_path: str, rows: list):
    """Never overwrite with 0 rows. If rows==0 and an older file exists, keep the old file."""
    import os, json, tempfile, shutil
    if not rows:
        if os.path.exists(out_path) and os.path.getsize(out_path) > 0:
            print(f"⚠️  Skipping write of 0 rows to {out_path} (kept existing non-empty file).")
            return
        else:
            print(f"⚠️  Not writing {out_path}: 0 rows.")
            return
    tmp = out_path + ".tmp"
    with open(tmp, "w", encoding="utf-8") as f:
        for r in rows:
            f.write(json.dumps(r, ensure_ascii=False) + "\n")
    shutil.move(tmp, out_path)
    print(f"✅ Wrote {out_path} | rows: {len(rows)}")


In [None]:
# --- Rebuild Bible WEB EN from TehShrike/world-english-bible ---
RAW_ROOT = "/content/drive/MyDrive/FATCR/data/raw/kb"
NORM_DIR = f"{RAW_ROOT}/_normalized"
WEB_ROOT = f"{RAW_ROOT}/Christianity/world-english-bible"  # adjust if needed

import os, glob, json
rows = []

# many WEB repos are one-file-per-book under e.g. translations/en/web/BOOK.json or similar
cands = glob.glob(os.path.join(WEB_ROOT, "**", "*.json"), recursive=True)

def friendly_book(name):
    # crude clean-up for filenames into book names
    import re
    b = os.path.splitext(os.path.basename(name))[0]
    b = b.replace("_", " ").replace("-", " ").title()
    return b

for fp in cands:
    try:
        with open(fp, "r", encoding="utf-8") as f:
            obj = json.load(f)
    except Exception:
        continue

    # accept common shapes:
    # 1) {"book":"Genesis","chapters":{"1":{"1":"In the beginning...", ...}, ...}}
    # 2) {"1":{"1":"...","2":"..."}, "2":{...}} with filename as book
    if isinstance(obj, dict) and "chapters" in obj and "book" in obj:
        book = obj["book"]
        chapters = obj["chapters"]
    elif isinstance(obj, dict) and all(k.isdigit() for k in obj.keys()):
        book = friendly_book(fp)
        chapters = obj
    else:
        continue

    for ch_s, verses in chapters.items():
        try: ch = int(ch_s)
        except: continue
        if not isinstance(verses, dict): continue
        for v_s, text in verses.items():
            try: v = int(v_s)
            except: continue
            t = str(text).strip()
            if not t: continue
            rows.append({
                "tradition":"Christianity","genre":"scripture","source":"WEB (TehShrike/world-english-bible)",
                "collection":"Bible","lang":"en","book":book,
                "chapter":ch,"verse":v,"number":None,"grade":None,
                "text":t,"ref":f"{book} {ch}:{v}","group_key":f"{book}-{ch}-{v}".lower().replace(" ","_")
            })

safe_write_jsonl(os.path.join(NORM_DIR, "bible_web_en.jsonl"), rows)


⚠️  Not writing /content/drive/MyDrive/FATCR/data/raw/kb/_normalized/bible_web_en.jsonl: 0 rows.


## Cell 11 — 1) Quick slice (see progress fast)

Run this in Cell 10 after the function definitions. It writes tiny “slice” JSONLs so you see logs immediately.

In [None]:
import json, itertools, os, glob
for p in glob.glob(f"{NORM_DIR}/tafsir_*_*.jsonl"):
    print(os.path.basename(p), "size:", os.path.getsize(p))
    with open(p, "r", encoding="utf-8") as f:
        for i, line in enumerate(itertools.islice(f, 3)):
            r = json.loads(line)
            # must exist:
            assert {"tradition","genre","collection","source","lang","book",
                    "chapter","verse","text","ref","group_key"} <= r.keys()
            print(" ", r["collection"], r["lang"], r["ref"], "→", r["text"][:50], "…")
    print()


tafsir_ibn_kathir_en.jsonl size: 41564813
  Tafsir Ibn Kathīr (EN) en Qur'an 1:1 → Introduction to Fatihah Which was revealed in Makk …
  Tafsir Ibn Kathīr (EN) en Qur'an 1:2 → The Meaning of Al-Hamd Abu Ja`far bin Jarir said,  …
  Tafsir Ibn Kathīr (EN) en Qur'an 1:3 → Allah said next, الرَّحْمَـنِ الرَّحِيمِ (Ar-Rahma …

tafsir_al_jalalayn_en.jsonl size: 3917664
  Tafsir al-Jalalayn (EN) en Qur'an 1:1 → In the Name of God the Compassionate the Merciful …
  Tafsir al-Jalalayn (EN) en Qur'an 1:2 → In the Name of God the name of a thing is that by  …
  Tafsir al-Jalalayn (EN) en Qur'an 1:3 → The Compassionate the Merciful that is to say the  …

tafsir_ibn_kathir_ar.jsonl size: 16146097
  Tafsir Ibn Kathīr (AR) ar Qur'an 1:1 → بسم الله الرحمن الرحيم سورة الفاتحة . يقال لها الف …
  Tafsir Ibn Kathīr (AR) ar Qur'an 1:2 → القراء السبعة على ضم الدال في قوله الحمد لله هو مب …
  Tafsir Ibn Kathīr (AR) ar Qur'an 1:3 → وقوله تعالى "الرحمن الرحيم" تقدم الكلام عليه في ال …

