# Building the BYOAI_LIAR Dataset
*A Practical Example of Building Your Own AI, with a little help from AI*

## Why we built a LIAR-style dataset for the book
For the chapter on scaling, we wanted a model that shows more than speed or accuracy. We wanted something that also explores how to scale trust. The classic LIAR dataset is a good inspiration. It labels short statements on a six-level truth scale: TRUE, mostly-true, half-true, barely-true, FALSE, and pants-fire. That framing fit our goals and also supports teaching. One of the authors teaches trustworthy AI and this gives a steady source of chapter-anchored multiple choice questions.

> The LIAR dataset is a large collection of 12,836 short, human-labeled statements about their truthfulness, collected from the fact-checking website PolitiFact.com. It was created by William Yang Wang and released in a 2017 paper titled "Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection, making it a valuable resource for fake news detection research.


Our version, which we call **byoai_liar**, uses passages from the book’s chapters to generate short labeled statements. A future classifier can then learn to predict the same six labels. The result doubles as a training set and as an educational asset tied to the book’s topics.

## Planning the build with a code assistant
We decided to treat the whole exercise as a worked example of using a coding copilot. Most of the work was done with ChatGPT and Gemini. The plan had two parts:

1. **Chunk the chapters** into small windows of text that carry enough context to inspire grounded statements.  
2. **Generate labeled statements** from those windows, along with a short context phrase and a short reason for the label.

We aimed for about 2,500 to 5,000 rows so a small model like T5-small would have something meaningful to learn from. With more than 1,800 chunks available from the drafts, that target looked realistic.

## Data preparation
We took a fresh snapshot of the chapter drafts and exported each chapter as a UTF-16 plain text file. That ensured simple, repeatable inputs. We then defined the six labels as constants, based on the LIAR model card, and kept them front and center in every prompt and function.

Next, we added **subject tags**. A chapter title alone does not always convey the core topics. We asked the code assistant to propose three concise tags per chapter. These tags appear in the chunk file and later guide generation. For example, the deepfake chapter carries tags like media-forensics, voice-cloning, deepfake. The agentic AI chapter carries agentic-ai, planning, tools. These tags flow through the pipeline and help the generator ground contexts without long prompts.

>**Tip:** Before writing any code, you can use a chat-based content analysis to define data structures like we did with: `SUBJECT_TAGS`. Here's what we did.
Create a new ChatGPT project, upload your chapter `.txt` files into its file folder, and prompt:  
*"Scan each file and extract three concise tags that best represent its subject or theme. Then output a Python constant named SUBJECT_TAGS mapping chapter numbers to those tags."*  
Review the results, refine the prompt if any tags feel off, and rerun until each chapter is well represented. This step anchors your later code generation in real content, ensuring your constants match the book’s themes rather than generic guesses.

## Listing 1: the Chunker
The chunker walks each chapter, breaks it into overlapping windows, and emits a `chunks.csv` with:

- `chapter`, `chapter_title`, `chunk_id`, `chunk_text`, `subject_tags`

The window size took some tuning. We tested different widths and strides, looking for enough context to inspire a claim, but short enough to keep the model focused. We added simple summary prints to show counts, token ranges, and per-chapter coverage. Those prints paid off. Each round of changes was based on facts rather than gut feel.

A key decision was to produce enough chunks so the generator could sample several labels per chunk later. That matters because a single passage can inspire both supported claims and claims that go a bit too far. We want the generator to produce that contrast so the classifier has something to learn.

>**Tip:** Let your AI code assistant help you find a practical window size before finalizing the chunker.  
Upload one or two representative chapter text files and ask:  
*"Write a short script to split this text into overlapping windows and report average word counts per window for different stride and size values."*  
Run the suggested code, share the output back with the assistant, and ask:  
*"Based on these stats, what window and stride would balance context and focus for training data generation?"*  
This experiment-driven loop helps you converge on numbers that fit your chapters, rather than guessing.

## Listing 2: the single-pass Generator
Early versions tried a generator plus a separate QA reviewer. That added complexity. The final approach is simpler and more reliable for this use case:

**One prompt. One call per row. No second pass.**

The generator takes `chunks.csv` and a few simple constants:

- `MAX_CHUNKS` controls how many chunks to process in this run.
- `N_PER_CHUNK_TARGET` controls how many rows to attempt per chunk.
- `LABELS` and `LABEL_WEIGHTS` control the truth distribution.

For each chunk, we sample a label according to the weights and create a single prompt that includes the target label, the chapter info, the subject tags, and the chunk text. The model returns exactly three fields:

- `statement` in 20 words or fewer  
- `context` in 12 words or fewer  
- `label_reason` in 8 to 28 words

The rules in the prompt keep things consistent: the label drives the claim, the context anchors to a specific term from the passage or tags, and the reason explains the label in natural language. We ask for varied phrasing and concrete terms such as tools, models, datasets, or metrics. We also ask for natural justification wording so we do not get repetitive openings like “The passage states…”

>**Tip:** Use your AI code assistant as a **prompt design partner**, not just a coder.  
Upload your `chunks.csv` file into the project, then ask:  
*"Given this dataset, help me design a single, efficient prompt that generates a labeled statement, short context, and natural justification for each chunk and target label."*  
Experiment by pasting a few sample chunks and iterating on the wording of the generation rules together.  
When the model starts returning well-balanced examples, copy that exact text into your `GEN_PROMPT` constant.  
This interactive process helps you tune the generator’s behavior before you automate it in code.

## Light validation instead of heavy post-processing
After each generation, we run quick checks:

- Word count bounds  
- No meta words like “chapter” or “this statement”  
- A short-hash dedupe on the statement

If a row fails a check, we retry once. If it fails again, we skip it. These small rules keep the output clean without slowing the pipeline.

## Output schema and reproducibility
The generator writes `byoai_liar.csv` with:

- `id, chunk_id, label, statement, context, label_reason, subject_tags, chapter, chapter_title`

This makes it easy to trace any statement back to its origin. The fields also give you the ingredients a future classifier or evaluator needs, including a short rationale that can support training or auditing.

## Scaling up without surprises
To reach 4,000 to 5,000 rows, pick a combination of `MAX_CHUNKS` and `N_PER_CHUNK_TARGET` that hits your target. For example, 600 chunks with 8 rows per chunk yields about 4,800 attempts. After dedupe and bounds checks, the final count lands near the goal. For variety, set temperature around 0.65 to 0.7 and top-p around 0.9. If you want even coverage across chapters, switch the chunk selection from random to a balanced sampler.

## A small audit tool helps you stay honest
We added a tiny “coherence and label audit” tool that samples random rows from the final CSV and asks a model to rate:

- chapter fit  
- label fit  
- context quality  
- reason quality

It prints a few lines per row and then averages. If context quality dips, nudge the prompt to ask for a named anchor term from the passage or tags. If reasons look mechanical, remind the model to explain the relationship naturally without report-style phrases. This loop is fast and keeps the dataset from drifting.

## What an AI builder learns from this pattern
**Design the target first.** The six labels and the output schema framed every decision that followed.  
**Break the work into two clear listings.** One prepares focused context. One generates structured rows.  
**Keep prompts compact and precise.** One well-designed prompt beats a long chain of fragile steps.  
**Use simple checks.** Bounds and dedupe give most of the benefits of a second pass, with a fraction of the effort.  
**Iterate on small samples.** Run ten chunks, study the rows, adjust one rule, and run again.  
**Instrument the code.** Print basic stats after each run. Simple numbers are better than guesswork.  
**Add a tiny auditor.** A lightweight, randomized audit helps you catch drift early and gives you confidence at scale.

## Where you can take it next
Once the dataset looks good, you can train a small text classifier such as T5-small or a compact encoder-based model. You can also slice the CSV per chapter or per tag to build practice sets for a class. Since each row has a short reason, you can experiment with explanation-aware training, evaluation rubrics, or teaching assistants that quiz readers on chapter topics.

That is the core pattern. Start with plain text chapters. Chunk them. Drive a single-pass generator with clear rules. Keep the checks light, the prompts short, and the iterations quick. You will end up with a dataset that is useful for training and also clear enough to support teaching and review.


### Listing 1: BYOAI Chapter Chunker → `byoai_book_chunks.csv`

This script scans all chapter `.txt` files, slices them into overlapping
sentence windows, and writes the results to `byoai_book_chunks.csv`. Each record
contains the chapter number, title, chunk ID, text window, and up to three
subject tags representing the chapter’s main themes.

Text is read as **UTF-16** (falling back to **UTF-8**), cleaned for encoding
artifacts, split into sentences, and filtered to omit short fragments.
Helper functions include:
`read_text_any()` for robust reading,
`sentence_split()` for segmentation,
and `make_windows()` for constructing rolling windows using
`WINDOW_SIZE` and `WINDOW_STRIDE`.

**Output schema:**
*chapter | chapter_title | chunk_id | chunk_text | subject_tags*

**Example:**

*5 | Deep Learning | 42 | “PyTorch introduced eager execution…” | “deep-learning,frameworks,tensors”*

The resulting `byoai_book_chunks.csv` provides consistent, context-rich samples for generation, labeling, and analysis while preserving chapter context and
subject tags.


In [None]:
# --- BYOAI Chapter Chunker v3 → chunks.csv -----------------------------------
# Scans all *.txt chapters (UTF-16 preferred; falls back to UTF-8),
# slices rolling sentence windows, and writes a lean CSV for generators.
#
# Output schema (per row):
#   chapter (int) | chapter_title (str) | chunk_id (int) | chunk_text (str)
#   subject_tags (str; comma-separated, ≤3 per chapter)

import re
import csv
import glob
import pathlib
import collections

# ===================== CONFIG =====================
CHAPTER_GLOB   = "*.txt"     # pattern for chapter text files
OUTPUT_CSV     = "byoai_book_chunks.csv"
WINDOW_SIZE    = 4           # sentences per window
WINDOW_STRIDE  = 3           # step size between windows
MIN_CHARS      = 260         # drop very short/meta fragments
# ==================================================

# Canonical chapter titles (0: Foreword, 1: Introduction, "Chapter N" → N+1)
CHAPTER_TITLES = {
     0: "Foreword – Robo Interviews Clément Delangue",
     1: "Introduction – The Gold Rush Paradox",
     2: "AI Survival Kit",
     3: "Prepping Data for AI",
     4: "Classical Machine Learning",
     5: "Deep Learning",
     6: "Neuron Building Blocks",
     7: "Generative AI",
     8: "Breaking-Securing AI",
     9: "Deepfake Defense",
    10: "AI At Scale",
    11: "AI Ethics and Governance",
    12: "Agentic AI",
    13: "Commit to Contribute",
}

# === SUBJECT TAGS (3 per chapter max) ===
# Used for guiding statement placement and downstream grouping.
SUBJECT_TAGS = {
     0: ["open-source", "community", "ai"],                 # Foreword
     1: ["ai", "open-source", "builder"],                   # Introduction
     2: ["ai", "tool-chain", "notebooks"],                  # Ch1 – Survival Kit
     3: ["data-prep", "feature-engineering", "rag"],        # Ch2 – Prepping Data
     4: ["machine-learning", "classification", "evaluation"],# Ch3 – Classical ML
     5: ["deep-learning", "frameworks", "tensors"],         # Ch4 – Deep Learning
     6: ["neural-networks", "cnn", "transformers"],         # Ch5 – Neuron Blocks
     7: ["generative-ai", "diffusion", "gans"],             # Ch6 – Gen AI
     8: ["security", "red-team", "guardrails"],             # Ch7 – Break/Secure
     9: ["media-forensics", "voice-cloning", "deepfake"],   # Ch8 – Deepfake Def.
    10: ["mlops", "scaling", "deployment"],                 # Ch9 – At Scale
    11: ["ethics", "governance", "privacy"],                # Ch10 – Ethics/Gov
    12: ["agentic-ai", "planning", "tools"],                # Ch11 – Agentic AI
    13: ["open-source", "community", "contribution"],       # Ch12 – Commit
}

def fix_mojibake(t: str) -> str:
    """Light cleanup for common encoding artifacts from word processors."""
    repl = {
        "‚Äôs": "’s", "‚Äô": "’", "‚Äú": "“", "‚Äù": "”",
        "‚Äì": "–",  "‚Äî": "—", "‚Ä¶": "…", "Â": ""
    }
    for k, v in repl.items():
        t = t.replace(k, v)
    return t

def read_text_any(path: str) -> str:
    """Read file as UTF-16 first, then UTF-8 as a fallback; normalize newlines."""
    try:
        txt = open(path, "r", encoding="utf-16").read()
    except Exception:
        txt = open(path, "r", encoding="utf-8", errors="ignore").read()
    return fix_mojibake(txt.replace("\r\n", "\n").replace("\r", "\n"))

def sentence_split(text: str):
    """Split text into sentences using simple punctuation-based rule."""
    parts = re.split(r"(?<=[.!?])\s+", text)
    return [p.strip() for p in parts if p.strip()]

def make_windows(text: str, n: int, stride: int):
    """Create rolling windows of n sentences with a given stride."""
    sents = sentence_split(text)
    return [" ".join(sents[i:i+n]) for i in range(0, len(sents), stride)]

def parse_chapter_from_filename(fname: str):
    """Infer (chapter_index, canonical_title) from filename text."""
    low = fname.lower()
    if "foreword" in low:
        return 0, CHAPTER_TITLES[0]
    if "introduction" in low:
        return 1, CHAPTER_TITLES[1]
    m = re.search(r"chapter\D*?(\d{1,2})", low)
    if m:
        n = int(m.group(1)) + 1
        return n, CHAPTER_TITLES.get(n, pathlib.Path(fname).stem)
    # Fallback to Introduction if naming is irregular
    return 1, CHAPTER_TITLES[1]

def tags_for_chapter(chapter: int) -> str:
    """Return comma-separated subject tags for a chapter (≤3)."""
    tags = SUBJECT_TAGS.get(chapter, ["ai"])
    return ",".join(tags[:3]) if tags else "ai"

def main():
    """Scan chapter files, window sentences, filter, and write CSV."""
    files = sorted(glob.glob(CHAPTER_GLOB))
    if not files:
        raise SystemExit(f"No files matched {CHAPTER_GLOB}")

    rows = []

    # Iterate each text file and produce rolling windows
    for path in files:
        if not path.lower().endswith(".txt"):
            continue

        # Parse chapter index and canonical title
        chapter, title = parse_chapter_from_filename(path)

        # Read raw text and skip empties
        text = read_text_any(path)
        if not text.strip():
            continue

        # Build sentence windows for this chapter
        windows = make_windows(text, WINDOW_SIZE, WINDOW_STRIDE)

        # Collect non-trivial windows with metadata and tags
        chunk_id = 0
        for w in windows:
            if len(w) < MIN_CHARS:
                continue
            rows.append({
                "chapter": chapter,
                "chapter_title": title,
                "chunk_id": chunk_id,
                "chunk_text": w,
                "subject_tags": tags_for_chapter(chapter),
            })
            chunk_id += 1

    # Write CSV with BOM for spreadsheet-friendly behavior
    with open(OUTPUT_CSV, "w", newline="", encoding="utf-8-sig") as f:
        w = csv.DictWriter(
            f,
            fieldnames=[
                "chapter", "chapter_title", "chunk_id",
                "chunk_text", "subject_tags"
            ]
        )
        w.writeheader()
        for r in rows:
            w.writerow(r)

    # Console summary to sanity-check coverage
    by_chapter = collections.Counter(r["chapter"] for r in rows)
    print(f"Wrote {len(rows)} rows to {OUTPUT_CSV}")
    print("Windows per chapter:", dict(sorted(by_chapter.items())))

main()

Wrote 1891 rows to byoai_book_chunks.csv
Windows per chapter: {0: 54, 1: 73, 2: 136, 3: 190, 4: 182, 5: 209, 6: 92, 7: 171, 8: 157, 9: 140, 10: 138, 11: 99, 12: 151, 13: 99}


### Listing 1A: BYOAI Chapter Chunk Quality Check

This script validates and summarizes the `byoai_book_chunks.csv` output from Listing 1.
It checks structural integrity, verifies column presence, and reports per-chapter
and global statistics such as word and sentence counts. It also measures
distribution balance, identifies overly short or long chunks, and prints
keyword and subject-tag coverage to assess content diversity.

Key outputs include per-chapter averages (`mean_words`, `mean_sents`), overall
word-length ranges, and tag frequency rankings. Chunks below `SHORT_LIMIT`
or above `LONG_LIMIT` are flagged, with a sample preview printed for inspection.
Together, these diagnostics ensure the dataset remains consistent, balanced,
and thematically representative for downstream text generation and analysis.

In [None]:
# === BYOAI Chapter Chunk Quality Check ========================================
# Validates the integrity, balance, and richness of 'chunks.csv' produced by
# the BYOAI Chapter Chunker (Listing 1). Reports stats, keyword coverage,
# and short/long chunk distributions to guide downstream generation.

import re
import pandas as pd
from collections import Counter
from shutil import get_terminal_size

# ===================== CONFIG =====================
CHUNKS_CSV  = "byoai_book_chunks.csv"
REQ_COLS    = {"chapter", "chapter_title", "chunk_id",
               "chunk_text", "subject_tags"}
SHORT_LIMIT = 50      # minimum acceptable word count per chunk
LONG_LIMIT  = 220     # threshold for overly long windows
# ==================================================

# --- Load CSV and verify columns ---
df = pd.read_csv(CHUNKS_CSV, encoding="utf-8-sig")
print(f"Loaded {len(df):,} chunks from {CHUNKS_CSV}")

missing = REQ_COLS - set(df.columns)
if missing:
    raise SystemExit(f"Missing columns: {sorted(missing)}")
if df.empty:
    raise SystemExit("No rows found in chunks.csv")

# --- Basic metrics per chunk ---
df["chunk_text"] = df["chunk_text"].astype(str)
df["word_len"]   = df["chunk_text"].apply(lambda t: len(t.split()))
df["sent_count"] = df["chunk_text"].apply(
    lambda t: len(re.findall(r"[.!?](?:\s|$)", t))
)

# --- Per-chapter summary ---
summary = (
    df.groupby(["chapter", "chapter_title"], as_index=False)
      .agg(
          chunks=("chunk_id", "count"),
          mean_words=("word_len", "mean"),
          min_words=("word_len", "min"),
          max_words=("word_len", "max"),
          mean_sents=("sent_count", "mean"),
      )
      .round(2)
)
summary["share_pct"] = (summary["chunks"] / len(df) * 100).round(1)

print("\n=== Per-Chapter Chunk Stats ===")
term_width = get_terminal_size((120, 20)).columns
print(summary.sort_values("chapter")
              .to_string(index=False, line_width=term_width))

# --- Global summary ---
print("\n=== Global Summary ===")
print(f"Mean words per chunk: {df['word_len'].mean():.1f}")
print(f"Shortest chunk: {df['word_len'].min()} words")
print(f"Longest chunk:  {df['word_len'].max()} words")
print(f"Mean sentences per chunk: {df['sent_count'].mean():.2f}")

# --- Subject tag coverage ---
def split_tags(s):
    return [t.strip() for t in str(s).split(",") if t.strip()]

tag_counts = Counter(tag for tags in df["subject_tags"].map(split_tags)
                     for tag in tags)

print("\n=== Subject Tag Coverage (top 20) ===")
for tag, count in tag_counts.most_common(20):
    print(f"{tag:<20} {count}")

# --- Keyword occurrence overview ---
KEY_TERMS = ["ai", "data", "model", "learning",
             "open", "ethics", "voice", "agent"]
totals = Counter()
for text in df["chunk_text"]:
    text_l = text.lower()
    for term in KEY_TERMS:
        totals[term] += text_l.count(term)

print("\n=== Keyword Occurrence Totals ===")
for term, count in sorted(totals.items(), key=lambda kv: -kv[1]):
    print(f"{term:<12} {count}")

# --- Flag short and long chunks ---
short_df = df[df["word_len"] < SHORT_LIMIT]
long_df  = df[df["word_len"] > LONG_LIMIT]

print(f"\nChunks under {SHORT_LIMIT} words: {len(short_df)} "
      f"({len(short_df)/len(df):.1%})")
print(f"Chunks over {LONG_LIMIT} words: {len(long_df)} "
      f"({len(long_df)/len(df):.1%})")

# --- Notebook-friendly preview ---
def preview(frame, label):
    if frame.empty:
        return
    cols = ["chapter","chapter_title","chunk_id","word_len","chunk_text"]
    print(f"\nExample {label} chunks:")
    try:
        from IPython.display import display
        display(frame.sample(min(5, len(frame)))[cols])
    except Exception:
        print(frame.sample(min(3, len(frame)))[cols].to_string(index=False))

preview(short_df, "short")
preview(long_df, "long")

Loaded 1,891 chunks from byoai_book_chunks.csv

=== Per-Chapter Chunk Stats ===
 chapter                               chapter_title  chunks  mean_words  min_words  max_words  mean_sents  share_pct
       0 Foreword – Robo Interviews Clément Delangue      54       84.17         33        150        3.94        2.9
       1        Introduction – The Gold Rush Paradox      73       67.21         46        106        4.00        3.9
       2                             AI Survival Kit     136       92.84         44        284        4.00        7.2
       3                        Prepping Data for AI     190       81.64         39        231        3.99       10.0
       4                  Classical Machine Learning     182       84.37         41        246        4.00        9.6
       5                               Deep Learning     209       80.19         32        256        3.99       11.1
       6                      Neuron Building Blocks      92      109.51         14        360

Unnamed: 0,chapter,chapter_title,chunk_id,word_len,chunk_text
909,5,Deep Learning,52,32,model = tf.keras.Sequential([\n tf.keras.la...
384,12,Agentic AI,149,43,"Finally, we come back to transparency. Open-so..."
965,5,Deep Learning,108,49,This loss value is the key signal we use to im...
409,13,Commit to Contribute,23,48,"For Robby, open innovation was a memory. A rem..."
1763,10,AI At Scale,137,37,"“Liar, Liar Pants on Fire”: A New Benchmark Da..."



Example long chunks:


Unnamed: 0,chapter,chapter_title,chunk_id,word_len,chunk_text
1302,7,Generative AI,144,237,It captures the essence of how modern AI assis...
1577,9,Deepfake Defense,91,240,"These final steps bring together the data, the..."
586,3,Prepping Data for AI,101,226,import pandas as pd\n\n# Helper functions are ...
1114,6,Neuron Building Blocks,48,256,Here is an example:\n\n\nFigure 5-6 Sample out...
74,2,AI Survival Kit,74,248,"You’ve Been Selected as the Next Avenger!',\n ..."


### Listing 2: BYOAI LIAR-Style Generator → `byoai_liar.csv`

This generator transforms chapter chunks into concise, labeled statements
using the OpenAI API. Each row in the output CSV contains a short claim,
context phrase, and brief justification paired with a truthfulness label
drawn from the classic *LIAR* dataset categories (`TRUE`, `mostly-true`,
`half-true`, `barely-true`, `FALSE`, `pants-fire`). Labels are sampled using
weighted probabilities to maintain a balanced overall mix.

For every passage in `byoai_book_chunks.csv`, the script composes a compact
prompt embedding the target label, chapter metadata, and subject tags. The
model (`gpt-4o-mini`) returns a JSON response containing three fields:
`statement`, `context`, and `label_reason`. Lightweight validators enforce
length limits, remove meta references (like “chapter” or “book”), and ensure
syntactic completeness. Deduplication guards prevent near-identical
statements from repeating.

Output fields include:

*id, chunk_id, label, statement, context, label_reason,
subject_tags, chapter, chapter_title*

Together, these labeled micro-claims form a synthetic fact-checking corpus
aligned to the book’s themes. The dataset can be used to train or evaluate
classification models that reason about factual accuracy, context grounding,
and evidence alignment—mirroring LIAR’s structure while drawing from
BYOAI’s original chapter content.

In short, this script lets AI **generate a dataset about itself** —
a living demonstration of the book’s theme: *using AI to build your own AI.*

**REQUIRES:**  
`byoai_book_chunks.csv` (from Listings 1), an active OpenAI API key stored via  
`google.colab.userdata`, and the `openai` + `pandas` libraries.

In [None]:
# === BYOAI LIAR-Style Generator (single-pass; OpenAI >= 1.0) ================
# Output CSV schema:
#   id,chunk_id,label,statement,context,label_reason,subject_tags,chapter,chapter_title
#
# Purpose:
#   Generate concise, labeled statements with brief context and justification
#   directly from chunks.csv using a single, tight GEN prompt per row.
#
# Requirements:
#   pip install --quiet openai pandas
#   from google.colab import userdata
#   userdata.set('OPENAI_API_KEY', '...')
# ============================================================================

import os, csv, json, re, time, random, math, hashlib
from dataclasses import dataclass, asdict
from collections import Counter, defaultdict

import pandas as pd
from google.colab import userdata
from openai import OpenAI

# ===================== CONFIG =====================
CHUNKS_CSV               = "byoai_book_chunks.csv"
OUTPUT_CSV               = "byoai_liar.csv"

# Sampling targets
MAX_CHUNKS               = 10           # how many chunks to process
N_PER_CHUNK_TARGET       = 2            # rows to generate per chunk
RETRIES_PER_ROW          = 1            # retry attempts if checks fail

# IDs and pacing
BASE_ID                  = 100000
RANDOM_SEED              = 42
PAUSE_SEC                = 0.02         # gentle pacing

# Labels & weights (global target mix)
LABELS = ["TRUE", "mostly-true", "half-true", "barely-true", "FALSE", "pants-fire"]
LABEL_WEIGHTS = [0.20, 0.18, 0.24, 0.18, 0.12, 0.08]

# Chapter-tag mapping (fallback if subject_tags column not present)
SUBJECT_TAGS = {
     0: ["open-source", "community", "ai"],                 # Foreword
     1: ["ai", "open-source", "builder"],                   # Introduction
     2: ["ai", "tool-chain", "notebooks"],                  # Ch1 – Survival Kit
     3: ["data-prep", "feature-engineering", "rag"],        # Ch2 – Prepping Data
     4: ["machine-learning", "classification", "evaluation"],# Ch3 – Classical ML
     5: ["deep-learning", "frameworks", "tensors"],         # Ch4 – Deep Learning
     6: ["neural-networks", "cnn", "transformers"],         # Ch5 – Neuron Blocks
     7: ["generative-ai", "diffusion", "gans"],             # Ch6 – Gen AI
     8: ["security", "red-team", "guardrails"],             # Ch7 – Break/Secure
     9: ["media-forensics", "voice-cloning", "deepfake"],   # Ch8 – Deepfake Def.
    10: ["mlops", "scaling", "deployment"],                 # Ch9 – At Scale
    11: ["ethics", "governance", "privacy"],                # Ch10 – Ethics/Gov
    12: ["agentic-ai", "planning", "tools"],                # Ch11 – Agentic AI
    13: ["open-source", "community", "contribution"],       # Ch12 – Commit
}

# Light validation bounds
META_BAN     = re.compile(r"\b(chapter|this chapter|book|this book|this statement)\b", re.I)
STMT_MAX_W   = 20
CTX_MAX_W    = 12
RSN_MIN_W    = 8
RSN_MAX_W    = 28

# Dedupe control
DEDUP_HASH_LEN = 60       # portion of statement to hash
# ==================================================

random.seed(RANDOM_SEED)

# ============ OpenAI client ============
api_key = userdata.get("OPENAI_API_KEY")
if not api_key:
    raise SystemExit("Missing OPENAI_API_KEY in Colab userdata.")
client = OpenAI(api_key=api_key)

# ============ CSV model ============
HEADERS = [
    "id","chunk_id","label","statement","context","label_reason",
    "subject_tags","chapter","chapter_title"
]

@dataclass
class Row:
    id: int
    chunk_id: int
    label: str
    statement: str
    context: str
    label_reason: str
    subject_tags: str
    chapter: int
    chapter_title: str

def write_csv(rows, path):
    with open(path, "w", newline="", encoding="utf-8-sig") as f:
        w = csv.writer(f)
        w.writerow(HEADERS)
        for r in rows:
            w.writerow([asdict(r)[h] for h in HEADERS])

# ============ Utilities ============
def wc(s: str) -> int:
    return len(str(s).strip().split())

def ensure_sentence(s: str) -> str:
    s = str(s or "").strip()
    if not s:
        return s
    if s[0].isalpha() and not s[0].isupper():
        s = s[0].upper() + s[1:]
    if s[-1] not in ".!?":
        s += "."
    return s

def ensure_phrase(s: str) -> str:
    # keep concise phrase/clause; period not required
    return str(s or "").strip().rstrip(".")

def safe_parse_json(txt: str):
    txt = str(txt).strip()
    i, j = txt.find("{"), txt.rfind("}") + 1
    snippet = txt[i:j] if i != -1 and j > 0 else txt
    try:
        return json.loads(snippet)
    except Exception:
        pairs = re.findall(r'"(\w+)":\s*"([^"]+)"', snippet)
        return {k: v for k, v in pairs}

def short_hash(statement: str) -> str:
    base = statement[:DEDUP_HASH_LEN].strip().lower()
    return hashlib.md5(base.encode("utf-8")).hexdigest()[:10]

def ok_row(stmt: str, ctx: str, rsn: str) -> bool:
    if not stmt or not ctx or not rsn:
        return False
    if META_BAN.search(stmt) or META_BAN.search(ctx) or META_BAN.search(rsn):
        return False
    if wc(stmt) > STMT_MAX_W:
        return False
    if wc(ctx) > CTX_MAX_W:
        return False
    n = wc(rsn)
    if n < RSN_MIN_W or n > RSN_MAX_W:
        return False
    return True

# ============ Prompt ============
GEN_PROMPT = r"""Return ONLY compact JSON with keys:
statement, context, label_reason.

Target label: {target_label}
Chapter: {chapter} – {chapter_title}
Subject tags: {subject_tags}

Rules:
- statement: ONE concise, third-person, declarative sentence (≤20 words) about the passage topic.
  It MUST be written to MATCH the Target label.
- Label guidance:
  TRUE = directly supported by passage;
  mostly-true = broadly supported, minor caveat omitted;
  half-true = mix of correct and incorrect specifics;
  barely-true = largely unsupported, notable error or overreach;
  FALSE = contradicts passage facts;
  pants-fire = extreme or implausible contradiction.
- context: ONE concise noun phrase or brief clause (≤12 words) showing where/how the claim fits.
  Include at least one specific term from the passage or tags (e.g., dataset, model, metric, tool, or concept).
- label_reason: 8–18 words explaining WHY the label fits, citing evidence, omissions, or contradictions.
  For FALSE or pants-fire, name the detail or assumption being contradicted.
  Avoid starting with phrases like “The passage…”, “The claim…”, or similar. Instead, describe the relationship or mismatch directly (e.g., “RAG supports complex use cases” or
  “Feature design improves accuracy in most cases.”).

Style:
- Write label_reason in a natural explanatory tone, not as a citation or report.
- Use varied openings and sentence structures across rows.
- Prefer concrete mechanisms, datasets, or named tools over generalities.
- Maintain a factual, neutral tone; no meta language or self-reference.

Avoid: “chapter”, “book”, “this statement”, or similar meta words.

Passage:
<<<
{passage}
>>>

JSON:
"""

# ============ Label selection with drift guard ============
def pick_label(counts: Counter, total_made: int, total_target: int) -> str:
    # sample by weights; if one label is ahead of its expected share by 25%, resample once
    lbl = random.choices(LABELS, weights=LABEL_WEIGHTS, k=1)[0]
    if total_made == 0:
        return lbl
    # expected share so far (based on target totals)
    expected = {L: LABEL_WEIGHTS[i] * (total_target) for i, L in enumerate(LABELS)}
    ahead = counts[lbl] > 1.25 * (expected[lbl] * (total_made / max(1, total_target)))
    if ahead:
        lbl2 = random.choices(LABELS, weights=LABEL_WEIGHTS, k=1)[0]
        return lbl2
    return lbl

# ============ LLM call ============
def llm_generate_row(label: str, passage: str, chapter: int, chapter_title: str, tags_str: str) -> dict:
    prompt = GEN_PROMPT.format(
        target_label=label,
        chapter=chapter,
        chapter_title=chapter_title,
        subject_tags=tags_str,
        passage=passage
    )
    r = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.75,
        top_p=0.9,
    )
    return safe_parse_json(r.choices[0].message.content)

# ============ Main ============
def main():
    df = pd.read_csv(CHUNKS_CSV, encoding="utf-8-sig")

    for col in ("chapter", "chapter_title", "chunk_id", "chunk_text"):
        if col not in df.columns:
            raise SystemExit(f"chunks.csv missing required column: {col}")

    # choose which chunks to process (simple shuffle for coverage)
    idxs = list(df.index)
    random.shuffle(idxs)
    idxs = idxs[:min(MAX_CHUNKS, len(idxs))]

    total_target = len(idxs) * N_PER_CHUNK_TARGET

    next_id = BASE_ID
    rows = []
    seen_stmt = set()
    label_counts = Counter()

    print(f"Processing {len(idxs)} chunks; target rows ≈ {total_target}")

    for i, idx in enumerate(idxs, 1):
        row = df.loc[idx]
        passage = str(row["chunk_text"]).strip()
        if not passage:
            continue

        chapter = int(row["chapter"])
        chapter_title = str(row["chapter_title"])
        chunk_id = int(row["chunk_id"])

        # prefer subject_tags column; fallback to map
        if "subject_tags" in df.columns and str(row["subject_tags"]).strip():
            tags_str = str(row["subject_tags"])
        else:
            tags_str = ", ".join(SUBJECT_TAGS.get(chapter, []))

        # per-chunk generation
        for _ in range(N_PER_CHUNK_TARGET):
            attempts = 0
            while attempts <= RETRIES_PER_ROW:
                attempts += 1

                label = pick_label(label_counts, len(rows), total_target)
                out = llm_generate_row(label, passage, chapter, chapter_title, tags_str)

                stmt = ensure_sentence(out.get("statement", ""))
                ctx  = ensure_phrase(out.get("context", ""))
                rsn  = ensure_sentence(out.get("label_reason", ""))

                if not ok_row(stmt, ctx, rsn):
                    if attempts <= RETRIES_PER_ROW:
                        continue
                    else:
                        break

                key = short_hash(stmt)
                if key in seen_stmt:
                    if attempts <= RETRIES_PER_ROW:
                        continue
                    else:
                        break

                # accept
                seen_stmt.add(key)
                label_counts[label] += 1
                rows.append(
                    Row(
                        id=next_id,
                        chunk_id=chunk_id,
                        label=label,
                        statement=stmt,
                        context=ctx,
                        label_reason=rsn,
                        subject_tags=tags_str,
                        chapter=chapter,
                        chapter_title=chapter_title,
                    )
                )
                next_id += 1
                time.sleep(PAUSE_SEC)
                break  # next target row for this chunk

        if i % 50 == 0:
            print(f"[progress] chunks {i}/{len(idxs)} | rows {len(rows)}")

    # write output
    write_csv(rows, OUTPUT_CSV)

    # summary
    print(f"\nWrote {len(rows)} rows to {OUTPUT_CSV}")
    print("Label counts:", label_counts)
    print("Chapters:", Counter(r.chapter for r in rows))
    print("\nSample:")
    for s in rows[:10]:
        print(f"{s.id} [{s.label}] {s.statement} :: {s.context} | tags={s.subject_tags} | ch{s.chapter} – {s.chapter_title}")

main()

Processing 10 chunks; target rows ≈ 20

Wrote 20 rows to byoai_liar.csv
Label counts: Counter({'mostly-true': 5, 'half-true': 4, 'TRUE': 4, 'FALSE': 3, 'barely-true': 2, 'pants-fire': 2})
Chapters: Counter({2: 6, 3: 4, 7: 4, 12: 2, 5: 2, 8: 2})

Sample:
100000 [barely-true] RAG cannot generate serious narratives like policy documents. :: dynamic, context-driven narratives | tags=data-prep,feature-engineering,rag | ch3 – Prepping Data for AI
100001 [FALSE] RAG cannot generate dynamic narratives for serious projects. :: RAG's capabilities in generating narratives | tags=data-prep,feature-engineering,rag | ch3 – Prepping Data for AI
100002 [half-true] Generative AI can create music, code, and videos from text prompts. :: applications of generative AI models | tags=generative-ai,diffusion,gans | ch7 – Generative AI
100003 [TRUE] Generative AI models can create music, code, and videos from text prompts. :: generative-ai applications in model architecture | tags=generative-ai,diffusion,gans 

### Listing 2A: Merging Extreme Supervision Samples into the BYOAI_LIAR Dataset

This listing merges the core **BYOAI_LIAR** dataset with a small, purpose-built extension file named  
`byoai_liar_extremes.csv`. The added file introduces *extreme supervision* samples — short,
high-confidence statements manually generated from the book’s chapter texts to balance the dataset’s
coverage of clear truths and clear falsehoods.

During early experiments, most examples clustered near the middle labels (“half-true” or “barely-true”).
To help the model learn stronger boundaries between factually correct and incorrect claims, this step
adds roughly ten curated statements per chapter (≈140 in total) representing explicit **TRUE**, **FALSE**,  
and **pants-fire** cases.

The script normalizes casing, aligns column names, removes duplicates, and produces a single clean file
`byoai_liar_merged.csv`. This merged dataset ensures more even label distribution and gives the model
a better grasp of clear-cut truth extremes before fine-tuning.



In [1]:
# === Listing 9-7: Merge Extended Dataset with Main BYOAI_LIAR ==============
# Combines byoai_liar.csv and byoai_liar_extremes.csv.
# Normalizes label and text casing, aligns columns, and removes duplicates.
# ---------------------------------------------------------------------------

import pandas as pd
import os

# --- Input / Output --------------------------------------------------------
BASE_FILE = "byoai_liar.csv"
EXT_FILE  = "byoai_liar_extremes.csv"
MERGED_FILE = "byoai_liar_merged.csv"

# --- Load both datasets ----------------------------------------------------
df_base = pd.read_csv(BASE_FILE, encoding="utf-8-sig")
df_ext  = pd.read_csv(EXT_FILE, encoding="utf-8-sig")

# --- Normalize text casing -------------------------------------------------
def normalize_text_fields(df):
    for col in ["label", "chapter_title", "subject_tags"]:
        if col in df.columns:
            df[col] = df[col].astype(str).str.strip().str.lower()
    return df

df_base = normalize_text_fields(df_base)
df_ext  = normalize_text_fields(df_ext)

# --- Align columns ---------------------------------------------------------
common_cols = [c for c in df_base.columns if c in df_ext.columns]
df_ext = df_ext[common_cols]
df_base = df_base[common_cols]

# --- Combine and remove duplicates -----------------------------------------
df_merged = pd.concat([df_base, df_ext], ignore_index=True)

# Drop duplicate statements (case-insensitive)
df_merged["statement_norm"] = df_merged["statement"].astype(str).str.strip().str.lower()
df_merged = df_merged.drop_duplicates(subset="statement_norm").drop(columns="statement_norm")

# --- Sort and reset index --------------------------------------------------
df_merged = df_merged.sort_values(by=["chapter", "id"]).reset_index(drop=True)

# --- Save clean merged dataset --------------------------------------------
df_merged.to_csv(MERGED_FILE, index=False, encoding="utf-8-sig")

# --- Summary ---------------------------------------------------------------
print(f"Merged dataset saved → {MERGED_FILE}")
print(f"Base rows: {len(df_base)}, Extended rows: {len(df_ext)}, Final: {len(df_merged)}")

print("\nLabel distribution:")
print(df_merged["label"].value_counts().to_dict())

print("\nChapter distribution:")
print(df_merged["chapter"].value_counts().sort_index().to_dict())

Merged dataset saved → byoai_liar_merged.csv
Base rows: 5517, Extended rows: 140, Final: 5558

Label distribution:
{'half-true': 1338, 'true': 1122, 'barely-true': 1050, 'mostly-true': 936, 'false': 659, 'pants-fire': 453}

Chapter distribution:
{0: 380, 1: 394, 2: 560, 3: 544, 4: 615, 5: 271, 6: 502, 7: 458, 8: 409, 9: 404, 10: 287, 11: 452, 12: 280, 13: 2}


### Listing 2B: BYOAI LIAR Coherence & Label Audit

This script performs a lightweight quality audit of `byoai_liar.csv` using
GPT-based evaluation. It randomly samples dataset rows and asks the model
to rate coherence across four dimensions—**chapter_fit**, **label_fit**,
**context_quality**, and **reason_quality**—each scored from 0 to 1. The
model also assigns an overall rating (`high`, `medium`, or `low`) and, when
needed, suggests a brief fix for weak or inconsistent entries.

Each row is evaluated through a structured JSON prompt that ensures
consistent scoring and concise feedback. Results are printed with detailed
row-level metrics and a short summary of average fit scores and overall
distribution. This audit helps verify whether generated statements align
with their chapters, whether labels are logically justified, and whether
context and reasoning remain distinct and informative—providing an early,
LLM-assisted sanity check on the coherence and factual structure of the
synthetic BYOAI LIAR dataset.

In [None]:
# === BYOAI LIAR Coherence & Label Audit (LLM-assisted, randomized) ============
# Randomly samples rows from 'byoai_liar.csv' and asks gpt-4o-mini to rate:
# - chapter_fit: does the statement/context align with the chapter’s theme?
# - label_fit: is the label logically consistent with the statement and justification?
# - reason_quality: does label_reason explain the label clearly and concisely?
# - context_quality: does context add distinct, relevant detail (not restate statement)?
# Also returns overall rating and a suggested fix when coherence is low.
# ==============================================================================

import os
import json
import time
import random
import re
import pandas as pd
from google.colab import userdata
from openai import OpenAI

# -------------------- CONFIG --------------------
CSV_PATH       = "byoai_liar.csv"
SAMPLE_CHECKS  = 100         # how many random rows to audit
MODEL          = "gpt-4o-mini"
RANDOM_SEED    = 42
SLEEP_SEC      = 0.05
# ------------------------------------------------

random.seed(RANDOM_SEED)

api_key = userdata.get("OPENAI_API_KEY")
if not api_key:
    raise SystemExit("Missing OPENAI_API_KEY in Colab userdata.")
client = OpenAI(api_key=api_key)

# Load dataset
df = pd.read_csv(CSV_PATH, encoding="utf-8-sig")
required = [
    "id","label","statement","context","label_reason",
    "subject_tags","chapter","chapter_title"
]
missing = [c for c in required if c not in df.columns]
if missing:
    raise SystemExit(f"Missing columns: {missing}")

# Sample subset
sample_df = (
    df.sample(SAMPLE_CHECKS, random_state=RANDOM_SEED)
    if SAMPLE_CHECKS < len(df)
    else df.copy()
)

# Prompt setup
SYSTEM = (
    "You are a concise dataset auditor. "
    "Return ONLY compact JSON. No preface, no commentary."
)
USER_TMPL = """Evaluate the coherence and labeling of this BYOAI dataset row.

Return JSON with keys EXACTLY:
- chapter_fit: float in [0,1]
- label_fit: float in [0,1]
- context_quality: float in [0,1]
- reason_quality: float in [0,1]
- overall: one of ["high","medium","low"]
- comments: short string (≤25 words)
- suggest_fix: short rewrite (≤28 words) improving label_reason or context

Guidelines:
- chapter_fit: statement+context align with the chapter theme ({chapter_title})?
- label_fit: does the label logically match the statement and label_reason?
- context_quality: does the context add value without repeating the statement?
- reason_quality: is label_reason clear, concise, and directly supportive of the label?
- overall: rate overall coherence (high=strong fit, medium=minor flaws, low=clear mismatch)

Row:
chapter_title: "{chapter_title}"
chapter: {chapter}
label: "{label}"
subject_tags: "{subject_tags}"
statement: "{statement}"
context: "{context}"
label_reason: "{label_reason}"

Return JSON:"""

# === Utility functions ===
def safe_parse_json(txt: str) -> dict:
    txt = (txt or "").strip()
    i, j = txt.find("{"), txt.rfind("}") + 1
    snippet = txt[i:j] if i != -1 and j > 0 else txt
    try:
        return json.loads(snippet)
    except Exception:
        pairs = re.findall(r'"(\w+)":\s*"([^"]+)"', snippet)
        return {k: v for k, v in pairs}

def clamp01(x):
    try:
        v = float(x)
    except Exception:
        return 0.0
    return max(0.0, min(1.0, v))

def audit_row(row):
    prompt = USER_TMPL.format(
        chapter=row["chapter"],
        chapter_title=str(row["chapter_title"]),
        label=str(row["label"]),
        subject_tags=str(row["subject_tags"]),
        statement=str(row["statement"]),
        context=str(row["context"]),
        label_reason=str(row["label_reason"]),
    )
    r = client.chat.completions.create(
        model=MODEL,
        messages=[{"role":"system","content":SYSTEM},
                  {"role":"user","content":prompt}],
        temperature=0.2,
    )
    out = safe_parse_json(r.choices[0].message.content)
    # normalize
    cfit = clamp01(out.get("chapter_fit", 0))
    lfit = clamp01(out.get("label_fit", 0))
    cq   = clamp01(out.get("context_quality", 0))
    rq   = clamp01(out.get("reason_quality", 0))
    overall = str(out.get("overall","")).lower().strip()
    if overall not in {"high","medium","low"}:
        avg = (cfit + lfit + cq + rq) / 4.0
        overall = "high" if avg >= 0.75 else "medium" if avg >= 0.5 else "low"
    comments = str(out.get("comments","")).strip()
    fix = str(out.get("suggest_fix","")).strip()
    return {
        "chapter_fit": cfit,
        "label_fit": lfit,
        "context_quality": cq,
        "reason_quality": rq,
        "overall": overall,
        "comments": comments,
        "suggest_fix": fix,
    }

# === Run audit ===
results = []
print(f"Auditing {len(sample_df)} random rows from {len(df)} total...\n")
for i, (_, row) in enumerate(sample_df.iterrows(), 1):
    res = audit_row(row)
    results.append((row, res))
    print(f"[{i:02d}] id={row['id']} | ch{row['chapter']} "
          f"({row['chapter_title']}) | label={row['label']}")
    print(f"     S: {row['statement']}")
    print(f"     C: {row['context']}")
    print(f"     R: {row['label_reason']}")
    print(f"     -> fit(ch,lab,ctx,rsn)=({res['chapter_fit']:.2f},"
          f"{res['label_fit']:.2f},{res['context_quality']:.2f},"
          f"{res['reason_quality']:.2f}) | overall={res['overall']}")
    if res["overall"] == "low" or res["reason_quality"] < 0.6:
        print(f"     fix: {res['suggest_fix'] or '(no suggestion)'}")
    if res["comments"]:
        print(f"     note: {res['comments']}")
    print()
    time.sleep(SLEEP_SEC)

# === Aggregates ===
if results:
    ch  = sum(r[1]["chapter_fit"] for r in results) / len(results)
    lf  = sum(r[1]["label_fit"] for r in results) / len(results)
    cq  = sum(r[1]["context_quality"] for r in results) / len(results)
    rq  = sum(r[1]["reason_quality"] for r in results) / len(results)
    overall_counts = {}
    for _, r in results:
        overall_counts[r["overall"]] = overall_counts.get(r["overall"], 0) + 1

    print("=== Aggregate (sample) ===")
    print(f"chapter_fit avg:     {ch:.3f}")
    print(f"label_fit avg:       {lf:.3f}")
    print(f"context_quality avg: {cq:.3f}")
    print(f"reason_quality avg:  {rq:.3f}")
    print("overall ratings:     ", overall_counts)
else:
    print("No results to summarize.")

Auditing 100 random rows from 5517 total...

[01] id=1625 | ch4 (Deep Learning) | label=half-true
     S: The DataLoader efficiently manages loading and batching of dataset samples.
     C: DataLoader functionality in handling datasets
     R: While it efficiently batches data, it may not address all complexities of data handling.
     -> fit(ch,lab,ctx,rsn)=(0.80,0.70,0.60,0.70) | overall=medium
     note: Context is somewhat vague and doesn't enhance understanding.

[02] id=1884 | ch6 (Generative AI) | label=TRUE
     S: Open-source alternatives to foundation models are increasingly developed globally.
     C: open-source alternatives to foundation models
     R: Numerous initiatives aim to democratize access to generative AI technologies.
     -> fit(ch,lab,ctx,rsn)=(0.90,0.80,0.70,0.80) | overall=medium
     note: Context is somewhat repetitive; label fits but could be clearer.

[03] id=3177 | ch2 (Prepping Data for AI) | label=barely-true
     S: Most teams invest little time in d