# Lab 13 — From Tables/Text to JSONL for LLM Workloads

**Focus Area:** JSON Lines (JSONL) for large **LLM** corpora — streaming, memory efficiency, one-record-per-line, provenance, and validation

This notebook implements the complete solution for Lab 13.

## Setup - Create Required Directories

In [1]:
from pathlib import Path
for p in ['artifacts/jsonl','artifacts/samples','tools','data']:
    Path(p).mkdir(parents=True, exist_ok=True)
print("Directories created successfully")

Directories created successfully


## Part A — Build an LLM-Native Source CSV

### A1. Generate `data/corpus_llm.csv` (1,000 rows)

In [2]:
import csv, random
from datetime import datetime, timedelta
from pathlib import Path

random.seed(42)
N = 1000
TYPES = ["help_article","policy","release_note","faq"]
SECTIONS = ["Overview","Setup","Troubleshooting","FAQ"]
TAGS = ["billing","security","compliance","sso","api","governance","export","retention","privacy","rate_limits"]
LANGS = ["en","en","en","de","fr"]  # mostly English
CONF = ["public","internal"]  # governance labels
now = datetime(2025, 2, 10)

boiler = (
    "This article explains how to configure single sign-on with step-by-step instructions. "
    "Use the admin console to enable SAML and verify claim mappings. "
    "Common pitfalls include clock skew and incorrect audience URIs. "
)

rows = []
for i in range(1, N+1):
    kind = random.choice(TYPES)
    doc_id = f"DOC-{i:04d}"
    title = {
        "help_article": f"How to configure SSO (v{random.randint(1,5)}).",
        "policy": f"Data Retention Policy — Region {random.choice(['US','EU','APAC'])}",
        "release_note": f"Release 2025{random.randint(1,12):02d} — Key fixes",
        "faq": f"FAQ: {random.choice(['Exports','Rate Limits','Privacy','Billing'])}"
    }[kind]
    section = random.choice(SECTIONS)
    # Simulate occasional missing/short text and duplicates later
    body = (boiler * random.randint(1,3)) + f"Additional details about {random.choice(TAGS)} and {random.choice(TAGS)}.\n"
    tags = ",".join(random.sample(TAGS, k=random.randint(2,4)))
    created = (now - timedelta(days=random.randint(0, 240))).strftime('%Y-%m-%d')
    updated = (now - timedelta(days=random.randint(0, 30))).strftime('%Y-%m-%d')
    language = random.choice(LANGS)
    confidentiality = random.choice(CONF)
    source_url = f"https://example.local/{kind}/{doc_id.lower()}"
    rows.append({
        "doc_id": doc_id,
        "type": kind,
        "title": title,
        "section": section,
        "body_text": body.strip(),
        "tags": tags,
        "source_url": source_url,
        "created_at": created,
        "updated_at": updated,
        "language": language,
        "confidentiality": confidentiality,
    })

# Add deliberate duplicates and a few missing values
for j in range(10):
    rows.append({**rows[j], "doc_id": f"DOC-DUP-{j:02d}"})
for k in range(5):
    rows[k]["body_text"] = ""  # missing text

out_csv = Path('data/corpus_llm.csv')
with out_csv.open('w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=rows[0].keys())
    writer.writeheader(); writer.writerows(rows)
print('Wrote', out_csv, 'rows=', len(rows))

Wrote data/corpus_llm.csv rows= 1010


### A2. Quick peek at the data

In [3]:
import pandas as pd
df_peek = pd.read_csv('data/corpus_llm.csv')
print(f"Total rows: {len(df_peek)}")
print(f"\nFirst 5 rows:")
df_peek.head(5)

Total rows: 1010

First 5 rows:


Unnamed: 0,doc_id,type,title,section,body_text,tags,source_url,created_at,updated_at,language,confidentiality
0,DOC-0001,help_article,How to configure SSO (v1).,Setup,,"rate_limits,export",https://example.local/help_article/doc-0001,2025-02-02,2025-02-10,en,public
1,DOC-0002,policy,Data Retention Policy — Region APAC,FAQ,,"billing,compliance,export",https://example.local/policy/doc-0002,2024-11-15,2025-02-02,en,public
2,DOC-0003,release_note,Release 202507 — Key fixes,Troubleshooting,,"retention,privacy",https://example.local/release_note/doc-0003,2025-01-10,2025-01-12,de,public
3,DOC-0004,release_note,Release 202510 — Key fixes,Overview,,"sso,security",https://example.local/release_note/doc-0004,2024-11-05,2025-02-02,de,internal
4,DOC-0005,policy,Data Retention Policy — Region EU,Overview,,"sso,compliance,retention,rate_limits",https://example.local/policy/doc-0005,2024-12-03,2025-01-12,fr,public


## Part B — Expose the CSV via a Local API

### B1. Create `tools/corpus_api.py` (FastAPI)

In [2]:
api_code = '''# tools/corpus_api.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import pandas as pd
from typing import List

app = FastAPI(title="LLM Corpus API", version="1.0.0")
DF = None

class Doc(BaseModel):
    doc_id: str
    type: str
    title: str
    section: str
    body_text: str
    tags: str
    source_url: str
    created_at: str
    updated_at: str
    language: str
    confidentiality: str

@app.on_event("startup")
async def load_data():
    global DF
    DF = pd.read_csv('../data/corpus_llm.csv')
    # Convert NaN to empty strings for all string columns
    string_cols = ['body_text', 'title', 'section', 'tags', 'source_url']
    for col in string_cols:
        DF[col] = DF[col].fillna('')

@app.get('/health')
async def health():
    return {"ok": True}

@app.get('/v1/corpus', response_model=List[Doc])
async def list_docs(page: int = 1, page_size: int = 50, q: str | None = None,
                    language: str | None = None, conf: str | None = None):
    if page < 1 or page_size < 1 or page_size > 200:
        raise HTTPException(400, 'bad paging params')
    df = DF
    if q:
        mask = df['body_text'].astype(str).str.contains(q, case=False, na=False) \\
             | df['title'].astype(str).str.contains(q, case=False, na=False)
        df = df[mask]
    if language:
        df = df[df['language'] == language]
    if conf:
        df = df[df['confidentiality'] == conf]
    start = (page - 1) * page_size
    end = start + page_size
    recs = df.iloc[start:end].to_dict(orient='records')
    return recs

@app.get('/v1/corpus/{doc_id}', response_model=Doc)
async def get_doc(doc_id: str):
    row = DF.loc[DF['doc_id'] == doc_id]
    if row.empty:
        raise HTTPException(404, 'not found')
    return row.iloc[0].to_dict()
'''

with open('tools/corpus_api.py', 'w', encoding='utf-8') as f:
    f.write(api_code)
print("API file created: tools/corpus_api.py")
print("\nTo run the API, execute in a terminal from the tools folder (optionally in a virtual environment):")
print("  uvicorn corpus_api:app --reload --port 8000")
print("\nNote: The API now handles NaN values by converting them to empty strings")

API file created: tools/corpus_api.py

To run the API, execute in a terminal from the tools folder (optionally in a virtual environment):
  uvicorn corpus_api:app --reload --port 8000

Note: The API now handles NaN values by converting them to empty strings


## Part C — Ingest, Clean, Filter, De-duplicate

### C1. Load CSV and define cleaners

In [3]:
import pandas as pd
import re
import numpy as np
from datetime import datetime

raw = pd.read_csv('data/corpus_llm.csv')
print(f"Raw rows loaded: {len(raw)}")

# Basic cleaners
def normalize_ws(s: str) -> str:
    s = re.sub(r"\s+", " ", str(s or "")).strip()
    return s

def clean_row(row):
    row['title'] = normalize_ws(row.get('title',''))
    row['section'] = normalize_ws(row.get('section','')) or 'Overview'
    row['body_text'] = normalize_ws(row.get('body_text',''))
    return row

clean = raw.apply(clean_row, axis=1)
print(f"Rows after normalization: {len(clean)}")

Raw rows loaded: 1010
Rows after normalization: 1010


### C2. Governance & language filters + missing handling

In [4]:
# Keep only public + English for a typical RAG public index
flt = (clean['confidentiality'] == 'public') & (clean['language'] == 'en')
clean = clean.loc[flt].copy()
print(f"Rows after governance (public) and language (en) filtering: {len(clean)}")

# Drop rows with empty body; log counts
before = len(clean)
clean = clean[clean['body_text'].str.len() >= 30].copy()  # min length
dropped_empty = before - len(clean)
print(f'Dropped for empty/short body: {dropped_empty}')
print(f'Rows remaining: {len(clean)}')

Rows after governance (public) and language (en) filtering: 305
Dropped for empty/short body: 2
Rows remaining: 303


### C3. De-duplicate by content signature (keep latest update)

In [5]:
import hashlib

def content_key(row):
    sig = normalize_ws(row['title'] + ' ' + row['body_text']).lower()
    return hashlib.sha1(sig.encode('utf-8')).hexdigest()

clean['content_key'] = clean.apply(content_key, axis=1)
# Keep most recent updated_at per content_key
clean['updated_at'] = pd.to_datetime(clean['updated_at'])
clean.sort_values(['content_key','updated_at'], ascending=[True, False], inplace=True)

before_dedupe = len(clean)
clean = clean.drop_duplicates(subset=['content_key'], keep='first')
clean.drop(columns=['content_key'], inplace=True)
duplicates_removed = before_dedupe - len(clean)
print(f'Duplicates removed: {duplicates_removed}')
print(f'Rows after dedupe: {len(clean)}')

Duplicates removed: 8
Rows after dedupe: 295


### Summary of Cleaning Pipeline

In [6]:
print("=" * 60)
print("CLEANING PIPELINE SUMMARY")
print("=" * 60)
print(f"Starting rows: {len(raw)}")
print(f"After governance/language filter: {len(clean) + duplicates_removed + dropped_empty}")
print(f"After empty/short body removal: {len(clean) + duplicates_removed}")
print(f"After deduplication: {len(clean)}")
print(f"Total rows removed: {len(raw) - len(clean)}")
print("=" * 60)

CLEANING PIPELINE SUMMARY
Starting rows: 1010
After governance/language filter: 305
After empty/short body removal: 303
After deduplication: 295
Total rows removed: 715


### C4. Optional API ingestion (simulate service boundary)

**Note:** This section requires the API to be running. Since we're working in a notebook, we'll demonstrate the code but note that it requires the API server to be active.

In [7]:
%pip install requests orjson

# This code would work if the API is running at http://127.0.0.1:8000
# Uncomment and run if you have the API server started

import requests, time, orjson

API = 'http://127.0.0.1:8000'
page = 1; PAGE_SIZE = 100
api_rows = []
while True:
    try:
        r = requests.get(f"{API}/v1/corpus", params={'page': page, 'page_size': PAGE_SIZE, 'language':'en', 'conf':'public'}, timeout=5)
        if r.status_code != 200:
            time.sleep(1); continue
        batch = r.json()
        if not batch:
            break
        api_rows.extend(batch)
        page += 1
    except Exception as e:
        print(f"API not available: {e}")
        break
        
if api_rows:
    api_df = pd.DataFrame(api_rows)
    print('API fetched rows:', len(api_df))
else:
    print("API ingestion skipped - using CSV-based clean dataframe")

print("API ingestion code ready")
print("For this solution, we'll continue with the CSV-based clean dataframe")


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
API fetched rows: 305
API ingestion code ready
For this solution, we'll continue with the CSV-based clean dataframe


## Part D — Write JSONL (RAG Chunks + SFT/Eval)

### D1. Chunking helper (overlap to preserve context)

In [8]:
import re

def split_chunks(text: str, max_chars=900, overlap=150):
    text = re.sub(r"\s+", " ", str(text)).strip()
    chunks = []
    i = 0
    while i < len(text):
        end = min(i + max_chars, len(text))
        chunk = text[i:end]
        chunks.append(chunk)
        if end == len(text): break
        i = max(0, end - overlap)
    return chunks

# Test the chunking function
test_text = "This is a test sentence. " * 100
test_chunks = split_chunks(test_text, max_chars=200, overlap=50)
print(f"Test text length: {len(test_text)} characters")
print(f"Number of chunks created: {len(test_chunks)}")
print(f"First chunk length: {len(test_chunks[0])}")

Test text length: 2500 characters
Number of chunks created: 17
First chunk length: 200


### D2. CSV→RAG JSONL with provenance

In [9]:
import orjson
from datetime import timezone
from pathlib import Path

rag_path = Path('artifacts/jsonl/rag_chunks_from_csv.jsonl')
chunk_count = 0
doc_count = 0

with rag_path.open('w', encoding='utf-8') as f:
    for _, r in clean.iterrows():
        chunks = split_chunks(r['body_text'])
        doc_count += 1
        for j, ch in enumerate(chunks):
            rec = {
                'doc_id': r['doc_id'],
                'chunk_id': f"{r['doc_id']}-{j:04d}",
                'text': ch,
                'metadata': {
                    'title': r['title'], 
                    'section': r['section'], 
                    'tags': r['tags'],
                    'source_url': r['source_url'], 
                    'language': r['language'],
                    'confidentiality': r['confidentiality'], 
                    'schema_version': 'rag-chunk-v1'
                }
            }
            f.write(orjson.dumps(rec).decode() + '\n')
            chunk_count += 1

print(f"RAG JSONL created: {rag_path}")
print(f"Documents processed: {doc_count}")
print(f"Total chunks created: {chunk_count}")

RAG JSONL created: artifacts/jsonl/rag_chunks_from_csv.jsonl
Documents processed: 295
Total chunks created: 295


### D3. Build an SFT/Eval JSONL view (instruction → output)

In [10]:
import numpy as np
np.random.seed(0)

sample = clean.sample(min(120, len(clean)), random_state=7)
sft_path = Path('artifacts/jsonl/corpus_sft.jsonl')
sft_count = 0

with sft_path.open('w', encoding='utf-8') as f:
    for _, r in sample.iterrows():
        prompt = f"Summarize the key steps from: {r['title']} ({r['section']})."
        # Simple templated target; in real life, use human-authored or heuristic extraction
        target = "Key steps: enable SAML; map claims; verify time sync; check audience URI; review settings."
        obj = {
            'input': prompt, 
            'output': target,
            'metadata': {
                'doc_id': r['doc_id'], 
                'type': r['type'], 
                'lang': r['language']
            }
        }
        f.write(orjson.dumps(obj).decode() + '\n')
        sft_count += 1

print(f"SFT/Eval JSONL created: {sft_path}")
print(f"Total instruction-output pairs: {sft_count}")

SFT/Eval JSONL created: artifacts/jsonl/corpus_sft.jsonl
Total instruction-output pairs: 120


### D4. Validate JSONL & write reviewer samples

In [11]:
import json
import itertools

def validate_jsonl(path):
    bad = 0
    total = 0
    errors = []
    with open(path, 'r', encoding='utf-8') as f:
        for i, line in enumerate(f, 1):
            total += 1
            try:
                obj = json.loads(line)
                assert isinstance(obj, dict)
            except Exception as e:
                bad += 1
                if len(errors) < 5:  # Keep first 5 errors
                    errors.append(f"Line {i}: {str(e)}")
    return total, bad, errors

print("=" * 60)
print("JSONL VALIDATION RESULTS")
print("=" * 60)

for p in ['artifacts/jsonl/rag_chunks_from_csv.jsonl','artifacts/jsonl/corpus_sft.jsonl']:
    t, b, errs = validate_jsonl(p)
    print(f"\nFile: {p}")
    print(f"  Total lines: {t}")
    print(f"  Bad lines: {b}")
    if b > 0:
        print(f"  Sample errors:")
        for err in errs:
            print(f"    {err}")
    else:
        print(f"  ✓ All lines valid!")

print("\n" + "=" * 60)

JSONL VALIDATION RESULTS

File: artifacts/jsonl/rag_chunks_from_csv.jsonl
  Total lines: 295
  Bad lines: 0
  ✓ All lines valid!

File: artifacts/jsonl/corpus_sft.jsonl
  Total lines: 120
  Bad lines: 0
  ✓ All lines valid!



In [12]:
# Create small reviewer sample
sample_path = Path('artifacts/samples/jsonl_samples.jsonl')
with sample_path.open('w', encoding='utf-8') as out:
    for p in ['artifacts/jsonl/rag_chunks_from_csv.jsonl','artifacts/jsonl/corpus_sft.jsonl']:
        out.write(f"# Samples from {p}\n")
        with open(p, 'r', encoding='utf-8') as f:
            for line in itertools.islice(f, 3):
                out.write(line)
        out.write("\n")

print(f"Reviewer samples created: {sample_path}")
print("\nShowing first 3 lines from each JSONL file:")
print("=" * 60)

with sample_path.open('r', encoding='utf-8') as f:
    content = f.read()
    print(content[:2000])  # Show first 2000 chars
    if len(content) > 2000:
        print("... (truncated)")

Reviewer samples created: artifacts/samples/jsonl_samples.jsonl

Showing first 3 lines from each JSONL file:
# Samples from artifacts/jsonl/rag_chunks_from_csv.jsonl
{"doc_id":"DOC-0879","chunk_id":"DOC-0879-0000","text":"This article explains how to configure single sign-on with step-by-step instructions. Use the admin console to enable SAML and verify claim mappings. Common pitfalls include clock skew and incorrect audience URIs. This article explains how to configure single sign-on with step-by-step instructions. Use the admin console to enable SAML and verify claim mappings. Common pitfalls include clock skew and incorrect audience URIs. This article explains how to configure single sign-on with step-by-step instructions. Use the admin console to enable SAML and verify claim mappings. Common pitfalls include clock skew and incorrect audience URIs. Additional details about export and rate_limits.","metadata":{"title":"FAQ: Billing","section":"Overview","tags":"privacy,compliance,rat

## Part E — Wrap-Up

### Analysis Questions

#### 1. Three reasons JSONL is preferred vs a single JSON array for LLM corpora

1. **Streaming & Memory Efficiency**: JSONL allows line-by-line processing without loading the entire dataset into memory. This is critical for large corpora (GB-TB scale) that would otherwise cause OOM errors with a single JSON array.

2. **Append-Friendly & Incremental Updates**: New records can be appended to JSONL files without parsing/rewriting the entire file. This enables continuous ingestion pipelines and incremental updates to training data.

3. **Recovery & Parallelization**: Each line is a self-contained JSON object, making it easy to recover from partial failures (skip corrupted lines), parallelize processing with map-reduce frameworks, and distribute chunks across workers without complex parsing logic.

#### 2. Governance + Language filters applied

**Filters Applied:**
- **Governance Filter**: `confidentiality == 'public'` - Removed all internal/confidential documents to ensure only public-facing content enters the RAG index
- **Language Filter**: `language == 'en'` - Kept only English documents for a monolingual RAG system

**Stage:** Applied after initial normalization (Part C2) but before missing value handling and deduplication. This order is efficient because:
- We normalize first to ensure consistent filtering
- We filter early to reduce processing overhead for subsequent steps
- We dedupe after filtering since governance/language changes the candidate set

#### 3. Deduplication Strategy

**Method:** Content-based deduplication using SHA-1 hash of normalized `title + body_text`

**Key Used:** `content_key = sha1(normalize(title + ' ' + body_text).lower())`

**Why This Approach:**
- **Content-focused**: Using `doc_id` alone would miss semantic duplicates with different IDs (like our deliberate `DOC-DUP-*` entries)
- **Normalization**: Lowercasing and whitespace normalization ensures minor formatting differences don't create false negatives
- **Latest-wins**: When duplicates are found, we keep the record with the most recent `updated_at` timestamp, preserving the freshest content
- **Efficiency**: SHA-1 provides fast hashing with negligible collision risk for our corpus size

#### 4. Essential Metadata: RAG vs SFT

**RAG-Essential Metadata:**
- `doc_id` & `chunk_id`: Unique identification for retrieval and citation
- `title`, `section`, `tags`: Semantic filtering and ranking (e.g., "only return security-related chunks")
- `source_url`: Provenance for user citations and verification
- `language`, `confidentiality`: Runtime filtering for multi-tenant or multi-lingual systems
- `schema_version`: Schema evolution tracking

**SFT/Eval-Essential Metadata:**
- `doc_id`: Traceability back to source material
- `type`: Document type helps balance training distribution (e.g., ensure mix of help_articles, policies, FAQs)
- `lang`: Language-specific fine-tuning or eval splits
- Optional: `created_at`/`updated_at` for temporal splits (train on older data, eval on recent)

**Key Difference:** RAG needs rich filtering metadata for retrieval-time decisions, while SFT needs distribution/provenance metadata for training quality and reproducibility.

## Final Outputs Summary

In [13]:
from pathlib import Path

print("=" * 60)
print("FINAL OUTPUTS SUMMARY")
print("=" * 60)

outputs = [
    ('data/corpus_llm.csv', 'Source CSV corpus'),
    ('tools/corpus_api.py', 'FastAPI corpus server'),
    ('artifacts/jsonl/rag_chunks_from_csv.jsonl', 'RAG chunks JSONL'),
    ('artifacts/jsonl/corpus_sft.jsonl', 'SFT/Eval JSONL'),
    ('artifacts/samples/jsonl_samples.jsonl', 'Reviewer samples')
]

for path, desc in outputs:
    p = Path(path)
    if p.exists():
        if p.is_file():
            size = p.stat().st_size
            print(f"✓ {desc}")
            print(f"  Path: {path}")
            print(f"  Size: {size:,} bytes")
        else:
            print(f"✓ {desc}")
            print(f"  Path: {path}")
    else:
        print(f"✗ {desc} - NOT FOUND")
        print(f"  Expected: {path}")
    print()

print("=" * 60)
print("Lab 13 Complete!")
print("=" * 60)

FINAL OUTPUTS SUMMARY
✓ Source CSV corpus
  Path: data/corpus_llm.csv
  Size: 635,990 bytes

✓ FastAPI corpus server
  Path: tools/corpus_api.py
  Size: 1,792 bytes

✓ RAG chunks JSONL
  Path: artifacts/jsonl/rag_chunks_from_csv.jsonl
  Size: 224,444 bytes

✓ SFT/Eval JSONL
  Path: artifacts/jsonl/corpus_sft.jsonl
  Size: 29,315 bytes

✓ Reviewer samples
  Path: artifacts/samples/jsonl_samples.jsonl
  Size: 3,338 bytes

Lab 13 Complete!


## Bonus: Gzip Compression (Optional)

For production use, compress JSONL files to save storage and bandwidth.

In [14]:
import gzip

# Compress the RAG chunks JSONL
rag_jsonl = 'artifacts/jsonl/rag_chunks_from_csv.jsonl'
rag_gz = 'artifacts/jsonl/rag_chunks_from_csv.jsonl.gz'

with open(rag_jsonl, 'r', encoding='utf-8') as src, \
     gzip.open(rag_gz, 'wt', encoding='utf-8') as dst:
    for line in src: 
        dst.write(line)

original_size = Path(rag_jsonl).stat().st_size
compressed_size = Path(rag_gz).stat().st_size
compression_ratio = (1 - compressed_size / original_size) * 100

print(f"Original size: {original_size:,} bytes")
print(f"Compressed size: {compressed_size:,} bytes")
print(f"Compression ratio: {compression_ratio:.1f}%")
print(f"Compressed file: {rag_gz}")

Original size: 224,444 bytes
Compressed size: 8,793 bytes
Compression ratio: 96.1%
Compressed file: artifacts/jsonl/rag_chunks_from_csv.jsonl.gz
