## How to Set the File Path and Page Range for Analysis

This notebook is designed so that **only one small section ever needs to be edited** when analyzing a new document or a different part of the same document. All other phases should remain unchanged.

Follow the steps below **before running any other cells**.

---

### Step 1: Locate **PHASE 0 — USER-DEFINED CONFIGURATION**

At the very top of the notebook, you will see a clearly labeled code block:




This is the **only place** where you should declare:
- The document being analyzed
- The page numbers to be analyzed
- The section label used for traceability

Do **not** edit any other phases unless you are intentionally changing methodology.

---

### Step 2: Declare the PDF File Path

Inside **PHASE 0**, find the line:




Replace `"input_document.pdf"` with the path to your file.

Examples:

- File in the same folder as the notebook:



- File in a subfolder:



- Absolute path:



The notebook will not search for files automatically. The path must be explicit.

---

### Step 3: Declare the Page Range to Be Analyzed

Still within **PHASE 0**, locate the page range variables:




Update these values to match the pages you want analyzed.

Important notes:
- Page numbers refer to **PDF page numbers**, not printed chapter numbers.
- The range is **inclusive**. Both the start and end pages will be analyzed.
- Pages outside this range will never be read or processed.





---

### Step 4: Declare a Section Label (Required)

The section label is used to tag every extracted claim for traceability.

Find the line: "PDF_PATH = "input_document.pdf"



Replace the text with a meaningful label (use "") for the portion of the document being analyzed.


This label will appear in the claims table and downstream exports.

---

### Step 5: Save and Run the Notebook from the Top

After updating **PHASE 0**:

1. Save the notebook
2. Restart the kernel (recommended)
3. Run all cells from top to bottom

This ensures:
- No stale variables remain in memory
- Page scoping is enforced correctly
- Claims are extracted only from the declared range

---

### Important Methodological Reminder

Do not move, duplicate, or redefine the configuration variables elsewhere in the notebook.

All downstream phases **assume** that:
- The file path was declared once
- The page range was declared once
- The scope of analysis is fixed before extraction begins

This design choice is intentional and supports reproducibility, auditability, and peer review.


In [1]:
# =====================================
# PHASE 0 — USER-DEFINED CONFIGURATION
# =====================================

PDF_PATH = 

PAGE_RANGE_START = 
PAGE_RANGE_END   = 

SECTION_LABEL = 

EXCLUSION_KEYWORDS = [
    "notes",
    "endnotes",
    "footnotes",
    "references",
    "bibliography",
    "works cited",
    "acknowledgements"
]

MIN_SENTENCE_LENGTH = 30


In [2]:
# =====================================
# PHASE 1 — PAGE-SCOPED TEXT EXTRACTION
# =====================================

from PyPDF2 import PdfReader

reader = PdfReader(PDF_PATH)

page_text_blocks = []

for page_index in range(PAGE_RANGE_START - 1, PAGE_RANGE_END):
    page = reader.pages[page_index]
    text = page.extract_text()
    
    # TIGHTENED: Explicitly check that text is not None AND not empty after stripping
    # This ensures we capture any page with actual content, even sparse pages
    if text and text.strip():
        page_text_blocks.append({
            "page": page_index + 1,
            "raw_text": text
        })

if not page_text_blocks:
    raise ValueError("No text extracted from specified page range.")


In [3]:
# =====================================
# PHASE 2 — TEXT NORMALIZATION
# =====================================

import re

def normalize_text(text):
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

for block in page_text_blocks:
    block["normalized_text"] = normalize_text(block["raw_text"])


In [4]:
# =====================================
# PHASE 3 — EXCLUSION SECTION DETECTION
# =====================================

def detect_excluded_section(text, keywords):
    text_lower = text.lower()
    return any(keyword in text_lower for keyword in keywords)

for block in page_text_blocks:
    block["is_excluded"] = detect_excluded_section(
        block["normalized_text"],
        EXCLUSION_KEYWORDS
    )


In [5]:
# =====================================
# PHASE 4 — SENTENCE-LEVEL CLAIM EXTRACTION
# =====================================

import uuid
import re

def is_visual_content(sentence):
    """
    Detect and exclude visual content markers:
    - Table indicators (Table N:, borders, pipes)
    - Chart/Graph indicators (Chart N:, Figure N:, percentage patterns)
    - Special characters associated with visual content
    - Lines that are primarily numbers or percentages
    """
    text = sentence.strip().lower()
    
    # PATTERN 1: Visual element labels
    visual_labels = [
        r'^\s*(table|tbl)\.?\s+\d+:',      # Table 1:, Table 1., Tbl 1:
        r'^\s*(figure|fig)\.?\s+\d+:',     # Figure 1:, Fig 1:
        r'^\s*(chart|graph|diagram)\.?\s+\d+:', # Chart 1:, etc.
        r'^\s*(exhibit|appendix)\.?\s+[a-z0-9]:',  # Exhibit A:, etc.
        r'^\s*source:',                      # Source: [attribution]
    ]
    
    for pattern in visual_labels:
        if re.search(pattern, text):
            return True
    
    # PATTERN 2: Table formatting characters
    # Common table border/structure characters
    table_chars = ['│', '┤', '├', '┼', '╞', '╡', '═', '║', '┌', '┐', '└', '┘', '─', '┬', '┴']
    if any(char in sentence for char in table_chars):
        return True
    
    # PATTERN 3: Heavy use of pipes (column separators)
    if sentence.count('|') >= 2:
        return True
    
    # PATTERN 4: Percentage-heavy content (likely chart axis labels)
    # Lines with 3+ percentage signs and very few words
    percent_count = sentence.count('%')
    word_count = len(sentence.split())
    if percent_count >= 3 and word_count < 15:
        return True
    
    # PATTERN 5: Legend-style content
    # Patterns like "■ Category A  ■ Category B"
    legend_chars = ['■', '●', '▲', '▼', '◆', '★', '□', '○', '△', '▽', '◇', '☆', '█', '▓', '▒']
    if any(char in sentence for char in legend_chars):
        return True
    
    # PATTERN 6: All caps labels with few words (often table headers)
    words = sentence.split()
    if len(words) <= 5 and sentence.isupper() and len(sentence) > 3:
        # But allow common uppercase phrases like acronyms in normal text
        # Only exclude if it's VERY short and all caps
        if len(words) <= 3:
            return True
    
    return False

claims = []

for block in page_text_blocks:
    if block["is_excluded"]:
        continue

    sentences = re.split(
        r'(?<=[.!?])\s+',
        block["normalized_text"]
    )

    for sentence in sentences:
        sentence = sentence.strip()

        # Check minimum length
        if len(sentence) < MIN_SENTENCE_LENGTH:
            continue
        
        # Check for visual content
        if is_visual_content(sentence):
            continue

        claims.append({
            "claim_id": f"C{str(uuid.uuid4())[:8]}",
            "claim_text": sentence,
            "page": block["page"],
            "section": SECTION_LABEL,
            "excluded": False
        })

if not claims:
    raise ValueError("No claims extracted. Review page range or filters.")


In [6]:
# =====================================
# PHASE 5 — CLAIMS TABLE DECLARATION
# =====================================

import pandas as pd

PHASE5_OUTPUT = pd.DataFrame(claims)

df_claims = PHASE5_OUTPUT.copy()

if "claim_text" not in df_claims.columns:
    raise RuntimeError("Claim text column missing from claims table.")


In [7]:
# =====================================
# PHASE 6 — EXCLUSION AUDIT LOG
# =====================================

exclusion_log = []

for block in page_text_blocks:
    if block["is_excluded"]:
        matched = [
            k for k in EXCLUSION_KEYWORDS
            if k in block["normalized_text"].lower()
        ]

        exclusion_log.append({
            "page": block["page"],
            "matched_keywords": matched,
            "section": SECTION_LABEL
        })

PHASE6_EXCLUSIONS = pd.DataFrame(exclusion_log)


In [8]:
# =====================================
# PHASE 7 — INITIAL CLAIM CLASSIFICATION
# =====================================

def classify_claim_type(text):
    t = text.lower()

    if any(w in t for w in ["should", "must", "ought", "policy", "recommend"]):
        return "Policy-prescriptive"

    if any(w in t for w in ["will", "would", "expected to", "likely"]):
        return "Predictive"

    if any(w in t for w in ["because", "leads to", "results in", "causes"]):
        return "Causal"

    return "Descriptive"


# Materialize claim_type (non-optional)
df_claims["claim_type"] = df_claims["claim_text"].apply(classify_claim_type)

# Enforce total classification (Appendix A.4)
if df_claims["claim_type"].isnull().any():
    raise RuntimeError(
        "Phase 7 failed: claim_type contains null values. "
        "All claims must receive a classification or be marked 'Unspecified'."
    )


In [9]:
# =====================================
# PHASE 7.5 — DISAMBIGUATION & AMBIGUITY FLAGGING
# =====================================

def detect_structural_signals(text):
    t = text.lower()

    signals = {
        "normative": any(w in t for w in ["should", "must", "ought", "important"]),
        "policy": any(w in t for w in ["policy", "regulation", "law"]),
        "causal": any(w in t for w in ["because", "leads to", "results in", "causes"]),
        "predictive": any(w in t for w in ["will", "would", "likely", "expected to"]),
        "descriptive": any(w in t for w in ["is", "are", "was", "were"])
    }

    return [k for k, v in signals.items() if v]


In [10]:
# =====================================
# PHASE 8 — FALSIFIABILITY ASSESSMENT
# =====================================

def assess_falsifiability(claim_text):
    text = claim_text.lower()

    if any(term in text for term in [
        "should", "must", "ought", "we believe", "it is important"
    ]):
        return "Not falsifiable"

    if any(term in text for term in [
        "may", "could", "likely", "appears to", "suggests"
    ]):
        return "Weakly falsifiable"

    return "Falsifiable"


df_claims["falsifiability"] = df_claims["claim_text"].apply(
    assess_falsifiability
)


In [11]:
# =====================================
# PHASE 8.1 — EVIDENCE TYPOLOGY
# =====================================

def assign_evidence_type(row):
    if row["falsifiability"] == "Not falsifiable":
        return None  # intentional null

    text = row["claim_text"].lower()

    if any(term in text for term in ["data", "statistics", "rates", "percent"]):
        return "Statistical"

    if any(term in text for term in ["report", "document", "record"]):
        return "Documentary"

    if any(term in text for term in ["over time", "trend", "longitudinal"]):
        return "Longitudinal"

    return "Unspecified"


df_claims["evidence_type"] = df_claims.apply(
    assign_evidence_type,
    axis=1
)


In [12]:
# =====================================
# PHASE 8.2 — REGULATORY RELEVANCE FLAG
# =====================================

def regulatory_relevance(claim_text):
    text = claim_text.lower()

    if any(term in text for term in [
        "law", "regulation", "policy", "compliance", "agency"
    ]):
        return True

    return False


df_claims["regulatory_relevance"] = df_claims["claim_text"].apply(
    regulatory_relevance
)


In [13]:
# =====================================
# PHASE 9 — QUALITY CONTROL CHECKS
# =====================================

required_columns = [
    "claim_id",
    "claim_text",
    "section",
    "page",
    "claim_type",
    "falsifiability",
    "evidence_type",
    "regulatory_relevance"
]

missing = [c for c in required_columns if c not in df_claims.columns]
if missing:
    raise RuntimeError(f"Missing required columns: {missing}")

# Confirm intentional nulls are limited to allowed fields
allowed_null_fields = ["evidence_type"]

for col in df_claims.columns:
    if col not in allowed_null_fields:
        assert not df_claims[col].isnull().any(), f"Unexpected nulls in {col}"


In [14]:
# =====================================
# PHASE 10 — FINAL DATASET ASSEMBLY
# =====================================

FINAL_SCHEMA = [
    "claim_id",
    "claim_text",
    "section",
    "page",
    "claim_type",
    "falsifiability",
    "evidence_type",
    "regulatory_relevance"
]

FINAL_DATASET = df_claims[FINAL_SCHEMA].copy()


In [15]:
# =====================================
# PHASE 10.7 — INTENTIONAL NULL DISCLOSURE
# =====================================

INTENTIONAL_NULL_FIELDS = {
    "evidence_type": "Not applicable or not responsibly assignable"
}


In [16]:
# =====================================
# PHASE 11 — DATASET HASHING
# =====================================

import hashlib
from datetime import datetime, timezone

dataset_json = FINAL_DATASET.to_json(
    orient="records"
)

DATASET_HASH = hashlib.sha256(
    dataset_json.encode("utf-8")
).hexdigest()

LOCK_TIMESTAMP = datetime.now(timezone.utc).isoformat()


In [17]:
# =====================================
# PHASE 12 — DATASET EXPORT
# =====================================

FINAL_DATASET.to_csv("final_claims_dataset.csv", index=False)
FINAL_DATASET.to_json("final_claims_dataset.json", orient="records")


In [18]:
# =====================================
# PHASE 13 — EXPORT METADATA
# =====================================

EXPORT_METADATA = {
    "hash": DATASET_HASH,
    "locked_at": LOCK_TIMESTAMP,
    "source_file": PDF_PATH,
    "page_range": f"{PAGE_RANGE_START}-{PAGE_RANGE_END}",
    "section": SECTION_LABEL,
    "intentional_nulls": INTENTIONAL_NULL_FIELDS
}


In [19]:
# =====================================
# PHASE 14 — RESULTS SUMMARY
# =====================================

claim_type_counts = FINAL_DATASET["claim_type"].value_counts()
falsifiability_counts = FINAL_DATASET["falsifiability"].value_counts()

display(claim_type_counts)
display(falsifiability_counts)


claim_type
Descriptive            645
Policy-prescriptive    118
Causal                  15
Predictive              11
Name: count, dtype: int64

falsifiability
Falsifiable           689
Not falsifiable        56
Weakly falsifiable     44
Name: count, dtype: int64

# --------------------------------------
# The table above is a representative table based 
# on the data entered in Phase 0 - User - Defined Configuration
# --------------------------------------

In [20]:
# =====================================
# PHASE 14.7 — REVIEWER NOTES
# =====================================

REVIEWER_NOTES = """
This dataset reflects sentence-level structural analysis only.
No claim has been modified, paraphrased, or evaluated for truth.
Intentional nulls are documented and disclosed per Appendix C.6.
"""


In [21]:
# ==================================================
# Phase 15 — Lexical Frequency (Stopword-Controlled)
# ==================================================

import re
import pandas as pd
from collections import Counter

# --- INPUT ASSERTION ---
if 'df_claims' not in globals():
    raise NameError("df_claims not found. Ensure Phase 1–7 completed successfully.")

if 'FINAL_DATASET' not in globals():
    raise NameError("FINAL_DATASET not found. Initialize before Phase 15.")

TEXT_COLUMN = "claim_text"

if TEXT_COLUMN not in df_claims.columns:
    print(f"Available columns: {df_claims.columns.tolist()}")
    raise KeyError(f"Column '{TEXT_COLUMN}' not found in df_claims.")


# --- STOPWORDS: Articles, Prepositions, and Coordination Conjunctions ---
# (Project 2025 Controlled List)
STOP_WORDS = {
    # --- Articles ---
    "a","an","the",
    
    # --- Prepositions ---
    "about","above","across","after","against","along",
    "among","around","at","before","behind","below",
    "beneath","beside","between","beyond","by","concerning",
    "considering","despite","down","during","except",
    "for","from","in","inside","into","like","near",
    "of","off","on","onto","out","outside","over",
    "past","regarding","round","since","through",
    "throughout","to","toward","under","underneath",
    "until","unto","up","upon","with","within","without",
    
    # --- Coordination Conjunctions (FANBOYS) ---
    "for","and","nor","but","or","yet","so"
}

# --- CONCATENATE CORPUS ---
corpus_text = " ".join(df_claims[TEXT_COLUMN].dropna().astype(str))

# --- CLEAN ---
corpus_text = re.sub(r"[^\w\s]", "", corpus_text.lower())
tokens = corpus_text.split()

# --- FILTER ---
filtered_tokens = [t for t in tokens if t not in STOP_WORDS]

# --- COUNT ---
word_counts = Counter(filtered_tokens)

PHASE15_WORD_DF = pd.DataFrame(
    word_counts.items(),
    columns=["term", "count"]
)

PHASE15_WORD_DF["analysis_phase"] = "Phase_15"
PHASE15_WORD_DF["analysis_type"] = "word_frequency"

PHASE15_WORD_DF = PHASE15_WORD_DF.sort_values("count", ascending=False)

# --- APPEND TO FINAL_DATASET ---
FINAL_DATASET = pd.concat(
    [FINAL_DATASET, PHASE15_WORD_DF],
    ignore_index=True
)

print("Phase 15 complete: Word frequency integrated (coordination conjunctions excluded).")
PHASE15_WORD_DF.head(20)


Phase 15 complete: Word frequency integrated (coordination conjunctions excluded).


Unnamed: 0,term,count,analysis_phase,analysis_type
219,foods,207,Phase_15,word_frequency
2,dietary,186,Phase_15,word_frequency
36,that,150,Phase_15,word_frequency
432,is,135,Phase_15,word_frequency
211,processed,132,Phase_15,word_frequency
192,are,130,Phase_15,word_frequency
177,as,129,Phase_15,word_frequency
16,evidence,117,Phase_15,word_frequency
31,health,115,Phase_15,word_frequency
3,guidelines,110,Phase_15,word_frequency


In [22]:
# ==================================================
# Phase 16 — Phrase Extraction (Bigrams + Trigrams)
# ==================================================

from collections import Counter
from itertools import islice

# --- INPUT ASSERTION ---
if 'filtered_tokens' not in globals():
    raise NameError("filtered_tokens not found. Run Phase 15 first.")

def generate_ngrams(tokens, n):
    return zip(*(islice(tokens, i, None) for i in range(n)))

# --- BIGRAMS ---
bigrams = [" ".join(bg) for bg in generate_ngrams(filtered_tokens, 2)]
bigram_counts = Counter(bigrams)

PHASE16_BIGRAM_DF = pd.DataFrame(
    bigram_counts.items(),
    columns=["term", "count"]
)

PHASE16_BIGRAM_DF["analysis_phase"] = "Phase_16"
PHASE16_BIGRAM_DF["analysis_type"] = "bigram_phrase"

# --- TRIGRAMS ---
trigrams = [" ".join(tg) for tg in generate_ngrams(filtered_tokens, 3)]
trigram_counts = Counter(trigrams)

PHASE16_TRIGRAM_DF = pd.DataFrame(
    trigram_counts.items(),
    columns=["term", "count"]
)

PHASE16_TRIGRAM_DF["analysis_phase"] = "Phase_16"
PHASE16_TRIGRAM_DF["analysis_type"] = "trigram_phrase"

# --- MERGE PHRASES ---
PHASE16_PHRASE_DF = pd.concat(
    [PHASE16_BIGRAM_DF, PHASE16_TRIGRAM_DF],
    ignore_index=True
).sort_values("count", ascending=False)

# --- APPEND TO FINAL_DATASET ---
FINAL_DATASET = pd.concat(
    [FINAL_DATASET, PHASE16_PHRASE_DF],
    ignore_index=True
)

print("Phase 16 complete: Phrase catalog integrated.")
PHASE16_PHRASE_DF.head(20)


Phase 16 complete: Phrase catalog integrated.


Unnamed: 0,term,count,analysis_phase,analysis_type
2,dietary guidelines,98,Phase_16,bigram_phrase
3,guidelines americans,82,Phase_16,bigram_phrase
10886,dietary guidelines americans,82,Phase_16,trigram_phrase
5,2025 2030,77,Phase_16,bigram_phrase
4,americans 2025,75,Phase_16,bigram_phrase
10887,guidelines americans 2025,75,Phase_16,trigram_phrase
10888,americans 2025 2030,74,Phase_16,trigram_phrase
0,scientific foundation,71,Phase_16,bigram_phrase
1,foundation dietary,70,Phase_16,bigram_phrase
10884,scientific foundation dietary,69,Phase_16,trigram_phrase


In [23]:
# ==================================================
# FINAL EXPORT — After Phase 15 & 16
# ==================================================

# Re-export FINAL_DATASET with Phase 15 & 16 entries included
FINAL_DATASET.to_csv("final_claims_dataset.csv", index=False)
FINAL_DATASET.to_json("final_claims_dataset.json", orient="records")

print("✓ FINAL_DATASET re-exported with Phase 15 & 16 entries.")
print(f"  Total records in FINAL_DATASET: {len(FINAL_DATASET)}")

# Get unique phases, excluding NaN values
phases = [p for p in FINAL_DATASET['analysis_phase'].unique() if pd.notna(p)]
phases_sorted = sorted(phases)
print(f"  Analysis phases included: {phases_sorted}")
print(f"\nExported files:")
print(f"  - final_claims_dataset.csv")
print(f"  - final_claims_dataset.json")


✓ FINAL_DATASET re-exported with Phase 15 & 16 entries.
  Total records in FINAL_DATASET: 28238
  Analysis phases included: ['Phase_15', 'Phase_16']

Exported files:
  - final_claims_dataset.csv
  - final_claims_dataset.json


In [24]:
assert "Phase_15" in FINAL_DATASET["analysis_phase"].values
assert "Phase_16" in FINAL_DATASET["analysis_phase"].values


In [25]:
print(df_claims.columns.tolist())


['claim_id', 'claim_text', 'page', 'section', 'excluded', 'claim_type', 'falsifiability', 'evidence_type', 'regulatory_relevance']
