# Two-Stage Regulatory Rule Extraction Approach


**Stage 1: Maximal Inclusion**

- Extract every paragraph or section that could possibly be a rule, with minimal filtering.
- Output a comprehensive list of candidate rules, each with full regulatory text and metadata.


**Stage 2: Actionability Assessment & Code Generation**

- Use AI or rule-based logic to classify each candidate as actionable or not.
- For actionable items, generate code logic or automation instructions.


This approach maximizes coverage and enables more nuanced, context-aware filtering and automation in the second stage.

# Step-by-Step: Extracting Regulatory Policy Rules as Structured JSON

This notebook will:

1. Parse the USDA NEPA XML (`title-7.xml`) using `xmltodict`.
2. Traverse all sections and paragraphs to identify regulatory requirements.
3. Extract each requirement as a structured JSON rule, including:
   - Regulatory text
   - Metadata (citation, effective date, authority)
   - Explanation of the rule's intent
4. Output all rules as a formatted JSON list for review.

This approach ensures all regulatory requirements are captured and ready for further translation into code logic.

In [47]:
%pip install xmltodict
import xmltodict

# Read and parse the XML file
# Source doc: https://www.ecfr.gov/current/title-7/subtitle-A/part-1b
# Source XML: https://www.ecfr.gov/api/versioner/v1/full/2025-08-26/title-7.xml?part=1b
with open("title-7.xml", "r", encoding="utf-8", errors="replace") as file:
    xml_content = file.read()

policy_dict = xmltodict.parse(xml_content)

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [48]:
# Extract regulatory requirements from USDA NEPA XML and output as structured JSON rules
import re
import json
from collections.abc import Iterable

def extract_text_recursive(item):
    """Recursively extract all text from nested dicts/lists/strings."""
    if isinstance(item, str):
        return [item.strip()] if item.strip() else []
    elif isinstance(item, dict):
        texts = []
        for k, v in item.items():
            if k in ['#text', 'HEAD', 'PSPACE', 'P'] or isinstance(v, (dict, list, str)):
                texts.extend(extract_text_recursive(v))
        return texts
    elif isinstance(item, Iterable):
        texts = []
        for subitem in item:
            texts.extend(extract_text_recursive(subitem))
        return texts
    return []

def flatten_paragraphs(section):
    """Extracts all paragraphs from a section, handling nested structure and dicts."""
    paragraphs = []
    if 'P' in section:
        ps = section['P']
        paragraphs.extend(extract_text_recursive(ps))
    return paragraphs

def is_true_requirement(para):
    """Stage 1: Include all paragraphs as candidate rules (maximal inclusion)."""
    return isinstance(para, str) and para.strip() != ""

def extract_rules_from_section(section, section_citation, effective_date, authority):
    rules = []
    paragraphs = flatten_paragraphs(section)
    for idx, para in enumerate(paragraphs):
        if not is_true_requirement(para):
            continue
        citation_match = re.search(r'(\d+\s*U\.S\.C\.\s*\d+[a-zA-Z0-9\(\)]*)', para)
        citation = citation_match.group(0) if citation_match else section_citation
        rule = {
            "id": f"{section_citation}_rule_{idx+1}",
            "text": para.strip(),
            "metadata": {
                "citation": citation,
                "effective_date": effective_date,
                "authority": authority
            },
            "explanation": f"Candidate rule extracted from section {section_citation}."
        }
        rules.append(rule)
    return rules

# Get top-level part and sections
part = policy_dict.get('DIV5', {})
authority = part.get('AUTH', {}).get('PSPACE', None)
effective_date = re.search(r'(\d{4}-\d{2}-\d{2})', part.get('SOURCE', {}).get('PSPACE', ''))
effective_date = effective_date.group(0) if effective_date else None

# Normalize sections to a list
div8_raw = part.get('DIV8', None)
if isinstance(div8_raw, list):
    sections = div8_raw
elif isinstance(div8_raw, dict):
    sections = [div8_raw]
else:
    sections = []

all_rules = []
for section in sections:
    meta = section.get('hierarchy_metadata', None)
    section_citation = None
    if isinstance(meta, str):
        try:
            meta_dict = json.loads(meta)
            section_citation = meta_dict.get('citation', None)
        except Exception as e:
            print(f"Failed to parse hierarchy_metadata for section {section.get('@N', '')}: {e}")
            section_citation = None
    elif isinstance(meta, dict):
        section_citation = meta.get('citation', None)
    # Fallback: use section number or header if citation is missing
    if not section_citation:
        section_citation = section.get('@N', section.get('HEAD', 'Unknown'))
    rules = extract_rules_from_section(section, section_citation, effective_date, authority)
    all_rules.extend(rules)

print(f"Total candidate rules extracted: {len(all_rules)}")
print(json.dumps(all_rules, indent=2))

Total candidate rules extracted: 1110
[
  {
    "id": "1b.1_rule_1",
    "text": "Purpose.",
    "metadata": {
      "citation": "1b.1",
      "effective_date": null,
      "authority": {
        "I": "et seq.",
        "#text": "5 U.S.C. 301; 42 U.S.C. 4321 ; E.O. 11514, 3 CFR, 1966-1970 Comp., p. 902, as amended by E.O. 11991, 3 CFR, 1978 Comp., p. 123; E.O. 12114, 3 CFR, 1980 Comp., p. 356; 40 CFR 1507.3."
      }
    },
    "explanation": "Candidate rule extracted from section 1b.1."
  },
  {
    "id": "1b.1_rule_2",
    "text": "(a)  The purpose of this part is to outline the procedures by which the U.S. Department of Agriculture (hereinafter USDA or the Department) will integrate the National Environmental Policy Act (NEPA) into decision-making processes. Specifically, this part: describes the process by which USDA determines what actions are subject to NEPA's procedural requirements and the applicable level of NEPA review; ensures that relevant environmental information is ident

In [49]:
# Translate regulatory text into structured, machine-readable logic for all extracted rules
import re

def generate_structured_logic(rule_text):
    """Extract actor, action, object, condition, deadline from regulatory text."""
    logic = {
        "actor": None,
        "action": None,
        "object": None,
        "condition": None,
        "deadline": None
    }
    text = rule_text.strip() if rule_text else ''
    # Actor: look for phrases like 'The [role] shall/must/required to'
    actor_match = re.search(r'(The|Each|A|An) ([A-Z][a-zA-Z\s\-]+?)(?: shall| must| is required| are required| will| may| should)', text)
    if actor_match:
        logic["actor"] = actor_match.group(2).strip()
    # Action: look for directive verbs
    action_match = re.search(r'(shall|must|is required to|are required to|will|may|should) ([a-zA-Z\s,]+?)(?:\.|,|;|$)', text)
    if action_match:
        logic["action"] = action_match.group(2).strip()
    # Object: look for what is being acted upon (after action)
    object_match = re.search(r'(for|of|to|on|regarding|concerning) ([a-zA-Z0-9\s\-]+?)(?:\.|,|;|$)', text)
    if object_match:
        logic["object"] = object_match.group(2).strip()
    # Condition: look for 'if', 'when', 'unless', 'as required by', etc.
    condition_match = re.search(r'(if|when|unless|as required by|in accordance with|subject to) ([^\.]+)', text)
    if condition_match:
        logic["condition"] = condition_match.group(0).strip()
    # Deadline: look for 'by [date/period]', 'annual', 'within [number] days', etc.
    deadline_match = re.search(r'(by|within|annual|no later than) ([^\.]+)', text)
    if deadline_match:
        logic["deadline"] = deadline_match.group(0).strip()
    # Fallbacks if not found
    if not logic["actor"]:
        logic["actor"] = "Unspecified"
    if not logic["action"]:
        # Try to extract first verb phrase
        verb_match = re.search(r'(shall|must|is required to|are required to|will|may|should) ([a-zA-Z\s,]+?)(?:\.|,|;|$)', text)
        if verb_match:
            logic["action"] = verb_match.group(2).strip()
        else:
            logic["action"] = "Unspecified"
    if not logic["object"]:
        logic["object"] = "Unspecified"
    # Attach original text for reference
    logic["original_text"] = text
    return logic

for rule in all_rules:
    rule_text = rule.get('text', '')
    rule['logic'] = generate_structured_logic(rule_text)

# Display a sample of rules with structured logic
print(json.dumps(all_rules[:5], indent=2))

[
  {
    "id": "1b.1_rule_1",
    "text": "Purpose.",
    "metadata": {
      "citation": "1b.1",
      "effective_date": null,
      "authority": {
        "I": "et seq.",
        "#text": "5 U.S.C. 301; 42 U.S.C. 4321 ; E.O. 11514, 3 CFR, 1966-1970 Comp., p. 902, as amended by E.O. 11991, 3 CFR, 1978 Comp., p. 123; E.O. 12114, 3 CFR, 1980 Comp., p. 356; 40 CFR 1507.3."
      }
    },
    "explanation": "Candidate rule extracted from section 1b.1.",
    "logic": {
      "actor": "Unspecified",
      "action": "Unspecified",
      "object": "Unspecified",
      "condition": null,
      "deadline": null,
      "original_text": "Purpose."
    }
  },
  {
    "id": "1b.1_rule_2",
    "text": "(a)  The purpose of this part is to outline the procedures by which the U.S. Department of Agriculture (hereinafter USDA or the Department) will integrate the National Environmental Policy Act (NEPA) into decision-making processes. Specifically, this part: describes the process by which USDA determin

# Stage 2: Actionability Assessment & Code Generation


In this stage, we will:

- Review each candidate rule extracted in Stage 1.
- Use AI or rule-based logic to classify whether each rule is actionable (i.e., can be implemented as code or automation).

- For actionable rules, generate structured logic and code templates for automation.


The process can be iterated and refined to improve accuracy and coverage.

In [50]:
# Stage 2: Actionability Assessment & Code Generation
import re
import json

def is_actionable_rule(rule_text):
    """
    Heuristic: A rule is actionable if it contains directive verbs (shall, must, required to, prohibited, etc.)
    and is not purely descriptive or explanatory.
    This can be replaced or augmented with an AI model for more nuanced classification.
    """
    directive_patterns = [
        r'\bmust\b', r'\bshall\b', r'\bare required to\b', r'\bis required to\b', r'\bprohibited\b',
        r'\bforbid\b', r'\bensure\b', r'\bdirect\b', r'\bmandate\b', r'\bmay not\b', r'\bis not permitted\b',
        r'\bno person may\b', r'\bwill\b', r'\bshould\b', r'\bmay\b', r'\bcan\b', r'\bshall not\b',
        r'\bmust not\b', r'\bis not allowed\b', r'\bis allowed\b', r'\brequire\b', r'\bauthorize\b',
        r'\bapprove\b', r'\bdeny\b', r'\breview\b', r'\bsubmit\b', r'\breport\b', r'\bdetermine\b',
        r'\bspecify\b', r'\bclarify\b', r'\badd\b', r'\bremove\b', r'\brev(ise|ised)\b', r'\bpublish\b',
        r'\bprovide\b', r'\bconsider\b', r'\bapply\b', r'\badopt\b', r'\bimplement\b', r'\bestablish\b', r'\bdefine\b'
    ]
    text = rule_text.lower()
    return any(re.search(pat, text) for pat in directive_patterns)

def generate_code_template(logic):
    """
    Generate a Python code template for an actionable rule's logic.
    This is a simple example; real automation may require more context.
    """
    actor = logic.get('actor', 'Unspecified')
    action = logic.get('action', 'Unspecified')
    obj = logic.get('object', 'Unspecified')
    condition = logic.get('condition', None)
    deadline = logic.get('deadline', None)
    code_lines = [f"# Actor: {actor}", f"# Action: {action}", f"# Object: {obj}"]
    if condition:
        code_lines.append(f"# Condition: {condition}")
    if deadline:
        code_lines.append(f"# Deadline: {deadline}")
    code_lines.append(f"def rule_action():")
    code_lines.append(f"    # TODO: Implement logic for: {action} {obj}")
    code_lines.append(f"    pass\n")
    return '\n'.join(code_lines)

# Assess actionability and generate code templates
for rule in all_rules:
    rule_text = rule.get('text', '')
    logic = rule.get('logic', {})
    rule['actionable'] = is_actionable_rule(rule_text)
    if rule['actionable']:
        rule['code_template'] = generate_code_template(logic)
    else:
        rule['code_template'] = None

# Display actionable rules and their code templates
actionable_rules = [r for r in all_rules if r['actionable']]
print(f"Total actionable rules: {len(actionable_rules)}")
for r in actionable_rules[:5]:
    print(json.dumps(r, indent=2))
    print("\nCode Template:\n", r['code_template'])

Total actionable rules: 321
{
  "id": "1b.1_rule_2",
  "text": "(a)  The purpose of this part is to outline the procedures by which the U.S. Department of Agriculture (hereinafter USDA or the Department) will integrate the National Environmental Policy Act (NEPA) into decision-making processes. Specifically, this part: describes the process by which USDA determines what actions are subject to NEPA's procedural requirements and the applicable level of NEPA review; ensures that relevant environmental information is identified and considered early in the process in order to ensure informed decision making; enables USDA to conduct coordinated, consistent, predictable and timely environmental reviews; reduces unnecessary burdens and delays; and implements NEPA's mandates regarding lead and cooperating agency roles, page and time limits, and sponsor preparation of environmental assessments and environmental impact statements.",
  "metadata": {
    "citation": "1b.1",
    "effective_date": nu