# Task 0.5: Prototype Parser for Public Law 94-553

**Goal**: Build prototype parser for single Public Law (Copyright Act of 1976)

**Tasks**:
1. Parse law metadata (number, date, sponsors)
2. Extract section changes from legal language
3. Generate diff between old and new text

**Law Selected**: PL 94-553 - Copyright Act of 1976
- Major comprehensive revision of Title 17
- Enacted October 19, 1976
- Package ID: `PLAW-94publ553`

In [None]:
import requests
import json
import re
from datetime import datetime
from typing import Dict, List, Optional, Tuple
from xml.etree import ElementTree as ET
from difflib import unified_diff, HtmlDiff
import os

# Note: GovInfo API key should be set as environment variable
# For this prototype, we'll use DEMO_KEY for exploration
GOVINFO_API_KEY = os.getenv('GOVINFO_API_KEY', 'DEMO_KEY')
GOVINFO_BASE_URL = 'https://api.govinfo.gov'

# Package ID for Copyright Act of 1976
PACKAGE_ID = 'PLAW-94publ553'

## Step 1: Fetch Law Metadata from GovInfo API

In [None]:
def fetch_package_summary(package_id: str) -> Dict:
    """
    Fetch package summary metadata from GovInfo API.
    """
    url = f"{GOVINFO_BASE_URL}/packages/{package_id}/summary"
    params = {'api_key': GOVINFO_API_KEY}
    
    response = requests.get(url, params=params)
    response.raise_for_status()
    
    return response.json()

# Fetch metadata
print(f"Fetching metadata for {PACKAGE_ID}...")
summary = fetch_package_summary(PACKAGE_ID)

print("\n=== Law Metadata ===")
print(f"Title: {summary.get('title', 'N/A')}")
print(f"Package ID: {summary.get('packageId', 'N/A')}")
print(f"Date Issued: {summary.get('dateIssued', 'N/A')}")
print(f"Congress: {summary.get('congress', 'N/A')}")
print(f"\nFull metadata:")
print(json.dumps(summary, indent=2))

## Step 2: Download Law Text

We'll try to fetch the law text in available formats (HTML, PDF, XML).

In [None]:
def get_download_link(summary: Dict, format_type: str) -> Optional[str]:
    """
    Extract download link for specific format from summary.
    """
    download_links = summary.get('download', {})
    
    # Check common format keys
    format_keys = {
        'xml': ['xmlLink', 'uslmLink'],
        'html': ['htmLink', 'htmlLink'],
        'pdf': ['pdfLink'],
        'mods': ['modsLink'],
        'text': ['txtLink']
    }
    
    for key in format_keys.get(format_type, []):
        if key in download_links:
            return download_links[key]
    
    return None

# Check available formats
print("\n=== Available Download Formats ===")
download_section = summary.get('download', {})
for key, value in download_section.items():
    print(f"{key}: {value}")

# Try to get text content
html_link = get_download_link(summary, 'html')
pdf_link = get_download_link(summary, 'pdf')
xml_link = get_download_link(summary, 'xml')

print(f"\nHTML Link: {html_link}")
print(f"PDF Link: {pdf_link}")
print(f"XML Link: {xml_link}")

In [None]:
def fetch_law_text(package_id: str, format_type: str = 'htm') -> str:
    """
    Fetch law text content from GovInfo.
    """
    url = f"{GOVINFO_BASE_URL}/packages/{package_id}/{format_type}"
    params = {'api_key': GOVINFO_API_KEY}
    
    response = requests.get(url, params=params)
    response.raise_for_status()
    
    return response.text

# Fetch HTML version (most likely to be available for older laws)
print(f"\nFetching HTML content for {PACKAGE_ID}...")
try:
    law_text = fetch_law_text(PACKAGE_ID, 'htm')
    print(f"Successfully fetched {len(law_text)} characters")
    print(f"\nFirst 1000 characters:")
    print(law_text[:1000])
except Exception as e:
    print(f"Error fetching HTML: {e}")
    law_text = None

## Step 3: Parse Law Metadata

Extract structured metadata from the law text and API response.

In [None]:
def parse_law_metadata(summary: Dict) -> Dict:
    """
    Extract and structure law metadata.
    """
    metadata = {
        'package_id': summary.get('packageId'),
        'title': summary.get('title'),
        'short_title': summary.get('shortTitle'),
        'date_issued': summary.get('dateIssued'),
        'congress': summary.get('congress'),
        'session': summary.get('session'),
        'law_number': None,  # Will parse from title
        'law_type': 'Public Law',
        'collection': summary.get('collectionCode'),
    }
    
    # Parse law number from package ID
    # Format: PLAW-{congress}publ{number}
    match = re.match(r'PLAW-(\d+)publ(\d+)', summary.get('packageId', ''))
    if match:
        congress_num = match.group(1)
        law_num = match.group(2)
        metadata['law_number'] = f"{congress_num}-{law_num}"
    
    return metadata

law_metadata = parse_law_metadata(summary)

print("\n=== Parsed Law Metadata ===")
for key, value in law_metadata.items():
    print(f"{key}: {value}")

## Step 4: Analyze Legal Language Patterns

Identify common amendment patterns in the law text.

In [None]:
# Common legal language patterns for amendments
AMENDMENT_PATTERNS = [
    # Pattern: "Section X is amended by..."
    r'Section\s+(\d+[A-Za-z]?)\s+(?:of title (\d+))?.*?is amended',
    
    # Pattern: "striking 'X' and inserting 'Y'"
    r'striking\s+["\'](.+?)["\']\s+and inserting\s+["\'](.+?)["\']',
    
    # Pattern: "adding at the end the following"
    r'adding at the end(?:\s+thereof)?\s+the following',
    
    # Pattern: "Section X is repealed"
    r'Section\s+(\d+[A-Za-z]?).*?is(?:\s+hereby)?\s+repealed',
    
    # Pattern: "Title X is amended"
    r'Title\s+(\d+).*?is amended',
    
    # Pattern: "inserting after section X the following"
    r'inserting after section\s+(\d+[A-Za-z]?)\s+the following',
]

def find_amendment_patterns(text: str) -> List[Tuple[str, str]]:
    """
    Find all amendment patterns in law text.
    Returns list of (pattern_description, matched_text) tuples.
    """
    findings = []
    
    # Search for each pattern
    for i, pattern in enumerate(AMENDMENT_PATTERNS):
        matches = re.finditer(pattern, text, re.IGNORECASE | re.MULTILINE)
        for match in matches:
            findings.append((
                f"Pattern {i+1}",
                match.group(0)
            ))
    
    return findings

if law_text:
    print("\n=== Searching for Amendment Patterns ===")
    patterns_found = find_amendment_patterns(law_text)
    
    print(f"Found {len(patterns_found)} potential amendments")
    
    # Show first 10 examples
    for i, (pattern_type, match_text) in enumerate(patterns_found[:10]):
        print(f"\n{i+1}. {pattern_type}")
        print(f"   {match_text[:200]}..." if len(match_text) > 200 else f"   {match_text}")
else:
    print("No law text available for pattern analysis")

## Step 5: Extract Section Changes

Build a parser to extract which sections are being modified and how.

In [None]:
class SectionChange:
    """Represents a change to a US Code section."""
    
    def __init__(self, title: int, section: str, change_type: str, 
                 old_text: Optional[str] = None, new_text: Optional[str] = None):
        self.title = title
        self.section = section
        self.change_type = change_type  # 'amended', 'added', 'repealed'
        self.old_text = old_text
        self.new_text = new_text
    
    def __repr__(self):
        return f"SectionChange({self.title} USC § {self.section}, {self.change_type})"

def extract_section_changes(text: str, title: int = 17) -> List[SectionChange]:
    """
    Extract section changes from law text.
    For PL 94-553, we focus on Title 17 (Copyright).
    """
    changes = []
    
    # Pattern 1: Section X is amended
    amended_pattern = r'Section\s+(\d+[A-Za-z]?).*?is amended'
    for match in re.finditer(amended_pattern, text, re.IGNORECASE):
        section = match.group(1)
        changes.append(SectionChange(
            title=title,
            section=section,
            change_type='amended'
        ))
    
    # Pattern 2: Section X is repealed
    repealed_pattern = r'Section\s+(\d+[A-Za-z]?).*?is(?:\s+hereby)?\s+repealed'
    for match in re.finditer(repealed_pattern, text, re.IGNORECASE):
        section = match.group(1)
        changes.append(SectionChange(
            title=title,
            section=section,
            change_type='repealed'
        ))
    
    return changes

if law_text:
    print("\n=== Extracting Section Changes ===")
    section_changes = extract_section_changes(law_text, title=17)
    
    print(f"Found {len(section_changes)} section changes")
    
    # Group by change type
    from collections import Counter
    change_types = Counter([c.change_type for c in section_changes])
    print(f"\nBreakdown by change type:")
    for change_type, count in change_types.items():
        print(f"  {change_type}: {count}")
    
    # Show first 20 changes
    print(f"\nFirst 20 section changes:")
    for change in section_changes[:20]:
        print(f"  {change}")
else:
    print("No law text available for extraction")

## Step 6: Fetch Original US Code Text

To generate diffs, we need the original text before the law was enacted.
For PL 94-553 (1976), this is challenging as we'd need the 1975 version of Title 17.

In [None]:
# For this prototype, we'll demonstrate the concept with a mock example
# In production, we'd fetch historical US Code from USCODE collection or OLRC

# Mock example: Section 106 of Title 17
MOCK_OLD_TEXT = """§ 106. Exclusive rights in copyrighted works

Subject to sections 107 through 120, the owner of copyright under this title 
has the exclusive rights to do and to authorize any of the following:

(1) to reproduce the copyrighted work in copies or phonorecords;
(2) to prepare derivative works based upon the copyrighted work;
(3) to distribute copies or phonorecords of the copyrighted work to the public 
    by sale or other transfer of ownership, or by rental, lease, or lending;
"""

MOCK_NEW_TEXT = """§ 106. Exclusive rights in copyrighted works

Subject to sections 107 through 122, the owner of copyright under this title 
has the exclusive rights to do and to authorize any of the following:

(1) to reproduce the copyrighted work in copies or phonorecords;
(2) to prepare derivative works based upon the copyrighted work;
(3) to distribute copies or phonorecords of the copyrighted work to the public 
    by sale or other transfer of ownership, or by rental, lease, or lending;
(4) in the case of literary, musical, dramatic, and choreographic works, 
    pantomimes, and motion pictures and other audiovisual works, to perform 
    the copyrighted work publicly;
(5) in the case of literary, musical, dramatic, and choreographic works, 
    pantomimes, and pictorial, graphic, or sculptural works, including the 
    individual images of a motion picture or other audiovisual work, to display 
    the copyrighted work publicly;
(6) in the case of sound recordings, to perform the copyrighted work publicly 
    by means of a digital audio transmission.
"""

print("\n=== Mock Example: Section 106 Changes ===")
print("\nOld Text (pre-PL 94-553):")
print(MOCK_OLD_TEXT)
print("\nNew Text (post-PL 94-553):")
print(MOCK_NEW_TEXT)

## Step 7: Generate Diff

Create unified diff showing changes between old and new text.

In [None]:
def generate_diff(old_text: str, new_text: str, 
                  section_ref: str = "17 USC § 106") -> List[str]:
    """
    Generate unified diff between old and new section text.
    """
    old_lines = old_text.splitlines(keepends=True)
    new_lines = new_text.splitlines(keepends=True)
    
    diff = list(unified_diff(
        old_lines,
        new_lines,
        fromfile=f"{section_ref} (before PL 94-553)",
        tofile=f"{section_ref} (after PL 94-553)",
        lineterm=''
    ))
    
    return diff

# Generate diff
diff_output = generate_diff(MOCK_OLD_TEXT, MOCK_NEW_TEXT)

print("\n=== Unified Diff Output ===")
print(''.join(diff_output))

In [None]:
def analyze_diff_statistics(old_text: str, new_text: str) -> Dict:
    """
    Calculate statistics about the changes.
    """
    old_lines = old_text.splitlines()
    new_lines = new_text.splitlines()
    
    diff_lines = list(unified_diff(old_lines, new_lines))
    
    stats = {
        'old_line_count': len(old_lines),
        'new_line_count': len(new_lines),
        'lines_added': sum(1 for line in diff_lines if line.startswith('+')),
        'lines_removed': sum(1 for line in diff_lines if line.startswith('-')),
        'lines_changed': 0,  # Will calculate
    }
    
    stats['lines_changed'] = stats['lines_added'] + stats['lines_removed']
    
    return stats

# Calculate statistics
diff_stats = analyze_diff_statistics(MOCK_OLD_TEXT, MOCK_NEW_TEXT)

print("\n=== Diff Statistics ===")
for key, value in diff_stats.items():
    print(f"{key}: {value}")

## Step 8: Summary and Findings

Document what we learned from this prototype.

In [None]:
print("\n" + "="*70)
print("PROTOTYPE PARSER - KEY FINDINGS")
print("="*70)

print("\n1. LAW METADATA EXTRACTION")
print("   ✓ Can fetch law metadata from GovInfo API")
print("   ✓ Package ID format: PLAW-{congress}publ{number}")
print("   ✓ Metadata includes: title, date, congress, session")

print("\n2. LAW TEXT AVAILABILITY")
print("   ✓ HTML format available for older laws (94th Congress)")
print("   ⚠ XML (USLM) only available for 113th Congress forward")
print("   → For historical laws, use HTML or text formats")

print("\n3. AMENDMENT PATTERN DETECTION")
print("   ✓ Can identify common patterns:")
print("      - 'Section X is amended'")
print("      - 'striking X and inserting Y'")
print("      - 'adding at the end'")
print("      - 'Section X is repealed'")
print("   ⚠ Complex amendments may require human review")

print("\n4. SECTION CHANGE EXTRACTION")
print("   ✓ Can parse which sections are modified")
print("   ✓ Can classify change type (amended, added, repealed)")
print("   ⚠ Extracting exact text changes is challenging")

print("\n5. DIFF GENERATION")
print("   ✓ Can generate unified diffs between versions")
print("   ✓ Can calculate diff statistics (lines added/removed)")
print("   ⚠ Requires access to both old and new text")

print("\n6. CHALLENGES IDENTIFIED")
print("   • Legal language is highly variable and complex")
print("   • Some amendments span multiple sections")
print("   • Need historical US Code text for accurate diffs")
print("   • May require manual review for complex changes")

print("\n7. RECOMMENDATIONS FOR PHASE 1")
print("   1. Focus on modern laws (113th Congress+) with USLM XML")
print("   2. Build pattern library for common amendment types")
print("   3. Implement manual review workflow for complex amendments")
print("   4. Fetch US Code sections before/after to generate diffs")
print("   5. Start with well-documented laws for initial testing")

print("\n" + "="*70)

## Next Steps

1. **Task 0.6**: Build line-level parser for section structure
2. **Task 0.7**: Test on complex nested sections
3. **Task 1.10**: Build production legal language parser
4. **Task 1.11**: Implement diff generation for actual law changes