### GPT Cleaning Pipeline
In this code ChatGPT will clean government entities and law mentions. The objective is to:
1. Clean government entities and extract missing ones
2. Clean law mentions and identify self-references
3. Create proper article + law structure

The process runs in 2 phases:
- **Phase 1**: Clean entities and laws
- **Phase 2**: Create article-law mappings using cleaned data


## 📋 Processing Overview

This notebook implements an automated GPT cleaning pipeline with the following objectives:

### 🎯 **Goals:**
1. **Government Entities**: Clean and validate extracted entities, remove false positives, add missing ones
2. **Law Mentions**: Clean law references, identify self-references, standardize formats  
3. **Article-Law Structure**: Create proper mappings between articles and their corresponding laws

### 🔄 **Process Flow:**
- **Input**: Raw regex extractions from entities_extracted_complete.csv, mentions_extracted_complete.csv, identifiers_0_half.csv
- **Phase 1**: Clean entities and laws with GPT validation
- **Verification**: Save and review Phase 1 results  
- **Phase 2**: Create article-law relationships using cleaned data
- **Output**: Clean, structured CSV files ready for knowledge graph construction 


## 📊 Data Sources & Structure

### **Input Files:**
1. **`entities_extracted_complete.csv`**: Government entities and laws found by regex
   - Columns: `[row_id, entity_text, full_context, entity_label, pattern_group]`
   
2. **`mentions_extracted_complete.csv`**: Article mentions found by regex  
   - Columns: `[row_id, matched_text, context_300_chars, pattern_type]`
   
3. **`identifiers_0_half.csv`**: Original legal document text
   - Columns: `[row_id, document_name, document_section_title, text]`

### **Output Files:**
- **`gov_entities_cleaned.csv`**: Validated government entities with context
- **`law_mentions_cleaned.csv`**: Cleaned law mentions with context  
- **`article_law_mappings.csv`**: Article-to-law relationship mappings

## 🔧 Pipeline Features

### **✨ Key Capabilities:**

#### **🏛️ Government Entity Cleaning:**
- Remove vague references ("las autoridades", "el gobierno")
- Keep only concrete institutional names  
- Add missing entities not found by regex
- Preserve original context for validation

#### **⚖️ Law Mention Processing:**
- Clean and standardize law names
- Identify self-references ("esta Ley" → document name)
- Handle regulations ("su reglamento" → "REGLAMENTO DE [LAW]")
- Remove non-official mentions

#### **📑 Article-Law Mapping:**
- Map each article to its corresponding law
- Handle ranges ("artículos 22 y 23" → separate rows)
- Distinguish current document vs external references
- Use cleaned entities for context 

In [1]:
# Load required libraries and data
import os
import pandas as pd
import time
from datetime import datetime
from dotenv import load_dotenv
from openai import OpenAI

# Load environment variables
load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))



In [2]:
# Load the three required CSV files


# Load entities extracted
entities_df = pd.read_csv('/Users/alexa/Projects/cdmx_kg/data/entities_extracted_complete.csv')
print(f"   ✅ Entities: {len(entities_df)} rows")

# Load article mentions extracted
mentions_df = pd.read_csv('/Users/alexa/Projects/cdmx_kg/data/mentions_extracted_complete.csv')
print(f"   ✅ Mentions: {len(mentions_df)} rows")

# Load original identifiers with full text
identifiers_df = pd.read_csv('/Users/alexa/Projects/cdmx_kg/data/identifiers_0_half.csv')
print(f"   ✅ Identifiers: {len(identifiers_df)} rows")

print(f"\n📋 Entities columns: {list(entities_df.columns)}")
print(f"📋 Mentions columns: {list(mentions_df.columns)}")
print(f"📋 Identifiers columns: {list(identifiers_df.columns)}")
print("=" * 70)


   ✅ Entities: 22316 rows
   ✅ Mentions: 3077 rows
   ✅ Identifiers: 11130 rows

📋 Entities columns: ['doc_hash', 'row_id', 'section_title', 'entity_text', 'entity_label', 'pattern_group', 'before_context', 'after_context', 'full_context', 'words_before_count', 'words_after_count']
📋 Mentions columns: ['doc_hash', 'row_id', 'pattern_type', 'document_section_title', 'matched_text', 'context_30_words', 'context_300_chars', 'start_char', 'end_char']
📋 Identifiers columns: ['row_id', 'doc_hash', 'document_name', 'document_section_title', 'text']


In [None]:
# 🛠️ STEP 2: Helper Functions
# ===================================================
# PURPOSE: Define utility functions needed for processing
# FUNCTIONS: clean_csv_output() for parsing GPT responses
# ===================================================

def clean_csv_output(raw_output, expected_columns):
    """Clean and parse GPT's CSV response into list of lists"""
    clean_lines = []
    
    for line in raw_output.strip().splitlines():
        line = line.strip()
        
        # Skip empty lines
        if not line:
            continue
            
        # Skip markdown code blocks
        if line.startswith('```'):
            continue
            
        # Skip headers (case insensitive)
        if any(header in line.lower() for header in ['row_id', 'entity_text', 'law_mention', 'article_mention']):
            continue
            
        # Skip comments and instructions
        if line.startswith('#') or line.startswith('//') or line.startswith('OUTPUT') or line.startswith('EXAMPLE'):
            continue
            
        # Try to parse as CSV
        try:
            # Split by comma but handle commas in quoted context
            parts = line.split(',')
            if len(parts) >= expected_columns:
                # Join excess parts back (for commas in context field)
                if len(parts) > expected_columns:
                    parts = parts[:expected_columns-1] + [','.join(parts[expected_columns-1:])]
                clean_lines.append(parts)
        except:
            # Skip malformed lines
            continue
    
    return clean_lines

print("✅ Helper functions defined successfully")
print(f"   📋 clean_csv_output() - Parses GPT CSV responses")
print("=" * 70)


In [3]:
# STEP 1: Filter and Prepare Data
print("🔍 Filtering entities by type...")

# Define entity categories (government vs law entities)
gov_entity_labels = [
    'SECRETARIA_GENERAL', 'ALCALDIA_GENERAL', 'FEDERAL_AGENCY', 'ORG_AUTONOMO_FED', 
    'PARAESTATAL_FED', 'ORG_GENERICO', 'UNIVERSITY', 'CDMX_PODER_EJECUTIVO', 
    'CDMX_PODER_LEGISLATIVO', 'CDMX_PODER_JUDICIAL', 'CDMX_ORGANOS_AUTONOMOS',
    'CDMX_ALCALDIAS', 'CDMX_ENTIDADES'
]

law_entity_labels = [
    'LAW_MENTION', 'LAW_CODE', 'REGULATION', 'NOM', 'CONSTITUTION',
    'FEDERAL_LAWS', 'CDMX_LAWS', 'LEGAL_DOCS'
]

# CREATE THE FILTERED DATAFRAMES (This was missing!)
gov_entities = entities_df[entities_df['entity_label'].isin(gov_entity_labels)].copy()
law_entities = entities_df[entities_df['entity_label'].isin(law_entity_labels)].copy()

# Summary
print(f"📊 Total entities: {len(entities_df):,}")
print(f"🏛️  Government entities: {len(gov_entities):,} ({len(gov_entities)/len(entities_df)*100:.1f}%)")
print(f"⚖️  Law entities: {len(law_entities):,} ({len(law_entities)/len(entities_df)*100:.1f}%)")
print(f"✅ Data filtering completed!")
print("=" * 70)


🔍 Filtering entities by type...
📊 Total entities: 22,316
🏛️  Government entities: 13,948 (62.5%)
⚖️  Law entities: 4,676 (21.0%)
✅ Data filtering completed!


In [4]:
# STEP 2A: Government Entities Prompt (SIMPLIFIED)
# Creates clean, simple prompt for government entity cleaning
def create_prompt_gov_entities(row_id, document_name, section_title, text, gov_entities_for_row):
    """
    Create prompt to clean and validate government entities for a specific article
    """
    entities_text = "\n".join([
        f"- {entity['entity_text']} (Label: {entity['entity_label']}, Context: {entity['full_context'][:100]}...)"
        for _, entity in gov_entities_for_row.iterrows()
    ])
    
    return f"""You are an expert in Mexican government institutions and entities. Your task is to clean and validate government entities extracted from legal documents.

CONTEXT:
- Document: {document_name}
- Section: {section_title}
- Row ID: {row_id}

ORIGINAL TEXT:
{text}

EXTRACTED GOVERNMENT ENTITIES:
{entities_text}

TASK:
1. Check all the government entities listed above and clean their names
2. Corroborate each entity with the context in the original text
3. Look again at the text and verify that no government entity is missing
4. Remove non-institutional mentions such as:
   - Generic references: "las autoridades", "el gobierno" 
   - Descriptive phrases: "la administración pública", "los funcionarios"
   - Conceptual terms: "el sector público", "las instituciones"
   Keep only official named entities: "Secretaría de...", "Instituto de...", "Alcaldía...", etc.
5. Add any missing government entities you find in the text

RULES:
- Keep only REAL government entities (ministries, agencies, institutions, alcaldías, etc.)
- Keep entity names as they appear in the context
- Remove non-specific references and keep only concrete government institutions:
  Remove: "las autoridades", "el gobierno", "la administración pública"
  Keep: "SEDEMA", "Jefatura de Gobierno", "Instituto Nacional Electoral"
- Delete vague references that don't specify a real entity
- Add missing entities that regex might have missed
- Preserve entity names as they appear in the original text
- ALWAYS provide context: Extract the exact phrase from the text where the entity appears
- For existing regex-identified entities: Keep the original context provided
- For new entities found: Include the surrounding text where the entity is mentioned
- NO empty context fields - every row must have context

OUTPUT FORMAT (CSV only, no headers):
row_id,entity_text,context

EXAMPLE OUTPUT:
{row_id},SEDEMA,la Secretaría del Medio Ambiente (SEDEMA)
{row_id},Jefatura de Gobierno de la Ciudad de México,representantes de la Jefatura de Gobierno
{row_id},Alcaldía Benito Juárez,en coordinación con la Alcaldía Benito Juárez
"""


In [5]:
# STEP 2B: Law Mentions Prompt (SIMPLIFIED)
# Creates clean, simple prompt for law mention cleaning
def create_prompt_law_mentions(row_id, document_name, section_title, text, law_entities_for_row):
    """
    Create prompt to clean and validate law mentions for a specific article
    """
    entities_text = "\n".join([
        f"- {entity['entity_text']} (Label: {entity['entity_label']}, Context: {entity['full_context'][:100]}...)"
        for _, entity in law_entities_for_row.iterrows()
    ])
    
    return f"""You are an expert in Mexican legal system and legislation. Your task is to clean and validate law mentions extracted from legal documents.

CONTEXT:
- Document: {document_name}
- Section: {section_title}
- Row ID: {row_id}

ORIGINAL TEXT:
{text}

EXTRACTED LAW MENTIONS:
{entities_text}

TASK:
1. Take all the law mentions identified by regex and clean them
2. Delete mentions that are NOT official law mentions
3. Identify self-reference mentions (like "esta Ley", "la presente Ley") and set the correct name: {document_name}
4. Identify codes/regulations of a law and set the correct name: "REGLAMENTO DE {document_name}"
5. Look for missing law mentions that regex might not have identified

RULES:
- Keep only REAL law mentions (Constitución, Ley, Código, Reglamento, NOM, etc.)
- Self-references → Use document name: {document_name}
- Regulations → Format as "REGLAMENTO DE {document_name}"
- Complete names when only partial names are mentioned
- Remove vague references like "ley establecerá"
- Add missing official law mentions
- ALWAYS provide context: Extract the exact phrase from the text where the law mention appears
- For existing regex-identified laws: Keep the original context provided
- For self-references: Include the original phrase (e.g., "esta Ley", "la presente Ley")
- For new laws found: Include the surrounding text where the law is mentioned
- NO empty context fields - every row must have context

OUTPUT FORMAT (CSV only, no headers):
row_id,law_mention,context

EXAMPLE OUTPUT:
{row_id},CONSTITUCIÓN POLÍTICA DE LOS ESTADOS UNIDOS MEXICANOS,artículo 27 de la Constitución Política de los Estados Unidos Mexicanos
{row_id},{document_name},conforme a lo establecido en esta Ley
{row_id},REGLAMENTO DE {document_name},su reglamento establecerá los procedimientos
"""


In [6]:
# STEP 2C: Article-Law Mapping Prompt (SIMPLIFIED)
# Creates clean prompt for article-law relationship mapping
def create_prompt_article_law_with_cleaned_data(row_id, document_name, section_title, text, article_mentions_for_row, cleaned_gov_entities, cleaned_law_mentions):
    """
    Create prompt to map articles with their corresponding laws using cleaned entity and law data
    """
    mentions_text = "\n".join([
        f"- {mention['matched_text']} (Context: {mention['context_300_chars'][:150]}...)"
        for _, mention in article_mentions_for_row.iterrows()
    ])
    
    # Include cleaned entities and laws for better context
    cleaned_entities_text = ""
    if not cleaned_gov_entities.empty:
        cleaned_entities_text = "\nCLEANED GOVERNMENT ENTITIES:\n" + "\n".join([
            f"- {entity['entity_text']} (Context: {entity['context'][:100]}...)"
            for _, entity in cleaned_gov_entities.iterrows()
        ])
    
    cleaned_laws_text = ""
    if not cleaned_law_mentions.empty:
        cleaned_laws_text = "\nCLEANED LAW MENTIONS:\n" + "\n".join([
            f"- {law['law_mention']} (Context: {law['context'][:100]}...)"
            for _, law in cleaned_law_mentions.iterrows()
        ])
    
    return f"""You are an expert in Mexican legal citations and cross-references. Your task is to create the correct structure for article + law relationships using cleaned entity data.

CONTEXT:
- Document: {document_name}
- Section: {section_title}
- Row ID: {row_id}

ORIGINAL TEXT:
{text}

EXTRACTED ARTICLE MENTIONS:
{mentions_text}
{cleaned_entities_text}
{cleaned_laws_text}

TASK:
1. Identify all articles extracted by regex and map what law corresponds to each article
2. Read the complete text to determine which law each article belongs to
3. Use the cleaned law mentions above to help identify the correct law names
4. If the article is about the current legal document, insert the current law name: {document_name}
5. If the article is about another law, include the article number and corresponding law
6. Create one row per article mention
7. If there is a range of articles (e.g., "artículos 22 y 23"), create one row per number
8. Check the text again for missing article mentions

RULES:
- For current document references → law_mention = {document_name}
- For external law references → law_mention = full law name (use cleaned law mentions when available)
- Split ranges: "artículos 22 y 23" = two rows (22, 23)
- Use complete law names from the cleaned data or context
- Article numbers should be just the number (e.g., "22", not "artículo 22")

OUTPUT FORMAT (CSV only, no headers):
row_id,article_mention,law_mention

EXAMPLE OUTPUT:
{row_id},22,{document_name}
{row_id},23,{document_name}
{row_id},44,CONSTITUCIÓN POLÍTICA DE LOS ESTADOS UNIDOS MEXICANOS
{row_id},27,CONSTITUCIÓN POLÍTICA DE LOS ESTADOS UNIDOS MEXICANOS
"""


In [7]:
# PHASE 1A: Government Entities Cleaning
print("🏛️ Starting GOVERNMENT ENTITIES cleaning (Phase 1A)...")

# Configuration for testing/production
BATCH_SIZE = 100  # Process first 100 articles for testing
# BATCH_SIZE = None  # Uncomment for full processing

# Get articles to process
if BATCH_SIZE:
    unique_row_ids = identifiers_df['row_id'].unique()[:BATCH_SIZE]
    print(f"🧪 TESTING MODE: Processing first {BATCH_SIZE} articles")
else:
    unique_row_ids = identifiers_df['row_id'].unique()
    print(f"🏭 PRODUCTION MODE: Processing all {len(unique_row_ids)} articles")

# Initialize collectors
gov_entities_cleaned = []
processed_count = 0
errors_count = 0
skipped_count = 0
start_time = time.time()

print(f"📊 Articles to process: {len(unique_row_ids)}")
print("=" * 50)

for i, row_id in enumerate(unique_row_ids):
    processed_count += 1
    
    # Progress indicator
    if processed_count % 10 == 0:
        elapsed = time.time() - start_time
        rate = processed_count / elapsed if elapsed > 0 else 0
        eta = (len(unique_row_ids) - processed_count) / rate if rate > 0 else 0
        print(f"📈 Progress: {processed_count}/{len(unique_row_ids)} | Rate: {rate:.1f}/s | ETA: {eta:.0f}s | Errors: {errors_count}")
    
    try:
        # Get article info
        article_info = identifiers_df[identifiers_df['row_id'] == row_id].iloc[0]
        document_name = article_info['document_name']
        section_title = article_info['document_section_title']
        text = article_info['text']
        
        # Get government entities for this article (with deduplication)
        gov_entities_raw = gov_entities[gov_entities['row_id'] == row_id]
        
        if len(gov_entities_raw) == 0:
            skipped_count += 1
            continue
            
        # Deduplicate
        gov_entities_unique = gov_entities_raw.drop_duplicates(subset=['entity_text'], keep='first')
        
        # Process with GPT
        try:
            prompt = create_prompt_gov_entities(row_id, document_name, section_title, text, gov_entities_unique)
            resp = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.1
            )
            output = resp.choices[0].message.content.strip()
            cleaned_results = clean_csv_output(output, 3)
            gov_entities_cleaned.extend(cleaned_results)
            
        except Exception as e:
            print(f"    ❌ API Error for {row_id}: {e}")
            errors_count += 1
        
        # Reduced delay for faster processing
        time.sleep(0.1)
        
    except Exception as e:
        print(f"❌ Critical error for {row_id}: {e}")
        errors_count += 1

# Results
elapsed_time = time.time() - start_time
print(f"\n✅ GOVERNMENT ENTITIES CLEANING COMPLETED!")
print(f"⏱️  Total time: {elapsed_time:.1f} seconds")
print(f"📊 Processed: {processed_count} articles")
print(f"⏭️  Skipped (no entities): {skipped_count}")
print(f"❌ Errors: {errors_count}")
print(f"🏛️  Gov entities cleaned: {len(gov_entities_cleaned)}")
print(f"⚡ Rate: {processed_count/elapsed_time:.1f} articles/second")

# Create DataFrame
gov_entities_cleaned_df = pd.DataFrame(gov_entities_cleaned, columns=["row_id", "entity_text", "context"]) if gov_entities_cleaned else pd.DataFrame()
print(f"📋 Created DataFrame with {len(gov_entities_cleaned_df)} rows")


🏛️ Starting GOVERNMENT ENTITIES cleaning (Phase 1A)...
🧪 TESTING MODE: Processing first 100 articles
📊 Articles to process: 100
    ❌ API Error for F823AF8C_ARTCULO1: name 'clean_csv_output' is not defined
    ❌ API Error for F823AF8C_ARTCULO5: name 'clean_csv_output' is not defined
    ❌ API Error for F823AF8C_ARTCULO6: name 'clean_csv_output' is not defined


KeyboardInterrupt: 

In [None]:
# 🚀 PHASE 1B: Law Mentions Cleaning
print("⚖️ Starting LAW MENTIONS cleaning (Phase 1B)...")

# Use same configuration as Phase 1A
if 'BATCH_SIZE' not in locals():
    BATCH_SIZE = 100

# Get articles to process (same as government entities)
if BATCH_SIZE:
    unique_row_ids = identifiers_df['row_id'].unique()[:BATCH_SIZE]
    print(f"🧪 TESTING MODE: Processing first {BATCH_SIZE} articles")
else:
    unique_row_ids = identifiers_df['row_id'].unique()
    print(f"🏭 PRODUCTION MODE: Processing all {len(unique_row_ids)} articles")

# Initialize collectors
law_mentions_cleaned = []
processed_count = 0
errors_count = 0
skipped_count = 0
start_time = time.time()

print(f"📊 Articles to process: {len(unique_row_ids)}")
print("=" * 50)

for i, row_id in enumerate(unique_row_ids):
    processed_count += 1
    
    # Progress indicator
    if processed_count % 10 == 0:
        elapsed = time.time() - start_time
        rate = processed_count / elapsed if elapsed > 0 else 0
        eta = (len(unique_row_ids) - processed_count) / rate if rate > 0 else 0
        print(f"📈 Progress: {processed_count}/{len(unique_row_ids)} | Rate: {rate:.1f}/s | ETA: {eta:.0f}s")
    
    try:
        # Get article info
        article_info = identifiers_df[identifiers_df['row_id'] == row_id].iloc[0]
        document_name = article_info['document_name']
        section_title = article_info['document_section_title']
        text = article_info['text']
        
        # Get law entities for this article
        law_entities_raw = law_entities[law_entities['row_id'] == row_id]
        
        if len(law_entities_raw) == 0:
            skipped_count += 1
            continue
            
        # Deduplicate
        law_entities_unique = law_entities_raw.drop_duplicates(subset=['entity_text'], keep='first')
        
        # Process with GPT
        try:
            prompt = create_prompt_law_mentions(row_id, document_name, section_title, text, law_entities_unique)
            resp = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.1
            )
            output = resp.choices[0].message.content.strip()
            cleaned_results = clean_csv_output(output, 3)
            law_mentions_cleaned.extend(cleaned_results)
            
        except Exception as e:
            print(f"    ❌ API Error for {row_id}: {e}")
            errors_count += 1
        
        # Reduced delay
        time.sleep(0.1)
        
    except Exception as e:
        print(f"❌ Critical error for {row_id}: {e}")
        errors_count += 1

# Results
elapsed_time = time.time() - start_time
print(f"\n✅ LAW MENTIONS CLEANING COMPLETED!")
print(f"⏱️  Total time: {elapsed_time:.1f} seconds")
print(f"📊 Processed: {processed_count} articles")
print(f"⏭️  Skipped (no entities): {skipped_count}")
print(f"❌ Errors: {errors_count}")
print(f"⚖️  Law mentions cleaned: {len(law_mentions_cleaned)}")
print(f"⚡ Rate: {processed_count/elapsed_time:.1f} articles/second")

# Create DataFrame
law_mentions_cleaned_df = pd.DataFrame(law_mentions_cleaned, columns=["row_id", "law_mention", "context"]) if law_mentions_cleaned else pd.DataFrame()
print(f"📋 Created DataFrame with {len(law_mentions_cleaned_df)} rows")


In [None]:
# 📊 COMBINE & SAVE Phase 1 Results
print("📋 Combining Phase 1A and 1B results...")

# Ensure DataFrames exist
if 'gov_entities_cleaned_df' not in locals():
    gov_entities_cleaned_df = pd.DataFrame(columns=["row_id", "entity_text", "context"])
    print("⚠️  No government entities DataFrame found")

if 'law_mentions_cleaned_df' not in locals():
    law_mentions_cleaned_df = pd.DataFrame(columns=["row_id", "law_mention", "context"])
    print("⚠️  No law mentions DataFrame found")

# Save results
if not gov_entities_cleaned_df.empty:
    gov_output = '/Users/alexa/Projects/cdmx_kg/data/gov_entities_phase1_cleaned.csv'
    gov_entities_cleaned_df.to_csv(gov_output, index=False, encoding='utf-8-sig')
    print(f"✅ Government entities saved: {gov_output}")
    print(f"   📊 {len(gov_entities_cleaned_df)} rows, {gov_entities_cleaned_df['entity_text'].nunique()} unique entities")

if not law_mentions_cleaned_df.empty:
    law_output = '/Users/alexa/Projects/cdmx_kg/data/law_mentions_phase1_cleaned.csv'
    law_mentions_cleaned_df.to_csv(law_output, index=False, encoding='utf-8-sig')
    print(f"✅ Law mentions saved: {law_output}")
    print(f"   📊 {len(law_mentions_cleaned_df)} rows, {law_mentions_cleaned_df['law_mention'].nunique()} unique laws")

# Create combined summary
total_entities = len(gov_entities_cleaned_df) + len(law_mentions_cleaned_df)
print(f"\n🎯 PHASE 1 SUMMARY:")
print(f"   🏛️  Government entities: {len(gov_entities_cleaned_df)}")
print(f"   ⚖️  Law mentions: {len(law_mentions_cleaned_df)}")
print(f"   📊 Total cleaned entities: {total_entities}")

if total_entities > 0:
    print(f"\n✅ Ready for Phase 2: Article-Law mapping!")
    print(f"📁 Input files created for Phase 2 verification")
else:
    print(f"\n⚠️  No entities found - check your data or increase BATCH_SIZE")

print("=" * 70)


In [12]:
# Helper function for deduplication
def deduplicate_entities_per_article(gov_entities_raw, law_entities_raw, row_id, processed_count):
    """
    Deduplicate entities per article to avoid redundant processing
    """
    # DEDUPLICATION: Keep only unique entities per article to avoid redundant processing
    gov_entities_unique = gov_entities_raw.drop_duplicates(subset=['entity_text'], keep='first')
    law_entities_unique = law_entities_raw.drop_duplicates(subset=['entity_text'], keep='first')
    
    # Log deduplication results (show occasionally to avoid spam)
    if len(gov_entities_raw) != len(gov_entities_unique) and processed_count % 50 == 0:
        removed_gov = len(gov_entities_raw) - len(gov_entities_unique)
        print(f"    🔄 Deduplicated {removed_gov} duplicate gov entities for {row_id}")
    
    if len(law_entities_raw) != len(law_entities_unique) and processed_count % 50 == 0:
        removed_law = len(law_entities_raw) - len(law_entities_unique)
        print(f"    🔄 Deduplicated {removed_law} duplicate law entities for {row_id}")
    
    return gov_entities_unique, law_entities_unique

print("Deduplication helper function loaded")


Deduplication helper function loaded


In [13]:
# PROMPT 3: Article + law structure creation
def create_prompt_article_law(row_id, document_name, section_title, text, article_mentions_for_row):
    """
    Create prompt to map articles with their corresponding laws
    """
    mentions_text = "\n".join([
        f"- {mention['matched_text']} (Context: {mention['context_300_chars'][:150]}...)"
        for _, mention in article_mentions_for_row.iterrows()
    ])
    
    return f"""You are an expert in Mexican legal citations and cross-references. Your task is to create the correct structure for article + law relationships.

CONTEXT:
- Document: {document_name}
- Section: {section_title}
- Row ID: {row_id}

ORIGINAL TEXT:
{text}

EXTRACTED ARTICLE MENTIONS:
{mentions_text}

TASK:
1. Identify all articles extracted by regex and map what law corresponds to each article
2. Read the complete text to determine which law each article belongs to
3. If the article is about the current legal document, insert the current law name
4. If the article is about another law, include the article number and corresponding law
5. Create one row per article mention
6. If there is a range of articles (e.g., "artículos 22 y 23"), create one row per number
7. Check the text again for missing article mentions

RULES:
- For current document references → law_mention = current law name
- For external law references → law_mention = full law name
- Split ranges: "artículos 22 y 23" = two rows (22, 23)
- Use complete law names from the context
- Article numbers should be just the number (e.g., "22", not "artículo 22")

OUTPUT FORMAT (CSV only, no headers):
row_id,article_mention,law_mention

EXAMPLE OUTPUT:
{row_id},22,{document_name}
{row_id},23,{document_name}
{row_id},44,CONSTITUCIÓN POLÍTICA DE LOS ESTADOS UNIDOS MEXICANOS
{row_id},27,CONSTITUCIÓN POLÍTICA DE LOS ESTADOS UNIDOS MEXICANOS
"""


In [14]:
# Helper function to clean and validate GPT output
def clean_csv_output(raw_output, expected_columns):
    """Clean and validate CSV output from GPT, filtering out artifacts and malformed entries"""
    clean_lines = []
    for line in raw_output.strip().splitlines():
        line = line.strip()
        
        # Skip empty lines
        if not line:
            continue
            
        # Skip markdown code blocks
        if line.startswith('```'):
            continue
            
        # Skip headers (case insensitive)
        if any(header in line.lower() for header in ['row_id', 'entity_text', 'law_mention', 'article_mention']):
            continue
            
        # Skip lines that look like instructions or comments
        if line.startswith('#') or line.startswith('//') or line.startswith('OUTPUT') or line.startswith('EXAMPLE'):
            continue
            
        # Must contain commas for CSV format
        if ',' not in line:
            continue
            
        # Check if line has the expected number of columns (allow some flexibility)
        parts = line.split(',')
        if len(parts) < expected_columns:
            print(f"    Skipping malformed line (too few columns): {line[:50]}...")
            continue
            
        # Clean each part
        cleaned_parts = []
        for i, part in enumerate(parts[:expected_columns]):  # Only take expected number of columns
            cleaned_part = part.strip().strip('"').strip("'")  # Remove quotes and extra spaces
            cleaned_parts.append(cleaned_part)
            
        # Skip if essential fields are empty (first column should not be empty)
        if not cleaned_parts[0]:
            print(f"    Skipping line with empty row_id: {line[:50]}...")
            continue
            
        clean_lines.append(cleaned_parts)
    
    return clean_lines


In [15]:
# SAFETY CHECK: Create DataFrames if they don't exist
if 'gov_entities_cleaned_df' not in locals():
    print("⚠️  Creating DataFrames from Phase 1 results...")
    if 'gov_entities_cleaned' in locals() and gov_entities_cleaned:
        gov_entities_cleaned_df = pd.DataFrame(gov_entities_cleaned, columns=["row_id", "entity_text", "context"])
    else:
        gov_entities_cleaned_df = pd.DataFrame()
        
    if 'law_mentions_cleaned' in locals() and law_mentions_cleaned:
        law_mentions_cleaned_df = pd.DataFrame(law_mentions_cleaned, columns=["row_id", "law_mention", "context"])
    else:
        law_mentions_cleaned_df = pd.DataFrame()

In [17]:
# Save intermediate results from Phase 1 for verification
print("💾 Saving Phase 1 results for verification...")

# Save government entities cleaned (Phase 1)
if not gov_entities_cleaned_df.empty:
    gov_entities_phase1_output = '/Users/alexa/Projects/cdmx_kg/data/gov_entities_phase1_cleaned.csv'
    gov_entities_cleaned_df.to_csv(gov_entities_phase1_output, index=False, encoding='utf-8-sig')
    print(f"   ✅ Phase 1 Government entities saved to: {gov_entities_phase1_output}")
    print(f"      📊 Total entities: {len(gov_entities_cleaned_df)}")
    print(f"      📊 Unique entities: {gov_entities_cleaned_df['entity_text'].nunique()}")
    
    # Show sample
    print("Sample government entities:")
    sample_gov = gov_entities_cleaned_df.head(3)
    for _, row in sample_gov.iterrows():
        print(f"{row['row_id']}: {row['entity_text']} | Context: {row['context'][:50]}...")
else:
    print("   ⚠️  No government entities to save from Phase 1")

# Save law mentions cleaned (Phase 1) 
if not law_mentions_cleaned_df.empty:
    law_mentions_phase1_output = '/Users/alexa/Projects/cdmx_kg/data/law_mentions_phase1_cleaned.csv'
    law_mentions_cleaned_df.to_csv(law_mentions_phase1_output, index=False, encoding='utf-8-sig')
    print(f" Phase 1 Law mentions saved to: {law_mentions_phase1_output}")
    print(f" Total mentions: {len(law_mentions_cleaned_df)}")
    print(f" Unique laws: {law_mentions_cleaned_df['law_mention'].nunique()}")
    
    # Show sample
    print("Sample law mentions:")
    sample_law = law_mentions_cleaned_df.head(3)
    for _, row in sample_law.iterrows():
        print(f"{row['row_id']}: {row['law_mention']} | Context: {row['context'][:50]}...")
else:
    print(" No law mentions to save from Phase 1")




💾 Saving Phase 1 results for verification...
   ⚠️  No government entities to save from Phase 1
 No law mentions to save from Phase 1


In [13]:
# PHASE 1: Entity and Law Cleaning
print("🚀 Starting GPT processing - PHASE 1: Entity and Law Cleaning...")

# Initialize result collectors for Phase 1
gov_entities_cleaned = []
law_mentions_cleaned = []

# Get unique row_ids from identifiers_df to process
unique_row_ids = identifiers_df['row_id'].unique()
total_articles = len(unique_row_ids)
print(f"📊 Total articles to process: {total_articles}")

# Processing configuration
processed_count = 0
errors_count = 0

for i, row_id in enumerate(unique_row_ids):
    processed_count += 1
    
    # Progress indicator
    if processed_count % 10 == 0 or processed_count == total_articles:
        progress = (processed_count / total_articles) * 100
        print(f"📈 Phase 1 Progress: {processed_count}/{total_articles} ({progress:.1f}%) | Errors: {errors_count}")
    
    try:
        # Get article information
        article_info = identifiers_df[identifiers_df['row_id'] == row_id].iloc[0]
        document_name = article_info['document_name']
        section_title = article_info['document_section_title']
        text = article_info['text']
        
        # Get related entities and mentions for this row_id
        #gov_entities_for_row = gov_entities[gov_entities['row_id'] == row_id]
        #law_entities_for_row = law_entities[law_entities['row_id'] == row_id]

        gov_entities_for_row_raw = gov_entities[gov_entities['row_id'] == row_id]
        law_entities_for_row_raw = law_entities[law_entities['row_id'] == row_id]
        gov_entities_for_row, law_entities_for_row = deduplicate_entities_per_article(
        gov_entities_for_row_raw, law_entities_for_row_raw, row_id, processed_count
         )
        


        # PROCESS 1: Government entities cleaning (only if entities exist)
        if len(gov_entities_for_row) > 0:
            try:
                prompt1 = create_prompt_gov_entities(row_id, document_name, section_title, text, gov_entities_for_row)
                resp1 = client.chat.completions.create(
                    model="gpt-4o-mini",
                    messages=[{"role": "user", "content": prompt1}],
                    temperature=0.1
                )
                output1 = resp1.choices[0].message.content.strip()
                cleaned_gov = clean_csv_output(output1, 3)
                gov_entities_cleaned.extend(cleaned_gov)
            except Exception as e:
                print(f"    ❌ Error processing gov entities for {row_id}: {e}")
                errors_count += 1
        
        # PROCESS 2: Law mentions cleaning (only if entities exist)
        if len(law_entities_for_row) > 0:
            try:
                prompt2 = create_prompt_law_mentions(row_id, document_name, section_title, text, law_entities_for_row)
                resp2 = client.chat.completions.create(
                    model="gpt-4o-mini",
                    messages=[{"role": "user", "content": prompt2}],
                    temperature=0.1
                )
                output2 = resp2.choices[0].message.content.strip()
                cleaned_laws = clean_csv_output(output2, 3)
                law_mentions_cleaned.extend(cleaned_laws)
            except Exception as e:
                print(f"    ❌ Error processing law mentions for {row_id}: {e}")
                errors_count += 1
        
        # Small delay to avoid rate limits
        time.sleep(0.2)
        
    except Exception as e:
        print(f"❌ Critical error processing {row_id}: {e}")
        errors_count += 1
        continue

print(f"\n✅ PHASE 1 completed!")
print(f"📊 Total articles processed: {processed_count}")
print(f"❌ Total errors: {errors_count}")
print(f"🏛️  Government entities cleaned: {len(gov_entities_cleaned)}")
print(f"⚖️  Law mentions cleaned: {len(law_mentions_cleaned)}")

# Convert to DataFrames for Phase 2
gov_entities_cleaned_df = pd.DataFrame(gov_entities_cleaned, columns=["row_id", "entity_text", "context"]) if gov_entities_cleaned else pd.DataFrame()
law_mentions_cleaned_df = pd.DataFrame(law_mentions_cleaned, columns=["row_id", "law_mention", "context"]) if law_mentions_cleaned else pd.DataFrame()

print("=" * 70)


🚀 Starting GPT processing - PHASE 1: Entity and Law Cleaning...
📊 Total articles to process: 10726
📈 Phase 1 Progress: 10/10726 (0.1%) | Errors: 0
📈 Phase 1 Progress: 20/10726 (0.2%) | Errors: 0


KeyboardInterrupt: 

In [None]:
# PHASE 2: Article + Law Structure Creation using cleaned results
print("🚀 Starting GPT processing - PHASE 2: Article + Law Structure Creation...")

# Check if Phase 1 DataFrames exist, if not create empty ones
if 'gov_entities_cleaned_df' not in locals():
    print("⚠️  Phase 1 DataFrames not found. Creating empty DataFrames...")
    gov_entities_cleaned_df = pd.DataFrame()
    law_mentions_cleaned_df = pd.DataFrame()

# Initialize result collectors for Phase 2
article_law_mappings = []

# Reset counters
processed_count = 0
errors_count = 0

for i, row_id in enumerate(unique_row_ids):
    processed_count += 1
    
    # Progress indicator
    if processed_count % 10 == 0 or processed_count == total_articles:
        progress = (processed_count / total_articles) * 100
        print(f"📈 Phase 2 Progress: {processed_count}/{total_articles} ({progress:.1f}%) | Errors: {errors_count}")
    
    try:
        # Get article information
        article_info = identifiers_df[identifiers_df['row_id'] == row_id].iloc[0]
        document_name = article_info['document_name']
        section_title = article_info['document_section_title']
        text = article_info['text']
        
        # Get article mentions for this row_id
        article_mentions_for_row = mentions_df[mentions_df['row_id'] == row_id]
        
        # Get cleaned entities and laws for this row_id (from Phase 1 results)
        cleaned_gov_for_row = gov_entities_cleaned_df[gov_entities_cleaned_df['row_id'] == row_id] if not gov_entities_cleaned_df.empty else pd.DataFrame()
        cleaned_laws_for_row = law_mentions_cleaned_df[law_mentions_cleaned_df['row_id'] == row_id] if not law_mentions_cleaned_df.empty else pd.DataFrame()
        
        # PROCESS 3: Article + law structure (only if mentions exist)
        if len(article_mentions_for_row) > 0:
            try:
                # Update the prompt to include cleaned results
                prompt3 = create_prompt_article_law_with_cleaned_data(
                    row_id, document_name, section_title, text, 
                    article_mentions_for_row, cleaned_gov_for_row, cleaned_laws_for_row
                )
                resp3 = client.chat.completions.create(
                    model="gpt-4o-mini",
                    messages=[{"role": "user", "content": prompt3}],
                    temperature=0.1
                )
                output3 = resp3.choices[0].message.content.strip()
                cleaned_articles = clean_csv_output(output3, 3)
                article_law_mappings.extend(cleaned_articles)
            except Exception as e:
                print(f"    ❌ Error processing article mappings for {row_id}: {e}")
                errors_count += 1
        
        # Small delay to avoid rate limits
        time.sleep(0.2)
        
    except Exception as e:
        print(f"❌ Critical error processing {row_id}: {e}")
        errors_count += 1
        continue

print(f"\n✅ PHASE 2 completed!")
print(f"📊 Total articles processed: {processed_count}")
print(f"❌ Total errors: {errors_count}")
print(f"📑 Article-law mappings: {len(article_law_mappings)}")
print("=" * 70)


In [None]:
# 🚀 OPTIMIZED SEPARATE PROCESSING: Government Entities Only
print("🏛️ Starting GOVERNMENT ENTITIES cleaning (Phase 1A)...")

# Configuration for testing/production
BATCH_SIZE = 100  # Process first 100 articles for testing
# BATCH_SIZE = None  # Uncomment for full processing

# Get articles to process
if BATCH_SIZE:
    unique_row_ids = identifiers_df['row_id'].unique()[:BATCH_SIZE]
    print(f"🧪 TESTING MODE: Processing first {BATCH_SIZE} articles")
else:
    unique_row_ids = identifiers_df['row_id'].unique()
    print(f"🏭 PRODUCTION MODE: Processing all {len(unique_row_ids)} articles")

# Initialize collectors
gov_entities_cleaned = []
processed_count = 0
errors_count = 0
skipped_count = 0
start_time = time.time()

print(f"📊 Articles to process: {len(unique_row_ids)}")
print("=" * 50)

for i, row_id in enumerate(unique_row_ids):
    processed_count += 1
    
    # Progress indicator
    if processed_count % 10 == 0:
        elapsed = time.time() - start_time
        rate = processed_count / elapsed if elapsed > 0 else 0
        eta = (len(unique_row_ids) - processed_count) / rate if rate > 0 else 0
        print(f"📈 Progress: {processed_count}/{len(unique_row_ids)} | Rate: {rate:.1f}/s | ETA: {eta:.0f}s | Errors: {errors_count}")
    
    try:
        # Get article info
        article_info = identifiers_df[identifiers_df['row_id'] == row_id].iloc[0]
        document_name = article_info['document_name']
        section_title = article_info['document_section_title']
        text = article_info['text']
        
        # Get government entities for this article (with deduplication)
        gov_entities_raw = gov_entities[gov_entities['row_id'] == row_id]
        
        if len(gov_entities_raw) == 0:
            skipped_count += 1
            continue
            
        # Deduplicate
        gov_entities_unique = gov_entities_raw.drop_duplicates(subset=['entity_text'], keep='first')
        
                         # Process with GPT
        try:
             prompt = create_prompt_gov_entities(row_id, document_name, section_title, text, gov_entities_unique)
             resp = client.chat.completions.create(
                 model="gpt-4o-mini",
                 messages=[{"role": "user", "content": prompt}],
                 temperature=0.1
             )
             output = resp.choices[0].message.content.strip()
             cleaned_results = clean_csv_output(output, 3)
             gov_entities_cleaned.extend(cleaned_results)
             
         except Exception as e:
             print(f"    ❌ API Error for {row_id}: {e}")
             errors_count += 1
        
        # Reduced delay for faster processing
        time.sleep(0.1)  # Reduced from 0.2s
        
    except Exception as e:
        print(f\"❌ Critical error for {row_id}: {e}\")
        errors_count += 1

# Results
elapsed_time = time.time() - start_time
print(f\"\\n✅ GOVERNMENT ENTITIES CLEANING COMPLETED!\")
print(f\"⏱️  Total time: {elapsed_time:.1f} seconds\")
print(f\"📊 Processed: {processed_count} articles\")
print(f\"⏭️  Skipped (no entities): {skipped_count}\")
print(f\"❌ Errors: {errors_count}\")
print(f\"🏛️  Gov entities cleaned: {len(gov_entities_cleaned)}\")
print(f\"⚡ Rate: {processed_count/elapsed_time:.1f} articles/second\")

# Create DataFrame
gov_entities_cleaned_df = pd.DataFrame(gov_entities_cleaned, columns=[\"row_id\", \"entity_text\", \"context\"]) if gov_entities_cleaned else pd.DataFrame()
print(f\"📋 Created DataFrame with {len(gov_entities_cleaned_df)} rows\")"


In [None]:
# 🚀 OPTIMIZED: Law Mentions Cleaning Only (Phase 1B)
print("⚖️ Starting LAW MENTIONS cleaning (Phase 1B)...")

# Configuration
BATCH_SIZE = 100  # Change to None for full processing

# Get articles to process (same as government entities)
if BATCH_SIZE:
    unique_row_ids = identifiers_df['row_id'].unique()[:BATCH_SIZE]
    print(f"🧪 TESTING MODE: Processing first {BATCH_SIZE} articles")
else:
    unique_row_ids = identifiers_df['row_id'].unique()
    print(f"🏭 PRODUCTION MODE: Processing all {len(unique_row_ids)} articles")

# Initialize collectors
law_mentions_cleaned = []
processed_count = 0
errors_count = 0
skipped_count = 0
start_time = time.time()

print(f"📊 Articles to process: {len(unique_row_ids)}")
print("=" * 50)

for i, row_id in enumerate(unique_row_ids):
    processed_count += 1
    
    # Progress indicator
    if processed_count % 10 == 0:
        elapsed = time.time() - start_time
        rate = processed_count / elapsed if elapsed > 0 else 0
        eta = (len(unique_row_ids) - processed_count) / rate if rate > 0 else 0
        print(f"📈 Progress: {processed_count}/{len(unique_row_ids)} | Rate: {rate:.1f}/s | ETA: {eta:.0f}s")
    
    try:
        # Get article info
        article_info = identifiers_df[identifiers_df['row_id'] == row_id].iloc[0]
        document_name = article_info['document_name']
        section_title = article_info['document_section_title']
        text = article_info['text']
        
        # Get law entities for this article
        law_entities_raw = law_entities[law_entities['row_id'] == row_id]
        
        if len(law_entities_raw) == 0:
            skipped_count += 1
            continue
            
        # Deduplicate
        law_entities_unique = law_entities_raw.drop_duplicates(subset=['entity_text'], keep='first')
        
        # Process with GPT
        try:
            prompt = create_prompt_law_mentions(row_id, document_name, section_title, text, law_entities_unique)
            resp = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.1
            )
            output = resp.choices[0].message.content.strip()
            cleaned_results = clean_csv_output(output, 3)
            law_mentions_cleaned.extend(cleaned_results)
            
        except Exception as e:
            print(f"    ❌ API Error for {row_id}: {e}")
            errors_count += 1
        
        # Reduced delay
        time.sleep(0.1)
        
    except Exception as e:
        print(f"❌ Critical error for {row_id}: {e}")
        errors_count += 1

# Results
elapsed_time = time.time() - start_time
print(f"\n✅ LAW MENTIONS CLEANING COMPLETED!")
print(f"⏱️  Total time: {elapsed_time:.1f} seconds")
print(f"📊 Processed: {processed_count} articles")
print(f"⏭️  Skipped (no entities): {skipped_count}")
print(f"❌ Errors: {errors_count}")
print(f"⚖️  Law mentions cleaned: {len(law_mentions_cleaned)}")
print(f"⚡ Rate: {processed_count/elapsed_time:.1f} articles/second")

# Create DataFrame
law_mentions_cleaned_df = pd.DataFrame(law_mentions_cleaned, columns=["row_id", "law_mention", "context"]) if law_mentions_cleaned else pd.DataFrame()
print(f"📋 Created DataFrame with {len(law_mentions_cleaned_df)} rows")


In [None]:
# 📊 COMBINE RESULTS and SAVE
print("📋 Combining Phase 1A and 1B results...")

# Ensure DataFrames exist
if 'gov_entities_cleaned_df' not in locals():
    gov_entities_cleaned_df = pd.DataFrame(columns=["row_id", "entity_text", "context"])
    print("⚠️  No government entities DataFrame found")

if 'law_mentions_cleaned_df' not in locals():
    law_mentions_cleaned_df = pd.DataFrame(columns=["row_id", "law_mention", "context"])
    print("⚠️  No law mentions DataFrame found")

# Save results
if not gov_entities_cleaned_df.empty:
    gov_output = '/Users/alexa/Projects/cdmx_kg/data/gov_entities_phase1_cleaned.csv'
    gov_entities_cleaned_df.to_csv(gov_output, index=False, encoding='utf-8-sig')
    print(f"✅ Government entities saved: {gov_output}")
    print(f"   📊 {len(gov_entities_cleaned_df)} rows, {gov_entities_cleaned_df['entity_text'].nunique()} unique entities")

if not law_mentions_cleaned_df.empty:
    law_output = '/Users/alexa/Projects/cdmx_kg/data/law_mentions_phase1_cleaned.csv'
    law_mentions_cleaned_df.to_csv(law_output, index=False, encoding='utf-8-sig')
    print(f"✅ Law mentions saved: {law_output}")
    print(f"   📊 {len(law_mentions_cleaned_df)} rows, {law_mentions_cleaned_df['law_mention'].nunique()} unique laws")

# Create combined summary
total_entities = len(gov_entities_cleaned_df) + len(law_mentions_cleaned_df)
print(f"\n🎯 PHASE 1 SUMMARY:")
print(f"   🏛️  Government entities: {len(gov_entities_cleaned_df)}")
print(f"   ⚖️  Law mentions: {len(law_mentions_cleaned_df)}")
print(f"   📊 Total cleaned entities: {total_entities}")

if total_entities > 0:
    print(f"\n✅ Ready for Phase 2: Article-Law mapping!")
    print(f"📁 Input files created for Phase 2 verification")
else:
    print(f"\n⚠️  No entities found - check your data or increase BATCH_SIZE")

print("=" * 70)


In [None]:
# 🔍 DIAGNOSTIC: Check what happened
print("🔍 DIAGNOSTIC REPORT:")
print("=" * 50)

# Check if raw lists exist
print("📋 Raw Data Lists:")
if 'gov_entities_cleaned' in locals():
    print(f"   ✅ gov_entities_cleaned exists: {len(gov_entities_cleaned)} items")
else:
    print("   ❌ gov_entities_cleaned NOT FOUND")

if 'law_mentions_cleaned' in locals():
    print(f"   ✅ law_mentions_cleaned exists: {len(law_mentions_cleaned)} items")
else:
    print("   ❌ law_mentions_cleaned NOT FOUND")

# Check if DataFrames exist
print("\n📊 DataFrames:")
if 'gov_entities_cleaned_df' in locals():
    print(f"   ✅ gov_entities_cleaned_df exists: {len(gov_entities_cleaned_df)} rows")
else:
    print("   ❌ gov_entities_cleaned_df NOT FOUND")

if 'law_mentions_cleaned_df' in locals():
    print(f"   ✅ law_mentions_cleaned_df exists: {len(law_mentions_cleaned_df)} rows")
else:
    print("   ❌ law_mentions_cleaned_df NOT FOUND")

# Check source data
print("\n📁 Source Data:")
print(f"   🏛️  Government entities available: {len(gov_entities)}")
print(f"   ⚖️  Law entities available: {len(law_entities)}")
print(f"   📄 Articles available: {len(identifiers_df)}")

# Check if we have entities in first 100 articles
print("\n🧪 Test Sample Check (first 100 articles):")
test_articles = identifiers_df['row_id'].unique()[:100]
gov_in_sample = gov_entities[gov_entities['row_id'].isin(test_articles)]
law_in_sample = law_entities[law_entities['row_id'].isin(test_articles)]

print(f"   🏛️  Gov entities in first 100 articles: {len(gov_in_sample)}")
print(f"   ⚖️  Law entities in first 100 articles: {len(law_in_sample)}")

if len(gov_in_sample) > 0:
    print(f"   📊 Articles with gov entities: {gov_in_sample['row_id'].nunique()}")
if len(law_in_sample) > 0:
    print(f"   📊 Articles with law entities: {law_in_sample['row_id'].nunique()}")

print("\n💡 RECOMMENDATION:")
if len(gov_in_sample) == 0 and len(law_in_sample) == 0:
    print("   🎯 NO ENTITIES in first 100 articles!")
    print("   🔧 Try: Set BATCH_SIZE = 500 or BATCH_SIZE = None")
elif 'gov_entities_cleaned' not in locals():
    print("   🎯 You need to RUN Cell 21 (Government Entities Processing)")
    print("   🔧 Execute: Cell 21 → Cell 22 → Cell 23")
else:
    print("   🎯 Data exists but processing may have failed")
    print("   🔧 Check for errors in Cell 21 and 22")


In [None]:
# Save outputs with correct structure for each prompt
print("💾 Saving cleaned results...")

# Save government entities cleaned
if gov_entities_cleaned:
    gov_entities_df = pd.DataFrame(gov_entities_cleaned, columns=["row_id", "entity_text", "context"])
    gov_entities_output = '/Users/alexa/Projects/cdmx_kg/data/gov_entities_cleaned.csv'
    gov_entities_df.to_csv(gov_entities_output, index=False, encoding='utf-8-sig')
    print(f"   ✅ Government entities saved to: {gov_entities_output}")
    print(f"      📊 Total entities: {len(gov_entities_df)}")
    print(f"      📊 Unique entities: {gov_entities_df['entity_text'].nunique()}")
else:
    print("   ⚠️  No government entities to save")

# Save law mentions cleaned
if law_mentions_cleaned:
    law_mentions_df = pd.DataFrame(law_mentions_cleaned, columns=["row_id", "law_mention", "context"])
    law_mentions_output = '/Users/alexa/Projects/cdmx_kg/data/law_mentions_cleaned.csv'
    law_mentions_df.to_csv(law_mentions_output, index=False, encoding='utf-8-sig')
    print(f"   ✅ Law mentions saved to: {law_mentions_output}")
    print(f"      📊 Total mentions: {len(law_mentions_df)}")
    print(f"      📊 Unique laws: {law_mentions_df['law_mention'].nunique()}")
else:
    print("   ⚠️  No law mentions to save")

# Save article-law mappings
if article_law_mappings:
    article_law_df = pd.DataFrame(article_law_mappings, columns=["row_id", "article_mention", "law_mention"])
    article_law_output = '/Users/alexa/Projects/cdmx_kg/data/article_law_mappings.csv'
    article_law_df.to_csv(article_law_output, index=False, encoding='utf-8-sig')
    print(f"   ✅ Article-law mappings saved to: {article_law_output}")
    print(f"      📊 Total mappings: {len(article_law_df)}")
    print(f"      📊 Self-references: {len(article_law_df[article_law_df['law_mention'] == ''])}")
    print(f"      📊 External references: {len(article_law_df[article_law_df['law_mention'] != ''])}")
else:
    print("   ⚠️  No article-law mappings to save")

print("\n📋 Sample results:")
if gov_entities_cleaned:
    print("Government entities sample:")
    sample_gov = gov_entities_df.head(3)
    for _, row in sample_gov.iterrows():
        print(f"   🏛️  {row['row_id']}: {row['entity_text']}")

if law_mentions_cleaned:
    print("Law mentions sample:")
    sample_law = law_mentions_df.head(3)
    for _, row in sample_law.iterrows():
        print(f"   ⚖️  {row['row_id']}: {row['law_mention']}")

if article_law_mappings:
    print("Article-law mappings sample:")
    sample_articles = article_law_df.head(3)
    for _, row in sample_articles.iterrows():
        law_text = row['law_mention'] if row['law_mention'] else '(current document)'
        print(f"   📑 {row['row_id']}: Article {row['article_mention']} → {law_text}")

print("=" * 70)
print("🎉 GPT CLEANING PIPELINE COMPLETED!")
print("=" * 70)


In [None]:
# Create DataFrames from Phase 1 results (if they don't exist)
print("🔧 Creating/Recreating DataFrames from Phase 1 results...")

# Check if the raw lists exist first
if 'gov_entities_cleaned' in locals() and gov_entities_cleaned:
    # Convert government entities list to DataFrame
    gov_entities_cleaned_df = pd.DataFrame(
        gov_entities_cleaned, 
        columns=["row_id", "entity_text", "context"]
    )
    print(f"✅ Created gov_entities_cleaned_df with {len(gov_entities_cleaned_df)} rows")
else:
    # Create empty DataFrame if no data
    gov_entities_cleaned_df = pd.DataFrame(columns=["row_id", "entity_text", "context"])
    print("⚠️  No gov_entities_cleaned found, created empty DataFrame")

if 'law_mentions_cleaned' in locals() and law_mentions_cleaned:
    # Convert law mentions list to DataFrame  
    law_mentions_cleaned_df = pd.DataFrame(
        law_mentions_cleaned,
        columns=["row_id", "law_mention", "context"]
    )
    print(f"✅ Created law_mentions_cleaned_df with {len(law_mentions_cleaned_df)} rows")
else:
    # Create empty DataFrame if no data
    law_mentions_cleaned_df = pd.DataFrame(columns=["row_id", "law_mention", "context"])
    print("⚠️  No law_mentions_cleaned found, created empty DataFrame")

print(f"📊 DataFrames ready:")
print(f"   🏛️  Government entities: {len(gov_entities_cleaned_df)} rows")
print(f"   ⚖️  Law mentions: {len(law_mentions_cleaned_df)} rows")
