# INCEpTION Annotations Exploratory Data Analysis
## Comprehensive Analysis of Portuguese Municipal Documents

**Objective**: Provide statistical foundations and publication-ready visualizations for academic research on Named Entity Recognition (NER) systems applied to Portuguese municipal governance documents.

**Dataset**: 120+ manually annotated documents from multiple Portuguese municipalities using the INCEpTION annotation platform.

**Analysis Focus**: 
- Entity type distributions and characteristics
- Posicionamento (voting positioning) patterns
- Assunto analysis
- Cross-municipality comparisons

---

In [1]:
# Jupyter configuration for optimal display
%matplotlib inline

# Verify environment
import sys
print(f"Python: {sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}")
print(f"Running in: {'Jupyter' if 'ipykernel' in sys.modules else 'Standard Python'}")
print("✅ Jupyter magic commands loaded")

Python: 3.13.7
Running in: Jupyter
✅ Jupyter magic commands loaded


## 1. Setup and Data Loading

In [2]:
# Core libraries for data manipulation and analysis
import pandas as pd
import numpy as np
import json
from pathlib import Path
from collections import Counter, defaultdict
import warnings
warnings.filterwarnings('ignore')

# Statistical analysis
from scipy import stats
from scipy.stats import chi2_contingency, kruskal, mannwhitneyu
import statsmodels.api as sm

# Visualization libraries
import matplotlib
matplotlib.use('Agg', force=True)  # Use Agg backend for compatibility
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

# Text analysis
from wordcloud import WordCloud

# Custom utilities
import sys
sys.path.append('./utils')
from inception_parser import InceptionParser
from analysis_functions import AnnotationAnalyzer, calculate_effect_size, bootstrap_confidence_interval

# Configure plotting
plt.style.use('default')
sns.set_palette("husl")
px.defaults.template = "plotly_white"
px.defaults.color_continuous_scale = "viridis"

# Set random seed for reproducibility
np.random.seed(42)

print("✅ All libraries imported successfully")
print(f"📊 Analysis environment ready")
print(f"📅 Analysis date: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"🖼️ Matplotlib backend: {matplotlib.get_backend()}")

✅ All libraries imported successfully
📊 Analysis environment ready
📅 Analysis date: 2025-09-02 12:32:39
🖼️ Matplotlib backend: Agg


In [3]:
# Data paths
INCEPTION_DATA_DIR = Path('./inception')
RESULTS_DIR = Path('./results')
FIGURES_DIR = RESULTS_DIR / 'figures'
STATISTICS_DIR = RESULTS_DIR / 'statistics'

# Create directories
FIGURES_DIR.mkdir(parents=True, exist_ok=True)
STATISTICS_DIR.mkdir(parents=True, exist_ok=True)

print(f"📂 Data directory: {INCEPTION_DATA_DIR}")
print(f"📊 Results will be saved to: {RESULTS_DIR}")

# Check data availability
json_files = list(INCEPTION_DATA_DIR.glob('*.json'))
print(f"📄 Found {len(json_files)} INCEpTION JSON files")

if len(json_files) == 0:
    print("⚠️ No JSON files found. Please check the data directory path.")
else:
    print("✅ Data files detected")
    print(f"📋 Sample files: {[f.name for f in json_files[:5]]}")

📂 Data directory: inception
📊 Results will be saved to: results
📄 Found 120 INCEpTION JSON files
✅ Data files detected
📋 Sample files: ['Porto_cm_029_2023-01-30.json', 'Fundao_cm_005_2023-27-03.json', 'Guimaraes_cm_016_2024-09-30.json', 'Alandroal_cm_007_2021-12-22.json', 'Covilha_cm_009_2022-05-06.json']


## 2. Data Parsing and Preprocessing

In [4]:
# Initialize the INCEpTION parser
print("🔄 Initializing INCEpTION parser...")
parser = InceptionParser()

# Parse all documents
print("🔄 Parsing INCEpTION annotation files...")
documents = parser.parse_directory(INCEPTION_DATA_DIR)

# Get parsing summary
summary = parser.get_parsing_summary()

print(f"\n=== PARSING RESULTS ===")
print(f"📊 Documents successfully parsed: {summary['total_documents_parsed']}")
print(f"⚠️ Parsing errors: {summary['parsing_errors']}")
print(f"🏛️ Municipalities: {len(summary['municipalities'])}")
print(f"📋 Municipality list: {', '.join(summary['municipalities'])}")
print(f"🏷️ Total entities: {summary['total_entities']:,}")
print(f"🔗 Total relations: {summary['total_relations']:,}")

if summary['parsing_errors'] > 0:
    print(f"\n⚠️ Error details:")
    for error in summary['error_details'][:5]:  # Show first 5 errors
        print(f"  - {error['file']}: {error['error']}")

print("\n✅ Data parsing completed")

🔄 Initializing INCEpTION parser...
🔄 Parsing INCEpTION annotation files...

=== PARSING RESULTS ===
📊 Documents successfully parsed: 120
⚠️ Parsing errors: 0
🏛️ Municipalities: 6
📋 Municipality list: Alandroal, Porto, Covilha, Guimaraes, Fundao, Campomaior
🏷️ Total entities: 26,435
🔗 Total relations: 8,752

✅ Data parsing completed


In [5]:
# Create analysis DataFrames
print("🔄 Creating analysis DataFrames...")

entities_df = parser.create_entity_dataframe()  # Uses parser.parsed_documents by default
relations_df = parser.create_relations_dataframe()  # Uses parser.parsed_documents by default
documents_df = parser.create_document_dataframe()  # Uses parser.parsed_documents by default

print(f"✅ DataFrames created successfully:")
print(f"   • Entities: {len(entities_df):,} rows")
print(f"   • Relations: {len(relations_df):,} rows") 
print(f"   • Documents: {len(documents_df):,} rows")

# Initialize the AnnotationAnalyzer
print("\n🔄 Initializing AnnotationAnalyzer...")
analyzer = AnnotationAnalyzer(entities_df, relations_df, documents_df)
print("✅ AnnotationAnalyzer ready for comprehensive analysis")

🔄 Creating analysis DataFrames...
✅ DataFrames created successfully:
   • Entities: 26,435 rows
   • Relations: 8,752 rows
   • Documents: 120 rows

🔄 Initializing AnnotationAnalyzer...
✅ AnnotationAnalyzer ready for comprehensive analysis


In [6]:
# Preview the data structure
print("👀 Data Preview")
print("\n=== ENTITIES DATAFRAME ===")
print(entities_df.head())
print(f"\nColumns: {list(entities_df.columns)}")

print("\n=== RELATIONS DATAFRAME ===")
print(relations_df.head())
print(f"\nColumns: {list(relations_df.columns)}")

print("\n=== DOCUMENTS DATAFRAME ===")
print(documents_df.head())
print(f"\nColumns: {list(documents_df.columns)}")

👀 Data Preview

=== ENTITIES DATAFRAME ===
                       filename municipality   document_id        date  \
0  Porto_cm_029_2023-01-30.json        Porto  Porto_cm_029  2023-01-30   
1  Porto_cm_029_2023-01-30.json        Porto  Porto_cm_029  2023-01-30   
2  Porto_cm_029_2023-01-30.json        Porto  Porto_cm_029  2023-01-30   
3  Porto_cm_029_2023-01-30.json        Porto  Porto_cm_029  2023-01-30   
4  Porto_cm_029_2023-01-30.json        Porto  Porto_cm_029  2023-01-30   

   entity_id  entity_type entity_label  begin  end  \
0        161  custom.Span    Metadados      1    5   
1        162  custom.Span    Metadados     65   86   
2        163  custom.Span    Metadados     90  110   
3        164  custom.Span    Metadados    126  159   
4        165  custom.Span    Metadados    164  210   

                                             text  ...  feature_Tema  \
0                                            29.ª  ...           NaN   
1                           30 DE JANEIRO D

## 3. Corpus Overview & Statistics

In [7]:
# Run comprehensive analysis
print("🔬 Running comprehensive statistical analysis...")

try:
    comprehensive_analysis = analyzer.run_comprehensive_analysis()
    
    # Save results using enhanced JSON serialization
    results_file = RESULTS_DIR / "comprehensive_analysis.json"
    
    # Enhanced JSON serializer for complex objects
    def make_json_serializable(obj):
        """Recursively convert objects to JSON-serializable format."""
        if isinstance(obj, dict):
            # Handle dictionary keys that might be tuples or other non-string types
            new_dict = {}
            for k, v in obj.items():
                # Convert tuple keys to strings
                if isinstance(k, tuple):
                    key = str(k)
                elif isinstance(k, (int, float, bool)):
                    key = str(k)
                else:
                    key = k
                new_dict[key] = make_json_serializable(v)
            return new_dict
        elif isinstance(obj, (list, tuple)):
            return [make_json_serializable(item) for item in obj]
        elif isinstance(obj, set):
            return list(obj)
        elif hasattr(obj, '__dict__'):
            return str(obj)
        elif isinstance(obj, (int, float, str, bool, type(None))):
            return obj
        else:
            return str(obj)
    
    # Convert to JSON-serializable format
    serializable_analysis = make_json_serializable(comprehensive_analysis)
    
    with open(results_file, 'w', encoding='utf-8') as f:
        json.dump(serializable_analysis, f, indent=2, ensure_ascii=False)
    
    print("✅ Comprehensive analysis completed successfully!")
    
    # Display basic overview
    print("\n=== ANALYSIS OVERVIEW ===")
    if 'corpus_statistics' in comprehensive_analysis:
        corpus_stats = comprehensive_analysis['corpus_statistics']
        if 'corpus_overview' in corpus_stats:
            overview = corpus_stats['corpus_overview']
            print(f"📚 Total Documents: {overview.get('total_documents', 0):,}")
            print(f"🏛️ Municipalities: {overview.get('total_municipalities', 0)}")
            print(f"📝 Total Entities: {corpus_stats.get('entity_overview', {}).get('total_entities', 0):,}")
            print(f"🔗 Total Relations: {corpus_stats.get('relation_overview', {}).get('total_relations', 0):,}")
    
    print(f"\n📊 Results saved to: {results_file}")
    
    # Extract corpus statistics for display
    corpus_stats = comprehensive_analysis['corpus_statistics']['corpus_overview']
    entity_stats = comprehensive_analysis['corpus_statistics'].get('entity_overview', {})
    relation_stats = comprehensive_analysis['corpus_statistics'].get('relation_overview', {})
    
    # Create comprehensive statistics table
    stats_data = [
        ['📄 Total Documents', f"{corpus_stats['total_documents']:,}"],
        ['🏛️ Municipalities', f"{corpus_stats['total_municipalities']}"],
        ['📊 Total Text Length', f"{corpus_stats['total_text_length']:,} characters"],
        ['🔤 Total Tokens', f"{corpus_stats['total_tokens']:,}"],
        ['🏷️ Total Entities', f"{entity_stats.get('total_entities', 0):,}"],
        ['📝 Entity Types', f"{entity_stats.get('unique_entity_types', 0)}"],
        ['🔗 Total Relations', f"{relation_stats.get('total_relations', 0):,}"],
        ['📈 Documents with Entities', f"{entity_stats.get('documents_with_entities', 0):,}"],
        ['📊 Entity Coverage', f"{entity_stats.get('entity_coverage', 0):.1%}"],
        ['📋 Avg Entities/Document', f"{entity_stats.get('avg_entities_per_document', 0):.2f}"]
    ]

    # Display as formatted table
    stats_df = pd.DataFrame(stats_data, columns=['Metric', 'Value'])
    print("\n=== CORPUS OVERVIEW STATISTICS ===")
    print(stats_df.to_string(index=False))

    # Save statistics table
    stats_df.to_csv(STATISTICS_DIR / 'corpus_overview_table.csv', index=False)
    print(f"\n💾 Statistics table saved to {STATISTICS_DIR / 'corpus_overview_table.csv'}")
    
except Exception as e:
    print(f"❌ Error during analysis: {str(e)}")
    import traceback
    traceback.print_exc()

🔬 Running comprehensive statistical analysis...
✅ Comprehensive analysis completed successfully!

=== ANALYSIS OVERVIEW ===
📚 Total Documents: 120
🏛️ Municipalities: 6
📝 Total Entities: 26,435
🔗 Total Relations: 8,752

📊 Results saved to: results/comprehensive_analysis.json

=== CORPUS OVERVIEW STATISTICS ===
                   Metric                Value
        📄 Total Documents                  120
        🏛️ Municipalities                    6
      📊 Total Text Length 7,512,427 characters
           🔤 Total Tokens            1,188,024
        🏷️ Total Entities               26,435
           📝 Entity Types                    6
        🔗 Total Relations                8,752
📈 Documents with Entities                  120
        📊 Entity Coverage               100.0%
  📋 Avg Entities/Document               220.29

💾 Statistics table saved to results/statistics/corpus_overview_table.csv


In [8]:
# Municipality distribution visualization
municipality_counts = documents_df['municipality'].value_counts()

# Create interactive bar chart
fig_muni = px.bar(
    x=municipality_counts.index,
    y=municipality_counts.values,
    labels={'x': 'Municipality', 'y': 'Number of Documents'},
    title='📊 Document Distribution by Municipality',
    color=municipality_counts.values,
    color_continuous_scale='viridis'
)

fig_muni.update_layout(
    xaxis_tickangle=-45,
    height=500,
    showlegend=False
)

fig_muni.show()

# Save figure
fig_muni.write_html(FIGURES_DIR / 'municipality_distribution.html')
fig_muni.write_image(FIGURES_DIR / 'municipality_distribution.png', width=1200, height=600)

print(f"💾 Municipality distribution chart saved")

💾 Municipality distribution chart saved


In [9]:
# Analysis of entities and relations by municipality
print("=== ENTITIES AND RELATIONS BY MUNICIPALITY ===")

if not entities_df.empty:
    # Group entities by municipality
    entities_by_municipality = entities_df.groupby('municipality').agg({
        'entity_id': 'count',  # Count of entities
        'entity_label': lambda x: x.value_counts().to_dict()  # Entity type distribution per municipality
    }).rename(columns={'entity_id': 'total_entities'})
    
    print("\n📊 ENTITY COUNTS BY MUNICIPALITY:")
    entities_by_municipality_sorted = entities_by_municipality.sort_values('total_entities', ascending=False)
    
    for municipality, row in entities_by_municipality_sorted.iterrows():
        total_entities = row['total_entities']
        print(f"\n🏛️ {municipality}: {total_entities:,} entities")
        
        # Show entity type breakdown for this municipality
        entity_types = row['entity_label']
        for entity_type, count in sorted(entity_types.items(), key=lambda x: x[1], reverse=True):
            percentage = (count / total_entities) * 100
            print(f"   • {entity_type}: {count:,} ({percentage:.1f}%)")

if not relations_df.empty:
    # Group relations by municipality
    relations_by_municipality = relations_df.groupby('municipality').agg({
        'relation_id': 'count',  # Count of relations
        'relation_type': lambda x: x.value_counts().to_dict()  # Relation type distribution per municipality
    }).rename(columns={'relation_id': 'total_relations'})
    
    print("\n\n🔗 RELATION COUNTS BY MUNICIPALITY:")
    relations_by_municipality_sorted = relations_by_municipality.sort_values('total_relations', ascending=False)
    
    for municipality, row in relations_by_municipality_sorted.iterrows():
        total_relations = row['total_relations']
        print(f"\n🏛️ {municipality}: {total_relations:,} relations")
        
        # Show relation type breakdown for this municipality
        relation_types = row['relation_type']
        for relation_type, count in sorted(relation_types.items(), key=lambda x: x[1], reverse=True):
            percentage = (count / total_relations) * 100
            print(f"   • {relation_type}: {count:,} ({percentage:.1f}%)")

# Create visualization combining entities and relations by municipality
if not entities_df.empty or not relations_df.empty:
    fig = make_subplots(
        rows=2, cols=2,
        subplot_titles=(
            'Entities by Municipality', 'Relations by Municipality',
            'Entity Density (per Document)', 'Relation Density (per Document)'
        ),
        specs=[[{'type': 'bar'}, {'type': 'bar'}],
               [{'type': 'bar'}, {'type': 'bar'}]]
    )
    
    # Entities by municipality
    if not entities_df.empty:
        entity_counts = entities_by_municipality_sorted['total_entities']
        fig.add_trace(
            go.Bar(
                x=entity_counts.index,
                y=entity_counts.values,
                name='Entities',
                marker_color='lightblue'
            ),
            row=1, col=1
        )
        
        # Calculate entity density (entities per document)
        if not documents_df.empty:
            doc_counts_by_municipality = documents_df['municipality'].value_counts()
            entity_density = entity_counts / doc_counts_by_municipality.reindex(entity_counts.index)
            
            fig.add_trace(
                go.Bar(
                    x=entity_density.index,
                    y=entity_density.values,
                    name='Entity Density',
                    marker_color='lightgreen'
                ),
                row=2, col=1
            )
    
    # Relations by municipality
    if not relations_df.empty:
        relation_counts = relations_by_municipality_sorted['total_relations']
        fig.add_trace(
            go.Bar(
                x=relation_counts.index,
                y=relation_counts.values,
                name='Relations',
                marker_color='lightcoral'
            ),
            row=1, col=2
        )
        
        # Calculate relation density (relations per document)
        if not documents_df.empty:
            doc_counts_by_municipality = documents_df['municipality'].value_counts()
            relation_density = relation_counts / doc_counts_by_municipality.reindex(relation_counts.index)
            
            fig.add_trace(
                go.Bar(
                    x=relation_density.index,
                    y=relation_density.values,
                    name='Relation Density',
                    marker_color='lightyellow'
                ),
                row=2, col=2
            )
    
    fig.update_layout(
        title_text="🏛️ Entities and Relations Distribution by Municipality",
        height=800,
        showlegend=False
    )
    
    # Rotate x-axis labels for better readability
    fig.update_xaxes(tickangle=-45)
    
    fig.show()
    
    # Save figure
    fig.write_html(FIGURES_DIR / 'entities_relations_by_municipality.html')
    fig.write_image(FIGURES_DIR / 'entities_relations_by_municipality.png', width=1400, height=800)
    
    print(f"\n💾 Municipality analysis visualization saved")

# Summary statistics
if not entities_df.empty and not relations_df.empty:
    print(f"\n📈 SUMMARY STATISTICS:")
    print(f"• Total municipalities with entities: {len(entities_by_municipality)}")
    print(f"• Total municipalities with relations: {len(relations_by_municipality)}")
    print(f"• Average entities per municipality: {entities_by_municipality['total_entities'].mean():.1f}")
    print(f"• Average relations per municipality: {relations_by_municipality['total_relations'].mean():.1f}")
    
    # Find municipalities with highest/lowest counts
    max_entities_mun = entities_by_municipality['total_entities'].idxmax()
    min_entities_mun = entities_by_municipality['total_entities'].idxmin()
    max_relations_mun = relations_by_municipality['total_relations'].idxmax()
    min_relations_mun = relations_by_municipality['total_relations'].idxmin()
    
    print(f"\n🏆 Municipality with most entities: {max_entities_mun} ({entities_by_municipality.loc[max_entities_mun, 'total_entities']:,})")
    print(f"🏆 Municipality with most relations: {max_relations_mun} ({relations_by_municipality.loc[max_relations_mun, 'total_relations']:,})")
    print(f"📉 Municipality with fewest entities: {min_entities_mun} ({entities_by_municipality.loc[min_entities_mun, 'total_entities']:,})")
    print(f"📉 Municipality with fewest relations: {min_relations_mun} ({relations_by_municipality.loc[min_relations_mun, 'total_relations']:,})")

=== ENTITIES AND RELATIONS BY MUNICIPALITY ===

📊 ENTITY COUNTS BY MUNICIPALITY:

🏛️ Covilha: 6,122 entities
   • Posicionamento: 2,201 (36.0%)
   • Assunto: 2,159 (35.3%)
   • Informação Pessoal: 1,289 (21.1%)
   • Metadados: 305 (5.0%)
   • Ordem do Dia: 168 (2.7%)

🏛️ Campomaior: 4,998 entities
   • Informação Pessoal: 2,297 (46.0%)
   • Assunto: 1,192 (23.8%)
   • Posicionamento: 1,124 (22.5%)
   • Metadados: 264 (5.3%)
   • Ordem do Dia: 116 (2.3%)
   • : 5 (0.1%)

🏛️ Guimaraes: 4,674 entities
   • Posicionamento: 1,679 (35.9%)
   • Assunto: 1,662 (35.6%)
   • Informação Pessoal: 928 (19.9%)
   • Metadados: 366 (7.8%)
   • Ordem do Dia: 39 (0.8%)

🏛️ Porto: 4,209 entities
   • Posicionamento: 1,715 (40.7%)
   • Assunto: 1,416 (33.6%)
   • Ordem do Dia: 463 (11.0%)
   • Metadados: 370 (8.8%)
   • Informação Pessoal: 244 (5.8%)
   • : 1 (0.0%)

🏛️ Alandroal: 3,916 entities
   • Assunto: 1,514 (38.7%)
   • Posicionamento: 1,306 (33.4%)
   • Informação Pessoal: 458 (11.7%)
   • Ordem 


💾 Municipality analysis visualization saved

📈 SUMMARY STATISTICS:
• Total municipalities with entities: 6
• Total municipalities with relations: 6
• Average entities per municipality: 4405.8
• Average relations per municipality: 1458.7

🏆 Municipality with most entities: Covilha (6,122)
🏆 Municipality with most relations: Covilha (2,192)
📉 Municipality with fewest entities: Fundao (2,516)
📉 Municipality with fewest relations: Fundao (751)


## 4. Entity Type Analysis

In [10]:
# Entity type distribution analysis
if not entities_df.empty:
    entity_type_counts = entities_df['entity_label'].value_counts()
    entity_type_percentages = (entity_type_counts / entity_type_counts.sum() * 100).round(2)
    
    print("=== ENTITY TYPE DISTRIBUTION ===")
    for entity_type, count in entity_type_counts.items():
        percentage = entity_type_percentages[entity_type]
        print(f"{entity_type}: {count:,} ({percentage:.1f}%)")
    
    # Create combined visualization
    fig = make_subplots(
        rows=1, cols=2,
        subplot_titles=('Entity Type Frequencies', 'Entity Type Distribution'),
        specs=[[{'type': 'bar'}, {'type': 'pie'}]]
    )
    
    # Bar chart
    fig.add_trace(
        go.Bar(x=entity_type_counts.index, y=entity_type_counts.values, name='Count'),
        row=1, col=1
    )
    
    # Pie chart
    fig.add_trace(
        go.Pie(labels=entity_type_counts.index, values=entity_type_counts.values, name='Distribution'),
        row=1, col=2
    )
    
    fig.update_layout(
        title_text="🏷️ Entity Type Analysis",
        height=500,
        showlegend=False
    )
    
    fig.show()
    
    # Save figure
    fig.write_html(FIGURES_DIR / 'entity_type_analysis.html')
    fig.write_image(FIGURES_DIR / 'entity_type_analysis.png', width=1200, height=600)
    
else:
    print("⚠️ No entities found in the dataset")

=== ENTITY TYPE DISTRIBUTION ===
Posicionamento: 8,777 (33.2%)
Assunto: 8,647 (32.7%)
Informação Pessoal: 5,760 (21.8%)
Metadados: 1,825 (6.9%)
Ordem do Dia: 1,420 (5.4%)
: 6 (0.0%)


In [11]:
# Entity length and characteristics analysis
if not entities_df.empty:
    # Entity length distribution
    fig_length = px.histogram(
        entities_df,
        x='length',
        nbins=30,
        title='📏 Entity Length Distribution (Characters)',
        labels={'length': 'Entity Length (characters)', 'count': 'Frequency'},
        marginal='box'  # Add box plot
    )
    
    fig_length.update_layout(height=500)
    fig_length.show()
    
    # Token count distribution
    fig_tokens = px.histogram(
        entities_df,
        x='token_count',
        nbins=20,
        title='🔤 Entity Token Count Distribution',
        labels={'token_count': 'Token Count', 'count': 'Frequency'},
        marginal='violin'  # Add violin plot
    )
    
    fig_tokens.update_layout(height=500)
    fig_tokens.show()
    
    # Length by entity type
    fig_length_by_type = px.box(
        entities_df,
        x='entity_label',
        y='length',
        title='📊 Entity Length Distribution by Type',
        labels={'entity_label': 'Entity Type', 'length': 'Length (characters)'},
        points='outliers'
    )
    
    fig_length_by_type.update_layout(
        xaxis_tickangle=-45,
        height=500
    )
    
    fig_length_by_type.show()
    
    # Save figures
    fig_length.write_html(FIGURES_DIR / 'entity_length_distribution.html')
    fig_tokens.write_html(FIGURES_DIR / 'entity_token_distribution.html')
    fig_length_by_type.write_html(FIGURES_DIR / 'entity_length_by_type.html')
    
    print("💾 Entity characteristics visualizations saved")

💾 Entity characteristics visualizations saved


In [12]:
# Cross-municipality entity analysis
if not entities_df.empty:
    # Entity distribution by municipality and type
    entity_muni_cross = pd.crosstab(entities_df['municipality'], entities_df['entity_label'])
    
    # Create heatmap
    fig_heatmap = px.imshow(
        entity_muni_cross.values,
        x=entity_muni_cross.columns,
        y=entity_muni_cross.index,
        title='🗺️ Entity Type Distribution by Municipality',
        labels={'x': 'Entity Type', 'y': 'Municipality', 'color': 'Count'},
        aspect='auto'
    )
    
    fig_heatmap.update_layout(height=600)
    fig_heatmap.show()
    
    # Entity density by municipality
    entity_density = entities_df.groupby('municipality').size() / documents_df.groupby('municipality').size()
    
    fig_density = px.bar(
        x=entity_density.index,
        y=entity_density.values,
        title='📈 Entity Density by Municipality',
        labels={'x': 'Municipality', 'y': 'Entities per Document'},
        color=entity_density.values,
        color_continuous_scale='plasma'
    )
    
    fig_density.update_layout(
        xaxis_tickangle=-45,
        height=500
    )
    
    fig_density.show()
    
    # Save figures
    fig_heatmap.write_html(FIGURES_DIR / 'entity_municipality_heatmap.html')
    fig_density.write_html(FIGURES_DIR / 'entity_density_by_municipality.html')
    
    print("💾 Cross-municipality analysis visualizations saved")

💾 Cross-municipality analysis visualizations saved


## 5. Posicionamento Analysis (Voting Behavior)

In [13]:
# Posicionamento analysis
if not relations_df.empty and 'posicionamento' in relations_df.columns:
    # Filter relations with posicionamento data
    pos_data = relations_df.dropna(subset=['posicionamento'])
    
    if not pos_data.empty:
        print("=== POSICIONAMENTO ANALYSIS ===")
        
        # Overall posicionamento distribution
        pos_counts = pos_data['posicionamento'].value_counts()
        pos_percentages = (pos_counts / pos_counts.sum() * 100).round(2)
        
        print("\nPositioning Distribution:")
        for pos, count in pos_counts.items():
            percentage = pos_percentages[pos]
            print(f"  {pos}: {count:,} ({percentage:.1f}%)")
        
        # Create visualization
        fig_pos = make_subplots(
            rows=1, cols=2,
            subplot_titles=('Posicionamento Frequencies', 'Posicionamento Distribution'),
            specs=[[{'type': 'bar'}, {'type': 'pie'}]]
        )
        
        # Bar chart
        colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd'][:len(pos_counts)]
        fig_pos.add_trace(
            go.Bar(x=pos_counts.index, y=pos_counts.values, 
                   marker_color=colors, name='Count'),
            row=1, col=1
        )
        
        # Pie chart
        fig_pos.add_trace(
            go.Pie(labels=pos_counts.index, values=pos_counts.values, 
                   marker_colors=colors, name='Distribution'),
            row=1, col=2
        )
        
        fig_pos.update_layout(
            title_text="🗳️ Posicionamento (Voting Position) Analysis",
            height=500,
            showlegend=False
        )
        
        fig_pos.show()
        
        # Save figure
        fig_pos.write_html(FIGURES_DIR / 'posicionamento_analysis.html')
        fig_pos.write_image(FIGURES_DIR / 'posicionamento_analysis.png', width=1200, height=600)
        
    else:
        print("⚠️ No posicionamento data found")
else:
    print("⚠️ No relations or posicionamento column found")

=== POSICIONAMENTO ANALYSIS ===

Positioning Distribution:
  a favor: 2,643 (65.3%)
  abstenção: 1,103 (27.2%)
  contra: 198 (4.9%)
  não presente: 105 (2.6%)


In [14]:
# Posicionamento by municipality analysis
if not relations_df.empty and 'posicionamento' in relations_df.columns:
    pos_data = relations_df.dropna(subset=['posicionamento'])
    
    if not pos_data.empty:
        # Cross-tabulation
        pos_muni_cross = pd.crosstab(pos_data['municipality'], pos_data['posicionamento'])
        
        # Normalize by municipality (percentage within each municipality)
        pos_muni_pct = pos_muni_cross.div(pos_muni_cross.sum(axis=1), axis=0) * 100
        
        # Create stacked bar chart
        fig_pos_muni = go.Figure()
        
        colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd']
        for i, pos_type in enumerate(pos_muni_cross.columns):
            fig_pos_muni.add_trace(go.Bar(
                name=pos_type,
                x=pos_muni_cross.index,
                y=pos_muni_cross[pos_type],
                marker_color=colors[i % len(colors)]
            ))
        
        fig_pos_muni.update_layout(
            title='🏛️ Posicionamento Distribution by Municipality',
            xaxis_title='Municipality',
            yaxis_title='Number of Positions',
            barmode='stack',
            xaxis_tickangle=-45,
            height=600
        )
        
        fig_pos_muni.show()
        
        # Create percentage heatmap
        fig_pos_heatmap = px.imshow(
            pos_muni_pct.values,
            x=pos_muni_pct.columns,
            y=pos_muni_pct.index,
            title='📊 Posicionamento Patterns by Municipality (%)',
            labels={'x': 'Position Type', 'y': 'Municipality', 'color': 'Percentage'},
            aspect='auto',
            color_continuous_scale='RdYlBu'
        )
        
        fig_pos_heatmap.update_layout(height=500)
        fig_pos_heatmap.show()
        
        # Save figures
        fig_pos_muni.write_html(FIGURES_DIR / 'posicionamento_by_municipality.html')
        fig_pos_heatmap.write_html(FIGURES_DIR / 'posicionamento_patterns_heatmap.html')
        
        print("💾 Posicionamento by municipality analysis saved")

💾 Posicionamento by municipality analysis saved


In [15]:
# Fix the posicionamento analysis visualization
print("=== POSICIONAMENTO ANALYSIS (FIXED) ===")

# Get posicionamento data
posicionamento_data = entities_df[entities_df['posicionamento'].notna()]

if not posicionamento_data.empty:
    from plotly.subplots import make_subplots
    import plotly.graph_objects as go
    import plotly.colors
    
    print("🗳️ Posicionamento (Voting Position) Analysis:")
    
    # Posicionamento distribution
    pos_counts = posicionamento_data['posicionamento'].value_counts()
    total_pos = len(posicionamento_data)
    
    print(f"\n📊 Position Distribution ({total_pos:,} total entities):")
    for pos_type, count in pos_counts.items():
        pct = (count / total_pos) * 100
        print(f"   {pos_type}: {count:,} ({pct:.1f}%)")
    
    # Define consistent color palette for all charts
    position_types = list(pos_counts.keys())
    color_palette = plotly.colors.qualitative.Set3[:len(position_types)]
    color_map = dict(zip(position_types, color_palette))
    
    # Create 2x2 subplot with proper specs for pie and bar charts
    fig_posicionamento = make_subplots(
        rows=2, cols=2,
        subplot_titles=('Position Distribution', 'Positions by Municipality', 
                       'Position Patterns Over Time', 'Position Types by Entity Count'),
        specs=[[{'type': 'pie'}, {'type': 'bar'}], 
               [{'type': 'bar'}, {'type': 'bar'}]]
    )
    
    # 1. Pie chart for position distribution with consistent colors
    fig_posicionamento.add_trace(
        go.Pie(
            labels=list(pos_counts.keys()),
            values=list(pos_counts.values),
            name="Position Types",
            marker=dict(colors=[color_map[pos] for pos in pos_counts.keys()])
        ),
        row=1, col=1
    )
    
    # 2. Positions by municipality with consistent colors
    pos_muni = posicionamento_data.groupby(['municipality', 'posicionamento']).size().unstack(fill_value=0)
    
    for pos_type in pos_muni.columns:
        fig_posicionamento.add_trace(
            go.Bar(
                x=list(pos_muni.index),
                y=list(pos_muni[pos_type]),
                name=pos_type,
                marker_color=color_map.get(pos_type, 'gray'),
                showlegend=False
            ),
            row=1, col=2
        )
    
    # 3. Position patterns over time (by year) with consistent colors
    # Fix date issues first
    posicionamento_data_fixed = posicionamento_data.copy()
    
    def extract_year_from_filename(filename):
        try:
            # Extract year from filename pattern like "Municipality_cm_XXX_YYYY-MM-DD.json"
            parts = filename.split('_')
            for part in parts:
                if len(part) >= 10 and '-' in part:  # Date part
                    date_part = part.split('.')[0]  # Remove .json
                    year = date_part.split('-')[0]
                    if len(year) == 4 and year.isdigit():
                        return int(year)
        except:
            pass
        return None
    
    posicionamento_data_fixed['year_from_filename'] = posicionamento_data_fixed['filename'].apply(extract_year_from_filename)
    pos_year = posicionamento_data_fixed.dropna(subset=['year_from_filename']).groupby(['year_from_filename', 'posicionamento']).size().unstack(fill_value=0)
    
    for pos_type in pos_year.columns:
        fig_posicionamento.add_trace(
            go.Bar(
                x=list(pos_year.index),
                y=list(pos_year[pos_type]),
                name=f"{pos_type} (by year)",
                marker_color=color_map.get(pos_type, 'gray'),
                showlegend=False
            ),
            row=2, col=1
        )
    
    # 4. Entity count comparison (different color scheme since it's different data)
    entity_counts = posicionamento_data['entity_label'].value_counts()
    
    fig_posicionamento.add_trace(
        go.Bar(
            x=list(entity_counts.index),
            y=list(entity_counts.values),
            name="Entity Count",
            marker_color='lightgreen',
            showlegend=False
        ),
        row=2, col=2
    )
    
    fig_posicionamento.update_layout(
        height=800, 
        title_text="Posicionamento (Voting Position) Analysis",
        showlegend=False
    )
    fig_posicionamento.show()
    
    print(f"\n📈 Posicionamento by Municipality:")
    for municipality in pos_muni.index:
        total = pos_muni.loc[municipality].sum()
        print(f"   - {municipality}: {total} position mentions")
        
    print(f"\n📅 Posicionamento by Year:")
    for year in sorted(pos_year.index):
        total = pos_year.loc[year].sum()
        print(f"   - {int(year)}: {total} position mentions")
    
    print(f"\n🏷️ Entities with Posicionamento:")
    for entity_type, count in entity_counts.items():
        print(f"   - {entity_type}: {count} occurrences")
        
else:
    print("⚠️ No posicionamento data available for analysis")

=== POSICIONAMENTO ANALYSIS (FIXED) ===
🗳️ Posicionamento (Voting Position) Analysis:

📊 Position Distribution (8,775 total entities):
   Votante: 3,966 (45.2%)
   Votação: 2,748 (31.3%)
   Contabilização global: 1,962 (22.4%)
   Não votante: 99 (1.1%)



📈 Posicionamento by Municipality:
   - Alandroal: 1306 position mentions
   - Campomaior: 1123 position mentions
   - Covilha: 2201 position mentions
   - Fundao: 752 position mentions
   - Guimaraes: 1679 position mentions
   - Porto: 1714 position mentions

📅 Posicionamento by Year:
   - 2021: 1051 position mentions
   - 2022: 2611 position mentions
   - 2023: 2642 position mentions
   - 2024: 2471 position mentions

🏷️ Entities with Posicionamento:
   - Posicionamento: 8775 occurrences


## 6. Assunto Analysis (Subject Matter)

Analysis of ASSUNTO entities, focusing on individual keyword entities (excluding Fronteira boundary markers).
This section provides the foundation for the comprehensive dual analysis presented in Section 13.

**Note**: This analysis excludes Fronteira markers to focus on content-bearing ASSUNTO entities. 
The complete dual-dimensional analysis (both Fronteira-based sections AND individual keywords) 
is presented in Section 13.

In [16]:
# Assunto (subject) analysis
if not entities_df.empty:
    # Filter for ASSUNTO entities (case-insensitive) excluding Fronteira boundary markers
    # Only include ASSUNTO entities that contain actual subject matter, not organizational markers
    assunto_entities = entities_df[
        (entities_df['entity_label'].str.contains('ASSUNTO', case=False, na=False)) &
        (~entities_df['fronteira'].notna())  # Exclude entities with Fronteira field
    ]
    
    if not assunto_entities.empty:
        print("=== ASSUNTO ANALYSIS ===")
        
        # Basic statistics
        print(f"\n📊 Basic Statistics:")
        print(f"   Total ASSUNTO entities: {len(assunto_entities):,}")
        print(f"   Documents with ASSUNTO: {assunto_entities['filename'].nunique():,}")
        print(f"   Average ASSUNTO per document: {len(assunto_entities) / assunto_entities['filename'].nunique():.2f}")
        
        # Length analysis
        print(f"\n📏 Length Statistics:")
        print(f"   Average length (chars): {assunto_entities['length'].mean():.2f}")
        print(f"   Average length (tokens): {assunto_entities['token_count'].mean():.2f}")
        print(f"   Length range: {assunto_entities['length'].min()}-{assunto_entities['length'].max()} chars")
        
        # Create length distribution visualization
        fig_assunto_length = px.histogram(
            assunto_entities,
            x='length',
            title='📏 ASSUNTO Entity Length Distribution',
            labels={'length': 'Length (characters)', 'count': 'Frequency'},
            marginal='box',
            nbins=20
        )
        
        fig_assunto_length.update_layout(height=500)
        fig_assunto_length.show()
        
        # Save figure
        fig_assunto_length.write_html(FIGURES_DIR / 'assunto_length_distribution.html')
        
    else:
        print("⚠️ No ASSUNTO entities found")
else:
    print("⚠️ No entities data available")

=== ASSUNTO ANALYSIS ===

📊 Basic Statistics:
   Total ASSUNTO entities: 2,888
   Documents with ASSUNTO: 120
   Average ASSUNTO per document: 24.07

📏 Length Statistics:
   Average length (chars): 96.99
   Average length (tokens): 15.38
   Length range: 7-530 chars


In [17]:

# ASSUNTO content and word frequency analysis
if not entities_df.empty:
    assunto_entities = entities_df[
        (entities_df['entity_label'].str.contains('ASSUNTO', case=False, na=False)) &
        (~entities_df['fronteira'].notna())  # Exclude entities with Fronteira field
    ]
    
    if not assunto_entities.empty:
        # Get most common ASSUNTO texts
        assunto_texts = assunto_entities['text'].dropna()
        common_assuntos = assunto_texts.value_counts().head(20)
        
        print(f"\n📋 Most Common ASSUNTO Texts (Top 20):")
        for i, (text, count) in enumerate(common_assuntos.items(), 1):
            print(f"{i:2d}. '{text}' ({count} occurrences)")
        
        # Create bar chart of most common assuntos
        if len(common_assuntos) > 0:
            fig_common_assuntos = px.bar(
                x=common_assuntos.values,
                y=common_assuntos.index,
                orientation='h',
                title='📊 Most Common ASSUNTO Texts',
                labels={'x': 'Frequency', 'y': 'ASSUNTO Text'},
                color=common_assuntos.values,
                color_continuous_scale='viridis'
            )
            
            fig_common_assuntos.update_layout(
                height=600,
                yaxis={'categoryorder': 'total ascending'}
            )
            
            fig_common_assuntos.show()
            
            # Save figure
            fig_common_assuntos.write_html(FIGURES_DIR / 'common_assuntos.html')
        
        # Word frequency analysis
        all_words = []
        for text in assunto_texts:
            if isinstance(text, str) and len(text.strip()) > 0:
                # Simple word splitting (could be enhanced with Portuguese NLP)
                words = text.lower().split()
                # Filter out very short words and common stop words
                filtered_words = [word for word in words if len(word) > 2 and word not in ['de', 'da', 'do', 'das', 'dos', 'em', 'na', 'no', 'para', 'por', 'com', 'uma', 'um', 'que', 'se', 'ou', 'ao', 'aos']]
                all_words.extend(filtered_words)
        
        if all_words:
            word_freq = Counter(all_words)
            top_words = dict(word_freq.most_common(30))
            
            print(f"\n🔤 Most Frequent Words in ASSUNTO Texts (Top 30):")
            for i, (word, count) in enumerate(list(top_words.items())[:15], 1):
                print(f"{i:2d}. '{word}': {count} times")
            
            # Create word frequency bar chart
            fig_word_freq = px.bar(
                x=list(top_words.keys()),
                y=list(top_words.values()),
                title='🔤 Most Frequent Words in ASSUNTO Entities',
                labels={'x': 'Words', 'y': 'Frequency'},
                color=list(top_words.values()),
                color_continuous_scale='plasma'
            )
            
            fig_word_freq.update_layout(
                height=500,
                xaxis_tickangle=-45
            )
            
            fig_word_freq.show()
            
            # Create word cloud if possible
            try:
                if len(top_words) > 5:
                    wordcloud = WordCloud(
                        width=800, height=400, 
                        background_color='white',
                        max_words=50,
                        colormap='viridis'
                    ).generate_from_frequencies(top_words)
                    
                    plt.figure(figsize=(12, 6))
                    plt.imshow(wordcloud, interpolation='bilinear')
                    plt.axis('off')
                    plt.title('☁️ ASSUNTO Words Cloud', fontsize=16, pad=20)
                    plt.tight_layout()
                    plt.savefig(FIGURES_DIR / 'assunto_wordcloud.png', dpi=300, bbox_inches='tight')
                    plt.show()
                    
                    print("💾 Word cloud saved")
            except Exception as e:
                print(f"⚠️ Could not create word cloud: {e}")
            
            # Save word frequency figure
            fig_word_freq.write_html(FIGURES_DIR / 'assunto_word_frequency.html')
            
        print("💾 ASSUNTO analysis visualizations saved")


📋 Most Common ASSUNTO Texts (Top 20):
 1. 'atribuição de habitação municipal' (21 occurrences)
 2. 'PRESENTE ATA EM MINUTA' (20 occurrences)
 3. 'ATA EM MINUTA' (20 occurrences)
 4. 'balancete' (15 occurrences)
 5. 'assuntos' (14 occurrences)
 6. 'assunto' (14 occurrences)
 7. 'alteração orçamental' (14 occurrences)
 8. 'PASSAGEM DE CERTIDÃO DE ISENÇÃO DE LICENÇA DE HABITABILIDADE' (10 occurrences)
 9. 'Reconhecimento da isenção de IMI e de IMT para os prédios' (9 occurrences)
10. 'alteração orçamental permutativa' (9 occurrences)
11. 'alterações orçamentais' (8 occurrences)
12. 'atas para aprovação' (7 occurrences)
13. 'atribuição de um apoio à fixação de residência em habitação própria' (5 occurrences)
14. 'REEMBOLSO DE 20% DO IMI - REGULAMENTO MUNICIPAL DE CONCESSÃO DE DIREITOS E BENEFÍCIOS AOS BOMBEIROS VOLUNTÁRIOS DO CONCELHO DE GUIMARÃES' (5 occurrences)
15. 'celebração do Protocolo de Apoio entre o Município da Covilhã e a União de Freguesias de Teixoso e Sarzedo' (4 occurrence


🔤 Most Frequent Words in ASSUNTO Texts (Top 30):
 1. 'município': 413 times
 2. 'apoio': 369 times
 3. 'entre': 339 times
 4. 'municipal': 329 times
 5. 'covilhã': 262 times
 6. 'protocolo': 251 times
 7. 'atribuição': 246 times
 8. 'n.º': 225 times
 9. 'empreitada': 207 times
10. 'aprovação': 203 times
11. 'obras': 182 times
12. 'celebração': 180 times
13. 'associação': 178 times
14. 'cedência': 169 times
15. 'freguesia': 167 times


💾 Word cloud saved
💾 ASSUNTO analysis visualizations saved


In [18]:
# ASSUNTO analysis by municipality
if not entities_df.empty:
    assunto_entities = entities_df[
        (entities_df['entity_label'].str.contains('ASSUNTO', case=False, na=False)) &
        (~entities_df['fronteira'].notna())  # Exclude entities with Fronteira field
    ]
    
    if not assunto_entities.empty:
        # ASSUNTO count by municipality
        assunto_by_muni = assunto_entities.groupby('municipality').agg({
            'text': ['count', 'nunique'],
            'length': ['mean', 'std'],
            'token_count': ['mean', 'std']
        }).round(2)
        
        # Flatten column names
        assunto_by_muni.columns = ['_'.join(col).strip() for col in assunto_by_muni.columns.values]
        assunto_by_muni = assunto_by_muni.reset_index()
        
        print(f"\n📊 ASSUNTO Statistics by Municipality:")
        print(assunto_by_muni)
        
        # Create visualization
        fig_assunto_muni = make_subplots(
            rows=2, cols=2,
            subplot_titles=(
                'Total ASSUNTO Count',
                'Unique ASSUNTO Count', 
                'Average Length (chars)',
                'Average Token Count'
            )
        )
        
        # Add traces
        fig_assunto_muni.add_trace(
            go.Bar(x=assunto_by_muni['municipality'], y=assunto_by_muni['text_count'], name='Total Count'),
            row=1, col=1
        )
        
        fig_assunto_muni.add_trace(
            go.Bar(x=assunto_by_muni['municipality'], y=assunto_by_muni['text_nunique'], name='Unique Count'),
            row=1, col=2
        )
        
        fig_assunto_muni.add_trace(
            go.Bar(x=assunto_by_muni['municipality'], y=assunto_by_muni['length_mean'], name='Avg Length'),
            row=2, col=1
        )
        
        fig_assunto_muni.add_trace(
            go.Bar(x=assunto_by_muni['municipality'], y=assunto_by_muni['token_count_mean'], name='Avg Tokens'),
            row=2, col=2
        )
        
        fig_assunto_muni.update_layout(
            title_text='🏛️ ASSUNTO Analysis by Municipality',
            height=800,
            showlegend=False
        )
        
        # Rotate x-axis labels
        fig_assunto_muni.update_xaxes(tickangle=-45)
        
        fig_assunto_muni.show()
        
        # Save figure
        fig_assunto_muni.write_html(FIGURES_DIR / 'assunto_by_municipality.html')
        
        # Save statistics table
        assunto_by_muni.to_csv(STATISTICS_DIR / 'assunto_by_municipality.csv', index=False)
        
        print("💾 ASSUNTO municipality analysis saved")


📊 ASSUNTO Statistics by Municipality:
  municipality  text_count  text_nunique  length_mean  length_std  \
0    Alandroal         508           459        80.78       52.18   
1   Campomaior         398           345       103.95       66.42   
2      Covilha         721           599        91.86       51.09   
3       Fundao         235           229       104.61       47.83   
4    Guimaraes         554           499        95.19       49.47   
5        Porto         472           453       114.72       50.15   

   token_count_mean  token_count_std  
0             12.57             8.04  
1             16.86            10.69  
2             14.69             8.43  
3             16.80             7.82  
4             15.07             7.83  
5             17.83             7.56  


💾 ASSUNTO municipality analysis saved


## 8. Fronteiras Analysis (Boundaries/Limits)

Analysis of **fronteiras** (boundaries or limits) annotations, which represent administrative, geographical, or conceptual boundaries mentioned in municipal documents.

In [19]:
# Fronteiras analysis
fronteiras_analysis = comprehensive_analysis.get('fronteiras_analysis', {})

if 'error' not in fronteiras_analysis and fronteiras_analysis:
    print("=== FRONTEIRAS ANALYSIS ===")
    print(f"🏛️ Total Fronteira entities: {fronteiras_analysis['total_fronteira_entities']:,}")
    print(f"📄 Documents with fronteira: {fronteiras_analysis['documents_with_fronteira']:,}")
    
    # Fronteira type distribution
    type_dist = fronteiras_analysis.get('fronteira_type_distribution', {})
    if type_dist:
        print(f"\n📊 Fronteira Type Distribution:")
        for f_type, count in type_dist.items():
            percentage = fronteiras_analysis.get('fronteira_type_percentages', {}).get(f_type, 0)
            print(f"   • {f_type}: {count:,} ({percentage}%)")
    
    # Municipality distribution
    muni_dist = fronteiras_analysis.get('fronteira_by_municipality', {})
    if muni_dist:
        print(f"\n🏛️ Fronteiras by Municipality:")
        for fronteira_type, municipalities in muni_dist.items():
            print(f"   {fronteira_type}:")
            for muni, count in municipalities.items():
                print(f"      • {muni}: {count:,}")
    
    # Entity co-occurrence
    cooccurrence = fronteiras_analysis.get('fronteira_entity_cooccurrence', {})
    if cooccurrence:
        print(f"\n🏷️ Entity Co-occurrence with Fronteiras:")
        for entity_type, fronteira_types in cooccurrence.items():
            print(f"   {entity_type}:")
            for f_type, count in fronteira_types.items():
                print(f"      • {f_type}: {count:,}")
    
    # Text statistics
    text_stats = fronteiras_analysis.get('fronteira_text_statistics', {})
    if text_stats:
        print(f"\n📝 Fronteira Text Statistics:")
        print(f"   • Average length: {text_stats['avg_length_chars']:.1f} characters, {text_stats['avg_length_tokens']:.1f} tokens")
        
        length_by_type = text_stats.get('length_by_fronteira_type', {})
        if length_by_type and 'mean' in length_by_type:
            print(f"   • Length by type:")
            for f_type, avg_len in length_by_type['mean'].items():
                std_len = length_by_type.get('std', {}).get(f_type, 0)
                count = length_by_type.get('count', {}).get(f_type, 0)
                print(f"      - {f_type}: {avg_len:.1f}±{std_len:.1f} chars ({count} entities)")

else:
    print("=== FRONTEIRAS ANALYSIS ===")
    if 'error' in fronteiras_analysis:
        print(f"⚠️ Error in fronteiras analysis: {fronteiras_analysis['error']}")
    else:
        print("⚠️ No fronteiras data found or analysis empty")

=== FRONTEIRAS ANALYSIS ===
🏛️ Total Fronteira entities: 5,759
📄 Documents with fronteira: 120

📊 Fronteira Type Distribution:
   • Fronteira Inicial: 2,880 (50.01%)
   • Fronteira Final: 2,879 (49.99%)

🏛️ Fronteiras by Municipality:
   Fronteira Final:
      • Alandroal: 503
      • Campomaior: 397
      • Covilha: 719
      • Fundao: 234
      • Guimaraes: 554
      • Porto: 472
   Fronteira Inicial:
      • Alandroal: 503
      • Campomaior: 397
      • Covilha: 719
      • Fundao: 235
      • Guimaraes: 554
      • Porto: 472

🏷️ Entity Co-occurrence with Fronteiras:
   Assunto:
      • Fronteira Inicial: 2,880
      • Fronteira Final: 2,879

📝 Fronteira Text Statistics:
   • Average length: 8.4 characters, 1.1 tokens
   • Length by type:
      - Fronteira Final: 9.1±3.5 chars (2879 entities)
      - Fronteira Inicial: 7.6±3.8 chars (2880 entities)


In [20]:
# Fix the fronteiras analysis visualization (corrected)
print("=== FRONTEIRAS ANALYSIS (FIXED) ===")

# Get fronteira data
fronteiras_data = entities_df[entities_df['fronteira'].notna()]

if not fronteiras_data.empty:
    print("🔍 Fronteira Analysis:")
    
    # Fronteira type distribution
    fronteira_counts = fronteiras_data['fronteira'].value_counts()
    total_fronteiras = len(fronteiras_data)
    
    print(f"\n📊 Fronteira Distribution ({total_fronteiras:,} total entities):")
    for fronteira_type, count in fronteira_counts.items():
        pct = (count / total_fronteiras) * 100
        print(f"   {fronteira_type}: {count:,} ({pct:.1f}%)")
    
    # Create visualization with corrected specs and values
    from plotly.subplots import make_subplots
    import plotly.graph_objects as go
    
    fig_fronteiras = make_subplots(
        rows=1, cols=2,
        subplot_titles=('Fronteira Type Distribution', 'Fronteira by Municipality'),
        specs=[[{'type': 'pie'}, {'type': 'bar'}]]
    )
    
    # Pie chart for distribution
    fig_fronteiras.add_trace(
        go.Pie(
            labels=list(fronteira_counts.keys()),
            values=list(fronteira_counts.values),  # Fixed: .values is a property, not method
            name="Fronteira Types"
        ),
        row=1, col=1
    )
    
    # Bar chart by municipality
    fronteira_muni = fronteiras_data.groupby(['municipality', 'fronteira']).size().unstack(fill_value=0)
    
    for fronteira_type in fronteira_muni.columns:
        fig_fronteiras.add_trace(
            go.Bar(
                x=list(fronteira_muni.index),
                y=list(fronteira_muni[fronteira_type]),
                name=fronteira_type,
                showlegend=False
            ),
            row=1, col=2
        )
    
    fig_fronteiras.update_layout(height=500, title_text="Fronteira Analysis")
    fig_fronteiras.show()
    
    print(f"\n📈 Municipalities with fronteira data: {fronteiras_data['municipality'].nunique()}")
    print(f"📈 Documents with fronteira data: {fronteiras_data['filename'].nunique()}")
    
else:
    print("⚠️ No fronteira data available for analysis")

=== FRONTEIRAS ANALYSIS (FIXED) ===
🔍 Fronteira Analysis:

📊 Fronteira Distribution (5,759 total entities):
   Fronteira Inicial: 2,880 (50.0%)
   Fronteira Final: 2,879 (50.0%)



📈 Municipalities with fronteira data: 6
📈 Documents with fronteira data: 120


## 10. Metadata Analysis (Meeting Information)

Comprehensive analysis of document metadata including meeting types, participants, presence information, political parties, and scheduling patterns.

In [21]:
# Metadata analysis
metadata_analysis = comprehensive_analysis.get('metadata_analysis', {})

if 'error' not in metadata_analysis and metadata_analysis:
    print("=== METADATA ANALYSIS ===")
    
    # Meeting type analysis
    if 'meeting_type_analysis' in metadata_analysis:
        meeting_types = metadata_analysis['meeting_type_analysis']
        print(f"\n🏛️ Meeting Type Analysis:")
        print(f"   Total with meeting type info: {meeting_types.get('total_with_meeting_type', 0):,}")
        
        types_dist = meeting_types.get('meeting_types', {})
        types_pct = meeting_types.get('meeting_type_percentages', {})
        
        for meeting_type, count in types_dist.items():
            pct = types_pct.get(meeting_type, 0)
            print(f"     {meeting_type}: {count:,} ({pct:.1f}%)")
    
    # Political party analysis
    if 'political_party_analysis' in metadata_analysis:
        party_analysis = metadata_analysis['political_party_analysis']
        print(f"\n🏛️ Political Party Analysis:")
        print(f"   Total with party info: {party_analysis.get('total_with_party_info', 0):,}")
        print(f"   Unique parties: {party_analysis.get('unique_parties', 0)}")
        
        party_dist = party_analysis.get('party_distribution', {})
        party_pct = party_analysis.get('party_percentages', {})
        
        print(f"\n   Top Political Parties:")
        for party, count in sorted(party_dist.items(), key=lambda x: x[1], reverse=True)[:10]:
            pct = party_pct.get(party, 0)
            print(f"     {party}: {count:,} ({pct:.1f}%)")
        
        # Party by municipality visualization
        if 'party_by_municipality' in party_analysis:
            party_muni_data = party_analysis['party_by_municipality']
            
            # Create heatmap for party-municipality distribution
            if party_muni_data:
                party_muni_df = pd.DataFrame(party_muni_data).fillna(0)
                
                fig_party_heatmap = px.imshow(
                    party_muni_df.values,
                    x=party_muni_df.columns,
                    y=party_muni_df.index,
                    title='🗳️ Political Party Distribution by Municipality',
                    labels={'x': 'Political Party', 'y': 'Municipality', 'color': 'Count'},
                    aspect='auto',
                    color_continuous_scale='Reds'
                )
                
                fig_party_heatmap.update_layout(height=600)
                fig_party_heatmap.show()
                
                fig_party_heatmap.write_html(FIGURES_DIR / 'political_party_by_municipality.html')
    
    # Presence analysis
    if 'presence_analysis' in metadata_analysis:
        presence_analysis = metadata_analysis['presence_analysis']
        print(f"\n👥 Presence Analysis:")
        print(f"   Total with presence info: {presence_analysis.get('total_with_presence_info', 0):,}")
        
        presence_types = presence_analysis.get('presence_types', {})
        presence_pct = presence_analysis.get('presence_percentages', {})
        
        for presence_type, count in presence_types.items():
            pct = presence_pct.get(presence_type, 0)
            print(f"     {presence_type}: {count:,} ({pct:.1f}%)")
    
    # Schedule analysis
    if 'schedule_analysis' in metadata_analysis:
        schedule_analysis = metadata_analysis['schedule_analysis']
        print(f"\n🕒 Schedule Analysis:")
        print(f"   Total with schedule info: {schedule_analysis.get('total_with_schedule', 0):,}")
        print(f"   Documents with schedule: {schedule_analysis.get('documents_with_schedule', 0):,}")
        
        schedule_patterns = schedule_analysis.get('schedule_patterns', {})
        if schedule_patterns:
            print(f"\n   Most Common Schedule Patterns:")
            for schedule, count in list(schedule_patterns.items())[:10]:
                print(f"     {schedule}: {count} times")
    
    print("💾 Metadata analysis completed")
else:
    print("⚠️ No metadata available for analysis")
    if 'error' in metadata_analysis:
        print(f"   Error: {metadata_analysis['error']}")

=== METADATA ANALYSIS ===

🏛️ Meeting Type Analysis:
   Total with meeting type info: 96
     ordinária: 90 (93.8%)
     extraordinária: 6 (6.2%)

🏛️ Political Party Analysis:
   Total with party info: 969
   Unique parties: 8

   Top Political Parties:
     PS: 441 (45.5%)
     RM: 120 (12.4%)
     PPD/PSD: 99 (10.2%)
     PPD/PSD.CDS-PP: 79 (8.2%)
     CDS-PP/PSD: 71 (7.3%)
     CDU: 60 (6.2%)
     PSD: 39 (4.0%)
     IND: 20 (2.1%)
     BE: 20 (2.1%)
     Nós, Cidadãos!: 20 (2.1%)



👥 Presence Analysis:
   Total with presence info: 977
     Presente: 917 (93.9%)
     Ausente: 36 (3.7%)
     Substituído: 24 (2.5%)

🕒 Schedule Analysis:
   Total with schedule info: 220
   Documents with schedule: 120

   Most Common Schedule Patterns:
     início: 120 times
     fim: 100 times
💾 Metadata analysis completed


In [22]:
# Fix the metadata analysis - remove purple bar and fix year scale
print("=== METADATA ANALYSIS (FULLY FIXED) ===")

# Participation analysis
participantes_data = entities_df[entities_df['participantes'].notna()]
horario_data = entities_df[entities_df['horario'].notna()]

if not participantes_data.empty or not horario_data.empty:
    from plotly.subplots import make_subplots
    import plotly.graph_objects as go
    
    # Create 2-panel layout (remove the meaningless document distribution heatmap)
    fig_metadata = make_subplots(
        rows=1, cols=2,
        subplot_titles=('Participation by Municipality', 'Schedule Patterns'),
        specs=[[{'type': 'heatmap'}, {'type': 'bar'}]]
    )
    
    # 1. Participation by municipality heatmap
    if not participantes_data.empty:
        participation_matrix = participantes_data.groupby(['municipality', 'participantes']).size().unstack(fill_value=0)
        
        fig_metadata.add_trace(
            go.Heatmap(
                z=participation_matrix.values,
                x=list(participation_matrix.columns),
                y=list(participation_matrix.index),
                colorscale='Blues',
                name='Participation',
                showscale=True
            ),
            row=1, col=1
        )
        
        print(f"👥 Participation Analysis:")
        print(f"   - Total entities with participants: {len(participantes_data):,}")
        print(f"   - Documents with participants: {participantes_data['filename'].nunique()}")
        print(f"   - Unique participants: {participantes_data['participantes'].nunique()}")
        
        print(f"\n📊 Participation by Municipality:")
        for municipality in participation_matrix.index:
            total = participation_matrix.loc[municipality].sum()
            print(f"   - {municipality}: {total} participant mentions")
    
    # 2. Schedule patterns bar chart
    if not horario_data.empty:
        schedule_counts = horario_data['horario'].value_counts().head(10)
        
        fig_metadata.add_trace(
            go.Bar(
                x=list(schedule_counts.index),
                y=list(schedule_counts.values),
                name='Schedule Frequency',
                marker_color='orange',
                showlegend=False
            ),
            row=1, col=2
        )
        
        print(f"\n🕒 Schedule Analysis:")
        print(f"   - Total entities with schedule: {len(horario_data):,}")
        print(f"   - Documents with schedule: {horario_data['filename'].nunique()}")
        print(f"   - Top schedule patterns:")
        for schedule, count in schedule_counts.head(5).items():
            print(f"     • {schedule}: {count} occurrences")
    
    fig_metadata.update_layout(
        height=600, 
        title_text="Metadata Analysis - Participation and Schedule Patterns",
        showlegend=False
    )
    fig_metadata.show()
    
    # Show document distribution as a simple table instead of problematic heatmap
    print(f"\n📅 Document Distribution Summary (Fixed Date Issues):")
    
    # Create corrected year data
    documents_df_fixed = documents_df.copy()
    
    def fix_date_format(date_str):
        if pd.isna(date_str) or '- Cópia' in str(date_str):
            return None
        try:
            parts = str(date_str).split('-')
            if len(parts) == 3:
                year, part1, part2 = parts
                if int(part1) > 12:  # day-month format, swap them
                    return f"{year}-{part2}-{part1}"
                else:
                    return date_str
        except:
            return None
        return date_str
    
    documents_df_fixed['date_corrected'] = documents_df_fixed['date'].apply(fix_date_format)
    documents_df_fixed['year_corrected'] = pd.to_datetime(documents_df_fixed['date_corrected'], errors='coerce').dt.year
    
    # Filter out duplicates and show clean distribution
    clean_docs = documents_df_fixed[~documents_df_fixed['filename'].str.contains('- Cópia', na=False)]
    
    doc_time_dist_clean = clean_docs.groupby(['municipality', 'year_corrected']).size().unstack(fill_value=0)
    
    print("┌─────────────┬──────┬──────┬──────┬──────┬───────┐")
    print("│Municipality │ 2021 │ 2022 │ 2023 │ 2024 │ Total │")
    print("├─────────────┼──────┼──────┼──────┼──────┼───────┤")
    
    for municipality in doc_time_dist_clean.index:
        row_data = doc_time_dist_clean.loc[municipality]
        total = row_data.sum()
        print(f"│ {municipality:<11} │  {row_data.get(2021.0, 0):>2}  │  {row_data.get(2022.0, 0):>2}  │  {row_data.get(2023.0, 0):>2}  │  {row_data.get(2024.0, 0):>2}  │  {total:>3}  │")
    
    print("└─────────────┴──────┴──────┴──────┴──────┴───────┘")
    
    total_docs = clean_docs.shape[0]
    total_with_valid_dates = clean_docs['year_corrected'].notna().sum()
    
    print(f"\n📊 Summary:")
    print(f"   - Total documents (excluding duplicates): {total_docs}")
    print(f"   - Documents with valid dates: {total_with_valid_dates}")
    print(f"   - Documents with date issues: {total_docs - total_with_valid_dates}")
        
else:
    print("⚠️ No metadata available for analysis")

=== METADATA ANALYSIS (FULLY FIXED) ===
👥 Participation Analysis:
   - Total entities with participants: 1,137
   - Documents with participants: 120
   - Unique participants: 5

📊 Participation by Municipality:
   - Alandroal: 119 participant mentions
   - Campomaior: 145 participant mentions
   - Covilha: 184 participant mentions
   - Fundao: 160 participant mentions
   - Guimaraes: 247 participant mentions
   - Porto: 282 participant mentions

🕒 Schedule Analysis:
   - Total entities with schedule: 220
   - Documents with schedule: 120
   - Top schedule patterns:
     • início: 120 occurrences
     • fim: 100 occurrences



📅 Document Distribution Summary (Fixed Date Issues):
┌─────────────┬──────┬──────┬──────┬──────┬───────┐
│Municipality │ 2021 │ 2022 │ 2023 │ 2024 │ Total │
├─────────────┼──────┼──────┼──────┼──────┼───────┤
│ Alandroal   │   2  │   5  │   6  │   6  │   19  │
│ Campomaior  │   2  │   6  │   6  │   6  │   20  │
│ Covilha     │   2  │   6  │   6  │   6  │   20  │
│ Fundao      │   2  │   6  │   6  │   5  │   19  │
│ Guimaraes   │   2  │   6  │   6  │   6  │   20  │
│ Porto       │   2  │   6  │   6  │   6  │   20  │
└─────────────┴──────┴──────┴──────┴──────┴───────┘

📊 Summary:
   - Total documents (excluding duplicates): 118
   - Documents with valid dates: 118
   - Documents with date issues: 0


In [23]:
# Statistical tests analysis
statistical_tests = comprehensive_analysis.get('statistical_tests', {})

print("=== STATISTICAL SIGNIFICANCE TESTS ===")

# Entity count differences by municipality
if 'entity_count_by_municipality' in statistical_tests:
    entity_test = statistical_tests['entity_count_by_municipality']
    
    if 'error' not in entity_test:
        print(f"\n🧮 Entity Count Differences by Municipality:")
        print(f"   Test: {entity_test['test']}")
        print(f"   Statistic: {entity_test['statistic']:.4f}")
        print(f"   p-value: {entity_test['p_value']:.6f}")
        print(f"   Significant: {entity_test['significant']}")
        print(f"   Interpretation: {entity_test['interpretation']}")
    else:
        print(f"   Error: {entity_test['error']}")

# Entity type distribution test
if 'entity_type_distribution' in statistical_tests:
    entity_dist_test = statistical_tests['entity_type_distribution']
    
    if 'error' not in entity_dist_test:
        print(f"\n📊 Entity Type Distribution Differences:")
        print(f"   Test: {entity_dist_test['test']}")
        print(f"   Chi-square: {entity_dist_test['chi2_statistic']:.4f}")
        print(f"   p-value: {entity_dist_test['p_value']:.6f}")
        print(f"   Degrees of freedom: {entity_dist_test['degrees_of_freedom']}")
        print(f"   Cramér's V: {entity_dist_test['cramers_v']:.4f}")
        print(f"   Effect size: {entity_dist_test['effect_size']}")
        print(f"   Significant: {entity_dist_test['significant']}")
    else:
        print(f"   Error: {entity_dist_test['error']}")

# Posicionamento distribution test
if 'posicionamento_distribution' in statistical_tests:
    pos_test = statistical_tests['posicionamento_distribution']
    
    if 'error' not in pos_test:
        print(f"\n🗳️ Posicionamento Distribution Differences:")
        print(f"   Test: {pos_test['test']}")
        print(f"   Chi-square: {pos_test['chi2_statistic']:.4f}")
        print(f"   p-value: {pos_test['p_value']:.6f}")
        print(f"   Degrees of freedom: {pos_test['degrees_of_freedom']}")
        print(f"   Cramér's V: {pos_test['cramers_v']:.4f}")
        print(f"   Effect size: {pos_test['effect_size']}")
        print(f"   Significant: {pos_test['significant']}")
    else:
        print(f"   Error: {pos_test['error']}")

print("\n✅ Statistical tests analysis completed")

=== STATISTICAL SIGNIFICANCE TESTS ===

🧮 Entity Count Differences by Municipality:
   Test: Kruskal-Wallis
   Statistic: 51.5234
   p-value: 0.000000
   Significant: True
   Interpretation: Significant differences in entity counts between municipalities

📊 Entity Type Distribution Differences:
   Test: Chi-square
   Chi-square: 3510.6418
   p-value: 0.000000
   Degrees of freedom: 25
   Cramér's V: 0.1630
   Effect size: small
   Significant: True

🗳️ Posicionamento Distribution Differences:
   Test: Chi-square
   Chi-square: 1037.4689
   p-value: 0.000000
   Degrees of freedom: 15
   Cramér's V: 0.2922
   Effect size: small
   Significant: True

✅ Statistical tests analysis completed


In [24]:
# Additional effect size calculations
print("\n=== EFFECT SIZE ANALYSIS ===")

if not documents_df.empty and len(documents_df['municipality'].unique()) > 1:
    # Entity count effect sizes between municipalities
    municipalities = documents_df['municipality'].unique()
    
    print(f"\n📊 Pairwise Effect Sizes (Cohen's d) for Entity Counts:")
    
    effect_sizes = []
    for i, muni1 in enumerate(municipalities):
        for muni2 in municipalities[i+1:]:
            group1 = documents_df[documents_df['municipality'] == muni1]['entity_count'].values
            group2 = documents_df[documents_df['municipality'] == muni2]['entity_count'].values
            
            if len(group1) > 1 and len(group2) > 1:
                effect_size = calculate_effect_size(group1, group2)
                effect_sizes.append({
                    'Municipality 1': muni1,
                    'Municipality 2': muni2,
                    'Cohens d': effect_size,
                    'Effect Size': 'Large' if abs(effect_size) > 0.8 else 'Medium' if abs(effect_size) > 0.5 else 'Small'
                })
    
    if effect_sizes:
        effect_sizes_df = pd.DataFrame(effect_sizes)
        print(effect_sizes_df.to_string(index=False))
        
        # Save effect sizes table
        effect_sizes_df.to_csv(STATISTICS_DIR / 'effect_sizes_entity_counts.csv', index=False)
        
        print("\n💾 Effect sizes table saved")
    else:
        print("⚠️ Could not calculate effect sizes")
else:
    print("⚠️ Insufficient data for effect size calculations")


=== EFFECT SIZE ANALYSIS ===

📊 Pairwise Effect Sizes (Cohen's d) for Entity Counts:
Municipality 1 Municipality 2  Cohens d Effect Size
         Porto         Fundao  1.898484       Large
         Porto      Guimaraes -0.435542       Small
         Porto      Alandroal  0.378833       Small
         Porto        Covilha -1.165090       Large
         Porto     Campomaior -0.789704      Medium
        Fundao      Guimaraes -1.822176       Large
        Fundao      Alandroal -1.508861       Large
        Fundao        Covilha -2.096418       Large
        Fundao     Campomaior -2.210320       Large
     Guimaraes      Alandroal  0.690373      Medium
     Guimaraes        Covilha -0.796704      Medium
     Guimaraes     Campomaior -0.255702       Small
     Alandroal        Covilha -1.327469       Large
     Alandroal     Campomaior -1.049011       Large
       Covilha     Campomaior  0.632136      Medium

💾 Effect sizes table saved


## 12. Publication-Ready Summary

In [25]:
print("===  SUMMARY ===")

# Dataset overview table
corpus_stats = comprehensive_analysis['corpus_statistics']['corpus_overview']
entity_stats = comprehensive_analysis['corpus_statistics'].get('entity_overview', {})
relation_stats = comprehensive_analysis['corpus_statistics'].get('relation_overview', {})

# Table 1: Dataset Characteristics
dataset_char_data = {
    'Characteristic': [
        'Documents',
        'Municipalities', 
        'Total Characters',
        'Total Tokens',
        'Avg Tokens/Document',
        'Total Entities',
        'Entity Types',
        'Total Relations',
        'Entity Coverage (%)',
        'Avg Entities/Document'
    ],
    'Value': [
        f"{corpus_stats['total_documents']:,}",
        f"{corpus_stats['total_municipalities']}",
        f"{corpus_stats['total_text_length']:,}",
        f"{corpus_stats['total_tokens']:,}",
        f"{corpus_stats['total_tokens'] / corpus_stats['total_documents']:.0f}",
        f"{entity_stats.get('total_entities', 0):,}",
        f"{entity_stats.get('unique_entity_types', 0)}",
        f"{relation_stats.get('total_relations', 0):,}",
        f"{entity_stats.get('entity_coverage', 0)*100:.1f}",
        f"{entity_stats.get('avg_entities_per_document', 0):.2f}"
    ]
}

dataset_char_df = pd.DataFrame(dataset_char_data)
print("\nTable 1: Dataset Characteristics")
print(dataset_char_df.to_string(index=False))

# Save as CSV and LaTeX
dataset_char_df.to_csv(STATISTICS_DIR / 'table1_dataset_characteristics.csv', index=False)
with open(STATISTICS_DIR / 'table1_dataset_characteristics.tex', 'w') as f:
    f.write(dataset_char_df.to_latex(index=False, caption="Dataset Characteristics", label="tab:dataset_char"))

# Table 2: Entity Type Distribution
if not entities_df.empty:
    entity_type_stats = entities_df['entity_label'].value_counts()
    entity_type_pct = (entity_type_stats / entity_type_stats.sum() * 100).round(2)
    
    entity_dist_data = {
        'Entity Type': entity_type_stats.index.tolist(),
        'Count': entity_type_stats.values.tolist(),
        'Percentage': [f"{pct:.1f}%" for pct in entity_type_pct.values]
    }
    
    entity_dist_df = pd.DataFrame(entity_dist_data)
    print("\nTable 2: Entity Type Distribution")
    print(entity_dist_df.to_string(index=False))
    
    entity_dist_df.to_csv(STATISTICS_DIR / 'table2_entity_distribution.csv', index=False)
    with open(STATISTICS_DIR / 'table2_entity_distribution.tex', 'w') as f:
        f.write(entity_dist_df.to_latex(index=False, caption="Entity Type Distribution", label="tab:entity_dist"))

# Table 3: Municipality Statistics  
muni_stats_data = []
for municipality in documents_df['municipality'].unique():
    muni_docs = documents_df[documents_df['municipality'] == municipality]
    muni_entities = entities_df[entities_df['municipality'] == municipality] if not entities_df.empty else pd.DataFrame()
    
    muni_stats_data.append({
        'Municipality': municipality,
        'Documents': len(muni_docs),
        'Total Tokens': muni_docs['token_count'].sum(),
        'Total Entities': len(muni_entities),
        'Entities/Document': len(muni_entities) / len(muni_docs) if len(muni_docs) > 0 else 0,
        'Entity Density': len(muni_entities) / muni_docs['token_count'].sum() if muni_docs['token_count'].sum() > 0 else 0
    })

muni_stats_df = pd.DataFrame(muni_stats_data)
muni_stats_df['Entities/Document'] = muni_stats_df['Entities/Document'].round(2)
muni_stats_df['Entity Density'] = muni_stats_df['Entity Density'].round(4)

print("\nTable 3: Municipality Statistics")
print(muni_stats_df.to_string(index=False))

muni_stats_df.to_csv(STATISTICS_DIR / 'table3_municipality_statistics.csv', index=False)
with open(STATISTICS_DIR / 'table3_municipality_statistics.tex', 'w') as f:
    f.write(muni_stats_df.to_latex(index=False, caption="Municipality Statistics", label="tab:muni_stats"))

print("\n💾 All tables saved to LaTeX and CSV formats")

===  SUMMARY ===

Table 1: Dataset Characteristics
       Characteristic     Value
            Documents       120
       Municipalities         6
     Total Characters 7,512,427
         Total Tokens 1,188,024
  Avg Tokens/Document      9900
       Total Entities    26,435
         Entity Types         6
      Total Relations     8,752
  Entity Coverage (%)     100.0
Avg Entities/Document    220.29

Table 2: Entity Type Distribution
       Entity Type  Count Percentage
    Posicionamento   8777      33.2%
           Assunto   8647      32.7%
Informação Pessoal   5760      21.8%
         Metadados   1825       6.9%
      Ordem do Dia   1420       5.4%
                        6       0.0%

Table 3: Municipality Statistics
Municipality  Documents  Total Tokens  Total Entities  Entities/Document  Entity Density
       Porto         20        274173            4209             210.45          0.0154
      Fundao         20        278147            2516             125.80          0.0090
  