# Stylometric and Linguistic Analysis of Parliamentary Discourse

This notebook implements comprehensive stylometric and linguistic feature analysis to quantify the structural properties of parliamentary discourse in Portuguese. 

## Methodology Overview

We conduct analysis across four main dimensions:

1. **Readability**: Using the Flesch Reading Ease score to assess discourse complexity
2. **Lexical Diversity**: Type-Token Ratio (TTR) - already available in dataset 
3. **Syntactic Structure**: Part-of-Speech (POS) frequency analysis
4. **Named Entity Recognition**: Identification of people, organizations, and locations

All linguistic processing is performed using the spaCy library with the Portuguese language model `pt_core_news_lg`.

In [6]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
import sys
import os
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Add src directory to Python path
sys.path.append('/home/igo/faculdade/poc/src')

# Import our custom stylometric analyzer
from stylometric_analysis import StylometricAnalyzer, process_dataframe

# Configure plotting
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 12

print("📚 Libraries imported successfully!")
print("🔧 Custom stylometric analyzer loaded!")

# Check if spaCy Portuguese model is available
try:
    nlp = spacy.load("pt_core_news_lg")
    print("✅ spaCy Portuguese model (pt_core_news_lg) loaded successfully!")
except IOError:
    try:
        nlp = spacy.load("pt_core_news_sm")
        print("⚠️  Using pt_core_news_sm (smaller model). For better results, install pt_core_news_lg")
    except IOError:
        print("❌ No Portuguese spaCy model found. Please install with:")
        print("   python -m spacy download pt_core_news_lg")

📚 Libraries imported successfully!
🔧 Custom stylometric analyzer loaded!
⚠️  Using pt_core_news_sm (smaller model). For better results, install pt_core_news_lg


In [5]:
# Check if spaCy Portuguese model is available
try:
    import spacy
    nlp = spacy.load("pt_core_news_sm")
    print("✅ spaCy Portuguese model (pt_core_news_sm) is available")
except IOError:
    try:
        nlp = spacy.load("pt_core_news_lg")
        print("✅ spaCy Portuguese model (pt_core_news_lg) is available")
    except IOError:
        print("❌ No Portuguese spaCy model found.")
        print("   Please install with one of:")
        print("   python -m spacy download pt_core_news_sm")
        print("   python -m spacy download pt_core_news_lg")
        print("   We'll continue with a warning - some features may not work properly")

❌ No Portuguese spaCy model found.
   Please install with one of:
   python -m spacy download pt_core_news_sm
   python -m spacy download pt_core_news_lg


## 1. Load and Prepare Data

First, we'll load the existing dataset that already contains TTR (Type-Token Ratio) values and other linguistic metrics.

In [7]:
# Load the sentiment results data
data_path = "/home/igo/faculdade/poc/data/sentiment_results.parquet"
df = pd.read_parquet(data_path)

print(f"📊 Dataset loaded successfully!")
print(f"📈 Shape: {df.shape}")
print(f"🏷️  Columns: {list(df.columns)}")

# Check for existing linguistic metrics
existing_metrics = ['ttr', 'MLC', 'MLS', 'DCC', 'CPC', 'profundidade_media', 'profundidade_max', 'lexical_density']
available_metrics = [col for col in existing_metrics if col in df.columns]
print(f"\n✅ Already available linguistic metrics: {available_metrics}")

# Check text columns
text_columns = [col for col in df.columns if 'response' in col.lower()]
print(f"📝 Available text columns: {text_columns}")

# Display sample data
print("\n📋 Sample data:")
display_columns = ['model', 'response', 'ttr', 'sentiment_label'] + available_metrics[:3]
available_display_cols = [col for col in display_columns if col in df.columns]
print(df[available_display_cols].head())

: 

## 2. Calculate Flesch Reading Ease Score

The Flesch Reading Ease score quantifies text readability based on average sentence length and word complexity (syllables per word). For Portuguese parliamentary discourse, we expect low scores (including negative values) indicating high complexity.

In [None]:
# Initialize the stylometric analyzer
analyzer = StylometricAnalyzer()

# Let's test the Flesch Reading Ease calculation on a sample
sample_text = df['response'].dropna().iloc[0] if 'response' in df.columns else ""
print("🧪 Testing Flesch Reading Ease calculation...")
print(f"Sample text (first 200 chars): {sample_text[:200]}...")

if sample_text:
    flesch_score = analyzer.calculate_flesch_reading_ease_pt(sample_text)
    print(f"\n📊 Flesch Reading Ease Score: {flesch_score}")
    
    # Interpret the score
    if flesch_score >= 90:
        interpretation = "Very Easy"
    elif flesch_score >= 80:
        interpretation = "Easy"
    elif flesch_score >= 70:
        interpretation = "Fairly Easy"
    elif flesch_score >= 60:
        interpretation = "Standard"
    elif flesch_score >= 50:
        interpretation = "Fairly Difficult"
    elif flesch_score >= 30:
        interpretation = "Difficult"
    else:
        interpretation = "Very Difficult (Parliamentary/Legal complexity)"
    
    print(f"📝 Interpretation: {interpretation}")
else:
    print("❌ No text available for testing")

## 3. Part-of-Speech Tagging and Syntactic Analysis

We'll analyze the grammatical composition by calculating relative frequencies of main grammatical categories: nouns, verbs, adjectives, and adverbs.

In [None]:
# Test POS frequency analysis on sample text
print("🧪 Testing POS frequency analysis...")

if sample_text:
    pos_frequencies = analyzer.calculate_pos_frequencies(sample_text)
    print(f"📊 POS Frequencies:")
    for pos_type, frequency in pos_frequencies.items():
        print(f"   {pos_type}: {frequency:.2f}%")
    
    print(f"\n🔍 Interpretation:")
    print(f"   • High noun frequency ({pos_frequencies['noun_freq']:.1f}%) → Nominal/informational style")
    print(f"   • Verb frequency ({pos_frequencies['verb_freq']:.1f}%) → Narrative/action-oriented")
    print(f"   • Adjective frequency ({pos_frequencies['adj_freq']:.1f}%) → Descriptive/evaluative")
    print(f"   • Adverb frequency ({pos_frequencies['adv_freq']:.1f}%) → Subjective discourse")
    
    # Detailed POS analysis with spaCy
    doc = analyzer.nlp(sample_text[:500])  # Analyze first 500 chars for detailed view
    print(f"\n📝 Detailed POS analysis (first 500 chars):")
    
    pos_examples = {}
    for token in doc:
        if not token.is_punct and not token.is_space:
            if token.pos_ not in pos_examples:
                pos_examples[token.pos_] = []
            if len(pos_examples[token.pos_]) < 3:  # Show max 3 examples per POS
                pos_examples[token.pos_].append(token.text)
    
    for pos, examples in sorted(pos_examples.items()):
        print(f"   {pos}: {', '.join(examples)}")
        
else:
    print("❌ No text available for testing")

## 4. Named Entity Recognition (NER)

We'll identify and categorize mentions of people (PER), organizations (ORG), and locations (LOC/GPE) to understand the salience of different actors and themes.

In [None]:
# Test Named Entity Recognition on sample text
print("🧪 Testing Named Entity Recognition...")

if sample_text:
    entities = analyzer.extract_named_entities(sample_text)
    print(f"📊 Named Entity Counts:")
    print(f"   People (PER): {entities['per_count']}")
    print(f"   Organizations (ORG): {entities['org_count']}")
    print(f"   Locations (LOC/GPE): {entities['loc_count']}")
    
    # Show detailed entities found
    doc = analyzer.nlp(sample_text)
    print(f"\n📝 Detailed entities found:")
    
    entities_found = {"PER": [], "ORG": [], "LOC/GPE": []}
    
    for ent in doc.ents:
        if ent.label_ in ["PER", "PERSON"]:
            entities_found["PER"].append(ent.text)
        elif ent.label_ == "ORG":
            entities_found["ORG"].append(ent.text)
        elif ent.label_ in ["LOC", "GPE", "PLACE"]:
            entities_found["LOC/GPE"].append(ent.text)
    
    for ent_type, ent_list in entities_found.items():
        if ent_list:
            unique_entities = list(set(ent_list))[:5]  # Show max 5 unique entities
            print(f"   {ent_type}: {', '.join(unique_entities)}")
        else:
            print(f"   {ent_type}: No entities found")
    
    # Show all entities with labels for debugging
    all_entities = [(ent.text, ent.label_) for ent in doc.ents]
    if all_entities:
        print(f"\n🔍 All entities found (with labels): {all_entities[:10]}")
    else:
        print(f"\n🔍 No entities found in sample text")
        
else:
    print("❌ No text available for testing")

## 5. Process Full Dataset and Add New Columns

Now we'll process the entire dataset to add all the new stylometric and linguistic metrics as columns.

In [None]:
# Process dataset in smaller batches for memory efficiency
print("🚀 Starting full dataset processing...")
print(f"📊 Processing {len(df)} rows in batches...")

# Choose text column to analyze
text_column = 'response' if 'response' in df.columns else 'response_lemm'
print(f"📝 Using text column: {text_column}")

# Start with a small sample to test processing time
test_sample_size = min(100, len(df))
print(f"\n🧪 Testing with {test_sample_size} rows first...")

# Process test sample
df_test = df.head(test_sample_size).copy()
df_test_processed = process_dataframe(df_test, text_column=text_column, batch_size=50)

# Check what new columns were added
original_columns = set(df.columns)
new_columns = [col for col in df_test_processed.columns if col not in original_columns]
print(f"\n✅ New columns added: {new_columns}")

# Show sample results
print(f"\n📋 Sample results:")
sample_results = df_test_processed[new_columns + ['model', text_column[:50] if text_column in df_test_processed.columns else 'model']].head(3)
print(sample_results)

In [None]:
# Process the full dataset (this may take some time)
# Uncomment the lines below when ready to process the full dataset

# print("🔄 Processing full dataset (this may take several minutes)...")
# df_full_processed = process_dataframe(df, text_column=text_column, batch_size=50)

# For now, let's work with the test sample to demonstrate the analysis
df_processed = df_test_processed.copy()
print(f"\n📊 Working with processed sample of {len(df_processed)} rows")

# Display summary statistics for new metrics
print(f"\n📈 Summary Statistics for New Stylometric Metrics:")
print("=" * 60)

for col in new_columns:
    if df_processed[col].dtype in ['float64', 'int64', 'float32', 'int32']:
        stats = df_processed[col].describe()
        print(f"\n{col.upper().replace('_', ' ')}:")
        print(f"  Mean: {stats['mean']:.2f}")
        print(f"  Std:  {stats['std']:.2f}")
        print(f"  Min:  {stats['min']:.2f}")
        print(f"  Max:  {stats['max']:.2f}")

# Save processed sample for further analysis
output_path = "/home/igo/faculdade/poc/data/sentiment_results_sample_with_stylometric.parquet"
df_processed.to_parquet(output_path, index=False)
print(f"\n💾 Processed sample saved to: {output_path}")

## 6. Visualize Linguistic Metrics

Let's create comprehensive visualizations to explore the distribution and relationships between different linguistic metrics.

In [None]:
# Create comprehensive visualizations for stylometric metrics

# Set up the plotting environment
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Stylometric and Linguistic Analysis - Distribution of Metrics', fontsize=16, fontweight='bold')

# 1. Flesch Reading Ease Distribution
if 'flesch_reading_ease' in df_processed.columns:
    axes[0, 0].hist(df_processed['flesch_reading_ease'], bins=20, alpha=0.7, color='skyblue', edgecolor='black')
    axes[0, 0].set_title('Flesch Reading Ease Score Distribution')
    axes[0, 0].set_xlabel('Score (Lower = More Complex)')
    axes[0, 0].set_ylabel('Frequency')
    axes[0, 0].axvline(df_processed['flesch_reading_ease'].mean(), color='red', linestyle='--', label=f'Mean: {df_processed["flesch_reading_ease"].mean():.1f}')
    axes[0, 0].legend()

# 2. POS Frequencies Comparison
pos_columns = [col for col in ['noun_freq', 'verb_freq', 'adj_freq', 'adv_freq'] if col in df_processed.columns]
if pos_columns:
    pos_means = [df_processed[col].mean() for col in pos_columns]
    pos_labels = [col.replace('_freq', '').title() for col in pos_columns]
    axes[0, 1].bar(pos_labels, pos_means, color=['lightcoral', 'lightgreen', 'lightyellow', 'lightblue'])
    axes[0, 1].set_title('Average POS Frequencies (%)')
    axes[0, 1].set_ylabel('Percentage')
    for i, v in enumerate(pos_means):
        axes[0, 1].text(i, v + 0.5, f'{v:.1f}%', ha='center', va='bottom')

# 3. Named Entity Counts
entity_columns = [col for col in ['per_count', 'org_count', 'loc_count'] if col in df_processed.columns]
if entity_columns:
    entity_totals = [df_processed[col].sum() for col in entity_columns]
    entity_labels = ['People (PER)', 'Organizations (ORG)', 'Locations (LOC)']
    axes[0, 2].bar(entity_labels, entity_totals, color=['salmon', 'gold', 'lightseagreen'])
    axes[0, 2].set_title('Total Named Entity Mentions')
    axes[0, 2].set_ylabel('Count')
    axes[0, 2].tick_params(axis='x', rotation=45)

# 4. TTR vs Flesch Reading Ease Correlation
if 'ttr' in df_processed.columns and 'flesch_reading_ease' in df_processed.columns:
    axes[1, 0].scatter(df_processed['ttr'], df_processed['flesch_reading_ease'], alpha=0.6)
    axes[1, 0].set_xlabel('Type-Token Ratio (TTR)')
    axes[1, 0].set_ylabel('Flesch Reading Ease')
    axes[1, 0].set_title('TTR vs Reading Ease Correlation')
    
    # Add correlation coefficient
    correlation = df_processed['ttr'].corr(df_processed['flesch_reading_ease'])
    axes[1, 0].text(0.05, 0.95, f'r = {correlation:.3f}', transform=axes[1, 0].transAxes, 
                    bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

# 5. Metrics by Model (if model column exists)
if 'model' in df_processed.columns and 'flesch_reading_ease' in df_processed.columns:
    models = df_processed['model'].unique()
    model_flesch = [df_processed[df_processed['model'] == model]['flesch_reading_ease'].mean() for model in models]
    axes[1, 1].bar(models, model_flesch, color='lightsteelblue')
    axes[1, 1].set_title('Average Reading Ease by Model')
    axes[1, 1].set_ylabel('Flesch Reading Ease')
    axes[1, 1].tick_params(axis='x', rotation=45)

# 6. Additional Metrics Distribution
additional_cols = [col for col in ['avg_word_length', 'long_words_ratio'] if col in df_processed.columns]
if additional_cols:
    for i, col in enumerate(additional_cols[:2]):  # Show max 2 additional metrics
        if i == 0:
            axes[1, 2].hist(df_processed[col], bins=15, alpha=0.7, color='mediumpurple', label=col)
            axes[1, 2].set_xlabel(col.replace('_', ' ').title())
            axes[1, 2].set_ylabel('Frequency')
            axes[1, 2].set_title('Additional Metrics Distribution')

plt.tight_layout()
plt.show()

In [None]:
# Create a correlation matrix for all numeric metrics
numeric_columns = df_processed.select_dtypes(include=[np.number]).columns
linguistic_metrics = [col for col in numeric_columns if col in new_columns + ['ttr', 'MLC', 'MLS', 'DCC', 'CPC']]

if len(linguistic_metrics) > 1:
    plt.figure(figsize=(12, 10))
    correlation_matrix = df_processed[linguistic_metrics].corr()
    
    # Create heatmap
    mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
    sns.heatmap(correlation_matrix, mask=mask, annot=True, cmap='coolwarm', center=0,
                square=True, fmt='.2f', cbar_kws={"shrink": .8})
    
    plt.title('Correlation Matrix: Stylometric and Linguistic Metrics', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

# Summary statistics table
print("\n📊 COMPREHENSIVE METRICS SUMMARY")
print("=" * 70)

if len(linguistic_metrics) > 0:
    summary_stats = df_processed[linguistic_metrics].describe().round(2)
    print(summary_stats)
    
    # Interpretation guide
    print("\n📖 INTERPRETATION GUIDE:")
    print("-" * 40)
    print("🔸 Flesch Reading Ease: Lower scores = more complex text")
    print("🔸 TTR (Type-Token Ratio): Higher values = more lexical diversity")
    print("🔸 Noun Frequency: Higher % = nominal/informational style")
    print("🔸 Verb Frequency: Higher % = narrative/action-oriented style")
    print("🔸 Adjective/Adverb Frequency: Higher % = descriptive/evaluative style")
    print("🔸 Named Entities: Higher counts = more references to actors/places")
else:
    print("No linguistic metrics found for summary.")

## 7. Process Full Dataset (Optional)

The following cells contain code to process the complete dataset. Uncomment and run when ready to analyze the full data.

In [None]:
# FULL DATASET PROCESSING - Uncomment when ready to process all data
# 
# # This will process the entire dataset (may take 30+ minutes depending on size)
# print("🚀 Processing full dataset - this may take significant time...")
# 
# # Process in small batches to manage memory
# df_full_processed = process_dataframe(df, text_column=text_column, batch_size=25)
# 
# # Save the complete processed dataset
# output_full_path = "/home/igo/faculdade/poc/data/sentiment_results_with_stylometric.parquet"
# df_full_processed.to_parquet(output_full_path, index=False)
# print(f"✅ Full processed dataset saved to: {output_full_path}")
# 
# # Display final summary
# new_columns_full = [col for col in df_full_processed.columns if col not in df.columns]
# print(f"📊 Complete dataset: {len(df_full_processed)} rows with {len(new_columns_full)} new metrics")

print("💡 To process the full dataset:")
print("   1. Uncomment the code above")
print("   2. Run the cell")
print("   3. Wait for processing to complete")
print("   4. The enhanced dataset will be saved with all stylometric metrics")