# Cross-Year Document Similarity Analysis  

## Overview
This module measures reporting consistency by comparing sustainability reports across 2021 and 2022 to detect recycled content versus genuine annual updates. It employs multiple similarity methods to identify symbolic rather than substantive reporting patterns, directly supporting the Reporting Consistency dimension.

## Multi-Method Similarity Assessment
1. **TF-IDF Document Similarity**: Measures overall document similarity based on term frequency patterns
2. **Jaccard Similarity**: Calculates shared vocabulary ratio as (common words ÷ total unique words) 
3. **Sentence-level SpaCy Analysis**: Semantic similarity using word vectors for individual sentence pairs

## Sentence Matching Algorithm
- **Content filtering**: Only sentences with 10+ meaningful words (excluding stopwords/punctuation)
- **Greedy matching**: Each sentence from shorter document paired with best match from longer document
- **Iterative process**: Highest similarity pairs selected first, matched sentences removed, process repeats
- **Performance optimization**: Limits to 300 sentences per document for computational efficiency

## High Similarity Detection
- **Threshold**: 99.9% similarity for near-identical content detection
- **Target patterns**: Minor changes like "reduced emissions by 15% in 2021" vs. "reduced emissions by 15% in 2022"
- **Quality validation**: SpaCy's word vector similarity naturally assigns high scores to substantively similar content within same company contexts

## Variables Produced for Communication Scoring
According to the analysis framework:
- **Cross-Year Similarity Score** → Reporting Consistency dimension
- **High-Similarity Sentence Ratio** → Reporting Consistency dimension

## Theoretical Foundation
Based on research showing widespread boilerplate language in sustainability reports correlating with ESG rating problems. Distinguishes between legitimate annual updates and lazy content recycling that suggests symbolic compliance rather than substantive environmental progress.

## Output Metrics
- **High similarity ratio**: (sentences above 99.9% similarity) ÷ total matched sentences  
- **Average similarity**: Mean semantic similarity across all sentence pairs
- **Document-level scores**: Combined similarity metrics for comprehensive consistency assessment

This analysis reveals whether companies meaningfully update their environmental narratives or rely on template-based reporting across years.

In [None]:
import spacy
from spacy_layout import spaCyLayout
from pathlib import Path
import pandas as pd
import numpy as np

# Load model
nlp = spacy.load("en_core_web_lg")

# Increase max_length to safely handle long texts
nlp.max_length = 1_500_000

In [None]:
from pathlib import Path

report_names = [ 
    "Akenerji_Elektrik_Uretim_AS",
    "Arendals_Fossekompani_ASA",
    "Atlantica_Sustainable_Infrastructure_PLC",
    "CEZ",
    "EDF",
    "EDP_Energias_de_Portugal_SA",
    "Endesa",
    "ERG_SpA",
    "Orsted",
    "Polska_Grupa_Energetyczna_PGE_SA",
    "Romande_Energie_Holding_SA",
    "Scatec_ASA",
    "Solaria_Energia_y_Medio_Ambiente_SA",
    "Terna_Energy_SA"
]

folders = {
    "2021": Path("data/NLP/Reports/Cleanest/2021"),
    "2022": Path("data/NLP/Reports/Cleanest/2022")
}

# Check availability
for name in report_names:
    file_name = f"{name}.txt"
    in_2021 = (folders["2021"] / file_name).exists()
    in_2022 = (folders["2022"] / file_name).exists()
    print(f"{file_name}: 2021: {'YES' if in_2021 else 'NO'} | 2022: {'YES' if in_2022 else 'NO'}")


In [None]:
# Dictionary to store processed docs
documents = {}

# Load and process
for version, folder_path in folders.items():
    for name in report_names:
        txt_path = folder_path / f"{name}.txt"
        try:
            with open(txt_path, "r", encoding="utf-8") as f:
                text = f.read()
            doc_key = f"{name}_{version}"
            documents[doc_key] = nlp(text)
            print(f"Processed {doc_key}")
        except Exception as e:
            print(f"Error processing {txt_path.name}: {e}")

print(f"\nTotal documents loaded: {len(documents)}")

## Doc Similarity

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# --- TF-IDF document-level similarity ---
def tfidf_doc_similarity(doc1, doc2):
    """
    Compute cosine similarity between two documents using TF-IDF vectorization.
    Lemmatizes and filters tokens before vectorizing full documents.
    Source: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
    """
    def clean_text(doc):
        return " ".join([
            token.lemma_.lower() 
            for token in doc 
            if not token.is_stop and not token.is_punct and token.is_alpha
        ])
    
    # Preprocess documents
    text1 = clean_text(doc1)
    text2 = clean_text(doc2)
    
    # Vectorize using TF-IDF (word- and phrase-level)
    vectorizer = TfidfVectorizer(
        max_features=4000,       # Limit vocabulary for speed
        ngram_range=(1, 3),      # Use unigrams, bigrams, trigrams
        norm='l2'                # Normalize vectors for cosine similarity
    )
    tfidf_matrix = vectorizer.fit_transform([text1, text2])
    
    # Calculate cosine similarity between the two document vectors
    similarity = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
    return similarity

# --- Jaccard similarity ---
def jaccard_similarity(doc1, doc2):
    """
    Compute Jaccard similarity between sets of unique lemmatized words.
    Measures overlap in vocabulary between two documents.
    Source: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.jaccard_score.html
    """
    words1 = {token.lemma_.lower() for token in doc1 if not token.is_stop and not token.is_punct and token.is_alpha}
    words2 = {token.lemma_.lower() for token in doc2 if not token.is_stop and not token.is_punct and token.is_alpha}
    
    intersection = len(words1 & words2)
    union = len(words1 | words2)
    
    return intersection / union if union else 0

def spacy_sentence_similarity(doc1, doc2, threshold=0.999):
    """
    Compare sentences between two documents using spaCy's built-in similarity with greedy matching.
    
    Args:
        doc1: spaCy Doc object for 2021 document
        doc2: spaCy Doc object for 2022 document  
        threshold: Similarity threshold for high similarity ratio
    
    Returns:
        dict: Contains high_similarity_ratio, matched_pairs, and similarity_scores

    Source: https://spacy.io/usage/linguistic-features#vectors-similarity
    """
    
    def filter_sentences(doc):
        """Filter sentences with more than 10 lemmatized words (excluding stopwords/punctuation)"""
        filtered_sentences = []
        
        for sent in doc.sents:
            # Count lemmatized words excluding stopwords and punctuation
            lemmatized_words = [
                token.lemma_.lower() 
                for token in sent 
                if not token.is_stop and not token.is_punct and token.is_alpha
            ]
            
            # Only include sentences with more than 10 such words
            if len(lemmatized_words) > 10:
                filtered_sentences.append(sent)
        
        return filtered_sentences
    
    # Step 1: Filter sentences from both documents
    sentences_2021 = filter_sentences(doc1)
    sentences_2022 = filter_sentences(doc2)
    
    print(f"Filtered sentences - 2021: {len(sentences_2021)}, 2022: {len(sentences_2022)}")
    
    if not sentences_2021 or not sentences_2022:
        return {
            'high_similarity_ratio': 0.0,
            'avg_similarity': 0.0,
            'matched_pairs': [],
            'similarity_scores': []
        }
    
    # Step 2 & 3: Greedy matching process
    matched_pairs = []
    similarity_scores = []
    
    # Create working copies of sentence lists
    remaining_2021 = sentences_2021.copy()
    remaining_2022 = sentences_2022.copy()
    
    while remaining_2021 and remaining_2022:
        # Compute similarity matrix for remaining sentences
        similarity_matrix = []
        max_similarity = -1
        best_pairs = []
        
        # Find all sentence pairs and their similarities
        for i, sent_2021 in enumerate(remaining_2021):
            row = []
            for j, sent_2022 in enumerate(remaining_2022):
                similarity = sent_2021.similarity(sent_2022)
                row.append(similarity)
                
                # Track the highest similarity score(s)
                if similarity > max_similarity:
                    max_similarity = similarity
                    best_pairs = [(i, j, similarity)]
                elif similarity == max_similarity:
                    best_pairs.append((i, j, similarity))
            
            similarity_matrix.append(row)
        
        if max_similarity == -1:  # No valid similarities computed
            break
            
        # Handle ties: select pairs that refer to unique sentences
        selected_pairs = []
        used_2021_indices = set()
        used_2022_indices = set()
        
        for i, j, sim in best_pairs:
            if i not in used_2021_indices and j not in used_2022_indices:
                selected_pairs.append((i, j, sim))
                used_2021_indices.add(i)
                used_2022_indices.add(j)
        
        # Add selected pairs to results
        for i, j, sim in selected_pairs:
            matched_pairs.append((remaining_2021[i], remaining_2022[j]))
            similarity_scores.append(sim)
        
        # Remove selected sentences from remaining lists (in reverse order to maintain indices)
        indices_2021_to_remove = sorted([i for i, j, sim in selected_pairs], reverse=True)
        indices_2022_to_remove = sorted([j for i, j, sim in selected_pairs], reverse=True)
        
        for idx in indices_2021_to_remove:
            remaining_2021.pop(idx)
        for idx in indices_2022_to_remove:
            remaining_2022.pop(idx)
    
    # Step 4: Calculate high similarity ratio and average similarity
    if not similarity_scores:
        high_similarity_ratio = 0.0
        avg_similarity = 0.0
    else:
        high_similarity_count = sum(1 for score in similarity_scores if score > threshold)
        high_similarity_ratio = high_similarity_count / len(similarity_scores)
        avg_similarity = np.mean(similarity_scores)

    print(f"Matched pairs: {len(matched_pairs)}")
    print(f"High similarity pairs (>{threshold}): {sum(1 for score in similarity_scores if score > threshold)}")
    print(f"High similarity ratio: {high_similarity_ratio:.4f}")
    print(f"Average similarity: {avg_similarity:.4f}")

    return {
        'high_similarity_ratio': high_similarity_ratio,
        'avg_similarity': avg_similarity,
        'matched_pairs': matched_pairs,
        'similarity_scores': similarity_scores
    }


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import pandas as pd
import numpy as np

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")  # Nice color palette
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12

# Create output directories
results_dir = Path("data/NLP/Results")
similarity_dir = results_dir / "Similarity"
figures_dir = results_dir / "Figures" / "Similarity"

# Create directories if they don't exist
similarity_dir.mkdir(parents=True, exist_ok=True)
figures_dir.mkdir(parents=True, exist_ok=True)

# Initialize results list and similarity scores storage
results = []
company_similarity_scores = {}  # Store similarity scores for histograms

# Modified results collection loop
for company in report_names:
    key_2021 = f"{company}_2021"
    key_2022 = f"{company}_2022"
    
    if key_2021 in documents and key_2022 in documents:
        print(f"\nAnalyzing {company}...")

        # Retrieve SpaCy Docs
        doc1 = documents[key_2021]
        doc2 = documents[key_2022]

        # Document-level similarity scores
        print("  Computing TF-IDF similarity...")
        tfidf_score = tfidf_doc_similarity(doc1, doc2)
        
        print("  Computing Jaccard similarity...")
        jaccard_score = jaccard_similarity(doc1, doc2)
        
        # Sentence-level similarity using spaCy
        print("  Computing sentence-level similarity...")
        sentence_similarity_result = spacy_sentence_similarity(doc1, doc2)
        high_sim_ratio = sentence_similarity_result['high_similarity_ratio']
        avg_sim = sentence_similarity_result['avg_similarity']
        
        # Store similarity scores for histogram
        similarity_scores = sentence_similarity_result['similarity_scores']
        if similarity_scores:  # Only store if we have scores
            company_similarity_scores[company] = similarity_scores

        # Append results to list (updated with new metrics)
        results.append({
            'Company': company.replace("_", " "),
            'TFIDF_Doc': tfidf_score,
            'Jaccard': jaccard_score,
            'SpaCy_HighSim_Ratio': high_sim_ratio,
            'SpaCy_Avg_Similarity': avg_sim,
            'Num_Sentence_Pairs': len(similarity_scores)
        })
        
        print(f"Complete - {len(similarity_scores)} sentence pairs analyzed")

# Convert results to DataFrame for analysis or export
df_similarity = pd.DataFrame(results).round(4)

# Display results
print("\n" + "="*80)
print("SIMILARITY SUMMARY")
print("="*80)
print(df_similarity.to_string(index=False))

# Save DataFrame to Excel
excel_path = similarity_dir / "similarity_analysis_results.xlsx"
print(f"\nSaving results to Excel: {excel_path}")

with pd.ExcelWriter(excel_path, engine='openpyxl') as writer:
    # Main results sheet
    df_similarity.to_excel(writer, sheet_name='Similarity_Results', index=False)
    
    # Summary statistics sheet
    summary_stats = df_similarity.describe()
    summary_stats.to_excel(writer, sheet_name='Summary_Statistics')
    
    # Individual company sentence scores (if available)
    if company_similarity_scores:
        # Create a sheet with all similarity scores
        all_scores_data = []
        for company, scores in company_similarity_scores.items():
            for score in scores:
                all_scores_data.append({
                    'Company': company.replace("_", " "),
                    'Similarity_Score': score
                })
        
        df_all_scores = pd.DataFrame(all_scores_data)
        df_all_scores.to_excel(writer, sheet_name='All_Sentence_Similarities', index=False)

print(f"Excel file saved successfully!")

# Create histograms for each company
print(f"\nCreating histograms for similarity scores...")
print("-" * 50)

# Define colors for companies (using seaborn color palette)
colors = sns.color_palette("husl", len(company_similarity_scores))
company_colors = dict(zip(company_similarity_scores.keys(), colors))

for i, (company, scores) in enumerate(company_similarity_scores.items()):
    if not scores:  # Skip if no scores
        continue
        
    print(f"Creating histogram for {company.replace('_', ' ')}...")
    
    # Create figure and axis
    fig, ax = plt.subplots(figsize=(12, 8))
    
    # Create histogram
    n_bins = min(30, len(scores) // 2)  # Adaptive number of bins
    n_bins = max(10, n_bins)  # Ensure minimum 10 bins
    
    counts, bins, patches = ax.hist(
        scores, 
        bins=n_bins, 
        alpha=0.7, 
        color=company_colors[company],
        edgecolor='black',
        linewidth=0.5
    )
    
    # Customize the plot
    company_display_name = company.replace("_", " ")
    ax.set_title(f'Sentence Similarity Score Distribution\n{company_display_name}', 
                fontsize=16, fontweight='bold', pad=20)
    ax.set_xlabel('Similarity Score', fontsize=14)
    ax.set_ylabel('Frequency', fontsize=14)
    ax.grid(True, alpha=0.3)
    
    # Add statistics text box
    mean_score = np.mean(scores)
    median_score = np.median(scores)
    std_score = np.std(scores)
    min_score = np.min(scores)
    max_score = np.max(scores)
    
    stats_text = f'Statistics:\n' \
                f'Mean: {mean_score:.3f}\n' \
                f'Median: {median_score:.3f}\n' \
                f'Std Dev: {std_score:.3f}\n' \
                f'Min: {min_score:.3f}\n' \
                f'Max: {max_score:.3f}\n' \
                f'N: {len(scores)}'
    
    ax.text(0.02, 0.98, stats_text, transform=ax.transAxes, 
            verticalalignment='top', horizontalalignment='left',
            bbox=dict(boxstyle='round', facecolor='white', alpha=0.8),
            fontsize=11)
    
    # Add vertical line for mean
    ax.axvline(mean_score, color='red', linestyle='--', linewidth=2, 
               label=f'Mean: {mean_score:.3f}')
    ax.legend()
    
    # Set x-axis limits
    ax.set_xlim(0, 1)
    
    # Improve layout
    plt.tight_layout()
    
    # Save the figure
    company_dir = figures_dir / company
    company_dir.mkdir(exist_ok=True)
    
    filename = f"{company}_similarity_histogram.png"
    filepath = company_dir / filename
    
    plt.savefig(filepath, dpi=300, bbox_inches='tight', facecolor='white')
    print(f"Saved: {filepath}")
    
    # Also save as PDF for better quality
    pdf_filepath = company_dir / f"{company}_similarity_histogram.pdf"
    plt.savefig(pdf_filepath, bbox_inches='tight', facecolor='white')
    
    plt.close()  # Close the figure to free memory

# Create a combined overview plot
if len(company_similarity_scores) > 1:
    print("\nCreating combined overview plot...")
    
    fig, axes = plt.subplots(
        nrows=(len(company_similarity_scores) + 2) // 3,  # 3 columns
        ncols=3,
        figsize=(18, 6 * ((len(company_similarity_scores) + 2) // 3))
    )
    
    # Flatten axes array for easy indexing
    if len(company_similarity_scores) > 3:
        axes = axes.flatten()
    elif len(company_similarity_scores) > 1:
        axes = [axes] if len(company_similarity_scores) == 1 else axes.flatten()
    else:
        axes = [axes]
    
    for i, (company, scores) in enumerate(company_similarity_scores.items()):
        if i >= len(axes):
            break
            
        ax = axes[i]
        
        # Create mini histogram
        n_bins = min(20, len(scores) // 2)
        n_bins = max(8, n_bins)
        
        ax.hist(scores, bins=n_bins, alpha=0.7, 
                color=company_colors[company], edgecolor='black', linewidth=0.3)
        
        # Customize
        company_name = company.replace("_", " ")
        ax.set_title(company_name, fontsize=12, fontweight='bold')
        ax.set_xlabel('Similarity Score', fontsize=10)
        ax.set_ylabel('Frequency', fontsize=10)
        ax.grid(True, alpha=0.3)
        ax.set_xlim(0, 1)
        
        # Add mean line
        mean_score = np.mean(scores)
        ax.axvline(mean_score, color='red', linestyle='--', linewidth=1.5)
    
    # Hide unused subplots
    for j in range(i + 1, len(axes)):
        axes[j].set_visible(False)
    
    plt.suptitle('Sentence Similarity Score Distributions - All Companies', 
                 fontsize=16, fontweight='bold')
    plt.tight_layout()
    
    # Save combined plot
    combined_path = figures_dir / "combined_similarity_histograms.png"
    plt.savefig(combined_path, dpi=300, bbox_inches='tight', facecolor='white')
    print(f"Combined plot saved: {combined_path}")
    
    plt.close()

# Print final summary
print("\n" + "="*80)
print("ANALYSIS COMPLETE")
print("="*80)
print(f"Results saved to Excel: {excel_path}")
print(f"Individual histograms saved to: {figures_dir}")
print(f"Companies processed: {len(company_similarity_scores)}")
print(f"Total sentence pairs analyzed: {sum(len(scores) for scores in company_similarity_scores.values())}")

# Display final DataFrame
print(f"\nFinal Results Summary:")
print("-" * 50)
print(df_similarity.to_string(index=False))

In [None]:
# Show examples of sentence pairs across different similarity ranges
print("\nSENTENCE SIMILARITY ANALYSIS - EXAMPLES ACROSS RANGES")
print("=" * 75)
print("Showing sentence pairs across different similarity levels")
print("High similarity (≥0.999), Medium similarity, Low similarity")
print()

# Function to find sentence pairs in specific similarity ranges
def get_similarity_examples_by_range(company_name, doc1, doc2):
    """Extract sentence pairs across different similarity ranges"""
    
    def filter_sentences(doc):
        """Filter sentences with more than 10 lemmatized words (excluding stopwords/punctuation)"""
        filtered_sentences = []
        
        for sent in doc.sents:
            lemmatized_words = [
                token.lemma_.lower() 
                for token in sent 
                if not token.is_stop and not token.is_punct and token.is_alpha
            ]
            
            if len(lemmatized_words) > 10:
                filtered_sentences.append(sent)
        
        return filtered_sentences
    
    # Filter sentences from both documents
    sentences_2021 = filter_sentences(doc1)
    sentences_2022 = filter_sentences(doc2)
    
    if not sentences_2021 or not sentences_2022:
        return {}
    
    # Create a smaller sample if documents are very large (for performance)
    max_sentences = 300  # Limit to prevent excessive computation
    if len(sentences_2021) > max_sentences:
        sentences_2021 = sentences_2021[:max_sentences]
    if len(sentences_2022) > max_sentences:
        sentences_2022 = sentences_2022[:max_sentences]
    
    # Define similarity ranges
    similarity_ranges = {
        'high': {'min': 0.999, 'max': 0.99999999, 'pairs': []},      # 0.999 - 1.000
        'medium_high': {'min': 0.9, 'max': 0.95, 'pairs': []}, # ~0.925
        'medium': {'min': 0.85, 'max': 0.9, 'pairs': []},      # ~0.875  
        'low': {'min': 0.45, 'max': 0.55, 'pairs': []}          # ~0.5
    }
    
    print(f"Analyzing {len(sentences_2021)} sentences from 2021 vs {len(sentences_2022)} sentences from 2022...")
    
    # Compare sentence pairs and categorize by similarity
    pairs_checked = 0
    for i, sent_2021 in enumerate(sentences_2021):
        for j, sent_2022 in enumerate(sentences_2022):
            try:
                similarity = sent_2021.similarity(sent_2022)
                pairs_checked += 1
                
                # Check which range this similarity falls into
                for range_name, range_info in similarity_ranges.items():
                    if range_info['min'] <= similarity <= range_info['max']:
                        range_info['pairs'].append({
                            'similarity': similarity,
                            'sentence_2021': sent_2021.text.strip(),
                            'sentence_2022': sent_2022.text.strip()
                        })
                        break
                
                # Stop early if we have enough examples in each category
                if all(len(range_info['pairs']) >= 5 for range_info in similarity_ranges.values()):
                    break
                    
            except:
                continue  # Skip if similarity computation fails
        
        # Break outer loop if we have enough examples
        if all(len(range_info['pairs']) >= 5 for range_info in similarity_ranges.values()):
            break
    
    print(f"Checked {pairs_checked} sentence pairs")
    return similarity_ranges

# Find the company being analyzed (assumes only one company in documents)
company_name = None
available_companies = set()

for doc_key in documents.keys():
    if doc_key.endswith('_2021') or doc_key.endswith('_2022'):
        company = doc_key.replace('_2021', '').replace('_2022', '')
        available_companies.add(company)

if len(available_companies) == 1:
    company_name = list(available_companies)[0]
elif len(available_companies) > 1:
    # If multiple companies, take the first one
    company_name = sorted(list(available_companies))[0]
    print(f"Multiple companies found, analyzing: {company_name.replace('_', ' ')}")

if company_name:
    key_2021 = f"{company_name}_2021"
    key_2022 = f"{company_name}_2022"
    
    if key_2021 in documents and key_2022 in documents:
        print(f"ANALYZING COMPANY: {company_name.replace('_', ' ')}")
        print("-" * 50)
        
        doc1 = documents[key_2021]
        doc2 = documents[key_2022]
        
        # Get examples across similarity ranges
        similarity_ranges = get_similarity_examples_by_range(company_name, doc1, doc2)
        
        # Display examples for each range
        range_descriptions = {
            'high': 'HIGH SIMILARITY (0.999 - 1.000) - Near identical/identical content',
            'medium_high': 'MEDIUM-HIGH SIMILARITY (~0.95)',
            'medium': 'MEDIUM SIMILARITY (~0.9)', 
            'low': 'LOW SIMILARITY (~0.5)'
        }
        
        for range_name, description in range_descriptions.items():
            range_data = similarity_ranges[range_name]
            pairs = range_data['pairs']
            
            print(f"\n{description}")
            print("=" * 75)
            
            if pairs:
                # Sort by similarity score (highest first) and show top examples
                pairs.sort(key=lambda x: x['similarity'], reverse=True)
                examples_to_show = min(3, len(pairs))  # Show up to 3 examples per range
                
                print(f"Found {len(pairs)} pairs in this range. Showing top {examples_to_show}:")
                print()
                
                for i, pair in enumerate(pairs[:examples_to_show], 1):
                    similarity = pair['similarity']
                    sent_2021 = pair['sentence_2021']
                    sent_2022 = pair['sentence_2022']
                    
                    # Truncate very long sentences for display
                    max_length = 250
                    if len(sent_2021) > max_length:
                        sent_2021 = sent_2021[:max_length] + "..."
                    if len(sent_2022) > max_length:
                        sent_2022 = sent_2022[:max_length] + "..."
                    
                    print(f"EXAMPLE {i} - SIMILARITY: {similarity:.6f}")
                    print(f"2021: {sent_2021}")
                    print(f"2022: {sent_2022}")
                    print("-" * 75)
            else:
                print("No sentence pairs found in this similarity range.")
                print("Try expanding the similarity range or checking more sentence pairs.")
                print()
        
        # Summary statistics
        total_pairs_found = sum(len(range_data['pairs']) for range_data in similarity_ranges.values())
        print(f"\nSUMMARY:")
        print(f"Total examples found across all ranges: {total_pairs_found}")
        for range_name, description in range_descriptions.items():
            count = len(similarity_ranges[range_name]['pairs'])
            range_min = similarity_ranges[range_name]['min']
            range_max = similarity_ranges[range_name]['max']
            print(f"  {range_name.replace('_', '-').title()}: {count} pairs ({range_min:.3f} - {range_max:.3f})")
    
    else:
        print(f"Documents not found for {company_name}")
        print(f"Available document keys: {list(documents.keys())}")

else:
    print("No company documents found in the expected format.")
    print("Expected format: CompanyName_2021, CompanyName_2022")
    print(f"Available document keys: {list(documents.keys())}")

print(f"\nSimilarity range analysis complete.")