# Vague and Hedge Language Analysis

## Overview
This module detects unclear communication patterns through vague language and hedge words analysis. It identifies language that reduces commitment specificity and increases ambiguity, directly supporting the Language Vagueness dimension of the communication assessment framework.

## Hedge Words Detection
- **Strong hedge words**: High uncertainty expressions ("may", "could", "might", "potentially", "possibly")
- **Mild hedge words**: Moderate uncertainty markers ("generally", "typically", "usually", "appears", "seems")
- **Sentence-level analysis**: Tracks sentences with 3+ hedge words indicating excessive uncertainty
- **Intensity scoring**: ((strong hedge × 1.5) + (mild hedge × 1.0)) ÷ meaningful words × 100

## Vague Language Categories
- **Temporal vagueness**: "soon", "eventually", "ongoing", "long-term", "progressively"
- **Scope ambiguity**: "various", "multiple", "several", "certain", "range of"  
- **Commitment vagueness**: "working towards", "striving for", "exploring", "considering"
- **Context-dependent assessment**: Some terms only count as vague when lacking specific quantification or comparison

## Specialized Analysis Components
1. **Commitment timeline analysis**: Percentage of commitment words with specific timelines vs. vague timeframes
2. **Context-dependent word processing**: Distinguishes vague usage from quantified/compared contexts
3. **Combined metrics**: Integrated unclear communication density across both hedge and vague categories

## Variables Produced for Communication Scoring
According to the analysis framework:
- **Vague and Hedge Words** → Language Vagueness dimension
- **Vague Language Intensity** → Language Vagueness dimension
- **Hedge Language Intensity** → Language Vagueness dimension

## Quality Control Features
- **Meaningful word calculation**: Excludes stopwords, punctuation, and whitespace for accurate density metrics
- **Category-specific tracking**: Separate analysis of temporal, scope, and commitment vagueness
- **Intensity weighting**: Stronger vague/hedge terms receive 1.5x weight to reflect greater ambiguity impact

## Theoretical Foundation
Based on research distinguishing vague language from legitimate legal hedging, enabling detection of communication patterns that reduce accountability while maintaining plausible deniability.

In [None]:
import spacy
from spacy_layout import spaCyLayout
from pathlib import Path
import pandas as pd
import numpy as np
import re
from collections import defaultdict, Counter

# Load spaCy model and configure for large documents
nlp = spacy.load("en_core_web_lg")
nlp.max_length = 1_500_000

In [None]:
from pathlib import Path

# Toggle between "test" and "actual"
MODE = "actual"   

# Define configuration based on mode
if MODE == "test":
    report_names = [ 
        "Axpo_Holding_AG", "NEOEN_SA"
    ]
    folders = {
        "2021": Path("data/NLP/Testing/Reports/Clean/2021"),
        "2022": Path("data/NLP/Testing/Reports/Clean/2022")
    }

elif MODE == "actual":
    report_names = [ 
        "Akenerji_Elektrik_Uretim_AS",
        "Arendals_Fossekompani_ASA",
        "Atlantica_Sustainable_Infrastructure_PLC",
        "CEZ",
        "EDF",
        "EDP_Energias_de_Portugal_SA",
        "Endesa",
        "ERG_SpA",
        "Orsted",
        "Polska_Grupa_Energetyczna_PGE_SA",
        "Romande_Energie_Holding_SA",
        "Scatec_ASA",
        "Solaria_Energia_y_Medio_Ambiente_SA",
        "Terna_Energy_SA"
    ]

    folders = {
        "2021": Path("data/NLP/Reports/Cleanest/2021"),
        "2022": Path("data/NLP/Reports/Cleanest/2022")
    }

else:
    raise ValueError("Invalid MODE. Use 'test' or 'actual'.")

# Check availability
for name in report_names:
    file_name = f"{name}.txt"
    in_2021 = (folders["2021"] / file_name).exists()
    in_2022 = (folders["2022"] / file_name).exists()
    print(f"{file_name}: 2021: {'YES' if in_2021 else 'NO'} | 2022: {'YES' if in_2022 else 'NO'}")


In [None]:
# Dictionary to store processed docs
documents = {}

# Load and process all documents
for version, folder_path in folders.items():
    for name in report_names:
        txt_path = folder_path / f"{name}.txt"
        try:
            with open(txt_path, "r", encoding="utf-8") as f:
                text = f.read()
            doc_key = f"{name}_{version}"
            documents[doc_key] = nlp(text)
            print(f"Processed {doc_key}")
        except Exception as e:
            print(f"Error processing {txt_path.name}: {e}")

print(f"\nTotal documents loaded: {len(documents)}")

## Hedge words

In [None]:
# Comprehensive hedge word classifications
HEDGE_WORDS = {
    "strong_hedge": [
        # Modal verbs expressing uncertainty
        "might", "could", "may", "would", "should", "ought",
        
        # Adverbs of uncertainty (context-independent)
        "possibly", "potentially", "probably", "likely", "unlikely", "perhaps",
        "maybe", "conceivably", "presumably", "supposedly", "allegedly", "apparently",
        "seemingly", "arguably", "debatably", "questionably", "tentatively",
        
        # Verbs indicating uncertainty
        "appears", "seems", "suggests", "indicates", "implies", "assumes", "believes",
        "estimates", "speculates", "suspects", "expects", "anticipates", "predicts",
        "presumes", "supposes", "imagines", "thinks", "feels", "considers",
        
        # Adjectives expressing uncertainty
        "uncertain", "unclear", "ambiguous", "doubtful", "questionable", "debatable",
        "controversial", "disputed", "alleged", "supposed", "presumed", "potential",
        "possible", "probable", "speculative", "hypothetical",
        "theoretical", "tentative", "provisional", "conditional", "contingent",
        
        # Phrases and expressions
        "it appears", "it seems", "it suggests", "it indicates", "it implies",
        "one might", "one could", "we believe", "we think", "we assume", "we estimate",
        "tend to", "tends to", "inclined to", "prone to", "apt to"
    ],
    
    "mild_hedge": [
        # Frequency adverbs (context-independent)
        "often", "commonly", "regularly", "ordinarily", "customarily", 
        "habitually", "routinely", "traditionally", "predominantly", "mainly", "chiefly", 
        "primarily", "principally", "mostly", "notably", "markedly",
        
        # Degree adverbs (context-independent)
        "somewhat", "fairly", "quite", "rather", "reasonably", "moderately",
        "partially", "partly", "nearly", "almost", "virtually",
        
        # Multi-word expressions (context-independent)
        "to some extent", "to a degree", "in part", "in general", "on the whole", 
        "by and large", "for the most part", "more or less", "in essence", "in principle", 
        "broadly speaking", "loosely speaking", "generally speaking", "relatively speaking",
        
        # Other qualifying expressions
        "tends to", "inclines toward", "leans toward", "appears to be", "seems to be",
        "proportionally", "correspondingly", "accordingly", "consequently", "effectively",
    ]
}

# Context-dependent hedge words that require POS and dependency analysis
CONTEXT_DEPENDENT_HEDGE_WORDS = {
    "about": {
        "hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod", "amod"]},  # "about 5 turbines"
            {"pos": ["ADV"], "dep": ["nummod"]},  # "about five"
        ],
        "non_hedge_contexts": [
            {"pos": ["ADP"], "dep": ["prep"]},  # "talk about something"
            {"pos": ["ADV"], "dep": ["advmod"], "head_pos": ["VERB"]},  # "bring about change"
        ],
        "hedge_type": "mild"
    },
    
    "around": {
        "hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod", "amod"]},  # "around 10 million"
            {"pos": ["ADV"], "dep": ["nummod"]},  # "around five"
        ],
        "non_hedge_contexts": [
            {"pos": ["ADP"], "dep": ["prep"]},  # "around the building"
        ],
        "hedge_type": "mild"
    },
    
    "roughly": {
        "hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod"], "siblings_has_num": True},  # "roughly 50%"
            {"pos": ["ADV"], "dep": ["amod"]},  # "roughly equivalent"
        ],
        "non_hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod"], "head_lemma": ["handle", "treat", "push"]},  # "handle roughly"
        ],
        "hedge_type": "mild"
    },
    
    "approximately": {
        "hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod", "amod", "nummod"]},  # Always hedge when used as adverb
        ],
        "hedge_type": "mild"
    },
    
    "generally": {
        "hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod"], "position": "sentence_start"},  # "Generally, we see..."
            {"pos": ["ADV"], "dep": ["advmod"], "head_pos": ["VERB", "ADJ"]},  # "generally accepted"
        ],
        "non_hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod"], "head_lemma": ["speak", "refer"]},  # "generally speaking"
        ],
        "hedge_type": "mild"
    },
    
    "typically": {
        "hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod"]},  # Usually hedge when modifying verbs/adjectives
        ],
        "hedge_type": "mild"
    },
    
    "usually": {
        "hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod"]},  # Usually hedge when modifying verbs/adjectives
        ],
        "hedge_type": "mild"
    },
    
    "normally": {
        "hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod"]},  # Usually hedge when modifying verbs/adjectives
        ],
        "hedge_type": "mild"
    },
    
    "largely": {
        "hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod"], "head_pos": ["VERB", "ADJ"]},  # "largely responsible"
        ],
        "non_hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod"], "head_lemma": ["scale", "size"]},  # "largely scaled"
        ],
        "hedge_type": "mild"
    },
    
    "substantially": {
        "hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod"], "head_pos": ["VERB", "ADJ"]},  # "substantially higher"
        ],
        "hedge_type": "mild"
    },
    
    "considerably": {
        "hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod"], "head_pos": ["VERB", "ADJ"]},  # "considerably more"
        ],
        "hedge_type": "mild"
    },
    
    "significantly": {
        "hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod"], "head_pos": ["VERB", "ADJ"]},  # "significantly higher"
        ],
        "hedge_type": "mild"
    },
    
    "relatively": {
        "hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod"], "head_pos": ["ADJ", "ADV"]},  # "relatively small"
        ],
        "non_hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod"], "head_lemma": ["relate", "compare"]},  # "relatively speaking"
        ],
        "hedge_type": "mild"
    },
    
    "practically": {
        "hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod"], "head_pos": ["ADJ", "ADV"]},  # "practically impossible"
            {"pos": ["ADV"], "dep": ["advmod"], "head_lemma": ["eliminate", "zero", "nothing"]},  # "practically zero"
        ],
        "non_hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod"], "head_pos": ["VERB"]},  # "practically implement"
        ],
        "hedge_type": "mild"
    },
    
    "essentially": {
        "hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod"], "head_pos": ["ADJ", "VERB"]},  # "essentially the same"
        ],
        "hedge_type": "mild"
    },
    
    "basically": {
        "hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod"], "head_pos": ["ADJ", "VERB"]},  # "basically correct"
        ],
        "hedge_type": "mild"
    },
    
    "fundamentally": {
        "hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod"], "head_pos": ["ADJ", "VERB"]},  # "fundamentally different"
        ],
        "non_hedge_contexts": [
            {"pos": ["ADV"], "dep": ["advmod"], "head_lemma": ["based", "rooted"]},  # "fundamentally based"
        ],
        "hedge_type": "mild"
    }
}

# Combined hedge word set for quick lookup (includes context-dependent words)
ALL_HEDGE_WORDS = set(HEDGE_WORDS["strong_hedge"] + HEDGE_WORDS["mild_hedge"] + list(CONTEXT_DEPENDENT_HEDGE_WORDS.keys()))

print(f"Hedge word dictionaries loaded:")
print(f"Strong hedge words: {len(HEDGE_WORDS['strong_hedge'])}")
print(f"Mild hedge words: {len(HEDGE_WORDS['mild_hedge'])}")
print(f"Context-dependent hedge words: {len(CONTEXT_DEPENDENT_HEDGE_WORDS)}")
print(f"Total hedge words: {len(ALL_HEDGE_WORDS)}")



In [None]:
def count_meaningful_words(doc):
    """
    Count meaningful words excluding stopwords, punctuation, and whitespace.
    Uses spaCy's built-in stopword detection.
    """
    meaningful_count = 0
    
    for token in doc:
        if (not token.is_punct and 
            not token.is_space and 
            not token.is_stop and
            len(token.text.strip()) > 0):
            meaningful_count += 1
    
    return meaningful_count

def get_token_lemma_lower(token):
    """Get lowercase lemma of token for consistent matching."""
    return token.lemma_.lower().strip()

def check_context_dependent_hedge(token):
    """
    Check if a context-dependent word is being used as a hedge word based on linguistic context.
    
    Args:
        token: spaCy token object
    
    Returns:
        tuple: (is_hedge, hedge_type) or (False, None)
    """
    lemma = token.lemma_.lower()
    
    if lemma not in CONTEXT_DEPENDENT_HEDGE_WORDS:
        return False, None
    
    word_config = CONTEXT_DEPENDENT_HEDGE_WORDS[lemma]
    hedge_type = word_config["hedge_type"]
    
    # Check hedge contexts
    for context in word_config.get("hedge_contexts", []):
        if matches_context(token, context):
            return True, hedge_type
    
    # Check non-hedge contexts (if matches, it's NOT a hedge)
    for context in word_config.get("non_hedge_contexts", []):
        if matches_context(token, context):
            return False, None
    
    # If no specific context matched, default to hedge (conservative approach)
    return True, hedge_type

def matches_context(token, context):
    """
    Check if a token matches a specific linguistic context.
    
    Args:
        token: spaCy token object
        context: Dictionary with context criteria
    
    Returns:
        bool: True if token matches the context
    """
    # Check POS tag
    if "pos" in context:
        if token.pos_ not in context["pos"]:
            return False
    
    # Check dependency relation
    if "dep" in context:
        if token.dep_ not in context["dep"]:
            return False
    
    # Check head word POS
    if "head_pos" in context:
        if token.head.pos_ not in context["head_pos"]:
            return False
    
    # Check head word lemma
    if "head_lemma" in context:
        if token.head.lemma_.lower() not in context["head_lemma"]:
            return False
    
    # Check if siblings contain numbers (for approximation words)
    if context.get("siblings_has_num", False):
        has_number_sibling = False
        for child in token.head.children:
            if child.pos_ == "NUM" or child.like_num:
                has_number_sibling = True
                break
        if not has_number_sibling:
            return False
    
    # Check position in sentence
    if "position" in context:
        if context["position"] == "sentence_start":
            # Check if token is within first 3 tokens of sentence
            sent_start = token.sent.start
            if token.i - sent_start > 2:
                return False
    
    return True

def is_hedge_word(lemma, hedge_type=None, token=None):
    """
    Check if a lemma is a hedge word, considering context for ambiguous words.
    
    Args:
        lemma: lowercase lemma to check
        hedge_type: 'strong', 'mild', or None (for any type)
        token: spaCy token object (needed for context-dependent words)
    
    Returns:
        tuple: (is_hedge, hedge_type) or (False, None) if not a hedge word
    """
    # First check context-dependent words if token is provided
    if token is not None and lemma in CONTEXT_DEPENDENT_HEDGE_WORDS:
        return check_context_dependent_hedge(token)
    
    # Then check regular hedge words
    if hedge_type == "strong":
        return (True, "strong") if lemma in HEDGE_WORDS["strong_hedge"] else (False, None)
    elif hedge_type == "mild":
        return (True, "mild") if lemma in HEDGE_WORDS["mild_hedge"] else (False, None)
    else:
        # Check both types
        if lemma in HEDGE_WORDS["strong_hedge"]:
            return (True, "strong")
        elif lemma in HEDGE_WORDS["mild_hedge"]:
            return (True, "mild")
        else:
            return (False, None)



In [None]:
def find_hedge_words_in_document(doc, document_name):
    """
    Find all hedge words in a document with their context and sentence information.
    Uses contextual analysis for ambiguous words.
    
    Returns:
        dict: Complete hedge word analysis for the document
    """
    hedge_analysis = {
        'document_name': document_name,
        'strong_hedge_words': [],
        'mild_hedge_words': [],
        'strong_count': 0,
        'mild_count': 0,
        'total_hedge_count': 0,
        'meaningful_words': 0,
        'sentences_with_hedges': [],
        'high_hedge_sentences': []
    }
    
    # Count meaningful words in document
    hedge_analysis['meaningful_words'] = count_meaningful_words(doc)
    
    # Track sentences containing hedge words
    sentence_hedge_counts = {}
    
    # Process each token for hedge words
    for token in doc:
        if token.is_punct or token.is_space or len(token.text.strip()) < 2:
            continue
            
        lemma = get_token_lemma_lower(token)
        
        # Check for hedge words with contextual analysis
        is_hedge, hedge_type = is_hedge_word(lemma, token=token)
        
        if is_hedge:
            # Get sentence context
            sent_text = token.sent.text.strip()
            sent_start = token.sent.start
            
            # Track hedge words per sentence
            if sent_start not in sentence_hedge_counts:
                sentence_hedge_counts[sent_start] = {
                    'text': sent_text, 
                    'count': 0, 
                    'hedge_words': []
                }
            sentence_hedge_counts[sent_start]['count'] += 1
            sentence_hedge_counts[sent_start]['hedge_words'].append(lemma)
            
            # Create hedge word information
            hedge_info = {
                'token': token.text,
                'lemma': lemma,
                'position': token.i,
                'pos': token.pos_,
                'dep': token.dep_,
                'sentence_start': sent_start,
                'sentence_text': sent_text[:200] + "..." if len(sent_text) > 200 else sent_text,
                'context_before': doc[max(0, token.i-3):token.i].text,
                'context_after': doc[token.i+1:min(len(doc), token.i+4)].text
            }
            
            # Categorize by hedge type
            if hedge_type == "strong":
                hedge_analysis['strong_hedge_words'].append(hedge_info)
                hedge_analysis['strong_count'] += 1
            else:
                hedge_analysis['mild_hedge_words'].append(hedge_info)
                hedge_analysis['mild_count'] += 1
    
    # Calculate total hedge count
    hedge_analysis['total_hedge_count'] = hedge_analysis['strong_count'] + hedge_analysis['mild_count']
    
    # Process sentences with hedge words
    hedge_analysis['sentences_with_hedges'] = [
        {
            'sentence': info['text'],
            'hedge_count': info['count'],
            'hedge_words': info['hedge_words']
        }
        for info in sentence_hedge_counts.values()
        if info['count'] > 0
    ]
    
    # Identify high-hedge sentences (3+ hedge words)
    hedge_analysis['high_hedge_sentences'] = [
        sentence_info for sentence_info in hedge_analysis['sentences_with_hedges']
        if sentence_info['hedge_count'] >= 3
    ]
    
    return hedge_analysis



In [None]:
def calculate_hedge_densities(hedge_analysis):
    """
    Calculate hedge word density metrics and ratios.
    
    Returns:
        dict: Density calculations as percentages
    """
    meaningful_words = hedge_analysis['meaningful_words']
    strong_count = hedge_analysis['strong_count']
    mild_count = hedge_analysis['mild_count']
    total_hedge_count = hedge_analysis['total_hedge_count']
    
    # Handle zero meaningful words
    if meaningful_words == 0:
        return {
            'total_hedge_density': 0.0,
            'strong_hedge_density': 0.0,
            'mild_hedge_density': 0.0,
            'strong_vs_mild_ratio': 0.0,
            'hedge_intensity_score': 0.0,
            'sentences_with_hedges_pct': 0.0
        }
    
    # Calculate basic densities (as percentages)
    total_density = (total_hedge_count / meaningful_words) * 100
    strong_density = (strong_count / meaningful_words) * 100
    mild_density = (mild_count / meaningful_words) * 100
    
    # Calculate strong vs mild ratio
    if mild_count > 0:
        strong_vs_mild_ratio = strong_count / mild_count
    else:
        strong_vs_mild_ratio = float('inf') if strong_count > 0 else 0.0
    
    # Calculate weighted hedge intensity score (strong hedges weighted 1.5x)
    hedge_intensity_score = (((strong_count * 1.5) + (mild_count * 1)) / meaningful_words) * 100
    
    # Estimate percentage of sentences with hedges
    total_sentences = len(set(hw['sentence_start'] for hw in 
                             hedge_analysis['strong_hedge_words'] + hedge_analysis['mild_hedge_words']))
    estimated_total_sentences = max(total_sentences, meaningful_words // 15)
    sentences_with_hedges_pct = (len(hedge_analysis['sentences_with_hedges']) / estimated_total_sentences) * 100
    
    return {
        'total_hedge_density': round(total_density, 4),
        'strong_hedge_density': round(strong_density, 4),
        'mild_hedge_density': round(mild_density, 4),
        'strong_vs_mild_ratio': round(strong_vs_mild_ratio, 4) if strong_vs_mild_ratio != float('inf') else strong_vs_mild_ratio,
        'hedge_intensity_score': round(hedge_intensity_score, 4),
        'sentences_with_hedges_pct': round(sentences_with_hedges_pct, 2)
    }



In [None]:
def analyze_hedge_words_all_documents(documents):
    """
    Analyze hedge words across all documents and calculate density statistics.
    
    Args:
        documents: Dictionary of {doc_name: spacy_doc}
    
    Returns:
        dict: Complete hedge word analysis results
    """
    all_results = {}
    density_stats = {
        'total_documents': len(documents),
        'documents_with_high_hedging': [],
        'documents_with_low_hedging': []
    }
    
    # Lists to collect density values for statistics
    total_densities = []
    strong_densities = []
    mild_densities = []
    
    # Process each document
    for doc_name, doc in documents.items():
        # Analyze hedge words in document
        hedge_analysis = find_hedge_words_in_document(doc, doc_name)
        
        # Calculate densities
        densities = calculate_hedge_densities(hedge_analysis)
        
        # Combine results
        document_result = {**hedge_analysis, **densities}
        all_results[doc_name] = document_result
        
        # Collect density values for statistics
        total_densities.append(densities['total_hedge_density'])
        strong_densities.append(densities['strong_hedge_density'])
        mild_densities.append(densities['mild_hedge_density'])
        
        # Categorize documents by hedging level
        total_density = densities['total_hedge_density']
        if total_density > 1.0:
            density_stats['documents_with_high_hedging'].append((doc_name, total_density))
        elif total_density < 0.5:
            density_stats['documents_with_low_hedging'].append((doc_name, total_density))
    
    # Calculate density statistics across all documents
    if total_densities:
        density_stats.update({
            # Average densities
            'average_total_hedge_density': round(sum(total_densities) / len(total_densities), 4),
            'average_strong_hedge_density': round(sum(strong_densities) / len(strong_densities), 4),
            'average_mild_hedge_density': round(sum(mild_densities) / len(mild_densities), 4),
            
            # Min/Max densities
            'min_total_hedge_density': round(min(total_densities), 4),
            'max_total_hedge_density': round(max(total_densities), 4),
            'min_strong_hedge_density': round(min(strong_densities), 4),
            'max_strong_hedge_density': round(max(strong_densities), 4),
            'min_mild_hedge_density': round(min(mild_densities), 4),
            'max_mild_hedge_density': round(max(mild_densities), 4),
            
            # Density ranges
            'range_total_hedge_density': round(max(total_densities) - min(total_densities), 4),
            'range_strong_hedge_density': round(max(strong_densities) - min(strong_densities), 4),
            'range_mild_hedge_density': round(max(mild_densities) - min(mild_densities), 4)
        })
    
    # Sort documents by hedging level
    density_stats['documents_with_high_hedging'].sort(key=lambda x: x[1], reverse=True)
    density_stats['documents_with_low_hedging'].sort(key=lambda x: x[1])
    
    return {
        'document_results': all_results,
        'density_statistics': density_stats
    }



In [None]:
def create_hedge_summary_table(analysis_results):
    """Create pandas DataFrame summarizing hedge analysis results."""
    document_results = analysis_results['document_results']
    
    summary_data = []
    for doc_name, results in document_results.items():
        summary_data.append({
            'Document': doc_name,
            'Meaningful Words': results['meaningful_words'],
            'Total Hedge Words': results['total_hedge_count'],
            'Strong Hedge Words': results['strong_count'],
            'Mild Hedge Words': results['mild_count'],
            'Total Hedge Density (%)': results['total_hedge_density'],
            'Strong Hedge Density (%)': results['strong_hedge_density'],
            'Mild Hedge Density (%)': results['mild_hedge_density'],
            'Hedge Intensity Score': results['hedge_intensity_score'],
        })
    
    df = pd.DataFrame(summary_data)
    return df #.sort_values('Total Hedge Density (%)', ascending=False)

def display_hedge_analysis_results(analysis_results):
    """Display comprehensive hedge analysis results."""
    density_stats = analysis_results['density_statistics']
    
    print("HEDGE WORD ANALYSIS RESULTS")
    print("=" * 60)
    
    # Basic statistics
    print(f"\nBASIC STATISTICS:")
    print(f"Total Documents: {density_stats['total_documents']}")
    
    # Density statistics across all documents
    print(f"\nDENSITY STATISTICS ACROSS ALL DOCUMENTS:")
    print(f"Average Total Hedge Density: {density_stats['average_total_hedge_density']:.4f}%")
    print(f"Average Strong Hedge Density: {density_stats['average_strong_hedge_density']:.4f}%")
    print(f"Average Mild Hedge Density: {density_stats['average_mild_hedge_density']:.4f}%")
    
    print(f"\nDENSITY RANGES:")
    print(f"Total Hedge Density Range: {density_stats['min_total_hedge_density']:.4f}% - {density_stats['max_total_hedge_density']:.4f}% (range: {density_stats['range_total_hedge_density']:.4f}%)")
    print(f"Strong Hedge Density Range: {density_stats['min_strong_hedge_density']:.4f}% - {density_stats['max_strong_hedge_density']:.4f}% (range: {density_stats['range_strong_hedge_density']:.4f}%)")
    print(f"Mild Hedge Density Range: {density_stats['min_mild_hedge_density']:.4f}% - {density_stats['max_mild_hedge_density']:.4f}% (range: {density_stats['range_mild_hedge_density']:.4f}%)")
    
    # High and low hedging documents
    if density_stats['documents_with_high_hedging']:
        print(f"\nHIGH HEDGING DOCUMENTS (>1% density):")
        for doc_name, density in density_stats['documents_with_high_hedging'][:5]:
            print(f"  {doc_name}: {density:.4f}%")
    
    if density_stats['documents_with_low_hedging']:
        print(f"\nLOW HEDGING DOCUMENTS (<0.5% density):")
        for doc_name, density in density_stats['documents_with_low_hedging'][:5]:
            print(f"  {doc_name}: {density:.4f}%")
    
    # Summary table
    print(f"\nDOCUMENT SUMMARY TABLE:")
    summary_df = create_hedge_summary_table(analysis_results)
    print(summary_df.to_string(index=False, float_format='%.4f'))
    
    return summary_df

def show_hedge_examples(analysis_results, document_name, max_examples=5):
    """Show specific hedge word examples from a document with contextual information."""
    if document_name not in analysis_results['document_results']:
        print(f"Document '{document_name}' not found in results.")
        return
    
    results = analysis_results['document_results'][document_name]
    
    print(f"\nHEDGE WORD EXAMPLES FROM: {document_name}")
    print("=" * 50)
    
    # Strong hedge examples
    if results['strong_hedge_words']:
        print(f"\nSTRONG HEDGE WORDS ({len(results['strong_hedge_words'])} total):")
        for i, hedge in enumerate(results['strong_hedge_words'][:max_examples]):
            context_info = ""
            if hedge['lemma'] in CONTEXT_DEPENDENT_HEDGE_WORDS:
                context_info = f" [POS: {hedge['pos']}, DEP: {hedge['dep']}]"
            print(f"{i+1}. '{hedge['token']}' (lemma: {hedge['lemma']}){context_info}")
            print(f"   Context: ...{hedge['context_before']} [{hedge['token']}] {hedge['context_after']}...")
            print()
    
    # Mild hedge examples  
    if results['mild_hedge_words']:
        print(f"\nMILD HEDGE WORDS ({len(results['mild_hedge_words'])} total):")
        for i, hedge in enumerate(results['mild_hedge_words'][:max_examples]):
            context_info = ""
            if hedge['lemma'] in CONTEXT_DEPENDENT_HEDGE_WORDS:
                context_info = f" [POS: {hedge['pos']}, DEP: {hedge['dep']}]"
            print(f"{i+1}. '{hedge['token']}' (lemma: {hedge['lemma']}){context_info}")
            print(f"   Context: ...{hedge['context_before']} [{hedge['token']}] {hedge['context_after']}...")
            print()
    
    # High-hedge sentences
    if results['high_hedge_sentences']:
        print(f"\nHIGH-HEDGE SENTENCES ({len(results['high_hedge_sentences'])} total):")
        for i, sent in enumerate(results['high_hedge_sentences'][:3]):
            print(f"{i+1}. Hedge words ({sent['hedge_count']}): {', '.join(sent['hedge_words'])}")
            print(f"   Sentence: {sent['sentence'][:300]}{'...' if len(sent['sentence']) > 300 else ''}")
            print()

def find_context_dependent_examples(doc, max_true=2, max_false=2):
    """
    Find examples of context-dependent hedge words - both true hedges and false hedges.
    
    Returns:
        dict: {'true_hedges': [...], 'false_hedges': [...]}
    """
    true_examples = []
    false_examples = []
    
    for token in doc:
        if token.is_punct or token.is_space or len(token.text.strip()) < 2:
            continue
            
        lemma = get_token_lemma_lower(token)
        
        # Check if it's a context-dependent word
        if lemma in CONTEXT_DEPENDENT_HEDGE_WORDS:
            is_hedge, hedge_type = is_hedge_word(lemma, token=token)
            
            # Create example info
            example_info = {
                'token': token.text,
                'lemma': lemma,
                'pos': token.pos_,
                'dep': token.dep_,
                'context': token.sent.text.strip()[:150] + "..." if len(token.sent.text) > 150 else token.sent.text.strip()
            }
            
            if is_hedge and len(true_examples) < max_true:
                true_examples.append(example_info)
            elif not is_hedge and len(false_examples) < max_false:
                false_examples.append(example_info)
            
            # Stop if we have enough examples
            if len(true_examples) >= max_true and len(false_examples) >= max_false:
                break
    
    return {'true_hedges': true_examples, 'false_hedges': false_examples}

def show_context_dependent_examples(doc, document_name):
    """Show context-dependent hedge word examples with brief definition."""
    examples = find_context_dependent_examples(doc)
    
    if not examples['true_hedges'] and not examples['false_hedges']:
        return
    
    print(f"\nCONTEXT-DEPENDENT HEDGE WORDS:")
    print("Context-dependent words can be hedges or not depending on grammatical usage.")
    
    # True hedge examples
    if examples['true_hedges']:
        print(f"\nWords functioning as HEDGES in context:")
        for i, ex in enumerate(examples['true_hedges'], 1):
            print(f"{i}. '{ex['token']}' (POS: {ex['pos']}, DEP: {ex['dep']})")
            print(f"   Context: {ex['context']}")
            print()
    
    # False hedge examples  
    if examples['false_hedges']:
        print(f"Words NOT functioning as hedges in context:")
        for i, ex in enumerate(examples['false_hedges'], 1):
            print(f"{i}. '{ex['token']}' (POS: {ex['pos']}, DEP: {ex['dep']})")
            print(f"   Context: {ex['context']}")
            print()


In [None]:
# Run the hedge word analysis
print("Starting hedge word analysis...")
hedge_results = analyze_hedge_words_all_documents(documents)

# Show examples from most hedged document
if hedge_results:
    # Get the document with highest hedge density from all documents
    document_results = hedge_results['document_results']
    if document_results:
        # Find document with highest total hedge density
        most_hedged_doc = max(document_results.items(), key=lambda x: x[1]['total_hedge_density'])[0]
        
        # Show hedge examples from the most hedged document
        print(f"\nHEDGE LANGUAGE EXAMPLES FROM: {most_hedged_doc}")
        print("=" * 50)
        show_hedge_examples(hedge_results, most_hedged_doc, max_examples=10)
        
        # Add context-dependent examples if the document exists in documents variable
        if 'documents' in globals() and most_hedged_doc in documents:
            show_context_dependent_examples(documents[most_hedged_doc], most_hedged_doc)

# Display results
summary_table = display_hedge_analysis_results(hedge_results)

print(f"\nHedge word analysis complete.")



In [None]:
# Show examples from most and least hedged documents
density_stats = hedge_results['density_statistics']
if density_stats['documents_with_high_hedging']:
    most_hedged_doc = density_stats['documents_with_high_hedging'][0][0]
    print(f"\nEXAMPLES FROM MOST HEDGED DOCUMENT")
    print("=" * 50)
    show_hedge_examples(hedge_results, most_hedged_doc, max_examples=3)
    
    # Add context-dependent examples
    if most_hedged_doc in documents:
        show_context_dependent_examples(documents[most_hedged_doc], most_hedged_doc)

if density_stats['documents_with_low_hedging']:
    least_hedged_doc = density_stats['documents_with_low_hedging'][0][0]
    print(f"\nEXAMPLES FROM LEAST HEDGED DOCUMENT")
    print("=" * 50)
    show_hedge_examples(hedge_results, least_hedged_doc, max_examples=3)
    
    # Add context-dependent examples
    if least_hedged_doc in documents:
        show_context_dependent_examples(documents[least_hedged_doc], least_hedged_doc)

# Document rankings by different metrics
print(f"\nDOCUMENT RANKINGS BY HEDGE METRICS")
print("=" * 50)

# Top 5 most hedged documents
print("\nTOP 5 MOST HEDGED DOCUMENTS (by total density):")
top_hedged = summary_table.nlargest(5, 'Total Hedge Density (%)')
for idx, (_, row) in enumerate(top_hedged.iterrows(), 1):
    print(f"{idx}. {row['Document']}: {row['Total Hedge Density (%)']:.4f}%")

# Top 5 strongest hedging documents
print("\nTOP 5 DOCUMENTS WITH STRONGEST HEDGING:")
top_strong = summary_table.nlargest(5, 'Strong Hedge Density (%)')
for idx, (_, row) in enumerate(top_strong.iterrows(), 1):
    print(f"{idx}. {row['Document']}: {row['Strong Hedge Density (%)']:.4f}%")

# Top 5 by hedge intensity score
print("\nTOP 5 DOCUMENTS BY HEDGE INTENSITY SCORE:")
top_intensity = summary_table.nlargest(5, 'Hedge Intensity Score')
for idx, (_, row) in enumerate(top_intensity.iterrows(), 1):
    print(f"{idx}. {row['Document']}: {row['Hedge Intensity Score']:.4f}")

print(f"\nComplete hedge word analysis finished.")

## Vague language

In [None]:
# Strong Vague Language Categories - high ambiguity terms
STRONG_VAGUE_LANGUAGE = {
    'temporal_vagueness': [
        'soon', 'eventually', 'in the future', 'ongoing', 'continuously', 
        'progressively', 'increasingly', 'gradually', 'over time', 'long-term',
        'short-term', 'medium-term', 'ultimately', 'periodically', 'shortly',
        'presently', 'currently', 'lately', 'recently', 'frequently'
    ],
    'scope_ambiguity': [
        'various', 'multiple', 'several', 'numerous', 'many', 'some', 
        'certain', 'particular', 'diverse', 'wide range', 'broad spectrum',
        'variety of', 'range of', 'selection of', 'array of', 'number of'
    ],
    'commitment_vagueness': [
    # Original terms
    'working towards', 'striving for', 'aiming to', 'seeking to',
    'endeavoring', 'exploring', 'investigating', 'considering',
    'evaluating', 'looking into', 'planning to', 'intending to',
    'committed to', 'dedicated to', 'focused on', 'pursuing',
    'attempting', 'trying to', 'hoping to',
    
    # Vague action-oriented commitments (kept the vague ones)
    'working on', 'developing', 'implementing', 'establishing',
    'advancing', 'driving', 'promoting', 'facilitating',
    'supporting', 'enabling', 'fostering', 'encouraging',
    
    # Progress-oriented terms (kept vague progress terms)
    'moving towards', 'progressing towards', 'heading towards',
    'improving', 'enhancing', 'strengthening', 'optimizing',
    'transforming',
    
    # Preparation terms (kept preparation-related commitments)
    'preparing', 'preparing for', 'getting ready', 'setting up',
    
    # Future-oriented vague terms
    'will work on', 'will develop', 'will implement', 'will establish',
    'will improve', 'will enhance', 'will strengthen',
    'continue to', 'ongoing efforts', 'future plans', 'next steps'
],
    'degree_vagueness': [
        'significant', 'substantial', 'considerable', 'meaningful', 
        'notable', 'impressive', 'major', 'minor', 'moderate', 
        'extensive', 'comprehensive', 'robust', 'strong', 'weak',
        'dramatic', 'marked', 'pronounced', 'modest', 'limited'
    ]
    
}

# Mild Vague Language Categories - moderate ambiguity terms
MILD_VAGUE_LANGUAGE = {
    'relative_terms': [
        'better', 'improved', 'enhanced', 'upgraded', 'advanced', 
        'superior', 'increased', 'reduced', 'higher', 'lower',
        'greater', 'lesser', 'faster', 'slower', 'more', 'less',
        'newer', 'older', 'larger', 'smaller', 'wider', 'narrower'
    ],
    'general_descriptors': [
        'appropriate', 'suitable', 'relevant', 'effective', 'efficient', 
        'optimal', 'adequate', 'proper', 'reasonable', 'acceptable',
        'satisfactory', 'desirable', 'favorable', 'beneficial',
        'valuable', 'useful', 'practical', 'viable', 'feasible'
    ],
    'process_vagueness': [
        'initiatives', 'measures', 'efforts', 'activities', 'actions', 
        'approaches', 'solutions', 'methods', 'techniques', 'procedures', 
        'processes', 'operations', 'practices', 'mechanisms', 'frameworks', 'systems'
    ],
    'outcome_vagueness': [
        'positive impact', 'improvement', 'enhancement', 'optimization', 
        'advancement', 'progress', 'benefits', 'success',
        'achievement', 'development', 'growth', 'innovation',
        'transformation', 'breakthrough', 'gains', 'advancement'
    ]
}

# Context-dependent words that can be vague or specific based on quantification
CONTEXT_DEPENDENT_WORDS = [
    'significant', 'substantial', 'better', 'improved', 'reduced', 
    'increased', 'enhanced', 'advanced', 'superior', 'effective',
    'efficient', 'optimal', 'major', 'minor', 'considerable',
    'meaningful', 'notable', 'impressive', 'extensive', 'comprehensive',
    # Verb forms
    'improve', 'improving', 'improves', 'develop', 'developing', 'develops',
    'enhance', 'enhancing', 'enhances', 'advance', 'advancing', 'advances',
    'reduce', 'reducing', 'reduces', 'increase', 'increasing', 'increases',
    'optimize', 'optimizing', 'optimizes', 'strengthen', 'strengthening', 'strengthens',
    'expand', 'expanding', 'expands', 'grow', 'growing', 'grows',
    'transform', 'transforming', 'transforms', 'upgrade', 'upgrading', 'upgrades'
]

# Patterns indicating quantified context (making words NOT vague)
QUANTIFIED_PATTERNS = [
    r'\b\d+(\.\d+)?%',  # percentages
    r'\b\d+(\.\d+)?\s*(tonnes?|kg|g|tons?|pounds?|lbs?)',  # weights
    r'\b\d+(\.\d+)?\s*(million|billion|thousand|k)\b',  # large numbers
    # r'\b\d{4}\b',  # years
    r'\bp\s*<\s*0\.\d+',  # p-values
    r'\b\d+(\.\d+)?\s*(times?|fold)\b',  # multiples
    r'\b\d+(\.\d+)?\s*-\s*\d+(\.\d+)?%',  # ranges
    r'\b\d+(\.\d+)?\s*(dollars?|\$|euros?|€)',  # monetary amounts
]

# Comparative context patterns indicating specific rather than vague usage
COMPARATIVE_PATTERNS = [
    # Original patterns
    r'\bcompared to\b', r'\bvs\.?\b', r'\bversus\b', r'\bthan\s+\d{4}\b',
    r'\bfrom\s+\d+.*to\s+\d+', r'\bbaseline\b', r'\bprevious\s+year\b',
    r'\blast\s+year\b', r'\bprior\s+to\b', r'\bagainst\s+\d{4}\b',
    
    # Time-based comparisons (safe patterns)
    r'\bthan\s+(last|previous|prior)\s+(year|quarter|month|period)\b',
    r'\bfrom\s+(last|previous|prior)\s+(year|quarter|month|period)\b',
    r'\bover\s+the\s+(last|previous|prior)\s+\d+\s+(years?|months?|quarters?)\b',
    r'\byear[\-\s]over[\-\s]year\b', r'\bmonth[\-\s]over[\-\s]month\b',
    r'\bquarter[\-\s]over[\-\s]quarter\b', 
    r'\bsince\s+\d{4}\b', r'\bfrom\s+\d{4}\s+to\s+\d{4}\b',
    r'\bthan\s+in\s+\d{4}\b', r'\bcompared\s+with\s+\d{4}\b',
    
    # Benchmarking and targets (specific patterns)
    r'\bvs\.?\s+(target|goal|objective|benchmark)\b',
    r'\bagainst\s+(target|goal|objective|benchmark|plan)\b',
    r'\bcompared\s+to\s+(target|goal|objective|benchmark)\b',
    r'\brelative\s+to\s+(target|goal|objective|benchmark)\b',
    r'\bversus\s+(target|goal|objective|plan)\b',
    
    # Industry and peer comparisons (safe and specific)
    r'\bvs\.?\s+(industry|sector|market|peers?|competitors?)\b',
    r'\bcompared\s+(to|with)\s+(industry|sector|market|peers?|competitors?)\b',
    r'\bagainst\s+(industry|sector|market|peers?|competitors?)\b',
    r'\brelative\s+to\s+(industry|sector|market|peers?|competitors?)\b',
    r'\bversus\s+(industry|sector|market|peers?|competitors?)\b',
    r'\bbenchmarked\s+against\b', r'\bin\s+comparison\s+(to|with)\b',
    
    # Performance comparisons (specific to avoid false matches)
    r'\boutperform\w*\b', r'\bunderperform\w*\b',
    
    # Percentage and ratio comparisons (very specific)
    r'\b\d+(\.\d+)?%\s+(higher|lower|above|below)\b',
    r'\b(up|down|increased?|decreased?)\s+by\s+\d+(\.\d+)?%\b',
    r'\b\d+(\.\d+)?\s*x\s+(higher|lower|more|less)\b',
    
    # Trend indicators (safe patterns with "from")
    r'\bimproved?\s+from\b', r'\bdeclined?\s+from\b', r'\brose\s+from\b',
    r'\bfell\s+from\b', r'\bgrew\s+from\b', r'\bdropped\s+from\b',
    
    # Historical comparisons (specific context only)
    r'\bsince\s+(inception|launch|start)\b',
    r'\bhistorical\s+(average|level|performance)\b',
    r'\bbase\s+(year|period|level)\b', r'\binitial\s+(level|target)\b'
]

# Create flattened lists for easier processing
ALL_STRONG_VAGUE = []
for category, words in STRONG_VAGUE_LANGUAGE.items():
    ALL_STRONG_VAGUE.extend(words)

ALL_MILD_VAGUE = []
for category, words in MILD_VAGUE_LANGUAGE.items():
    ALL_MILD_VAGUE.extend(words)

ALL_VAGUE_WORDS = set(ALL_STRONG_VAGUE + ALL_MILD_VAGUE + CONTEXT_DEPENDENT_WORDS)

print("Vague Language Dictionaries Loaded:")
print(f"Strong vague words: {len(ALL_STRONG_VAGUE)}")
print(f"Mild vague words: {len(ALL_MILD_VAGUE)}")
print(f"Context-dependent words: {len(CONTEXT_DEPENDENT_WORDS)}")
print(f"Total unique vague words: {len(ALL_VAGUE_WORDS)}")



In [None]:
def has_quantified_context(target_token, token_window=5):
    """Check if token has quantified context within specified token window in the same sentence."""
    
    # Get all tokens in the sentence
    sentence_tokens = list(target_token.sent)
    
    # Find the index of our target token within the sentence
    target_idx = None
    for i, token in enumerate(sentence_tokens):
        if token == target_token:
            target_idx = i
            break
    
    if target_idx is None:
        return False
    
    # Define the token window around the target token (within sentence boundaries)
    start_idx = max(0, target_idx - token_window)
    end_idx = min(len(sentence_tokens), target_idx + token_window + 1)
    
    # Get the context tokens
    context_tokens = sentence_tokens[start_idx:end_idx]
    
    # Reconstruct the context text from tokens for pattern matching
    context_text = " ".join([token.text for token in context_tokens])
    
    # Check for quantified patterns in the context
    for pattern in QUANTIFIED_PATTERNS:
        if re.search(pattern, context_text, re.IGNORECASE):
            return True
    
    # Check for NUM tokens (excluding years 1990-2050)
    for token in context_tokens:
        if token.pos_ == "NUM":
            # Try to check if it's a year to exclude
            try:
                num_value = int(token.text)
                if 1990 <= num_value <= 2050:
                    continue  # Skip years
            except ValueError:
                pass  # Not a simple integer
            
            # If we get here, it's either not a year or not a simple integer
            return True
    
    return False

def has_comparative_context(target_token, token_window=5):
    """Check if token has comparative context within specified token window in the same sentence."""
    
    # Get all tokens in the sentence
    sentence_tokens = list(target_token.sent)
    
    # Find the index of our target token within the sentence
    target_idx = None
    for i, token in enumerate(sentence_tokens):
        if token == target_token:
            target_idx = i
            break
    
    if target_idx is None:
        return False
    
    # Define the token window around the target token (within sentence boundaries)
    start_idx = max(0, target_idx - token_window)
    end_idx = min(len(sentence_tokens), target_idx + token_window + 1)
    
    # Get the context tokens
    context_tokens = sentence_tokens[start_idx:end_idx]
    
    # Reconstruct the context text from tokens
    context_text = " ".join([token.text for token in context_tokens])
    
    # Check for comparative patterns in the context
    for pattern in COMPARATIVE_PATTERNS:
        if re.search(pattern, context_text, re.IGNORECASE):
            return True
    
    return False

def classify_word_vagueness_token(token):
    """
    Classify word's vagueness level based on context using token object.
    Returns: 'strong_vague', 'mild_vague', 'context_quantified', 'context_compared', or 'not_vague'
    """
    word = token.text
    word_lower = word.lower()
    
    # Check if context-dependent word
    if word_lower in CONTEXT_DEPENDENT_WORDS:
        # Check for quantified or comparative context using token-based functions
        has_quant = has_quantified_context(token)
        has_comp = has_comparative_context(token)
        
        if has_quant and has_comp:
            return 'context_quantified'  # Prioritize quantified if both
        elif has_quant:
            return 'context_quantified'
        elif has_comp:
            return 'context_compared'
        else:
            # Without context, classify as vague
            if word_lower in ALL_STRONG_VAGUE:
                return 'strong_vague'
            else:
                return 'mild_vague'
    
    # Check if degree_vagueness word used as VERB - don't count as vague
    if word_lower in STRONG_VAGUE_LANGUAGE['degree_vagueness'] and token.pos_ == "VERB":
        return 'not_vague'

    # Check regular vague categories
    if word_lower in ALL_STRONG_VAGUE:
        return 'strong_vague'
    elif word_lower in ALL_MILD_VAGUE:
        return 'mild_vague'
    else:
        return 'not_vague'


In [None]:
def find_vague_language_in_document(doc, document_name):
    """
    Find all vague language instances in a document using spaCy processing.
    Similar structure to find_hedge_words_in_document but for vague language.
    """
    vague_analysis = {
        'document_name': document_name,
        'strong_vague_words': [],
        'mild_vague_words': [],
        'context_specific_words': [],
        'quantified_context_words': [],
        'compared_context_words': [],
        'quantified_context_examples': [],
        'compared_context_examples': [],
        'vague_context_examples': [],
        'total_context_dependent_found': 0,
        'strong_count': 0,
        'mild_count': 0,
        'context_specific_count': 0,
        'quantified_context_count': 0,
        'compared_context_count': 0,  
        'total_vague_count': 0,
        'meaningful_words': 0,
        'vague_word_contexts': [],
        'category_counts': defaultdict(int)
    }
    
    # Reuse existing function for meaningful word count
    vague_analysis['meaningful_words'] = count_meaningful_words(doc)
    
    # Get document text for context analysis
    doc_text = doc.text
    
    # Process each token for vague language
    for token in doc:
        if token.is_punct or token.is_space or len(token.text.strip()) < 2:
            continue
        
        token_text = token.text
        word_position = token.idx
        
        # Check if word is potentially vague
        if token_text.lower() in ALL_VAGUE_WORDS:
            
            # Track if this is a context-dependent word (regardless of final classification)
            if token_text.lower() in CONTEXT_DEPENDENT_WORDS:
                vague_analysis['total_context_dependent_found'] += 1
            
            vagueness_type = classify_word_vagueness_token(token)
            
            # Create vague word information
            if vagueness_type != 'not_vague':
                # Create 10-token window context (within same sentence)
                sentence_tokens = list(token.sent)
                
                # Find the index of our target token within the sentence
                target_idx = None
                for i, sent_token in enumerate(sentence_tokens):
                    if sent_token == token:
                        target_idx = i
                        break
                
                if target_idx is not None:
                    # Define 10-token window around the target token (within sentence boundaries)
                    token_window = 10
                    start_idx = max(0, target_idx - token_window)
                    end_idx = min(len(sentence_tokens), target_idx + token_window + 1)
                    
                    # Get the context tokens and reconstruct text
                    context_tokens = sentence_tokens[start_idx:end_idx]
                    context = " ".join([t.text for t in context_tokens])
                else:
                    # Fallback to sentence text if token not found
                    context = token.sent.text.strip()
                
                vague_info = {
                    'word': token_text,
                    'lemma': token.lemma_.lower(),
                    'type': vagueness_type,
                    'position': word_position,
                    'context': context,
                    'sentence_text': token.sent.text.strip()[:200] + "..." if len(token.sent.text) > 200 else token.sent.text.strip()
                }
                
                # Categorize by vagueness type
                if vagueness_type == 'strong_vague':
                    # Only count truly vague words in total counts
                    vague_analysis['vague_word_contexts'].append(vague_info)
                    vague_analysis['total_vague_count'] += 1
                    vague_analysis['strong_vague_words'].append(vague_info)
                    vague_analysis['strong_count'] += 1
                    
                    # Check if this is also a context-dependent word that ended up being vague
                    if token_text.lower() in CONTEXT_DEPENDENT_WORDS:
                        vague_analysis['vague_context_examples'].append(vague_info)

                    # Find specific strong category
                    for category, words in STRONG_VAGUE_LANGUAGE.items():
                        if token_text.lower() in words:
                            vague_analysis['category_counts'][f'strong_{category}'] += 1
                            break
                
                elif vagueness_type == 'mild_vague':
                    # Only count truly vague words in total counts
                    vague_analysis['vague_word_contexts'].append(vague_info)
                    vague_analysis['total_vague_count'] += 1
                    vague_analysis['mild_vague_words'].append(vague_info)
                    vague_analysis['mild_count'] += 1
                    
                    # Check if this is also a context-dependent word that ended up being vague
                    if token_text.lower() in CONTEXT_DEPENDENT_WORDS:
                        vague_analysis['vague_context_examples'].append(vague_info)
                    
                    # Find specific mild category
                    for category, words in MILD_VAGUE_LANGUAGE.items():
                        if token_text.lower() in words:
                            vague_analysis['category_counts'][f'mild_{category}'] += 1
                            break
                
                elif vagueness_type == 'context_quantified':
                    # Track quantified context separately - these are NOT vague due to quantified context
                    vague_analysis['quantified_context_words'].append(vague_info)
                    vague_analysis['quantified_context_count'] += 1
                    vague_analysis['context_specific_words'].append(vague_info)  # Keep for backward compatibility
                    vague_analysis['context_specific_count'] += 1
                    vague_analysis['category_counts']['context_quantified'] += 1
                    vague_analysis['quantified_context_examples'].append(vague_info)
                
                elif vagueness_type == 'context_compared':
                    # Track compared context separately - these are NOT vague due to comparative context
                    vague_analysis['compared_context_words'].append(vague_info)
                    vague_analysis['compared_context_count'] += 1
                    vague_analysis['context_specific_words'].append(vague_info)  # Keep for backward compatibility
                    vague_analysis['context_specific_count'] += 1
                    vague_analysis['category_counts']['context_compared'] += 1
                    vague_analysis['compared_context_examples'].append(vague_info)
    
    return vague_analysis

def calculate_vague_language_densities(vague_analysis):
    """Calculate vague language density metrics similar to hedge word densities."""
    meaningful_words = vague_analysis['meaningful_words']
    if meaningful_words == 0:
        return {
            'total_vague_density': 0.0,
            'strong_vague_density': 0.0,
            'mild_vague_density': 0.0,
            'context_specific_density': 0.0,
            'strong_mild_ratio': 0.0,
            'vague_intensity_score': 0.0
        }
    
    strong_count = vague_analysis['strong_count']
    mild_count = vague_analysis['mild_count']
    context_count = vague_analysis['context_specific_count']
    total_vague = strong_count + mild_count
    
    # Calculate densities as percentages
    total_density = (total_vague / meaningful_words) * 100
    strong_density = (strong_count / meaningful_words) * 100
    mild_density = (mild_count / meaningful_words) * 100
    context_density = (context_count / meaningful_words) * 100
    
    # Calculate ratios
    strong_mild_ratio = strong_count / mild_count if mild_count > 0 else 0.0
    
    # Calculate weighted intensity score (strong vague weighted 1.5x)
    intensity_score = (((strong_count * 1.5) + (mild_count * 1)) / meaningful_words) * 100
    
    return {
        'total_vague_density': round(total_density, 4),
        'strong_vague_density': round(strong_density, 4),
        'mild_vague_density': round(mild_density, 4),
        'context_specific_density': round(context_density, 4),
        'strong_mild_ratio': round(strong_mild_ratio, 4),
        'vague_intensity_score': round(intensity_score, 4)
    }

def calculate_context_metrics(vague_analysis):
    """Calculate context-dependent word metrics using total found context words."""
    total_context_found = vague_analysis['total_context_dependent_found']
    quantified_count = vague_analysis['quantified_context_count']
    compared_count = vague_analysis['compared_context_count']
    
    if total_context_found == 0:
        return {
            'quantified_context_pct': 0.0,
            'compared_context_pct': 0.0,
            'total_context_pct': 0.0
        }
    
    return {
        'quantified_context_pct': round((quantified_count / total_context_found) * 100, 4),
        'compared_context_pct': round((compared_count / total_context_found) * 100, 4),
        'total_context_pct': round(((quantified_count + compared_count) / total_context_found) * 100, 4)
    }

def analyze_commitment_vagueness(vague_analysis, doc_text):
    """Analyze commitment-related vague language without concrete timelines."""
    commitment_words = STRONG_VAGUE_LANGUAGE['commitment_vagueness']
    
    total_commitment_words_found = 0
    commitment_words_with_timelines = 0
    vague_commitments = []
    commitments_with_timelines = []
    
    for context_info in vague_analysis['vague_word_contexts']:
        word = context_info['word'].lower()
        if any(commit_word in word for commit_word in commitment_words):
            total_commitment_words_found += 1
            
            # Check for concrete timeline in context
            context = context_info['context']
            has_timeline = bool(re.search(r'\b(by|until|before|after)\s+\d{4}\b|'
                                        r'\b(january|february|march|april|may|june|'
                                        r'july|august|september|october|november|december)\s+\d{4}\b|'
                                        r'\b(jan|feb|mar|apr|may|jun|jul|aug|sep|sept|oct|nov|dec)\.?\s+\d{4}\b|'
                                        r'\b\d{1,2}\s+(months?|years?|weeks?|days?)\b|'
                                        
                                        # End of year/period references
                                        r'\bby\s+(the\s+)?end\s+of\s+\d{4}\b|'
                                        r'\bby\s+end[\-\s]\d{4}\b|'
                                        
                                        # Within timeframe references
                                        r'\bwithin\s+\d{1,2}\s+(years?|months?|quarters?)\b|'
                                        r'\bover\s+the\s+next\s+\d{1,2}\s+(years?|months?|quarters?)\b|'
                                        r'\bin\s+the\s+next\s+\d{1,2}\s+(years?|months?|quarters?)\b|'
                                        
                                        # Quarter references
                                        r'\bby\s+[Qq][1-4]\s+\d{4}\b|'
                                        r'\b[Qq][1-4]\s+\d{4}\b|'
                                        r'\bby\s+(first|second|third|fourth)\s+quarter\s+\d{4}\b|'
                                        
                                        # During/throughout references
                                        r'\bduring\s+\d{4}\b|'
                                        r'\bthroughout\s+\d{4}\b|'
                                        r'\bin\s+\d{4}\b|'
                                        
                                        # Part of year references
                                        r'\bby\s+(early|mid|late)\s+\d{4}\b|'
                                        r'\bby\s+(spring|summer|fall|autumn|winter)\s+\d{4}\b|'
                                        
                                        # Fiscal year references
                                        r'\bby\s+[Ff][Yy]\s*\d{2,4}\b|'
                                        r'\b[Ff]iscal\s+[Yy]ear\s+\d{2,4}\b|'
                                        
                                        # No later than / starting references
                                        r'\bno\s+later\s+than\s+\d{4}\b|'
                                        r'\bstarting\s+(in\s+)?\d{4}\b|'
                                        r'\bbeginning\s+(in\s+)?\d{4}\b|'
                                        
                                        # Date ranges
                                        r'\bbetween\s+\d{4}\s+and\s+\d{4}\b|'
                                        r'\bfrom\s+\d{4}\s+(through|to|until)\s+\d{4}\b|'
                                        
                                        # Specific month references
                                        r'\bby\s+(january|february|march|april|may|june|july|august|september|october|november|december)\b|'
                                        r'\bby\s+(jan|feb|mar|apr|may|jun|jul|aug|sep|sept|oct|nov|dec)\.?\b|'
                                        
                                        # Target year references
                                        r'\btarget\s+(of\s+)?\d{4}\b|'
                                        r'\bgoal\s+(of\s+)?\d{4}\b', 
                                        context, re.IGNORECASE))
            
            if has_timeline:
                commitment_words_with_timelines += 1
                commitments_with_timelines.append(context_info)
            else:
                vague_commitments.append(context_info)
    
    # Calculate ratio of commitments WITH timelines (consistent with context analysis)
    commitment_timeline_pct = ((commitment_words_with_timelines / total_commitment_words_found) * 100 
                                if total_commitment_words_found > 0 else 0.0)
    
    return {
        'total_commitment_words_found': total_commitment_words_found,
        'commitment_words_with_timelines': commitment_words_with_timelines,
        'vague_commitment_count': len(vague_commitments),
        'vague_commitment_examples': vague_commitments[:5],
        'commitments_with_timeline_examples': commitments_with_timelines[:5],
        'commitment_timeline_pct': round(commitment_timeline_pct, 4),
        'commitment_vagueness_ratio': len(vague_commitments) / len(vague_analysis['vague_word_contexts']) 
                                    if vague_analysis['vague_word_contexts'] else 0
    }



In [None]:
def analyze_vague_language_all_documents(documents):
    """
    Analyze vague language across all documents. 
    Similar structure to analyze_hedge_words_all_documents.
    """
    all_results = {}
    density_stats = {
        'total_documents': len(documents),
        'documents_with_high_vagueness': [],
        'documents_with_low_vagueness': []
    }
    
    # Collect density values for statistics
    total_densities = []
    strong_densities = []
    mild_densities = []
    intensity_scores = []
    
    # Process each document
    for doc_name, doc in documents.items():
        # Analyze vague language in document
        vague_analysis = find_vague_language_in_document(doc, doc_name)
        
        # Calculate densities
        densities = calculate_vague_language_densities(vague_analysis)
        
        # Calculate context metrics
        context_metrics = calculate_context_metrics(vague_analysis)
        
        # Analyze commitment vagueness
        commitment_analysis = analyze_commitment_vagueness(vague_analysis, doc.text)
        
        # Count quantified terms for comparison
        quantified_count = 0
        for pattern in QUANTIFIED_PATTERNS:
            quantified_count += len(re.findall(pattern, doc.text, re.IGNORECASE))
        
        # Combine results
        document_result = {
            **vague_analysis, 
            **densities,
            **context_metrics,
            'commitment_analysis': commitment_analysis,
            'quantified_terms': quantified_count,
        }
        all_results[doc_name] = document_result
        
        # Collect density values for statistics
        total_densities.append(densities['total_vague_density'])
        strong_densities.append(densities['strong_vague_density'])
        mild_densities.append(densities['mild_vague_density'])
        intensity_scores.append(densities['vague_intensity_score'])
        
        # Classify documents by vagueness level
        if densities['total_vague_density'] > 5.0:  # High vagueness threshold
            density_stats['documents_with_high_vagueness'].append(doc_name)
        elif densities['total_vague_density'] < 3.0:  # Low vagueness threshold
            density_stats['documents_with_low_vagueness'].append(doc_name)
    
    # Calculate cross-document statistics
    if total_densities:
        density_stats.update({
            'average_total_vague_density': round(np.mean(total_densities), 4),
            'average_strong_vague_density': round(np.mean(strong_densities), 4),
            'average_mild_vague_density': round(np.mean(mild_densities), 4),
            'average_vague_intensity_score': round(np.mean(intensity_scores), 4),
            'min_total_vague_density': round(np.min(total_densities), 4),
            'max_total_vague_density': round(np.max(total_densities), 4),
            'range_total_vague_density': round(np.max(total_densities) - np.min(total_densities), 4),
            'min_strong_vague_density': round(np.min(strong_densities), 4),
            'max_strong_vague_density': round(np.max(strong_densities), 4),
            'min_mild_vague_density': round(np.min(mild_densities), 4),
            'max_mild_vague_density': round(np.max(mild_densities), 4)
        })
    
    return {
        'document_results': all_results,
        'density_statistics': density_stats
    }



In [None]:
def create_vague_language_summary_table(analysis_results):
    """Create pandas DataFrame summarizing vague language analysis results."""
    document_results = analysis_results['document_results']
    
    summary_data = []
    for doc_name, results in document_results.items():
        summary_data.append({
            'Document': doc_name,
            'Meaningful Words': results['meaningful_words'],
            'Total Vague Words': results['total_vague_count'],
            'Strong Vague Words': results['strong_count'],
            'Mild Vague Words': results['mild_count'],
            'Total Vague Density (%)': results['total_vague_density'],
            'Strong Vague Density (%)': results['strong_vague_density'],
            'Mild Vague Density (%)': results['mild_vague_density'],
            'Vague Intensity Score': results['vague_intensity_score'],
            'Vague Commitments': results['commitment_analysis']['vague_commitment_count'],
            'total_commitment_words_found': results['commitment_analysis']['total_commitment_words_found'],
            'Commitment (with) Timeline (%)': results['commitment_analysis']['commitment_timeline_pct'],
            'quantified_context_words': results['quantified_context_count'],
            'compared_context_words': results['compared_context_count'], 
            'total_context_words': results['total_context_dependent_found'],            
            'quantified_context (%)': results['quantified_context_pct'],
            'compared_context (%)': results['compared_context_pct'],
            'total_context (%)': results['total_context_pct']
        })
    
    df = pd.DataFrame(summary_data)
    return df # .sort_values('Total Vague Density (%)', ascending=False)

def display_vague_language_analysis_results(analysis_results):
    """Display comprehensive vague language analysis results."""
    density_stats = analysis_results['density_statistics']
    
    print("VAGUE LANGUAGE ANALYSIS RESULTS")
    print("=" * 60)
    
    # Basic statistics
    print(f"\nBASIC STATISTICS:")
    print(f"Total Documents: {density_stats['total_documents']}")
    
    # Density statistics across all documents
    print(f"\nDENSITY STATISTICS ACROSS ALL DOCUMENTS:")
    print(f"Average Total Vague Density: {density_stats['average_total_vague_density']:.4f}%")
    print(f"Average Strong Vague Density: {density_stats['average_strong_vague_density']:.4f}%")
    print(f"Average Mild Vague Density: {density_stats['average_mild_vague_density']:.4f}%")
    print(f"Average Vague Intensity Score: {density_stats['average_vague_intensity_score']:.4f}")
    
    print(f"\nDENSITY RANGES:")
    print(f"Total Vague Density Range: {density_stats['min_total_vague_density']:.4f}% - {density_stats['max_total_vague_density']:.4f}% (range: {density_stats['range_total_vague_density']:.4f}%)")
    print(f"Strong Vague Density Range: {density_stats['min_strong_vague_density']:.4f}% - {density_stats['max_strong_vague_density']:.4f}%")
    print(f"Mild Vague Density Range: {density_stats['min_mild_vague_density']:.4f}% - {density_stats['max_mild_vague_density']:.4f}%")
    
    # High and low vagueness documents
    if density_stats['documents_with_high_vagueness']:
        print(f"\nHIGH VAGUENESS DOCUMENTS (>5% density):")
        for doc in density_stats['documents_with_high_vagueness']:
            doc_density = analysis_results['document_results'][doc]['total_vague_density']
            print(f"  {doc}: {doc_density:.4f}%")
    
    if density_stats['documents_with_low_vagueness']:
        print(f"\nLOW VAGUENESS DOCUMENTS (<3% density):")
        for doc in density_stats['documents_with_low_vagueness']:
            doc_density = analysis_results['document_results'][doc]['total_vague_density']
            print(f"  {doc}: {doc_density:.4f}%")
    
    # Create and display summary table
    summary_table = create_vague_language_summary_table(analysis_results)
    print(f"\nDOCUMENT SUMMARY TABLE:")
    print(summary_table.to_string(index=False))
    
    return summary_table

def show_vague_language_examples(analysis_results, doc_name=None, max_examples=10):
    """Show examples of vague language usage from documents."""
    document_results = analysis_results['document_results']
    
    if doc_name:
        if doc_name not in document_results:
            print(f"Document '{doc_name}' not found")
            return
        docs_to_show = {doc_name: document_results[doc_name]}
    else:
        # Show examples from highest vague density document
        sorted_docs = sorted(document_results.items(), 
                           key=lambda x: x[1]['total_vague_density'], 
                           reverse=True)
        docs_to_show = {sorted_docs[0][0]: sorted_docs[0][1]} if sorted_docs else {}
    
    for doc_name, results in docs_to_show.items():
        print(f"\nVAGUE LANGUAGE EXAMPLES FROM: {doc_name}")
        print("=" * 50)
        
        # Strong vague examples
        if results['strong_vague_words']:
            print(f"\nSTRONG VAGUE WORDS ({len(results['strong_vague_words'])} total):")
            for i, vague in enumerate(results['strong_vague_words'][:max_examples]):
                print(f"{i+1}. '{vague['word']}' (lemma: {vague['lemma']})")
                print(f"   Context: {vague['context']}")
                print()
        
        # Mild vague examples
        if results['mild_vague_words']:
            print(f"\nMILD VAGUE WORDS ({len(results['mild_vague_words'])} total):")
            for i, vague in enumerate(results['mild_vague_words'][:max_examples]):
                print(f"{i+1}. '{vague['word']}' (lemma: {vague['lemma']})")
                print(f"   Context: {vague['context']}")
                print()
        
        # COMMITMENT EXAMPLES (4 examples total)
        commitment_analysis = results['commitment_analysis']
        
        # Vague commitments (without timeline) - 2 examples
        if commitment_analysis['vague_commitment_examples']:
            print(f"\nVAGUE COMMITMENTS (WITHOUT TIMELINE) - {len(commitment_analysis['vague_commitment_examples'])} total:")
            for i, commit in enumerate(commitment_analysis['vague_commitment_examples'][:2]):
                print(f"{i+1}. '{commit['word']}'")
                print(f"   Context: {commit['context']}")
                print()
        
        # Commitments with timeline - 2 examples
        if commitment_analysis.get('commitments_with_timeline_examples'):
            print(f"\nCOMMITMENTS (WITH TIMELINE) - {commitment_analysis['commitment_words_with_timelines']} total:")
            for i, commit in enumerate(commitment_analysis['commitments_with_timeline_examples'][:2]):
                print(f"{i+1}. '{commit['word']}'")
                print(f"   Context: {commit['context']}")
                print()
        
        # CONTEXT-DEPENDENT EXAMPLES (6 examples total)
        
        # Vague context words (counted as vague) - 2 examples
        if results.get('vague_context_examples'):
            print(f"\nCONTEXT WORDS (COUNTED AS VAGUE) - 2 examples:")
            for i, context in enumerate(results['vague_context_examples'][:2]):
                print(f"{i+1}. '{context['word']}' (lemma: {context['lemma']})")
                print(f"   Context: {context['context']}")
                print()
        
        # Quantified context words (not counted due to quantification) - 2 examples
        if results.get('quantified_context_examples'):
            print(f"\nCONTEXT WORDS (NOT COUNTED - QUANTIFIED) - {results['quantified_context_count']} total:")
            for i, context in enumerate(results['quantified_context_examples'][:2]):
                print(f"{i+1}. '{context['word']}' (lemma: {context['lemma']})")
                print(f"   Context: {context['context']}")
                print()
        
        # Compared context words (not counted due to comparison) - 2 examples
        if results.get('compared_context_examples'):
            print(f"\nCONTEXT WORDS (NOT COUNTED - COMPARED) - {results['compared_context_count']} total:")
            for i, context in enumerate(results['compared_context_examples'][:2]):
                print(f"{i+1}. '{context['word']}' (lemma: {context['lemma']})")
                print(f"   Context: {context['context']}")
                print()



In [None]:
def run_vague_language_analysis():
    """Run vague language analysis using existing documents variable."""
    try:
        # Check if documents variable exists (from hedge word analysis)
        if 'documents' not in globals():
            print("Error: 'documents' variable not found. Please run the hedge words analysis first.")
            return None
        
        print("Starting vague language analysis...")
        
        # Run the analysis
        vague_results = analyze_vague_language_all_documents(documents)
        
        # Display results
        summary_table = display_vague_language_analysis_results(vague_results)
        
        print(f"\nVague language analysis complete.")
        
        # Store results globally for further use
        global vague_language_results
        vague_language_results = vague_results
        
        return vague_results
        
    except Exception as e:
        print(f"Error during vague language analysis: {str(e)}")
        return None

# Run the analysis
vague_analysis_results = run_vague_language_analysis()

In [None]:
# Show examples from most vague document
if vague_analysis_results:
    show_vague_language_examples(vague_analysis_results)



In [None]:
# COMBINED HEDGE AND VAGUE WORDS ANALYSIS - CREATE DATAFRAME

def create_hedge_vague_analysis_dataframe(hedge_results, vague_results):
    """
    Create a DataFrame combining hedge and vague word analysis.
    Rows: Organizations (company-year combinations)
    Columns: All metrics from both hedge and vague word analysis plus combined metrics
    """
    data = []
    
    # Get document names from hedge results (assuming both analyses cover same documents)
    hedge_documents = hedge_results['document_results']
    vague_documents = vague_results['document_results']
    
    for doc_name in hedge_documents.keys():
        if doc_name not in vague_documents:
            print(f"Warning: {doc_name} not found in vague results, skipping...")
            continue
            
        # Extract organization and year from document name
        parts = doc_name.split('_')
        year = parts[-1]
        org_name = '_'.join(parts[:-1])
        
        # Get hedge word results
        hedge_data = hedge_documents[doc_name]
        vague_data = vague_documents[doc_name]
        
        # Basic document metrics (should be same for both analyses)
        meaningful_words = hedge_data['meaningful_words']
        
        # HEDGE WORD METRICS
        total_hedge_count = hedge_data['total_hedge_count']
        hedge_strong_count = hedge_data['strong_count']
        hedge_mild_count = hedge_data['mild_count']
        total_hedge_density = hedge_data['total_hedge_density']
        strong_hedge_density = hedge_data['strong_hedge_density']
        mild_hedge_density = hedge_data['mild_hedge_density']
        hedge_strong_mild_ratio = hedge_data['strong_vs_mild_ratio']
        hedge_intensity_score = hedge_data['hedge_intensity_score']
        high_hedge_sentences = len(hedge_data['high_hedge_sentences'])
        
        # VAGUE WORD METRICS
        total_vague_count = vague_data['total_vague_count']
        vague_strong_count = vague_data['strong_count']
        vague_mild_count = vague_data['mild_count']
        total_vague_density = vague_data['total_vague_density']
        strong_vague_density = vague_data['strong_vague_density']
        mild_vague_density = vague_data['mild_vague_density']
        vague_strong_mild_ratio = vague_data['strong_mild_ratio']
        vague_intensity_score = vague_data['vague_intensity_score']
        vague_commitments = vague_data['commitment_analysis']['vague_commitment_count']
        total_commitment_words_found = vague_data['commitment_analysis']['total_commitment_words_found']
        commitment_timeline_pct = vague_data['commitment_analysis']['commitment_timeline_pct']
        quantified_terms = vague_data['quantified_terms']
        quantified_context_words = vague_data['quantified_context_count']
        compared_context_words = vague_data['compared_context_count']
        total_context_words = vague_data['total_context_dependent_found']
        quantified_context_pct = vague_data['quantified_context_pct']
        compared_context_pct = vague_data['compared_context_pct']
        total_context_pct = vague_data['total_context_pct']
        
        # COMBINED METRICS
        total_unclear_words = total_hedge_count + total_vague_count
        total_unclear_density = (total_unclear_words / meaningful_words) * 100
        hedge_vague_ratio = total_hedge_count / total_vague_count if total_vague_count > 0 else 0
        
        # Combined intensity score: ((strong_hedge × 1.5 + mild_hedge × 1 + strong_vague × 1.5 + mild_vague × 1) / meaningful_words) × 100
        combined_intensity_score = (((hedge_strong_count * 1.5) + (hedge_mild_count * 1) + 
                                   (vague_strong_count * 1.5) + (vague_mild_count * 1)) / meaningful_words) * 100
        
        # Combined strong/mild ratios
        total_strong = hedge_strong_count + vague_strong_count
        total_mild = hedge_mild_count + vague_mild_count
        combined_strong_mild_ratio = total_strong / total_mild if total_mild > 0 else 0
        
        row = {
            # Basic identifiers
            'organization': org_name,
            'year': int(year),
            'meaningful_words': meaningful_words,
            
            # HEDGE WORD METRICS
            'total_hedge_words': total_hedge_count,
            'hedge_strong_count': hedge_strong_count,
            'hedge_mild_count': hedge_mild_count,
            'total_hedge_density': round(total_hedge_density, 4),
            'strong_hedge_density': round(strong_hedge_density, 4),
            'mild_hedge_density': round(mild_hedge_density, 4),
            'hedge_strong_mild_ratio': round(hedge_strong_mild_ratio, 4),
            'hedge_intensity_score': round(hedge_intensity_score, 4),
            'high_hedge_sentences': high_hedge_sentences,
            
            # VAGUE WORD METRICS
            'total_vague_words': total_vague_count,
            'vague_strong_count': vague_strong_count,
            'vague_mild_count': vague_mild_count,
            'total_vague_density': round(total_vague_density, 4),
            'strong_vague_density': round(strong_vague_density, 4),
            'mild_vague_density': round(mild_vague_density, 4),
            'vague_strong_mild_ratio': round(vague_strong_mild_ratio, 4),
            'vague_intensity_score': round(vague_intensity_score, 4),
            'vague_commitments': vague_commitments,
            'total_commitment_words_found': total_commitment_words_found,
            'commitment_timeline_pct': round(commitment_timeline_pct, 4),
            'quantified_terms': quantified_terms,
            'quantified_context_words': quantified_context_words,
            'compared_context_words': compared_context_words,
            'total_context_words': total_context_words,
            'quantified_context_pct': round(quantified_context_pct, 4),
            'compared_context_pct': round(compared_context_pct, 4),
            'total_context_pct': round(total_context_pct, 4),
            
            # COMBINED METRICS
            'total_unclear_words': total_unclear_words,
            'total_unclear_density': round(total_unclear_density, 4),
            'combined_strong_mild_ratio': round(combined_strong_mild_ratio, 4),
            'hedge_vague_ratio': round(hedge_vague_ratio, 4),
            'combined_intensity_score': round(combined_intensity_score, 4)
        }
        
        data.append(row)
    
    return pd.DataFrame(data)

# Create the combined hedge and vague analysis DataFrame
import pandas as pd
import os

# Create the dataframe
try:
    hedge_vague_df = create_hedge_vague_analysis_dataframe(hedge_results, vague_analysis_results)
    
    # Sort by organization and year for better readability
    hedge_vague_df = hedge_vague_df.sort_values(['organization', 'year']).reset_index(drop=True)
    
    print("HEDGE & VAGUE WORDS COMBINED ANALYSIS DATAFRAME CREATED")
    print("="*80)
    print(f"DataFrame shape: {hedge_vague_df.shape[0]} organizations × {hedge_vague_df.shape[1]} metrics")
    
    # Export to Excel
    excel_path = "data/NLP/Results/Communication_Score_df_Hedge_Vague.xlsx"
    
    # Create directory if it doesn't exist
    os.makedirs(os.path.dirname(excel_path), exist_ok=True)
    
    # Export to Excel
    hedge_vague_df.to_excel(excel_path, index=False, sheet_name='Hedge_Vague_Analysis')
    
    print(f"\nExported to Excel: {excel_path}")
    print(f"Variable available as: hedge_vague_df")
    
    # Column descriptions for reference
    print(f"\nCOLUMN DESCRIPTIONS:")
    column_descriptions = {
        # Basic identifiers
        'organization': 'Organization name',
        'year': 'Report year',
        'meaningful_words': 'Total meaningful words in document',
        
        # Hedge word metrics
        'total_hedge_words': 'Total hedge words found',
        'hedge_strong_count': 'Strong hedge words count',
        'hedge_mild_count': 'Mild hedge words count', 
        'total_hedge_density': 'Overall hedge density percentage',
        'strong_hedge_density': 'Strong hedge density percentage',
        'mild_hedge_density': 'Mild hedge density percentage',
        'hedge_strong_mild_ratio': 'Ratio of strong to mild hedge words',
        'hedge_intensity_score': 'Hedge intensity score',
        'high_hedge_sentences': 'Number of sentences with 3+ hedge words',
        
        # Vague word metrics
        'total_vague_words': 'Total vague words found',
        'vague_strong_count': 'Strong vague words count',
        'vague_mild_count': 'Mild vague words count',
        'total_vague_density': 'Overall vague density percentage',
        'strong_vague_density': 'Strong vague density percentage',
        'mild_vague_density': 'Mild vague density percentage',
        'vague_strong_mild_ratio': 'Ratio of strong to mild vague words',
        'vague_intensity_score': 'Vague intensity score',
        'vague_commitments': 'Number of vague commitments',
        'total_commitment_words_found': 'Total commitment words found in document',
        'commitment_timeline_pct': 'Percentage of commitment words with specific timelines',
        'quantified_terms': 'Number of quantified terms',
        'quantified_context_words': 'Number of context words that were quantified',
        'compared_context_words': 'Number of context words that were compared', 
        'total_context_words': 'Total context-dependent words found',
        'quantified_context_pct': 'Ratio of quantified to total context words',
        'compared_context_pct': 'Ratio of compared to total context words',
        'total_context_pct': 'Ratio of (quantified + compared) to total context words',
        
        # Combined metrics
        'total_unclear_words': 'Combined hedge and vague words',
        'total_unclear_density': 'Combined unclear communication density',
        'combined_strong_mild_ratio': 'Combined strong to mild ratio',
        'hedge_vague_ratio': 'Ratio of hedge to vague words',
        'combined_intensity_score': 'Weighted combined intensity score'
    }
    
    for col, desc in column_descriptions.items():
        if col in hedge_vague_df.columns:
            print(f"  {col:<25}: {desc}")
    
    print(f"\nData saved as: {excel_path}")
    print(f"Variable available as: hedge_vague_df")
    
    # Summary statistics
    print(f"\n{'='*80}")
    print("COMBINED ANALYSIS SUMMARY STATISTICS")
    print(f"{'='*80}")
    print(f"Total organizations analyzed: {hedge_vague_df['organization'].nunique()}")
    print(f"Total documents: {len(hedge_vague_df)}")
    print(f"Year range: {hedge_vague_df['year'].min()} - {hedge_vague_df['year'].max()}")
    
    print(f"\nOverall Communication Metrics (averages):")
    print(f"  Total meaningful words: {hedge_vague_df['meaningful_words'].mean():.0f}")
    print(f"  Total unclear words: {hedge_vague_df['total_unclear_words'].mean():.1f}")
    print(f"  Total unclear density: {hedge_vague_df['total_unclear_density'].mean():.2f}%")
    
    print(f"\nHedge vs Vague Breakdown (averages):")
    print(f"  Hedge words: {hedge_vague_df['total_hedge_words'].mean():.1f}")
    print(f"  Vague words: {hedge_vague_df['total_vague_words'].mean():.1f}")
    print(f"  Hedge density: {hedge_vague_df['total_hedge_density'].mean():.3f}%")
    print(f"  Vague density: {hedge_vague_df['total_vague_density'].mean():.3f}%")
    print(f"  Hedge/Vague ratio: {hedge_vague_df['hedge_vague_ratio'].mean():.2f}")
    
    print(f"\nIntensity Scores (averages):")
    print(f"  Hedge intensity: {hedge_vague_df['hedge_intensity_score'].mean():.2f}")
    print(f"  Vague intensity: {hedge_vague_df['vague_intensity_score'].mean():.2f}")
    print(f"  Combined intensity: {hedge_vague_df['combined_intensity_score'].mean():.2f}")
    
except NameError as e:
    print(f"Error: Missing required variables. Please run hedge and vague analysis first.")
    print(f"Error details: {e}")
except Exception as e:
    print(f"Error creating combined dataframe: {e}")

In [None]:
from openpyxl import Workbook
from openpyxl.utils import get_column_letter
from openpyxl.styles import PatternFill
from openpyxl import load_workbook

# Define file path and output path
output_path = "data/NLP/Results/Communication_Score_df_Hedge_Vague.xlsx"

# Save the DataFrame to Excel
hedge_vague_df.to_excel(output_path, index=False, engine="openpyxl")

# Load the workbook and sheet
wb = load_workbook(output_path)
ws = wb.active  # There's only one sheet since we saved just one DataFrame

# Auto-adjust column widths based on the longest string in each column
for col in ws.columns:
    max_length = 0
    col_letter = get_column_letter(col[0].column)
    for cell in col:
        if cell.value:
            max_length = max(max_length, len(str(cell.value)))
    ws.column_dimensions[col_letter].width = max_length + 3  # Add padding

# Define grey fill for alternating rows
grey_fill = PatternFill(start_color="D9D9D9", end_color="D9D9D9", fill_type="solid")

# Alternate row colors by company
prev_company = None
use_grey = False
for row in range(2, ws.max_row + 1):
    current_company = ws[f"A{row}"].value  # Column A has the company names
    if current_company != prev_company:
        use_grey = not use_grey
        prev_company = current_company

    if use_grey:
        for col in range(1, ws.max_column + 1):
            ws.cell(row=row, column=col).fill = grey_fill

# Save the final cleaned and formatted workbook
wb.save(output_path)
