# Green Term Context Classification Analysis

## Overview
This module performs three-dimensional classification of environmental terms to assess claim substantiation and temporal orientation. It analyzes each green term's temporal context, quantification level, and evidence versus aspirational nature, directly supporting Substantiation Weakness and Temporal Orientation dimensions.

## Three Classification Dimensions

### 1. Temporal Classification
- **Past context**: Completed actions ("achieved", "implemented", "certified")
- **Present context**: Current operations ("operates", "generates", "maintains")  
- **Future context**: Planned actions ("will install", "targeting by 2030")
- **Method**: Dependency parsing to find governing verbs, temporal marker detection within 5-token window

### 2. Quantification Classification  
- **Highly quantified**: Strong connection to specific numbers with units ("€50 million investment", "25% reduction")
- **Partially quantified**: Relative quantifiers without direct numbers ("doubled capacity", "increased efficiency")
- **Non-quantified**: No meaningful quantification connections
- **Method**: Token reconstruction for fragmented numbers, syntactic connection analysis

### 3. Evidence vs. Aspirational Classification
- **Strong evidence**: Completed, verifiable actions ("installed 200MW", "achieved certification")
- **Moderate evidence**: Ongoing or implemented actions ("operates facilities", "generates energy")
- **Strong aspirational**: Uncertain/conditional plans ("should achieve", "might invest", "vision includes")
- **Moderate aspirational**: Committed future intentions ("will install", "plan to reduce", "strategy targets")
- **Method**: Targeted marker detection around semantic governors (main controlling verbs)

## Variables Produced for Communication Scoring
According to the analysis framework:
- **Quantified Claim Intensity** → Substantiation Weakness dimension
- **Evidence-Based Claim Intensity** → Substantiation Weakness dimension  
- **Aspirational Claim Intensity** → Substantiation Weakness dimension
- **Future Orientation Ratio** → Temporal Orientation dimension

## Intensity Scoring Method
Weighted percentage calculation: ((strong count × 1.5) + (moderate count × 1.0)) ÷ total green terms × 100. This enables standardized comparison across documents while emphasizing stronger classifications.

## Advanced Features
Confidence scoring for each classification, combined cross-dimensional insights (future+aspirational, past+evidence), and semantic governor analysis for precise context determination.

In [None]:
import spacy
from spacy_layout import spaCyLayout
from pathlib import Path
import pandas as pd
import numpy as np
import re
from collections import defaultdict, Counter

# Load spaCy model and configure for large documents
nlp = spacy.load("en_core_web_lg")
nlp.max_length = 1_500_000

In [None]:
from pathlib import Path

# Toggle between "test" and "actual"
MODE = "actual"   

# Define configuration based on mode
if MODE == "test":
    report_names = [ 
        "Axpo_Holding_AG", "NEOEN_SA"
    ]
    folders = {
        "2021": Path("data/NLP/Testing/Reports/Clean/2021"),
        "2022": Path("data/NLP/Testing/Reports/Clean/2022")
    }

elif MODE == "actual":
    report_names = [ 
        "Akenerji_Elektrik_Uretim_AS",
        "Arendals_Fossekompani_ASA",
        "Atlantica_Sustainable_Infrastructure_PLC",
        "CEZ",
        "EDF",
        "EDP_Energias_de_Portugal_SA",
        "Endesa",
        "ERG_SpA",
        "Orsted",
        "Polska_Grupa_Energetyczna_PGE_SA",
        "Romande_Energie_Holding_SA",
        "Scatec_ASA",
        "Solaria_Energia_y_Medio_Ambiente_SA",
        "Terna_Energy_SA"
    ]

    folders = {
        "2021": Path("data/NLP/Reports/Cleanest/2021"),
        "2022": Path("data/NLP/Reports/Cleanest/2022")
    }

else:
    raise ValueError("Invalid MODE. Use 'test' or 'actual'.")

# Check availability
for name in report_names:
    file_name = f"{name}.txt"
    in_2021 = (folders["2021"] / file_name).exists()
    in_2022 = (folders["2022"] / file_name).exists()
    print(f"{file_name}: 2021: {'YES' if in_2021 else 'NO'} | 2022: {'YES' if in_2022 else 'NO'}")


In [None]:
# Dictionary to store processed docs
documents = {}

# Load and process all documents
for version, folder_path in folders.items():
    for name in report_names:
        txt_path = folder_path / f"{name}.txt"
        try:
            with open(txt_path, "r", encoding="utf-8") as f:
                text = f.read()
            doc_key = f"{name}_{version}"
            documents[doc_key] = nlp(text)
            print(f"Processed {doc_key}")
        except Exception as e:
            print(f"Error processing {txt_path.name}: {e}")

print(f"\nTotal documents loaded: {len(documents)}")

## Green word frequency

In [None]:
# List of green/sustainable nouns (lemmas) - EMISSIONS FOCUSED
green_nouns = [
    "adaptation", "afforestation", "biodiversity", "biofuel", "biogas", "biomass", 
    "ccs", "ccus", "cogeneration", "decarbonisation", "decarbonization", "ecology", 
    "ecosystem", "electrification", "environment", "ess", "geothermal", "hydropower", 
    "improvement", "innovation", "mitigation", "optimization", "photovoltaic", 
    "preservation", "pv", "recycling", "reforestation", "regeneration", "renewable", 
    "renewables", "responsibility", "restoration", "solar", "sustainability", 
    "transition", "transparency", "wind"
]

# Multi-word green nouns (lemmas) - EMISSIONS FOCUSED
green_multiword_nouns = {
    "abatement": ["carbon", "co2", "co2e", "emission", "ghg", "pollution"], 
    "bond": ["climate", "green", "sustainability"], 
    "bonds": ["climate", "green", "sustainability"], 
    "capture": ["carbon", "co2", "ghg", "methane"],
    "development": ["clean", "renewable", "sustainable"],
    "economy": ["circular", "green", "hydrogen", "sustainable"],
    "energy": ["alternative", "clean", "geothermal", "hydro", "renewable", "solar", "tidal", "wind"],
    "farm": ["offshore", "solar", "wind"],
    "farms": ["offshore", "solar", "wind"],
    "finance": ["climate", "green", "sustainable"],
    "financing": ["climate", "green", "sustainable"],
    "fuel": ["alternative", "bio", "clean", "hydrogen", "synthetic"],
    "fuels": ["alternative", "bio", "clean", "hydrogen", "synthetic"], 
    "fund": ["climate", "green", "sustainability"], 
    "funds": ["climate", "green", "sustainability"], 
    "generation": ["clean", "renewable"],
    "goal": ["carbon", "climate", "emission"], 
    "goals": ["carbon", "climate", "emission"],
    "growth": ["clean", "climate", "green", "renewable", "sustainable"],
    "hydrogen": ["blue", "clean", "green", "renewable"],
    "investment": ["clean", "climate", "green", "renewable", "sustainable"],
    "investments": ["clean", "climate", "green", "renewable", "sustainable"],
    "management": ["carbon", "energy", "environmental", "waste"],
    "neutral": ["carbon", "climate", "co2", "emission"],
    "neutrality": ["carbon", "climate", "co2", "emission"],
    "panel": ["photovoltaic", "pv", "solar"],
    "panels": ["photovoltaic", "pv", "solar"],
    "plant": ["biomass", "geothermal", "hydro", "renewable", "solar", "wind"],
    "plants": ["biomass", "geothermal", "hydro", "renewable", "solar", "wind"],
    "power": ["clean", "geothermal", "hydro", "renewable", "solar", "tidal", "wind"],
    "program": ["conservation", "efficiency", "renewable", "retrofit"],
    "project": ["clean", "efficiency", "green", "renewable"],
    "reduction": ["carbon", "co2", "emission", "ghg", "waste"],
    "reductions": ["carbon", "co2", "emission", "ghg", "waste"],
    "sequestration": ["carbon", "co2", "ghg"], 
    "solution": ["clean", "climate", "green", "renewable"],
    "solutions": ["clean", "climate", "green", "renewable"],
    "source": ["clean", "geothermal", "green", "hydro", "renewable", "solar", "wind"], 
    "sources": ["clean", "geothermal", "green", "hydro", "renewable", "solar", "wind"],
    "standard": ["efficiency", "environmental", "green", "performance"],
    "standards": ["efficiency", "environmental", "green", "performance"],
    "station": ["clean", "geothermal", "green", "hydro", "renewable", "solar", "wind"],
    "stations": ["clean", "geothermal", "green", "hydro", "renewable", "solar", "wind"],
    "storage": ["carbon", "co2"],
    "technology": ["carbon", "clean", "efficiency", "green", "renewable"],
    "technologies": ["carbon", "clean", "efficiency", "green", "renewable"],
    "transition": ["climate", "energy", "green"],
    "turbine": ["offshore", "onshore", "wind"],
    "turbines": ["offshore", "onshore", "wind"],
    "zero": ["carbon", "climate", "emission", "footprint", "ghg", "net", "pollution", "waste"]
}

# List of single-word green adjectives (lemmas) - EMISSIONS FOCUSED
green_adjectives = [
    "circular", "clean", "decarbonise", "decarbonised", "decarbonising", "decarbonize", 
    "decarbonized", "decarbonizing", "durable", "ecological", "ecosystemic", "efficient", 
    "enriching", "environmental", "environmentally", "green", "hydroelectric", "innovative",
    "optimal", "proenvironmental", "recover", "recoverable", "recovered", "recyclable", 
    "recycle", "recycled", "recycling", "reforested", "refurbish", "refurbished", 
    "regenerable", "regenerate", "regenerated", "renewable", "renewables", "responsible", 
    "restore", "restored", "reusable", "reuse", "sustainable", "sustainably"
]

# Multi-word green adjectives (lemmas) - EMISSIONS FOCUSED
green_multiword_adjectives = {
    "based": ["biomass", "nature", "plant", "renewable"], 
    "bio": ["energy", "fuel", "gas", "mass"],
    "biomass": ["fired", "fueled", "powered"],
    "carbon": ["captured", "capturing", "free", "low", "lower", "negative", "neutral", "non", "sequestered", "zero"],
    "ccs": ["enabled", "equipped", "ready"], 
    "efficient": ["eco", "energy", "fuel", "high", "resource"],
    "efficiency": ["eco", "energy", "fuel", "high", "resource"],
    "electric": ["all", "geothermal", "hydro", "solar", "tidal", "wind"],
    "emission": ["free", "low", "zero"], 
    "emissions": ["free", "low", "zero"], 
    "emitting": ["non", "zero"],
    "energy": ["alternative", "clean", "efficient", "environmental", "renewable", "saved"], 
    "energies": ["alternative", "clean", "efficient", "environmental", "renewable", "saved"],
    "environmental": ["certified", "energy", "management", "responsible"],
    "free": ["carbon", "coal", "co2", "emission", "emissions", "fossil", "waste"],
    "friendly": ["climate", "eco", "environment", "environmental", "planet"], 
    "friendlier": ["carbon", "climate", "eco", "environment", "environmental", "planet"],
    "gas": ["bio", "renewable"],
    "intensity": ["low", "reduced"], 
    "mitigating": ["key"],
    "natural": ["protected"],
    "negative": ["carbon", "co2", "emission", "emissions"],
    "neutral": ["carbon", "climate", "co2", "emission"],
    "oriented": ["climate", "ecosystem", "sustainability"],
    "planting": ["forest", "tree"],
    "pollutant": ["anti", "controlling", "low", "preventing", "reduced", "zero"],
    "pollution": ["anti", "controlling", "low", "preventing", "reduced", "zero"],
    "production": ["clean", "green", "renewable", "responsible", "sustainable"],
    "proof": ["climate", "future"],
    "protected": ["environmental", "natural"],
    "reducing": ["carbon", "emission", "ghg", "pollution"],
    "related": ["climate", "environment", "sustainability"],
    "resilient": ["climate", "environment"],
    "responsible": ["climate", "eco", "environmental"],
    "saving": ["carbon", "energy", "fuel", "resource"],
    "sustainable": ["certified", "climate"],
    "zero": ["carbon", "emission", "net"]
}

# List of green verbs (lemmas) - EMISSIONS FOCUSED
green_verbs = [
    "afforeste", "afforesting", "conserve", "conserving", "decarbonize", "decarbonizing", 
    "decarbonise", "decarbonising", "electrify", "electrifying", "innovate", "innovating",
    "minimize", "minimizing", "minimise", "minimising", "mitigate", "mitigating", 
    "optimize", "optimizing", "optimise", "optimising", "preserve", "preserving", 
    "recover", "recovering", "recycle", "recycling", "remediate", "remediating", 
    "reforest", "reforesting", "regenerate", "regenerating", "restore", "restoring", 
    "reuse", "reusing", "transition", "transitioning", "upgrade", "upgrading"
]

# List of green adverbs (lemmas) - EMISSIONS FOCUSED
green_adverbs = [
    "cleanly", "consciously", "ecologically", "efficiently", "environmentally", 
    "optimally", "renewably", "responsibly", "sustainably"
]

# Multi-word green adverbs (lemmas) - EMISSIONS FOCUSED
green_multiword_adverbs = {
    "aware": ["carbon", "climate", "eco", "environmentally"],
    "based": ["nature", "plant", "renewable", "sustainability"], 
    "compliant": ["climate", "emission", "environmentally"],
    "compatible": ["climate", "eco", "environmentally"],
    "conscious": ["carbon", "climate", "eco", "environmentally"],
    "designed": ["efficiently", "environmentally", "sustainably"],
    "driven": ["climate", "renewable", "sustainability"],
    "efficient": ["carbon", "energy", "fuel", "resource"],
    "focused": ["climate", "emission", "environmental", "sustainability"],
    "friendly": ["carbon", "climate", "co2", "eco", "environmentally"],
    "managed": ["environmentally", "responsibly", "sustainably"],
    "neutral": ["carbon", "climate", "emission"],
    "operated": ["cleanly", "efficiently", "sustainably"],
    "oriented": ["climate", "environment", "renewable", "sustainability"],
    "produced": ["cleanly", "renewably", "sustainably"],
    "responsible": ["carbon", "climate", "environmentally"],
    "safe": ["climate", "eco", "environmentally"],
    "sensitive": ["carbon", "climate", "environmentally"],
    "sound": ["carbon", "climate", "environmentally"]
}

# Neutral terms that become green with proper context (lemmas) - EMISSIONS FOCUSED
neutral_nouns = [
    "co2", "co2e", "cooling", "emission", "emissions", "energy", "footprint", 
    "fuel", "ghg", "heating", "methane", "pollution", "transportation", "waste"
]

# Multi-word neutral terms that become green with context (lemmas) - EMISSIONS FOCUSED
neutral_multiword_nouns = {
    "consumption": ["coal", "electricity", "energy", "fuel", "gas", "oil", "power"],
    "economy": ["carbon"],
    "emission": ["annual", "baseline", "carbon", "co2", "co2e", "direct", "ghg", "indirect", "scope", "total"],
    "emissions": ["annual", "baseline", "carbon", "co2", "co2e", "direct", "ghg", "indirect", "scope", "total"],
    "footprint": ["carbon", "co2", "ecological", "emission", "environmental", "ghg"],
    "impact": ["carbon", "climate", "ecological", "environmental"],
    "intensity": ["carbon", "co2", "emission", "energy", "fuel", "ghg"],
    "usage": ["electricity", "energy", "fuel", "power", "resource"],
    "use": ["electricity", "energy", "fuel", "resource"]
}

# Context words that indicate positive change - EMISSIONS FOCUSED
improvement_adjectives = [
    "advanced", "best", "better", "boosted", "enhanced", "excellent", "exceptional", 
    "improved", "impressive", "optimized", "optimised", "outstanding", "positive", 
    "remarkable", "strengthened", "successful", "superior", "upgraded"
]

# Improvement verbs that indicate positive change
improvement_verbs = [
    "advance", "advancing", "boost", "boosting", "deliver", "delivering", "enhance", 
    "enhancing", "improve", "improving", "optimize", "optimizing", "optimise", 
    "optimising", "outperform", "outperforming", "strengthen", "strengthening", 
    "upgrade", "upgrading"
]
# Improvement adverbs that indicate positive manner
improvement_adverbs = [
    "effectively", "efficiently", "excellently", "meaningfully", "positively", "successfully"
]
# Dependency-based green term patterns - EMISSIONS FOCUSED
dependency_green_patterns = {
    "improvement_adjectives": {
        "head_words": ["efficiency", "generation", "performance", "process", "solution", "system", "technology"],
        "dependent_words": ["advanced", "better", "cleaner", "enhanced", "greener", "improved", "innovative", "optimized"],
        "dependency_relations": ["amod"],
        "description": "Improvement adjectives with green nouns"
    },
    "increase_positive": {
        "head_words": ["boost", "develop", "enhance", "expand", "grow", "improve", "increase", "scale"],
        "dependent_words": ["capture", "conservation", "efficiency", "recycling", "renewable", "sequestration", "sustainability"],
        "dependency_relations": ["dobj"],
        "description": "Positive action verbs with green objects"
    },
    "achieve_climate": {
        "head_words": ["achieve", "attain", "deliver", "reach", "realize"],
        "dependent_words": ["neutrality", "reduction", "transition", "zero"],
        "dependency_relations": ["dobj"],
        "description": "Achievement verbs with climate goals"
    },
    "transition_to": {
        "head_words": ["change", "migrate", "move", "shift", "switch", "transition"],
        "dependent_words": ["circular", "clean", "green", "renewable", "sustainable", "zero"],
        "dependency_relations": ["prep", "pobj"],
        "description": "Transition phrases"
    },
    "investment_in": {
        "head_words": ["allocate", "finance", "funding", "invest", "investment", "spend"],
        "dependent_words": ["carbon", "clean", "climate", "environmental", "green", "renewable", "sustainable"],
        "dependency_relations": ["prep", "pobj"],
        "description": "Investment in green areas"
    },
    "commitment_to": {
        "head_words": ["commitment", "dedication", "pledge", "promise"],
        "dependent_words": ["climate", "neutrality", "reduction", "sustainability", "zero"],
        "dependency_relations": ["prep", "pobj"],
        "description": "Commitments to climate goals"
    },
    "sustainable_adverbs": {
        "head_words": ["develop", "generate", "grow", "manage", "manufacture", "operate", "produce"],
        "dependent_words": ["cleanly", "efficiently", "environmentally", "responsibly", "sustainably"],
        "dependency_relations": ["advmod"],
        "description": "Sustainable manner of operations"
    }
}


In [None]:
# NEGATION DETECTION SYSTEM - COMPREHENSIVE WORD LISTS

# Direct negation words (alphabetically sorted)
direct_negation_words = [
    "absence", "barely", "beneath", "below", "deficit", "devoid", "empty", "exempt", 
    "excluded", "failed", "gap", "hardly", "impossible", "inadequate", "insufficient", 
    "lacking", "minus", "never", "neither", "nil", "no", "nobody", "none", "nor", 
    "not", "nothing", "nowhere", "rarely", "scarcely", "seldom", "short", "shortfall", 
    "unable", "void", "without", "zero"
]

# Negative descriptor words (alphabetically sorted)
negative_descriptor_words = [
    "absence", "absent", "barrier", "blocks", "blocked", "cancel", "cancelled", 
    "cease", "ceased", "challenge", "concern", "constraint", "constraints", 
    "decline", "decrease", "deficit", "degrade", "degraded", "deteriorate", 
    "deteriorated", "difficulties", "difficulty", "drop", "downturn", "end", 
    "ended", "fail", "failed", "failure", "fall", "gap", "halt", "halted", 
    "harm", "hinders", "hindered", "impedes", "impeded", "inability", "incapable", 
    "inadequate", "ineffective", "inefficient", "insufficient", "issues", "lack", 
    "lacking", "lacks", "limited", "limiting", "loss", "missing", "obstacle", 
    "prevents", "prevented", "problem", "problems", "reduction", "reject", "rejected", 
    "refuse", "refused", "setback", "shortage", "shortfall", "shrinkage", "stop", 
    "stopped", "struggles", "struggling", "suspend", "suspended", "threat", 
    "unable", "unsuccessful", "weakening", "worse", "worsen", "worsening"
]

# Phrasal negation patterns (alphabetically sorted)
phrasal_negation_patterns = [
    "absence of", "anything but", "are no plans", "are not", "aren't", "can not", 
    "can't", "cannot", "cease", "could not", "couldn't", "devoid of", "did not", 
    "didn't", "do not", "does not", "doesn't", "don't", "exempt from", "failed to", 
    "failing to", "far from", "free from", "had not", "hadn't", "has not", "hasn't", 
    "have not", "haven't", "inability to", "incapable of", "inadequate to", 
    "insufficient to", "instead of", "is no plan", "is not", "isn't", "lack of", 
    "may not", "might not", "must not", "mustn't", "need not", "needn't", 
    "never", "no attempt to", "no chance to", "no effort to", "no intention to", 
    "no longer", "no means to", "no opportunity to", "no plans to", "no way to", 
    "not enough", "not yet", "other than", "rather than", "scarcity of", 
    "short of", "shortage of", "should not", "shouldn't", "too few", "too few to", 
    "too little", "too little to", "unable to", "was not", "wasn't", "were not", 
    "weren't", "will not", "without", "without any", "without proper", 
    "without sufficient", "without the", "won't", "would not", "wouldn't"
]

# Negation prefixes (alphabetically sorted)
negation_prefixes = [
    "anti", "counter", "de", "dis", "false", "fake", "il", "im", "in", "ir", 
    "mis", "non", "pseudo", "un"
]

# Words that shouldn't cause negation in positive green contexts
protected_green_context_words = [
    "abate", "curb", "cut", "declining", "decrease", "decreasing", "eliminate", 
    "fewer", "less", "low", "lower", "minimal", "reduce", "reduction", "slash", "zero"
]

# Verbs that when negated should exclude green terms in their scope
reduction_negative_verbs = [
    "abate", "abating", "control", "controlling", "curb", "curbing", "cut", "cutting", 
    "decrease", "decreasing", "eliminate", "eliminating", "limit", "limiting", 
    "lower", "lowering", "minimize", "minimizing", "minimise", "minimising", 
    "reduce", "reducing", "remove", "removing", "restrict", "restricting", 
    "slash", "slashing"
]

In [None]:
def find_context_for_neutral_term(doc, neutral_token_start, neutral_token_end, all_existing_terms):
    """
    Find context words (negative/improvement) for a neutral term.
    Returns: (context_found, context_word, context_type, context_relationship)
    """
    # Get the main token of the neutral term
    main_token = doc[neutral_token_start]
    
    # For multiword terms, find the syntactic head
    if neutral_token_end > neutral_token_start:
        tokens_in_term = [doc[i] for i in range(neutral_token_start, neutral_token_end + 1)]
        for token in tokens_in_term:
            if token.head not in tokens_in_term or token.head == token:
                main_token = token
                break
    
    # Apply distance filtering to head_subtree scope
    head_subtree_all = list(main_token.head.subtree)
    head_subtree_filtered = [
        token for token in head_subtree_all 
        if (main_token.i - 8) <= token.i <= (main_token.i + 4)
    ]
    
    # Define search scopes with restricted head_subtree
    search_scopes = [
        ("subtree", list(main_token.subtree)),
        ("ancestors", list(main_token.ancestors)),
        ("head_subtree", head_subtree_filtered)
    ]
    
    # Collect context word lists
    negative_context_words = (set(negative_descriptor_words) | 
                             set(protected_green_context_words) | 
                             set(reduction_negative_verbs))
    
    improvement_context_words = (set(improvement_adjectives) | 
                                set(improvement_verbs) | 
                                set(improvement_adverbs))
    
    # Search for context in each scope
    for scope_name, scope_tokens in search_scopes:
        for token in scope_tokens:
            token_lemma = token.lemma_.lower()
            
            # Skip if this token is part of existing green terms
            if is_word_part_of_green_terms(token, all_existing_terms):
                continue
            
            # Check for negative context
            if token_lemma in negative_context_words:
                if is_context_related_to_term(token, main_token, scope_name):
                    return True, token.text, "negative", token.dep_
            
            # Check for improvement context  
            if token_lemma in improvement_context_words:
                if is_context_related_to_term(token, main_token, scope_name):
                    return True, token.text, "improvement", token.dep_
    
    return False, None, None, None

def is_context_related_to_term(context_token, neutral_token, scope_name):
    """
    Validate that context word is syntactically related to neutral term.
    """
    # Direct dependency relationship
    if context_token.head == neutral_token or neutral_token.head == context_token:
        return True
    
    # Same head (siblings)
    if context_token.head == neutral_token.head and scope_name == "head_subtree":
        return True
    
    # Ancestor relationship
    if scope_name == "ancestors":
        neutral_ancestors = list(neutral_token.ancestors)
        if context_token in neutral_ancestors[:3]:
            return True
    
    # Subtree relationship
    if scope_name == "subtree":
        if context_token in list(neutral_token.subtree):
            return True
    
    return False

def find_context_dependent_terms(doc, excluded_positions, all_existing_terms):
    """
    Find neutral terms that become green when paired with context words.
    Returns: (found_terms, context_excluded_positions)
    """
    found_terms = []
    context_excluded_positions = set()
    
    # Find single neutral terms
    for i, token in enumerate(doc):
        if i in excluded_positions or i in context_excluded_positions:
            continue
        
        lemma_lower = token.lemma_.lower()
        if lemma_lower in neutral_nouns:
            context_found, context_word, context_type, context_rel = find_context_for_neutral_term(
                doc, i, i, all_existing_terms
            )
            
            if context_found:
                found_terms.append({
                    'term': f"{context_word} {token.text}",
                    'pos': f"context_dependent_noun",
                    'start_idx': i,
                    'end_idx': i,
                    'sentence': token.sent,
                    'neutral_part': token.text,
                    'context_word': context_word,
                    'context_type': context_type,
                    'context_relationship': context_rel,
                    'negated': False
                })
                context_excluded_positions.add(i)
    
    # Find multiword neutral terms
    tokens = [token.lemma_.lower() for token in doc]
    
    for base_word, modifiers in neutral_multiword_nouns.items():
        for modifier in modifiers:
            # Pattern 1: modifier-base (e.g., "carbon-footprint")
            pattern1 = f"{modifier}-{base_word}"
            # Pattern 2: modifier base (e.g., "carbon footprint")
            pattern2 = f"{modifier} {base_word}"
            
            # Search in original text for hyphenated version
            text_lower = doc.text.lower()
            for match in re.finditer(re.escape(pattern1), text_lower):
                start_char = match.start()
                end_char = match.end()
                
                # Find corresponding token positions
                start_token_idx = None
                end_token_idx = None
                
                for j, token in enumerate(doc):
                    if token.idx <= start_char < token.idx + len(token.text):
                        start_token_idx = j
                    if token.idx < end_char <= token.idx + len(token.text):
                        end_token_idx = j
                        break
                
                if (start_token_idx is not None and end_token_idx is not None and
                    start_token_idx not in excluded_positions and start_token_idx not in context_excluded_positions):
                    
                    context_found, context_word, context_type, context_rel = find_context_for_neutral_term(
                        doc, start_token_idx, end_token_idx, all_existing_terms
                    )
                    
                    if context_found:
                        found_terms.append({
                            'term': f"{context_word} {pattern1}",
                            'pos': f"context_dependent_multiword_noun",
                            'start_idx': start_token_idx,
                            'end_idx': end_token_idx,
                            'sentence': doc[start_token_idx].sent,
                            'neutral_part': pattern1,
                            'context_word': context_word,
                            'context_type': context_type,
                            'context_relationship': context_rel,
                            'negated': False
                        })
                        for idx in range(start_token_idx, end_token_idx + 1):
                            context_excluded_positions.add(idx)
            
            # Search for space-separated version
            for j in range(len(tokens) - 1):
                if (tokens[j] == modifier and tokens[j + 1] == base_word and
                    j not in excluded_positions and j not in context_excluded_positions):
                    
                    context_found, context_word, context_type, context_rel = find_context_for_neutral_term(
                        doc, j, j + 1, all_existing_terms
                    )
                    
                    if context_found:
                        found_terms.append({
                            'term': f"{context_word} {pattern2}",
                            'pos': f"context_dependent_multiword_noun",
                            'start_idx': j,
                            'end_idx': j + 1,
                            'sentence': doc[j].sent,
                            'neutral_part': pattern2,
                            'context_word': context_word,
                            'context_type': context_type,
                            'context_relationship': context_rel,
                            'negated': False
                        })
                        context_excluded_positions.add(j)
                        context_excluded_positions.add(j + 1)
    
    return found_terms, context_excluded_positions

In [None]:
def is_word_part_of_green_terms(token, all_found_green_terms):
    """Check if a token is part of any existing green term."""
    for green_term in all_found_green_terms:
        green_term_span = range(green_term['start_idx'], green_term['end_idx'] + 1)
        if token.i in green_term_span:
            return True
    return False

def was_used_in_green_pattern(token, all_found_green_terms):
    """Check if this token was used to construct any green dependency pattern."""
    for green_term in all_found_green_terms:
        if green_term['pos'].startswith('dependency_'):
            green_term_span = range(green_term['start_idx'], green_term['end_idx'] + 1)
            if token.i in green_term_span:
                if token.lemma_.lower() in protected_green_context_words:
                    return True
    return False

def is_negation_related_to_term(negation_token, green_term_token, all_found_green_terms):
    """
    Determine if a negation word is actually negating the green term.
    """
    # Check if this negation word is part of any existing green term
    if is_word_part_of_green_terms(negation_token, all_found_green_terms):
        return False
    
    # Check if this negation word was used to construct any green pattern
    if was_used_in_green_pattern(negation_token, all_found_green_terms):
        return False
    
    # Check syntactic relationship using dependency parsing
    if negation_token.head == green_term_token or green_term_token.head == negation_token:
        return True
    
    # Check if they're in the same sentence and have close syntactic relationship
    if negation_token.sent == green_term_token.sent:
        negation_ancestors = [negation_token] + list(negation_token.ancestors)
        green_ancestors = [green_term_token] + list(green_term_token.ancestors)
        
        # If they share close ancestors, they're likely related
        for neg_ancestor in negation_ancestors[:3]:
            if neg_ancestor in green_ancestors[:3]:
                return True
    
    return False

def validate_phrasal_negation(sentence, phrase, green_term_token):
    """
    Validate that a phrasal negation pattern actually applies to the green term.
    """
    sentence_text = sentence.text.lower()
    phrase_start = sentence_text.find(phrase)
    
    if phrase_start == -1:
        return False
    
    # Calculate character positions
    green_term_char_start = green_term_token.idx
    green_term_char_end = green_term_token.idx + len(green_term_token.text)
    
    # Convert to sentence-relative positions
    sentence_char_start = sentence.start_char
    relative_green_start = green_term_char_start - sentence_char_start
    relative_green_end = green_term_char_end - sentence_char_start
    
    phrase_end = phrase_start + len(phrase)
    
    # Check if the green term appears reasonably close after the negation phrase
    if relative_green_start > phrase_end and relative_green_start - phrase_end < 50:
        return True
    
    return False

def find_negated_reduction_verb(main_token, all_found_green_terms):
    """
    Check if the green term is the object/target of a negated reduction verb.
    Returns: (is_negated, negation_type, negation_text, scope_found)
    """
    # Get all context words that are already being used to create green terms
    used_context_words = set()
    for green_term in all_found_green_terms:
        if 'context_word' in green_term:
            context_word = green_term['context_word'].lower()
            used_context_words.add(context_word)
            used_context_words.add(context_word.rstrip('ed').rstrip('ing').rstrip('s'))
    
    # Get the context word for this specific green term (if it's context-dependent)
    this_term_context_word_lemma = None
    for green_term in all_found_green_terms:
        term_span = range(green_term['start_idx'], green_term['end_idx'] + 1)
        if main_token.i in term_span and 'context_word' in green_term:
            context_word = green_term['context_word']
            context_doc = nlp(context_word)
            if len(context_doc) > 0:
                this_term_context_word_lemma = context_doc[0].lemma_.lower()
            else:
                this_term_context_word_lemma = context_word.lower()
            break
    
    # Check if this green term is the object of a reduction verb
    for ancestor in main_token.ancestors:
        if ancestor.lemma_.lower() in reduction_negative_verbs:
            # Apply distance filtering to verb subtree and head_subtree scopes
            verb_subtree_all = list(ancestor.subtree)
            verb_subtree_filtered = [
                token for token in verb_subtree_all 
                if (ancestor.i - 5) <= token.i < ancestor.i
            ]
            
            verb_head_subtree_all = list(ancestor.head.subtree)
            verb_head_subtree_filtered = [
                token for token in verb_head_subtree_all 
                if (ancestor.i - 5) <= token.i < ancestor.i
            ]
            
            # Check the verb's subtree, ancestors, and head.subtree for negation
            verb_scopes = [
                ("verb_subtree", verb_subtree_filtered),
                ("verb_ancestors", list(ancestor.ancestors)),
                ("verb_head_subtree", verb_head_subtree_filtered)
            ]
            
            for scope_name, scope_tokens in verb_scopes:
                for token in scope_tokens:
                    if (token.dep_ == "neg" or 
                        token.lemma_.lower() in direct_negation_words):
                        if not is_word_part_of_green_terms(token, all_found_green_terms):
                            if token.lemma_.lower() not in used_context_words:
                                if this_term_context_word_lemma is None or token.lemma_.lower() != this_term_context_word_lemma:
                                    return True, f"negated_reduction_verb_{scope_name}", f"'{token.text}' negating '{ancestor.text}'", scope_name
    
    # Also check direct dependency relationships where green term is object
    if main_token.dep_ in ["dobj", "pobj", "attr"]:
        head_verb = main_token.head
        if head_verb.lemma_.lower() in reduction_negative_verbs:
            # Apply distance filtering to direct verb scopes
            direct_verb_subtree_all = list(head_verb.subtree)
            direct_verb_subtree_filtered = [
                token for token in direct_verb_subtree_all 
                if (head_verb.i - 5) <= token.i < head_verb.i
            ]
            
            direct_verb_head_subtree_all = list(head_verb.head.subtree)
            direct_verb_head_subtree_filtered = [
                token for token in direct_verb_head_subtree_all 
                if (head_verb.i - 5) <= token.i < head_verb.i
            ]
            
            # Check if this verb is negated
            verb_scopes = [
                ("direct_verb_subtree", direct_verb_subtree_filtered),
                ("direct_verb_ancestors", list(head_verb.ancestors)),
                ("direct_verb_head_subtree", direct_verb_head_subtree_filtered)
            ]
            
            for scope_name, scope_tokens in verb_scopes:
                for token in scope_tokens:
                    if (token.dep_ == "neg" or 
                        token.lemma_.lower() in direct_negation_words):
                        if not is_word_part_of_green_terms(token, all_found_green_terms):
                            if token.lemma_.lower() not in used_context_words:
                                if this_term_context_word_lemma is None or token.lemma_.lower() != this_term_context_word_lemma:
                                    return True, f"negated_reduction_verb_{scope_name}", f"'{token.text}' negating '{head_verb.text}'", scope_name
    
    return False, None, None, None

def find_negation_in_multiple_scopes(main_token, all_found_green_terms):
    """
    Enhanced negation detection that checks subtree, ancestors, and head.subtree.
    Returns: (is_negated, negation_type, negation_text, scope_found)
    """
    # Get all context words that are already being used to create green terms
    used_context_words = set()
    for green_term in all_found_green_terms:
        if 'context_word' in green_term:
            context_word = green_term['context_word'].lower()
            used_context_words.add(context_word)
            used_context_words.add(context_word.rstrip('ed').rstrip('ing').rstrip('s'))
    
    # Get the context word for this specific green term (if it's context-dependent)
    this_term_context_word_lemma = None
    for green_term in all_found_green_terms:
        term_span = range(green_term['start_idx'], green_term['end_idx'] + 1)
        if main_token.i in term_span and 'context_word' in green_term:
            context_word = green_term['context_word']
            context_doc = nlp(context_word)
            if len(context_doc) > 0:
                this_term_context_word_lemma = context_doc[0].lemma_.lower()
            else:
                this_term_context_word_lemma = context_word.lower()
            break
    
    # Apply distance filtering to subtree and head_subtree scopes
    subtree_all = list(main_token.subtree)
    subtree_filtered = [
        token for token in subtree_all 
        if (main_token.i - 6) <= token.i <= (main_token.i + 3)
    ]
    
    head_subtree_all = list(main_token.head.subtree)
    head_subtree_filtered = [
        token for token in head_subtree_all 
        if (main_token.i - 6) <= token.i <= (main_token.i + 3)
    ]
    
    # Define the three scopes to check
    scopes = [
        ("subtree", subtree_filtered),
        ("ancestors", list(main_token.ancestors)),
        ("head_subtree", head_subtree_filtered)
    ]
    
    # Method 1: Check spaCy's built-in negation detection in all scopes
    for scope_name, scope_tokens in scopes:
        for token in scope_tokens:
            if token.dep_ == "neg":
                if not is_word_part_of_green_terms(token, all_found_green_terms):
                    if token.lemma_.lower() not in used_context_words:
                        if this_term_context_word_lemma is None or token.lemma_.lower() != this_term_context_word_lemma:
                            return True, "spacy_neg", token.text, scope_name
    
    # Method 2: Check for direct negation words in all scopes
    for scope_name, scope_tokens in scopes:
        for token in scope_tokens:
            if token.lemma_.lower() in direct_negation_words:
                if (not is_word_part_of_green_terms(token, all_found_green_terms) and
                    is_negation_related_to_term(token, main_token, all_found_green_terms)):
                    if token.lemma_.lower() not in used_context_words:
                        if this_term_context_word_lemma is None or token.lemma_.lower() != this_term_context_word_lemma:
                            return True, "direct_negation", token.text, scope_name
    
    # Method 3: Check for negative descriptor words in all scopes
    for scope_name, scope_tokens in scopes:
        for token in scope_tokens:
            if token.lemma_.lower() in negative_descriptor_words:
                if (not is_word_part_of_green_terms(token, all_found_green_terms) and
                    not was_used_in_green_pattern(token, all_found_green_terms) and
                    is_negation_related_to_term(token, main_token, all_found_green_terms)):
                    if token.lemma_.lower() not in used_context_words:
                        if this_term_context_word_lemma is None or token.lemma_.lower() != this_term_context_word_lemma:
                            return True, "negative_descriptor", token.text, scope_name
    
    return False, None, None, None

def detect_negation_for_term(doc, green_term_start_idx, green_term_end_idx, all_found_green_terms):
    """
    Comprehensive negation detection for a green term using multiple scopes and reduction verbs.
    Returns: (is_negated, negation_type, negation_text, scope_found)
    """
    # Get the main token of the green term (head token for multiword terms)
    main_token = doc[green_term_start_idx]
    
    # For multiword terms, try to find the head token
    if green_term_end_idx > green_term_start_idx:
        tokens_in_term = [doc[i] for i in range(green_term_start_idx, green_term_end_idx + 1)]
        for token in tokens_in_term:
            if token.head not in tokens_in_term or token.head == token:
                main_token = token
                break
    
    # Check for negated reduction verbs first
    is_negated, negation_type, negation_text, scope = find_negated_reduction_verb(main_token, all_found_green_terms)
    if is_negated:
        return True, negation_type, negation_text, scope
    
    # Enhanced method: check subtree, ancestors, and head.subtree
    is_negated, negation_type, negation_text, scope = find_negation_in_multiple_scopes(main_token, all_found_green_terms)
    if is_negated:
        return True, f"{negation_type}_{scope}", negation_text, scope
    
    # Check for prefix negation on all tokens in the term
    for token_idx in range(green_term_start_idx, green_term_end_idx + 1):
        token = doc[token_idx]
        
        if token.i >= 2:
            prev_token = doc[token.i - 1]
            if prev_token.text == "-":
                prev_prev_token = doc[token.i - 2]
                if prev_prev_token.text.lower() in negation_prefixes:
                    return True, "prefix_negation", f"{prev_prev_token.text}-", "prefix"
    
    # Check for phrasal negation patterns
    sentence = main_token.sent
    sentence_text = sentence.text.lower()
    for phrase in phrasal_negation_patterns:
        if phrase in sentence_text:
            if validate_phrasal_negation(sentence, phrase, main_token):
                return True, "phrasal_negation", phrase, "sentence"
    
    return False, None, None, None

def filter_negated_terms(all_found_terms, doc):
    """
    Filter out negated terms and return both valid and negated term lists.
    """
    valid_terms = []
    negated_terms = []
    
    for term_info in all_found_terms:
        is_negated, negation_type, negation_text, scope = detect_negation_for_term(
            doc, 
            term_info['start_idx'], 
            term_info['end_idx'], 
            all_found_terms
        )
        
        if is_negated:
            term_info['negated'] = True
            term_info['negation_type'] = negation_type
            term_info['negation_text'] = negation_text
            term_info['negation_scope'] = scope
            negated_terms.append(term_info)
        else:
            term_info['negated'] = False
            valid_terms.append(term_info)
    
    return valid_terms, negated_terms

def get_negation_statistics(negated_terms):
    """
    Generate statistics about negation patterns including scope and reduction verb information.
    """
    if not negated_terms:
        return {}
    
    negation_types = Counter()
    negated_by_pos = Counter()
    negation_scopes = Counter()
    reduction_verb_negations = Counter()
    
    for term in negated_terms:
        negation_types[term['negation_type']] += 1
        negated_by_pos[term['pos']] += 1
        if 'negation_scope' in term:
            negation_scopes[term['negation_scope']] += 1
        if 'reduction_verb' in term.get('negation_type', ''):
            reduction_verb_negations[term['negation_type']] += 1
    
    return {
        'total_negated': len(negated_terms),
        'by_type': dict(negation_types),
        'by_pos': dict(negated_by_pos),
        'by_scope': dict(negation_scopes),
        'reduction_verb_negations': dict(reduction_verb_negations),
        'examples': [(term['term'], term['negation_type'], term['negation_text'], 
                     term.get('negation_scope', 'unknown')) 
                    for term in negated_terms[:8]]
    }

In [None]:
def find_multiword_terms(doc, multiword_dict, pos_tag):
    """Find multiword terms in document and return positions to exclude from single word counting."""
    found_terms = []
    excluded_positions = set()
    
    # Convert doc to lowercase tokens for matching
    tokens = [token.lemma_.lower() for token in doc]
    
    for base_word, modifiers in multiword_dict.items():
        for modifier in modifiers:
            # Pattern 1: modifier-base (e.g., "eco-friendly")
            pattern1 = f"{modifier}-{base_word}"
            # Pattern 2: modifier base (e.g., "eco friendly") 
            pattern2 = f"{modifier} {base_word}"
            
            # Search in original text for hyphenated version
            text_lower = doc.text.lower()
            for match in re.finditer(re.escape(pattern1), text_lower):
                start_char = match.start()
                end_char = match.end()
                
                # Find corresponding token positions
                start_token_idx = None
                end_token_idx = None
                
                for i, token in enumerate(doc):
                    if token.idx <= start_char < token.idx + len(token.text):
                        start_token_idx = i
                    if token.idx < end_char <= token.idx + len(token.text):
                        end_token_idx = i
                        break
                
                if start_token_idx is not None and end_token_idx is not None:
                    found_terms.append({
                        'term': pattern1,
                        'pos': pos_tag,
                        'start_idx': start_token_idx,
                        'end_idx': end_token_idx,
                        'sentence': doc[start_token_idx].sent,
                        'negated': False
                    })
                    # Mark positions as excluded
                    for idx in range(start_token_idx, end_token_idx + 1):
                        excluded_positions.add(idx)
            
            # Search for space-separated version
            for i in range(len(tokens) - 1):
                if tokens[i] == modifier and tokens[i + 1] == base_word:
                    found_terms.append({
                        'term': pattern2,
                        'pos': pos_tag,
                        'start_idx': i,
                        'end_idx': i + 1,
                        'sentence': doc[i].sent,
                        'negated': False
                    })
                    # Mark positions as excluded
                    excluded_positions.add(i)
                    excluded_positions.add(i + 1)
    
    return found_terms, excluded_positions

def find_single_word_terms(doc, word_list, pos_tag, excluded_positions):
    """Find single word terms, excluding positions already counted in multiword terms."""
    found_terms = []
    
    for i, token in enumerate(doc):
        if i in excluded_positions:
            continue
            
        lemma_lower = token.lemma_.lower()
        if lemma_lower in word_list:
            # Skip "sustainability" if followed by "report"
            if lemma_lower == "sustainability" and i + 1 < len(doc):
                next_token = doc[i + 1]
                if next_token.lemma_.lower() in {"report", "reporting"}:
                    continue

            # Skip "PV" if it's between brackets like "(PV)"
            if lemma_lower == "pv":
                left_char = doc.text[token.idx - 1] if token.idx > 0 else ""
                right_char = doc.text[token.idx + len(token)] if token.idx + len(token) < len(doc.text) else ""
                if left_char == "(" and right_char == ")":
                    continue

            found_terms.append({
                'term': lemma_lower,
                'pos': pos_tag,
                'start_idx': i,
                'end_idx': i,
                'sentence': token.sent,
                'negated': False
            })

    return found_terms

def find_dependency_patterns(doc, excluded_positions):
    """Find dependency-based green patterns."""
    found_terms = []
    dependency_excluded_positions = set()
    
    for pattern_name, pattern_info in dependency_green_patterns.items():
        head_words = pattern_info["head_words"]
        dependent_words = pattern_info["dependent_words"]
        dependency_relations = pattern_info["dependency_relations"]
        
        for token in doc:
            # Skip if token is already counted
            if token.i in excluded_positions or token.i in dependency_excluded_positions:
                continue
                
            token_lemma = token.lemma_.lower()
            
            # Check if current token is a head word
            if token_lemma in head_words:
                # Look for dependents
                for child in token.subtree:
                    if (child.dep_ in dependency_relations and 
                        child.lemma_.lower() in dependent_words and
                        child.i not in excluded_positions and
                        child.i not in dependency_excluded_positions):
                        
                        # Found a match
                        term_text = f"{child.text} {token.text}"
                        found_terms.append({
                            'term': term_text,
                            'pos': f"dependency_{pattern_name}",
                            'start_idx': min(child.i, token.i),
                            'end_idx': max(child.i, token.i),
                            'sentence': token.sent,
                            'pattern': pattern_name,
                            'dependency': child.dep_,
                            'negated': False
                        })
                        # Mark both positions as used
                        dependency_excluded_positions.add(child.i)
                        dependency_excluded_positions.add(token.i)
            
            # Check if current token is a dependent word
            elif token_lemma in dependent_words:
                # Look at its head
                head = token.head
                if (token.dep_ in dependency_relations and
                    head.lemma_.lower() in head_words and
                    head.i not in excluded_positions and
                    head.i not in dependency_excluded_positions):
                    
                    # Found a match
                    term_text = f"{token.text} {head.text}"
                    found_terms.append({
                        'term': term_text,
                        'pos': f"dependency_{pattern_name}",
                        'start_idx': min(token.i, head.i),
                        'end_idx': max(token.i, head.i),
                        'sentence': token.sent,
                        'pattern': pattern_name,
                        'dependency': token.dep_,
                        'negated': False
                    })
                    # Mark both positions as used
                    dependency_excluded_positions.add(token.i)
                    dependency_excluded_positions.add(head.i)
    
    return found_terms, dependency_excluded_positions

def get_context_words(doc, start_idx, end_idx, context_size=5):
    """Extract context words around the found term."""
    # Get the sentence containing the term
    sentence = doc[start_idx].sent
    sent_start = sentence.start
    sent_end = sentence.end
    
    # Calculate context boundaries within the sentence
    context_start = max(sent_start, start_idx - context_size)
    context_end = min(sent_end, end_idx + context_size + 1)
    
    # Extract context tokens
    context_tokens = []
    for i in range(context_start, context_end):
        if i == start_idx and start_idx == end_idx:
            # Single word term - highlight it
            context_tokens.append(f"**{doc[i].text}**")
        elif i == start_idx:
            # Start of multiword term
            context_tokens.append(f"**{doc[i].text}")
        elif i == end_idx:
            # End of multiword term
            context_tokens.append(f"{doc[i].text}**")
        elif start_idx < i < end_idx:
            # Middle of multiword term
            context_tokens.append(doc[i].text)
        else:
            # Regular context word
            context_tokens.append(doc[i].text)
    
    return " ".join(context_tokens)

def analyze_green_terms_with_negation(doc, doc_name):
    """
    Analyze all green terms in a document with enhanced multi-scope negation detection and context-dependent terms.
    Returns: (valid_counts, valid_terms, negated_terms, original_counts)
    """
    print(f"\n{'='*60}")
    print(f"ANALYZING: {doc_name}")
    print(f"{'='*60}")
    
    all_found_terms = []
    all_excluded_positions = set()
    
    # Step 1: Find multiword terms first (all types together)
    multiword_noun_terms, excluded_noun_pos = find_multiword_terms(doc, green_multiword_nouns, "multiword_noun")
    multiword_adj_terms, excluded_adj_pos = find_multiword_terms(doc, green_multiword_adjectives, "multiword_adjective")
    multiword_adv_terms, excluded_adv_pos = find_multiword_terms(doc, green_multiword_adverbs, "multiword_adverb")
    
    # Combine all multiword terms and check for overlaps
    all_multiword_terms = multiword_noun_terms + multiword_adj_terms + multiword_adv_terms
    
    # Remove duplicates based on position overlap
    filtered_multiword_terms = []
    used_positions = set()
    
    # Sort by start position to process in order
    all_multiword_terms.sort(key=lambda x: x['start_idx'])
    
    for term in all_multiword_terms:
        # Check if this term overlaps with any already used positions
        term_positions = set(range(term['start_idx'], term['end_idx'] + 1))
        if not term_positions.intersection(used_positions):
            filtered_multiword_terms.append(term)
            used_positions.update(term_positions)
            all_excluded_positions.update(term_positions)
    
    all_found_terms.extend(filtered_multiword_terms)
    
    # Step 2: Find dependency patterns (excluding already counted positions)
    dependency_terms, dependency_excluded_pos = find_dependency_patterns(doc, all_excluded_positions)
    
    # Filter dependency terms to avoid overlap with multiword terms
    filtered_dependency_terms = []
    for dep_term in dependency_terms:
        dep_positions = set(range(dep_term['start_idx'], dep_term['end_idx'] + 1))
        if not dep_positions.intersection(all_excluded_positions):
            filtered_dependency_terms.append(dep_term)
            all_excluded_positions.update(dep_positions)
    
    all_found_terms.extend(filtered_dependency_terms)
    
    # Step 3: Find context-dependent terms (excluding already counted positions)
    context_dependent_terms, context_excluded_pos = find_context_dependent_terms(doc, all_excluded_positions, all_found_terms)
    
    # Filter context-dependent terms to avoid overlap
    filtered_context_terms = []
    for ctx_term in context_dependent_terms:
        ctx_positions = set(range(ctx_term['start_idx'], ctx_term['end_idx'] + 1))
        if not ctx_positions.intersection(all_excluded_positions):
            filtered_context_terms.append(ctx_term)
            all_excluded_positions.update(ctx_positions)
    
    all_found_terms.extend(filtered_context_terms)
    
    # Step 4: Find single word terms with position tracking to prevent double counting
    # Process in priority order: verbs > nouns > adjectives > adverbs
    
    # Priority 1: Verbs (actions are often most important)
    single_verb_terms = find_single_word_terms(doc, green_verbs, "verb", all_excluded_positions)
    for term in single_verb_terms:
        all_excluded_positions.add(term['start_idx'])
    all_found_terms.extend(single_verb_terms)
    
    # Priority 2: Nouns (concrete green concepts)
    single_noun_terms = find_single_word_terms(doc, green_nouns, "noun", all_excluded_positions)
    for term in single_noun_terms:
        all_excluded_positions.add(term['start_idx'])
    all_found_terms.extend(single_noun_terms)
    
    # Priority 3: Adjectives (descriptive green terms)
    single_adj_terms = find_single_word_terms(doc, green_adjectives, "adjective", all_excluded_positions)
    for term in single_adj_terms:
        all_excluded_positions.add(term['start_idx'])
    all_found_terms.extend(single_adj_terms)
    
    # Priority 4: Adverbs (manner of green actions)
    single_adv_terms = find_single_word_terms(doc, green_adverbs, "adverb", all_excluded_positions)
    for term in single_adv_terms:
        all_excluded_positions.add(term['start_idx'])
    all_found_terms.extend(single_adv_terms)
    
    # Step 5: Apply enhanced multi-scope negation detection
    print(f"Found {len(all_found_terms)} green terms before negation filtering...")
    valid_terms, negated_terms = filter_negated_terms(all_found_terms, doc)
    print(f"After negation filtering: {len(valid_terms)} valid terms, {len(negated_terms)} negated terms")
    
    # Count by type for original, valid, and negated terms
    original_type_counts = Counter()
    valid_type_counts = Counter()
    negated_type_counts = Counter()
    
    for term_info in all_found_terms:
        original_type_counts[term_info['pos']] += 1
    
    for term_info in valid_terms:
        valid_type_counts[term_info['pos']] += 1
    
    for term_info in negated_terms:
        negated_type_counts[term_info['pos']] += 1
    
    # Print comprehensive counts
    print(f"\nGREEN TERMS COUNTS (ORIGINAL | NEGATED | VALID):")
    print(f"Nouns: {original_type_counts['noun']} | {negated_type_counts['noun']} | {valid_type_counts['noun']}")
    print(f"Multiword Nouns: {original_type_counts['multiword_noun']} | {negated_type_counts['multiword_noun']} | {valid_type_counts['multiword_noun']}")
    print(f"Adjectives: {original_type_counts['adjective']} | {negated_type_counts['adjective']} | {valid_type_counts['adjective']}")  
    print(f"Multiword Adjectives: {original_type_counts['multiword_adjective']} | {negated_type_counts['multiword_adjective']} | {valid_type_counts['multiword_adjective']}")
    print(f"Verbs: {original_type_counts['verb']} | {negated_type_counts['verb']} | {valid_type_counts['verb']}")
    print(f"Adverbs: {original_type_counts['adverb']} | {negated_type_counts['adverb']} | {valid_type_counts['adverb']}")
    print(f"Multiword Adverbs: {original_type_counts['multiword_adverb']} | {negated_type_counts['multiword_adverb']} | {valid_type_counts['multiword_adverb']}")
    print(f"Context-Dependent Nouns: {original_type_counts['context_dependent_noun']} | {negated_type_counts['context_dependent_noun']} | {valid_type_counts['context_dependent_noun']}")
    print(f"Context-Dependent Multiword Nouns: {original_type_counts['context_dependent_multiword_noun']} | {negated_type_counts['context_dependent_multiword_noun']} | {valid_type_counts['context_dependent_multiword_noun']}")
    
    # Count dependency patterns
    original_dependency_counts = Counter()
    valid_dependency_counts = Counter()
    negated_dependency_counts = Counter()
    
    for term_info in all_found_terms:
        if term_info['pos'].startswith('dependency_'):
            original_dependency_counts[term_info['pos']] += 1
    
    for term_info in valid_terms:
        if term_info['pos'].startswith('dependency_'):
            valid_dependency_counts[term_info['pos']] += 1
            
    for term_info in negated_terms:
        if term_info['pos'].startswith('dependency_'):
            negated_dependency_counts[term_info['pos']] += 1
    
    print(f"Dependency Patterns: {sum(original_dependency_counts.values())} | {sum(negated_dependency_counts.values())} | {sum(valid_dependency_counts.values())}")
    
    for dep_type in original_dependency_counts.keys():
        pattern_name = dep_type.replace('dependency_', '')
        orig_count = original_dependency_counts[dep_type]
        neg_count = negated_dependency_counts[dep_type]
        val_count = valid_dependency_counts[dep_type]
        print(f"  {pattern_name}: {orig_count} | {neg_count} | {val_count}")
    
    print(f"TOTAL: {sum(original_type_counts.values())} | {sum(negated_type_counts.values())} | {sum(valid_type_counts.values())}")
    
    # Generate negation statistics
    negation_stats = get_negation_statistics(negated_terms)
    if negation_stats:
        print(f"\nENHANCED NEGATION ANALYSIS:")
        print(f"Total negated terms: {negation_stats['total_negated']}")
        print(f"Negation types: {negation_stats['by_type']}")
        print(f"Negation scopes: {negation_stats['by_scope']}")
        print(f"Examples of negated terms (with scope):")
        for term, neg_type, neg_text, scope in negation_stats['examples']:
            print(f"  - '{term}' (negated by: {neg_text}, type: {neg_type}, scope: {scope})")
    
    # Sort valid terms by their position in the text
    valid_terms_sorted = sorted(valid_terms, key=lambda x: x['start_idx'])
    
    # Print valid terms in order they appear in the text
    print(f"\nVALID TERMS IN TEXT ORDER:")
    print("-" * 40)
    for i, term_info in enumerate(valid_terms_sorted[:20], 1):
        # Format the term type for display
        if term_info['pos'].startswith('dependency_'):
            pattern_name = term_info['pos'].replace('dependency_', '').upper()
            term_type = f"DEPENDENCY {pattern_name} TERM"
        elif term_info['pos'].startswith('context_dependent_'):
            ctx_type = term_info.get('context_type', 'unknown').upper()
            base_type = term_info['pos'].replace('context_dependent_', '').upper().replace('_', ' ')
            term_type = f"CONTEXT-DEPENDENT {base_type} ({ctx_type})"
        else:
            term_type = term_info['pos'].upper().replace('_', ' ') + " TERM"
        
        context = get_context_words(doc, term_info['start_idx'], term_info['end_idx'])
        print(f"{i}. {term_type}: {term_info['term']}")
        
        # Add dependency info if it's a dependency pattern
        if 'dependency' in term_info:
            print(f"   (Dependency: {term_info['dependency']})")
        
        # Add context info if it's a context-dependent term
        if 'context_word' in term_info:
            print(f"   (Context: '{term_info['context_word']}' -> Neutral: '{term_info['neutral_part']}')")
        
        print(f"   Context: {context}")
        print()
    
    # If there are negated terms, show some examples
    if negated_terms:
        print(f"\nEXAMPLES OF NEGATED TERMS (EXCLUDED FROM COUNT):")
        print("-" * 40)
        negated_terms_sorted = sorted(negated_terms, key=lambda x: x['start_idx'])
        for i, term_info in enumerate(negated_terms_sorted[:10], 1):
            if term_info['pos'].startswith('dependency_'):
                pattern_name = term_info['pos'].replace('dependency_', '').upper()
                term_type = f"DEPENDENCY {pattern_name} TERM"
            elif term_info['pos'].startswith('context_dependent_'):
                ctx_type = term_info.get('context_type', 'unknown').upper()
                base_type = term_info['pos'].replace('context_dependent_', '').upper().replace('_', ' ')
                term_type = f"CONTEXT-DEPENDENT {base_type} ({ctx_type})"
            else:
                term_type = term_info['pos'].upper().replace('_', ' ') + " TERM"
            
            context = get_context_words(doc, term_info['start_idx'], term_info['end_idx'])
            print(f"{i}. {term_type}: {term_info['term']}")
            scope_info = f" in {term_info.get('negation_scope', 'unknown')} scope" if 'negation_scope' in term_info else ""
            print(f"   Negated by: {term_info['negation_text']} (type: {term_info['negation_type']}{scope_info})")
            
            # Add context info if it's a context-dependent term
            if 'context_word' in term_info:
                print(f"   (Context: '{term_info['context_word']}' -> Neutral: '{term_info['neutral_part']}')")
            
            print(f"   Context: {context}")
            print()
    
    return valid_type_counts, valid_terms, negated_terms, original_type_counts

In [None]:
all_results = {}
for doc_name, doc in documents.items():
    valid_counts, valid_terms, negated_terms, original_counts = analyze_green_terms_with_negation(doc, doc_name)
    all_results[doc_name] = {
        'valid_counts': valid_counts,
        'valid_terms': valid_terms,
        'negated_terms': negated_terms,
        'original_counts': original_counts,
        'total_tokens': len(doc),
        'total_sentences': len(list(doc.sents))
    }

# Cell 10: Print Comprehensive Summary
print(f"\n{'='*140}")
print("COMPREHENSIVE SUMMARY - GREEN TERMS ANALYSIS WITH CONTEXT-DEPENDENT TERMS + MULTI-SCOPE NEGATION")
print(f"{'='*140}")

print("\n1. ORIGINAL COUNTS (Before Negation Filtering)")
print(f"{'Document':<25} {'Nouns':<8} {'M-Nouns':<8} {'Adj':<8} {'M-Adj':<8} {'Verbs':<8} {'Adv':<8} {'M-Adv':<8} {'Ctx-N':<8} {'Ctx-MN':<8} {'Dep-Pat':<8} {'Total':<8}")
print("-" * 140)

for doc_name, results in all_results.items():
    counts = results['original_counts']
    dependency_total = sum(count for pos_type, count in counts.items() if pos_type.startswith('dependency_'))
    total = sum(counts.values())
    print(f"{doc_name:<25} {counts['noun']:<8} {counts['multiword_noun']:<8} {counts['adjective']:<8} {counts['multiword_adjective']:<8} {counts['verb']:<8} {counts['adverb']:<8} {counts['multiword_adverb']:<8} {counts['context_dependent_noun']:<8} {counts['context_dependent_multiword_noun']:<8} {dependency_total:<8} {total:<8}")

print("\n2. NEGATED TERMS (Filtered Out)")
print(f"{'Document':<25} {'Nouns':<8} {'M-Nouns':<8} {'Adj':<8} {'M-Adj':<8} {'Verbs':<8} {'Adv':<8} {'M-Adv':<8} {'Ctx-N':<8} {'Ctx-MN':<8} {'Dep-Pat':<8} {'Total':<8}")
print("-" * 140)

for doc_name, results in all_results.items():
    negated_counts = Counter()
    for term in results['negated_terms']:
        negated_counts[term['pos']] += 1
    
    dependency_total = sum(count for pos_type, count in negated_counts.items() if pos_type.startswith('dependency_'))
    total = sum(negated_counts.values())
    print(f"{doc_name:<25} {negated_counts['noun']:<8} {negated_counts['multiword_noun']:<8} {negated_counts['adjective']:<8} {negated_counts['multiword_adjective']:<8} {negated_counts['verb']:<8} {negated_counts['adverb']:<8} {negated_counts['multiword_adverb']:<8} {negated_counts['context_dependent_noun']:<8} {negated_counts['context_dependent_multiword_noun']:<8} {dependency_total:<8} {total:<8}")

print("\n3. FINAL VALID COUNTS (After Multi-Scope + Reduction Verb Negation Filtering)")
print(f"{'Document':<25} {'Nouns':<8} {'M-Nouns':<8} {'Adj':<8} {'M-Adj':<8} {'Verbs':<8} {'Adv':<8} {'M-Adv':<8} {'Ctx-N':<8} {'Ctx-MN':<8} {'Dep-Pat':<8} {'Total':<8}")
print("-" * 140)

for doc_name, results in all_results.items():
    counts = results['valid_counts']
    dependency_total = sum(count for pos_type, count in counts.items() if pos_type.startswith('dependency_'))
    total = sum(counts.values())
    print(f"{doc_name:<25} {counts['noun']:<8} {counts['multiword_noun']:<8} {counts['adjective']:<8} {counts['multiword_adjective']:<8} {counts['verb']:<8} {counts['adverb']:<8} {counts['multiword_adverb']:<8} {counts['context_dependent_noun']:<8} {counts['context_dependent_multiword_noun']:<8} {dependency_total:<8} {total:<8}")

print("\n4. CONTEXT-DEPENDENT TERMS ANALYSIS")
print("-" * 80)

# Analyze context-dependent terms by type
all_context_types = Counter()
all_context_words = Counter()
context_examples = []

for doc_name, results in all_results.items():
    doc_context_negative = 0
    doc_context_improvement = 0
    
    for term in results['valid_terms']:
        if term['pos'].startswith('context_dependent_'):
            context_type = term.get('context_type', 'unknown')
            context_word = term.get('context_word', 'unknown')
            all_context_types[context_type] += 1
            all_context_words[context_word.lower()] += 1
            
            if context_type == 'negative':
                doc_context_negative += 1
            elif context_type == 'improvement':
                doc_context_improvement += 1
            
            if len(context_examples) < 10:
                context_examples.append((doc_name, term['term'], term['context_word'], term['neutral_part'], context_type))
    
    total_context = doc_context_negative + doc_context_improvement
    if total_context > 0:
        print(f"{doc_name}: {total_context} context-dependent terms ({doc_context_negative} negative, {doc_context_improvement} improvement)")

print(f"\nOverall context type distribution: {dict(all_context_types)}")
print(f"Top context words: {dict(all_context_words.most_common(10))}")

print(f"\nExamples of context-dependent terms:")
for doc, term, context, neutral, ctx_type in context_examples:
    print(f"  - '{term}' = '{context}' + '{neutral}' ({ctx_type} context) [{doc}]")

print("\n5. NEGATION IMPACT SUMMARY")
print(f"{'Document':<25} {'Original':<10} {'Negated':<10} {'Valid':<10} {'Negation %':<12}")
print("-" * 70)

for doc_name, results in all_results.items():
    original_total = sum(results['original_counts'].values())
    negated_total = len(results['negated_terms'])
    valid_total = sum(results['valid_counts'].values())
    negation_pct = (negated_total / original_total * 100) if original_total > 0 else 0
    
    print(f"{doc_name:<25} {original_total:<10} {negated_total:<10} {valid_total:<10} {negation_pct:<11.1f}%")

print(f"\n{'='*140}")
print("ANALYSIS COMPLETE - Enhanced with Context-Dependent Terms + Multi-Scope Negation Detection")
print("Key features:")
print("1. Context-dependent terms: neutral terms that become green with negative/improvement context")
print("2. Multi-scope negation detection: subtree, ancestors, head.subtree, and sentence-level")
print("3. Reduction verb negation: detects when positive verbs are negated")
print("4. Comprehensive term classification and overlap prevention")
print(f"{'='*140}")

# Context Classification

## Temporal

In [None]:
import re
from collections import Counter, defaultdict

def get_temporal_markers_for_year(report_year):
    """
    Get temporal markers adjusted for the specific report year.
    """
    if report_year == 2021:
        return {
            'past': {
                'year_patterns': [r'\b(in|during|since|until|before)\s+(19\d{2}|200\d|201\d|2020)\b', r'\b(19\d{2}|200\d|201\d|2020)\b'],
                'relative_patterns': ['last year', 'previous year', 'previously', 'before', 'ago', 'earlier', 'historic', 'past', 'prior', 
                                      'former', 'preceding', 'until now', 'so far', 'to date', 'historically', 'last quarter', 'previous quarter', 
                                      'last month', 'previous month', 'last semester', 'previous semester', 'last period', 'previous period'],
                'specific_patterns': ['since 2020', 'since 2019', 'since 2018', 'since 2017', 'since 2016', 'since 2015', 'until 2020', 
                                      'before 2021', 'up to 2020', 'through 2020', 'by 2020', 'Q1 2020', 'Q2 2020', 'Q3 2020', 'Q4 2020', 
                                      'first quarter 2020', 'second quarter 2020', 'third quarter 2020', 'fourth quarter 2020', 'H1 2020', 
                                      'H2 2020', 'first half 2020', 'second half 2020']
            },
            'present': {
                'year_patterns': [r'\b(in|during|this|current)\s+2021\b', r'\b2021\b'],
                'relative_patterns': ['currently', 'now', 'today', 'this year', 'present', 'ongoing', 'continue', 'continues', 'at present', 
                                      'as of now', 'current', 'existing', 'active', 'in progress', 'underway', 'throughout this year', 
                                      'during this period', 'as we speak', 'at this time', 'at this point', 'right now', 'presently'],
                'specific_patterns': ['as of now', 'at present', 'to date', 'in 2021', 'during 2021', 'this year', 'current year', 'Q1 2021', 
                                      'Q2 2021', 'Q3 2021', 'Q4 2021', 'this quarter', 'current quarter', 'H1 2021', 'H2 2021', 'first half 2021', 
                                      'second half 2021', 'as of 2021', 'as of end-2021', 'year-to-date', 'YTD 2021', 'as of December 2021', 
                                      'end of 2021']
            },
            'future': {
                'year_patterns': [r'\b(by|until|before|from|after)\s+(202[2-9]|20[3-9]\d)\b', r'\b(202[2-9]|20[3-9]\d)\b'],
                'relative_patterns': ['next year', 'future', 'upcoming', 'planned', 'will', 'shall', 'intend', 'target', 'aim', 'expect', 
                                      'anticipate', 'forecast', 'project', 'outlook', 'forward', 'ahead', 'coming', 'prospective'],
                'specific_patterns': ['by 2030', 'by 2025', 'by 2026', 'by 2027', 'by 2028', 'by 2029', 'in the future', 'from 2022', 
                                      'after 2021', 'beyond 2021', 'going forward', 'by 2024', 'by 2023', 'by 2022', 'by 2035', 'by 2040', 
                                      'by 2050', 'from 2022 onwards', 'starting 2022', 'beginning 2022', 'Q1 2022', 'Q2 2022', 'Q3 2022', 
                                      'Q4 2022', 'next quarter', 'H1 2022', 'H2 2022', 'first half 2022', 'second half 2022',]
            }
        }
    
    elif report_year == 2022:
        return {
            'past': {
                'year_patterns': [r'\b(in|during|since|until|before)\s+(19\d{2}|200\d|201\d|2020|2021)\b', r'\b(19\d{2}|200\d|201\d|2020|2021)\b'],
                'relative_patterns': ['last year', 'previous year', 'previously', 'before', 'ago', 'earlier', 'historic', 'past', 'prior', 
                                      'former', 'preceding', 'until now', 'so far', 'to date', 'historically', 'last quarter', 'previous quarter', 
                                      'last month', 'previous month', 'last semester', 'previous semester', 'last period', 'previous period',],
                'specific_patterns': ['since 2020', 'since 2019', 'since 2018', 'since 2017', 'since 2016', 'since 2015', 'until 2020', 
                                      'before 2021', 'up to 2020', 'through 2020', 'by 2020', 'Q1 2020', 'Q2 2020', 'Q3 2020', 'Q4 2020', 
                                      'first quarter 2020', 'second quarter 2020', 'third quarter 2020', 'fourth quarter 2020', 'H1 2020', 
                                      'H2 2020', 'first half 2020', 'second half 2020',]
            },
            'present': {
                'year_patterns': [r'\b(in|during|this|current)\s+2022\b', r'\b2022\b'],
                'relative_patterns': ['currently', 'now', 'today', 'this year', 'present', 'ongoing', 'continue', 'continues', 'at present', 
                                      'as of now', 'current', 'existing', 'active', 'in progress', 'underway', 'throughout this year', 
                                      'during this period', 'as we speak', 'at this time', 'at this point', 'right now', 'presently'],
                'specific_patterns': ['as of now', 'at present', 'to date', 'in 2021', 'during 2021', 'this year', 'current year', 'Q1 2021', 
                                      'Q2 2021', 'Q3 2021', 'Q4 2021', 'this quarter', 'current quarter', 'H1 2021', 'H2 2021', 'first half 2021', 
                                      'second half 2021', 'as of 2021', 'as of end-2021', 'year-to-date', 'YTD 2021', 'as of December 2021', 
                                      'end of 2021']
            },
            'future': {
                'year_patterns': [r'\b(by|until|before|from|after)\s+(202[3-9]|20[3-9]\d)\b', r'\b(202[3-9]|20[3-9]\d)\b'],
                'relative_patterns': ['next year', 'future', 'upcoming', 'planned', 'will', 'shall', 'intend', 'target', 'aim', 'expect', 
                                      'anticipate', 'forecast', 'project', 'outlook', 'forward', 'ahead', 'coming', 'prospective'],
                'specific_patterns': ['by 2030', 'by 2025', 'by 2026', 'by 2027', 'by 2028', 'by 2029', 'in the future', 'from 2022', 
                                      'after 2021', 'beyond 2021', 'going forward', 'by 2024', 'by 2023', 'by 2022', 'by 2035', 'by 2040', 
                                      'by 2050', 'from 2022 onwards', 'starting 2022', 'beginning 2022', 'Q1 2022', 'Q2 2022', 'Q3 2022', 
                                      'Q4 2022', 'next quarter', 'H1 2022', 'H2 2022', 'first half 2022', 'second half 2022',]
            }
        }
    
    else:
        return{
            print(f"Warning: No specific temporal markers defined for report year {report_year}. Using default patterns.")
        }

# Enhanced auxiliary verb patterns for tense detection
AUXILIARY_PATTERNS = {
    'future': [
        'will', 'shall', 'going to', 'plan to', 'intend to', 'aim to', 'expect to',
        'hope to', 'seek to', 'strive to', 'endeavor to', 'commit to', 'pledge to',
        'target to', 'set to', 'scheduled to', 'due to', 'about to', 'prepare to',
        'ready to', 'poised to', 'bound to', 'likely to', 'expected to', 'projected to',
        'forecast to', 'anticipated to', 'destined to', 'planning to', 'intending to', 
        'would', 'could potentially', 'might eventually', 'may ultimately'
    ],
    'present_perfect': [
        'have', 'has', 'have been', 'has been', 'have done', 'has done',
        'have achieved', 'has achieved', 'have implemented', 'has implemented',
        'have established', 'has established', 'have developed', 'has developed'
    ],
    'past_perfect': [
        'had', 'had been', 'had done', 'had achieved', 'had implemented',
        'had established', 'had developed', 'had completed', 'had finished'
    ],
    'present_continuous': [
        'am', 'is', 'are', 'am being', 'is being', 'are being',
        'am doing', 'is doing', 'are doing', 'am working', 'is working', 'are working'
    ],
    'past_continuous': [
        'was', 'were', 'was being', 'were being', 'was doing', 'were doing',
        'was working', 'were working', 'was implementing', 'were implementing'
    ],
    'conditional': [
        'would', 'could', 'should', 'might', 'may', 'must',
        'would be', 'could be', 'should be', 'might be', 'may be', 'must be',
        'would need to', 'could potentially', 'should ideally', 'might require',
        'may involve', 'must ensure', 'would enable', 'could facilitate'
    ]
}

def determine_report_year(doc_name):
    """
    Determine the report year from the document name.
    """
    if '2021' in doc_name:
        return 2021
    elif '2022' in doc_name:
        return 2022
    else:
        # Try to extract year from document name
        year_match = re.search(r'(20\d{2})', doc_name)
        if year_match:
            return int(year_match.group(1))
        else:
            return 2021  # Default fallback

In [None]:
def find_governing_verb(green_term, doc):
    """
    Find the governing verb for a green term using dependency parsing.
    
    Args:
        green_term: Dictionary containing green term info with start_idx, end_idx, pos
        doc: spaCy doc object
    
    Returns:
        spaCy token object of governing verb or None
    """
    start_idx = green_term['start_idx']
    end_idx = green_term['end_idx']
    pos_type = green_term['pos']
    
    # If the green term itself is a verb, return it
    if 'verb' in pos_type.lower():
        return doc[start_idx] if start_idx == end_idx else doc[start_idx]
    
    # For non-verb terms, find the governing verb through dependency parsing
    main_token = doc[start_idx]  # Use the first token as the main token
    
    # Strategy 1: Look for the head verb directly
    current_token = main_token
    max_depth = 10  # Prevent infinite loops
    depth = 0
    
    while current_token.head != current_token and depth < max_depth:
        if current_token.head.pos_ == 'VERB' or current_token.head.pos_ == 'AUX':
            return current_token.head
        current_token = current_token.head
        depth += 1
    
    # Strategy 2: Look for verbs in the same sentence that govern this term
    sentence = main_token.sent
    for token in sentence:
        if token.pos_ == 'VERB' or token.pos_ == 'AUX':
            # Check if this verb governs our green term through dependency relations
            for child in token.subtree:
                if start_idx <= child.i <= end_idx:
                    return token
    
    # Strategy 3: Find the root verb of the sentence
    sentence_root = main_token.sent.root
    if sentence_root.pos_ == 'VERB' or sentence_root.pos_ == 'AUX':
        return sentence_root
    
    return None

def extract_verb_tense(governing_verb):
    """
    Extract tense information from a governing verb using enhanced patterns.
    
    Args:
        governing_verb: spaCy token object of the verb
    
    Returns:
        String: 'past', 'present', 'future', or 'unclear'
    """
    if governing_verb is None:
        return 'unclear'
    
    # Primary method: Use spaCy's morphological analysis
    tense_info = governing_verb.morph.get("Tense")
    if tense_info:
        tense_value = tense_info[0].lower() if isinstance(tense_info, list) else tense_info.lower()
        if 'past' in tense_value:
            return 'past'
        elif 'pres' in tense_value:
            return 'present'
        elif 'fut' in tense_value:
            return 'future'
    
    # Secondary method: Enhanced pattern matching for auxiliary verbs
    verb_lemma = governing_verb.lemma_.lower()
    verb_text = governing_verb.text.lower()
    
    # Check sentence context for auxiliary patterns
    sentence = governing_verb.sent
    sentence_text = sentence.text.lower()
    
    # Check for auxiliary patterns with better context awareness
    for pattern_type, patterns in AUXILIARY_PATTERNS.items():
        for pattern in patterns:
            if pattern in sentence_text:
                # Verify proximity to our verb (within 5 tokens)
                pattern_tokens = pattern.split()
                for i, sent_token in enumerate(sentence):
                    if sent_token.text.lower() == pattern_tokens[0]:
                        # Check if this auxiliary is close to our governing verb
                        if abs(sent_token.i - governing_verb.i) <= 5:
                            if pattern_type == 'future':
                                return 'future'
                            elif pattern_type in ['present_perfect', 'present_continuous']:
                                return 'present'
                            elif pattern_type in ['past_perfect', 'past_continuous']:
                                return 'past'
                            elif pattern_type == 'conditional':
                                return 'future'  # Conditional often implies future intent
    
    # Enhanced tense detection based on verb form and context
    if governing_verb.tag_ in ['VBD', 'VBN']:  # Past tense, past participle
        return 'past'
    elif governing_verb.tag_ in ['VBZ', 'VBP', 'VBG']:  # Present tense forms
        return 'present'
    elif governing_verb.tag_ == 'MD':  # Modal verb
        return 'future'
    elif governing_verb.tag_ == 'VB':  # Base form - often future with auxiliary
        return 'future'
    
    return 'unclear'

In [None]:
def find_temporal_markers(green_term, governing_verb, doc, report_year):
    """
    Find relevant temporal markers near the green term or governing verb.
    
    Args:
        green_term: Dictionary containing green term info
        governing_verb: spaCy token object of governing verb
        doc: spaCy doc object
        report_year: Year of the report (2021 or 2022)
    
    Returns:
        Tuple: (temporal_category, marker_text, proximity_score)
    """
    green_start = green_term['start_idx']
    green_end = green_term['end_idx']
    verb_pos = governing_verb.i if governing_verb else green_start
    
    sentence = doc[green_start].sent
    sentence_text = sentence.text.lower()
    
    # Get temporal markers for the specific report year
    TEMPORAL_MARKERS = get_temporal_markers_for_year(report_year)
    
    # Search within 5 tokens of green term or governing verb
    search_positions = [green_start, green_end]
    if governing_verb:
        search_positions.append(verb_pos)
    
    best_match = None
    best_proximity = float('inf')
    best_category = None
    
    # Check each temporal category
    for category, patterns in TEMPORAL_MARKERS.items():
        # Check year patterns
        for pattern in patterns['year_patterns']:
            matches = re.finditer(pattern, sentence_text, re.IGNORECASE)
            for match in matches:
                # Find token position of match
                match_start_char = match.start()
                match_token_pos = None
                for token in sentence:
                    if token.idx <= sentence.start_char + match_start_char < token.idx + len(token.text):
                        match_token_pos = token.i
                        break
                
                if match_token_pos:
                    # Calculate proximity to green term or governing verb
                    proximity = min(abs(match_token_pos - pos) for pos in search_positions)
                    if proximity <= 5 and proximity < best_proximity:
                        best_match = match.group()
                        best_proximity = proximity
                        best_category = category
        
        # Check relative patterns
        for pattern in patterns['relative_patterns']:
            if pattern in sentence_text:
                # Find token position
                pattern_start = sentence_text.find(pattern)
                match_token_pos = None
                for token in sentence:
                    token_start_in_sent = token.idx - sentence.start_char
                    if token_start_in_sent <= pattern_start < token_start_in_sent + len(token.text):
                        match_token_pos = token.i
                        break
                
                if match_token_pos:
                    proximity = min(abs(match_token_pos - pos) for pos in search_positions)
                    if proximity <= 5 and proximity < best_proximity:
                        best_match = pattern
                        best_proximity = proximity
                        best_category = category
        
        # Check specific patterns
        for pattern in patterns['specific_patterns']:
            if pattern in sentence_text:
                pattern_start = sentence_text.find(pattern)
                match_token_pos = None
                for token in sentence:
                    token_start_in_sent = token.idx - sentence.start_char
                    if token_start_in_sent <= pattern_start < token_start_in_sent + len(token.text):
                        match_token_pos = token.i
                        break
                
                if match_token_pos:
                    proximity = min(abs(match_token_pos - pos) for pos in search_positions)
                    if proximity <= 5 and proximity < best_proximity:
                        best_match = pattern
                        best_proximity = proximity
                        best_category = category
    
    return best_category, best_match, best_proximity if best_match else None

def classify_temporal_context(green_term, doc, report_year):
    """
    Classify the temporal context of a green term.
    
    Args:
        green_term: Dictionary containing green term info
        doc: spaCy doc object
        report_year: Year of the report (2021 or 2022)
    
    Returns:
        String: 'past', 'present', 'future', or 'unclear'
    """
    # Step 1: Find governing verb
    governing_verb = find_governing_verb(green_term, doc)
    
    # Step 2: Extract verb tense
    verb_tense = extract_verb_tense(governing_verb)
    
    # Step 3: Find temporal markers
    temporal_category, marker_text, proximity = find_temporal_markers(green_term, governing_verb, doc, report_year)
    
    # Step 4: Combine evidence with precedence rules
    # Temporal markers override verb tense when they are syntactically connected
    if temporal_category and proximity is not None and proximity <= 3:
        # Close temporal markers take precedence
        return temporal_category
    elif temporal_category and proximity is not None and proximity <= 5:
        # Moderate distance - consider both evidence
        if verb_tense != 'unclear' and verb_tense != temporal_category:
            # Conflict resolution: prefer more specific evidence
            return temporal_category  # Temporal markers are usually more specific
        else:
            return temporal_category
    elif verb_tense != 'unclear':
        # No close temporal markers, use verb tense
        return verb_tense
    else:
        # No clear evidence
        return 'unclear'

In [None]:
def add_temporal_classification(all_results, documents):
    """
    Add temporal classification to all valid green terms across all documents.
    
    Args:
        all_results: Dictionary containing results from green term analysis
        documents: Dictionary containing spaCy doc objects
    
    Returns:
        Dictionary: Updated results with temporal classification added
    """
    print(f"\n{'='*80}")
    print("TEMPORAL CLASSIFICATION ANALYSIS")
    print(f"{'='*80}")
    
    # Initialize summary statistics
    overall_temporal_stats = Counter()
    doc_temporal_stats = {}
    
    for doc_name, results in all_results.items():
        print(f"\nProcessing temporal classification for: {doc_name}")
        doc = documents[doc_name]
        valid_terms = results['valid_terms']
        
        # Determine report year for this document
        report_year = determine_report_year(doc_name)
        print(f"  Report year determined: {report_year}")
        
        # Classify each valid term
        temporal_classifications = []
        doc_temporal_counter = Counter()
        
        for i, green_term in enumerate(valid_terms):
            temporal_class = classify_temporal_context(green_term, doc, report_year)
            
            # Add temporal classification to the green term object
            green_term['temporal_class'] = temporal_class
            green_term['report_year'] = report_year
            
            # Update statistics
            temporal_classifications.append(temporal_class)
            doc_temporal_counter[temporal_class] += 1
            overall_temporal_stats[temporal_class] += 1
        
        # Store document-level statistics
        doc_temporal_stats[doc_name] = doc_temporal_counter
        
        # Print document summary
        total_terms = len(valid_terms)
        print(f"  Temporal classification completed for {total_terms} terms:")
        for category, count in doc_temporal_counter.items():
            percentage = (count / total_terms) * 100 if total_terms > 0 else 0
            print(f"    {category.capitalize()}: {count} ({percentage:.1f}%)")
    
    # Print overall summary
    print(f"\n{'='*60}")
    print("TEMPORAL CLASSIFICATION SUMMARY")
    print(f"{'='*60}")
    
    total_terms = sum(overall_temporal_stats.values())
    print(f"Total terms classified: {total_terms}")
    print(f"\nOverall distribution:")
    for category in ['past', 'present', 'future', 'unclear']:
        count = overall_temporal_stats[category]
        percentage = (count / total_terms) * 100 if total_terms > 0 else 0
        print(f"  {category.capitalize()}: {count} ({percentage:.1f}%)")
    
    # Document-level breakdown
    print(f"\nDocument-level breakdown:")
    print(f"{'Document':<25} {'Past':<8} {'Present':<8} {'Future':<8} {'Unclear':<8} {'Total':<8}")
    print("-" * 75)
    
    for doc_name, doc_stats in doc_temporal_stats.items():
        past = doc_stats['past']
        present = doc_stats['present']
        future = doc_stats['future']
        unclear = doc_stats['unclear']
        total = sum(doc_stats.values())
        
        print(f"{doc_name:<25} {past:<8} {present:<8} {future:<8} {unclear:<8} {total:<8}")
    
    # Store temporal statistics in results
    for doc_name in all_results:
        all_results[doc_name]['temporal_stats'] = doc_temporal_stats[doc_name]
    
    return all_results

In [None]:
# APPLY TEMPORAL CLASSIFICATION TO EXISTING RESULTS

print("Applying temporal classification to all valid green terms...")

# Apply temporal classification to all documents
all_results = add_temporal_classification(all_results, documents)

print(f"\n{'='*80}")
print("TEMPORAL CLASSIFICATION INTEGRATION COMPLETE")
print(f"{'='*80}")
print("All valid green terms now include 'temporal_class' field")
print("Report year automatically detected for each document")
print("Temporal statistics added to results")
print("Ready for further analysis")

In [None]:
def analyze_temporal_patterns(all_results):
    """
    Analyze patterns in temporal classification results.
    """
    print(f"\n{'='*80}")
    print("TEMPORAL PATTERN ANALYSIS")
    print(f"{'='*80}")
    
    # Analyze by POS type
    pos_temporal_stats = defaultdict(Counter)
    
    for doc_name, results in all_results.items():
        for term in results['valid_terms']:
            if 'temporal_class' in term:
                pos_type = term['pos']
                temporal_class = term['temporal_class']
                pos_temporal_stats[pos_type][temporal_class] += 1
    
    print("\nTemporal classification by POS type:")
    print(f"{'POS Type':<30} {'Past':<8} {'Present':<8} {'Future':<8} {'Unclear':<8}")
    print("-" * 70)
    
    for pos_type, temporal_counter in pos_temporal_stats.items():
        past = temporal_counter['past']
        present = temporal_counter['present'] 
        future = temporal_counter['future']
        unclear = temporal_counter['unclear']
        print(f"{pos_type:<30} {past:<8} {present:<8} {future:<8} {unclear:<8}")
    
    return pos_temporal_stats

def print_temporal_examples(all_results, documents, max_examples=5):
    """
    Print examples of temporal classifications for verification.
    """
    print(f"\n{'='*80}")
    print("TEMPORAL CLASSIFICATION EXAMPLES")
    print(f"{'='*80}")
    
    # Collect examples by temporal class
    examples_by_class = {'past': [], 'present': [], 'future': [], 'unclear': []}
    
    for doc_name, results in all_results.items():
        doc = documents[doc_name]
        for term in results['valid_terms']:
            if 'temporal_class' in term:
                temporal_class = term['temporal_class']
                if len(examples_by_class[temporal_class]) < max_examples:
                    # Get sentence context
                    sentence = term['sentence']
                    sentence_text = sentence.text.strip()
                    
                    # Highlight the green term in the sentence
                    start_char = doc[term['start_idx']].idx
                    end_char = doc[term['end_idx']].idx + len(doc[term['end_idx']].text)
                    sentence_start_char = sentence.start_char
                    
                    relative_start = start_char - sentence_start_char
                    relative_end = end_char - sentence_start_char
                    
                    highlighted_sentence = (
                        sentence_text[:relative_start] + 
                        f"**{sentence_text[relative_start:relative_end]}**" + 
                        sentence_text[relative_end:]
                    )
                    
                    examples_by_class[temporal_class].append({
                        'term': term['term'],
                        'sentence': highlighted_sentence,
                        'doc': doc_name,
                        'report_year': term.get('report_year', 'Unknown')
                    })
    
    # Print examples
    for temporal_class, examples in examples_by_class.items():
        print(f"\n{temporal_class.upper()} EXAMPLES:")
        print("-" * 40)
        for i, example in enumerate(examples[:max_examples], 1):
            print(f"{i}. Term: '{example['term']}'")
            print(f"   Document: {example['doc']} (Report Year: {example['report_year']})")
            print(f"   Context: {example['sentence']}")
            print()

# Run the analysis
print("Analyzing temporal patterns...")
pos_temporal_patterns = analyze_temporal_patterns(all_results)

print("\nDisplaying temporal classification examples...")
print_temporal_examples(all_results, documents, max_examples=3)

In [None]:
print(f"\n{'='*80}")
print("TEMPORAL CLASSIFICATION VERIFICATION")
print(f"{'='*80}")

# Check a few specific examples to verify classification accuracy
for doc_name, results in all_results.items():
    print(f"\nDocument: {doc_name}")
    report_year = determine_report_year(doc_name)
    print(f"Report Year: {report_year}")
    print("-" * 40)
    
    # Show first 10 terms with their temporal classification
    for i, term in enumerate(results['valid_terms'][:10]):
        if 'temporal_class' in term:
            sentence_text = term['sentence'].text.strip()
            # Truncate long sentences for display
            if len(sentence_text) > 100:
                sentence_text = sentence_text[:100] + "..."
            
            print(f"{i+1:2d}. '{term['term']}' -> {term['temporal_class'].upper()}")
            print(f"    Context: {sentence_text}")
    
    if len(results['valid_terms']) > 10:
        print(f"    ... and {len(results['valid_terms']) - 10} more terms")
    print()

print(f"\n{'='*80}")
print("TEMPORAL CLASSIFICATION COMPLETE")
print(f"{'='*80}")
print("All valid green terms now have temporal classification based on report year.")
print("2021 Reports: 2021=present, ≤2020=past, ≥2022=future")
print("2022 Reports: 2022=present, ≤2021=past, ≥2023=future")
print("Use 'temporal_class' field to filter terms by temporal context.")
print("Example: past_terms = [t for t in valid_terms if t['temporal_class'] == 'past']")

## Quantification

In [None]:
import re
from collections import Counter, defaultdict

# Combined currency pattern for all currencies and amounts
CURRENCY_SYMBOLS = ['$', '€', '£', '¥', '₹', 'USD', 'EUR', 'GBP', 'NOK', 'CZK', 'PLN', 'CHF', 'DKK', 'SEK', 
                   'dollars', 'dollar', 'euros', 'euro', 'pounds', 'pound', 'kroner', 'krone', 'kr',
                   'korun', 'koruna', 'Kč', 'złoty', 'złotych', 'zł', 'francs', 'franc', 'Fr']

CURRENCY_UNITS = ['million', 'M', 'billion', 'B', 'thousand', 'K', 'trillion', 'T']

# Percentage patterns
PERCENTAGE_PATTERNS = ['%', 'percent', 'per cent', 'percentage', 'pct', 'pc']

# Energy and environmental units
UNIT_PATTERNS = {
    'energy': ['MW', 'GW', 'TW', 'kW', 'MWh', 'GWh', 'TWh', 'kWh', 'Wh'],
    'emissions': ['tons', 'tonnes', 'tCO2', 'tCO2e', 'CO2', 'CO2e', 'kt', 'Mt', 'Gt'],
    'volume': ['m3', 'm³', 'cubic meters', 'litres', 'liters', 'gallons'],
    'weight': ['kg', 'tons', 'tonnes', 'pounds', 'lbs'],
    'area': ['hectares', 'km2', 'km²', 'square meters', 'm2', 'm²'],
    'distance': ['km', 'kilometers', 'miles', 'meters', 'm'],
    'time': ['years', 'months', 'days', 'hours', 'hrs']
}

# Organizational terms to exclude with decimal numbers
ORGANIZATIONAL_TERMS = [
    'policy', 'management', 'governance', 'analysis', 'methodology', 
    'approach', 'framework', 'overview', 'introduction', 'background',
    'scope', 'objectives', 'strategy', 'implementation', 'monitoring',
    'reporting', 'compliance', 'assessment', 'evaluation', 'review',
    'taxonomy', 'participations', 'stakeholder', 'engagement'
    ]

# Objects/patterns to EXCLUDE from meaningful quantifications
EXCLUDED_OBJECTS = [
    # Document structure
    'chapter', 'chapters', 'section', 'sections', 'page', 'pages', 'appendix', 'appendices',
    'table', 'tables', 'figure', 'figures', 'chart', 'charts', 'graph', 'graphs',
    'note', 'notes', 'footnote', 'footnotes', 'reference', 'references',
    
    # Organizational structure  
    'employees', 'staff', 'workers', 'people', 'personnel', 'team', 'teams',
    'office', 'offices', 'location', 'locations', 'site', 'sites', 'facility', 'facilities',
    'country', 'countries', 'region', 'regions', 'city', 'cities',
    'department', 'departments', 'division', 'divisions', 'unit', 'units',
    
    # Business entities
    'customer', 'customers', 'client', 'clients', 'supplier', 'suppliers',
    'shareholder', 'shareholders', 'stakeholder', 'stakeholders', 'investor', 'investors',
    'partner', 'partners', 'contractor', 'contractors',
    
    # Time periods (non-quantitative)
    'year', 'years', 'month', 'months', 'quarter', 'quarters', 'week', 'weeks',
    'day', 'days', 'hour', 'hours', 'minute', 'minutes',
    'years of experience', 'years experience', 'experience',
    
    # Document numbering
    'item', 'items', 'point', 'points', 'step', 'steps', 'phase', 'phases',
    'level', 'levels', 'tier', 'tiers', 'grade', 'grades', 'class', 'classes',
    
    # Financial/legal (non-environmental)
    'share', 'shares', 'stock', 'stocks', 'option', 'options',
    'contract', 'contracts', 'agreement', 'agreements',
    
    # Generic counts that aren't meaningful
    'number', 'numbers', 'amount', 'amounts', 'quantity', 'quantities',
    'total', 'sum', 'count', 'instance', 'instances', 'case', 'cases'
]

# Expanded relative quantifiers
RELATIVE_QUANTIFIERS = [
    # Multiplication
    'doubled', 'tripled', 'quadrupled', 'quintupled', 'multiplied',
    'more than doubled', 'nearly doubled', 'almost doubled', 'over doubled',
    'more than tripled', 'nearly tripled', 'almost tripled',
    'increased twofold', 'increased threefold', 'increased fourfold',
    'grew twofold', 'grew threefold', 'expanded twofold',
    
    # Division/reduction
    'halved', 'quartered', 'cut in half', 'cut by half', 'reduced by half',
    'decreased by half', 'slashed in half', 'divided by half',
    'cut by three quarters', 'reduced by three quarters',
    
    # Fractional increases/decreases
    'increased by half', 'grew by half', 'expanded by half',
    'increased by a third', 'grew by a third', 'expanded by a third',
    'decreased by a third', 'reduced by a third', 'cut by a third',
    'increased by two thirds', 'grew by two thirds',
    'decreased by two thirds', 'reduced by two thirds',
    
    # Significant changes
    'dramatically increased', 'dramatically decreased', 'dramatically reduced',
    'significantly increased', 'significantly decreased', 'significantly reduced',
    'substantially increased', 'substantially decreased', 'substantially reduced',
    'markedly increased', 'markedly decreased', 'markedly reduced',
    'considerably increased', 'considerably decreased', 'considerably reduced',
    
    # Amplification terms
    'amplified', 'magnified', 'intensified', 'escalated', 'boosted',
    'enhanced', 'strengthened', 'reinforced', 'accelerated',
    'diminished', 'weakened', 'scaled back', 'scaled down', 'minimized',
    
    # Comparative terms
    'surpassed', 'exceeded', 'outperformed', 'outdid', 'topped',
    'fell short of', 'underperformed', 'lagged behind',

    # EXPANDED LIST - Add more common relative terms
    'increased', 'decreased', 'reduced', 'improved', 'enhanced', 'boosted',
    'declined', 'dropped', 'fell', 'rose', 'grew', 'expanded',
    'contracted', 'shrank', 'diminished', 'escalated', 'accelerated',
    'slowed', 'stabilized', 'maintained', 'sustained', 'optimized',
    'maximized', 'minimized', 'elevated', 'lowered', 'raised',
    'strengthened', 'weakened', 'intensified', 'amplified', 'magnified',
    'better', 'worse', 'higher', 'lower', 'greater', 'lesser',
    'more', 'less', 'fewer', 'additional', 'extra', 'surplus',
    'deficit', 'shortfall', 'excess', 'beyond', 'below', 'above',
    'up', 'down', 'upward', 'downward', 'forward', 'backward',
    'progress', 'regress', 'advance', 'retreat', 'gain', 'loss',
    'positive', 'negative', 'favorable', 'unfavorable', 'beneficial', 'detrimental'
]

# Year patterns to exclude (but with context checking)
YEAR_PATTERNS = [
    r'\b(19|20)\d{2}\b',  # 4-digit years
    r'\b(19|20)\d{2}-(19|20)\d{2}\b',  # Year ranges
    r'\bby\s+(19|20)\d{2}\b',  # "by YEAR"
    r'\bin\s+(19|20)\d{2}\b',  # "in YEAR"
    r'\bsince\s+(19|20)\d{2}\b',  # "since YEAR"
    r'\buntil\s+(19|20)\d{2}\b'  # "until YEAR"
]

In [None]:
def reconstruct_currency_quantifications(doc):
    """
    Reconstruct currency quantifications using unified pattern detection.
    """
    currency_quantifications = []
    
    for i, token in enumerate(doc):
        token_text = token.text.lower()
        
        # Check if current token is a currency symbol
        if any(symbol.lower() == token_text for symbol in CURRENCY_SYMBOLS):
            # Look for numbers within 3 tokens before or after the currency symbol
            for direction in [-1, 1]:
                for offset in range(1, 4):
                    num_idx = i + (direction * offset)
                    if 0 <= num_idx < len(doc):
                        num_token = doc[num_idx]
                        
                        if num_token.pos_ == 'NUM' or num_token.like_num:
                            # Found a number, now look for unit markers
                            unit_found = None
                            quantification_tokens = []
                            
                            # Determine the range of tokens to include
                            if direction == -1:  # Number before currency
                                start_idx = num_idx
                                end_idx = i
                            else:  # Number after currency
                                start_idx = i
                                end_idx = num_idx
                            
                            # Look for unit markers after the number
                            for unit_offset in range(1, 3):
                                unit_idx = num_idx + unit_offset
                                if 0 <= unit_idx < len(doc):
                                    unit_token = doc[unit_idx]
                                    if unit_token.text in CURRENCY_UNITS:
                                        unit_found = unit_token.text
                                        end_idx = max(end_idx, unit_idx)
                                        break
                            
                            quantification_tokens = [doc[j].text for j in range(start_idx, end_idx + 1)]
                            quantification_text = ' '.join(quantification_tokens)
                            
                            currency_quantifications.append({
                                'text': quantification_text,
                                'start_idx': start_idx,
                                'end_idx': end_idx,
                                'type': 'currency',
                                'value': num_token.text,
                                'unit': unit_found
                            })
    
    return currency_quantifications

def reconstruct_percentage_quantifications(doc):
    """
    Reconstruct percentage quantifications from split tokens.
    """
    percentage_quantifications = []
    
    for i, token in enumerate(doc):
        token_text = token.text.lower()
        
        if any(pct in token_text for pct in PERCENTAGE_PATTERNS):
            # Look for numbers within 3 tokens before this percentage marker
            for offset in range(-3, 1):
                num_idx = i + offset
                if 0 <= num_idx < len(doc):
                    num_token = doc[num_idx]
                    
                    if num_token.pos_ == 'NUM' or num_token.like_num:
                        quantification_text = ' '.join([doc[j].text for j in range(num_idx, i + 1)])
                        
                        percentage_quantifications.append({
                            'text': quantification_text,
                            'start_idx': num_idx,
                            'end_idx': i,
                            'type': 'percentage',
                            'value': num_token.text
                        })
                        break
    
    return percentage_quantifications

def reconstruct_unit_quantifications(doc):
    """
    Reconstruct unit-based quantifications (MW, tons CO2, etc.).
    """
    unit_quantifications = []
    
    for i, token in enumerate(doc):
        token_text = token.text
        
        # Check against all unit patterns
        for unit_category, units in UNIT_PATTERNS.items():
            if token_text in units or token_text.lower() in [u.lower() for u in units]:
                # Look for numbers within 3 tokens before this unit
                for offset in range(-3, 1):
                    num_idx = i + offset
                    if 0 <= num_idx < len(doc):
                        num_token = doc[num_idx]
                        
                        if num_token.pos_ == 'NUM' or num_token.like_num:
                            # Check for additional unit words (like "tons CO2")
                            end_idx = i
                            quantification_tokens = [doc[j].text for j in range(num_idx, i + 1)]
                            
                            # Look ahead for additional unit components
                            for lookahead in range(1, 3):
                                next_idx = i + lookahead
                                if next_idx < len(doc):
                                    next_token = doc[next_idx]
                                    if (next_token.text in ['CO2', 'CO2e', 'equivalent', 'eq'] or
                                        next_token.pos_ in ['NOUN', 'PROPN']):
                                        quantification_tokens.append(next_token.text)
                                        end_idx = next_idx
                                    else:
                                        break
                            
                            quantification_text = ' '.join(quantification_tokens)
                            
                            unit_quantifications.append({
                                'text': quantification_text,
                                'start_idx': num_idx,
                                'end_idx': end_idx,
                                'type': 'unit',
                                'category': unit_category,
                                'value': num_token.text,
                                'unit': token_text
                            })
                            break
    
    return unit_quantifications

def find_all_quantifications(doc):
    """
    Find all quantifications in a document by combining different detection methods.
    IMPROVED: Better overlap removal logic.
    """
    all_quantifications = []
    
    # Get different types of quantifications
    currency_quants = reconstruct_currency_quantifications(doc)
    percentage_quants = reconstruct_percentage_quantifications(doc)
    unit_quants = reconstruct_unit_quantifications(doc)
    
    all_quantifications.extend(currency_quants)
    all_quantifications.extend(percentage_quants)
    all_quantifications.extend(unit_quants)
    
    # Remove overlapping quantifications with better logic
    filtered_quantifications = remove_overlapping_quantifications(all_quantifications)
    
    return filtered_quantifications

def remove_overlapping_quantifications(quantifications):
    """
    Remove overlapping quantifications, preferring higher-quality ones.
    """
    if not quantifications:
        return []
    
    # Sort by start position first
    sorted_quants = sorted(quantifications, key=lambda x: x['start_idx'])
    
    # Define type priority (higher number = higher priority)
    type_priority = {
        'currency': 4,
        'percentage': 4,
        'unit': 4,
        'meaningful_count': 2,
        'relative': 1
    }
    
    filtered = []
    
    for current in sorted_quants:
        current_positions = set(range(current['start_idx'], current['end_idx'] + 1))
        current_priority = type_priority.get(current['type'], 0)
        
        # Check for overlaps with already added quantifications
        should_add = True
        to_remove = []
        
        for i, existing in enumerate(filtered):
            existing_positions = set(range(existing['start_idx'], existing['end_idx'] + 1))
            
            # Check if they overlap
            if current_positions & existing_positions:
                existing_priority = type_priority.get(existing['type'], 0)
                
                # Decide which one to keep based on priority and quality
                if current_priority > existing_priority:
                    # Current is better, mark existing for removal
                    to_remove.append(i)
                elif current_priority < existing_priority:
                    # Existing is better, don't add current
                    should_add = False
                    break
                else:
                    # Same priority, prefer the more specific one
                    if is_more_specific(current, existing):
                        to_remove.append(i)
                    else:
                        should_add = False
                        break
        
        # Remove marked items (in reverse order to maintain indices)
        for i in reversed(to_remove):
            filtered.pop(i)
        
        # Add current if it should be added
        if should_add:
            filtered.append(current)
    
    return filtered

def is_more_specific(quant1, quant2):
    """
    Determine if quant1 is more specific than quant2.
    """
    # Prefer shorter, more precise quantifications
    len1 = len(quant1['text'])
    len2 = len(quant2['text'])
    
    # If one is much longer, prefer the shorter one
    if len1 < len2 * 0.8:
        return True
    elif len2 < len1 * 0.8:
        return False
    
    # If similar length, prefer the one with less extra text
    text1_clean = quant1['text'].lower().strip()
    text2_clean = quant2['text'].lower().strip()
    
    # Count non-alphanumeric characters (indicators of extra text)
    extra_chars1 = sum(1 for c in text1_clean if not (c.isalnum() or c in '.,-%€$£¥₹'))
    extra_chars2 = sum(1 for c in text2_clean if not (c.isalnum() or c in '.,-%€$£¥₹'))
    
    return extra_chars1 < extra_chars2

In [None]:
def find_meaningful_counts(doc):
    """
    Find meaningful numerical counts that aren't captured by other methods.
    """
    meaningful_counts = []
    
    for i, token in enumerate(doc):
        if token.pos_ == 'NUM' or token.like_num:
            # Skip decimals that are likely organizational (like version numbers)
            if '.' in token.text:
                context_start = max(0, i - 3)
                context_end = min(len(doc), i + 3)
                context_words = [doc[j].text.lower() for j in range(context_start, context_end)]
                
                # Check if surrounded by organizational terms
                if any(org_term in ' '.join(context_words) for org_term in ORGANIZATIONAL_TERMS):
                    continue
            
            # Look for meaningful objects after the number
            for phrase_length in range(1, 5):
                if i + phrase_length < len(doc):
                    object_tokens = [doc[i + j + 1] for j in range(phrase_length)]
                    object_phrase = ' '.join([t.text.lower() for t in object_tokens])
                    
                    # Check if this forms a meaningful quantification
                    if object_tokens:
                        # Must include at least one noun
                        has_noun = any(token.pos_ in ['NOUN', 'PROPN'] for token in object_tokens)
                        
                        if has_noun:
                            # Check exclusions
                            should_exclude = False
                            
                            # Original exclusion check for EXCLUDED_OBJECTS
                            if not should_exclude:
                                for excluded_obj in EXCLUDED_OBJECTS:
                                    if (excluded_obj.lower() in object_phrase or 
                                        any(word in object_phrase for word in excluded_obj.lower().split())):
                                        should_exclude = True
                                        break
                            
                            # Also exclude if it looks like a list item (e.g., "1.", "2)", "a)", etc.)
                            if not should_exclude:
                                if re.match(r'^\d+[\.\)]\s*$', token.text) or re.match(r'^[a-z][\.\)]\s*$', token.text):
                                    should_exclude = True
                            
                            if not should_exclude and len(object_phrase.strip()) > 2:  # Must be meaningful
                                end_idx = i + len(object_tokens)
                                quantification_text = f"{token.text} {object_phrase}"
                                
                                meaningful_counts.append({
                                    'text': quantification_text,
                                    'start_idx': i,
                                    'end_idx': min(end_idx, len(doc) - 1),
                                    'type': 'meaningful_count',
                                    'value': token.text,
                                    'object': object_phrase
                                })
                                break
    
    return meaningful_counts

def find_relative_quantifiers(doc):
    """
    Find relative quantifiers like 'doubled', 'tripled', 'halved'.
    """
    relative_quants = []
    
    for i, token in enumerate(doc):
        token_lemma = token.lemma_.lower()
        token_text = token.text.lower()
        
        # Check for single-word relative quantifiers
        if token_lemma in RELATIVE_QUANTIFIERS or token_text in RELATIVE_QUANTIFIERS:
            relative_quants.append({
                'text': token.text,
                'start_idx': i,
                'end_idx': i,
                'type': 'relative',
                'quantifier': token_text
            })
        
        # Check for multi-word patterns
        for phrase_length in [2, 3, 4, 5]:
            if i + phrase_length - 1 < len(doc):
                phrase_tokens = [doc[i + j].text.lower() for j in range(phrase_length)]
                phrase = ' '.join(phrase_tokens)
                
                if phrase in RELATIVE_QUANTIFIERS:
                    relative_quants.append({
                        'text': phrase,
                        'start_idx': i,
                        'end_idx': i + phrase_length - 1,
                        'type': 'relative',
                        'quantifier': phrase
                    })
                    break  # Don't check longer phrases starting at same position
    
    return relative_quants

In [None]:
def is_year_quantification(quantification, doc):
    """
    Determine if a quantification is actually a year that should be excluded.
    """
    quant_text = quantification['text'].lower()
    
    # Check if the quantification matches year patterns
    for year_pattern in YEAR_PATTERNS:
        if re.search(year_pattern, quant_text, re.IGNORECASE):
            # Special case: Check if year is used with units (like "2025 MW")
            if quantification.get('type') == 'unit':
                return False
            
            # Check context around the quantification for unit indicators
            start_idx = quantification['start_idx']
            end_idx = quantification['end_idx']
            
            # Look at surrounding tokens for unit indicators
            for i in range(max(0, start_idx - 3), min(len(doc), end_idx + 4)):
                if i < start_idx or i > end_idx:
                    token = doc[i]
                    token_text = token.text.lower()
                    
                    # If we find units near the year, it's probably a valid quantification
                    for unit_category, units in UNIT_PATTERNS.items():
                        if token_text in [u.lower() for u in units]:
                            return False
            
            return True
    
    return False

def filter_year_quantifications(quantifications, doc):
    """
    Filter out year-based quantifications that aren't meaningful.
    """
    return [quant for quant in quantifications if not is_year_quantification(quant, doc)]

In [None]:
def find_syntactic_connection(green_term, quantification, doc):
    """
    Determine if a quantification is syntactically connected to a green term.
    MORE SEMANTICALLY AWARE - only meaningful grammatical relationships.
    """
    green_start = green_term['start_idx']
    green_end = green_term['end_idx']
    quant_start = quantification['start_idx']
    quant_end = quantification['end_idx']
    
    # Check direct dependency relationships
    for green_idx in range(green_start, green_end + 1):
        green_token = doc[green_idx]
        for quant_idx in range(quant_start, quant_end + 1):
            quant_token = doc[quant_idx]
            
            # Direct dependency relationship
            if quant_token.head == green_token or green_token.head == quant_token:
                return True, 0.9, 'direct_dependency'
            
            # Check for specific meaningful relationships
            if green_token.head == quant_token.head:
                shared_head = green_token.head
                
                # Both modify the same noun (e.g., "2,284,254 MWh of clean energy")
                if shared_head.pos_ == 'NOUN' and shared_head.dep_ in ['nsubj', 'dobj', 'pobj']:
                    return True, 0.8, 'shared_noun_head'
                
                # Both are objects of the same verb (e.g., "produced 2,284,254 MWh of clean energy")
                if shared_head.pos_ == 'VERB' and green_token.dep_ in ['dobj', 'pobj'] and quant_token.dep_ in ['dobj', 'pobj']:
                    return True, 0.8, 'shared_verb_objects'
    
    # Check for quantification modifying the green term directly
    # E.g., "17,000 tons of recyclable materials"
    for green_idx in range(green_start, green_end + 1):
        green_token = doc[green_idx]
        for quant_idx in range(quant_start, quant_end + 1):
            quant_token = doc[quant_idx]
            
            # Quantification modifies green term through "of" relationship
            if quant_token.head == green_token and any(child.text.lower() == 'of' for child in quant_token.children):
                return True, 0.9, 'quantification_of_green_term'
            
            # Green term modifies quantification (e.g., "clean energy production")
            if green_token.head == quant_token and green_token.dep_ in ['amod', 'compound']:
                return True, 0.8, 'green_term_modifies_quantification'
    
    return False, 0.0, 'no_syntactic_connection'

def find_proximity_connection(green_term, quantification, doc):
    """
    Determine if a quantification is close enough to a green term to be connected.
    MORE SELECTIVE - only connects to closest quantification if multiple are nearby.
    """
    green_center = (green_term['start_idx'] + green_term['end_idx']) / 2
    quant_center = (quantification['start_idx'] + quantification['end_idx']) / 2
    distance = abs(green_center - quant_center)
    
    # Check if they're in the same sentence
    green_sent = doc[green_term['start_idx']].sent
    quant_sent = doc[quantification['start_idx']].sent
    
    if green_sent == quant_sent:
        # Same sentence - be more selective
        if distance <= 3:
            return True, 0.8, distance
        elif distance <= 6:
            return True, 0.6, distance
        elif distance <= 10:
            return True, 0.4, distance
        else:
            return False, 0.0, distance
    
    # Different sentences - much more restrictive
    if abs(green_sent.start - quant_sent.start) == 1:  # Adjacent sentences
        if distance <= 15:
            return True, 0.3, distance
    
    return False, 0.0, distance

def find_pattern_connection(green_term, quantification, doc):
    """
    Find pattern-based connections between green terms and quantifications.
    MUCH MORE RESTRICTIVE - only connects if quantification is directly describing the green term.
    """
    green_start = green_term['start_idx']
    green_end = green_term['end_idx']
    quant_start = quantification['start_idx']
    quant_end = quantification['end_idx']
    
    # Only check if they're in the same sentence
    sentence = doc[green_start].sent
    quant_sentence = doc[quant_start].sent
    
    if sentence != quant_sentence:
        return False, 0.0, 'different_sentences'
    
    # Get the text between green term and quantification
    if green_end < quant_start:
        # Green term comes before quantification
        between_start = green_end + 1
        between_end = quant_start - 1
        context_text = ' '.join([doc[i].text.lower() for i in range(max(green_start-2, sentence.start), min(quant_end+3, sentence.end))])
    elif quant_end < green_start:
        # Quantification comes before green term
        between_start = quant_end + 1
        between_end = green_start - 1
        context_text = ' '.join([doc[i].text.lower() for i in range(max(quant_start-2, sentence.start), min(green_end+3, sentence.end))])
    else:
        # Overlapping or adjacent
        context_text = ' '.join([doc[i].text.lower() for i in range(max(min(green_start, quant_start)-2, sentence.start), min(max(green_end, quant_end)+3, sentence.end))])
    
    # Strong patterns - quantification directly describes the green term
    strong_patterns = [
        # Direct measurement patterns
        r'(\d+[\.,]?\d*)\s*(mwh|gwh|twh|kw|mw|gw|tw|kwh)\s+(of|from)\s+(clean|green|renewable)',
        r'(clean|green|renewable)\s+(energy|power)\s+(of|from|produced|generated)?\s*(\d+[\.,]?\d*)',
        r'(save|saving|saved|reduce|reducing|reduced)\s+(approximately|about|around)?\s*(\d+[\.,]?\d*)',
        r'(cost|costs|costing|at|less than|under|over)\s+([€$£¥₹]?\d+[\.,]?\d*)',
        r'(invest|invested|investment|allocated|allocate|spend|spent|spending)\s+([€$£¥₹]?\d+[\.,]?\d*)',
        r'(target|targeting|aim|aiming|goal|objective)\s+(of|to)?\s*(\d+[\.,]?\d*)',
        r'(achieve|achieved|reach|reached|attain|attained)\s+(\d+[\.,]?\d*)',
        r'(increase|increased|grow|grew|rise|rose)\s+(by|to)\s+(\d+[\.,]?\d*)',
        r'(decrease|decreased|reduce|reduced|cut|lower|lowered)\s+(by|to)\s+(\d+[\.,]?\d*)',
        r'(\d+[\.,]?\d*)\s*(tons?|tonnes?|kg|mt|kt|gt)\s+(of|from)\s+(waste|emissions?|co2|recyclable)'
    ]
    
    # Check for strong patterns
    for pattern in strong_patterns:
        if re.search(pattern, context_text, re.IGNORECASE):
            return True, 0.9, 'strong_pattern'
    
    # Medium patterns - quantification is in close proximity with connecting words
    medium_patterns = [
        r'(approximately|about|around|roughly|nearly|almost|close to|up to|as much as|at least|more than|less than|over|under)\s+(\d+[\.,]?\d*)',
        r'(\d+[\.,]?\d*)\s+(per|each|every)\s+(year|month|day|hour)',
        r'(total|sum|amount|quantity)\s+(of|to)?\s*(\d+[\.,]?\d*)',
        r'(\d+[\.,]?\d*)\s*(million|billion|thousand|m|b|k|%|percent)'
    ]
    
    # Only apply medium patterns if green term and quantification are very close (within 5 tokens)
    distance = min(abs(green_start - quant_start), abs(green_end - quant_end))
    if distance <= 5:
        for pattern in medium_patterns:
            if re.search(pattern, context_text, re.IGNORECASE):
                return True, 0.6, 'medium_pattern'
    
    return False, 0.0, 'no_pattern'

def assess_quantification_connection(green_term, quantification, doc):
    """
    Assess the overall connection between a green term and quantification.
    MORE RESTRICTIVE - requires stronger evidence for connection.
    """
    # Try all connection methods
    syntactic_connected, syntactic_strength, syntactic_type = find_syntactic_connection(green_term, quantification, doc)
    proximity_connected, proximity_strength, proximity_distance = find_proximity_connection(green_term, quantification, doc)
    pattern_connected, pattern_strength, pattern_type = find_pattern_connection(green_term, quantification, doc)
    
    # NEW: More restrictive connection logic
    # Need either strong syntactic connection OR strong pattern connection OR very close proximity
    
    if syntactic_connected and syntactic_strength >= 0.8:
        # Strong syntactic connection is sufficient
        return {
            'is_connected': True,
            'connection_strength': syntactic_strength,
            'primary_method': 'syntactic',
            'connection_details': {
                'syntactic': {'connected': True, 'strength': syntactic_strength, 'type': syntactic_type},
                'proximity': {'connected': proximity_connected, 'strength': proximity_strength, 'distance': proximity_distance},
                'pattern': {'connected': pattern_connected, 'strength': pattern_strength, 'type': pattern_type}
            }
        }
    elif pattern_connected and pattern_strength >= 0.8:
        # Strong pattern connection is sufficient
        return {
            'is_connected': True,
            'connection_strength': pattern_strength,
            'primary_method': 'pattern',
            'connection_details': {
                'syntactic': {'connected': syntactic_connected, 'strength': syntactic_strength, 'type': syntactic_type},
                'proximity': {'connected': proximity_connected, 'strength': proximity_strength, 'distance': proximity_distance},
                'pattern': {'connected': True, 'strength': pattern_strength, 'type': pattern_type}
            }
        }
    elif syntactic_connected and proximity_connected and proximity_distance <= 5:
        # Medium syntactic + very close proximity
        combined_strength = (syntactic_strength + proximity_strength) / 2
        return {
            'is_connected': True,
            'connection_strength': combined_strength,
            'primary_method': 'syntactic',
            'connection_details': {
                'syntactic': {'connected': True, 'strength': syntactic_strength, 'type': syntactic_type},
                'proximity': {'connected': True, 'strength': proximity_strength, 'distance': proximity_distance},
                'pattern': {'connected': pattern_connected, 'strength': pattern_strength, 'type': pattern_type}
            }
        }
    elif pattern_connected and proximity_connected and proximity_distance <= 3:
        # Medium pattern + very close proximity
        combined_strength = (pattern_strength + proximity_strength) / 2
        return {
            'is_connected': True,
            'connection_strength': combined_strength,
            'primary_method': 'pattern',
            'connection_details': {
                'syntactic': {'connected': syntactic_connected, 'strength': syntactic_strength, 'type': syntactic_type},
                'proximity': {'connected': True, 'strength': proximity_strength, 'distance': proximity_distance},
                'pattern': {'connected': True, 'strength': pattern_strength, 'type': pattern_type}
            }
        }
    else:
        # No sufficient connection
        return {
            'is_connected': False,
            'connection_strength': 0.0,
            'primary_method': 'none',
            'connection_details': {
                'syntactic': {'connected': syntactic_connected, 'strength': syntactic_strength, 'type': syntactic_type},
                'proximity': {'connected': proximity_connected, 'strength': proximity_strength, 'distance': proximity_distance},
                'pattern': {'connected': pattern_connected, 'strength': pattern_strength, 'type': pattern_type}
            }
        }
    
def filter_best_connections(green_term, all_connected_quantifications, doc):
    """
    Filter to keep only the best/most relevant connections for each green term.
    """
    if not all_connected_quantifications:
        return []
    
    # Group by quantification type
    by_type = {}
    for quant_info in all_connected_quantifications:
        quant_type = quant_info['quantification']['type']
        if quant_type not in by_type:
            by_type[quant_type] = []
        by_type[quant_type].append(quant_info)
    
    # For each type, keep only the best connection(s)
    filtered_connections = []
    
    for quant_type, quant_list in by_type.items():
        # Sort by connection strength (descending)
        quant_list.sort(key=lambda x: x['connection']['connection_strength'], reverse=True)
        
        # Keep only the strongest connection(s) for this type
        if quant_type in ['currency', 'percentage', 'unit']:
            # For important types, keep top 2 connections if they're both strong
            if len(quant_list) >= 2 and quant_list[1]['connection']['connection_strength'] >= 0.7:
                filtered_connections.extend(quant_list[:2])
            else:
                filtered_connections.append(quant_list[0])
        else:
            # For other types, keep only the best one
            filtered_connections.append(quant_list[0])
    
    return filtered_connections

In [None]:
def classify_quantification_level(green_term_quantifications):
    """
    Classify the quantification level of a green term based on its connected quantifications.
    """
    if not green_term_quantifications:
        return 'non_quantified', 0.0
    
    # Calculate weighted confidence scores
    highly_quantified_score = 0
    partially_quantified_score = 0
    
    for quant_info in green_term_quantifications:
        quant = quant_info['quantification']
        connection = quant_info['connection']
        quant_type = quant['type']
        connection_strength = connection['connection_strength']
        
        if quant_type in ['currency', 'percentage', 'unit', 'meaningful_count']:
            # Weight by quantification type importance
            if quant_type in ['currency', 'percentage', 'unit']:
                highly_quantified_score += 2.0 * connection_strength  # Strong evidence
            else:  # meaningful_count
                highly_quantified_score += 1.5 * connection_strength  # Moderate evidence
                
        elif quant_type == 'relative':
            partially_quantified_score += 1.0 * connection_strength
    
    # Calculate final confidence
    total_score = highly_quantified_score + partially_quantified_score
    if total_score > 0:
        confidence = min(total_score / 2.0, 1.0)  # Normalize to 0-1
    else:
        confidence = 0.0
    
    # Determine classification based on scores AND confidence
    if highly_quantified_score > 0 and confidence >= 0.3:
        return 'highly_quantified', confidence
    elif partially_quantified_score > 0 and confidence >= 0.2:
        return 'partially_quantified', confidence
    else:
        return 'non_quantified', confidence

def add_quantification_classification(all_results, documents):
    """
    Add quantification classification to all valid green terms.
    """
    print(f"\n{'='*80}")
    print("QUANTIFICATION CLASSIFICATION ANALYSIS")
    print(f"{'='*80}")
    
    # Initialize summary statistics
    overall_quant_stats = Counter()
    doc_quant_stats = {}
    
    for doc_name, results in all_results.items():
        doc = documents[doc_name]
        valid_terms = results['valid_terms']
        
        # Find all quantifications in the document
        currency_quants = reconstruct_currency_quantifications(doc)
        percentage_quants = reconstruct_percentage_quantifications(doc)
        unit_quants = reconstruct_unit_quantifications(doc)
        meaningful_counts = find_meaningful_counts(doc)
        relative_quants = find_relative_quantifiers(doc)
        
        # Combine all quantifications
        all_quantifications = []
        all_quantifications.extend(currency_quants)
        all_quantifications.extend(percentage_quants)
        all_quantifications.extend(unit_quants)
        all_quantifications.extend(meaningful_counts)
        all_quantifications.extend(relative_quants)
        
        # Remove overlapping quantifications (this will handle the duplicates)
        filtered_quantifications = remove_overlapping_quantifications(all_quantifications)
        
        # Filter out year-based quantifications
        filtered_quantifications = filter_year_quantifications(filtered_quantifications, doc)
        
        # Classify each valid green term
        doc_quant_counter = Counter()
        
        for green_term in valid_terms:
            connected_quantifications = []
            
            # Check connection to each quantification
            for quantification in filtered_quantifications:
                connection_assessment = assess_quantification_connection(green_term, quantification, doc)
                
                if connection_assessment['is_connected']:
                    connected_quantifications.append({
                        'quantification': quantification,
                        'connection': connection_assessment
                    })
            
            # NEW: Filter to keep only the best connections
            connected_quantifications = filter_best_connections(green_term, connected_quantifications, doc)
            
            # Continue with existing classification logic...
            quantification_level, quantification_confidence = classify_quantification_level(connected_quantifications)
            
            # Classify quantification level
            quantification_level, quantification_confidence = classify_quantification_level(connected_quantifications)

            # Add to green term
            green_term['quantification_level'] = quantification_level
            green_term['quantification_confidence'] = quantification_confidence
            green_term['quantification_score'] = {
                'highly_score': sum(2.0 * q['connection']['connection_strength'] 
                                for q in connected_quantifications 
                                if q['quantification']['type'] in ['currency', 'percentage', 'unit', 'meaningful_count']),
                'partially_score': sum(1.0 * q['connection']['connection_strength'] 
                                    for q in connected_quantifications 
                                    if q['quantification']['type'] == 'relative')
            }
            green_term['connected_quantifications'] = connected_quantifications
            
            # Update counters
            doc_quant_counter[quantification_level] += 1
            overall_quant_stats[quantification_level] += 1
        
        # Calculate quantification intensity score for document
        total_terms = len(valid_terms)
        if total_terms > 0:
            highly_count = doc_quant_counter['highly_quantified']
            partially_count = doc_quant_counter['partially_quantified']
            intensity_score = ((highly_count * 1.5) + (partially_count * 1.0)) / total_terms * 100
        else:
            intensity_score = 0.0
        
        # Store document statistics
        doc_quant_stats[doc_name] = {
            'total_terms': total_terms,
            'highly_quantified': doc_quant_counter['highly_quantified'],
            'partially_quantified': doc_quant_counter['partially_quantified'],
            'non_quantified': doc_quant_counter['non_quantified'],
            'quantification_intensity_score': intensity_score
        }
    
    # Print summary statistics
    print(f"\nQuantification Classification Results:")
    total_terms = sum(overall_quant_stats.values())
    for level in ['highly_quantified', 'partially_quantified', 'non_quantified']:
        count = overall_quant_stats[level]
        percentage = (count / total_terms * 100) if total_terms > 0 else 0
        level_display = level.replace('_', ' ').title()
        print(f"  {level_display}: {count} ({percentage:.1f}%)")
    
    print(f"\nDocument breakdown:")
    print(f"{'Document':<30} {'Highly':<8} {'Partial':<8} {'None':<8} {'Total':<8}")
    print("-" * 65)
    
    for doc_name, stats in doc_quant_stats.items():
        print(f"{doc_name:<30} {stats['highly_quantified']:<8} {stats['partially_quantified']:<8} {stats['non_quantified']:<8} {stats['total_terms']:<8}")
    
    # Store quantification statistics in results
    for doc_name in all_results:
        all_results[doc_name]['quantification_stats'] = doc_quant_stats[doc_name]
    
    return all_results

In [None]:
# Apply quantification classification to all documents
all_results = add_quantification_classification(all_results, documents)

def analyze_quantification_patterns(all_results):
    """
    Analyze patterns in quantification classification results.
    """
    print(f"\n{'='*60}")
    print("QUANTIFICATION PATTERN ANALYSIS")
    print(f"{'='*60}")
    
    # Analyze by quantification type
    quantification_type_stats = Counter()
    connection_method_stats = Counter()
    
    for doc_name, results in all_results.items():
        for term in results['valid_terms']:
            for quant_info in term.get('connected_quantifications', []):
                quant_type = quant_info['quantification']['type']
                connection_method = quant_info['connection']['primary_method']
                quantification_type_stats[quant_type] += 1
                connection_method_stats[connection_method] += 1
    
    print(f"\nQuantification types found:")
    for quant_type, count in quantification_type_stats.most_common():
        print(f"  {quant_type}: {count}")
    
    print(f"\nConnection methods used:")
    for method, count in connection_method_stats.most_common():
        print(f"  {method}: {count}")
    
    return quantification_type_stats, connection_method_stats

def print_quantification_examples_by_document(all_results, documents, examples_per_level=3):
    """
    Print examples of quantification classifications for each document.
    """
    print(f"\n{'='*80}")
    print("QUANTIFICATION EXAMPLES BY DOCUMENT")
    print(f"{'='*80}")
    
    for doc_name, results in all_results.items():
        doc = documents[doc_name]
        stats = results['quantification_stats']
        
        print(f"\n{'='*60}")
        print(f"DOCUMENT: {doc_name} (Intensity Score: {stats['quantification_intensity_score']:.2f}):")
        print(f"{'='*60}")
        print(f"Total terms: {stats['total_terms']}")
        print(f"  Highly Quantified: {stats['highly_quantified']} ({stats['highly_quantified']/stats['total_terms']*100:.1f}%)")
        print(f"  Partially Quantified: {stats['partially_quantified']} ({stats['partially_quantified']/stats['total_terms']*100:.1f}%)")
        print(f"  Non Quantified: {stats['non_quantified']} ({stats['non_quantified']/stats['total_terms']*100:.1f}%)")
        
        # Group terms by quantification level
        terms_by_level = {}
        for term in results['valid_terms']:
            level = term.get('quantification_level', 'unknown')
            if level not in terms_by_level:
                terms_by_level[level] = []
            terms_by_level[level].append(term)
        
        # Print examples for each level
        for level in ['highly_quantified', 'partially_quantified', 'non_quantified']:
            if level in terms_by_level and terms_by_level[level]:
                level_display = level.replace('_', ' ').upper()
                print(f"\n{level_display} EXAMPLES:")
                print("-" * 40)
                
                # Sort by confidence score (descending) and show top examples
                sorted_terms = sorted(terms_by_level[level], 
                                    key=lambda x: x.get('quantification_confidence', 0), 
                                    reverse=True)
                
                for i, term in enumerate(sorted_terms[:examples_per_level]):
                    # MODIFIED: Get 10-token context instead of entire sentence
                    green_start = term['start_idx']
                    green_end = term['end_idx']
                    sentence = term['sentence']
                    
                    # Get sentence boundaries
                    sent_start = sentence.start
                    sent_end = sentence.end - 1  # spaCy sentence.end is exclusive
                    
                    # Calculate context boundaries (10 tokens before and after, within sentence limits)
                    context_start = max(sent_start, green_start - 10)
                    context_end = min(sent_end, green_end + 10)
                    
                    # Build context with highlighting
                    context_tokens_list = []
                    for token_idx in range(context_start, context_end + 1):
                        token = doc[token_idx]
                        if green_start <= token_idx <= green_end:
                            # Highlight the green term
                            context_tokens_list.append(f"**{token.text}**")
                        else:
                            context_tokens_list.append(token.text)
                    
                    highlighted_context = ' '.join(context_tokens_list)
                    
                    print(f"{i+1}. '{term['term']}' ({term['pos']})")
                    print(f"   Context: {highlighted_context}")

                    # Show confidence score
                    confidence = term.get('quantification_confidence', 0)
                    print(f"   Confidence: {confidence:.3f}")

                    # Show connected quantifications with strength scores
                    connected_quants = term.get('connected_quantifications', [])
                    if connected_quants:
                        quant_texts = []
                        for quant_info in connected_quants:
                            quant = quant_info['quantification']
                            connection = quant_info['connection']
                            strength = connection['connection_strength']
                            quant_texts.append(f"'{quant['text']}' ({quant['type']}, {connection['primary_method']}, strength={strength:.2f})")
                        print(f"   Quantifications: {', '.join(quant_texts)}")
                        
                        # Show score breakdown
                        scores = term.get('quantification_score', {})
                        if scores:
                            print(f"   Scores: Highly={scores.get('highly_score', 0):.2f}, Partially={scores.get('partially_score', 0):.2f}")
                    else:
                        print(f"   Quantifications: None")
                    print()

def print_overall_quantification_summary(all_results):
    """
    Print overall summary statistics for quantification classification.
    """
    print(f"\n{'='*80}")
    print("OVERALL QUANTIFICATION SUMMARY")
    print(f"{'='*80}")
    
    # Aggregate statistics across all documents
    overall_stats = Counter()
    confidence_stats = []
    
    for doc_name, results in all_results.items():
        for term in results['valid_terms']:
            level = term.get('quantification_level', 'unknown')
            overall_stats[level] += 1
            
            if 'quantification_confidence' in term:
                confidence_stats.append(term['quantification_confidence'])
    
    # Print level statistics
    total_terms = sum(overall_stats.values())
    print(f"\nOverall Classification Results:")
    print(f"  Total terms analyzed: {total_terms}")
    
    for level in ['highly_quantified', 'partially_quantified', 'non_quantified']:
        count = overall_stats[level]
        percentage = (count / total_terms * 100) if total_terms > 0 else 0
        level_display = level.replace('_', ' ').title()
        print(f"  {level_display}: {count} ({percentage:.1f}%)")
    
    # Calculate quantified vs non-quantified
    quantified_count = overall_stats['highly_quantified'] + overall_stats['partially_quantified']
    quantified_percentage = (quantified_count / total_terms * 100) if total_terms > 0 else 0
    print(f"  Total quantified (highly + partially): {quantified_count} ({quantified_percentage:.1f}%)")
    
    # Print POS distribution for quantified terms
    pos_stats = defaultdict(Counter)
    for doc_name, results in all_results.items():
        for term in results['valid_terms']:
            level = term.get('quantification_level', 'unknown')
            pos = term.get('pos', 'unknown')
            pos_stats[pos][level] += 1
    
    print(f"\nQuantification by POS tag:")
    print(f"{'POS':<30} {'Highly':<8} {'Partial':<8} {'None':<8} {'Total':<8}")
    print("-" * 65)
    
    for pos_type in sorted(pos_stats.keys()):
        level_counts = pos_stats[pos_type]
        highly = level_counts['highly_quantified']
        partial = level_counts['partially_quantified']
        none = level_counts['non_quantified']
        total = sum(level_counts.values())
        print(f"{pos_type:<30} {highly:<8} {partial:<8} {none:<8} {total:<8}")

    # Add confidence statistics (NEW SECTION)
    print(f"\nConfidence Score Statistics:")
    confidence_stats = []
    confidence_by_level = {'highly_quantified': [], 'partially_quantified': [], 'non_quantified': []}

    for doc_name, results in all_results.items():
        for term in results['valid_terms']:
            if 'quantification_confidence' in term:
                confidence = term['quantification_confidence']
                level = term.get('quantification_level', 'unknown')
                confidence_stats.append(confidence)
                if level in confidence_by_level:
                    confidence_by_level[level].append(confidence)

    if confidence_stats:
        avg_confidence = sum(confidence_stats) / len(confidence_stats)
        high_confidence = sum(1 for c in confidence_stats if c >= 0.7)
        medium_confidence = sum(1 for c in confidence_stats if 0.3 <= c < 0.7)
        low_confidence = sum(1 for c in confidence_stats if c < 0.3)
        
        print(f"  Overall average confidence: {avg_confidence:.3f}")
        print(f"  High confidence (≥0.7): {high_confidence} ({high_confidence/len(confidence_stats)*100:.1f}%)")
        print(f"  Medium confidence (0.3-0.7): {medium_confidence} ({medium_confidence/len(confidence_stats)*100:.1f}%)")
        print(f"  Low confidence (<0.3): {low_confidence} ({low_confidence/len(confidence_stats)*100:.1f}%)")
        
        # Confidence by quantification level
        print(f"\nAverage confidence by quantification level:")
        for level, confidences in confidence_by_level.items():
            if confidences:
                avg = sum(confidences) / len(confidences)
                print(f"  {level.replace('_', ' ').title()}: {avg:.3f} (n={len(confidences)})")
    else:
        print("  No confidence data available")
    
    # Add intensity score statistics (NEW SECTION)
    print(f"\nQuantification Intensity Scores by Document:")
    intensity_scores = []
    print(f"{'Document':<35} {'Intensity Score':<15}")
    print("-" * 50)

    for doc_name, results in all_results.items():
        intensity = results.get('quantification_intensity_score', 0)
        intensity_scores.append(intensity)
        print(f"{doc_name:<35} {intensity:<15.2f}")

    if intensity_scores:
        avg_intensity = sum(intensity_scores) / len(intensity_scores)
        max_intensity = max(intensity_scores)
        min_intensity = min(intensity_scores)
        
        print(f"\nIntensity Score Statistics:")
        print(f"  Average intensity: {avg_intensity:.2f}")
        print(f"  Maximum intensity: {max_intensity:.2f}")
        print(f"  Minimum intensity: {min_intensity:.2f}")
        
        # Intensity distribution
        high_intensity = sum(1 for score in intensity_scores if score >= 100)
        medium_intensity = sum(1 for score in intensity_scores if 50 <= score < 100)
        low_intensity = sum(1 for score in intensity_scores if score < 50)
        
        print(f"  High intensity (≥100): {high_intensity} documents")
        print(f"  Medium intensity (50-100): {medium_intensity} documents") 
        print(f"  Low intensity (<50): {low_intensity} documents")

# Run analysis  
quant_type_stats, connection_stats = analyze_quantification_patterns(all_results)
print_quantification_examples_by_document(all_results, documents, examples_per_level=3)
print_overall_quantification_summary(all_results)

# Print final summary (OPTIONAL ADDITION)
print(f"\n{'='*80}")
print("QUANTIFICATION CLASSIFICATION WITH CONFIDENCE & INTENSITY COMPLETE")
print(f"{'='*80}")
print("All valid green terms now have confidence scores based on connection strength")
print("Enhanced classification considers both quantification type AND connection quality")
print("Intensity score: ((highly_count * 1.5) + (partially_count * 1)) / total_terms * 100")
print("Use 'quantification_confidence' field to filter high-quality quantifications")
print("Use 'quantification_intensity_score' field to compare document quantification density")

## Evidence/Claim

In [None]:
import re
from collections import Counter, defaultdict

# Evidence-based markers (concrete, verifiable)
EVIDENCE_MARKERS = {
    'strong_evidence': {
        'completed_actions': [
            'achieved', 'delivered', 'implemented', 'installed', 'completed', 'finished',
            'accomplished', 'executed', 'established', 'built', 'constructed', 'launched',
            'deployed', 'commissioned', 'operationalized', 'finalized'
        ],
        'verified_outcomes': [
            'certified', 'verified', 'audited', 'measured', 'reported', 'documented',
            'validated', 'confirmed', 'accredited', 'assessed', 'evaluated', 'monitored',
            'tracked', 'recorded', 'registered', 'approved'
        ],
        'measured_performance': [
            'resulted in', 'demonstrated', 'showed', 'recorded', 'yielded', 'generated',
            'produced', 'delivered', 'exceeded', 'surpassed', 'outperformed'
        ]
    },
    'moderate_evidence': {
        'implemented_actions': [
            'introduced', 'adopted', 'initiated', 'started', 'begun', 'commenced',
            'undertaken', 'proceeded', 'engaged', 'embarked', 'activated', 'enabled',
            'facilitated', 'developed', 'created', 'established'
        ],
        'present_reality': [
            'operates', 'generates', 'produces', 'supplies', 'provides', 'delivers',
            'maintains', 'runs', 'functions', 'performs', 'serves', 'supports',
            'contributes', 'consists', 'comprises', 'includes', 'contains'
        ],
        'ongoing_actions': [
            'continuing', 'ongoing', 'in progress', 'underway', 'proceeding',
            'advancing', 'progressing', 'working on', 'currently', 'actively'
        ]
    }
}

# Aspirational markers (future-oriented, commitments)
ASPIRATIONAL_MARKERS = {
    'strong_aspirational': {
        'modal_uncertainty': [
            'should', 'could', 'might', 'would', 'may', 'can', 'ought to',
            'supposed to', 'expected to', 'likely to', 'probable', 'possible'
        ],
        'conditional_promises': [
            'if', 'when', 'subject to', 'pending', 'contingent on', 'dependent on',
            'provided that', 'assuming', 'unless', 'in case', 'should', 'were to'
        ],
        'visionary_language': [
            'vision', 'dream', 'aspire', 'envision', 'imagine', 'hope',
            'wish', 'desire', 'ambition', 'ultimate goal', 'long-term vision'
        ]
    },
    'moderate_aspirational': {
        'future_commitments': [
            'will', 'plan to', 'aim to', 'intend to', 'commit to', 'pledge to',
            'promise to', 'undertake to', 'agree to', 'decide to', 'choose to',
            'resolve to', 'determine to', 'set out to', 'going to'
        ],
        'strategic_intentions': [
            'strategy', 'roadmap', 'pathway', 'plan', 'target', 'goal', 'objective',
            'initiative', 'program', 'project', 'scheme', 'approach', 'framework',
            'agenda', 'blueprint', 'timeline', 'schedule'
        ],
        'preparatory_actions': [
            'preparing', 'planning', 'designing', 'developing', 'working towards',
            'moving towards', 'progressing towards', 'aiming for', 'targeting',
            'seeking', 'pursuing', 'striving', 'endeavoring'
        ]
    }
}

# Words that indicate temporal context but not evidence/aspirational nature
TEMPORAL_NEUTRALS = [
    'in 2021', 'in 2022', 'by 2030', 'since 2020', 'during', 'throughout',
    'over the period', 'in the past', 'historically', 'previously',
    'currently', 'now', 'today', 'recently', 'lately'
]

In [None]:
def detect_green_term_type(green_term):
    """
    Detect whether a green term is direct, context-dependent, or dependency-based.
    
    Args:
        green_term: Green term dictionary
    
    Returns:
        Tuple: (term_type, context_info)
    """
    # Check for dependency patterns first
    if green_term['pos'].startswith('dependency_'):
        return 'dependency', {
            'pattern_name': green_term.get('pattern', ''),
            'dependency_relation': green_term.get('dependency', ''),
            'pattern_type': green_term['pos']
        }
    
    # Check for context-dependent terms
    elif 'context_word' in green_term:
        return 'context_dependent', {
            'context_word': green_term['context_word'],
            'neutral_part': green_term.get('neutral_part', ''),
            'context_type': green_term.get('context_type', ''),
            'context_relationship': green_term.get('context_relationship', '')
        }
    
    # Default to direct terms
    else:
        return 'direct', {}

def find_semantic_governor_enhanced(green_term, doc):
    """
    Find the semantic governor with special handling for all term types.
    
    Args:
        green_term: Green term dictionary
        doc: spaCy doc object
    
    Returns:
        Dictionary with governor information
    """
    start_idx = green_term['start_idx']
    end_idx = green_term['end_idx']
    term_type, context_info = detect_green_term_type(green_term)
    
    if term_type == 'dependency':
        # For dependency patterns, we have two tokens in a dependency relationship
        # We need to find the verb that governs this dependency pair
        token1 = doc[start_idx]
        token2 = doc[end_idx]
        
        # Find which token is the head and which is the dependent
        if token1.head == token2:
            head_token = token2
            dependent_token = token1
        elif token2.head == token1:
            head_token = token1
            dependent_token = token2
        else:
            # They don't have direct dependency, use the first token as primary
            head_token = token1
            dependent_token = token2
        
        # Find the verb that governs the head token of the dependency pair
        current_token = head_token
        max_depth = 10
        depth = 0
        
        while current_token.head != current_token and depth < max_depth:
            if current_token.head.pos_ == 'VERB' or current_token.head.pos_ == 'AUX':
                return {
                    'semantic_governor': current_token.head,
                    'primary_token': head_token,
                    'secondary_token': dependent_token,
                    'dependency_relation': dependent_token.dep_
                }
            current_token = current_token.head
            depth += 1
        
        # Fallback to sentence root
        sentence_root = head_token.sent.root
        if sentence_root.pos_ == 'VERB' or sentence_root.pos_ == 'AUX':
            return {
                'semantic_governor': sentence_root,
                'primary_token': head_token,
                'secondary_token': dependent_token,
                'dependency_relation': dependent_token.dep_
            }
        
        return {
            'semantic_governor': None,
            'primary_token': head_token,
            'secondary_token': dependent_token,
            'dependency_relation': dependent_token.dep_
        }
    
    elif term_type == 'context_dependent':
        # For context-dependent terms, find the verb that governs the complete phrase
        context_word = context_info['context_word']
        sentence = doc[start_idx].sent
        context_token = None
        
        for token in sentence:
            if token.text.lower() == context_word.lower():
                if abs(token.i - start_idx) <= 3:
                    context_token = token
                    break
        
        if context_token:
            current_token = context_token
            max_depth = 10
            depth = 0
            
            while current_token.head != current_token and depth < max_depth:
                if current_token.head.pos_ == 'VERB' or current_token.head.pos_ == 'AUX':
                    return {
                        'semantic_governor': current_token.head,
                        'context_token': context_token,
                        'neutral_tokens': [doc[i] for i in range(start_idx, end_idx + 1)]
                    }
                current_token = current_token.head
                depth += 1
    
    # For direct terms or fallback
    main_token = doc[start_idx]
    current_token = main_token
    max_depth = 10
    depth = 0
    
    while current_token.head != current_token and depth < max_depth:
        if current_token.head.pos_ == 'VERB' or current_token.head.pos_ == 'AUX':
            return {
                'semantic_governor': current_token.head,
                'main_tokens': [doc[i] for i in range(start_idx, end_idx + 1)]
            }
        current_token = current_token.head
        depth += 1
    
    # Strategy 2: Look for verbs in the same sentence that govern this term
    sentence = main_token.sent
    for token in sentence:
        if token.pos_ == 'VERB' or token.pos_ == 'AUX':
            for child in token.subtree:
                if start_idx <= child.i <= end_idx:
                    return {
                        'semantic_governor': token,
                        'main_tokens': [doc[i] for i in range(start_idx, end_idx + 1)]
                    }
    
    # Fallback to sentence root
    sentence_root = main_token.sent.root
    if sentence_root.pos_ == 'VERB' or sentence_root.pos_ == 'AUX':
        return {
            'semantic_governor': sentence_root,
            'main_tokens': [doc[i] for i in range(start_idx, end_idx + 1)]
        }
    
    return {
        'semantic_governor': None,
        'main_tokens': [doc[i] for i in range(start_idx, end_idx + 1)]
    }

def get_analysis_scope_enhanced(green_term, doc):
    """
    Determine the enhanced analysis scope for evidence/aspirational markers.
    
    Args:
        green_term: Green term dictionary
        doc: spaCy doc object
    
    Returns:
        Dictionary with enhanced analysis scope information
    """
    term_type, context_info = detect_green_term_type(green_term)
    governor_info = find_semantic_governor_enhanced(green_term, doc)
    
    # Get excluded words (green-making words that shouldn't be considered as evidence/aspirational markers)
    excluded_words = []
    
    if term_type == 'context_dependent':
        excluded_words.append(context_info['context_word'].lower())
    elif term_type == 'dependency':
        # For dependency patterns, both tokens are part of the green meaning
        # so we don't exclude them unless they appear in our marker lists
        pass
    
    return {
        'term_type': term_type,
        'context_info': context_info,
        'governor_info': governor_info,
        'excluded_words': excluded_words,
        'green_term_positions': set(range(green_term['start_idx'], green_term['end_idx'] + 1))
    }

In [None]:
def find_targeted_evidence_markers(analysis_scope, doc):
    """
    Find evidence markers using targeted dependency-first approach.
    
    Args:
        analysis_scope: Enhanced analysis scope dictionary
        doc: spaCy doc object
    
    Returns:
        List of evidence markers found
    """
    evidence_markers_found = []
    governor_info = analysis_scope['governor_info']
    excluded_words = analysis_scope['excluded_words']
    green_positions = analysis_scope['green_term_positions']
    term_type = analysis_scope['term_type']
    
    # Combine all evidence markers for searching
    all_evidence_markers = []
    for strength_level, categories in EVIDENCE_MARKERS.items():
        for category, markers in categories.items():
            for marker in markers:
                all_evidence_markers.append((marker, strength_level, category))
    
    # Get analysis targets based on term type
    analysis_targets = get_analysis_targets(analysis_scope)
    
    # Search for markers in targeted locations
    for target_info in analysis_targets:
        target_token = target_info['token']
        search_scope = target_info['scope']
        
        markers_in_scope = find_markers_in_scope(
            target_token, search_scope, all_evidence_markers, excluded_words, doc
        )
        
        for marker_info in markers_in_scope:
            # Assess connection strength
            connection_strength, connection_type = assess_targeted_marker_connection(
                marker_info, target_info, green_positions, governor_info, doc
            )
            
            if connection_strength > 0.3:
                evidence_markers_found.append({
                    'text': marker_info['text'],
                    'strength_level': marker_info['strength_level'],
                    'category': marker_info['category'],
                    'token_position': marker_info['token_position'],
                    'connection_strength': connection_strength,
                    'connection_type': connection_type,
                    'search_scope': search_scope
                })
    
    return evidence_markers_found

def find_targeted_aspirational_markers(analysis_scope, doc):
    """
    Find aspirational markers using targeted dependency-first approach.
    
    Args:
        analysis_scope: Enhanced analysis scope dictionary
        doc: spaCy doc object
    
    Returns:
        List of aspirational markers found
    """
    aspirational_markers_found = []
    governor_info = analysis_scope['governor_info']
    excluded_words = analysis_scope['excluded_words']
    green_positions = analysis_scope['green_term_positions']
    term_type = analysis_scope['term_type']
    
    # Combine all aspirational markers for searching
    all_aspirational_markers = []
    for strength_level, categories in ASPIRATIONAL_MARKERS.items():
        for category, markers in categories.items():
            for marker in markers:
                all_aspirational_markers.append((marker, strength_level, category))
    
    # Get analysis targets based on term type
    analysis_targets = get_analysis_targets(analysis_scope)
    
    # Search for markers in targeted locations
    for target_info in analysis_targets:
        target_token = target_info['token']
        search_scope = target_info['scope']
        
        markers_in_scope = find_markers_in_scope(
            target_token, search_scope, all_aspirational_markers, excluded_words, doc
        )
        
        for marker_info in markers_in_scope:
            # Assess connection strength
            connection_strength, connection_type = assess_targeted_marker_connection(
                marker_info, target_info, green_positions, governor_info, doc
            )
            
            if connection_strength > 0.3:
                aspirational_markers_found.append({
                    'text': marker_info['text'],
                    'strength_level': marker_info['strength_level'],
                    'category': marker_info['category'],
                    'token_position': marker_info['token_position'],
                    'connection_strength': connection_strength,
                    'connection_type': connection_type,
                    'search_scope': search_scope
                })
    
    return aspirational_markers_found

def get_analysis_targets(analysis_scope):
    """
    Get targeted analysis locations based on term type and dependency structure.
    
    Args:
        analysis_scope: Enhanced analysis scope dictionary
    
    Returns:
        List of analysis targets with scope information
    """
    targets = []
    governor_info = analysis_scope['governor_info']
    term_type = analysis_scope['term_type']
    semantic_governor = governor_info.get('semantic_governor')
    
    if term_type == 'dependency':
        # For dependency patterns, analyze both tokens and their governor
        primary_token = governor_info.get('primary_token')
        secondary_token = governor_info.get('secondary_token')
        
        if primary_token:
            targets.append({
                'token': primary_token,
                'scope': 'token_modifiers',
                'role': 'primary_dependency_token'
            })
        
        if secondary_token:
            targets.append({
                'token': secondary_token,
                'scope': 'token_modifiers',
                'role': 'secondary_dependency_token'
            })
        
        if semantic_governor:
            targets.append({
                'token': semantic_governor,
                'scope': 'governor_modifiers',
                'role': 'semantic_governor'
            })
    
    elif term_type == 'context_dependent':
        # For context-dependent terms, analyze the complete phrase and governor
        context_token = governor_info.get('context_token')
        neutral_tokens = governor_info.get('neutral_tokens', [])
        
        if context_token:
            targets.append({
                'token': context_token,
                'scope': 'token_modifiers',
                'role': 'context_token'
            })
        
        for token in neutral_tokens:
            targets.append({
                'token': token,
                'scope': 'token_modifiers',
                'role': 'neutral_token'
            })
        
        if semantic_governor:
            targets.append({
                'token': semantic_governor,
                'scope': 'governor_modifiers',
                'role': 'semantic_governor'
            })
    
    else:  # direct terms
        # For direct terms, analyze the main tokens and governor
        main_tokens = governor_info.get('main_tokens', [])
        
        for token in main_tokens:
            targets.append({
                'token': token,
                'scope': 'token_modifiers',
                'role': 'main_token'
            })
        
        if semantic_governor:
            targets.append({
                'token': semantic_governor,
                'scope': 'governor_modifiers',
                'role': 'semantic_governor'
            })
    
    return targets

def find_markers_in_scope(target_token, search_scope, marker_list, excluded_words, doc):
    """
    Find markers in the specified scope around a target token.
    
    Args:
        target_token: The token to search around
        search_scope: Type of scope ('token_modifiers', 'governor_modifiers', etc.)
        marker_list: List of (marker, strength_level, category) tuples
        excluded_words: Words to exclude from search
        doc: spaCy doc object
    
    Returns:
        List of marker information dictionaries
    """
    markers_found = []
    search_tokens = []
    
    if search_scope == 'token_modifiers':
        # Look at tokens that modify this token (its children)
        search_tokens.extend(target_token.children)
        # Look at tokens this token modifies (siblings under same head)
        if target_token.head != target_token:
            search_tokens.extend([child for child in target_token.head.children if child != target_token])
    
    elif search_scope == 'governor_modifiers':
        # Look at tokens that modify the governor
        search_tokens.extend(target_token.children)
        # Look at auxiliary verbs and modal verbs connected to governor
        for child in target_token.children:
            if child.pos_ in ['AUX', 'VERB'] or child.dep_ in ['aux', 'auxpass']:
                search_tokens.append(child)
    
    # Add the target token itself to search
    search_tokens.append(target_token)
    
    # Also check immediate neighbors (within 2 tokens)
    for offset in [-2, -1, 1, 2]:
        neighbor_idx = target_token.i + offset
        if 0 <= neighbor_idx < len(doc):
            neighbor_token = doc[neighbor_idx]
            # Only add if it's in the same sentence
            if neighbor_token.sent == target_token.sent:
                search_tokens.append(neighbor_token)
    
    # Remove duplicates
    search_tokens = list(set(search_tokens))
    
    # Search for markers in these specific tokens
    for marker, strength_level, category in marker_list:
        marker_words = marker.lower().split()
        
        # Skip excluded words
        if marker.lower() in excluded_words:
            continue
        
        # Look for single-word markers
        if len(marker_words) == 1:
            for token in search_tokens:
                if token.text.lower() == marker_words[0] or token.lemma_.lower() == marker_words[0]:
                    markers_found.append({
                        'text': marker,
                        'strength_level': strength_level,
                        'category': category,
                        'token_position': token.i,
                        'matched_token': token
                    })
        
        # Look for multi-word markers in consecutive tokens
        else:
            for i, token in enumerate(search_tokens):
                if token.text.lower() == marker_words[0]:
                    # Check if subsequent tokens match
                    match = True
                    consecutive_tokens = [token]
                    
                    for j, word in enumerate(marker_words[1:], 1):
                        next_token_idx = token.i + j
                        if next_token_idx < len(doc):
                            next_token = doc[next_token_idx]
                            if next_token.text.lower() == word:
                                consecutive_tokens.append(next_token)
                            else:
                                match = False
                                break
                        else:
                            match = False
                            break
                    
                    if match:
                        markers_found.append({
                            'text': marker,
                            'strength_level': strength_level,
                            'category': category,
                            'token_position': token.i,
                            'matched_token': token,
                            'consecutive_tokens': consecutive_tokens
                        })
    
    return markers_found

def assess_targeted_marker_connection(marker_info, target_info, green_positions, governor_info, doc):
    """
    Assess connection strength for targeted marker detection.
    
    Args:
        marker_info: Dictionary with marker information
        target_info: Dictionary with target token information
        green_positions: Set of green term token positions
        governor_info: Governor information dictionary
        doc: spaCy doc object
    
    Returns:
        Tuple: (connection_strength, connection_type)
    """
    marker_token = marker_info['matched_token']
    target_token = target_info['token']
    target_role = target_info['role']
    
    # Direct modification of target token
    if marker_token.head == target_token or target_token.head == marker_token:
        return 0.9, f'direct_modification_{target_role}'
    
    # Marker is child of target token
    if marker_token in target_token.children:
        return 0.8, f'child_of_{target_role}'
    
    # Marker is sibling of target token (same head)
    if (marker_token.head == target_token.head and 
        marker_token.head != marker_token and 
        target_token.head != target_token):
        return 0.7, f'sibling_of_{target_role}'
    
    # Close proximity to target token
    distance = abs(marker_token.i - target_token.i)
    if distance <= 2:
        return 0.6, f'proximity_{distance}_{target_role}'
    elif distance <= 4:
        return 0.4, f'proximity_{distance}_{target_role}'
    
    # Same sentence but more distant
    if marker_token.sent == target_token.sent:
        return 0.2, f'same_sentence_{target_role}'
    
    return 0.0, 'no_connection'

In [None]:
def classify_evidence_aspirational(evidence_markers, aspirational_markers):
    """
    Classify the evidence vs aspirational nature based on found markers.
    
    Args:
        evidence_markers: List of evidence markers found
        aspirational_markers: List of aspirational markers found
    
    Returns:
        Dictionary with classification results
    """
    # Calculate evidence scores
    evidence_score = 0
    evidence_details = []
    
    for marker in evidence_markers:
        weight = marker['connection_strength']
        if marker['strength_level'] == 'strong_evidence':
            evidence_score += 2.0 * weight
        elif marker['strength_level'] == 'moderate_evidence':
            evidence_score += 1.0 * weight
        
        evidence_details.append({
            'text': marker['text'],
            'strength': marker['strength_level'],
            'category': marker['category'],
            'weight': weight,
            'connection_type': marker.get('connection_type', 'unknown'),
            'search_scope': marker.get('search_scope', 'unknown')
        })
    
    # Calculate aspirational scores
    aspirational_score = 0
    aspirational_details = []
    
    for marker in aspirational_markers:
        weight = marker['connection_strength']
        if marker['strength_level'] == 'strong_aspirational':
            aspirational_score += 2.0 * weight
        elif marker['strength_level'] == 'moderate_aspirational':
            aspirational_score += 1.0 * weight
        
        aspirational_details.append({
            'text': marker['text'],
            'strength': marker['strength_level'],
            'category': marker['category'],
            'weight': weight,
            'connection_type': marker.get('connection_type', 'unknown'),
            'search_scope': marker.get('search_scope', 'unknown')
        })
    
    # Determine classification
    classification = determine_final_classification(evidence_score, aspirational_score)
    
    # Calculate confidence score
    total_score = evidence_score + aspirational_score
    if total_score > 0:
        confidence = min(total_score / 2.0, 1.0)  # Normalize to 0-1
    else:
        confidence = 0.0
    
    return {
        'classification': classification,
        'confidence': confidence,
        'evidence_score': evidence_score,
        'aspirational_score': aspirational_score,
        'evidence_markers': evidence_details,
        'aspirational_markers': aspirational_details,
        'total_markers': len(evidence_markers) + len(aspirational_markers)
    }

def determine_final_classification(evidence_score, aspirational_score):
    """
    Determine the final classification based on evidence and aspirational scores.
    
    Args:
        evidence_score: Weighted evidence score
        aspirational_score: Weighted aspirational score
    
    Returns:
        String: Classification category
    """
    # Thresholds for classification
    strong_threshold = 1.5
    moderate_threshold = 0.5
    
    if evidence_score >= strong_threshold and aspirational_score < moderate_threshold:
        return 'strong_evidence'
    elif evidence_score >= moderate_threshold and aspirational_score < evidence_score:
        return 'moderate_evidence'
    elif aspirational_score >= strong_threshold and evidence_score < moderate_threshold:
        return 'strong_aspirational'
    elif aspirational_score >= moderate_threshold and evidence_score < aspirational_score:
        return 'moderate_aspirational'
    elif evidence_score > 0 and aspirational_score > 0:
        # Mixed context - classify based on stronger signal
        if evidence_score > aspirational_score:
            return 'moderate_evidence'
        else:
            return 'moderate_aspirational'
    else:
        return 'neutral'

def classify_single_green_term_enhanced(green_term, doc):
    """
    Classify a single green term with enhanced analysis for all term types.
    
    Args:
        green_term: Green term dictionary
        doc: spaCy doc object
    
    Returns:
        Dictionary with enhanced classification results
    """
    # Get enhanced analysis scope
    analysis_scope = get_analysis_scope_enhanced(green_term, doc)
    
    # Find evidence and aspirational markers using targeted approach
    evidence_markers = find_targeted_evidence_markers(analysis_scope, doc)
    aspirational_markers = find_targeted_aspirational_markers(analysis_scope, doc)
    
    # Classify based on markers
    classification_result = classify_evidence_aspirational(evidence_markers, aspirational_markers)
    
    # Add enhanced context information
    classification_result.update({
        'term_type': analysis_scope['term_type'],
        'context_info': analysis_scope['context_info'],
        'governor_info': {
            'semantic_governor': analysis_scope['governor_info'].get('semantic_governor').text if analysis_scope['governor_info'].get('semantic_governor') else None,
            'governor_pos': analysis_scope['governor_info'].get('semantic_governor').pos_ if analysis_scope['governor_info'].get('semantic_governor') else None,
            'additional_info': {k: v for k, v in analysis_scope['governor_info'].items() 
                             if k not in ['semantic_governor']}
        },
        'excluded_words': analysis_scope['excluded_words'],
        'analysis_method': 'targeted_dependency_first'
    })
    
    return classification_result

In [None]:
def add_evidence_aspirational_classification_enhanced(all_results, documents):
    """
    Add enhanced evidence vs aspirational classification to all valid green terms.
    
    Args:
        all_results: Dictionary containing results from green term analysis
        documents: Dictionary containing spaCy doc objects
    
    Returns:
        Dictionary: Updated results with enhanced evidence vs aspirational classification
    """
    print(f"\n{'='*80}")
    print("ENHANCED EVIDENCE VS ASPIRATIONAL CLASSIFICATION ANALYSIS")
    print(f"{'='*80}")
    print("Features: Dependency patterns support + Targeted marker detection")
    
    # Initialize summary statistics
    overall_ea_stats = Counter()
    doc_ea_stats = {}
    term_type_stats = Counter()
    
    for doc_name, results in all_results.items():
        doc = documents[doc_name]
        valid_terms = results['valid_terms']
        
        # Classify each valid green term with enhanced analysis
        doc_ea_counter = Counter()
        doc_term_type_counter = Counter()
        
        for green_term in valid_terms:
            # Classify this green term with enhanced method
            classification_result = classify_single_green_term_enhanced(green_term, doc)
            
            # Add enhanced classification to green term
            green_term['evidence_aspirational_class'] = classification_result['classification']
            green_term['ea_confidence'] = classification_result['confidence']
            green_term['ea_evidence_score'] = classification_result['evidence_score']
            green_term['ea_aspirational_score'] = classification_result['aspirational_score']
            green_term['ea_evidence_markers'] = classification_result['evidence_markers']
            green_term['ea_aspirational_markers'] = classification_result['aspirational_markers']
            green_term['ea_term_type'] = classification_result['term_type']
            green_term['ea_context_info'] = classification_result['context_info']
            green_term['ea_governor_info'] = classification_result['governor_info']
            green_term['ea_excluded_words'] = classification_result['excluded_words']
            green_term['ea_analysis_method'] = classification_result['analysis_method']
            
            # Update statistics
            ea_class = classification_result['classification']
            term_type = classification_result['term_type']
            
            doc_ea_counter[ea_class] += 1
            doc_term_type_counter[term_type] += 1
            overall_ea_stats[ea_class] += 1
            term_type_stats[term_type] += 1
        
        # Calculate evidence and aspirational intensity scores
        total_terms = len(valid_terms)
        strong_evidence_count = doc_ea_counter['strong_evidence']
        moderate_evidence_count = doc_ea_counter['moderate_evidence']
        strong_aspirational_count = doc_ea_counter['strong_aspirational']
        moderate_aspirational_count = doc_ea_counter['moderate_aspirational']

        if total_terms > 0:
            evidence_intensity_score = (((strong_evidence_count * 1.5) + (moderate_evidence_count * 1)) / total_terms) * 100
            aspirational_intensity_score = (((strong_aspirational_count * 1.5) + (moderate_aspirational_count * 1)) / total_terms) * 100
        else:
            evidence_intensity_score = 0.0
            aspirational_intensity_score = 0.0

        # Store document-level statistics
        doc_ea_stats[doc_name] = {
            'ea_distribution': doc_ea_counter,
            'term_type_distribution': doc_term_type_counter
        }
        
        # Store intensity scores in results
        all_results[doc_name]['evidence_intensity_score'] = evidence_intensity_score
        all_results[doc_name]['aspirational_intensity_score'] = aspirational_intensity_score
    
    # Print overall summary
    print(f"\nEvidence vs Aspirational Classification Results:")
    total_terms = sum(overall_ea_stats.values())
    
    # Print in logical order
    classification_order = ['strong_evidence', 'moderate_evidence', 'neutral', 
                          'moderate_aspirational', 'strong_aspirational']
    
    for ea_class in classification_order:
        count = overall_ea_stats[ea_class]
        percentage = (count / total_terms) * 100 if total_terms > 0 else 0
        print(f"  {ea_class.replace('_', ' ').title()}: {count} ({percentage:.1f}%)")
    
    # Print term type distribution
    print(f"\nTerm Type Distribution:")
    for term_type, count in term_type_stats.items():
        percentage = (count / total_terms) * 100 if total_terms > 0 else 0
        print(f"  {term_type.replace('_', ' ').title()}: {count} ({percentage:.1f}%)")
    
    # Document-level breakdown
    print(f"\nDocument breakdown:")
    print(f"{'Document':<25} {'StrEvid':<8} {'ModEvid':<8} {'Neutral':<8} {'ModAsp':<8} {'StrAsp':<8} {'Total':<8}")
    print("-" * 85)
    
    for doc_name, doc_stats in doc_ea_stats.items():
        ea_dist = doc_stats['ea_distribution']
        strong_ev = ea_dist['strong_evidence']
        mod_ev = ea_dist['moderate_evidence']
        neutral = ea_dist['neutral']
        mod_asp = ea_dist['moderate_aspirational']
        strong_asp = ea_dist['strong_aspirational']
        total = sum(ea_dist.values())
        
        print(f"{doc_name:<25} {strong_ev:<8} {mod_ev:<8} {neutral:<8} {mod_asp:<8} {strong_asp:<8} {total:<8}")
    
    # Store enhanced EA statistics in results
    for doc_name in all_results:
        all_results[doc_name]['ea_stats_enhanced'] = doc_ea_stats[doc_name]
    
    return all_results

In [None]:
# Apply enhanced evidence vs aspirational classification to all documents
all_results = add_evidence_aspirational_classification_enhanced(all_results, documents)

def analyze_enhanced_ea_patterns(all_results):
    """
    Analyze patterns in enhanced evidence vs aspirational classification results.
    """
    print(f"\n{'='*80}")
    print("ENHANCED EVIDENCE VS ASPIRATIONAL PATTERN ANALYSIS")
    print(f"{'='*80}")
    
    # Analyze by term type and classification
    term_type_ea_stats = defaultdict(Counter)
    marker_category_stats = Counter()
    confidence_stats = []
    connection_type_stats = Counter()
    search_scope_stats = Counter()
    
    for doc_name, results in all_results.items():
        for term in results['valid_terms']:
            if 'evidence_aspirational_class' in term:
                ea_class = term['evidence_aspirational_class']
                term_type = term.get('ea_term_type', 'unknown')
                confidence = term.get('ea_confidence', 0)
                
                term_type_ea_stats[term_type][ea_class] += 1
                confidence_stats.append(confidence)
                
                # Count marker categories, connection types, and search scopes
                for marker in term.get('ea_evidence_markers', []):
                    marker_category_stats[f"evidence_{marker['category']}"] += 1
                    connection_type_stats[marker.get('connection_type', 'unknown')] += 1
                    search_scope_stats[marker.get('search_scope', 'unknown')] += 1
                
                for marker in term.get('ea_aspirational_markers', []):
                    marker_category_stats[f"aspirational_{marker['category']}"] += 1
                    connection_type_stats[marker.get('connection_type', 'unknown')] += 1
                    search_scope_stats[marker.get('search_scope', 'unknown')] += 1
    
    print(f"\nClassification by term type:")
    print(f"{'Term Type':<20} {'StrEvid':<8} {'ModEvid':<8} {'Neutral':<8} {'ModAsp':<8} {'StrAsp':<8} {'Total':<8}")
    print("-" * 85)
    
    for term_type, ea_counter in term_type_ea_stats.items():
        strong_ev = ea_counter['strong_evidence']
        mod_ev = ea_counter['moderate_evidence']
        neutral = ea_counter['neutral']
        mod_asp = ea_counter['moderate_aspirational']
        strong_asp = ea_counter['strong_aspirational']
        total = sum(ea_counter.values())
        
        print(f"{term_type:<20} {strong_ev:<8} {mod_ev:<8} {neutral:<8} {mod_asp:<8} {strong_asp:<8} {total:<8}")
    
    print(f"\nTop marker categories used:")
    for category, count in marker_category_stats.most_common(10):
        print(f"  {category}: {count}")
    
    print(f"\nTop connection types:")
    for conn_type, count in connection_type_stats.most_common(10):
        print(f"  {conn_type}: {count}")
    
    print(f"\nSearch scope distribution:")
    for scope, count in search_scope_stats.most_common():
        print(f"  {scope}: {count}")
    
    if confidence_stats:
        avg_confidence = sum(confidence_stats) / len(confidence_stats)
        high_confidence = sum(1 for c in confidence_stats if c >= 0.7)
        print(f"\nConfidence Statistics:")
        print(f"  Average confidence: {avg_confidence:.3f}")
        print(f"  High confidence (≥0.7): {high_confidence}/{len(confidence_stats)} ({100*high_confidence/len(confidence_stats):.1f}%)")
    
    # Add evidence and aspirational intensity score statistics (NEW SECTION)
    print(f"\nEvidence & Aspirational Intensity Scores by Document:")
    evidence_scores = []
    aspirational_scores = []
    print(f"{'Document':<35} {'Evidence':<12} {'Aspirational':<12}")
    print("-" * 59)

    for doc_name, results in all_results.items():
        evidence_intensity = results.get('evidence_intensity_score', 0)
        aspirational_intensity = results.get('aspirational_intensity_score', 0)
        evidence_scores.append(evidence_intensity)
        aspirational_scores.append(aspirational_intensity)
        print(f"{doc_name:<35} {evidence_intensity:<12.2f} {aspirational_intensity:<12.2f}")

    if evidence_scores and aspirational_scores:
        avg_evidence = sum(evidence_scores) / len(evidence_scores)
        avg_aspirational = sum(aspirational_scores) / len(aspirational_scores)
        max_evidence = max(evidence_scores)
        max_aspirational = max(aspirational_scores)
        min_evidence = min(evidence_scores)
        min_aspirational = min(aspirational_scores)
        
        print(f"\nEvidence Intensity Statistics:")
        print(f"  Average evidence intensity: {avg_evidence:.2f}")
        print(f"  Maximum evidence intensity: {max_evidence:.2f}")
        print(f"  Minimum evidence intensity: {min_evidence:.2f}")
        
        print(f"\nAspirational Intensity Statistics:")
        print(f"  Average aspirational intensity: {avg_aspirational:.2f}")
        print(f"  Maximum aspirational intensity: {max_aspirational:.2f}")
        print(f"  Minimum aspirational intensity: {min_aspirational:.2f}")
        
        # Evidence intensity distribution
        high_evidence = sum(1 for score in evidence_scores if score >= 30)
        medium_evidence = sum(1 for score in evidence_scores if 10 <= score < 30)
        low_evidence = sum(1 for score in evidence_scores if score < 10)
        
        # Aspirational intensity distribution  
        high_aspirational = sum(1 for score in aspirational_scores if score >= 30)
        medium_aspirational = sum(1 for score in aspirational_scores if 10 <= score < 30)
        low_aspirational = sum(1 for score in aspirational_scores if score < 10)
        
        print(f"\nEvidence Intensity Distribution:")
        print(f"  High evidence (≥30): {high_evidence} documents")
        print(f"  Medium evidence (10-30): {medium_evidence} documents")
        print(f"  Low evidence (<10): {low_evidence} documents")
        
        print(f"\nAspirational Intensity Distribution:")
        print(f"  High aspirational (≥30): {high_aspirational} documents")
        print(f"  Medium aspirational (10-30): {medium_aspirational} documents")
        print(f"  Low aspirational (<10): {low_aspirational} documents")
    
    return term_type_ea_stats, marker_category_stats, connection_type_stats

def print_enhanced_ea_examples_by_document(all_results, documents, examples_per_class=2):
    """
    Print enhanced examples of evidence vs aspirational classifications for each document.
    """
    print(f"\n{'='*80}")
    print("ENHANCED EVIDENCE VS ASPIRATIONAL EXAMPLES BY DOCUMENT")
    print(f"{'='*80}")
    
    classification_order = ['strong_evidence', 'moderate_evidence', 'neutral', 
                          'moderate_aspirational', 'strong_aspirational']
    
    for doc_name, results in all_results.items():
        evidence_intensity = results.get('evidence_intensity_score', 0)
        aspirational_intensity = results.get('aspirational_intensity_score', 0)
        print(f"\n{'='*60}")
        print(f"DOCUMENT: {doc_name} (Evidence: {evidence_intensity:.2f}, Aspirational: {aspirational_intensity:.2f})")
        print(f"{'='*60}")
        
        doc = documents[doc_name]
        valid_terms = results['valid_terms']
        
        # Group terms by EA classification
        terms_by_class = defaultdict(list)
        for term in valid_terms:
            if 'evidence_aspirational_class' in term:
                ea_class = term['evidence_aspirational_class']
                terms_by_class[ea_class].append(term)
        
        # Print summary with term type breakdown
        total_terms = len(valid_terms)
        print(f"Total terms: {total_terms}")
        
        # Term type summary
        term_type_counts = Counter()
        for term in valid_terms:
            term_type = term.get('ea_term_type', 'unknown')
            term_type_counts[term_type] += 1
        
        print(f"Term types: ", end="")
        for term_type, count in term_type_counts.items():
            percentage = (count / total_terms) * 100 if total_terms > 0 else 0
            print(f"{term_type}={count}({percentage:.1f}%) ", end="")
        print()
        
        # EA classification summary
        for ea_class in classification_order:
            count = len(terms_by_class[ea_class])
            percentage = (count / total_terms) * 100 if total_terms > 0 else 0
            print(f"  {ea_class.replace('_', ' ').title()}: {count} ({percentage:.1f}%)")
        
        # Show examples for each classification
        for ea_class in classification_order:
            class_terms = terms_by_class[ea_class]
            if class_terms:
                print(f"\n{ea_class.replace('_', ' ').upper()} EXAMPLES:")
                print("-" * 40)
                
                examples_to_show = min(examples_per_class, len(class_terms))
                
                for i, term in enumerate(class_terms[:examples_to_show]):
                    sentence = term['sentence']
                    sentence_text = sentence.text.strip()
                    
                    # Highlight the green term
                    try:
                        start_char = doc[term['start_idx']].idx
                        end_char = doc[term['end_idx']].idx + len(doc[term['end_idx']].text)
                        sentence_start_char = sentence.start_char
                        
                        relative_start = start_char - sentence_start_char
                        relative_end = end_char - sentence_start_char
                        
                        if 0 <= relative_start <= len(sentence_text) and 0 <= relative_end <= len(sentence_text):
                            highlighted_sentence = (
                                sentence_text[:relative_start] + 
                                f"**{sentence_text[relative_start:relative_end]}**" + 
                                sentence_text[relative_end:]
                            )
                        else:
                            highlighted_sentence = sentence_text
                    except:
                        highlighted_sentence = sentence_text
                    
                    print(f"{i+1}. '{term['term']}' ({term.get('ea_term_type', 'unknown')})")
                    print(f"   Context: {highlighted_sentence}")
                    print(f"   Confidence: {term.get('ea_confidence', 0):.3f}")
                    
                    # Show enhanced information
                    governor_info = term.get('ea_governor_info', {})
                    if governor_info.get('semantic_governor'):
                        print(f"   Semantic Governor: '{governor_info['semantic_governor']}' ({governor_info.get('governor_pos', 'unknown')})")
                    
                    # Show excluded words for context-dependent terms
                    excluded_words = term.get('ea_excluded_words', [])
                    if excluded_words:
                        print(f"   Excluded Words: {excluded_words}")
                    
                    # Show markers with enhanced details
                    evidence_markers = term.get('ea_evidence_markers', [])
                    aspirational_markers = term.get('ea_aspirational_markers', [])
                    
                    if evidence_markers:
                        marker_details = []
                        for m in evidence_markers:
                            scope = m.get('search_scope', 'unknown')
                            conn_type = m.get('connection_type', 'unknown')
                            marker_details.append(f"'{m['text']}' ({m['strength']}, {scope}, {conn_type})")
                        print(f"   Evidence markers: {', '.join(marker_details)}")
                    
                    if aspirational_markers:
                        marker_details = []
                        for m in aspirational_markers:
                            scope = m.get('search_scope', 'unknown')
                            conn_type = m.get('connection_type', 'unknown')
                            marker_details.append(f"'{m['text']}' ({m['strength']}, {scope}, {conn_type})")
                        print(f"   Aspirational markers: {', '.join(marker_details)}")
                    
                    if not evidence_markers and not aspirational_markers:
                        print(f"   Markers: None (classified as neutral)")
                    
                    # Show dependency pattern details for dependency terms
                    if term.get('ea_term_type') == 'dependency':
                        context_info = term.get('ea_context_info', {})
                        pattern_name = context_info.get('pattern_name', 'unknown')
                        dependency_relation = context_info.get('dependency_relation', 'unknown')
                        print(f"   Dependency Pattern: {pattern_name} ({dependency_relation})")
                    
                    print()
                
                if len(class_terms) > examples_to_show:
                    print(f"   ... and {len(class_terms) - examples_to_show} more {ea_class.replace('_', ' ')} terms")

def print_classification_comparison(all_results):
    """
    Print comparison between different classification dimensions.
    """
    print(f"\n{'='*80}")
    print("MULTI-DIMENSIONAL CLASSIFICATION COMPARISON")
    print(f"{'='*80}")
    
    # Combine temporal, quantification, and evidence/aspirational classifications
    combined_stats = defaultdict(Counter)
    
    for doc_name, results in all_results.items():
        for term in results['valid_terms']:
            temporal_class = term.get('temporal_class', 'unknown')
            quant_level = term.get('quantification_level', 'unknown')
            ea_class = term.get('evidence_aspirational_class', 'unknown')
            
            # Cross-tabulations
            combined_stats['temporal_ea'][f"{temporal_class}_{ea_class}"] += 1
            combined_stats['quant_ea'][f"{quant_level}_{ea_class}"] += 1
            combined_stats['temporal_quant'][f"{temporal_class}_{quant_level}"] += 1
    
    print(f"\nTemporal vs Evidence/Aspirational (Top 10):")
    for combo, count in combined_stats['temporal_ea'].most_common(10):
        print(f"  {combo.replace('_', ' + ')}: {count}")
    
    print(f"\nQuantification vs Evidence/Aspirational (Top 10):")
    for combo, count in combined_stats['quant_ea'].most_common(10):
        print(f"  {combo.replace('_', ' + ')}: {count}")
    
    print(f"\nTemporal vs Quantification (Top 10):")
    for combo, count in combined_stats['temporal_quant'].most_common(10):
        print(f"  {combo.replace('_', ' + ')}: {count}")

# Run enhanced analysis
term_type_patterns, marker_stats, connection_stats = analyze_enhanced_ea_patterns(all_results)
print_enhanced_ea_examples_by_document(all_results, documents, examples_per_class=2)
print_classification_comparison(all_results)

print(f"\n{'='*80}")
print("ENHANCED EVIDENCE VS ASPIRATIONAL CLASSIFICATION WITH INTENSITY COMPLETE")
print(f"{'='*80}")
print("All valid green terms classified with enhanced analysis")
print("Special handling for direct, context-dependent, AND dependency pattern terms")
print("Targeted dependency-first marker detection (no sentence-wide search)")
print("Enhanced connection assessment with role-aware scoring")
print("Comprehensive multi-dimensional analysis")
print("Individual term classification with detailed confidence and connection info")
print("Evidence intensity: ((strong_count * 1.5) + (moderate_count * 1)) / total_terms * 100")
print("Aspirational intensity: ((strong_count * 1.5) + (moderate_count * 1)) / total_terms * 100")
print("Use intensity scores to compare evidence vs aspirational balance across documents")

In [None]:
def create_context_classification_dataframe(all_results):
    """
    Create a focused DataFrame for context classification analysis only.
    Rows: Organizations (company-year combinations)
    Columns: Context classification metrics (temporal, quantification, evidence/aspirational)
    """
    data = []
    
    for doc_name, term_data in all_results.items():
        
        # Extract organization and year from document name
        parts = doc_name.split('_')
        year = parts[-1]
        org_name = '_'.join(parts[:-1])
        
        # Get valid terms for classification analysis
        valid_terms = term_data['valid_terms']
        valid_terms_count = len(valid_terms)
        
        # TEMPORAL CLASSIFICATION METRICS
        temporal_stats = {}
        for temporal_class in ['past', 'present', 'future', 'unclear']:
            count = sum(1 for term in valid_terms if term.get('temporal_class') == temporal_class)
            temporal_stats[f'temporal_{temporal_class}'] = count
            temporal_stats[f'temporal_{temporal_class}_pct'] = round((count / valid_terms_count * 100) if valid_terms_count > 0 else 0, 2)
        
        # QUANTIFICATION CLASSIFICATION METRICS
        quant_stats = {}
        for quant_level in ['highly_quantified', 'partially_quantified', 'non_quantified']:
            count = sum(1 for term in valid_terms if term.get('quantification_level') == quant_level)
            quant_stats[f'quant_{quant_level}'] = count
            quant_stats[f'quant_{quant_level}_pct'] = round((count / valid_terms_count * 100) if valid_terms_count > 0 else 0, 2)
        
        # QUANTIFICATION CONFIDENCE AND INTENSITY SCORES
        quantified_terms = [term for term in valid_terms 
                          if term.get('quantification_level') in ['highly_quantified', 'partially_quantified']]
        if quantified_terms:
            quant_confidences = [term.get('quantification_confidence', 0) for term in quantified_terms]
            quant_avg_confidence = round(sum(quant_confidences) / len(quant_confidences), 3)
        else:
            quant_avg_confidence = 0.0
        
        quant_stats['quant_avg_confidence'] = quant_avg_confidence
        quant_stats['quantification_intensity_score'] = round(term_data.get('quantification_stats', {}).get('quantification_intensity_score', 0), 2)        
        
        # EVIDENCE/ASPIRATIONAL CLASSIFICATION METRICS
        ea_stats = {}
        for ea_class in ['strong_evidence', 'moderate_evidence', 'neutral', 'moderate_aspirational', 'strong_aspirational']:
            count = sum(1 for term in valid_terms if term.get('evidence_aspirational_class') == ea_class)
            ea_stats[f'ea_{ea_class}'] = count
            ea_stats[f'ea_{ea_class}_pct'] = round((count / valid_terms_count * 100) if valid_terms_count > 0 else 0, 2)
        
        # EVIDENCE/ASPIRATIONAL CONFIDENCE SCORES
        evidence_terms = [term for term in valid_terms 
                         if term.get('evidence_aspirational_class') in ['strong_evidence', 'moderate_evidence']]
        aspirational_terms = [term for term in valid_terms 
                            if term.get('evidence_aspirational_class') in ['strong_aspirational', 'moderate_aspirational']]
        
        if evidence_terms:
            evidence_confidences = [term.get('ea_confidence', 0) for term in evidence_terms]
            evidence_avg_confidence = round(sum(evidence_confidences) / len(evidence_confidences), 3)
        else:
            evidence_avg_confidence = 0.0
            
        if aspirational_terms:
            aspirational_confidences = [term.get('ea_confidence', 0) for term in aspirational_terms]
            aspirational_avg_confidence = round(sum(aspirational_confidences) / len(aspirational_confidences), 3)
        else:
            aspirational_avg_confidence = 0.0
        
        ea_stats['evidence_avg_confidence'] = evidence_avg_confidence
        ea_stats['aspirational_avg_confidence'] = aspirational_avg_confidence
        ea_stats['evidence_intensity_score'] = round(term_data.get('evidence_intensity_score', 0), 2)
        ea_stats['aspirational_intensity_score'] = round(term_data.get('aspirational_intensity_score', 0), 2)
        
        # COMBINED CLASSIFICATION INSIGHTS
        # High-impact terms (highly quantified + strong evidence)
        high_impact_terms = sum(1 for term in valid_terms 
                               if term.get('quantification_level') == 'highly_quantified' 
                               and term.get('evidence_aspirational_class') == 'strong_evidence')
        
        # Future aspirational terms (future + aspirational)
        future_aspirational = sum(1 for term in valid_terms 
                                 if term.get('temporal_class') == 'future' 
                                 and term.get('evidence_aspirational_class') in ['moderate_aspirational', 'strong_aspirational'])
        
        # Past evidence terms (past + evidence)
        past_evidence = sum(1 for term in valid_terms 
                           if term.get('temporal_class') == 'past' 
                           and term.get('evidence_aspirational_class') in ['moderate_evidence', 'strong_evidence'])
        
        # Present quantified terms (present + quantified)
        present_quantified = sum(1 for term in valid_terms 
                                if term.get('temporal_class') == 'present' 
                                and term.get('quantification_level') in ['highly_quantified', 'partially_quantified'])
        
        # Calculate overall average confidence scores
        all_ea_confidences = [term.get('ea_confidence', 0) for term in valid_terms if 'ea_confidence' in term]
        avg_ea_confidence = round(sum(all_ea_confidences) / len(all_ea_confidences), 3) if all_ea_confidences else 0
        
        # Create the row dictionary
        row = {
            # Basic identifiers
            'organization': org_name,
            'year': int(year),
            
            # Temporal classification metrics
            **temporal_stats,
            
            # Quantification classification metrics (counts, percentages, confidence, intensity)
            **quant_stats,
            
            # Evidence/Aspirational classification metrics (counts, percentages, confidences, intensities)
            **ea_stats,
            
            # Combined insights
            'high_impact_terms': high_impact_terms,
            'future_aspirational': future_aspirational,
            'past_evidence': past_evidence,
            'present_quantified': present_quantified,
            'avg_ea_confidence': avg_ea_confidence
        }
        
        data.append(row)
    
    # Create DataFrame
    context_classification_df = pd.DataFrame(data)
    
    # Sort by organization and year
    context_classification_df = context_classification_df.sort_values(['organization', 'year'])
    
    return context_classification_df

# Create the DataFrame
context_classification_df = create_context_classification_dataframe(all_results)

# Save to Excel
excel_path = "data/NLP/Results/Communication_Score_df_Context.xlsx"
context_classification_df.to_excel(excel_path, index=False)

print("CONTEXT CLASSIFICATION DATAFRAME CREATED")
print("="*80)
print(context_classification_df.head())

print(f"\nDataFrame shape: {context_classification_df.shape}")
print(f"Columns ({len(context_classification_df.columns)}):")

# Column descriptions in logical order
column_descriptions = {
    # Basic info
    'organization': 'Organization name',
    'year': 'Report year',
    
    # Temporal classification (counts and percentages)
    'temporal_past': 'Past temporal context terms',
    'temporal_past_pct': 'Past temporal terms percentage',
    'temporal_present': 'Present temporal context terms', 
    'temporal_present_pct': 'Present temporal terms percentage',
    'temporal_future': 'Future temporal context terms',
    'temporal_future_pct': 'Future temporal terms percentage',
    'temporal_unclear': 'Terms with unclear temporal context',
    'temporal_unclear_pct': 'Unclear temporal terms percentage',
    
    # Quantification classification (counts, percentages, confidence, intensity)
    'quant_highly_quantified': 'Highly quantified terms',
    'quant_highly_quantified_pct': 'Highly quantified percentage',
    'quant_partially_quantified': 'Partially quantified terms',
    'quant_partially_quantified_pct': 'Partially quantified percentage',
    'quant_non_quantified': 'Non-quantified terms',
    'quant_non_quantified_pct': 'Non-quantified percentage',
    'quant_avg_confidence': 'Average confidence for quantified terms (highly + partially)',
    'quantification_intensity_score': 'Quantification intensity score ((highly*1.5 + partially*1)/total*100)',
    
    # Evidence/Aspirational classification (counts, percentages, confidences, intensities)
    'ea_strong_evidence': 'Strong evidence terms',
    'ea_strong_evidence_pct': 'Strong evidence percentage',
    'ea_moderate_evidence': 'Moderate evidence terms',
    'ea_moderate_evidence_pct': 'Moderate evidence percentage',
    'ea_neutral': 'Neutral terms',
    'ea_neutral_pct': 'Neutral terms percentage',
    'ea_moderate_aspirational': 'Moderate aspirational terms',
    'ea_moderate_aspirational_pct': 'Moderate aspirational percentage',
    'ea_strong_aspirational': 'Strong aspirational terms',
    'ea_strong_aspirational_pct': 'Strong aspirational percentage',
    'evidence_avg_confidence': 'Average confidence for evidence terms (strong + moderate)',
    'aspirational_avg_confidence': 'Average confidence for aspirational terms (strong + moderate)',
    'evidence_intensity_score': 'Evidence intensity score ((strong*1.5 + moderate*1)/total*100)',
    'aspirational_intensity_score': 'Aspirational intensity score ((strong*1.5 + moderate*1)/total*100)',
    
    # Combined insights
    'high_impact_terms': 'Highly quantified + strong evidence terms',
    'future_aspirational': 'Future + aspirational terms',
    'past_evidence': 'Past + evidence terms',
    'present_quantified': 'Present + quantified terms',
    'avg_ea_confidence': 'Average evidence/aspirational confidence (all terms)'
}

for col, desc in column_descriptions.items():
    if col in context_classification_df.columns:
        print(f"  {col:<35}: {desc}")

print(f"\nData saved as: {excel_path}")
print(f"Variable available as: context_classification_df")
print("Contains ONLY context classification insights (temporal, quantification, evidence/aspirational)")
print("NEW: Added confidence scores for quantified and evidence/aspirational terms")
print("NEW: Added intensity scores for quantification, evidence, and aspirational classifications")

In [None]:
from openpyxl import Workbook
from openpyxl.utils import get_column_letter
from openpyxl.styles import PatternFill
from openpyxl import load_workbook

# Define file path and output path
output_path = "data/NLP/Results/Communication_Score_df_Context.xlsx"

# Save the DataFrame to Excel
context_classification_df.to_excel(output_path, index=False, engine="openpyxl")

# Load the workbook and sheet
wb = load_workbook(output_path)
ws = wb.active  # There's only one sheet since we saved just one DataFrame

# Auto-adjust column widths based on the longest string in each column
for col in ws.columns:
    max_length = 0
    col_letter = get_column_letter(col[0].column)
    for cell in col:
        if cell.value:
            max_length = max(max_length, len(str(cell.value)))
    ws.column_dimensions[col_letter].width = max_length + 3  # Add padding

# Define grey fill for alternating rows
grey_fill = PatternFill(start_color="D9D9D9", end_color="D9D9D9", fill_type="solid")

# Alternate row colors by company
prev_company = None
use_grey = False
for row in range(2, ws.max_row + 1):
    current_company = ws[f"A{row}"].value  # Column A has the company names
    if current_company != prev_company:
        use_grey = not use_grey
        prev_company = current_company

    if use_grey:
        for col in range(1, ws.max_column + 1):
            ws.cell(row=row, column=col).fill = grey_fill

# Save the final cleaned and formatted workbook
wb.save(output_path)
