# Environmental Sentiment Analysis using Climate-BERT

## Overview
This module analyzes environmental sentiment using Climate-BERT, a specialized model trained on climate texts. It measures how positively or negatively companies frame environmental topics, supporting both Green Communication Intensity and Substantiation Weakness dimensions through sentiment-based analysis.

## Climate-BERT Integration
- **Specialized model**: Uses "climatebert/distilroberta-base-climate-sentiment" for climate-specific sentiment classification
- **Three-class output**: OPPORTUNITY (positive climate sentiment), RISK (negative climate sentiment), NEUTRAL (neutral content)  
- **Sentiment scoring**: Final score = Opportunity Score - Risk Score, ranging from -1 to +1
- **Sentence-level analysis**: Processes individual sentences containing green terms for granular sentiment assessment

## Multi-Level Analysis
1. **Document-level sentiment**: Average sentiment across all sustainability sentences per report
2. **Topic-weighted sentiment**: Separate analysis for renewable energy and climate emissions categories
3. **Weighted contribution**: When sentences contain multiple topics, sentiment contributions weighted by term frequency within sentence

## Topic-Specific Focus
- **Renewable energy terms**: "solar power", "wind turbines", "clean energy generation"
- **Climate emissions terms**: "carbon neutral", "emission reduction", "decarbonization"  
- **Justification**: These topics carry distinctly positive sentiment that shapes public environmental perception

## Variables Produced for Communication Scoring
According to the analysis framework:
- **Average Environmental Sentiment** → Green Communication Intensity dimension
- **Renewable Energy Sentiment** → Green Communication Intensity dimension  
- **Climate and Emissions Sentiment** → Green Communication Intensity dimension
- **Sentiment analysis contributes to** → Substantiation Weakness dimension (through positive framing assessment)

## Processing Pipeline
Extracts sentences with green terms → Climate-BERT sentiment classification → Topic-based weighting → Aggregated sentiment scores for communication gap analysis.

In [None]:
import spacy
from spacy_layout import spaCyLayout
from pathlib import Path
import pandas as pd
import numpy as np
import re
from collections import defaultdict, Counter

# Load spaCy model and configure for large documents
nlp = spacy.load("en_core_web_lg")
nlp.max_length = 1_500_000

In [None]:
from pathlib import Path

# Toggle between "test" and "actual"
MODE = "actual"  

# Define configuration based on mode
if MODE == "test":
    report_names = [ 
        "Axpo_Holding_AG", "NEOEN_SA"
    ]
    folders = {
        "2021": Path("data/NLP/Testing/Reports/Clean/2021"),
        "2022": Path("data/NLP/Testing/Reports/Clean/2022")
    }

elif MODE == "actual":
    report_names = [ 
        "Akenerji_Elektrik_Uretim_AS",
        "Arendals_Fossekompani_ASA",
        "Atlantica_Sustainable_Infrastructure_PLC",
        "CEZ",
        "EDF",
        "EDP_Energias_de_Portugal_SA",
        "Endesa",
        "ERG_SpA",
        "Orsted",
        "Polska_Grupa_Energetyczna_PGE_SA",
        "Romande_Energie_Holding_SA",
        "Scatec_ASA",
        "Solaria_Energia_y_Medio_Ambiente_SA",
        "Terna_Energy_SA"
    ]

    folders = {
        "2021": Path("data/NLP/Reports/Cleanest/2021"),
        "2022": Path("data/NLP/Reports/Cleanest/2022")
    }

else:
    raise ValueError("Invalid MODE. Use 'test' or 'actual'.")

# Check availability
for name in report_names:
    file_name = f"{name}.txt"
    in_2021 = (folders["2021"] / file_name).exists()
    in_2022 = (folders["2022"] / file_name).exists()
    print(f"{file_name}: 2021: {'YES' if in_2021 else 'NO'} | 2022: {'YES' if in_2022 else 'NO'}")


In [None]:
# Dictionary to store processed docs
documents = {}

# Load and process all documents
for version, folder_path in folders.items():
    for name in report_names:
        txt_path = folder_path / f"{name}.txt"
        try:
            with open(txt_path, "r", encoding="utf-8") as f:
                text = f.read()
            doc_key = f"{name}_{version}"
            documents[doc_key] = nlp(text)
            print(f"Processed {doc_key}")
        except Exception as e:
            print(f"Error processing {txt_path.name}: {e}")

print(f"\nTotal documents loaded: {len(documents)}")

## Topics

In [None]:
# GREEN TOPIC CATEGORIZATION WORD LISTS

# =============================================================================
# 1. RENEWABLE ENERGY SOURCES
# =============================================================================

# Single-word renewable energy terms (lemmas)
renewable_energy_nouns = [
    "biogas", "biofuel", "biomass", "geothermal", "hydroelectric", "hydropower", 
    "photovoltaic", "pv", "renewables", "solar", "turbine", "wind"
]

renewable_energy_adverbs = [
    "renewably"
]

# Multi-word renewable energy terms (lemmas) - base_word: [modifier_words]
renewable_energy_multiword = {
    "energy": ["alternative", "bio", "biomass", "clean", "geothermal", "green", "hydro", "renewable", "solar", "tidal", "wind"],
    "fuel": ["alternative", "bio", "biomass", "clean", "renewable", "synthetic"],
    "fuels": ["alternative", "clean", "renewable", "synthetic"],
    "gas": ["bio", "biomass", "renewable"],
    "mass": ["bio"],
    "fired": ["biomass"],
    "fueled": ["biomass"],
    "powered": ["biomass"],
    "hydrogen": ["blue", "clean", "green", "renewable"],
    "power": ["hydro", "renewable", "solar", "wind"],
    "farm": ["offshore", "solar", "wind"],
    "farms": ["offshore", "solar", "wind"],
    "station": ["offshore", "solar", "wind"],
    "stations": ["offshore", "solar", "wind"],
    "turbine": ["offshore", "onshore", "wind"],
    "turbines": ["offshore", "onshore", "wind"],
    "wind": ["offshore"],
    "panel": ["photovoltaic", "pv", "solar"],
    "panels": ["photovoltaic", "pv", "solar"],
    "generation": ["renewable"],
    "source": ["renewable"],
    "sources": ["renewable"],
    "plant": ["solar", "wind"],
    "plants": ["solar", "wind"]
}

# =============================================================================
# 2. CLIMATE & EMISSIONS MANAGEMENT
# =============================================================================

# Single-word climate & emissions terms (lemmas)
climate_emissions_nouns = [
    "climate", "co2", "co2e", "decarbonisation", "decarbonization", "emission", "emissions", 
    "footprint", "ghg", "greenhouse", "methane", "mitigation", "pollution", "scope1", 
    "scope2", "scope3", "sequestration"
]

climate_emissions_verbs = [
    "decarbonise", "decarbonised", "decarbonising", "decarbonize", "decarbonized", "decarbonizing"
]

# Multi-word climate & emissions terms (lemmas) - base_word: [modifier_words]
climate_emissions_multiword = {
    "emission": ["annual", "baseline", "carbon", "co2", "co2e", "direct", "ghg", "indirect", "scope", "total"],
    "emissions": ["annual", "baseline", "carbon", "co2", "co2e", "direct", "ghg", "indirect", "scope", "total"],
    "abatement": ["carbon", "co2", "co2e", "emission", "ghg", "pollution"],
    "capture": ["carbon", "co2", "ghg", "methane"],
    "captured": ["carbon"],
    "capturing": ["carbon"],
    "economy": ["carbon"],
    "footprint": ["carbon", "co2", "ecological", "environmental", "ghg", "zero"],
    "free": ["carbon", "co2", "emission", "emissions", "pollution"],
    "goal": ["carbon", "climate", "emission"],
    "goals": ["carbon", "climate", "emission"],
    "impact": ["carbon", "climate", "ecological", "environmental"],
    "intensity": ["carbon", "co2", "emission", "fuel", "ghg"],
    "low": ["carbon", "emission", "emissions"],
    "lower": ["carbon"],
    "management": ["carbon"],
    "negative": ["carbon", "co2"],
    "neutral": ["carbon", "climate", "co2", "emission"],
    "neutrality": ["carbon", "climate", "co2", "emission"],
    "non": ["carbon", "emitting"],
    "sequestered": ["carbon"],
    "storage": ["carbon", "co2"],
    "zero": ["carbon", "co2", "emission", "emissions", "footprint", "net", "pollution", "waste"],
    "change": ["climate"],
    "emitting": ["non", "zero"],
    "risk": ["climate"],
    "science": ["climate"],
    "warming": ["global"],
    "gas": ["greenhouse"],
    "1": ["scope"],
    "2": ["scope"],
    "3": ["scope"]
}

# =============================================================================
# 3. ENVIRONMENTAL CONSERVATION & BIODIVERSITY
# =============================================================================

# Single-word environmental conservation terms (lemmas)
environmental_conservation_nouns = [
    "adaptation", "afforestation", "biodiversity", "conservation", "deforestation", 
    "ecology", "ecosystem", "ecosystemic", "environment", "forest", "habitat", 
    "nature", "preservation", "reforestation", "regeneration", "restoration", 
    "soil", "species", "wildlife"
]

environmental_conservation_adjectives = [
    "ecological", "environmental", "ecosystemic"
]

environmental_conservation_adverbs = [
    "ecologically", "environmentally"
]

environmental_conservation_verbs = [
    "afforest", "afforesting", "conserve", "conserving", "preserve", "preserving", 
    "reforest", "reforesting", "regenerate", "regenerated", "regenerating", 
    "restore", "restored", "restoring"
]

# Multi-word environmental conservation terms (lemmas) - base_word: [modifier_words]
environmental_conservation_multiword = {
    "based": ["nature", "plant"],
    "environmental": ["protected"],
    "natural": ["protected"],
    "use": ["land"],
    "ecosystem": ["marine"],
    "capital": ["natural"],
    "habitat": ["natural"],
    "area": ["protected"],
    "areas": ["protected"]
}

# =============================================================================
# 4. ENERGY SYSTEMS & EFFICIENCY
# =============================================================================

# Single-word energy systems & efficiency terms (lemmas)
energy_systems_nouns = [
    "baseload", "battery", "ccs", "ccus", "cogeneration", "cooling", "demand", 
    "distribution", "electrification", "energy", "ess", "fuel", "grid", "heat", 
    "heating", "infrastructure", "insulation", "load", "optimization", "peak", 
    "supply", "thermal", "transmission", "transportation"
]

energy_systems_adjectives = [
    "efficient", "optimal"
]

energy_systems_verbs = [
    "cogenerate", "cogenerating", "electrify", "electrifying", "optimize", 
    "optimized", "optimising", "optimizing"
]

# Multi-word energy systems & efficiency terms (lemmas) - base_word: [modifier_words]
energy_systems_multiword = {
    "electric": ["all", "geothermal", "hydro", "tidal"],
    "technology": ["carbon", "clean", "efficiency", "green", "renewable", "smart"],
    "technologies": ["carbon", "clean", "efficiency", "green", "renewable"],
    "enabled": ["ccs"],
    "equipped": ["ccs"],
    "ready": ["ccs"],
    "generation": ["clean"],
    "power": ["clean", "green", "renewable"],
    "production": ["clean"],
    "source": ["clean", "green"],
    "sources": ["clean", "green"],
    "station": ["clean", "green", "hydro", "renewable"],
    "stations": ["clean", "green", "hydro", "renewable"],
    "consumption": ["coal", "electricity", "energy", "fuel", "gas", "oil", "power"],
    "efficient": ["eco", "energy", "fuel", "high", "resource"],
    "efficiency": ["eco", "energy", "fuel", "high", "resource", "thermal"],
    "standard": ["efficiency", "performance"],
    "standards": ["efficiency", "performance"],
    "usage": ["electricity", "energy", "fuel", "power", "resource"],
    "use": ["electricity", "energy", "fuel", "power", "resource"],
    "alternative": ["energy"],
    "clean": ["energy"],
    "environmental": ["efficient", "energy"],
    "renewable": ["energy"],
    "saved": ["energy"],
    "transition": ["energy"],
    "plant": ["geothermal", "hydro", "renewable"],
    "plants": ["geothermal", "hydro", "renewable"],
    "storage": ["battery", "energy"],
    "management": ["demand"],
    "infrastructure": ["energy"],
    "pump": ["heat"],
    "grid": ["smart"]
}

# =============================================================================
# 5. CIRCULAR ECONOMY & WASTE MANAGEMENT
# =============================================================================

# Single-word circular economy & waste management terms (lemmas)
circular_economy_nouns = [
    "composting", "incineration", "landfill", "lifecycle", "repair", "waste"
]

circular_economy_adjectives = [
    "circular", "durable", "recoverable", "recyclable", "recyclable", "recycled", 
    "refurbished", "regenerable", "reusable"
]

circular_economy_verbs = [
    "recover", "recovered", "recycle", "recycling", "refurbish", "reuse"
]

# Multi-word circular economy & waste management terms (lemmas) - base_word: [modifier_words]
circular_economy_multiword = {
    "economy": ["circular"],
    "free": ["waste"],
    "management": ["waste"],
    "zero": ["waste"],
    "assessment": ["lifecycle"],
    "efficiency": ["material"],
    "material": ["raw", "virgin"],
    "materials": ["raw", "virgin"]
}

# =============================================================================
# 6. SUSTAINABILITY & GOVERNANCE
# =============================================================================

# Single-word sustainability & governance terms (lemmas)
sustainability_governance_nouns = [
    "esg", "ethics", "governance", "innovation", "responsibility", "social", 
    "sustainability", "transparency"
]

sustainability_governance_adjectives = [
    "clean", "enriching", "green", "innovative", "responsible", "sustainable"
]

sustainability_governance_adverbs = [
    "sustainably"
]

sustainability_governance_verbs = [
    "innovate", "innovating"
]

# Multi-word sustainability & governance terms (lemmas) - base_word: [modifier_words]
sustainability_governance_multiword = {
    "development": ["clean", "green", "renewable", "sustainable"],
    "economy": ["green", "sustainable"],
    "growth": ["green", "sustainable"],
    "financing": ["sustainable"],
    "production": ["responsible", "sustainable"],
    "environmental": ["responsible"],
    "assessment": ["impact"],
    "impact": ["social"],
    "engagement": ["stakeholder"],
    "reporting": ["sustainability"],
    "standard": ["sustainability"],
    "standards": ["sustainability"]
}

# =============================================================================
# 7. GREEN FINANCE & INVESTMENT
# =============================================================================

# Single-word green finance & investment terms (lemmas) - None identified
green_finance_nouns = []

green_finance_adjectives = []

green_finance_verbs = []

# Multi-word green finance & investment terms (lemmas) - base_word: [modifier_words]
green_finance_multiword = {
    "bond": ["climate", "green", "sustainability"],
    "bonds": ["climate", "green", "sustainability"],
    "finance": ["blended", "climate", "green", "sustainable", "transition"],
    "financing": ["climate", "green", "sustainable"],
    "fund": ["climate", "green", "sustainability"],
    "funds": ["climate", "green", "sustainability"],
    "investment": ["climate", "esg", "green", "sustainable"],
    "investments": ["climate", "green", "sustainable"],
    "investing": ["esg", "impact", "sustainable"],
    "loan": ["sustainable"],
    "loans": ["sustainable"]
}

# =============================================================================
# CONSOLIDATED TOPIC DICTIONARY
# =============================================================================

# All topic categories in one dictionary for easy access
GREEN_TOPICS = {
    "renewable_energy": {
        "nouns": renewable_energy_nouns,
        "adjectives": [],
        "verbs": [],
        "adverbs": renewable_energy_adverbs,
        "multiword": renewable_energy_multiword
    },
    "climate_emissions": {
        "nouns": climate_emissions_nouns,
        "adjectives": [],
        "verbs": climate_emissions_verbs,
        "adverbs": [],
        "multiword": climate_emissions_multiword
    },
    "environmental_conservation": {
        "nouns": environmental_conservation_nouns,
        "adjectives": environmental_conservation_adjectives,
        "verbs": environmental_conservation_verbs,
        "adverbs": environmental_conservation_adverbs,
        "multiword": environmental_conservation_multiword
    },
    "energy_systems": {
        "nouns": energy_systems_nouns,
        "adjectives": energy_systems_adjectives,
        "verbs": energy_systems_verbs,
        "adverbs": [],
        "multiword": energy_systems_multiword
    },
    "circular_economy": {
        "nouns": circular_economy_nouns,
        "adjectives": circular_economy_adjectives,
        "verbs": circular_economy_verbs,
        "adverbs": [],
        "multiword": circular_economy_multiword
    },
    "sustainability_governance": {
        "nouns": sustainability_governance_nouns,
        "adjectives": sustainability_governance_adjectives,
        "verbs": sustainability_governance_verbs,
        "adverbs": sustainability_governance_adverbs,
        "multiword": sustainability_governance_multiword
    },
    "green_finance": {
        "nouns": green_finance_nouns,
        "adjectives": green_finance_adjectives,
        "verbs": green_finance_verbs,
        "adverbs": [],
        "multiword": green_finance_multiword
    }
}

In [None]:
# Helper functions for topic analysis

def count_valid_words(doc):
    """
    Count total valid words in document (excluding stop words, punctuation, whitespace).
    """
    valid_count = 0
    for token in doc:
        if (not token.is_stop and 
            not token.is_punct and 
            not token.is_space and 
            token.text.strip()):
            valid_count += 1
    return valid_count

def is_position_excluded(token_idx, excluded_positions):
    """
    Check if token position is already used in another term.
    """
    return token_idx in excluded_positions

def mark_positions_as_used(start_idx, end_idx, excluded_positions):
    """
    Mark token positions as used to prevent double counting.
    """
    for i in range(start_idx, end_idx + 1):
        excluded_positions.add(i)

In [None]:
def find_multiword_topic_terms(doc, excluded_positions):
    """
    Find multiword terms across all topics with position tracking.
    Returns: (found_terms, updated_excluded_positions)
    """
    found_terms = []
    tokens = [token.lemma_.lower() for token in doc]
    
    # Process all topics
    for topic_name, topic_data in GREEN_TOPICS.items():
        multiword_dict = topic_data["multiword"]
        
        for base_word, modifiers in multiword_dict.items():
            for modifier in modifiers:
                # Create search patterns
                pattern = f"{modifier} {base_word}"
                
                # Search for pattern in token sequence
                for i in range(len(tokens) - 1):
                    if (i not in excluded_positions and 
                        (i + 1) not in excluded_positions):
                        
                        if (tokens[i] == modifier and 
                            tokens[i + 1] == base_word):
                            
                            # Create term info
                            term_info = {
                                'term': pattern,
                                'topic': topic_name,
                                'start_idx': i,
                                'end_idx': i + 1,
                                'sentence': doc[i].sent
                            }
                            found_terms.append(term_info)
                            
                            # Mark positions as used
                            mark_positions_as_used(i, i + 1, excluded_positions)
    
    return found_terms, excluded_positions

In [None]:
def find_single_topic_terms(doc, excluded_positions):
    """
    Find single word terms across all topics, excluding already used positions.
    Returns: found_terms
    """
    found_terms = []
    
    # Process all topics and word types
    for topic_name, topic_data in GREEN_TOPICS.items():
        word_types = ['nouns', 'adjectives', 'verbs', 'adverbs']
        
        for word_type in word_types:
            word_list = topic_data[word_type]
            
            for i, token in enumerate(doc):
                if is_position_excluded(i, excluded_positions):
                    continue
                
                lemma_lower = token.lemma_.lower()
                
                if lemma_lower in word_list:
                    term_info = {
                        'term': token.text,
                        'topic': topic_name,
                        'start_idx': i,
                        'end_idx': i,
                        'sentence': token.sent,
                        'pos': word_type
                    }
                    found_terms.append(term_info)
                    
                    # Mark position as used
                    excluded_positions.add(i)
    
    return found_terms

In [None]:
def calculate_topic_densities(all_found_terms, total_valid_words):
    """
    Calculate density and counts for each topic.
    Returns: topic_results dictionary
    """
    topic_results = {}
    
    # Initialize all topics
    for topic_name in GREEN_TOPICS.keys():
        topic_results[topic_name] = {
            'count': 0,
            'density': 0.0,
            'terms_found': []
        }
    
    # Count terms per topic
    for term_info in all_found_terms:
        topic = term_info['topic']
        topic_results[topic]['count'] += 1
        topic_results[topic]['terms_found'].append(term_info['term'])
    
    # Calculate densities as percentage
    for topic_name, results in topic_results.items():
        if total_valid_words > 0:
            results['density'] = (results['count'] / total_valid_words) * 100
        else:
            results['density'] = 0.0
    
    return topic_results

In [None]:
def analyze_document_topics(doc, document_name):
    """
    Main function to analyze topics in a document.
    Returns: complete analysis results
    """
    excluded_positions = set()
    
    # Step 1: Find multiword terms first (priority)
    multiword_terms, excluded_positions = find_multiword_topic_terms(doc, excluded_positions)
    
    # Step 2: Find single word terms (excluding used positions)
    single_terms = find_single_topic_terms(doc, excluded_positions)
    
    # Step 3: Combine all found terms
    all_found_terms = multiword_terms + single_terms
    
    # Step 4: Count total valid words in document
    total_valid_words = count_valid_words(doc)
    
    # Step 5: Calculate topic densities and counts
    topic_results = calculate_topic_densities(all_found_terms, total_valid_words)
    
    # Step 6: Create final result structure
    result = {
        'document_name': document_name,
        'total_valid_words': total_valid_words,
        'total_terms_found': len(all_found_terms),
        'topics': topic_results
    }
    
    return result, all_found_terms

In [None]:
def compare_company_years(result_2021, result_2022, terms_2021, terms_2022, company_name):
    """
    Compare topic analysis results between two years for the same company.
    Focuses on density changes and individual term changes.
    Returns: comparison analysis with term-level details
    """
    comparison = {
        'company': company_name,
        'year_2021': {
            'total_valid_words': result_2021['total_valid_words'],
            'total_terms_found': result_2021['total_terms_found']
        },
        'year_2022': {
            'total_valid_words': result_2022['total_valid_words'],
            'total_terms_found': result_2022['total_terms_found']
        },
        'topic_comparison': {},
        'term_changes': {
            'increasing_terms': [],  # Terms that increased significantly
            'decreasing_terms': []   # Terms that decreased significantly
        }
    }
    
    # Create term frequency dictionaries for both years
    terms_freq_2021 = {}
    terms_freq_2022 = {}
    
    for term_info in terms_2021:
        term = term_info['term']
        if term not in terms_freq_2021:
            terms_freq_2021[term] = 0
        terms_freq_2021[term] += 1
    
    for term_info in terms_2022:
        term = term_info['term']
        if term not in terms_freq_2022:
            terms_freq_2022[term] = 0
        terms_freq_2022[term] += 1
    
    # Compare each topic
    for topic_name in GREEN_TOPICS.keys():
        topic_2021 = result_2021['topics'][topic_name]
        topic_2022 = result_2022['topics'][topic_name]
        
        # Calculate changes (density is already in percentage)
        count_change = topic_2022['count'] - topic_2021['count']
        density_change = topic_2022['density'] - topic_2021['density']
        
        # Calculate percentage change in density
        if topic_2021['density'] > 0:
            density_percentage = (density_change / topic_2021['density']) * 100
        else:
            density_percentage = 100 if topic_2022['density'] > 0 else 0
        
        comparison['topic_comparison'][topic_name] = {
            '2021_count': topic_2021['count'],
            '2022_count': topic_2022['count'],
            '2021_density': round(topic_2021['density'], 4),
            '2022_density': round(topic_2022['density'], 4),
            'count_change': count_change,
            'density_change': round(density_change, 4),
            'density_percentage_change': round(density_percentage, 1)
        }
    
    # Analyze individual term changes
    all_terms = set(terms_freq_2021.keys()) | set(terms_freq_2022.keys())
    
    for term in all_terms:
        freq_2021 = terms_freq_2021.get(term, 0)
        freq_2022 = terms_freq_2022.get(term, 0)
        
        # Look for significant changes (rare in one year, common in another)
        if freq_2021 <= 1 and freq_2022 >= 3:  # Rare in 2021, more common in 2022
            comparison['term_changes']['increasing_terms'].append({
                'term': term,
                '2021_freq': freq_2021,
                '2022_freq': freq_2022,
                'change': freq_2022 - freq_2021
            })
        elif freq_2021 >= 3 and freq_2022 <= 1:  # Common in 2021, rare in 2022
            comparison['term_changes']['decreasing_terms'].append({
                'term': term,
                '2021_freq': freq_2021,
                '2022_freq': freq_2022,
                'change': freq_2022 - freq_2021
            })
    
    # Sort term changes by magnitude
    comparison['term_changes']['increasing_terms'].sort(key=lambda x: x['change'], reverse=True)
    comparison['term_changes']['decreasing_terms'].sort(key=lambda x: abs(x['change']), reverse=True)
    
    return comparison

In [None]:
# Process all documents and store results
document_results = {}
document_terms = {}

print("Processing documents for topic analysis...")
print("=" * 60)

for doc_key, doc in documents.items():
    print(f"Analyzing: {doc_key}")
    
    # Analyze document topics
    result, terms = analyze_document_topics(doc, doc_key)
    
    # Store results
    document_results[doc_key] = result
    document_terms[doc_key] = terms
    
    # Print summary
    total_terms = result['total_terms_found']
    total_words = result['total_valid_words']
    overall_density = round(total_terms / total_words, 6) if total_words > 0 else 0
    
    print(f"  Total valid words: {total_words}")
    print(f"  Total terms found: {total_terms}")
    print(f"  Overall topic density: {overall_density*100:.2f}%")
    print()

print("Topic analysis completed.")

In [None]:
# Display detailed topic analysis results
def display_topic_results(document_results):
    """
    Display topic analysis results in a formatted way.
    """
    for doc_name, result in document_results.items():
        print(f"\n{'='*60}")
        print(f"TOPIC ANALYSIS: {doc_name}")
        print(f"{'='*60}")
        print(f"Total valid words: {result['total_valid_words']}")
        print(f"Total terms found: {result['total_terms_found']}")
        print()
        
        # Sort topics by count (descending)
        sorted_topics = sorted(
            result['topics'].items(), 
            key=lambda x: x[1]['count'], 
            reverse=True
        )
        
        print("TOPIC BREAKDOWN:")
        print("-" * 60)
        print(f"{'Topic':<25} {'Count':<8} {'Density %':<10} {'Sample Terms'}")
        print("-" * 60)
        
        for topic_name, topic_data in sorted_topics:
            # Format topic name for display
            display_name = topic_name.replace('_', ' ').title()
            
            # Get sample terms (first 3)
            sample_terms = topic_data['terms_found'][:3]
            sample_str = ", ".join(sample_terms) if sample_terms else "None"
            if len(sample_str) > 30:
                sample_str = sample_str[:27] + "..."
            
            print(f"{display_name:<25} {topic_data['count']:<8} "
                  f"{topic_data['density']:<10.4f} {sample_str}")
        
        print()

# Display results
display_topic_results(document_results)

In [None]:
# Year-over-year comparison analysis with term-level details
print("YEAR-OVER-YEAR TOPIC COMPARISON")
print("=" * 80)

# Extract company names (assuming format: CompanyName_YEAR)
companies = {}
company_terms = {}

for doc_key in document_results.keys():
    if '_2021' in doc_key or '_2022' in doc_key:
        company = doc_key.replace('_2021', '').replace('_2022', '')
        if company not in companies:
            companies[company] = {}
            company_terms[company] = {}
        
        year = '2021' if '_2021' in doc_key else '2022'
        companies[company][year] = document_results[doc_key]
        company_terms[company][year] = document_terms[doc_key]

# Compare each company
for company_name, years_data in companies.items():
    if '2021' in years_data and '2022' in years_data:
        print(f"\nCOMPANY: {company_name}")
        print("-" * 80)
        
        # Perform comparison with term data
        comparison = compare_company_years(
            years_data['2021'], 
            years_data['2022'],
            company_terms[company_name]['2021'],
            company_terms[company_name]['2022'],
            company_name
        )
        
        # Display overall changes
        word_change = comparison['year_2022']['total_valid_words'] - comparison['year_2021']['total_valid_words']
        term_change = comparison['year_2022']['total_terms_found'] - comparison['year_2021']['total_terms_found']
        
        print(f"Document size change: {word_change:+d} words")
        print(f"Total terms change: {term_change:+d} terms")
        print()
        
        # Display topic density changes (focus on density percentage change)
        print(f"{'Topic':<25} {'2021%':<8} {'2022%':<8} {'Density Δ':<10} {'% Change':<10}")
        print("-" * 70)
        
        # Sort by absolute density percentage change (descending)
        sorted_topics = sorted(
            comparison['topic_comparison'].items(),
            key=lambda x: abs(x[1]['density_percentage_change']),
            reverse=True
        )
        
        for topic_name, topic_comp in sorted_topics:
            display_name = topic_name.replace('_', ' ').title()
            
            print(f"{display_name:<25} "
                  f"{topic_comp['2021_density']:<8.3f} "
                  f"{topic_comp['2022_density']:<8.3f} "
                  f"{topic_comp['density_change']:+8.3f} "
                  f"{topic_comp['density_percentage_change']:+8.1f}%")
        
        print()
        
        # Display increasing terms (rare in 2021, common in 2022)
        increasing_terms = comparison['term_changes']['increasing_terms']
        if increasing_terms:
            print("TERMS GAINING PROMINENCE (rare in 2021, more common in 2022):")
            print("-" * 60)
            print(f"{'Term':<30} {'2021':<6} {'2022':<6} {'Change'}")
            print("-" * 60)
            for term_info in increasing_terms[:10]:  # Show top 10
                print(f"{term_info['term']:<30} "
                      f"{term_info['2021_freq']:<6} "
                      f"{term_info['2022_freq']:<6} "
                      f"{term_info['change']:+d}")
            print()
        
        # Display decreasing terms (common in 2021, rare in 2022)
        decreasing_terms = comparison['term_changes']['decreasing_terms']
        if decreasing_terms:
            print("TERMS LOSING PROMINENCE (common in 2021, rare in 2022):")
            print("-" * 60)
            print(f"{'Term':<30} {'2021':<6} {'2022':<6} {'Change'}")
            print("-" * 60)
            for term_info in decreasing_terms[:10]:  # Show top 10
                print(f"{term_info['term']:<30} "
                      f"{term_info['2021_freq']:<6} "
                      f"{term_info['2022_freq']:<6} "
                      f"{term_info['change']:+d}")
            print()
        
        print("=" * 80)
        
    else:
        print(f"\nCOMPANY: {company_name}")
        print("Incomplete data - missing 2021 or 2022 report")
        print()

In [None]:
def create_topic_analysis_dataframe(document_results):
    """
    Create a DataFrame for topic analysis.
    Rows: Organizations (company-year combinations)
    Columns: Topic metrics (counts and density percentages for each topic)
    """
    data = []
    
    for doc_name, result_data in document_results.items():
        
        # Extract organization and year from document name
        parts = doc_name.split('_')
        year = parts[-1]
        org_name = '_'.join(parts[:-1])
        
        # Get basic metrics
        total_valid_words = result_data['total_valid_words']
        total_terms_found = result_data['total_terms_found']
        overall_density = round((total_terms_found / total_valid_words * 100) if total_valid_words > 0 else 0, 4)
        
        # TOPIC-SPECIFIC METRICS
        topic_stats = {}
        topic_densities = []
        topic_counts = []
        
        for topic_name, topic_data in result_data['topics'].items():
            count = topic_data['count']
            density = topic_data['density']
            
            # Store individual topic metrics
            topic_stats[f'{topic_name}_count'] = count
            topic_stats[f'{topic_name}_density'] = round(density, 4)
            
            topic_densities.append(density)
            topic_counts.append(count)
        
        # COMBINED TOPIC INSIGHTS
        # Most prominent topic
        max_density_topic = max(result_data['topics'].items(), key=lambda x: x[1]['density'])
        most_prominent_topic = max_density_topic[0]
        highest_density = round(max_density_topic[1]['density'], 4)
        
        # Topic diversity (number of topics with terms)
        active_topics = sum(1 for topic_data in result_data['topics'].values() if topic_data['count'] > 0)
        
        # Topic concentration (percentage of terms in most prominent topic)
        if total_terms_found > 0:
            concentration = round((max_density_topic[1]['count'] / total_terms_found * 100), 2)
        else:
            concentration = 0
        
        # Topic balance (standard deviation of densities)
        if topic_densities:
            import statistics
            topic_balance = round(statistics.stdev(topic_densities) if len(topic_densities) > 1 else 0, 4)
        else:
            topic_balance = 0
        
        # Specific topic combinations
        renewable_plus_climate = (result_data['topics']['renewable_energy']['count'] + 
                                 result_data['topics']['climate_emissions']['count'])
        
        governance_plus_finance = (result_data['topics']['sustainability_governance']['count'] + 
                                  result_data['topics']['green_finance']['count'])
        
        # Create the row dictionary
        row = {
            # Basic identifiers
            'organization': org_name,
            'year': int(year),
            
            # Overall metrics
            'total_valid_words': total_valid_words,
            'total_terms_found': total_terms_found,
            'overall_density': overall_density,
            
            # Individual topic metrics
            **topic_stats,
            
            # Combined insights
            'most_prominent_topic': most_prominent_topic,
            'highest_topic_density': highest_density,
            'active_topics': active_topics,
            'topic_concentration': concentration,
            'topic_balance': topic_balance,
            'renewable_plus_climate': renewable_plus_climate,
            'governance_plus_finance': governance_plus_finance
        }
        
        data.append(row)
    
    return pd.DataFrame(data)

# Create the topic analysis DataFrame
import pandas as pd
topic_analysis_df = create_topic_analysis_dataframe(document_results)

# Sort by organization and year for better readability
topic_analysis_df = topic_analysis_df.sort_values(['organization', 'year']).reset_index(drop=True)

print("TOPIC ANALYSIS DATAFRAME CREATED")
print("="*80)
print(topic_analysis_df.to_string(index=False))

# Export to Excel
import os
excel_path = "data/NLP/Results/Communication_Score_df_Topics.xlsx"

# Create directory if it doesn't exist
os.makedirs(os.path.dirname(excel_path), exist_ok=True)

topic_analysis_df.to_excel(excel_path, index=False, sheet_name='Topic_Analysis')
print(f"\nExported to Excel: {excel_path}")
print(f"DataFrame shape: {topic_analysis_df.shape[0]} organizations × {topic_analysis_df.shape[1]} topic metrics")

# Column descriptions for reference
print(f"\nCOLUMN DESCRIPTIONS:")
column_descriptions = {
    # Basic identifiers
    'organization': 'Organization name',
    'year': 'Report year',
    
    # Overall metrics
    'total_valid_words': 'Total meaningful words in document',
    'total_terms_found': 'Total sustainability terms found',
    'overall_density': 'Overall sustainability density percentage',
    
    # Topic counts
    'renewable_energy_count': 'Renewable energy terms count',
    'climate_emissions_count': 'Climate & emissions terms count',
    'environmental_conservation_count': 'Environmental conservation terms count',
    'energy_systems_count': 'Energy systems & efficiency terms count',
    'circular_economy_count': 'Circular economy & waste terms count',
    'sustainability_governance_count': 'Sustainability & governance terms count',
    'green_finance_count': 'Green finance & investment terms count',
    
    # Topic densities
    'renewable_energy_density': 'Renewable energy density percentage',
    'climate_emissions_density': 'Climate & emissions density percentage',
    'environmental_conservation_density': 'Environmental conservation density percentage',
    'energy_systems_density': 'Energy systems & efficiency density percentage',
    'circular_economy_density': 'Circular economy & waste density percentage',
    'sustainability_governance_density': 'Sustainability & governance density percentage',
    'green_finance_density': 'Green finance & investment density percentage',
    
    # Combined insights
    'most_prominent_topic': 'Topic with highest density',
    'highest_topic_density': 'Density of most prominent topic',
    'active_topics': 'Number of topics with at least one term',
    'topic_concentration': 'Percentage of terms in most prominent topic',
    'topic_balance': 'Standard deviation of topic densities',
    'renewable_plus_climate': 'Combined renewable energy + climate terms',
    'governance_plus_finance': 'Combined governance + finance terms'
}

for col, desc in column_descriptions.items():
    if col in topic_analysis_df.columns:
        print(f"  {col:<35}: {desc}")

print(f"\nData saved as: {excel_path}")
print(f"Variable available as: topic_analysis_df")
print("Contains topic analysis metrics (counts, densities, insights)")

# Summary statistics
print(f"\n{'='*80}")
print("TOPIC ANALYSIS SUMMARY STATISTICS")
print(f"{'='*80}")
print(f"Total organizations analyzed: {topic_analysis_df['organization'].nunique()}")
print(f"Total documents: {len(topic_analysis_df)}")
print(f"Year range: {topic_analysis_df['year'].min()} - {topic_analysis_df['year'].max()}")

print(f"\nOverall Metrics (averages):")
print(f"  Total valid words: {topic_analysis_df['total_valid_words'].mean():.0f}")
print(f"  Total terms found: {topic_analysis_df['total_terms_found'].mean():.1f}")
print(f"  Overall density: {topic_analysis_df['overall_density'].mean():.2f}%")

print(f"\nTopic Density Distribution (averages):")
topic_cols = [col for col in topic_analysis_df.columns if col.endswith('_density')]
for col in topic_cols:
    topic_name = col.replace('_density', '').replace('_', ' ').title()
    print(f"  {topic_name:<35}: {topic_analysis_df[col].mean():.3f}%")

print(f"\nTopic Insights (averages):")
print(f"  Active topics per document: {topic_analysis_df['active_topics'].mean():.1f}")
print(f"  Topic concentration: {topic_analysis_df['topic_concentration'].mean():.1f}%")
print(f"  Topic balance (lower = more balanced): {topic_analysis_df['topic_balance'].mean():.3f}")
print(f"  Renewable + Climate terms: {topic_analysis_df['renewable_plus_climate'].mean():.1f}")
print(f"  Governance + Finance terms: {topic_analysis_df['governance_plus_finance'].mean():.1f}")

print(f"\nMost Prominent Topics:")
topic_prominence = topic_analysis_df['most_prominent_topic'].value_counts()
for topic, count in topic_prominence.items():
    topic_display = topic.replace('_', ' ').title()
    print(f"  {topic_display}: {count} documents")

## Sentiment analysis

### Overall Sentiment

In [None]:
# ===================================================================
# CLIMATE-BERT SENTIMENT ANALYSIS EXECUTION
# This analyzes climate-specific sentiment using Climate-BERT which classifies text as:
# - OPPORTUNITY: Positive climate/sustainability sentiment
# - RISK: Negative climate/sustainability sentiment  
# - NEUTRAL: Neutral climate/sustainability sentiment
# Add this to your existing notebook after your topic analysis
# ===================================================================

# First, install required packages if not already installed
# Run this in a separate cell first:
# !pip install transformers torch

print("Starting Climate-BERT Climate Sentiment Analysis")
print("Climate-BERT classifies sustainability text as OPPORTUNITY/RISK/NEUTRAL")
print("=" * 60)

# Step 1: Initialize Climate-BERT Model
from transformers import pipeline
import warnings
warnings.filterwarnings('ignore')

def initialize_climate_bert():
    """Initialize Climate-BERT model for sentiment analysis."""
    try:
        print("Loading Climate-BERT model...")
        model_name = "climatebert/distilroberta-base-climate-sentiment"
        
        climate_sentiment = pipeline(
            "sentiment-analysis", 
            model=model_name,
            return_all_scores=True,
            truncation=True,
            max_length=512
        )
        
        print("Climate-BERT model loaded successfully!")
        return climate_sentiment
        
    except Exception as e:
        print(f"Error loading Climate-BERT: {e}")
        print("Falling back to RoBERTa-base sentiment model...")
        
        # Fallback model
        fallback_sentiment = pipeline(
            "sentiment-analysis",
            model="cardiffnlp/twitter-roberta-base-sentiment-latest",
            return_all_scores=True,
            truncation=True,
            max_length=512
        )
        print("Fallback model loaded successfully!")
        return fallback_sentiment

# Initialize the model
climate_sentiment_model = initialize_climate_bert()

# Step 2: Extract Sustainability Sentences
def extract_sustainability_sentences_from_terms(doc, found_terms):
    """Extract sentences containing sustainability terms."""
    sustainability_sentences = []
    processed_sentences = set()
    
    for term_info in found_terms:
        # Use the sentence directly from term_info (it's already a spaCy Span)
        sentence = term_info['sentence']
        sentence_text = sentence.text.strip()
        
        # Avoid duplicates and very short sentences
        if sentence_text not in processed_sentences and len(sentence_text) > 20:
            sustainability_sentences.append({
                'text': sentence_text,
                'term_found': term_info['term'],
                'topic': term_info['topic'],
                'start_char': sentence.start_char,
                'end_char': sentence.end_char
            })
            processed_sentences.add(sentence_text)
    
    return sustainability_sentences

# Step 3: Analyze Sentiment
def analyze_sentences_sentiment(sustainability_sentences, model):
    """Analyze sentiment of sustainability sentences."""
    if not sustainability_sentences:
        return {
            'avg_sentiment_score': 0.0,
            'positive_ratio': 0.0,
            'negative_ratio': 0.0,
            'neutral_ratio': 0.0,
            'sentiment_confidence': 0.0,
            'sentiment_volatility': 0.0,
            'total_sentences': 0,
            'detailed_results': []
        }
    
    sentiment_results = []
    
    for sent_info in sustainability_sentences:
        text = sent_info['text']
        
        # Truncate very long sentences
        if len(text) > 500:
            text = text[:500] + "..."
        
        try:
            # Get sentiment prediction
            result = model(text)
            
            # Process Climate-BERT results (uses 'opportunity', 'risk', 'neutral')
            if isinstance(result[0], list):
                scores_dict = {item['label']: item['score'] for item in result[0]}
                
                # Climate-BERT specific labels
                opportunity_score = scores_dict.get('opportunity', 0)  # Positive climate sentiment
                risk_score = scores_dict.get('risk', 0)               # Negative climate sentiment  
                neutral_score = scores_dict.get('neutral', 0)         # Neutral climate sentiment
                
                # Calculate sentiment score (-1 to +1)
                # opportunity = positive, risk = negative
                sentiment_score = opportunity_score - risk_score
                confidence = max(opportunity_score, risk_score, neutral_score)
                
                # Determine primary label
                if opportunity_score > max(risk_score, neutral_score):
                    label = 'OPPORTUNITY'
                elif risk_score > max(opportunity_score, neutral_score):
                    label = 'RISK'
                else:
                    label = 'NEUTRAL'
            else:
                # Single result format (shouldn't happen with return_all_scores=True)
                main_result = result[0]
                label = main_result['label']
                confidence = main_result['score']
                
                # Convert Climate-BERT labels to sentiment score
                if 'opportunity' in label.lower():
                    sentiment_score = confidence
                    label = 'OPPORTUNITY'
                elif 'risk' in label.lower():
                    sentiment_score = -confidence
                    label = 'RISK'
                else:
                    sentiment_score = 0.0
                    label = 'NEUTRAL'
            
            sentiment_results.append({
                'text': sent_info['text'],
                'term_found': sent_info['term_found'],
                'topic': sent_info['topic'],
                'sentiment_score': sentiment_score,
                'sentiment_label': label,
                'confidence': confidence
            })
            
        except Exception as e:
            print(f"Error processing sentence: {e}")
            sentiment_results.append({
                'text': sent_info['text'],
                'term_found': sent_info['term_found'],
                'topic': sent_info['topic'],
                'sentiment_score': 0.0,
                'sentiment_label': 'NEUTRAL',
                'confidence': 0.5
            })
    
    # Calculate aggregate metrics
    scores = [r['sentiment_score'] for r in sentiment_results]
    labels = [r['sentiment_label'] for r in sentiment_results]
    confidences = [r['confidence'] for r in sentiment_results]
    
    # Count Climate-BERT labels correctly
    opportunity_count = sum(1 for label in labels if label == 'OPPORTUNITY')
    risk_count = sum(1 for label in labels if label == 'RISK')
    neutral_count = sum(1 for label in labels if label == 'NEUTRAL')
    
    total_sentences = len(sentiment_results)
    
    return {
        'avg_sentiment_score': np.mean(scores) if scores else 0.0,
        'positive_ratio': opportunity_count / total_sentences if total_sentences > 0 else 0.0,  # opportunity ratio
        'negative_ratio': risk_count / total_sentences if total_sentences > 0 else 0.0,        # risk ratio
        'neutral_ratio': neutral_count / total_sentences if total_sentences > 0 else 0.0,
        'sentiment_confidence': np.mean(confidences) if confidences else 0.0,
        'sentiment_volatility': np.std(scores) if len(scores) > 1 else 0.0,
        'total_sentences': total_sentences,
        'detailed_results': sentiment_results
}

# Step 4: Run Analysis on All Documents
print("\nProcessing documents for sentiment analysis...")
print("-" * 60)

sentiment_analysis_results = {}

for doc_name, doc in documents.items():
    print(f"Processing: {doc_name}")
    
    # Get the found terms from your existing analysis
    if doc_name in document_terms:
        found_terms = document_terms[doc_name]
        
        # Extract sustainability sentences
        sustainability_sentences = extract_sustainability_sentences_from_terms(doc, found_terms)
        
        # Analyze sentiment
        sentiment_analysis = analyze_sentences_sentiment(sustainability_sentences, climate_sentiment_model)
        
        # Store results
        sentiment_analysis_results[doc_name] = {
            'sustainability_sentences_count': len(sustainability_sentences),
            'climate_bert_sentiment': sentiment_analysis
        }
        
        # Print summary (using opportunity ratio instead of positive)
        avg_sentiment = sentiment_analysis['avg_sentiment_score']
        sentence_count = len(sustainability_sentences)
        opp_ratio = sentiment_analysis['positive_ratio']  # This is actually opportunity ratio
        
        print(f"Avg Sentiment: {avg_sentiment:+.3f} | Sentences: {sentence_count} | Opportunity: {opp_ratio:.1%}")
    
    else:
        print(f"No terms found for {doc_name}")

# Step 5: Create Results DataFrame
print(f"\nCreating results summary...")

sentiment_data = []

for doc_name, sentiment_data_item in sentiment_analysis_results.items():
    # Extract organization and year
    parts = doc_name.split('_')
    year = parts[-1]
    org_name = '_'.join(parts[:-1])
    
    sentiment_metrics = sentiment_data_item['climate_bert_sentiment']
    
    row = {
        'organization': org_name,
        'year': int(year),
        'sustainability_sentences': sentiment_data_item['sustainability_sentences_count'],
        'avg_sentiment_score': round(sentiment_metrics['avg_sentiment_score'], 4),
        'opportunity_ratio': round(sentiment_metrics['positive_ratio'], 3),  # Actually opportunity ratio
        'risk_ratio': round(sentiment_metrics['negative_ratio'], 3),         # Actually risk ratio
        'neutral_ratio': round(sentiment_metrics['neutral_ratio'], 3),
        'sentiment_confidence': round(sentiment_metrics['sentiment_confidence'], 3),
        'sentiment_volatility': round(sentiment_metrics['sentiment_volatility'], 3),
        'total_sentences_analyzed': sentiment_metrics['total_sentences']
    }
    
    sentiment_data.append(row)

# Create DataFrame
climate_bert_sentiment_df = pd.DataFrame(sentiment_data)
climate_bert_sentiment_df = climate_bert_sentiment_df.sort_values(['organization', 'year']).reset_index(drop=True)

# Step 6: Save Results
import os

# Create output directory
output_dir = "data/NLP/Results"
os.makedirs(output_dir, exist_ok=True)

# Save to Excel
excel_path = f"{output_dir}/Overall_Sentiment_Analysis.xlsx"
climate_bert_sentiment_df.to_excel(excel_path, index=False, sheet_name='Document_Sentiment')

print(f"\nClimate-BERT sentiment analysis complete!")
print(f"Results saved to: {excel_path}")
print(f"Analyzed {len(sentiment_analysis_results)} documents")

# Step 7: Display Summary Statistics
print(f"\nSUMMARY STATISTICS:")
print("=" * 60)
print(f"Average sentiment across all documents: {climate_bert_sentiment_df['avg_sentiment_score'].mean():+.3f}")
print(f"Standard deviation: {climate_bert_sentiment_df['avg_sentiment_score'].std():.3f}")

# Top 5 most opportunity-focused (most positive sentiment)
most_positive = climate_bert_sentiment_df.nlargest(5, 'avg_sentiment_score')
print(f"\nTOP 5 MOST OPPORTUNITY-FOCUSED:")
for _, row in most_positive.iterrows():
    print(f"  {row['organization']} ({row['year']}): {row['avg_sentiment_score']:+.3f}")

# Top 5 most risk-focused (most negative sentiment)  
most_negative = climate_bert_sentiment_df.nsmallest(5, 'avg_sentiment_score')
print(f"\nTOP 5 MOST RISK-FOCUSED:")
for _, row in most_negative.iterrows():
    print(f"  {row['organization']} ({row['year']}): {row['avg_sentiment_score']:+.3f}")

# Display DataFrame
print(f"\nCOMPLETE RESULTS:")
print(climate_bert_sentiment_df.to_string(index=False))

print(f"\nDone! Your Climate-BERT sentiment analysis data is ready for integration with your communication score.")
print(f"Note: Climate-BERT measures 'opportunity' (positive) vs 'risk' (negative) sentiment specific to climate/sustainability topics.")

In [None]:
from openpyxl import Workbook
from openpyxl.utils import get_column_letter
from openpyxl.styles import PatternFill
from openpyxl import load_workbook

# Define file path and output path
output_path = "data/NLP/Results/Overall_Sentiment_Analysis.xlsx"

# Save the DataFrame to Excel
climate_bert_sentiment_df.to_excel(output_path, index=False, engine="openpyxl")

# Load the workbook and sheet
wb = load_workbook(output_path)
ws = wb.active  # There's only one sheet since we saved just one DataFrame

# Auto-adjust column widths based on the longest string in each column
for col in ws.columns:
    max_length = 0
    col_letter = get_column_letter(col[0].column)
    for cell in col:
        if cell.value:
            max_length = max(max_length, len(str(cell.value)))
    ws.column_dimensions[col_letter].width = max_length + 3  # Add padding

# Define grey fill for alternating rows
grey_fill = PatternFill(start_color="D9D9D9", end_color="D9D9D9", fill_type="solid")

# Alternate row colors by company
prev_company = None
use_grey = False
for row in range(2, ws.max_row + 1):
    current_company = ws[f"A{row}"].value  # Column A has the company names
    if current_company != prev_company:
        use_grey = not use_grey
        prev_company = current_company

    if use_grey:
        for col in range(1, ws.max_column + 1):
            ws.cell(row=row, column=col).fill = grey_fill

# Save the final cleaned and formatted workbook
wb.save(output_path)


### Sentiment per topic

In [None]:
# ===================================================================
# TOPIC-LEVEL WEIGHTED SENTIMENT ANALYSIS USING CLIMATE-BERT
# This calculates average weighted sentiment per sustainability topic
# Can be run independently of main sentiment analysis
# ===================================================================

print("Starting Topic-Level Weighted Sentiment Analysis")
print("Using Climate-BERT with weighted contribution method")
print("=" * 70)

# Step 1: Import required libraries and define topics
import pandas as pd
import numpy as np
from transformers import pipeline
import warnings
warnings.filterwarnings('ignore')

# Define the same GREEN_TOPICS structure used in your topic analysis
GREEN_TOPICS = {
    'renewable_energy': 'Renewable Energy',
    'climate_emissions': 'Climate & Emissions', 
    'environmental_conservation': 'Environmental Conservation',
    'energy_systems': 'Energy Systems & Efficiency',
    'circular_economy': 'Circular Economy & Waste',
    'sustainability_governance': 'Sustainability & Governance',
    'green_finance': 'Green Finance & Investment'
}

# Step 2: Initialize Climate-BERT model
def initialize_climate_bert_for_topics():
    """Initialize Climate-BERT model for topic sentiment analysis."""
    try:
        print("Loading Climate-BERT model for topic analysis...")
        model_name = "climatebert/distilroberta-base-climate-sentiment"
        
        climate_sentiment = pipeline(
            "sentiment-analysis", 
            model=model_name,
            return_all_scores=True,
            truncation=True,
            max_length=512
        )
        
        print("Climate-BERT model loaded successfully!")
        return climate_sentiment
        
    except Exception as e:
        print(f"Error loading Climate-BERT: {e}")
        print("Falling back to RoBERTa-base sentiment model...")
        
        fallback_sentiment = pipeline(
            "sentiment-analysis",
            model="cardiffnlp/twitter-roberta-base-sentiment-latest",
            return_all_scores=True,
            truncation=True,
            max_length=512
        )
        print("Fallback model loaded successfully!")
        return fallback_sentiment

# Initialize model
topic_climate_model = initialize_climate_bert_for_topics()

# Step 3: Extract sentences with topic mapping
def extract_sentences_with_topic_mapping(doc, found_terms):
    """
    Extract sentences and map which topics are mentioned in each sentence.
    Returns list of sentences with topic term counts.
    """
    sentence_topic_mapping = {}
    
    # Group terms by sentence
    for term_info in found_terms:
        sentence = term_info['sentence']
        sentence_text = sentence.text.strip()
        
        # Skip very short sentences
        if len(sentence_text) <= 20:
            continue
            
        # Initialize sentence if not seen before
        if sentence_text not in sentence_topic_mapping:
            sentence_topic_mapping[sentence_text] = {
                'sentence_obj': sentence,
                'topic_counts': {topic: 0 for topic in GREEN_TOPICS.keys()},
                'terms_found': []
            }
        
        # Add term to this sentence's topic count
        topic = term_info['topic']
        sentence_topic_mapping[sentence_text]['topic_counts'][topic] += 1
        sentence_topic_mapping[sentence_text]['terms_found'].append(term_info['term'])
    
    # Convert to list format with additional info
    sentences_with_topics = []
    for sentence_text, info in sentence_topic_mapping.items():
        total_terms_in_sentence = sum(info['topic_counts'].values())
        
        # Only include sentences that have sustainability terms
        if total_terms_in_sentence > 0:
            sentences_with_topics.append({
                'text': sentence_text,
                'sentence_obj': info['sentence_obj'],
                'topic_counts': info['topic_counts'],
                'total_terms': total_terms_in_sentence,
                'terms_found': info['terms_found']
            })
    
    return sentences_with_topics

# Step 4: Calculate weighted sentiment for each topic
def calculate_weighted_topic_sentiment(sentence_info, climate_model):
    """
    Calculate weighted sentiment contribution for each topic in a sentence.
    """
    text = sentence_info['text']
    topic_counts = sentence_info['topic_counts']
    total_terms = sentence_info['total_terms']
    
    # Truncate if needed
    if len(text) > 500:
        text = text[:500] + "..."
    
    try:
        # Get Climate-BERT sentiment for the sentence
        result = climate_model(text)
        scores_dict = {item['label']: item['score'] for item in result[0]}
        
        opportunity_score = scores_dict.get('opportunity', 0)
        risk_score = scores_dict.get('risk', 0)
        neutral_score = scores_dict.get('neutral', 0)
        
        # Calculate overall sentiment score for this sentence
        sentence_sentiment = opportunity_score - risk_score
        confidence = max(opportunity_score, risk_score, neutral_score)
        
        # Determine primary label
        if opportunity_score > max(risk_score, neutral_score):
            label = 'OPPORTUNITY'
        elif risk_score > max(opportunity_score, neutral_score):
            label = 'RISK'
        else:
            label = 'NEUTRAL'
        
        # Calculate weighted contribution to each topic
        topic_sentiment_contributions = {}
        for topic, count in topic_counts.items():
            if count > 0:  # Only calculate for topics mentioned in this sentence
                weight = count / total_terms  # Proportional weight
                weighted_sentiment = sentence_sentiment * weight
                topic_sentiment_contributions[topic] = {
                    'weighted_sentiment': weighted_sentiment,
                    'weight': weight,
                    'term_count': count,
                    'sentence_sentiment': sentence_sentiment,
                    'sentence_label': label,
                    'sentence_confidence': confidence
                }
            else:
                topic_sentiment_contributions[topic] = {
                    'weighted_sentiment': 0.0,
                    'weight': 0.0,
                    'term_count': 0,
                    'sentence_sentiment': 0.0,
                    'sentence_label': 'NONE',
                    'sentence_confidence': 0.0
                }
        
        return topic_sentiment_contributions
        
    except Exception as e:
        print(f"Error processing sentence: {e}")
        # Return zero contributions for all topics
        return {topic: {
            'weighted_sentiment': 0.0,
            'weight': 0.0,
            'term_count': 0,
            'sentence_sentiment': 0.0,
            'sentence_label': 'ERROR',
            'sentence_confidence': 0.0
        } for topic in GREEN_TOPICS.keys()}

# Step 5: Analyze all documents for topic-level sentiment
def analyze_topic_weighted_sentiment_all_docs(documents_dict, document_terms_dict, climate_model):
    """
    Analyze weighted topic sentiment for all documents.
    """
    print("\nProcessing documents for topic-weighted sentiment...")
    print("-" * 70)
    
    all_results = {}
    
    for doc_name, doc in documents_dict.items():
        print(f"Processing: {doc_name}")
        
        if doc_name in document_terms_dict:
            found_terms = document_terms_dict[doc_name]
            
            # Extract sentences with topic mapping
            sentences_with_topics = extract_sentences_with_topic_mapping(doc, found_terms)
            
            print(f"Found {len(sentences_with_topics)} sentences with sustainability terms")
            
            # Initialize topic accumulators
            topic_aggregations = {topic: {
                'total_weighted_sentiment': 0.0,
                'total_weight': 0.0,
                'sentence_count': 0,
                'term_count_total': 0,
                'opportunity_sentences': 0,
                'risk_sentences': 0,
                'neutral_sentences': 0
            } for topic in GREEN_TOPICS.keys()}
            
            # Process each sentence
            for sentence_info in sentences_with_topics:
                # Get weighted contributions for this sentence
                topic_contributions = calculate_weighted_topic_sentiment(sentence_info, climate_model)
                
                # Aggregate contributions for each topic
                for topic, contribution in topic_contributions.items():
                    if contribution['weight'] > 0:  # Only aggregate topics mentioned in this sentence
                        topic_aggregations[topic]['total_weighted_sentiment'] += contribution['weighted_sentiment']
                        topic_aggregations[topic]['total_weight'] += contribution['weight']
                        topic_aggregations[topic]['sentence_count'] += 1
                        topic_aggregations[topic]['term_count_total'] += contribution['term_count']
                        
                        # Count sentence types
                        if contribution['sentence_label'] == 'OPPORTUNITY':
                            topic_aggregations[topic]['opportunity_sentences'] += 1
                        elif contribution['sentence_label'] == 'RISK':
                            topic_aggregations[topic]['risk_sentences'] += 1
                        else:
                            topic_aggregations[topic]['neutral_sentences'] += 1
            
            # Calculate average weighted sentiment per topic
            topic_results = {}
            for topic, aggregation in topic_aggregations.items():
                if aggregation['sentence_count'] > 0:
                    avg_weighted_sentiment = aggregation['total_weighted_sentiment'] / aggregation['sentence_count']
                    avg_weight_per_sentence = aggregation['total_weight'] / aggregation['sentence_count']
                    
                    # Calculate ratios
                    total_sentences = aggregation['sentence_count']
                    opp_ratio = aggregation['opportunity_sentences'] / total_sentences
                    risk_ratio = aggregation['risk_sentences'] / total_sentences  
                    neutral_ratio = aggregation['neutral_sentences'] / total_sentences
                else:
                    avg_weighted_sentiment = 0.0
                    avg_weight_per_sentence = 0.0
                    opp_ratio = 0.0
                    risk_ratio = 0.0
                    neutral_ratio = 0.0
                
                topic_results[topic] = {
                    'avg_weighted_sentiment': avg_weighted_sentiment,
                    'avg_weight_per_sentence': avg_weight_per_sentence,
                    'sentence_count': aggregation['sentence_count'],
                    'total_terms': aggregation['term_count_total'],
                    'opportunity_ratio': opp_ratio,
                    'risk_ratio': risk_ratio,
                    'neutral_ratio': neutral_ratio
                }
            
            all_results[doc_name] = {
                'total_sentences_analyzed': len(sentences_with_topics),
                'topic_results': topic_results
            }
            
            # Print summary for this document
            topics_with_data = sum(1 for topic, results in topic_results.items() if results['sentence_count'] > 0)
            print(f"Topics with sentiment data: {topics_with_data}/{len(GREEN_TOPICS)}")
            
        else:
            print(f"No terms found for {doc_name}")
            all_results[doc_name] = {
                'total_sentences_analyzed': 0,
                'topic_results': {topic: {
                    'avg_weighted_sentiment': 0.0,
                    'avg_weight_per_sentence': 0.0,
                    'sentence_count': 0,
                    'total_terms': 0,
                    'opportunity_ratio': 0.0,
                    'risk_ratio': 0.0,
                    'neutral_ratio': 0.0
                } for topic in GREEN_TOPICS.keys()}
            }
    
    return all_results

# Step 6: Create DataFrame with topic sentiment results
def create_topic_sentiment_dataframe(all_results):
    """
    Create DataFrame with documents as rows and topic sentiment metrics as columns.
    """
    data = []
    
    for doc_name, doc_results in all_results.items():
        # Extract organization and year
        parts = doc_name.split('_')
        year = parts[-1]
        org_name = '_'.join(parts[:-1])
        
        # Base row information
        row = {
            'organization': org_name,
            'year': int(year),
            'total_sentences_analyzed': doc_results['total_sentences_analyzed']
        }
        
        # Add metrics for each topic
        for topic, topic_results in doc_results['topic_results'].items():
            topic_display_name = topic.replace('_', ' ').title()
            
            # Add columns for this topic
            row[f'{topic}_avg_sentiment'] = round(topic_results['avg_weighted_sentiment'], 4)
            row[f'{topic}_sentence_count'] = topic_results['sentence_count']
            row[f'{topic}_total_terms'] = topic_results['total_terms']
            row[f'{topic}_opportunity_ratio'] = round(topic_results['opportunity_ratio'], 3)
            row[f'{topic}_risk_ratio'] = round(topic_results['risk_ratio'], 3)
            row[f'{topic}_neutral_ratio'] = round(topic_results['neutral_ratio'], 3)
        
        data.append(row)
    
    # Create DataFrame
    df = pd.DataFrame(data)
    df = df.sort_values(['organization', 'year']).reset_index(drop=True)
    
    return df

# Step 7: Run the complete analysis
print("\nRunning topic-weighted sentiment analysis...")

# Run analysis (assumes 'documents' and 'document_terms' variables exist)
topic_sentiment_results = analyze_topic_weighted_sentiment_all_docs(
    documents, 
    document_terms, 
    topic_climate_model
)

# Create DataFrame
topic_sentiment_df = create_topic_sentiment_dataframe(topic_sentiment_results)

# Step 8: Save results
import os
output_dir = "data/NLP/Results"
os.makedirs(output_dir, exist_ok=True)

excel_path = f"{output_dir}/Topic_Weighted_Sentiment_Analysis.xlsx"
topic_sentiment_df.to_excel(excel_path, index=False, sheet_name='Topic_Sentiment')

print(f"\nTopic-weighted sentiment analysis complete!")
print(f"Results saved to: {excel_path}")
print(f"Analyzed {len(topic_sentiment_results)} documents across {len(GREEN_TOPICS)} topics")

# Step 9: Display summary statistics
print(f"\nTOPIC SENTIMENT SUMMARY:")
print("=" * 70)

# Calculate and display average sentiment per topic across all documents
topic_columns = [col for col in topic_sentiment_df.columns if col.endswith('_avg_sentiment')]

print("Average weighted sentiment by topic (across all documents):")
print("-" * 50)

for col in topic_columns:
    topic_name = col.replace('_avg_sentiment', '').replace('_', ' ').title()
    avg_sentiment = topic_sentiment_df[col].mean()
    std_sentiment = topic_sentiment_df[col].std()
    
    # Count documents with data for this topic
    docs_with_data = sum(1 for val in topic_sentiment_df[col] if val != 0.0)
    
    print(f"{topic_name:<35}: {avg_sentiment:+.3f} (±{std_sentiment:.3f}) | {docs_with_data} docs")

# Find most positive and negative topics overall
print(f"\nMOST OPPORTUNITY-FOCUSED TOPICS:")
topic_averages = [(col, topic_sentiment_df[col].mean()) for col in topic_columns]
topic_averages.sort(key=lambda x: x[1], reverse=True)

for col, avg in topic_averages[:3]:
    topic_name = col.replace('_avg_sentiment', '').replace('_', ' ').title()
    print(f"  {topic_name}: {avg:+.3f}")

print(f"\nMOST RISK-FOCUSED TOPICS:")
for col, avg in topic_averages[-3:]:
    topic_name = col.replace('_avg_sentiment', '').replace('_', ' ').title()
    print(f"  {topic_name}: {avg:+.3f}")

print(f"\nDataFrame shape: {topic_sentiment_df.shape}")
print(f"Columns: {len(topic_sentiment_df.columns)} total")

# Display first few rows for verification
print(f"\nSAMPLE RESULTS (first 5 rows):")
print(topic_sentiment_df.head().to_string(index=False))

print(f"\nTopic-weighted sentiment analysis complete!")
print(f"Each topic's sentiment represents weighted average across sentences mentioning that topic.")
print(f" Weighting based on proportion of topic terms in each sentence.")

In [None]:
from openpyxl import Workbook
from openpyxl.utils import get_column_letter
from openpyxl.styles import PatternFill
from openpyxl import load_workbook

# Define file path and output path
output_path = "data/NLP/Results/Topic_Weighted_Sentiment_Analysis.xlsx"

# Save the DataFrame to Excel
topic_sentiment_df.to_excel(output_path, index=False, engine="openpyxl")

# Load the workbook and sheet
wb = load_workbook(output_path)
ws = wb.active  # There's only one sheet since we saved just one DataFrame

# Auto-adjust column widths based on the longest string in each column
for col in ws.columns:
    max_length = 0
    col_letter = get_column_letter(col[0].column)
    for cell in col:
        if cell.value:
            max_length = max(max_length, len(str(cell.value)))
    ws.column_dimensions[col_letter].width = max_length + 3  # Add padding

# Define grey fill for alternating rows
grey_fill = PatternFill(start_color="D9D9D9", end_color="D9D9D9", fill_type="solid")

# Alternate row colors by company
prev_company = None
use_grey = False
for row in range(2, ws.max_row + 1):
    current_company = ws[f"A{row}"].value  # Column A has the company names
    if current_company != prev_company:
        use_grey = not use_grey
        prev_company = current_company

    if use_grey:
        for col in range(1, ws.max_column + 1):
            ws.cell(row=row, column=col).fill = grey_fill

# Save the final cleaned and formatted workbook
wb.save(output_path)


In [None]:
# Show examples of Climate-BERT sentiment analysis
if sentiment_analysis_results:
    print(f"\nCLIMATE-BERT SENTIMENT ANALYSIS EXAMPLES")
    print("=" * 60)
    print("Climate-BERT classifies sustainability sentences as OPPORTUNITY/RISK/NEUTRAL")
    print("Sentiment Score: Opportunity - Risk (ranges from -1 to +1)")
    print()
    
    # Get the document with most analyzed sentences
    most_analyzed_doc = max(sentiment_analysis_results.items(), 
                           key=lambda x: x[1]['sustainability_sentences_count'])
    doc_name = most_analyzed_doc[0]
    doc_results = most_analyzed_doc[1]['climate_bert_sentiment']
    
    print(f"EXAMPLES FROM: {doc_name}")
    print(f"Total sustainability sentences analyzed: {doc_results['total_sentences']}")
    print(f"Average sentiment score: {doc_results['avg_sentiment_score']:+.3f}")
    print(f"Opportunity ratio: {doc_results['positive_ratio']:.1%}")
    print(f"Risk ratio: {doc_results['negative_ratio']:.1%}")
    print(f"Neutral ratio: {doc_results['neutral_ratio']:.1%}")
    print()
    
    # Show detailed examples if available
    if 'detailed_results' in doc_results and doc_results['detailed_results']:
        detailed_results = doc_results['detailed_results']
        
        # Get examples of each sentiment type
        opportunity_examples = [r for r in detailed_results if r['sentiment_label'] == 'OPPORTUNITY']
        risk_examples = [r for r in detailed_results if r['sentiment_label'] == 'RISK']
        neutral_examples = [r for r in detailed_results if r['sentiment_label'] == 'NEUTRAL']
        
        # Show OPPORTUNITY examples
        if opportunity_examples:
            print(f"OPPORTUNITY EXAMPLES ({len(opportunity_examples)} total):")
            for i, example in enumerate(opportunity_examples[:3], 1):
                confidence = example['confidence']
                sentiment_score = example['sentiment_score']
                term_found = example['term_found']
                topic = example['topic']
                text = example['text']
                # Truncate long sentences for display
                display_text = text[:150] + "..." if len(text) > 150 else text
                print(f"{i}. Term: '{term_found}' | Topic: {topic}")
                print(f"   Sentiment Score: {sentiment_score:+.3f} | Confidence: {confidence:.3f}")
                print(f"   Text: {display_text}")
                print()
        
        # Show RISK examples  
        if risk_examples:
            print(f"RISK EXAMPLES ({len(risk_examples)} total):")
            for i, example in enumerate(risk_examples[:3], 1):
                confidence = example['confidence']
                sentiment_score = example['sentiment_score']
                term_found = example['term_found']
                topic = example['topic']
                text = example['text']
                # Truncate long sentences for display
                display_text = text[:150] + "..." if len(text) > 150 else text
                print(f"{i}. Term: '{term_found}' | Topic: {topic}")
                print(f"   Sentiment Score: {sentiment_score:+.3f} | Confidence: {confidence:.3f}")
                print(f"   Text: {display_text}")
                print()
        
        # Show NEUTRAL examples
        if neutral_examples:
            print(f"NEUTRAL EXAMPLES ({len(neutral_examples)} total):")
            for i, example in enumerate(neutral_examples[:2], 1):
                confidence = example['confidence']
                sentiment_score = example['sentiment_score']
                term_found = example['term_found']
                topic = example['topic']
                text = example['text']
                # Truncate long sentences for display
                display_text = text[:150] + "..." if len(text) > 150 else text
                print(f"{i}. Term: '{term_found}' | Topic: {topic}")
                print(f"   Sentiment Score: {sentiment_score:+.3f} | Confidence: {confidence:.3f}")
                print(f"   Text: {display_text}")
                print()
    
    else:
        print("Detailed sentence examples not available in results")
        print("Note: To see individual sentence examples, ensure 'detailed_results' are stored in sentiment analysis")

print(f"\nClimate-BERT sentiment examples complete.")

In [None]:
# Show examples of Topic-Weighted Climate-BERT sentiment analysis
if 'topic_sentiment_results' in globals() and topic_sentiment_results:
    print(f"\nTOPIC-WEIGHTED CLIMATE-BERT SENTIMENT EXAMPLES")
    print("=" * 65)
    print("Shows how climateBERT analyzes sentiment for specific sustainability topics")
    print("Weighted by topic term frequency within each sentence")
    print()
    
    # Get the document with most topic sentences analyzed
    most_analyzed_doc = max(topic_sentiment_results.items(), 
                           key=lambda x: x[1]['total_sentences_analyzed'])
    doc_name = most_analyzed_doc[0]
    doc_results = most_analyzed_doc[1]
    
    print(f"EXAMPLES FROM: {doc_name}")
    print(f"Total sentences analyzed: {doc_results['total_sentences_analyzed']}")
    print()
    
    # Focus on the two main topics: renewable_energy and climate_emissions
    topics_to_show = ['renewable_energy', 'climate_emissions']
    topic_display_names = {
        'renewable_energy': 'RENEWABLE ENERGY',
        'climate_emissions': 'CLIMATE & EMISSIONS'
    }
    
    for topic in topics_to_show:
        if topic in doc_results['topic_results']:
            topic_data = doc_results['topic_results'][topic]
            
            if topic_data['sentence_count'] > 0:
                print(f"{topic_display_names[topic]} TOPIC SENTIMENT:")
                print(f"   Average Weighted Sentiment: {topic_data['avg_weighted_sentiment']:+.3f}")
                print(f"   Sentences Analyzed: {topic_data['sentence_count']}")
                print(f"   Terms Found: {topic_data['total_terms']}")
                print(f"   Opportunity Ratio: {topic_data['opportunity_ratio']:.1%}")
                print(f"   Risk Ratio: {topic_data['risk_ratio']:.1%}")
                print(f"   Neutral Ratio: {topic_data['neutral_ratio']:.1%}")
                print()
            else:
                print(f"{topic_display_names[topic]} TOPIC SENTIMENT:")
                print(f"   No sentences found for this topic")
                print()
    
    print("NOTE: Topic-weighted sentiment calculation:")
    print("   • Each sentence gets analyzed by climateBERT for overall sentiment")
    print("   • Sentiment contribution is weighted by topic term frequency in sentence")
    print("   • Example: Sentence with 3 renewable terms + 1 emissions term:")
    print("     - Renewable energy gets 75% of sentence sentiment weight")
    print("     - Climate emissions gets 25% of sentence sentiment weight")
    print()
    
    # Show example of how weighting works if we have detailed results
    print("TOPIC WEIGHTING EXAMPLE:")
    print("   Sentence: 'Our renewable energy projects reduce carbon emissions significantly'")
    print("   Terms found: renewable (1), energy (1), carbon (1), emissions (1)")
    print("   If climateBERT scores this as +0.8 OPPORTUNITY:")
    print("   • Renewable Energy gets: +0.8 × (2/4) = +0.4 weighted sentiment") 
    print("   • Climate Emissions gets: +0.8 × (2/4) = +0.4 weighted sentiment")
    print()

elif 'topic_sentiment_df' in globals() and len(topic_sentiment_df) > 0:
    # If we only have the final DataFrame, show summary stats
    print(f"\nTOPIC-WEIGHTED SENTIMENT SUMMARY")
    print("=" * 45)
    
    # Show average sentiment scores across all documents for key topics
    renewable_avg = topic_sentiment_df['renewable_energy_avg_sentiment'].mean()
    emissions_avg = topic_sentiment_df['climate_emissions_avg_sentiment'].mean()
    
    print(f"AVERAGE TOPIC SENTIMENT ACROSS ALL DOCUMENTS:")
    print(f"Renewable Energy: {renewable_avg:+.3f}")
    print(f"Climate & Emissions: {emissions_avg:+.3f}")
    print()
    print("Note: Run topic sentiment analysis with detailed results enabled")
    print("   to see individual sentence examples and weighting details")

else:
    print(f"\nTOPIC-WEIGHTED SENTIMENT ANALYSIS")
    print("=" * 45)
    print("Topic sentiment results not available")
    print("Run the topic-weighted sentiment analysis section first to see examples")

print(f"\nTopic-weighted sentiment examples complete.")