# Greenwashing Risk Score Calculation and Ensemble Analysis

## Overview
This module integrates all performance and communication components into the final greenwashing risk assessment using ensemble methodology. It calculates the Performance-Communication Gap (PCG) that defines classic greenwashing and applies systematic weight combinations to ensure robust scoring across methodological uncertainties.

## Component Integration Process

### Communication Dimension Calculation
1. **Combined Sentiment Score**: Weighted combination (60% general environmental sentiment, 20% renewable energy sentiment, 20% climate emissions sentiment)
2. **Combined Green Terms Score**: Weighted combination (70% term frequency, 30% vocabulary diversity)  
3. **Green Communication Intensity**: Final combination (40% green terms, 60% sentiment)
4. **Other Dimensions**: Calculated from NLP outputs:
   - Substantiation Weakness: Quantification + Evidence + Aspirational intensities
   - Language Vagueness: Vague + Hedge word intensities
   - Temporal Orientation: Future focus + Timeline specificity
   - Reporting Consistency: Cross-year similarity + High similarity ratios

### Performance-Communication Gap Calculation
- **Absolute Gap**: Green Communication Score - Performance Score
- **Classic Greenwashing Detection**: Companies with above-median communication AND below-median performance
- **Amplification Factor**: 1.5x multiplier applied to classic greenwashing pattern to reflect severity
- **Year-wise Normalization**: Scores normalized within each year (2021, 2022) for temporal consistency

## Ensemble Methodology
- **Weight Constraints**: Individual weights 0.05-0.50, hierarchical ordering: Greenwashing Score > Substantiation Weakness > other components
- **Valid Combinations**: 59,881 weight combinations meeting theoretical constraints
- **Statistical Output**: Mean, median, standard deviation, quartiles, and range for each company-year observation
- **Robustness**: Multiple weight scenarios ensure results aren't dependent on single weight specification

## Key Output Variables
- **Performance_Communication_Gap_Score**: Primary greenwashing risk measure (PCG)
- **Component scores**: Individual dimension scores (0-100 scale)
- **Ensemble statistics**: Complete uncertainty measures across all weight combinations
- **Company rankings**: Relative positioning for both years

## Data Integration
Creates comprehensive master dataset combining performance scores (from Performance_Score_Calculations.ipynb), communication analyses (from all 5 NLP modules), and ensemble results with complete variable preservation for validation and further analysis.

## Theoretical Foundation
Implements the PCG approach where high environmental communication paired with poor environmental performance indicates greenwashing risk, with amplification factors revealing imbalances as recommended by composite index best practices.

## Loading and preparing data

In [None]:
import pandas as pd
import os

# Load all Excel files
df_density = pd.read_excel('data/NLP/Results/Communication_Score_df_Density.xlsx')
df_context = pd.read_excel('data/NLP/Results/Communication_Score_df_Context.xlsx')
df_similarity = pd.read_excel('data/NLP/Results/Similarity/similarity_analysis_results.xlsx')
df_sentiment = pd.read_excel('data/NLP/Results/Overall_Sentiment_Analysis.xlsx')
df_hedge_vague = pd.read_excel('data/NLP/Results/Communication_Score_df_Hedge_Vague.xlsx')
df_topics = pd.read_excel('data/NLP/Results/Communication_Score_df_Topics.xlsx')
df_topic_sentiment = pd.read_excel('data/NLP/Results/Topic_Weighted_Sentiment_Analysis.xlsx')

# For each df print the name of the first column
dataframes = [df_density, df_context, df_similarity, df_sentiment, df_hedge_vague, df_topics, df_topic_sentiment]
for df in dataframes:
    first_col = df.columns[0]
    print(f"First column: {first_col}")

In [None]:
# Extract relevant metrics from each dataframe
density_metrics = df_density[['organization', 'year', 'gt_freq_pct', 'unique_gt_relative']].copy()

context_metrics = df_context[['organization', 'year', 'temporal_past_pct', 'temporal_present_pct', 
                             'temporal_future_pct', 'quantification_intensity_score', 
                             'evidence_intensity_score', 'aspirational_intensity_score']].copy()

similarity_metrics = df_similarity[['Company', 'TFIDF_Doc', 'Jaccard', 'SpaCy_HighSim_Ratio', 'SpaCy_Avg_Similarity']].copy()
similarity_metrics.rename(columns={'Company': 'organization'}, inplace=True)

sentiment_metrics = df_sentiment[['organization', 'year', 'avg_sentiment_score', 'sentiment_confidence', 'opportunity_ratio', 'risk_ratio'
]].copy()

hedge_vague_metrics = df_hedge_vague[['organization', 'year', 'hedge_intensity_score', 
                                     'vague_intensity_score', 'commitment_timeline_pct', 'total_unclear_density',
                                     'combined_intensity_score']].copy()

topics_metrics = df_topics[['organization', 'year', 'renewable_energy_density', 'climate_emissions_density']].copy()

topic_sentiment_metrics = df_topic_sentiment[['organization', 'year', 'renewable_energy_avg_sentiment', 
                                             'climate_emissions_avg_sentiment']].copy()

# Standardize organization names by replacing underscores with spaces
density_metrics['organization'] = density_metrics['organization'].str.replace('_', ' ')
context_metrics['organization'] = context_metrics['organization'].str.replace('_', ' ')
sentiment_metrics['organization'] = sentiment_metrics['organization'].str.replace('_', ' ')
hedge_vague_metrics['organization'] = hedge_vague_metrics['organization'].str.replace('_', ' ')
topics_metrics['organization'] = topics_metrics['organization'].str.replace('_', ' ')
topic_sentiment_metrics['organization'] = topic_sentiment_metrics['organization'].str.replace('_', ' ')

In [None]:
# Merge all dataframes
df = density_metrics.merge(context_metrics, on=['organization', 'year'], how='outer')
df = df.merge(sentiment_metrics, on=['organization', 'year'], how='outer')
df = df.merge(hedge_vague_metrics, on=['organization', 'year'], how='outer')
df = df.merge(topics_metrics, on=['organization', 'year'], how='outer')
df = df.merge(topic_sentiment_metrics, on=['organization', 'year'], how='outer')
df = df.merge(similarity_metrics, on='organization', how='outer')

# Rename organization column to Organization
df.rename(columns={'organization': 'Organization'}, inplace=True)

In [None]:
# Function to extract first word or apply exceptions
def simplify_org_name(name):
    if name == 'Polska Grupa Energetyczna PGE SA':
        return 'PGE'
    elif name == 'AKENERJİ ELEKTRİK ÜRETİM A.Ş.':
        return 'Akenerji'
    else:
        return name.split()[0]

# Apply the function to the 'Organization' column
df.loc[:, 'Organization'] = df['Organization'].apply(simplify_org_name)

In [None]:
# Check for NaN values and show which metrics and organizations have them
print("MISSING DATA ANALYSIS")
print("=" * 50)

# Get columns with NaN values
cols_with_nan = df.columns[df.isnull().any()].tolist()

if cols_with_nan:
    print(f"Metrics with NaN values: {cols_with_nan}")
    print()
    
    for col in cols_with_nan:
        nan_rows = df[df[col].isnull()]
        if not nan_rows.empty:
            print(f"Metric: {col}")
            print(f"Organizations with missing data:")
            for _, row in nan_rows.iterrows():
                if 'year' in df.columns:
                    print(f"  - {row['Organization']} ({row['year']})")
                else:
                    print(f"  - {row['Organization']}")
            print()
else:
    print("No NaN values found in the dataset.")

# Summary statistics
total_cells = df.shape[0] * df.shape[1]
nan_cells = df.isnull().sum().sum()
print(f"Total cells: {total_cells}")
print(f"NaN cells: {nan_cells}")
print(f"Missing data percentage: {(nan_cells/total_cells)*100:.2f}%")

In [None]:
# Load cleaned Excel with new structure
ensemble_perf = pd.read_excel('data/Performance/ensemble_performance_scores.xlsx')

# Confirm actual column names
print("Columns:", ensemble_perf.columns)

# Fix name consistency (Ørsted → Orsted)
ensemble_perf['Organization'] = ensemble_perf['Organization'].replace('Ørsted', 'Orsted')

# Rename median_score to Performance_Score (data is already in long format)
ensemble_perf = ensemble_perf.rename(columns={'median_score': 'Performance_Score'})

# Merge with your main DataFrame
df = df.merge(ensemble_perf[['Organization', 'year', 'Performance_Score']], on=['Organization', 'year'], how='left')

# Diagnostics
print("Performance scores added to df")
print(f"Shape: {df.shape}")
print(f"Performance scores available: {df['Performance_Score'].notna().sum()}/{len(df)}")

## Helper function

In [None]:
df

# Save df in Exscel in file path: data/Greenwashing Results/df.xlsx
output_path = 'data/Greenwashing Results/df.xlsx'
os.makedirs(os.path.dirname(output_path), exist_ok=True)
df.to_excel(output_path, index=False)

In [None]:
# Normalize scores per year (0-100 scale)
def normalize_by_year(df, column):
    normalized_col = f"{column}"
    df[normalized_col] = df.groupby('year')[column].transform(
        lambda x: (x - x.min()) / (x.max() - x.min()) * 100 if x.max() != x.min() else 50
    )
    return normalized_col

## Aditional metrics

In [None]:
# Normalize similarity values
normalize_by_year(df, 'TFIDF_Doc')
normalize_by_year(df, 'Jaccard')
normalize_by_year(df, 'SpaCy_Avg_Similarity')

# Create similarity combined score (average of three similarity metrics)
df['similarity_combined'] = (
    (df['TFIDF_Doc'] + 
     df['Jaccard'] + 
     df['SpaCy_Avg_Similarity']) / 3
).round(2)


# Calculate an other additional metric
# Add future vs past+present ratio
df['future_vs_past_present_ratio'] = (
    df['temporal_future_pct'] / 
    (df['temporal_past_pct'] + df['temporal_present_pct'])
).round(2)

## Create df used for greenwashing score calculation

In [None]:
# Create a copy of df
greenwashing_df = df.copy()

# Drop specified columns
columns_to_drop = [
    'temporal_past_pct', 'temporal_present_pct', 'temporal_future_pct',
    'TFIDF_Doc', 'Jaccard', 'SpaCy_Avg_Similarity',
    'sentiment_confidence', 'opportunity_ratio', 'risk_ratio',
    'total_unclear_density', 'renewable_energy_density', 'climate_emissions_density'
]
greenwashing_df.drop(columns=columns_to_drop, inplace=True, errors='ignore')

# Normalize all numeric columns except 'year' and 'Performance_Score'
numeric_cols = greenwashing_df.select_dtypes(include='number').columns
numeric_cols = [col for col in numeric_cols if col != 'year'] # was: ...if col not in ['year', 'Performance_Score'] OR: if col != 'year'

for col in numeric_cols:
    normalize_by_year(greenwashing_df, col)


# Calculation of Greenwashing Components

In [None]:
# Combined sentiment score

# Calculate combined sentiment score
greenwashing_df['combined_sentiment_score'] = (
    0.6 * greenwashing_df['avg_sentiment_score'] +     
    0.2 * greenwashing_df['renewable_energy_avg_sentiment'] +
    0.2 * greenwashing_df['climate_emissions_avg_sentiment']
).round(2)


In [None]:
# Combined green term score

# Calculate combined sentiment score
greenwashing_df['combined_green_score'] = (
    0.7 * greenwashing_df['gt_freq_pct'] +     
    0.3 * greenwashing_df['unique_gt_relative']
).round(2)

In [None]:
# Calculate communication score
greenwashing_df['Green_Com_Score'] = (
    0.4 * greenwashing_df['combined_green_score'] +
    0.6 * greenwashing_df['combined_sentiment_score']
).round(2)

# Normalize Green_Com_Score per year to 0–100 scale
normalize_by_year(greenwashing_df, 'Green_Com_Score')

print("Communication score created")
print(f"Mean score: {greenwashing_df['Green_Com_Score'].mean():.2f}")
print(f"Score range: {greenwashing_df['Green_Com_Score'].min():.2f} - {greenwashing_df['Green_Com_Score'].max():.2f}")

In [None]:
# Calculate absolute gap between performance and communication
greenwashing_df['Greenwashing_Risk_Abs'] = (
    greenwashing_df['Green_Com_Score'] - greenwashing_df['Performance_Score']
).round(2)

# Calculate yearly medians for classic greenwashing pattern
yearly_medians = greenwashing_df.groupby('year').agg({
    'Performance_Score': 'median',
    'Green_Com_Score': 'median'
})

# Identify classic greenwashing pattern
greenwashing_df['Classic_Greenwashing'] = greenwashing_df.apply(
    lambda row: (row['Green_Com_Score'] > yearly_medians.loc[row['year'], 'Green_Com_Score']) and 
                (row['Performance_Score'] < yearly_medians.loc[row['year'], 'Performance_Score']), 
    axis=1
)

# Apply 1.5x amplifier for classic greenwashing
greenwashing_df['Amplified_Score'] = greenwashing_df.apply(
    lambda row: row['Greenwashing_Risk_Abs'] * 1.5 if row['Classic_Greenwashing'] else row['Greenwashing_Risk_Abs'],
    axis=1
).round(2)

# Normalize to 0–100 scale scale within each year
normalize_by_year(greenwashing_df, 'Amplified_Score')
greenwashing_df['Greenwashing_Score'] = greenwashing_df['Amplified_Score'].round(2)

# Clean up intermediate columns
greenwashing_df = greenwashing_df.drop(['Classic_Greenwashing'], axis=1) # kept 'Base_Hybrid_Score', 'Amplified_Score'

print("Greenwashing score calculated")
print(f"Mean score: {greenwashing_df['Greenwashing_Score'].mean():.2f}")
print(f"Companies with classic greenwashing pattern: {greenwashing_df.apply(lambda row: (row['Green_Com_Score'] > yearly_medians.loc[row['year'], 'Green_Com_Score']) and (row['Performance_Score'] < yearly_medians.loc[row['year'], 'Performance_Score']), axis=1).sum()}")

# 2021 highest scores
print(f"\n2021 HIGHEST GREENWASHING SCORES:")
highest_2021 = greenwashing_df[greenwashing_df['year'] == 2021].nlargest(10, 'Greenwashing_Score')[['Organization', 'Greenwashing_Score']]
print(highest_2021.to_string(index=False))

# 2022 highest scores  
print(f"\n2022 HIGHEST GREENWASHING SCORES:")
highest_2022 = greenwashing_df[greenwashing_df['year'] == 2022].nlargest(10, 'Greenwashing_Score')[['Organization', 'Greenwashing_Score']]
print(highest_2022.to_string(index=False))

# Average scores per company across both years
print(f"\nHIGHEST AVERAGE GREENWASHING SCORES (2021-2022):")
company_averages = greenwashing_df.groupby('Organization')['Greenwashing_Score'].agg(['mean', 'count']).reset_index()
company_averages.columns = ['Organization', 'Avg_Greenwashing_Score', 'Years_Count']
company_averages = company_averages[company_averages['Years_Count'] == 2]  # Only companies with data for both years
highest_avg = company_averages.nlargest(10, 'Avg_Greenwashing_Score')[['Organization', 'Avg_Greenwashing_Score']]
highest_avg['Avg_Greenwashing_Score'] = highest_avg['Avg_Greenwashing_Score'].round(2)
print(highest_avg.to_string(index=False))

print(f"\nCompanies with data for both years: {len(company_averages)}")
print(f"Average greenwashing score across all companies (both years): {company_averages['Avg_Greenwashing_Score'].mean():.2f}")

In [None]:
# Save performance communication gap data to Excel
import os

# Create directory if it doesn't exist
os.makedirs('data/Greenwashing Results', exist_ok=True)

# Save the dataframe to Excel
greenwashing_df.to_excel('data/Greenwashing Results/performance_communication_gap.xlsx', 
                        index=False)

print("Performance communication gap data saved to: data/Greenwashing Results/performance_communication_gap.xlsx")
print(f"Saved {len(greenwashing_df)} rows and {len(greenwashing_df.columns)} columns")

In [None]:
# Greenwashing Score Quadrant Visualization
# Communication Score vs Performance Score Analysis

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Calculate medians for thresholds
comm_median = greenwashing_df['Green_Com_Score'].median()
perf_median = greenwashing_df['Performance_Score'].median()

# Create quadrant classification
def classify_quadrant(row):
    comm_high = row['Green_Com_Score'] >= comm_median
    perf_high = row['Performance_Score'] >= perf_median
    
    if comm_high and not perf_high:
        return 'Potential_Greenwashing'
    elif comm_high and perf_high:
        return 'Green_Leaders'
    elif not comm_high and not perf_high:
        return 'Laggards'
    else:
        return 'Under_Communicators'

greenwashing_df['Quadrant'] = greenwashing_df.apply(classify_quadrant, axis=1)

# Define colors for quadrants
colors = {'Potential_Greenwashing': 'red', 'Green_Leaders': 'green', 
          'Laggards': 'gray', 'Under_Communicators': 'blue'}

# Filter data by year
df_2021 = greenwashing_df[greenwashing_df['year'] == 2021]
df_2022 = greenwashing_df[greenwashing_df['year'] == 2022]

# Create the visualization function
def create_quadrant_plot(data, year, ax):
    """Create a quadrant plot for the given year"""
    
    # Calculate year-specific medians
    year_comm_median = data['Green_Com_Score'].median()
    year_perf_median = data['Performance_Score'].median()
    
    # Plot each quadrant
    for quadrant in data['Quadrant'].unique():
        if pd.isna(quadrant):
            continue
        subset = data[data['Quadrant'] == quadrant]
        ax.scatter(subset['Performance_Score'], subset['Green_Com_Score'], 
                  c=colors[quadrant], label=quadrant.replace('_', ' '), 
                  alpha=0.7, s=100, edgecolors='black', linewidth=0.5)
    
    # Add company name annotations
    for _, row in data.iterrows():
        if pd.notna(row['Performance_Score']) and pd.notna(row['Green_Com_Score']):
            ax.annotate(row['Organization'], 
                       (row['Performance_Score'], row['Green_Com_Score']),
                       xytext=(5, 5), textcoords='offset points', 
                       fontsize=10, alpha=0.9, fontweight='bold')
    
    # Add median lines
    ax.axvline(year_perf_median, color='gray', linestyle='--', alpha=0.5, linewidth=2)
    ax.axhline(year_comm_median, color='gray', linestyle='--', alpha=0.5, linewidth=2)
    
    # Add quadrant labels
    perf_range = ax.get_xlim()
    comm_range = ax.get_ylim()
    
    # Potential Greenwashing (Top Left)
    ax.text(perf_range[0] + (year_perf_median - perf_range[0])/2, 
            year_comm_median + (comm_range[1] - year_comm_median)/2,
            'Potential\nGreenwashing', ha='center', va='center', fontsize=11, 
            bbox=dict(boxstyle="round,pad=0.3", facecolor='lightcoral', alpha=0.7))
    
    # Green Leaders (Top Right)
    ax.text(year_perf_median + (perf_range[1] - year_perf_median)/2, 
            year_comm_median + (comm_range[1] - year_comm_median)/2,
            'Green\nLeaders', ha='center', va='center', fontsize=11, 
            bbox=dict(boxstyle="round,pad=0.3", facecolor='lightgreen', alpha=0.7))
    
    # Laggards (Bottom Left)
    ax.text(perf_range[0] + (year_perf_median - perf_range[0])/2, 
            comm_range[0] + (year_comm_median - comm_range[0])/2,
            'Laggards', ha='center', va='center', fontsize=11, 
            bbox=dict(boxstyle="round,pad=0.3", facecolor='lightgray', alpha=0.7))
    
    # Under Communicators (Bottom Right)
    ax.text(year_perf_median + (perf_range[1] - year_perf_median)/2, 
            comm_range[0] + (year_comm_median - comm_range[0])/2,
            'Under\nCommunicators', ha='center', va='center', fontsize=11, 
            bbox=dict(boxstyle="round,pad=0.3", facecolor='lightblue', alpha=0.7))
    
    # Customize axes
    ax.set_xlabel('Performance Score', fontsize=14)
    ax.set_ylabel('Communication Score', fontsize=14)
    ax.set_title(f'Greenwashing Detection {year}', fontsize=16, fontweight='bold')
    ax.legend(loc='upper left', bbox_to_anchor=(0, 1))
    ax.grid(True, alpha=0.3)

# Create figure with subplots
fig, axes = plt.subplots(1, 2, figsize=(24, 10))

# FIRST GRAPH: 2021
if len(df_2021) > 0:
    create_quadrant_plot(df_2021, 2021, axes[0])
else:
    axes[0].text(0.5, 0.5, 'No data available for 2021', 
                ha='center', va='center', transform=axes[0].transAxes, fontsize=14)
    axes[0].set_title('Greenwashing Detection 2021', fontsize=16, fontweight='bold')

# SECOND GRAPH: 2022
if len(df_2022) > 0:
    create_quadrant_plot(df_2022, 2022, axes[1])
else:
    axes[1].text(0.5, 0.5, 'No data available for 2022', 
                ha='center', va='center', transform=axes[1].transAxes, fontsize=14)
    axes[1].set_title('Greenwashing Detection 2022', fontsize=16, fontweight='bold')

plt.tight_layout()
plt.show()

# Print detailed quadrant analysis
print("GREENWASHING QUADRANT ANALYSIS")
print("=" * 60)

for year in [2021, 2022]:
    year_data = greenwashing_df[greenwashing_df['year'] == year]
    if len(year_data) > 0:
        print(f"\n{year} QUADRANT DISTRIBUTION:")
        print("-" * 40)
        
        quadrant_counts = year_data['Quadrant'].value_counts()
        for quadrant, count in quadrant_counts.items():
            if pd.notna(quadrant):
                print(f"{quadrant.replace('_', ' ')}: {count} companies")
        
        print(f"\nMedian Communication Score: {year_data['Green_Com_Score'].median():.2f}")
        print(f"Median Performance Score: {year_data['Performance_Score'].median():.2f}")
        
        # Show companies in each quadrant with their scores
        print(f"\nDETAILED BREAKDOWN ({year}):")
        for quadrant in ['Potential_Greenwashing', 'Green_Leaders', 'Laggards', 'Under_Communicators']:
            quad_data = year_data[year_data['Quadrant'] == quadrant]
            if len(quad_data) > 0:
                print(f"\n{quadrant.replace('_', ' ')} ({len(quad_data)} companies):")
                for _, row in quad_data.iterrows():
                    print(f"  {row['Organization']}: Com={row['Green_Com_Score']:.1f}, Perf={row['Performance_Score']:.1f}, Risk={row['Greenwashing_Score']:.1f}")

# Analysis of companies that changed quadrants
print(f"\n" + "=" * 60)
print("QUADRANT MOVEMENT ANALYSIS (2021 → 2022)")
print("=" * 60)

companies_both_years = set(df_2021['Organization'].unique()) & set(df_2022['Organization'].unique())

movements = []
for company in companies_both_years:
    quad_2021 = df_2021[df_2021['Organization'] == company]['Quadrant'].iloc[0]
    quad_2022 = df_2022[df_2022['Organization'] == company]['Quadrant'].iloc[0]
    
    if quad_2021 != quad_2022:
        movements.append({
            'Company': company,
            'From': quad_2021,
            'To': quad_2022
        })

if movements:
    print(f"\nCompanies that changed quadrants: {len(movements)}")
    for movement in movements:
        print(f"{movement['Company']}: {movement['From'].replace('_', ' ')} → {movement['To'].replace('_', ' ')}")
else:
    print("\nNo companies changed quadrants between 2021 and 2022")

# Summary statistics
print(f"\n" + "=" * 60)
print("SUMMARY STATISTICS")
print("=" * 60)

print(f"Overall median Communication Score: {greenwashing_df['Green_Com_Score'].median():.2f}")
print(f"Overall median Performance Score: {greenwashing_df['Performance_Score'].median():.2f}")
print(f"Overall median Greenwashing Score: {greenwashing_df['Greenwashing_Score'].median():.2f}")

print(f"\nCorrelation between Communication and Performance: {greenwashing_df['Green_Com_Score'].corr(greenwashing_df['Performance_Score']):.3f}")
print(f"Correlation between Communication and Greenwashing Risk: {greenwashing_df['Green_Com_Score'].corr(greenwashing_df['Greenwashing_Score']):.3f}")
print(f"Correlation between Performance and Greenwashing Risk: {greenwashing_df['Performance_Score'].corr(greenwashing_df['Greenwashing_Score']):.3f}")

# Print companies with highest risk scores in each quadrant
print(f"\n" + "=" * 60)
print("HIGHEST RISK COMPANIES BY QUADRANT")
print("=" * 60)

for quadrant in ['Potential_Greenwashing', 'Green_Leaders', 'Laggards', 'Under_Communicators']:
    quad_data = greenwashing_df[greenwashing_df['Quadrant'] == quadrant]
    if len(quad_data) > 0:
        top_risk = quad_data.nlargest(3, 'Greenwashing_Score')
        print(f"\n{quadrant.replace('_', ' ')} - Top Risk:")
        for _, row in top_risk.iterrows():
            print(f"  {row['Organization']} ({row['year']}): Risk={row['Greenwashing_Score']:.1f}")

### Additional components for greenwashing

In [None]:
# ==========================================
# COMPONENT 1: SUBSTANTIATION WEAKNESS SCORE
# Quantification (30%) + Evidence (35%) + Aspirational (35%)
# Higher scores = higher greenwashing risk
# ==========================================

greenwashing_df['Substantiation_Weakness'] = (
    0.30 * (100 - greenwashing_df['quantification_intensity_score']) +     # Reversed: higher quantification = lower risk
    0.35 * (100 - greenwashing_df['evidence_intensity_score']) +          # Reversed: higher evidence = lower risk  
    0.35 * greenwashing_df['aspirational_intensity_score']                # Normal: higher aspirational = higher risk
).round(2)



In [None]:
# COMPONENT 2: LANGUAGE VAGUENESS SCORE
# Vague (70%) + Hedge (30%)
# Higher scores = higher greenwashing risk
# ==========================================

greenwashing_df['Language_Vagueness'] = (
    0.70 * greenwashing_df['vague_intensity_score'] +                     # Normal: higher vagueness = higher risk
    0.30 * (100 - greenwashing_df['hedge_intensity_score'])               # Reversed: higher hedging = lower risk (per UCLA study)
).round(2)



In [None]:
# COMPONENT 3: TEMPORAL ORIENTATION SCORE  
# Future orientation (60%) + Timeline specificity (40%)
# Higher scores = higher greenwashing risk
# ==========================================

greenwashing_df['Temporal_Orientation'] = (
    0.60 * greenwashing_df['future_vs_past_present_ratio'] +             # Normal: higher future focus = higher risk
    0.40 * (100 - greenwashing_df['commitment_timeline_pct'])            # Reversed: more specific timelines = lower risk
).round(2)



In [None]:
# COMPONENT 4: REPORTING CONSISTENCY SCORE
# Overall similarity (70%) + High similarity ratio (30%)  
# Higher scores = higher greenwashing risk
# ==========================================

greenwashing_df['Reporting_Consistency'] = (
    0.70 * greenwashing_df['similarity_combined'] +                      # Normal: higher similarity = higher risk
    0.30 * greenwashing_df['SpaCy_HighSim_Ratio']                       # Normal: more identical sentences = higher risk
).round(2)



In [None]:
# STANDARDIZE EACH COMPONENT BY YEAR
# ==========================================

normalize_by_year(greenwashing_df, 'Substantiation_Weakness')
normalize_by_year(greenwashing_df, 'Language_Vagueness')
normalize_by_year(greenwashing_df, 'Temporal_Orientation')
normalize_by_year(greenwashing_df, 'Reporting_Consistency')


# ==========================================
# SUMMARY STATISTICS BY YEAR
# ==========================================

print("GREENWASHING COMPONENTS SUMMARY BY YEAR")
print("="*60)

# Component 1: Substantiation Weakness
print(f"\nSUBSTANTIATION WEAKNESS")
for year in [2021, 2022]:
    year_data = greenwashing_df[greenwashing_df['year'] == year]
    print(f"\n{year}:")
    print(f"  Mean Score: {year_data['Substantiation_Weakness'].mean():.2f}")
    print(f"  Highest Risk Companies:")
    top_subst = year_data.nlargest(5, 'Substantiation_Weakness')[['Organization', 'Substantiation_Weakness']]
    for i, (idx, row) in enumerate(top_subst.iterrows(), 1):
        print(f"    {i}. {row['Organization']}: {row['Substantiation_Weakness']:.2f}")

# Component 2: Language Vagueness
print(f"\nLANGUAGE VAGUENESS")
for year in [2021, 2022]:
    year_data = greenwashing_df[greenwashing_df['year'] == year]
    print(f"\n{year}:")
    print(f"  Mean Score: {year_data['Language_Vagueness'].mean():.2f}")
    print(f"  Highest Risk Companies:")
    top_lang = year_data.nlargest(5, 'Language_Vagueness')[['Organization', 'Language_Vagueness']]
    for i, (idx, row) in enumerate(top_lang.iterrows(), 1):
        print(f"    {i}. {row['Organization']}: {row['Language_Vagueness']:.2f}")

# Component 3: Temporal Orientation  
print(f"\nTEMPORAL ORIENTATION")
for year in [2021, 2022]:
    year_data = greenwashing_df[greenwashing_df['year'] == year]
    print(f"\n{year}:")
    print(f"  Mean Score: {year_data['Temporal_Orientation'].mean():.2f}")
    print(f"  Highest Risk Companies:")
    top_temp = year_data.nlargest(5, 'Temporal_Orientation')[['Organization', 'Temporal_Orientation']]
    for i, (idx, row) in enumerate(top_temp.iterrows(), 1):
        print(f"    {i}. {row['Organization']}: {row['Temporal_Orientation']:.2f}")

# Component 4: Reporting Consistency
print(f"\nREPORTING CONSISTENCY")
for year in [2021, 2022]:
    year_data = greenwashing_df[greenwashing_df['year'] == year]
    print(f"\n{year}:")
    print(f"  Mean Score: {year_data['Reporting_Consistency'].mean():.2f}")
    print(f"  Highest Risk Companies:")
    top_cons = year_data.nlargest(5, 'Reporting_Consistency')[['Organization', 'Reporting_Consistency']]
    for i, (idx, row) in enumerate(top_cons.iterrows(), 1):
        print(f"    {i}. {row['Organization']}: {row['Reporting_Consistency']:.2f}")

print(f"\n" + "="*60)
print("Component calculations and standardization complete!")
print("Higher scores indicate higher greenwashing risk")

In [None]:
greenwashing_df

### Export and save

In [None]:
# Cell: Create Comprehensive Greenwashing Results Export

print("CREATING COMPREHENSIVE GREENWASHING RESULTS")
print("=" * 60)

# ==========================================
# STEP 1: Create comprehensive results DataFrame with all variables
# ==========================================

print("Preparing comprehensive greenwashing dataset...")

# Create comprehensive dataset with all variables and clear naming
comprehensive_greenwashing = pd.DataFrame({
    # ============= IDENTIFICATION =============
    'Company': greenwashing_df['Organization'],
    'Year': greenwashing_df['year'],
    
    # ============= MAIN SCORES =============
    'Performance_Communication_Gap_Score': greenwashing_df['Greenwashing_Score'],
    'Performance_Score': greenwashing_df['Performance_Score'], 
    'Green_Communication_Score': greenwashing_df['Green_Com_Score'],
    'Performance_Communication_Absolute_Gap': greenwashing_df['Greenwashing_Risk_Abs'],
    # 'Performance_Communication_Gap_Amplified': greenwashing_df['Amplified_Score'],
    
    # ============= COMPONENT SCORES (0-100 each) =============
    'Component_Substantiation_Weakness_Score': greenwashing_df['Substantiation_Weakness'],
    'Component_Language_Vagueness_Score': greenwashing_df['Language_Vagueness'], 
    'Component_Temporal_Orientation_Score': greenwashing_df['Temporal_Orientation'],
    'Component_Reporting_Consistency_Score': greenwashing_df['Reporting_Consistency'],
    
    # ============= GREEN TERM ANALYSIS =============
    'Green_Terms_Frequency_Pct': df['gt_freq_pct'],
    'Green_Terms_Unique_Relative': df['unique_gt_relative'],
    'Green_Terms_Combined_Score': greenwashing_df['combined_green_score'],
    
    # ============= SENTIMENT ANALYSIS =============
    'Overall_Sentiment_Score': df['avg_sentiment_score'],
    'Renewable_Energy_Sentiment': df['renewable_energy_avg_sentiment'],
    'Climate_Emissions_Sentiment': df['climate_emissions_avg_sentiment'],
    'Combined_Sentiment_Score': greenwashing_df['combined_sentiment_score'],
    
    # ============= CONTEXT & SUBSTANTIATION =============
    # Quantification, Evidence, Aspirational (used in Substantiation Weakness)
    'Quantification_Intensity_Score': df['quantification_intensity_score'],
    'Evidence_Intensity_Score': df['evidence_intensity_score'], 
    'Aspirational_Intensity_Score': df['aspirational_intensity_score'],
    
    # ============= LANGUAGE VAGUENESS METRICS =============
    # Hedging, Vague language (used in Language Vagueness)
    'Hedge_Intensity_Score': df['hedge_intensity_score'],
    'Vague_Intensity_Score': df['vague_intensity_score'],
    'Combined_Unclear_Intensity_Score': greenwashing_df['combined_intensity_score'],
    'Commitment_Timeline_Pct': df['commitment_timeline_pct'],
    
    # ============= TEMPORAL ORIENTATION METRICS =============
    # Future vs past/present focus (used in Temporal Orientation)
    'Future_vs_Past_Present_Ratio': df['future_vs_past_present_ratio'],
    
    # ============= SIMILARITY/CONSISTENCY METRICS =============
    # Document similarity (used in Reporting Consistency)
    'TFIDF_Document_Similarity': df['TFIDF_Doc'],
    'Jaccard_Similarity': df['Jaccard'],
    'SpaCy_Average_Similarity': df['SpaCy_Avg_Similarity'],
    'SpaCy_High_Similarity_Ratio': df['SpaCy_HighSim_Ratio'],
    'Similarity_Combined_Score': greenwashing_df['similarity_combined'],
})

# Round all numeric columns
numeric_columns = comprehensive_greenwashing.select_dtypes(include=[np.number]).columns
comprehensive_greenwashing[numeric_columns] = comprehensive_greenwashing[numeric_columns].round(3)

# Sort by Company and Year
comprehensive_greenwashing = comprehensive_greenwashing.sort_values(['Company', 'Year']).reset_index(drop=True)

print(f"✓ Comprehensive dataset created: {comprehensive_greenwashing.shape}")

# ==========================================
# STEP 2: Create component breakdown analysis
# ==========================================

print("Creating component breakdown analysis...")

# Component weights used in calculations (for documentation)
component_weights = pd.DataFrame({
    'Component': [
        'Substantiation Weakness - Quantification',
        'Substantiation Weakness - Evidence', 
        'Substantiation Weakness - Aspirational',
        'Language Vagueness - Vague Language',
        'Language Vagueness - Hedge Language',
        'Temporal Orientation - Future vs Past/Present',
        'Temporal Orientation - Timeline Specificity',
        'Reporting Consistency - Overall Similarity',
        'Reporting Consistency - High Similarity Ratio',
        'Green Communication - Green Terms',
        'Green Communication - Sentiment'
    ],
    'Weight': [0.30, 0.35, 0.35, 0.70, 0.30, 0.60, 0.40, 0.70, 0.30, 0.40, 0.60],
    'Direction': [
        'Reversed (higher quantification = lower risk)',
        'Reversed (higher evidence = lower risk)',
        'Normal (higher aspirational = higher risk)',
        'Normal (higher vagueness = higher risk)',
        'Reversed (higher hedging = lower risk)',
        'Normal (higher future focus = higher risk)',
        'Reversed (more specific timelines = lower risk)',
        'Normal (higher similarity = higher risk)',
        'Normal (more identical sentences = higher risk)',
        'Normal (higher green terms = higher communication)',
        'Normal (higher sentiment = higher communication)'
    ]
})

# ==========================================
# STEP 3: Create summary statistics by year
# ==========================================

print("Calculating summary statistics by year...")

summary_stats_2021 = comprehensive_greenwashing[comprehensive_greenwashing['Year'] == 2021].describe()
summary_stats_2022 = comprehensive_greenwashing[comprehensive_greenwashing['Year'] == 2022].describe()

# Component score means by year
component_summary = pd.DataFrame({
    'Component': [
        'Greenwashing Risk Score',
        'Green Communication Score', 
        'Substantiation Weakness',
        'Language Vagueness',
        'Temporal Orientation',
        'Reporting Consistency'
    ],
    'Mean_2021': [
        comprehensive_greenwashing[comprehensive_greenwashing['Year'] == 2021]['Performance_Communication_Gap_Score'].mean(),
        comprehensive_greenwashing[comprehensive_greenwashing['Year'] == 2021]['Green_Communication_Score'].mean(),
        comprehensive_greenwashing[comprehensive_greenwashing['Year'] == 2021]['Component_Substantiation_Weakness_Score'].mean(),
        comprehensive_greenwashing[comprehensive_greenwashing['Year'] == 2021]['Component_Language_Vagueness_Score'].mean(),
        comprehensive_greenwashing[comprehensive_greenwashing['Year'] == 2021]['Component_Temporal_Orientation_Score'].mean(),
        comprehensive_greenwashing[comprehensive_greenwashing['Year'] == 2021]['Component_Reporting_Consistency_Score'].mean()
    ],
    'Mean_2022': [
        comprehensive_greenwashing[comprehensive_greenwashing['Year'] == 2022]['Performance_Communication_Gap_Score'].mean(),
        comprehensive_greenwashing[comprehensive_greenwashing['Year'] == 2022]['Green_Communication_Score'].mean(),
        comprehensive_greenwashing[comprehensive_greenwashing['Year'] == 2022]['Component_Substantiation_Weakness_Score'].mean(),
        comprehensive_greenwashing[comprehensive_greenwashing['Year'] == 2022]['Component_Language_Vagueness_Score'].mean(),
        comprehensive_greenwashing[comprehensive_greenwashing['Year'] == 2022]['Component_Temporal_Orientation_Score'].mean(),
        comprehensive_greenwashing[comprehensive_greenwashing['Year'] == 2022]['Component_Reporting_Consistency_Score'].mean()
    ]
}).round(2)

# ==========================================
# STEP 4: Export to Excel with multiple tabs
# ==========================================

print("\nExporting comprehensive greenwashing results to Excel...")
output_path = "data/Greenwashing Results/comprehensive_greenwashing_results.xlsx"

try:
    with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
        # Main comprehensive dataset
        comprehensive_greenwashing.to_excel(writer, sheet_name='Comprehensive_Results', index=False)
        
        # Separate years for easier analysis
        comprehensive_greenwashing[comprehensive_greenwashing['Year'] == 2021].to_excel(
            writer, sheet_name='Results_2021', index=False)
        comprehensive_greenwashing[comprehensive_greenwashing['Year'] == 2022].to_excel(
            writer, sheet_name='Results_2022', index=False)
        
        # Component weights and methodology
        component_weights.to_excel(writer, sheet_name='Component_Methodology', index=False)
        
        # Component summary statistics
        component_summary.to_excel(writer, sheet_name='Component_Summary', index=False)
        
        # Detailed statistics by year
        summary_stats_2021.to_excel(writer, sheet_name='Stats_2021')
        summary_stats_2022.to_excel(writer, sheet_name='Stats_2022')
        
        # Data dictionary
        data_dict = pd.DataFrame({
            'Column_Name': comprehensive_greenwashing.columns,
            'Category': [
                'Identification' if col in ['Company', 'Year'] else
                'Main Scores' if any(x in col for x in ['Performance_Communication_Gap_Score', 'Performance_Score', 'Green_Communication_Score', 'Absolute_Gap', 'Amplified']) else
                'Component Scores' if 'Component_' in col else
                'Green Terms Analysis' if any(x in col for x in ['Green_Terms', 'combined_green_score']) else
                'Sentiment Analysis' if any(x in col for x in ['Sentiment', 'sentiment']) else
                'Substantiation Metrics' if any(x in col for x in ['Quantification', 'Evidence', 'Aspirational']) else
                'Language Vagueness Metrics' if any(x in col for x in ['Hedge', 'Vague', 'Unclear', 'Timeline']) else
                'Temporal Metrics' if 'Future_vs_Past' in col else
                'Similarity Metrics' if any(x in col for x in ['Similarity', 'TFIDF', 'Jaccard', 'SpaCy']) else
                'Other' for col in comprehensive_greenwashing.columns
            ],
            'Description': [
                'Company identifier' if col == 'Company' else
                'Year (2021 or 2022)' if col == 'Year' else
                '(Amplified) Performance Communication Ga[ score (0-100, higher = more risk)' if col == 'Performance_Communication_Gap_Score' else
                'Environmental performance score from ensemble analysis' if col == 'Performance_Score' else
                'Green communication intensity score (0-100)' if col == 'Green_Communication_Score' else
                'Absolute gap between communication and performance' if col == 'Performance_Communication_Absolute_Gap' else
                # 'Amplified risk score with classic greenwashing penalty' if col == 'Greenwashing_Risk_Amplified' else
                'Component score: Weakness of substantiation (0-100, higher = more risk)' if col == 'Component_Substantiation_Weakness_Score' else
                'Component score: Language Vagueness (0-100, higher = more risk)' if col == 'Component_Language_Vagueness_Score' else
                'Component score: Temporal orientation (0-100, higher = more risk)' if col == 'Component_Temporal_Orientation_Score' else
                'Component score: Reporting consistency (0-100, higher = more risk)' if col == 'Component_Reporting_Consistency_Score' else
                'Frequency of green terms as percentage of total words' if col == 'Green_Terms_Frequency_Pct' else
                'Unique green terms relative to document length' if col == 'Green_Terms_Unique_Relative' else
                'Combined score from green term frequency and uniqueness' if col == 'Green_Terms_Combined_Score' else
                'Overall sentiment score across all text' if col == 'Overall_Sentiment_Score' else
                'Average sentiment in renewable energy discussions' if col == 'Renewable_Energy_Sentiment' else
                'Average sentiment in climate/emissions discussions' if col == 'Climate_Emissions_Sentiment' else
                'Weighted combination of all sentiment scores' if col == 'Combined_Sentiment_Score' else
                'Intensity of quantitative claims and metrics' if col == 'Quantification_Intensity_Score' else
                'Intensity of evidence-based statements' if col == 'Evidence_Intensity_Score' else
                'Intensity of aspirational/future-oriented language' if col == 'Aspirational_Intensity_Score' else
                'Intensity of hedging language (uncertainty markers)' if col == 'Hedge_Intensity_Score' else
                'Intensity of vague, non-specific language' if col == 'Vague_Intensity_Score' else
                'Combined score of unclear language patterns' if col == 'Combined_Unclear_Intensity_Score' else
                'Percentage of commitments with specific timelines' if col == 'Commitment_Timeline_Pct' else
                'Ratio of future-focused vs past/present language' if col == 'Future_vs_Past_Present_Ratio' else
                'TF-IDF based document similarity score' if col == 'TFIDF_Document_Similarity' else
                'Jaccard similarity coefficient between documents' if col == 'Jaccard_Similarity' else
                'SpaCy semantic similarity average' if col == 'SpaCy_Average_Similarity' else
                'Ratio of highly similar sentences (SpaCy >0.8)' if col == 'SpaCy_High_Similarity_Ratio' else
                'Combined similarity score from multiple methods' if col == 'Similarity_Combined_Score' else
                f'Variable: {col}' for col in comprehensive_greenwashing.columns
            ],
            'Scale': [
                'Text' if col in ['Company'] else
                'Integer (2021, 2022)' if col == 'Year' else
                '0-100 (normalized by year)' if 'Score' in col or 'Component_' in col else
                'Percentage (0-100)' if 'Pct' in col else
                'Ratio (0+)' if 'Ratio' in col else
                'Normalized (0-100)' if any(x in col for x in ['Intensity', 'Similarity', 'Combined']) else
                'Score (-100 to +100)' if 'Sentiment' in col else
                'Continuous' for col in comprehensive_greenwashing.columns
            ]
        })
        
        data_dict.to_excel(writer, sheet_name='Data_Dictionary', index=False)
    
    print(f"✓ Comprehensive results exported successfully to: {output_path}")
    print(f"  - Comprehensive_Results: Complete dataset ({comprehensive_greenwashing.shape[0]} rows, {comprehensive_greenwashing.shape[1]} columns)")
    print(f"  - Results_2021: 2021 data only")
    print(f"  - Results_2022: 2022 data only") 
    print(f"  - Component_Methodology: Weights and calculation methods")
    print(f"  - Component_Summary: Component score means by year")
    print(f"  - Stats_2021/2022: Detailed descriptive statistics")
    print(f"  - Data_Dictionary: Complete variable descriptions")
    
except Exception as e:
    print(f"Error exporting comprehensive results: {e}")
    print("Comprehensive dataset created successfully in memory as 'comprehensive_greenwashing'")

# ==========================================
# STEP 5: Display key summary information
# ==========================================

print(f"\n" + "=" * 60)
print("COMPREHENSIVE GREENWASHING DATASET SUMMARY")
print("=" * 60)

print(f"\n📊 DATASET OVERVIEW:")
print(f"  Total records: {len(comprehensive_greenwashing):,}")
print(f"  Companies: {comprehensive_greenwashing['Company'].nunique()}")
print(f"  Years: {sorted(comprehensive_greenwashing['Year'].unique())}")
print(f"  Variables: {comprehensive_greenwashing.shape[1]}")

print(f"\n📈 VARIABLE CATEGORIES:")
main_scores = [col for col in comprehensive_greenwashing.columns if any(x in col for x in ['Performance_Communication_Gap_Score', 'Performance_Score', 'Green_Communication_Score', 'Absolute_Gap', 'Amplified'])]
components = [col for col in comprehensive_greenwashing.columns if 'Component_' in col]
green_terms = [col for col in comprehensive_greenwashing.columns if 'Green_Terms' in col or 'combined_green_score' in col]
sentiment = [col for col in comprehensive_greenwashing.columns if 'Sentiment' in col or 'sentiment' in col]
substantiation = [col for col in comprehensive_greenwashing.columns if any(x in col for x in ['Quantification', 'Evidence', 'Aspirational'])]
language = [col for col in comprehensive_greenwashing.columns if any(x in col for x in ['Hedge', 'Vague', 'Unclear', 'Timeline'])]
temporal = [col for col in comprehensive_greenwashing.columns if 'Future_vs_Past' in col]
similarity = [col for col in comprehensive_greenwashing.columns if any(x in col for x in ['Similarity', 'TFIDF', 'Jaccard', 'SpaCy'])]

print(f"  Main Scores: {len(main_scores)} variables")
print(f"  Component Scores: {len(components)} variables")
print(f"  Green Terms Analysis: {len(green_terms)} variables")
print(f"  Sentiment Analysis: {len(sentiment)} variables")
print(f"  Substantiation Metrics: {len(substantiation)} variables")
print(f"  Language Metrics: {len(language)} variables")
print(f"  Temporal Metrics: {len(temporal)} variables")
print(f"  Similarity/Consistency: {len(similarity)} variables")

print(f"\n🎯 KEY STATISTICS:")
print(f"  Mean Greenwashing Risk (2021): {comprehensive_greenwashing[comprehensive_greenwashing['Year'] == 2021]['Performance_Communication_Gap_Score'].mean():.2f}")
print(f"  Mean Greenwashing Risk (2022): {comprehensive_greenwashing[comprehensive_greenwashing['Year'] == 2022]['Performance_Communication_Gap_Score'].mean():.2f}")
print(f"  Companies with both years: {len(comprehensive_greenwashing.groupby('Company').filter(lambda x: len(x) == 2)['Company'].unique())}")

print(f"\n💾 FILES CREATED:")
print(f"  Comprehensive dataset: {output_path}")
print(f"  (8 sheets with complete data, methodology, and documentation)")

print(f"\n🔬 READY FOR ENSEMBLE ANALYSIS:")
print(f"  All component scores calculated and normalized")
print(f"  All underlying variables preserved")
print(f"  Complete documentation provided")

print(f"\nVariable 'comprehensive_greenwashing' is available in memory for immediate use")
print("Proceed to ensemble analysis with confidence that all data is preserved!")

# Final Score Calculation

In [None]:
import pandas as pd
import numpy as np
from itertools import product
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

print("ENSEMBLE GREENWASHING SCORE ANALYSIS")
print("="*50)

# ==========================================
# STEP 1: Generate valid weight combinations
# ==========================================

print("Generating valid weight combinations...")
print("Constraints:")
print("- Individual weights: 0.05 ≤ w ≤ 0.50")
print("- Greenwashing_Score > all other components")
print("- Substantiation_Weakness > Language_Vagueness, Temporal_Orientation, Reporting_Consistency")
print("- All weights sum to 1.0")
print()

# Create weight ranges (2 decimal precision as requested)
weight_range = np.arange(0.05, 0.51, 0.01)
weight_range = np.round(weight_range, 2)

valid_combinations = []
total_combinations = 0

# Generate all possible combinations
for w_gw in weight_range:  # Greenwashing_Score weight
    for w_sub in weight_range:  # Substantiation_Weakness weight
        for w_lang in weight_range:  # Language_Vagueness weight
            for w_temp in weight_range:  # Temporal_Orientation weight
                total_combinations += 1
                
                # Calculate remaining weight for Reporting_Consistency
                w_rep = 1.0 - (w_gw + w_sub + w_lang + w_temp)
                w_rep = round(w_rep, 2)
                
                # Check if all constraints are satisfied
                if (0.05 <= w_rep <= 0.50 and  # Reporting_Consistency in valid range
                    w_gw > w_sub and           # Greenwashing_Score > Substantiation_Weakness
                    w_gw > w_lang and          # Greenwashing_Score > Language_Vagueness 
                    w_gw > w_temp and          # Greenwashing_Score > Temporal_Orientation
                    w_gw > w_rep and           # Greenwashing_Score > Reporting_Consistency
                    w_sub > w_lang and         # Substantiation_Weakness > Language_Vagueness
                    w_sub > w_temp and         # Substantiation_Weakness > Temporal_Orientation  
                    w_sub > w_rep and          # Substantiation_Weakness > Reporting_Consistency
                    abs(w_gw + w_sub + w_lang + w_temp + w_rep - 1.0) < 0.001):  # Sum ≈ 1.0
                    
                    valid_combinations.append({
                        'w_greenwashing': w_gw,
                        'w_substantiation': w_sub,
                        'w_language': w_lang,
                        'w_temporal': w_temp,
                        'w_reporting': w_rep
                    })

print(f"Total combinations tested: {total_combinations:,}")
print(f"Valid combinations found: {len(valid_combinations):,}")
print(f"Percentage valid: {len(valid_combinations)/total_combinations*100:.2f}%")

if len(valid_combinations) == 0:
    print("ERROR: No valid weight combinations found. Check constraints.")
else:
    # ==========================================
    # STEP 2: Calculate scores for all combinations
    # ==========================================
    
    print(f"\nCalculating scores for {len(valid_combinations):,} weight combinations...")
    
    # Initialize storage for all results
    all_results = []
    
    # Progress bar for weight combinations
    for i, weights in enumerate(tqdm(valid_combinations, desc="Processing combinations")):
        
        # Calculate final score for this weight combination
        final_scores = (
            weights['w_greenwashing'] * greenwashing_df['Greenwashing_Score'] +
            weights['w_substantiation'] * greenwashing_df['Substantiation_Weakness'] +
            weights['w_language'] * greenwashing_df['Language_Vagueness'] +
            weights['w_temporal'] * greenwashing_df['Temporal_Orientation'] +
            weights['w_reporting'] * greenwashing_df['Reporting_Consistency']
        ).round(2)
        
        # Store results for this combination
        combination_result = {
            'combination_id': i,
            'weights': weights,
            'scores': final_scores.tolist(),
            'organizations': greenwashing_df['Organization'].tolist(),
            'years': greenwashing_df['year'].tolist(),
            'summary_stats': {
                'mean': final_scores.mean(),
                'median': final_scores.median(),
                'std': final_scores.std(),
                'min': final_scores.min(),
                'max': final_scores.max(),
                'q25': final_scores.quantile(0.25),
                'q75': final_scores.quantile(0.75)
            }
        }
        
        all_results.append(combination_result)
    
    # ==========================================
    # STEP 3: Create ensemble summary
    # ==========================================
    
    print("\nCreating ensemble summary...")
    
    # Extract all scores into matrix format
    n_combinations = len(all_results)
    n_observations = len(greenwashing_df)
    
    # Matrix: rows = combinations, columns = observations
    score_matrix = np.array([result['scores'] for result in all_results])
    
    # Calculate statistics across all combinations for each observation
    ensemble_stats = pd.DataFrame({
        'Organization': greenwashing_df['Organization'],
        'year': greenwashing_df['year'],
        'mean_score': np.mean(score_matrix, axis=0),
        'median_score': np.median(score_matrix, axis=0),
        'std_score': np.std(score_matrix, axis=0),
        'min_score': np.min(score_matrix, axis=0),
        'max_score': np.max(score_matrix, axis=0),
        'q25_score': np.percentile(score_matrix, 25, axis=0),
        'q75_score': np.percentile(score_matrix, 75, axis=0),
        'iqr_score': np.percentile(score_matrix, 75, axis=0) - np.percentile(score_matrix, 25, axis=0),
        'range_score': np.max(score_matrix, axis=0) - np.min(score_matrix, axis=0)
    }).round(2)
    
    # Weight distribution analysis
    weight_analysis = pd.DataFrame([result['weights'] for result in all_results])
    weight_stats = {
        'weight_ranges': {
            'greenwashing': [weight_analysis['w_greenwashing'].min(), weight_analysis['w_greenwashing'].max()],
            'substantiation': [weight_analysis['w_substantiation'].min(), weight_analysis['w_substantiation'].max()], 
            'language': [weight_analysis['w_language'].min(), weight_analysis['w_language'].max()],
            'temporal': [weight_analysis['w_temporal'].min(), weight_analysis['w_temporal'].max()],
            'reporting': [weight_analysis['w_reporting'].min(), weight_analysis['w_reporting'].max()]
        },
        'weight_means': weight_analysis.mean().to_dict(),
        'weight_stds': weight_analysis.std().to_dict()
    }
    
    # Overall ensemble statistics
    overall_stats = {
        'total_combinations': n_combinations,
        'total_observations': n_observations,
        'score_stability': {
            'mean_std_across_combinations': ensemble_stats['std_score'].mean(),
            'max_std_across_combinations': ensemble_stats['std_score'].max(),
            'mean_range_across_combinations': ensemble_stats['range_score'].mean(),
            'companies_with_high_uncertainty': len(ensemble_stats[ensemble_stats['std_score'] > ensemble_stats['std_score'].quantile(0.9)])
        }
    }
    
    # ==========================================
    # STEP 4: Calculate company averages across both years
    # ==========================================
    
    print("Calculating company averages across years...")
    
    # Company averages for median scores
    company_averages_median = ensemble_stats.groupby('Organization').agg({
        'median_score': ['mean', 'count']
    }).round(2)
    company_averages_median.columns = ['Avg_Median_Score', 'Years_Count']
    company_averages_median = company_averages_median.reset_index()
    company_averages_median = company_averages_median[company_averages_median['Years_Count'] == 2]  # Only companies with both years
    
    # Company averages for mean scores
    company_averages_mean = ensemble_stats.groupby('Organization').agg({
        'mean_score': ['mean', 'count']
    }).round(2)
    company_averages_mean.columns = ['Avg_Mean_Score', 'Years_Count']
    company_averages_mean = company_averages_mean.reset_index()
    company_averages_mean = company_averages_mean[company_averages_mean['Years_Count'] == 2]  # Only companies with both years
    
    # Combine company averages
    company_averages = pd.merge(company_averages_median[['Organization', 'Avg_Median_Score']], 
                               company_averages_mean[['Organization', 'Avg_Mean_Score']], 
                               on='Organization') 
    
    

In [None]:
from openpyxl import Workbook
from openpyxl.utils import get_column_letter
from openpyxl.styles import PatternFill
from openpyxl import load_workbook

# ==========================================
# STEP 5: Save results
# ==========================================
if len(valid_combinations) == 0:
    print("ERROR: No valid weight combinations found. Check constraints.")
else:    
    print(f"\nSaving results...")
    
    # Define file path
    output_path = "data/Greenwashing Results/ensemble_results_summary_ensforperf.xlsx"
    
    # Save key results as Excel for easy viewing
    with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
        ensemble_stats.to_excel(writer, sheet_name='Ensemble_Scores', index=False)
        company_averages.to_excel(writer, sheet_name='Company_Averages', index=False)
        
        # Summary statistics sheet
        summary_df = pd.DataFrame([
            ['Total Combinations', overall_stats['total_combinations']],
            ['Total Observations', overall_stats['total_observations']],
            ['Mean Std Across Combinations', overall_stats['score_stability']['mean_std_across_combinations']],
            ['Max Std Across Combinations', overall_stats['score_stability']['max_std_across_combinations']],
            ['Mean Range Across Combinations', overall_stats['score_stability']['mean_range_across_combinations']],
            ['High Uncertainty Companies', overall_stats['score_stability']['companies_with_high_uncertainty']]
        ], columns=['Metric', 'Value'])
        summary_df.to_excel(writer, sheet_name='Summary_Stats', index=False)
    
    print("Applying Excel formatting...")
    
    # Load the workbook for formatting
    wb = load_workbook(output_path)
    
    # Define grey fill for alternating rows
    grey_fill = PatternFill(start_color="D9D9D9", end_color="D9D9D9", fill_type="solid")
    
    # Format each sheet
    for sheet_name in wb.sheetnames:
        ws = wb[sheet_name]
        
        # Auto-adjust column widths based on the longest string in each column
        for col in ws.columns:
            max_length = 0
            col_letter = get_column_letter(col[0].column)
            for cell in col:
                if cell.value:
                    max_length = max(max_length, len(str(cell.value)))
            ws.column_dimensions[col_letter].width = max_length + 3  # Add padding
        
        # Apply alternating row colors
        if sheet_name in ['Ensemble_Scores', 'Company_Averages']:
            # These sheets have company names in column A - alternate by company
            prev_company = None
            use_grey = False
            for row in range(2, ws.max_row + 1):
                current_company = ws[f"A{row}"].value  # Column A has the company names
                if current_company != prev_company:
                    use_grey = not use_grey
                    prev_company = current_company
                
                if use_grey:
                    for col in range(1, ws.max_column + 1):
                        ws.cell(row=row, column=col).fill = grey_fill
        else:
            # Summary_Stats sheet - simple alternating rows
            for row in range(2, ws.max_row + 1):
                if row % 2 == 0:  # Even rows get grey background
                    for col in range(1, ws.max_column + 1):
                        ws.cell(row=row, column=col).fill = grey_fill
    
    # Save the final formatted workbook
    wb.save(output_path)
    
    print(f"Results saved and formatted:")
    print(f"- Summary tables: data/Greenwashing Results/ensemble_results_summary_ensforperf.xlsx")
    
    # ==========================================
    # STEP 6: Display results by year and averages
    # ==========================================
    
    print("\n" + "="*60)
    print("ENSEMBLE GREENWASHING SCORE RESULTS")
    print("="*60)
    
    print(f"\nOVERALL STATISTICS:")
    print(f"Total weight combinations tested: {overall_stats['total_combinations']:,}")
    print(f"Mean uncertainty (std) across all companies: {overall_stats['score_stability']['mean_std_across_combinations']:.2f}")
    print(f"Maximum uncertainty (std) for any company: {overall_stats['score_stability']['max_std_across_combinations']:.2f}")
    print(f"Companies with high score uncertainty (>90th percentile): {overall_stats['score_stability']['companies_with_high_uncertainty']}")
    
    # Results by year - 2021
    print(f"\n" + "="*40)
    print("HIGHEST RISK COMPANIES - 2021")
    print("="*40)
    print("(Ranked by median ensemble score)")
    
    ensemble_2021 = ensemble_stats[ensemble_stats['year'] == 2021]
    top_risk_2021 = ensemble_2021.nlargest(10, 'median_score')[['Organization', 'median_score', 'mean_score', 'std_score', 'min_score', 'max_score']]
    print(top_risk_2021.to_string(index=False))
    
    # Results by year - 2022
    print(f"\n" + "="*40)
    print("HIGHEST RISK COMPANIES - 2022")
    print("="*40)
    print("(Ranked by median ensemble score)")
    
    ensemble_2022 = ensemble_stats[ensemble_stats['year'] == 2022]
    top_risk_2022 = ensemble_2022.nlargest(10, 'median_score')[['Organization', 'median_score', 'mean_score', 'std_score', 'min_score', 'max_score']]
    print(top_risk_2022.to_string(index=False))
    
    # Company averages across both years
    print(f"\n" + "="*40)
    print("HIGHEST RISK COMPANIES - AVERAGE (2021-2022)")
    print("="*40)
    print("(Companies with data for both years)")
    
    top_avg_companies = company_averages.nlargest(14, 'Avg_Median_Score')
    print(top_avg_companies.to_string(index=False))
    
    print(f"\nCompanies with data for both years: {len(company_averages)}")
    
    # Most uncertain companies by year
    print(f"\n" + "="*40)
    print("MOST UNCERTAIN COMPANIES BY YEAR")
    print("="*40)
    print("(Highest std across weight combinations)")
    
    print(f"\n2021 - Most Uncertain:")
    most_uncertain_2021 = ensemble_2021.nlargest(5, 'std_score')[['Organization', 'median_score', 'mean_score', 'std_score', 'min_score', 'max_score']]
    print(most_uncertain_2021.to_string(index=False))
    
    print(f"\n2022 - Most Uncertain:")
    most_uncertain_2022 = ensemble_2022.nlargest(5, 'std_score')[['Organization', 'median_score', 'mean_score', 'std_score', 'min_score', 'max_score']]
    print(most_uncertain_2022.to_string(index=False))
    
    print(f"\n" + "="*60)
    print("ANALYSIS COMPLETE!")
    print("="*60)
    print("Use 'ensemble_stats' DataFrame for individual company/year scores")
    print("Use 'company_averages' DataFrame for cross-year company averages")
    print("All results saved to files for further analysis")

print("\nEnsemble analysis complete!")
print("\nKey variables created:")
print("- ensemble_stats: Individual company scores by year")
print("- company_averages: Company averages across both years") 
print("- weight_analysis: All valid weight combinations (in memory)")
print("- all_results: Complete results for all combinations")
print("\nFormatted results saved to: data/Greenwashing Results/ensemble_results_summary_ensforperf.xlsx")
print("(3 sheets: Ensemble_Scores, Company_Averages, Summary_Stats)")

### Final Export

In [None]:
# Cell: Create Master Greenwashing Dataset - Combining All Results

print("CREATING MASTER GREENWASHING DATASET")
print("=" * 60)
print("Combining comprehensive communication results with ensemble statistics...")

# ==========================================
# STEP 1: Load Comprehensive Greenwashing Results
# ==========================================

print("\nLoading comprehensive greenwashing results...")

try:
    # Load comprehensive results
    comp_greenwashing = pd.read_excel("data/Greenwashing Results/comprehensive_greenwashing_results.xlsx", 
                                     sheet_name='Comprehensive_Results')
    
    print(f"✓ Comprehensive data loaded: {len(comp_greenwashing)} records")
    
except Exception as e:
    print(f"Error loading comprehensive results: {e}")
    print("Using in-memory comprehensive_greenwashing data...")
    comp_greenwashing = comprehensive_greenwashing.copy()

# ==========================================
# STEP 2: Load Ensemble Greenwashing Results
# ==========================================

print("Loading ensemble greenwashing results...")

try:
    ensemble_greenwashing = pd.read_excel("data/Greenwashing Results/ensemble_results_summary_ensforperf.xlsx", 
                                         sheet_name='Ensemble_Scores')
    
    print(f"✓ Ensemble data loaded: {len(ensemble_greenwashing)} records")
    
except Exception as e:
    print(f"Error loading ensemble results: {e}")
    print("Using in-memory ensemble_stats data...")
    ensemble_greenwashing = ensemble_stats.copy()

# ==========================================
# STEP 2.5: Load Sensitivity Analysis Results
# ==========================================

print("Loading sensitivity analysis results...")

try:
    # Load company sensitivity metrics from existing sensitivity analysis
    company_sensitivity = pd.read_excel("data/Greenwashing Results/sensitivity_analysis/communication_sensitivity_results.xlsx", 
                                       sheet_name='Company_Sensitivity')
    company_sensitivity.rename(columns={'Organization': 'Company'}, inplace=True)
    
    # Load scenario impact summary 
    scenario_impact = pd.read_excel("data/Greenwashing Results/sensitivity_analysis/communication_sensitivity_results.xlsx", 
                                   sheet_name='Scenario_Impact')
    
    print(f"✓ Sensitivity data loaded:")
    print(f"  - Company sensitivity: {len(company_sensitivity)} companies")
    print(f"  - Scenario impact: {len(scenario_impact)} scenarios")
    
except Exception as e:
    print(f"Warning: Could not load sensitivity analysis results: {e}")
    print("Continuing without sensitivity data...")
    company_sensitivity = pd.DataFrame()
    scenario_impact = pd.DataFrame()

# ==========================================
# STEP 3: Standardize and Rename Columns
# ==========================================

print("\nStandardizing column names...")

# Rename comprehensive data columns with clear, informative names
if not comp_greenwashing.empty:
    comprehensive_renamed = comp_greenwashing.rename(columns={
        'Company': 'Company',
        'Year': 'Year',
        
        # ============= MAIN GREENWASHING SCORES =============
        'Performance_Communication_Gap_Score': 'Performance_Communication_Gap_Score',
        'Performance_Score': 'Performance_Score',
        'Green_Communication_Score': 'Green_Com_Score',
        'Greenwashing_Risk_Absolute_Gap': 'Greenwashing_Risk_Absolute_Gap',
        'Greenwashing_Risk_Amplified': 'Greenwashing_Risk_Amplified',
        
        # ============= COMMUNICATION COMPONENT SCORES =============
        'Component_Substantiation_Weakness_Score': 'Substantiation_Weakness',
        'Component_Language_Vagueness_Score': 'Language_Vagueness',
        'Component_Temporal_Orientation_Score': 'Temporal_Orientation',
        'Component_Reporting_Consistency_Score': 'Reporting_Consistency',
        
        # ============= UNDERLYING RAW COMMUNICATION VARIABLES =============
        'Green_Terms_Frequency_Pct': 'gt_freq_pct',
        'Green_Terms_Unique_Relative': 'unique_gt_relative',
        'Green_Terms_Combined_Score': 'combined_green_score',
        
        'Overall_Sentiment_Score': 'avg_sentiment_score',
        'Renewable_Energy_Sentiment': 'renewable_energy_avg_sentiment',
        'Climate_Emissions_Sentiment': 'climate_emissions_avg_sentiment',
        'Combined_Sentiment_Score': 'combined_sentiment_score',
        
        'Quantification_Intensity_Score': 'quantification_intensity_score',
        'Evidence_Intensity_Score': 'evidence_intensity_score',
        'Aspirational_Intensity_Score': 'aspirational_intensity_score',
        
        'Hedge_Intensity_Score': 'hedge_density',
        'Vague_Intensity_Score': 'vague_density',
        'Combined_Unclear_Intensity_Score': 'combined_unclear_intensity',
        'Commitment_Timeline_Pct': 'commitment_timeline_pct',
        
        'Future_vs_Past_Present_Ratio': 'temporal_future_pct',
        
        'TFIDF_Document_Similarity': 'TFIDF_similarity',
        'Jaccard_Similarity': 'Jaccard_similarity',
        'SpaCy_Average_Similarity': 'SpaCy_avg_similarity',
        'SpaCy_High_Similarity_Ratio': 'SpaCy_HighSim_Ratio',
        'Similarity_Combined_Score': 'similarity_combined'
    })
else:
    comprehensive_renamed = pd.DataFrame()

# Rename ensemble data columns with Greenwashing_ prefix
if not ensemble_greenwashing.empty:
    ensemble_renamed = ensemble_greenwashing.rename(columns={
        'Organization': 'Company',
        'year': 'Year',
        'mean_score': 'Ens_Greenwashing_mean_score',
        'median_score': 'Ens_Greenwashing_median_score',
        'std_score': 'Ens_Greenwashing_std_score',
        'min_score': 'Ens_Greenwashing_min_score',
        'max_score': 'Ens_Greenwashing_max_score',
        'q25_score': 'Ens_Greenwashing_q25_score',
        'q75_score': 'Ens_Greenwashing_q75_score',
        'iqr_score': 'Ens_Greenwashing_iqr_score',
        'range_score': 'Ens_Greenwashing_range_score'
    })
else:
    ensemble_renamed = pd.DataFrame()

print(f"✓ Column standardization complete")

# ==========================================
# STEP 4: Merge Datasets
# ==========================================

print("\nMerging comprehensive and ensemble data...")

if not comprehensive_renamed.empty and not ensemble_renamed.empty:
    # First merge comprehensive and ensemble data
    master_greenwashing = pd.merge(
        comprehensive_renamed, 
        ensemble_renamed, 
        on=['Company', 'Year'], 
        how='outer',
        suffixes=('', '_ensemble')
    )
    
    # Then add sensitivity data (company-level, no year dimension)
if not company_sensitivity.empty:
    # Rename sensitivity columns with Sens_ prefix
    sensitivity_renamed = company_sensitivity.rename(columns={
        'CV_Percent': 'Sens_CV_Percent',
        'Score_Range': 'Sens_Score_Range',
        'MAD': 'Sens_MAD',
        'Avg_Rank_Shift': 'Sens_Avg_Rank_Shift',
        'Max_Rank_Shift': 'Sens_Max_Rank_Shift',
        'Sensitivity_Level': 'Sens_Sensitivity_Level'
    })
    
    sensitivity_cols = ['Company', 'Sens_CV_Percent', 'Sens_Score_Range', 'Sens_MAD', 'Sens_Avg_Rank_Shift', 
                       'Sens_Max_Rank_Shift', 'Sens_Sensitivity_Level']
    master_greenwashing = pd.merge(
        master_greenwashing,
        sensitivity_renamed[sensitivity_cols],
        on='Company',
        how='left'
    )
    print(f"✓ Sensitivity data merged: {len(sensitivity_cols)-1} sensitivity variables added")
    
    print(f"✓ Merge complete: {len(master_greenwashing)} records")
    print(f"  - Companies: {master_greenwashing['Company'].nunique()}")
    print(f"  - Years: {sorted(master_greenwashing['Year'].unique())}")
    
elif not comprehensive_renamed.empty:
    master_greenwashing = comprehensive_renamed.copy()
    print("Using comprehensive data only (ensemble data not available)")
    
elif not ensemble_renamed.empty:
    master_greenwashing = ensemble_renamed.copy()
    print("Using ensemble data only (comprehensive data not available)")
    
else:
    print("Error: No data available from either source")
    master_greenwashing = pd.DataFrame()

# ==========================================
# STEP 5: Final Data Organization and Validation
# ==========================================

if not master_greenwashing.empty:
    # Sort by Company and Year for consistency
    master_greenwashing = master_greenwashing.sort_values(['Company', 'Year']).reset_index(drop=True)
    
    # Round numerical columns for clean display
    numeric_columns = master_greenwashing.select_dtypes(include=[np.number]).columns
    master_greenwashing[numeric_columns] = master_greenwashing[numeric_columns].round(3)
    
    print(f"\nFinal dataset structure:")
    print(f"  - Shape: {master_greenwashing.shape}")
    print(f"  - Companies: {master_greenwashing['Company'].nunique()}")
    print(f"  - Years covered: {sorted(master_greenwashing['Year'].unique())}")
    print(f"  - Component score columns: {len([col for col in master_greenwashing.columns if col in ['Green_Com_Score', 'Substantiation_Weakness', 'Language_Vagueness', 'Temporal_Orientation', 'Reporting_Consistency']])}")
    print(f"  - Ensemble statistic columns: {len([col for col in master_greenwashing.columns if 'Ens_Greenwashing_' in col])}")

# ==========================================
# STEP 6: Export Master Dataset
# ==========================================

print(f"\nExporting master greenwashing dataset...")

if not master_greenwashing.empty:
    output_path = "data/Greenwashing Results/master_greenwashing_dataset.xlsx"
    
    try:
        with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
            # Main dataset
            master_greenwashing.to_excel(writer, sheet_name='Master_Greenwashing_Data', index=False)
            
            # Create data dictionary
            data_dict = pd.DataFrame({
                'Column_Name': master_greenwashing.columns,
                'Data_Type': [str(master_greenwashing[col].dtype) for col in master_greenwashing.columns],
                'Category': [
                    'Identification' if col in ['Company', 'Year'] else
                    'Main Greenwashing Scores' if col in ['Performance_Communication_Gap_Score', 'Performance_Score', 'Greenwashing_Risk_Absolute_Gap', 'Greenwashing_Risk_Amplified'] else
                    'Communication Component Scores' if col in ['Green_Com_Score', 'Substantiation_Weakness', 'Language_Vagueness', 'Temporal_Orientation', 'Reporting_Consistency'] else
                    'Greenwashing Ensemble Statistics' if any(x in col for x in ['Ens_Greenwashing_mean', 'Ens_Greenwashing_median', 'Ens_Greenwashing_std', 'Ens_Greenwashing_min', 'Ens_Greenwashing_max', 'Ens_Greenwashing_q25', 'Ens_Greenwashing_q75', 'Ens_Greenwashing_iqr', 'Ens_Greenwashing_range']) else
                    'Sensitivity Analysis' if col in ['Sens_CV_Percent', 'Sens_Score_Range', 'Sens_MAD', 'Sens_Avg_Rank_Shift', 'Sens_Max_Rank_Shift', 'Sens_Sensitivity_Level'] else
                    'Green Terms Analysis' if col in ['gt_freq_pct', 'unique_gt_relative', 'combined_green_score'] else
                    'Sentiment Analysis' if col in ['avg_sentiment_score', 'renewable_energy_avg_sentiment', 'climate_emissions_avg_sentiment', 'combined_sentiment_score'] else
                    'Substantiation Variables' if col in ['quantification_intensity_score', 'evidence_intensity_score', 'aspirational_intensity_score'] else
                    'Language Vagueness Variables' if col in ['hedge_density', 'vague_density', 'combined_unclear_intensity', 'commitment_timeline_pct'] else
                    'Temporal Variables' if col in ['temporal_future_pct'] else
                    'Similarity Variables' if col in ['TFIDF_similarity', 'Jaccard_similarity', 'SpaCy_avg_similarity', 'SpaCy_HighSim_Ratio', 'similarity_combined'] else
                    'Other' for col in master_greenwashing.columns
                ],
                'Description': [
                    'Company name' if col == 'Company' else
                    'Year (2021 or 2022)' if col == 'Year' else
                    '(Amplified) Performance Communication Gap Score (0-100, higher = more risk)' if col == 'Performance_Communication_Gap_Score' else
                    'Environmental performance score from performance analysis' if col == 'Performance_Score' else
                    'Combined green communication intensity score (0-100)' if col == 'Green_Com_Score' else
                    'Absolute gap between communication and performance scores' if col == 'Greenwashing_Risk_Absolute_Gap' else
                    'Amplified risk score with classic greenwashing penalty' if col == 'Greenwashing_Risk_Amplified' else
                    'Component score: Weakness of substantiation (0-100, higher = more risk)' if col == 'Substantiation_Weakness' else
                    'Component score: Language Vagueness (0-100, higher = more risk)' if col == 'Language_Vagueness' else
                    'Component score: Temporal orientation (0-100, higher = more risk)' if col == 'Temporal_Orientation' else
                    'Component score: Reporting consistency (0-100, higher = more risk)' if col == 'Reporting_Consistency' else
                    'Ensemble median greenwashing score across all weight combinations' if col == 'Ens_Greenwashing_median_score' else
                    'Ensemble mean greenwashing score across all weight combinations' if col == 'Ens_Greenwashing_mean_score' else
                    'Ensemble standard deviation of greenwashing scores' if col == 'Ens_Greenwashing_std_score' else
                    'Ensemble minimum greenwashing score across all weight combinations' if col == 'Ens_Greenwashing_min_score' else
                    'Ensemble maximum greenwashing score across all weight combinations' if col == 'Ens_Greenwashing_max_score' else
                    'Ensemble 25th percentile greenwashing score' if col == 'Ens_Greenwashing_q25_score' else
                    'Ensemble 75th percentile greenwashing score' if col == 'Ens_Greenwashing_q75_score' else
                    'Ensemble interquartile range of greenwashing scores' if col == 'Ens_Greenwashing_iqr_score' else
                    'Ensemble range (max-min) of greenwashing scores' if col == 'Ens_Greenwashing_range_score' else
                    'Coefficient of variation across weight scenarios (%)' if col == 'Sens_CV_Percent' else
                    'Score range across all weight scenarios' if col == 'Sens_Score_Range' else
                    'Mean absolute deviation from baseline' if col == 'Sens_MAD' else
                    'Average ranking shift across scenarios' if col == 'Sens_Avg_Rank_Shift' else
                    'Maximum ranking shift across scenarios' if col == 'Sens_Max_Rank_Shift' else
                    'Sensitivity level classification (High/Moderate/Low)' if col == 'Sens_Sensitivity_Level' else
                    'Green term frequency as percentage of total words' if col == 'gt_freq_pct' else
                    'Unique green terms relative to document length (vocabulary diversity)' if col == 'unique_gt_relative' else
                    'Combined score from green term frequency and uniqueness' if col == 'combined_green_score' else
                    'Average sentiment score across all text' if col == 'avg_sentiment_score' else
                    'Average sentiment in renewable energy discussions' if col == 'renewable_energy_avg_sentiment' else
                    'Average sentiment in climate/emissions discussions' if col == 'climate_emissions_avg_sentiment' else
                    'Weighted combination of all sentiment scores' if col == 'combined_sentiment_score' else
                    'Intensity of quantitative claims and metrics' if col == 'quantification_intensity_score' else
                    'Intensity of evidence-based statements' if col == 'evidence_intensity_score' else
                    'Intensity of aspirational/future-oriented language' if col == 'aspirational_intensity_score' else
                    'Vague language percentage (unclear, non-specific terms)' if col == 'vague_density' else
                    'Hedge word percentage (uncertainty markers)' if col == 'hedge_density' else
                    'Combined unclear language intensity score' if col == 'combined_unclear_intensity' else
                    'Percentage of commitments with specific timelines' if col == 'commitment_timeline_pct' else
                    'Future orientation percentage (future vs past/present focus)' if col == 'temporal_future_pct' else
                    'TF-IDF based document similarity score' if col == 'TFIDF_similarity' else
                    'Jaccard similarity coefficient between documents' if col == 'Jaccard_similarity' else
                    'SpaCy semantic similarity average' if col == 'SpaCy_avg_similarity' else
                    'Ratio of highly similar sentences (SpaCy >0.8)' if col == 'SpaCy_HighSim_Ratio' else
                    'Year-over-year combined similarity score from multiple methods' if col == 'similarity_combined' else
                    f'Variable: {col}' for col in master_greenwashing.columns
                ],
                'Source': [
                    'Both' if col in ['Company', 'Year'] else
                    'Communication Analysis' if any(x in col for x in ['Green_Com_Score', 'Substantiation_Weakness', 'Language_Vagueness', 'Temporal_Orientation', 'Reporting_Consistency', 'gt_freq_pct', 'unique_gt_relative', 'sentiment', 'quantification', 'evidence', 'aspirational', 'vague', 'hedge', 'timeline', 'temporal', 'similarity', 'TFIDF', 'Jaccard', 'SpaCy']) else
                    'Performance Analysis' if col == 'Performance_Score' else
                    'Greenwashing Analysis' if any(x in col for x in ['Performance_Communication_Gap_Score', 'Greenwashing_Risk', 'Amplified']) else
                    'Ensemble Analysis' if 'Ens_Greenwashing_' in col else
                    'Sensitivity Analysis' if col in ['Sens_CV_Percent', 'Sens_Score_Range', 'Sens_MAD', 'Sens_Avg_Rank_Shift', 'Sens_Max_Rank_Shift', 'Sens_Sensitivity_Level'] else
                    'Unknown' for col in master_greenwashing.columns
                ],
                'Scale': [
                    'Text' if col == 'Company' else
                    'Integer (2021, 2022)' if col == 'Year' else
                    '0-100 (normalized by year)' if any(x in col for x in ['Score', 'Weakness', 'Vagueness', 'Orientation', 'Consistency']) else
                    'Percentage (0-100)' if any(x in col for x in ['pct', 'density']) else
                    'Ratio (0+)' if 'Ratio' in col else
                    'Score (-100 to +100)' if 'sentiment' in col else
                    'Normalized (0-100)' if any(x in col for x in ['intensity', 'similarity', 'combined']) else
                    'Ensemble Statistics' if 'Ens_Greenwashing_' in col else
                    'Sensitivity Metrics' if col in ['Sens_CV_Percent', 'Sens_Score_Range', 'Sens_MAD', 'Sens_Avg_Rank_Shift', 'Sens_Max_Rank_Shift'] else
                    'Categorical' if col == 'Sens_Sensitivity_Level' else
                    'Continuous' for col in master_greenwashing.columns
                ]
            })
            
            data_dict.to_excel(writer, sheet_name='Data_Dictionary', index=False)
            
            # Summary statistics by year
            if 'Year' in master_greenwashing.columns:
                summary_2021 = master_greenwashing[master_greenwashing['Year'] == 2021].describe()
                summary_2022 = master_greenwashing[master_greenwashing['Year'] == 2022].describe()
                
                summary_2021.to_excel(writer, sheet_name='Summary_Stats_2021')
                summary_2022.to_excel(writer, sheet_name='Summary_Stats_2022')
                
            # Create component analysis sheet
            component_cols = ['Green_Com_Score', 'Substantiation_Weakness', 'Language_Vagueness', 
                            'Temporal_Orientation', 'Reporting_Consistency']
            if all(col in master_greenwashing.columns for col in component_cols):
                component_analysis = pd.DataFrame({
                    'Component': component_cols,
                    'Mean_2021': [master_greenwashing[master_greenwashing['Year'] == 2021][col].mean() 
                                 for col in component_cols],
                    'Mean_2022': [master_greenwashing[master_greenwashing['Year'] == 2022][col].mean() 
                                 for col in component_cols],
                    'Std_2021': [master_greenwashing[master_greenwashing['Year'] == 2021][col].std() 
                                for col in component_cols],
                    'Std_2022': [master_greenwashing[master_greenwashing['Year'] == 2022][col].std() 
                                for col in component_cols]
                }).round(3)
                
                component_analysis.to_excel(writer, sheet_name='Component_Analysis', index=False)
            
            # Add sensitivity analysis summary
            if not scenario_impact.empty:
                scenario_impact.to_excel(writer, sheet_name='Sensitivity_Summary', index=False)
        
        print(f"✓ Master dataset exported successfully to: {output_path}")
        print(f"  - Master_Greenwashing_Data: Complete dataset ({master_greenwashing.shape[0]} rows, {master_greenwashing.shape[1]} columns)")
        print(f"  - Data_Dictionary: Column descriptions and sources")
        print(f"  - Summary_Stats_2021: Descriptive statistics for 2021")
        print(f"  - Summary_Stats_2022: Descriptive statistics for 2022")
        print(f"  - Component_Analysis: Component score analysis by year")
        if not scenario_impact.empty:
            print(f"  - Sensitivity_Summary: Sensitivity analysis scenario impacts")
        
    except Exception as e:
        print(f"Error exporting master dataset: {e}")
else:
    print("Cannot export: Master dataset is empty")

# ==========================================
# STEP 7: Display Final Summary
# ==========================================

print(f"\n" + "=" * 60)
print("MASTER GREENWASHING DATASET CREATION COMPLETE")
print("=" * 60)

if not master_greenwashing.empty:
    print(f"\nDATASET OVERVIEW:")
    print(f"  Total records: {len(master_greenwashing):,}")
    print(f"  Companies: {master_greenwashing['Company'].nunique()}")
    print(f"  Years: {sorted(master_greenwashing['Year'].unique())}")
    
    print(f"\nVARIABLE CATEGORIES:")
    
    # Count variables by category
    main_scores = [col for col in master_greenwashing.columns if col in ['Performance_Communication_Gap_Score', 'Performance_Score', 'Greenwashing_Risk_Absolute_Gap', 'Greenwashing_Risk_Amplified']]
    component_scores = [col for col in master_greenwashing.columns if col in ['Green_Com_Score', 'Substantiation_Weakness', 'Language_Vagueness', 'Temporal_Orientation', 'Reporting_Consistency']]
    ensemble_stats = [col for col in master_greenwashing.columns if 'Ens_Greenwashing_' in col and any(x in col for x in ['mean', 'median', 'std', 'min', 'max', 'q25', 'q75', 'iqr', 'range'])]
    sensitivity_vars = [col for col in master_greenwashing.columns if col in ['Sens_CV_Percent', 'Sens_Score_Range', 'Sens_MAD', 'Sens_Avg_Rank_Shift', 'Sens_Max_Rank_Shift', 'Sens_Sensitivity_Level']]    
    raw_variables = [col for col in master_greenwashing.columns if col in ['gt_freq_pct', 'unique_gt_relative', 'avg_sentiment_score', 'renewable_energy_avg_sentiment', 'climate_emissions_avg_sentiment', 'quantification_intensity_score', 'evidence_intensity_score', 'aspirational_intensity_score', 'vague_density', 'hedge_density', 'temporal_future_pct', 'commitment_timeline_pct', 'similarity_combined', 'SpaCy_HighSim_Ratio']]
    
    print(f"  Main Greenwashing Scores: {len(main_scores)} variables")
    print(f"  Communication Component Scores: {len(component_scores)} variables")
    print(f"  Greenwashing Ensemble Statistics: {len(ensemble_stats)} variables") 
    print(f"  Sensitivity Analysis Variables: {len(sensitivity_vars)} variables")
    print(f"  Underlying Raw Communication Variables: {len(raw_variables)} variables")
    print(f"  Total Variables: {master_greenwashing.shape[1]}")
    
    print(f"\nKEY STATISTICS:")
    if 'Ens_Greenwashing_median_score' in master_greenwashing.columns:
        print(f"  Mean Greenwashing Risk (2021): {master_greenwashing[master_greenwashing['Year'] == 2021]['Ens_Greenwashing_median_score'].mean():.2f}")
        print(f"  Mean Greenwashing Risk (2022): {master_greenwashing[master_greenwashing['Year'] == 2022]['Ens_Greenwashing_median_score'].mean():.2f}")
    
    companies_both_years = len(master_greenwashing.groupby('Company').filter(lambda x: len(x) == 2)['Company'].unique())
    print(f"  Companies with both years: {companies_both_years}")
    
    print(f"\nSENSITIVITY ANALYSIS:")
    if 'Sens_CV_Percent' in master_greenwashing.columns:
        high_sensitivity = len(master_greenwashing[master_greenwashing['Sens_CV_Percent'] > 15]) // 2  # Divide by 2 since each company appears twice
        moderate_sensitivity = len(master_greenwashing[(master_greenwashing['Sens_CV_Percent'] > 5) & (master_greenwashing['Sens_CV_Percent'] <= 15)]) // 2
        low_sensitivity = len(master_greenwashing[master_greenwashing['Sens_CV_Percent'] <= 5]) // 2
        
        print(f"  High sensitivity companies: {high_sensitivity}")
        print(f"  Moderate sensitivity companies: {moderate_sensitivity}")
        print(f"  Low sensitivity companies: {low_sensitivity}")
        print(f"  Average CV across companies: {master_greenwashing['Sens_CV_Percent'].mean():.2f}%")
    
    print(f"\nKEY VARIABLES CONFIRMED:")
    key_vars = ['Green_Com_Score', 'Substantiation_Weakness', 'Language_Vagueness', 'Temporal_Orientation', 'Reporting_Consistency',
                'Ens_Greenwashing_median_score', 'Sens_CV_Percent', 'Sens_Sensitivity_Level', 'gt_freq_pct', 'unique_gt_relative', 'avg_sentiment_score', 'similarity_combined']
    
    for var in key_vars:
        status = "YES" if var in master_greenwashing.columns else "NO"
        print(f"  {status} {var}")
    
    print(f"\nFILES CREATED:")
    print(f"  Master dataset: data/Greenwashing Results/master_greenwashing_dataset.xlsx")
    sheet_count = 6 if not scenario_impact.empty else 5
    print(f"  ({sheet_count} sheets: Master_Greenwashing_Data, Data_Dictionary, Summary_Stats_2021, Summary_Stats_2022, Component_Analysis" + (", Sensitivity_Summary" if not scenario_impact.empty else "") + ")")
    
    print(f"\nREADY FOR ANALYSIS:")
    print(f"  All communication variables are now in one dataset")
    print(f"  Both individual component scores and ensemble statistics included")
    print(f"  Sensitivity analysis metrics integrated")
    print(f"  Clear variable names with sources and descriptions")
    print(f"  Complete data dictionary and summary statistics provided")
    
    print(f"\nTOP 5 HIGHEST RISK COMPANIES BY ENSEMBLE SCORE:")
    if 'Ens_Greenwashing_median_score' in master_greenwashing.columns:
        for year in [2021, 2022]:
            year_data = master_greenwashing[master_greenwashing['Year'] == year]
            if not year_data.empty:
                top_risk = year_data.nlargest(5, 'Ens_Greenwashing_median_score')[['Company', 'Ens_Greenwashing_median_score']]
                print(f"\n  {year}:")
                for i, (_, row) in enumerate(top_risk.iterrows(), 1):
                    print(f"    {i}. {row['Company']}: {row['Ens_Greenwashing_median_score']:.2f}")
else:
    print("Master dataset creation failed - check input files")

print(f"\nVariable 'master_greenwashing' is available in memory for immediate use")
print("Ready for comprehensive greenwashing risk analysis!")

In [None]:
# Cell: Export Master Greenwashing Data as CSV
# Place this AFTER the master dataset creation cell

print("EXPORTING MASTER GREENWASHING DATA AS CSV")
print("=" * 50)

try:
    # Load the Master_Greenwashing_Data sheet from the Excel file
    print("Loading Master_Greenwashing_Data sheet...")
    
    master_greenwashing_csv = pd.read_excel("data/Greenwashing Results/master_greenwashing_dataset.xlsx", 
                                           sheet_name='Master_Greenwashing_Data')
    
    print(f"✓ Data loaded successfully:")
    print(f"  - Shape: {master_greenwashing_csv.shape}")
    print(f"  - Companies: {master_greenwashing_csv['Company'].nunique()}")
    print(f"  - Years: {sorted(master_greenwashing_csv['Year'].unique())}")
    
    # Export as CSV
    output_csv_path = "data/Greenwashing Results/master_greenwashing_data.csv"
    master_greenwashing_csv.to_csv(output_csv_path, index=False)
    
    print(f"\n✓ CSV export successful!")
    print(f"  - File saved to: {output_csv_path}")
    print(f"  - Records exported: {len(master_greenwashing_csv):,}")
    print(f"  - Variables exported: {master_greenwashing_csv.shape[1]}")
    
    # Display file size info
    import os
    file_size = os.path.getsize(output_csv_path) / (1024 * 1024)  # Convert to MB
    print(f"  - File size: {file_size:.2f} MB")
    
    # Show first few rows and column names for verification
    print(f"\nFIRST 3 ROWS PREVIEW:")
    print(master_greenwashing_csv.head(3).to_string())
    
    print(f"\nCOLUMN NAMES ({len(master_greenwashing_csv.columns)} total):")
    for i, col in enumerate(master_greenwashing_csv.columns, 1):
        print(f"  {i:2d}. {col}")
    
    print(f"\nREADY FOR UPLOAD:")
    print(f"  You can now upload: {output_csv_path}")
    
except Exception as e:
    print(f"Error exporting CSV: {e}")
    print("Trying to use in-memory data instead...")
    
    try:
        # Fallback: use the in-memory master_greenwashing DataFrame
        if 'master_greenwashing' in locals() and not master_greenwashing.empty:
            output_csv_path = "data/Greenwashing Results/master_greenwashing_data.csv"
            master_greenwashing.to_csv(output_csv_path, index=False)
            
            print(f"✓ CSV export successful using in-memory data!")
            print(f"  - File saved to: {output_csv_path}")
            print(f"  - Records exported: {len(master_greenwashing):,}")
            print(f"  - Variables exported: {master_greenwashing.shape[1]}")
            
            # Display file size info
            import os
            file_size = os.path.getsize(output_csv_path) / (1024 * 1024)  # Convert to MB
            print(f"  - File size: {file_size:.2f} MB")
            
        else:
            print("No master_greenwashing data available in memory")
            
    except Exception as e2:
        print(f"Fallback also failed: {e2}")

print("\nCSV export process complete!")