# Performance Score Calculations 2021-2022

## Overview
This module calculates environmental performance scores for the 14 sample companies using a constrained ensemble methodology. It demonstrates which variables from the comprehensive CDP-Eikon dataset are actually used in the final greenwashing risk assessment, revealing the core data requirements for the GRAT performance component.

## Data Sources Used
**Primary input**: `perf_score_required_data_seperate.xlsx` (merged CDP-Eikon dataset)
**Industry benchmarks**: `Eikon_Final_2021.xlsx` and `Eikon_Final_2022.xlsx` (27-company sector dataset)

## Core Performance Components (0-100 scale)
1. **Emission Intensity**: Company Scope 1+2 emission intensity vs. industry average
2. **Goal Achievement**: Actual emission reduction progress vs. stated targets  
3. **Target Ambition**: Annual reduction rate assessment against science-based targets (10%+ for full points)
4. **Transparency**: Self-reported (CDP) vs. third-party verified (Eikon) emission data comparison

## Key Variables Actually Used
- **Eikon emission intensities**: `Sc1+2 (ton CO2e) / M$ Rev.` for 2020, 2021, 2022
- **CDP emission intensities**: `Sc1+2 (ton CO2e) / M$ Rev.` for 2021, 2022 (transparency comparison)
- **Target data**: Emission reduction percentages and target years from Eikon
- **Renewable energy**: `Total Renewable Energy / Rev. (M$)` for bonus calculations
- **Industry benchmarks**: Sector averages calculated from 27-company expanded dataset

## Ensemble Methodology  
- **Weight constraints**: Individual weights 0.05-0.50, hierarchy Emissions > Goal_Progress > Target_Ambition > Transparency
- **Valid combinations**: 2,285 weight combinations meeting constraints
- **Statistical outputs**: Mean, median, standard deviation, quartiles, and range for each company-year
- **Adjustments**: Renewable energy bonus (0-10 points), suspicious target change penalty (0 to -10 points)

## Final Output Variables
Performance scores with uncertainty measures for each company in 2021 and 2022, saved as:
- `ensemble_performance_scores.xlsx`: Median scores for further analysis
- `ensemble_results_summary.xlsx`: Complete statistical distributions

## Data Reduction Note
While the merged dataset contains extensive variables from both CDP and Eikon extractions, this module uses only the essential metrics for performance scoring. Most extracted variables serve as supporting data or were excluded due to data quality limitations, demonstrating the focused data requirements for reliable performance assessment.

# Calculation

## Getting Emission Intenisty from Eikon Companies

In [None]:
import pandas as pd

# Use for manipulation for validation
df = pd.read_excel('data/Performance/perf_score_required_data_seperate.xlsx')

In [None]:
# Load the Eikon ESG datasets for 2021 and 2022
eikon_esg_2021_path = "data/Eikon/Eikon_Final_2021.xlsx"
eikon_esg_2022_path = "data/Eikon/Eikon_Final_2022.xlsx"

eikon_esg_2021 = pd.read_excel(eikon_esg_2021_path)
eikon_esg_2022 = pd.read_excel(eikon_esg_2022_path)

# Calculate the average emission intensity for 2021
# We drop NaN values and calculate the mean of the 'CO2 Em. / Rev. (M$)' column
average_emission_intensity_2021 = round(eikon_esg_2021['CO2 Em. / Rev. (M$)'].dropna().mean(), 2)

# Calculate the average emission intensity for 2022
average_emission_intensity_2022 = round(eikon_esg_2022['CO2 Em. / Rev. (M$)'].dropna().mean(), 2)

# Calculate the average renewable energy intensity for 2021
# We drop NaN values and calculate the mean of the 'Total Renewable Energy / Rev. (M$)' column
average_renewable_intensity_2021 = round(eikon_esg_2021['Total Renewable Energy / Rev. (M$)'].dropna().mean(), 2)

# Calculate the average renewable energy intensity for 2022
average_renewable_intensity_2022 = round(eikon_esg_2022['Total Renewable Energy / Rev. (M$)'].dropna().mean(), 2)

# Print the results
print(f"Average Emission Intensity for 2021: {average_emission_intensity_2021}")
print(f"Average Emission Intensity for 2022: {average_emission_intensity_2022}")
print(f"Average Renewable Intensity for 2021: {average_renewable_intensity_2021}")
print(f"Average Renewable Intensity for 2022: {average_renewable_intensity_2022}")

In [None]:
# Cell 1: Import libraries and define helper functions

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
import warnings
warnings.filterwarnings('ignore')

# Set display options for better DataFrame viewing
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

print("Libraries imported successfully!")

# Helper Functions

def safe_sqrt(x):
    """Safe square root function that handles edge cases"""
    if pd.isna(x) or x < 0:
        return 0
    return np.sqrt(x)

def safe_divide(numerator, denominator):
    """Safe division function that handles division by zero"""
    if pd.isna(numerator) or pd.isna(denominator) or denominator == 0:
        return np.nan
    return numerator / denominator

def calculate_percentage_change(old_value, new_value):
    """Calculate percentage change between two values"""
    if pd.isna(old_value) or pd.isna(new_value) or old_value == 0:
        return np.nan
    return ((new_value - old_value) / old_value) * 100

print("Helper functions defined successfully!")

In [None]:
# Cell 2: Component 1 - Absolute Emission Intensity Score (100 points per year)

def calculate_absolute_emission_score(row, year):
    """Calculate absolute emission intensity score for specified year (100 points max)"""
    
    # Select the appropriate data based on year
    if year == 2021:
        company_emission = row['Eikon21 - Sc1+2 (ton CO2e) / M$ Rev.']
        industry_average = average_emission_intensity_2021
    elif year == 2022:
        company_emission = row['Eikon22 - Sc1+2 (ton CO2e) / M$ Rev.']
        industry_average = average_emission_intensity_2022
    else:
        raise ValueError(f"Year {year} not supported. Use 2021 or 2022.")
    
    if pd.isna(company_emission) or pd.isna(industry_average) or industry_average == 0:
        return 0
    
    # Calculate ratio
    ratio = company_emission / industry_average
    
    # Cap ratio between 0.5 and 2.0
    capped_ratio = max(0.5, min(2.0, ratio))
    
    # Calculate score using square root decay
    if capped_ratio <= 0.5:
        score = 100  # Maximum score (horizontal line)
    elif capped_ratio >= 2.0:
        score = 0   # Minimum score (horizontal line)
    else:
        # Square root decay from 0.5 to 2.0
        # Normalize capped_ratio to [0, 1] range
        t = (capped_ratio - 0.5) / 1.5  # t goes from 0 to 1 as ratio goes from 0.5 to 2.0
        score = 100 * (1 - safe_sqrt(t))  # Square root decay
    
    return max(0, score)  # Ensure non-negative

# Calculate absolute emission scores for both years
print("CALCULATING ABSOLUTE EMISSION INTENSITY SCORES")
print("=" * 50)

df['absolute_emission_score_2021'] = df.apply(lambda row: calculate_absolute_emission_score(row, 2021), axis=1)
df['absolute_emission_score_2022'] = df.apply(lambda row: calculate_absolute_emission_score(row, 2022), axis=1)

# Print component scores per company for both years
print("ABSOLUTE EMISSION INTENSITY SCORES BY COMPANY:")
print()
for _, row in df.iterrows():
    company = row['Organization']
    score_2021 = row['absolute_emission_score_2021']
    score_2022 = row['absolute_emission_score_2022']
    emission_2021 = row['Eikon21 - Sc1+2 (ton CO2e) / M$ Rev.']
    emission_2022 = row['Eikon22 - Sc1+2 (ton CO2e) / M$ Rev.']
    
    print(f"{company}:")
    if not pd.isna(emission_2021):
        ratio_2021 = emission_2021 / average_emission_intensity_2021
        print(f"  2021: {score_2021:>6.2f}/100 points (Intensity: {emission_2021:>8.2f}, Ratio: {ratio_2021:>5.2f})")
    else:
        print(f"  2021: {score_2021:>6.2f}/100 points (No data)")
    
    if not pd.isna(emission_2022):
        ratio_2022 = emission_2022 / average_emission_intensity_2022
        print(f"  2022: {score_2022:>6.2f}/100 points (Intensity: {emission_2022:>8.2f}, Ratio: {ratio_2022:>5.2f})")
    else:
        print(f"  2022: {score_2022:>6.2f}/100 points (No data)")
    print()

# Summary statistics
print("SUMMARY STATISTICS:")
print(f"Industry Average Emission Intensity:")
print(f"  2021: {average_emission_intensity_2021:.2f} tCO2e/M$")
print(f"  2022: {average_emission_intensity_2022:.2f} tCO2e/M$")
print()
print(f"2021 Scores - Range: {df['absolute_emission_score_2021'].min():.2f} - {df['absolute_emission_score_2021'].max():.2f}")
print(f"2021 Scores - Mean: {df['absolute_emission_score_2021'].mean():.2f}")
print(f"2021 Scores - Companies with max score (100): {(df['absolute_emission_score_2021'] == 100).sum()}")
print()
print(f"2022 Scores - Range: {df['absolute_emission_score_2022'].min():.2f} - {df['absolute_emission_score_2022'].max():.2f}")
print(f"2022 Scores - Mean: {df['absolute_emission_score_2022'].mean():.2f}")
print(f"2022 Scores - Companies with max score (100): {(df['absolute_emission_score_2022'] == 100).sum()}")

print("\nAbsolute Emission Intensity Scores calculated successfully!")

In [None]:
# Cell 3: Component 2 - Goal Achievement Rate (100 points per year)

def calculate_goal_achievement_score(row, year):
    """Calculate goal achievement score for specified year (100 points max)
    
    Uses year-over-year target-relative progress:
    - 2021 score: Progress from 2020→2021 toward 2021 stated target
    - 2022 score: Progress from 2021→2022 toward 2022 stated target
    """
    
    # Select the appropriate data based on year
    if year == 2021:
        previous_year_emission = row['Eikon20 - Sc1+2 (ton CO2e) / M$ Rev.']
        current_year_emission = row['Eikon21 - Sc1+2 (ton CO2e) / M$ Rev.']
        target_year = row['Eikon21 - Em. Red. Target Year']
        target_reduction = row['Eikon21 - Em. Red. Target (%)']
        baseline_year = 2020
    elif year == 2022:
        previous_year_emission = row['Eikon21 - Sc1+2 (ton CO2e) / M$ Rev.']
        current_year_emission = row['Eikon22 - Sc1+2 (ton CO2e) / M$ Rev.']
        target_year = row['Eikon22 - Em. Red. Target Year']
        target_reduction = row['Eikon22 - Em. Red. Target (%)']
        baseline_year = 2021
    else:
        raise ValueError(f"Year {year} not supported. Use 2021 or 2022.")
    
    # Check for missing data
    if (pd.isna(previous_year_emission) or pd.isna(current_year_emission) or 
        pd.isna(target_year) or pd.isna(target_reduction) or 
        previous_year_emission == 0 or target_year <= baseline_year):
        return 0
    
    # Calculate actual progress (percentage change from previous year)
    actual_progress = ((previous_year_emission - current_year_emission) / previous_year_emission) * 100
    
    # Calculate expected annual progress rate
    expected_annual_progress = target_reduction / (target_year - baseline_year)
    
    # Calculate achievement rate
    if expected_annual_progress <= 0:
        achievement_rate = 0
    else:
        achievement_rate = actual_progress / expected_annual_progress * 100
    
    # Calculate final score using linear scaling from 0% to 200%
    if achievement_rate >= 200:
        score = 100  # Maximum score
    elif achievement_rate <= 0:
        score = 0   # Minimum score
    else:
        # Linear relationship: score = 100 * (achievement_rate / 200)
        score = 100 * (achievement_rate / 200)
    
    return max(0, score)  # Ensure non-negative

# Calculate goal achievement scores for both years
print("CALCULATING GOAL ACHIEVEMENT SCORES")
print("=" * 50)

df['goal_achievement_score_2021'] = df.apply(lambda row: calculate_goal_achievement_score(row, 2021), axis=1)
df['goal_achievement_score_2022'] = df.apply(lambda row: calculate_goal_achievement_score(row, 2022), axis=1)

# Print component scores per company for both years
print("GOAL ACHIEVEMENT SCORES BY COMPANY:")
print()
for _, row in df.iterrows():
    company = row['Organization']
    
    # 2021 calculations
    score_2021 = row['goal_achievement_score_2021']
    emission_2020 = row['Eikon20 - Sc1+2 (ton CO2e) / M$ Rev.']
    emission_2021 = row['Eikon21 - Sc1+2 (ton CO2e) / M$ Rev.']
    target_year_2021 = row['Eikon21 - Em. Red. Target Year']
    target_reduction_2021 = row['Eikon21 - Em. Red. Target (%)']
    
    # 2022 calculations
    score_2022 = row['goal_achievement_score_2022']
    emission_2022 = row['Eikon22 - Sc1+2 (ton CO2e) / M$ Rev.']
    target_year_2022 = row['Eikon22 - Em. Red. Target Year']
    target_reduction_2022 = row['Eikon22 - Em. Red. Target (%)']
    
    print(f"{company}:")
    
    # 2021 details
    if not pd.isna(emission_2020) and not pd.isna(emission_2021) and not pd.isna(target_reduction_2021):
        actual_progress_2021 = ((emission_2020 - emission_2021) / emission_2020) * 100
        expected_annual_2021 = target_reduction_2021 / (target_year_2021 - 2020) if not pd.isna(target_year_2021) and target_year_2021 > 2020 else 0
        achievement_rate_2021 = (actual_progress_2021 / expected_annual_2021 * 100) if expected_annual_2021 > 0 else 0
        
        print(f"  2021: {score_2021:>6.2f}/100 points")
        print(f"        Progress: {actual_progress_2021:>6.2f}% (Expected: {expected_annual_2021:>6.2f}%)")
        print(f"        Achievement Rate: {achievement_rate_2021:>6.2f}%")
        print(f"        Target: {target_reduction_2021}% by {target_year_2021} (from 2020)")
    else:
        print(f"  2021: {score_2021:>6.2f}/100 points (Insufficient data)")
    
    # 2022 details
    if not pd.isna(emission_2021) and not pd.isna(emission_2022) and not pd.isna(target_reduction_2022):
        actual_progress_2022 = ((emission_2021 - emission_2022) / emission_2021) * 100
        expected_annual_2022 = target_reduction_2022 / (target_year_2022 - 2021) if not pd.isna(target_year_2022) and target_year_2022 > 2021 else 0
        achievement_rate_2022 = (actual_progress_2022 / expected_annual_2022 * 100) if expected_annual_2022 > 0 else 0
        
        print(f"  2022: {score_2022:>6.2f}/100 points")
        print(f"        Progress: {actual_progress_2022:>6.2f}% (Expected: {expected_annual_2022:>6.2f}%)")
        print(f"        Achievement Rate: {achievement_rate_2022:>6.2f}%")
        print(f"        Target: {target_reduction_2022}% by {target_year_2022} (from 2021)")
    else:
        print(f"  2022: {score_2022:>6.2f}/100 points (Insufficient data)")
    
    print()

# Summary statistics
print("SUMMARY STATISTICS:")
print(f"2021 Scores - Range: {df['goal_achievement_score_2021'].min():.2f} - {df['goal_achievement_score_2021'].max():.2f}")
print(f"2021 Scores - Mean: {df['goal_achievement_score_2021'].mean():.2f}")
print(f"2021 Scores - Companies with zero score: {(df['goal_achievement_score_2021'] == 0).sum()}")
print(f"2021 Scores - Companies with perfect score (100): {(df['goal_achievement_score_2021'] == 100).sum()}")
print()
print(f"2022 Scores - Range: {df['goal_achievement_score_2022'].min():.2f} - {df['goal_achievement_score_2022'].max():.2f}")
print(f"2022 Scores - Mean: {df['goal_achievement_score_2022'].mean():.2f}")
print(f"2022 Scores - Companies with zero score: {(df['goal_achievement_score_2022'] == 0).sum()}")
print(f"2022 Scores - Companies with perfect score (100): {(df['goal_achievement_score_2022'] == 100).sum()}")

print("\nGoal Achievement Scores calculated successfully!")

In [None]:
# Cell 4: Component 3 - Target Ambition (100 points per year)

def calculate_target_ambition_score(row, year):
    """Calculate target ambition score for specified year (100 points max)
    
    Evaluates the ambition level of stated emission reduction targets
    based on annual reduction intensity (higher rates = higher scores)
    """
    
    # Select the appropriate data based on year
    if year == 2021:
        target_year = row['Eikon21 - Em. Red. Target Year']
        target_reduction = row['Eikon21 - Em. Red. Target (%)']
        baseline_year = 2021
    elif year == 2022:
        target_year = row['Eikon22 - Em. Red. Target Year']
        target_reduction = row['Eikon22 - Em. Red. Target (%)']
        baseline_year = 2022
    else:
        raise ValueError(f"Year {year} not supported. Use 2021 or 2022.")
    
    # Check for missing data
    if (pd.isna(target_year) or pd.isna(target_reduction) or 
        target_year <= baseline_year or target_reduction <= 0):
        return 0
    
    # Calculate annual target intensity
    annual_intensity = target_reduction / (target_year - baseline_year)
    
    # Calculate score using linear scaling (0% to 10% per year)
    if annual_intensity >= 10:
        score = 100  # Maximum score for 10%+ per year
    elif annual_intensity <= 0:
        score = 0   # Minimum score for 0% or negative targets
    else:
        # Linear scaling: score = 100 * (intensity / 10)
        score = 100 * (annual_intensity / 10)
    
    return max(0, score)  # Ensure non-negative

# Calculate target ambition scores for both years
print("CALCULATING TARGET AMBITION SCORES")
print("=" * 50)

df['target_ambition_score_2021'] = df.apply(lambda row: calculate_target_ambition_score(row, 2021), axis=1)
df['target_ambition_score_2022'] = df.apply(lambda row: calculate_target_ambition_score(row, 2022), axis=1)

# Print component scores per company for both years
print("TARGET AMBITION SCORES BY COMPANY:")
print()
for _, row in df.iterrows():
    company = row['Organization']
    
    # 2021 calculations
    score_2021 = row['target_ambition_score_2021']
    target_year_2021 = row['Eikon21 - Em. Red. Target Year']
    target_reduction_2021 = row['Eikon21 - Em. Red. Target (%)']
    
    # 2022 calculations  
    score_2022 = row['target_ambition_score_2022']
    target_year_2022 = row['Eikon22 - Em. Red. Target Year']
    target_reduction_2022 = row['Eikon22 - Em. Red. Target (%)']
    
    print(f"{company}:")
    
    # 2021 details
    if not pd.isna(target_year_2021) and not pd.isna(target_reduction_2021) and target_year_2021 > 2021:
        annual_intensity_2021 = target_reduction_2021 / (target_year_2021 - 2021)
        print(f"  2021: {score_2021:>6.2f}/100 points")
        print(f"        Target: {target_reduction_2021}% by {target_year_2021} (from 2021)")
        print(f"        Annual Intensity: {annual_intensity_2021:.2f}% per year")
    else:
        print(f"  2021: {score_2021:>6.2f}/100 points (No valid target)")
    
    # 2022 details
    if not pd.isna(target_year_2022) and not pd.isna(target_reduction_2022) and target_year_2022 > 2022:
        annual_intensity_2022 = target_reduction_2022 / (target_year_2022 - 2022)
        print(f"  2022: {score_2022:>6.2f}/100 points")
        print(f"        Target: {target_reduction_2022}% by {target_year_2022} (from 2022)")
        print(f"        Annual Intensity: {annual_intensity_2022:.2f}% per year")
    else:
        print(f"  2022: {score_2022:>6.2f}/100 points (No valid target)")
    
    print()

# Summary statistics
print("SUMMARY STATISTICS:")
print(f"2021 Scores - Range: {df['target_ambition_score_2021'].min():.2f} - {df['target_ambition_score_2021'].max():.2f}")
print(f"2021 Scores - Mean: {df['target_ambition_score_2021'].mean():.2f}")
print(f"2021 Scores - Companies with zero score: {(df['target_ambition_score_2021'] == 0).sum()}")
print(f"2021 Scores - Companies with perfect score (100): {(df['target_ambition_score_2021'] == 100).sum()}")
print()
print(f"2022 Scores - Range: {df['target_ambition_score_2022'].min():.2f} - {df['target_ambition_score_2022'].max():.2f}")  
print(f"2022 Scores - Mean: {df['target_ambition_score_2022'].mean():.2f}")
print(f"2022 Scores - Companies with zero score: {(df['target_ambition_score_2022'] == 0).sum()}")
print(f"2022 Scores - Companies with perfect score (100): {(df['target_ambition_score_2022'] == 100).sum()}")

# Target intensity analysis
valid_targets_2021 = df[df['target_ambition_score_2021'] > 0]
valid_targets_2022 = df[df['target_ambition_score_2022'] > 0]

if len(valid_targets_2021) > 0:
    intensities_2021 = []
    for _, row in valid_targets_2021.iterrows():
        if not pd.isna(row['Eikon21 - Em. Red. Target Year']) and not pd.isna(row['Eikon21 - Em. Red. Target (%)']):
            intensity = row['Eikon21 - Em. Red. Target (%)'] / (row['Eikon21 - Em. Red. Target Year'] - 2021)
            intensities_2021.append(intensity)
    
    if intensities_2021:
        print(f"\n2021 TARGET INTENSITY ANALYSIS:")
        print(f"Mean annual intensity: {np.mean(intensities_2021):.2f}% per year")
        print(f"Companies with 10%+ per year: {sum(1 for i in intensities_2021 if i >= 10)}")
        print(f"Companies with 5%+ per year: {sum(1 for i in intensities_2021 if i >= 5)}")
        print(f"Companies with 3%+ per year: {sum(1 for i in intensities_2021 if i >= 3)}")

if len(valid_targets_2022) > 0:
    intensities_2022 = []
    for _, row in valid_targets_2022.iterrows():
        if not pd.isna(row['Eikon22 - Em. Red. Target Year']) and not pd.isna(row['Eikon22 - Em. Red. Target (%)']):
            intensity = row['Eikon22 - Em. Red. Target (%)'] / (row['Eikon22 - Em. Red. Target Year'] - 2022)
            intensities_2022.append(intensity)
    
    if intensities_2022:
        print(f"\n2022 TARGET INTENSITY ANALYSIS:")
        print(f"Mean annual intensity: {np.mean(intensities_2022):.2f}% per year")
        print(f"Companies with 10%+ per year: {sum(1 for i in intensities_2022 if i >= 10)}")
        print(f"Companies with 5%+ per year: {sum(1 for i in intensities_2022 if i >= 5)}")
        print(f"Companies with 3%+ per year: {sum(1 for i in intensities_2022 if i >= 3)}")

print("\nTarget Ambition Scores calculated successfully!")

In [None]:
# Cell 5: Component 4 - Transparency Score (100 points per year)

def calculate_transparency_score(row, year):
    """Calculate transparency score for specified year (100 points max)
    
    Measures honesty in emission reporting by comparing self-reported (CDP) 
    vs third-party verified (Eikon) data
    """
    
    # Select the appropriate data based on year
    if year == 2021:
        cdp_emission = row['CDP21 - Sc1+2 (ton CO2e) / M$ Rev.']
        eikon_emission = row['Eikon21 - Sc1+2 (ton CO2e) / M$ Rev.']
    elif year == 2022:
        cdp_emission = row['CDP22 - Sc1+2 (ton CO2e) / M$ Rev.']
        eikon_emission = row['Eikon22 - Sc1+2 (ton CO2e) / M$ Rev.']
    else:
        raise ValueError(f"Year {year} not supported. Use 2021 or 2022.")
    
    # Check for missing data
    if pd.isna(cdp_emission) or pd.isna(eikon_emission) or cdp_emission == 0:
        return 0
    
    # Calculate transparency change: (CDP - Eikon) / CDP × 100
    transparency_change = ((cdp_emission - eikon_emission) / cdp_emission) * 100
    
    # Calculate score using piecewise linear scaling
    if transparency_change >= 0:
        # CDP ≥ Eikon: Perfect transparency (honest or conservative reporting)
        score = 100
    elif transparency_change <= -50:
        # Eikon ≥ 1.5 × CDP: Significant under-reporting
        score = 0
    else:
        # Piecewise linear scaling for -50% < Change < 0%
        # More gradual penalties for minor discrepancies
        if transparency_change >= -2.5:
            # -2.5% to 0%: Minimal penalty for small discrepancies (80 to 100 points)
            score = 80 + (20 * (transparency_change + 2.5) / 2.5)
        elif transparency_change >= -10:
            # -10% to -2.5%: Moderate penalty increase (40 to 80 points)
            score = 40 + (40 * (transparency_change + 10) / 7.5)
        else:
            # -50% to -10%: Gradual penalty increase (0 to 40 points)
            score = 40 * (transparency_change + 50) / 40
    
    return max(0, min(100, score))  # Ensure score is between 0 and 100

# Calculate transparency scores for both years
print("CALCULATING TRANSPARENCY SCORES")
print("=" * 50)

df['transparency_score_2021'] = df.apply(lambda row: calculate_transparency_score(row, 2021), axis=1)
df['transparency_score_2022'] = df.apply(lambda row: calculate_transparency_score(row, 2022), axis=1)

# Print component scores per company for both years
print("TRANSPARENCY SCORES BY COMPANY:")
print()
for _, row in df.iterrows():
    company = row['Organization']
    
    # 2021 calculations
    score_2021 = row['transparency_score_2021']
    cdp_2021 = row['CDP21 - Sc1+2 (ton CO2e) / M$ Rev.']
    eikon_2021 = row['Eikon21 - Sc1+2 (ton CO2e) / M$ Rev.']
    
    # 2022 calculations
    score_2022 = row['transparency_score_2022']
    cdp_2022 = row['CDP22 - Sc1+2 (ton CO2e) / M$ Rev.']
    eikon_2022 = row['Eikon22 - Sc1+2 (ton CO2e) / M$ Rev.']
    
    print(f"{company}:")
    
    # 2021 details
    if not pd.isna(cdp_2021) and not pd.isna(eikon_2021) and cdp_2021 != 0:
        transparency_change_2021 = ((cdp_2021 - eikon_2021) / cdp_2021) * 100
        print(f"  2021: {score_2021:>6.2f}/100 points")
        print(f"        CDP: {cdp_2021:>8.2f}, Eikon: {eikon_2021:>8.2f}")
        print(f"        Transparency Change: {transparency_change_2021:>6.2f}%")
        
        # Interpretation
        if transparency_change_2021 >= 0:
            print(f"        Status: Honest/Conservative reporting")
        elif transparency_change_2021 >= -10:
            print(f"        Status: Minor discrepancy")
        elif transparency_change_2021 >= -25:
            print(f"        Status: Moderate under-reporting")
        else:
            print(f"        Status: Significant under-reporting")
    else:
        print(f"  2021: {score_2021:>6.2f}/100 points (Missing data)")
    
    # 2022 details
    if not pd.isna(cdp_2022) and not pd.isna(eikon_2022) and cdp_2022 != 0:
        transparency_change_2022 = ((cdp_2022 - eikon_2022) / cdp_2022) * 100
        print(f"  2022: {score_2022:>6.2f}/100 points")
        print(f"        CDP: {cdp_2022:>8.2f}, Eikon: {eikon_2022:>8.2f}")
        print(f"        Transparency Change: {transparency_change_2022:>6.2f}%")
        
        # Interpretation
        if transparency_change_2022 >= 0:
            print(f"        Status: Honest/Conservative reporting")
        elif transparency_change_2022 >= -10:
            print(f"        Status: Minor discrepancy")
        elif transparency_change_2022 >= -25:
            print(f"        Status: Moderate under-reporting")
        else:
            print(f"        Status: Significant under-reporting")
    else:
        print(f"  2022: {score_2022:>6.2f}/100 points (Missing data)")
    
    print()

# Summary statistics
print("SUMMARY STATISTICS:")
print(f"2021 Scores - Range: {df['transparency_score_2021'].min():.2f} - {df['transparency_score_2021'].max():.2f}")
print(f"2021 Scores - Mean: {df['transparency_score_2021'].mean():.2f}")
print(f"2021 Scores - Companies with perfect score (100): {(df['transparency_score_2021'] == 100).sum()}")
print(f"2021 Scores - Companies with zero score: {(df['transparency_score_2021'] == 0).sum()}")
print()
print(f"2022 Scores - Range: {df['transparency_score_2022'].min():.2f} - {df['transparency_score_2022'].max():.2f}")
print(f"2022 Scores - Mean: {df['transparency_score_2022'].mean():.2f}")
print(f"2022 Scores - Companies with perfect score (100): {(df['transparency_score_2022'] == 100).sum()}")
print(f"2022 Scores - Companies with zero score: {(df['transparency_score_2022'] == 0).sum()}")

# Transparency analysis
print(f"\nTRANSPARENCY ANALYSIS:")
honest_2021 = (df['transparency_score_2021'] == 100).sum()
honest_2022 = (df['transparency_score_2022'] == 100).sum()
total_companies = len(df)

print(f"Companies with honest reporting:")
print(f"  2021: {honest_2021}/{total_companies} ({(honest_2021/total_companies)*100:.1f}%)")
print(f"  2022: {honest_2022}/{total_companies} ({(honest_2022/total_companies)*100:.1f}%)")

# Count companies with data for transparency analysis
valid_data_2021 = df[(~pd.isna(df['CDP21 - Sc1+2 (ton CO2e) / M$ Rev.'])) & 
                     (~pd.isna(df['Eikon21 - Sc1+2 (ton CO2e) / M$ Rev.'])) &
                     (df['CDP21 - Sc1+2 (ton CO2e) / M$ Rev.'] != 0)].shape[0]

valid_data_2022 = df[(~pd.isna(df['CDP22 - Sc1+2 (ton CO2e) / M$ Rev.'])) & 
                     (~pd.isna(df['Eikon22 - Sc1+2 (ton CO2e) / M$ Rev.'])) &
                     (df['CDP22 - Sc1+2 (ton CO2e) / M$ Rev.'] != 0)].shape[0]

print(f"\nData availability:")
print(f"  2021: {valid_data_2021}/{total_companies} companies with complete data")
print(f"  2022: {valid_data_2022}/{total_companies} companies with complete data")

print("\nTransparency Scores calculated successfully!")

In [None]:
# Cell 6: Component 5 - Suspicious Target Changes Penalty (-10 points maximum per year)

def calculate_suspicious_changes_penalty(row, year):
    """Calculate suspicious target changes penalty for specified year (-10 points max)
    
    Penalizes inconsistent target-setting that may indicate greenwashing behavior:
    - 2021 score: Analyzes 2020→2021 target changes
    - 2022 score: Analyzes 2021→2022 target changes
    """
    
    # Select the appropriate data based on year
    if year == 2021:
        previous_missing = row['Eikon20 - Missing Targets']
        current_missing = row['Eikon21 - Missing Targets']
        previous_target_year = row['Eikon20 - Em. Red. Target Year']
        current_target_year = row['Eikon21 - Em. Red. Target Year']
        previous_target_pct = row['Eikon20 - Em. Red. Target (%)']
        current_target_pct = row['Eikon21 - Em. Red. Target (%)']
    elif year == 2022:
        previous_missing = row['Eikon21 - Missing Targets']
        current_missing = row['Eikon22 - Missing Targets']
        previous_target_year = row['Eikon21 - Em. Red. Target Year']
        current_target_year = row['Eikon22 - Em. Red. Target Year']
        previous_target_pct = row['Eikon21 - Em. Red. Target (%)']
        current_target_pct = row['Eikon22 - Em. Red. Target (%)']
    else:
        raise ValueError(f"Year {year} not supported. Use 2021 or 2022.")
    
    total_penalty = 0
    penalty_details = []
    
    # 1. Missing targets penalty (-10 points if current year is missing)
    if current_missing == 'Yes':
        penalty = -10
        total_penalty += penalty
        penalty_details.append(f"Missing targets: {penalty} points")
    
    # 2. If current year has valid targets but previous year also had targets, check for suspicious changes
    elif (current_missing == 'No' and previous_missing == 'No' and 
          not pd.isna(previous_target_year) and not pd.isna(current_target_year) and
          not pd.isna(previous_target_pct) and not pd.isna(current_target_pct)):
        
        # 2a. Reduction percentage decrease penalty
        if current_target_pct < previous_target_pct:
            pct_decrease = previous_target_pct - current_target_pct
            pct_decrease_ratio = pct_decrease / previous_target_pct * 100
            
            # Base penalty for any decrease
            penalty = -2
            
            # Additional penalty based on decrease amount (max -2 additional)
            if pct_decrease_ratio >= 25:
                penalty -= 2  # Maximum additional penalty
            else:
                penalty -= (2 * pct_decrease_ratio / 25)  # Linear scaling
            
            # Cap at -4 points for reduction decrease
            penalty = max(-4, penalty)
            total_penalty += penalty
            penalty_details.append(f"Reduction % decreased by {pct_decrease:.1f}pp ({pct_decrease_ratio:.1f}%): {penalty:.2f} points")
        
        # 2b. Target year moved later penalty
        if current_target_year > previous_target_year:
            years_moved = current_target_year - previous_target_year
            
            # Base penalty for moving target later
            penalty = -2
            
            # Additional penalty if reduction % didn't increase
            if current_target_pct <= previous_target_pct:
                penalty -= 2
            
            # Proportionate penalty based on years moved (max -2 additional)
            if years_moved >= 5:
                penalty -= 2  # Maximum proportionate penalty
            else:
                penalty -= (years_moved * 0.4)  # 0.4 points per year moved
            
            # Cap at -6 points for year delay
            penalty = max(-6, penalty)
            total_penalty += penalty
            penalty_details.append(f"Target year moved later by {years_moved} years: {penalty:.2f} points")
    
    # Cap total penalty at -10 points
    total_penalty = max(-10, total_penalty)
    
    return total_penalty, penalty_details

# Calculate suspicious changes penalties for both years
print("CALCULATING SUSPICIOUS TARGET CHANGES PENALTIES")
print("=" * 50)

# Apply the function and store results
penalty_results_2021 = df.apply(lambda row: calculate_suspicious_changes_penalty(row, 2021), axis=1)
penalty_results_2022 = df.apply(lambda row: calculate_suspicious_changes_penalty(row, 2022), axis=1)

# Extract penalties and details
df['suspicious_changes_penalty_2021'] = [result[0] for result in penalty_results_2021]
df['suspicious_changes_penalty_2022'] = [result[0] for result in penalty_results_2022]

penalty_details_2021 = [result[1] for result in penalty_results_2021]
penalty_details_2022 = [result[1] for result in penalty_results_2022]

# Print component penalties per company for both years
print("SUSPICIOUS TARGET CHANGES PENALTIES BY COMPANY:")
print()
for i, row in df.iterrows():
    company = row['Organization']
    penalty_2021 = row['suspicious_changes_penalty_2021']
    penalty_2022 = row['suspicious_changes_penalty_2022']
    
    print(f"{company}:")
    
    # 2021 details
    print(f"  2021: {penalty_2021:>6.2f} points")
    if penalty_details_2021[i]:
        for detail in penalty_details_2021[i]:
            print(f"        - {detail}")
    else:
        print(f"        - No penalties (consistent targets)")
    
    # 2022 details  
    print(f"  2022: {penalty_2022:>6.2f} points")
    if penalty_details_2022[i]:
        for detail in penalty_details_2022[i]:
            print(f"        - {detail}")
    else:
        print(f"        - No penalties (consistent targets)")
    
    print()

# Summary statistics
print("SUMMARY STATISTICS:")
print(f"2021 Penalties - Range: {df['suspicious_changes_penalty_2021'].min():.2f} - {df['suspicious_changes_penalty_2021'].max():.2f}")
print(f"2021 Penalties - Mean: {df['suspicious_changes_penalty_2021'].mean():.2f}")
print(f"2021 Penalties - Companies with no penalty (0): {(df['suspicious_changes_penalty_2021'] == 0).sum()}")
print(f"2021 Penalties - Companies with maximum penalty (-10): {(df['suspicious_changes_penalty_2021'] == -10).sum()}")
print()
print(f"2022 Penalties - Range: {df['suspicious_changes_penalty_2022'].min():.2f} - {df['suspicious_changes_penalty_2022'].max():.2f}")
print(f"2022 Penalties - Mean: {df['suspicious_changes_penalty_2022'].mean():.2f}")
print(f"2022 Penalties - Companies with no penalty (0): {(df['suspicious_changes_penalty_2022'] == 0).sum()}")
print(f"2022 Penalties - Companies with maximum penalty (-10): {(df['suspicious_changes_penalty_2022'] == -10).sum()}")

# Penalty type analysis
print(f"\nPENALTY TYPE ANALYSIS:")
missing_penalties_2021 = sum(1 for details in penalty_details_2021 if any('Missing targets' in detail for detail in details))
missing_penalties_2022 = sum(1 for details in penalty_details_2022 if any('Missing targets' in detail for detail in details))

reduction_penalties_2021 = sum(1 for details in penalty_details_2021 if any('Reduction %' in detail for detail in details))
reduction_penalties_2022 = sum(1 for details in penalty_details_2022 if any('Reduction %' in detail for detail in details))

year_penalties_2021 = sum(1 for details in penalty_details_2021 if any('Target year moved' in detail for detail in details))
year_penalties_2022 = sum(1 for details in penalty_details_2022 if any('Target year moved' in detail for detail in details))

print(f"2021 penalty types:")
print(f"  Missing targets: {missing_penalties_2021} companies")
print(f"  Reduction % decreased: {reduction_penalties_2021} companies")
print(f"  Target year moved later: {year_penalties_2021} companies")

print(f"2022 penalty types:")
print(f"  Missing targets: {missing_penalties_2022} companies")
print(f"  Reduction % decreased: {reduction_penalties_2022} companies")
print(f"  Target year moved later: {year_penalties_2022} companies")

print("\nSuspicious Target Changes Penalties calculated successfully!")

In [None]:
# Cell 7: Component 6 - Renewable Energy Bonus (+10 points maximum per year)

def calculate_renewable_energy_bonus(row, year):
    """Calculate renewable energy bonus for specified year (10 points max)
    
    Components:
    - Industry Comparison (8 points): Performance vs industry average
    - Growth Bonus (2 points): Year-over-year improvement
    """
    
    # Select the appropriate data based on year
    if year == 2021:
        previous_year_renewable = row['Eikon20 - Total Renewable Energy / Rev. (M$)']
        current_year_renewable = row['Eikon21 - Total Renewable Energy / Rev. (M$)']
        industry_average = average_renewable_intensity_2021
    elif year == 2022:
        previous_year_renewable = row['Eikon21 - Total Renewable Energy / Rev. (M$)']
        current_year_renewable = row['Eikon22 - Total Renewable Energy / Rev. (M$)']
        industry_average = average_renewable_intensity_2022
    else:
        raise ValueError(f"Year {year} not supported. Use 2021 or 2022.")
    
    total_score = 0
    score_breakdown = {}
    
    # 1. Industry Comparison Score (8 points max)
    if not pd.isna(current_year_renewable) and not pd.isna(industry_average) and industry_average > 0:
        # Calculate ratio (Industry Average / Company Renewable)
        # Lower ratio = company uses more renewable energy = better score
        ratio = industry_average / current_year_renewable if current_year_renewable > 0 else float('inf')
        
        # Cap ratio between 0.5 and 2.0
        capped_ratio = max(0.5, min(2.0, ratio))
        
        # Calculate score using square root decay (8 points max)
        if capped_ratio <= 0.5:
            industry_score = 8  # Maximum score (company uses 2x+ more renewable than average)
        elif capped_ratio >= 2.0:
            industry_score = 0  # Minimum score (company uses 2x+ less renewable than average)
        else:
            # Square root decay from 0.5 to 2.0
            t = (capped_ratio - 0.5) / 1.5  # Normalize to [0, 1]
            industry_score = 8 * (1 - safe_sqrt(t))  # Square root decay
        
        score_breakdown['industry_comparison'] = industry_score
        total_score += industry_score
    else:
        score_breakdown['industry_comparison'] = 0
    
    # 2. Growth Bonus Score (2 points max)
    if (not pd.isna(previous_year_renewable) and not pd.isna(current_year_renewable) and 
        previous_year_renewable > 0 and current_year_renewable > 0):
        
        # Calculate growth ratio
        growth_ratio = current_year_renewable / previous_year_renewable
        
        if growth_ratio <= 1.0:
            growth_score = 0  # No bonus for negative or zero growth
        elif growth_ratio >= 2.0:
            growth_score = 2  # Maximum bonus for doubling or more
        else:
            # Linear scaling from 1.0 to 2.0
            growth_score = 1 + (growth_ratio - 1.0)  # 1 point at ratio 1.0, 2 points at ratio 2.0
        
        score_breakdown['growth_bonus'] = growth_score
        total_score += growth_score
    else:
        score_breakdown['growth_bonus'] = 0
    
    return total_score, score_breakdown

# Calculate renewable energy bonuses for both years
print("CALCULATING RENEWABLE ENERGY BONUSES")
print("=" * 50)

# Apply the function and store results
bonus_results_2021 = df.apply(lambda row: calculate_renewable_energy_bonus(row, 2021), axis=1)
bonus_results_2022 = df.apply(lambda row: calculate_renewable_energy_bonus(row, 2022), axis=1)

# Extract bonuses and breakdowns
df['renewable_bonus_2021'] = [result[0] for result in bonus_results_2021]
df['renewable_bonus_2022'] = [result[0] for result in bonus_results_2022]

bonus_breakdown_2021 = [result[1] for result in bonus_results_2021]
bonus_breakdown_2022 = [result[1] for result in bonus_results_2022]

# Print component bonuses per company for both years
print("RENEWABLE ENERGY BONUSES BY COMPANY:")
print()
for i, row in df.iterrows():
    company = row['Organization']
    bonus_2021 = row['renewable_bonus_2021']
    bonus_2022 = row['renewable_bonus_2022']
    
    # Get renewable energy data
    renewable_2020 = row['Eikon20 - Total Renewable Energy / Rev. (M$)']
    renewable_2021 = row['Eikon21 - Total Renewable Energy / Rev. (M$)']
    renewable_2022 = row['Eikon22 - Total Renewable Energy / Rev. (M$)']
    
    print(f"{company}:")
    
    # 2021 details
    print(f"  2021: +{bonus_2021:>6.2f}/10 points")
    if not pd.isna(renewable_2021):
        print(f"        Renewable Intensity: {renewable_2021:>8.4f}")
        if not pd.isna(average_renewable_intensity_2021):
            ratio_2021 = average_renewable_intensity_2021 / renewable_2021 if renewable_2021 > 0 else float('inf')
            print(f"        Industry Ratio: {ratio_2021:>6.2f} (lower = better)")
        print(f"        Industry Comparison: +{bonus_breakdown_2021[i]['industry_comparison']:>5.2f}/8 points")
        
        if not pd.isna(renewable_2020) and renewable_2020 > 0:
            growth_2021 = renewable_2021 / renewable_2020
            print(f"        Growth (2020→2021): {growth_2021:>5.2f}x")
        print(f"        Growth Bonus: +{bonus_breakdown_2021[i]['growth_bonus']:>5.2f}/2 points")
    else:
        print(f"        No renewable energy data available")
    
    # 2022 details
    print(f"  2022: +{bonus_2022:>6.2f}/10 points")
    if not pd.isna(renewable_2022):
        print(f"        Renewable Intensity: {renewable_2022:>8.4f}")
        if not pd.isna(average_renewable_intensity_2022):
            ratio_2022 = average_renewable_intensity_2022 / renewable_2022 if renewable_2022 > 0 else float('inf')
            print(f"        Industry Ratio: {ratio_2022:>6.2f} (lower = better)")
        print(f"        Industry Comparison: +{bonus_breakdown_2022[i]['industry_comparison']:>5.2f}/8 points")
        
        if not pd.isna(renewable_2021) and renewable_2021 > 0:
            growth_2022 = renewable_2022 / renewable_2021
            print(f"        Growth (2021→2022): {growth_2022:>5.2f}x")
        print(f"        Growth Bonus: +{bonus_breakdown_2022[i]['growth_bonus']:>5.2f}/2 points")
    else:
        print(f"        No renewable energy data available")
    
    print()

# Summary statistics
print("SUMMARY STATISTICS:")
print(f"Industry Averages:")
print(f"  2021: {average_renewable_intensity_2021:.4f}")
print(f"  2022: {average_renewable_intensity_2022:.4f}")
print()
print(f"2021 Bonuses - Range: {df['renewable_bonus_2021'].min():.2f} - {df['renewable_bonus_2021'].max():.2f}")
print(f"2021 Bonuses - Mean: {df['renewable_bonus_2021'].mean():.2f}")
print(f"2021 Bonuses - Companies with maximum bonus (10): {(df['renewable_bonus_2021'] == 10).sum()}")
print(f"2021 Bonuses - Companies with zero bonus: {(df['renewable_bonus_2021'] == 0).sum()}")
print()
print(f"2022 Bonuses - Range: {df['renewable_bonus_2022'].min():.2f} - {df['renewable_bonus_2022'].max():.2f}")
print(f"2022 Bonuses - Mean: {df['renewable_bonus_2022'].mean():.2f}")
print(f"2022 Bonuses - Companies with maximum bonus (10): {(df['renewable_bonus_2022'] == 10).sum()}")
print(f"2022 Bonuses - Companies with zero bonus: {(df['renewable_bonus_2022'] == 0).sum()}")

# Renewable energy analysis
print(f"\nRENEWABLE ENERGY PERFORMANCE ANALYSIS:")

# Count companies with data
valid_data_2021 = df[~pd.isna(df['Eikon21 - Total Renewable Energy / Rev. (M$)'])].shape[0]
valid_data_2022 = df[~pd.isna(df['Eikon22 - Total Renewable Energy / Rev. (M$)'])].shape[0]

print(f"Data availability:")
print(f"  2021: {valid_data_2021}/{len(df)} companies with renewable data")
print(f"  2022: {valid_data_2022}/{len(df)} companies with renewable data")

# Growth analysis
positive_growth_2021 = sum(1 for breakdown in bonus_breakdown_2021 if breakdown['growth_bonus'] > 0)
positive_growth_2022 = sum(1 for breakdown in bonus_breakdown_2022 if breakdown['growth_bonus'] > 0)

print(f"\nGrowth performance:")
print(f"  2021: {positive_growth_2021} companies with positive renewable growth")
print(f"  2022: {positive_growth_2022} companies with positive renewable growth")

# Industry comparison analysis
excellent_industry_2021 = sum(1 for breakdown in bonus_breakdown_2021 if breakdown['industry_comparison'] >= 6)
excellent_industry_2022 = sum(1 for breakdown in bonus_breakdown_2022 if breakdown['industry_comparison'] >= 6)

print(f"\nIndustry comparison (6+ points = above average):")
print(f"  2021: {excellent_industry_2021} companies performing above average")
print(f"  2022: {excellent_industry_2022} companies performing above average")

print("\nRenewable Energy Bonuses calculated successfully!")

In [None]:
# Cell: Create Comprehensive Results DataFrames and Export

print("CREATING COMPREHENSIVE PERFORMANCE RESULTS")
print("=" * 50)

# ==========================================
# STEP 1: Create 2021 Results DataFrame
# ==========================================

results_2021 = pd.DataFrame({
    # Company identifier
    'Organization': df['Organization'],
    
    # Component Scores (out of 100 each)
    'Absolute_Emission_Score': df['absolute_emission_score_2021'].round(2),
    'Goal_Achievement_Score': df['goal_achievement_score_2021'].round(2),
    'Target_Ambition_Score': df['target_ambition_score_2021'].round(2),
    'Transparency_Score': df['transparency_score_2021'].round(2),
    'Suspicious_Changes_Penalty': df['suspicious_changes_penalty_2021'].round(2),
    'Renewable_Energy_Bonus': df['renewable_bonus_2021'].round(2),
    
    # Underlying Emission Data
    'Eikon_Emissions_2021_tCO2e_per_M$': df['Eikon21 - Sc1+2 (ton CO2e) / M$ Rev.'].round(2),
    'CDP_Emissions_2021_tCO2e_per_M$': df['CDP21 - Sc1+2 (ton CO2e) / M$ Rev.'].round(2),
    'Eikon_Emissions_2020_tCO2e_per_M$': df['Eikon20 - Sc1+2 (ton CO2e) / M$ Rev.'].round(2),
    
    # Benchmarks and Industry Data
    'Industry_Avg_Emissions_2021': average_emission_intensity_2021,
    'Company_vs_Industry_Ratio_2021': (df['Eikon21 - Sc1+2 (ton CO2e) / M$ Rev.'] / average_emission_intensity_2021).round(3),
    
    # Target Data
    'Target_Year_2021': df['Eikon21 - Em. Red. Target Year'],
    'Target_Reduction_Pct_2021': df['Eikon21 - Em. Red. Target (%)'],
    'Target_Missing_2021': df['Eikon21 - Missing Targets'],
    
    # Previous Year Target Data (for penalty calculation)
    'Target_Year_2020': df['Eikon20 - Em. Red. Target Year'],
    'Target_Reduction_Pct_2020': df['Eikon20 - Em. Red. Target (%)'],
    'Target_Missing_2020': df['Eikon20 - Missing Targets'],
    
    # Renewable Energy Data
    'Renewable_Energy_2021_per_M$': df['Eikon21 - Total Renewable Energy / Rev. (M$)'].round(2),
    'Renewable_Energy_2020_per_M$': df['Eikon20 - Total Renewable Energy / Rev. (M$)'].round(2),
    'Industry_Avg_Renewable_2021': average_renewable_intensity_2021,
    'Renewable_vs_Industry_Ratio_2021': (average_renewable_intensity_2021 / df['Eikon21 - Total Renewable Energy / Rev. (M$)']).round(3),
    
    # Transparency Analysis
    'Transparency_Change_2021_Pct': (((df['CDP21 - Sc1+2 (ton CO2e) / M$ Rev.'] - df['Eikon21 - Sc1+2 (ton CO2e) / M$ Rev.']) / df['CDP21 - Sc1+2 (ton CO2e) / M$ Rev.']) * 100).round(2),
    
    # Goal Achievement Analysis
    'Actual_Progress_2020_to_2021_Pct': (((df['Eikon20 - Sc1+2 (ton CO2e) / M$ Rev.'] - df['Eikon21 - Sc1+2 (ton CO2e) / M$ Rev.']) / df['Eikon20 - Sc1+2 (ton CO2e) / M$ Rev.']) * 100).round(2),
    'Expected_Annual_Progress_2021_Pct': (df['Eikon21 - Em. Red. Target (%)'] / (df['Eikon21 - Em. Red. Target Year'] - 2020)).round(2),
    'Achievement_Rate_2021_Pct': ((((df['Eikon20 - Sc1+2 (ton CO2e) / M$ Rev.'] - df['Eikon21 - Sc1+2 (ton CO2e) / M$ Rev.']) / df['Eikon20 - Sc1+2 (ton CO2e) / M$ Rev.']) * 100) / (df['Eikon21 - Em. Red. Target (%)'] / (df['Eikon21 - Em. Red. Target Year'] - 2020)) * 100).round(2),
    
    # Target Ambition Analysis
    'Annual_Target_Intensity_2021_Pct': (df['Eikon21 - Em. Red. Target (%)'] / (df['Eikon21 - Em. Red. Target Year'] - 2021)).round(2)
})

# ==========================================
# STEP 2: Create 2022 Results DataFrame
# ==========================================

results_2022 = pd.DataFrame({
    # Company identifier
    'Organization': df['Organization'],
    
    # Component Scores (out of 100 each)
    'Absolute_Emission_Score': df['absolute_emission_score_2022'].round(2),
    'Goal_Achievement_Score': df['goal_achievement_score_2022'].round(2),
    'Target_Ambition_Score': df['target_ambition_score_2022'].round(2),
    'Transparency_Score': df['transparency_score_2022'].round(2),
    'Suspicious_Changes_Penalty': df['suspicious_changes_penalty_2022'].round(2),
    'Renewable_Energy_Bonus': df['renewable_bonus_2022'].round(2),
    
    # Underlying Emission Data
    'Eikon_Emissions_2022_tCO2e_per_M$': df['Eikon22 - Sc1+2 (ton CO2e) / M$ Rev.'].round(2),
    'CDP_Emissions_2022_tCO2e_per_M$': df['CDP22 - Sc1+2 (ton CO2e) / M$ Rev.'].round(2),
    'Eikon_Emissions_2021_tCO2e_per_M$': df['Eikon21 - Sc1+2 (ton CO2e) / M$ Rev.'].round(2),
    
    # Benchmarks and Industry Data
    'Industry_Avg_Emissions_2022': average_emission_intensity_2022,
    'Company_vs_Industry_Ratio_2022': (df['Eikon22 - Sc1+2 (ton CO2e) / M$ Rev.'] / average_emission_intensity_2022).round(3),
    
    # Target Data
    'Target_Year_2022': df['Eikon22 - Em. Red. Target Year'],
    'Target_Reduction_Pct_2022': df['Eikon22 - Em. Red. Target (%)'],
    'Target_Missing_2022': df['Eikon22 - Missing Targets'],
    
    # Previous Year Target Data (for penalty calculation)
    'Target_Year_2021': df['Eikon21 - Em. Red. Target Year'],
    'Target_Reduction_Pct_2021': df['Eikon21 - Em. Red. Target (%)'],
    'Target_Missing_2021': df['Eikon21 - Missing Targets'],
    
    # Renewable Energy Data
    'Renewable_Energy_2022_per_M$': df['Eikon22 - Total Renewable Energy / Rev. (M$)'].round(2),
    'Renewable_Energy_2021_per_M$': df['Eikon21 - Total Renewable Energy / Rev. (M$)'].round(2),
    'Industry_Avg_Renewable_2022': average_renewable_intensity_2022,
    'Renewable_vs_Industry_Ratio_2022': (average_renewable_intensity_2022 / df['Eikon22 - Total Renewable Energy / Rev. (M$)']).round(3),
    
    # Transparency Analysis
    'Transparency_Change_2022_Pct': (((df['CDP22 - Sc1+2 (ton CO2e) / M$ Rev.'] - df['Eikon22 - Sc1+2 (ton CO2e) / M$ Rev.']) / df['CDP22 - Sc1+2 (ton CO2e) / M$ Rev.']) * 100).round(2),
    
    # Goal Achievement Analysis
    'Actual_Progress_2021_to_2022_Pct': (((df['Eikon21 - Sc1+2 (ton CO2e) / M$ Rev.'] - df['Eikon22 - Sc1+2 (ton CO2e) / M$ Rev.']) / df['Eikon21 - Sc1+2 (ton CO2e) / M$ Rev.']) * 100).round(2),
    'Expected_Annual_Progress_2022_Pct': (df['Eikon22 - Em. Red. Target (%)'] / (df['Eikon22 - Em. Red. Target Year'] - 2021)).round(2),
    'Achievement_Rate_2022_Pct': ((((df['Eikon21 - Sc1+2 (ton CO2e) / M$ Rev.'] - df['Eikon22 - Sc1+2 (ton CO2e) / M$ Rev.']) / df['Eikon21 - Sc1+2 (ton CO2e) / M$ Rev.']) * 100) / (df['Eikon22 - Em. Red. Target (%)'] / (df['Eikon22 - Em. Red. Target Year'] - 2021)) * 100).round(2),
    
    # Target Ambition Analysis
    'Annual_Target_Intensity_2022_Pct': (df['Eikon22 - Em. Red. Target (%)'] / (df['Eikon22 - Em. Red. Target Year'] - 2022)).round(2)
})

# ==========================================
# STEP 3: Sort by Organization for consistency
# ==========================================

results_2021 = results_2021.sort_values('Organization').reset_index(drop=True)
results_2022 = results_2022.sort_values('Organization').reset_index(drop=True)

# ==========================================
# STEP 4: Export to Excel with Multiple Tabs
# ==========================================

print("Exporting comprehensive results to Excel...")
output_path = "data/Performance/comprehensive_performance_results.xlsx"

try:
    with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
        # Export main results
        results_2021.to_excel(writer, sheet_name='Results_2021', index=False)
        results_2022.to_excel(writer, sheet_name='Results_2022', index=False)
        
        # Create summary comparison sheet
        comparison_summary = pd.DataFrame({
            'Organization': df['Organization'],
            'Abs_Emission_2021': df['absolute_emission_score_2021'].round(2),
            'Abs_Emission_2022': df['absolute_emission_score_2022'].round(2),
            'Goal_Achievement_2021': df['goal_achievement_score_2021'].round(2),
            'Goal_Achievement_2022': df['goal_achievement_score_2022'].round(2),
            'Target_Ambition_2021': df['target_ambition_score_2021'].round(2),
            'Target_Ambition_2022': df['target_ambition_score_2022'].round(2),
            'Transparency_2021': df['transparency_score_2021'].round(2),
            'Transparency_2022': df['transparency_score_2022'].round(2),
            'Penalties_2021': df['suspicious_changes_penalty_2021'].round(2),
            'Penalties_2022': df['suspicious_changes_penalty_2022'].round(2),
            'Renewable_Bonus_2021': df['renewable_bonus_2021'].round(2),
            'Renewable_Bonus_2022': df['renewable_bonus_2022'].round(2)
        }).sort_values('Organization').reset_index(drop=True)
        
        comparison_summary.to_excel(writer, sheet_name='Component_Comparison', index=False)
        
        # Create metadata sheet with industry averages and calculation details
        metadata = pd.DataFrame({
            'Metric': [
                'Industry Average Emissions 2021 (tCO2e/M$)',
                'Industry Average Emissions 2022 (tCO2e/M$)',
                'Industry Average Renewable 2021 (per M$)',
                'Industry Average Renewable 2022 (per M$)',
                'Total Companies Analyzed',
                'Absolute Emission Score Max Points',
                'Goal Achievement Score Max Points',
                'Target Ambition Score Max Points', 
                'Transparency Score Max Points',
                'Suspicious Changes Penalty Max Points',
                'Renewable Energy Bonus Max Points'
            ],
            'Value': [
                average_emission_intensity_2021,
                average_emission_intensity_2022,
                average_renewable_intensity_2021,
                average_renewable_intensity_2022,
                len(df),
                100,
                100,
                100,
                100,
                -10,
                10
            ]
        })
        
        metadata.to_excel(writer, sheet_name='Metadata', index=False)
        
    print(f"✓ Results exported successfully to: {output_path}")
    print(f"  - Results_2021: {results_2021.shape[0]} companies, {results_2021.shape[1]} columns")
    print(f"  - Results_2022: {results_2022.shape[0]} companies, {results_2022.shape[1]} columns")
    print(f"  - Component_Comparison: Summary of all component scores")
    print(f"  - Metadata: Industry benchmarks and scoring parameters")
    
except Exception as e:
    print(f"Error exporting to Excel: {e}")
    print("Results DataFrames created successfully in memory:")
    print("- results_2021: 2021 comprehensive results")
    print("- results_2022: 2022 comprehensive results")

# ==========================================
# STEP 5: Display Summary Statistics
# ==========================================

print(f"\nCOMPONENT SCORE SUMMARY:")
print("=" * 40)

components = [
    ('Absolute_Emission_Score', 'Absolute Emission Intensity'),
    ('Goal_Achievement_Score', 'Goal Achievement Rate'),
    ('Target_Ambition_Score', 'Target Ambition'),
    ('Transparency_Score', 'Transparency'),
    ('Suspicious_Changes_Penalty', 'Suspicious Changes Penalty'),
    ('Renewable_Energy_Bonus', 'Renewable Energy Bonus')
]

for year, results_df in [('2021', results_2021), ('2022', results_2022)]:
    print(f"\n{year} SCORES:")
    for col_name, display_name in components:
        if col_name in results_df.columns:
            mean_score = results_df[col_name].mean()
            min_score = results_df[col_name].min()
            max_score = results_df[col_name].max()
            std_score = results_df[col_name].std()
            print(f"  {display_name:<30}: Mean={mean_score:>6.2f}, Range=[{min_score:>6.2f}, {max_score:>6.2f}], Std={std_score:>5.2f}")

print(f"\nDATA AVAILABILITY SUMMARY:")
print("=" * 40)

# Check data availability for key metrics
for year, results_df in [('2021', results_2021), ('2022', results_2022)]:
    emission_col = f'Eikon_Emissions_{year}_tCO2e_per_M$'
    cdp_col = f'CDP_Emissions_{year}_tCO2e_per_M$'
    renewable_col = f'Renewable_Energy_{year}_per_M$'
    target_col = f'Target_Reduction_Pct_{year}'
    
    emission_avail = results_df[emission_col].notna().sum()
    cdp_avail = results_df[cdp_col].notna().sum()
    renewable_avail = results_df[renewable_col].notna().sum()
    target_avail = results_df[target_col].notna().sum()
    
    total = len(results_df)
    
    print(f"\n{year}:")
    print(f"  Eikon Emissions: {emission_avail}/{total} ({emission_avail/total*100:.1f}%)")
    print(f"  CDP Emissions: {cdp_avail}/{total} ({cdp_avail/total*100:.1f}%)")
    print(f"  Renewable Energy: {renewable_avail}/{total} ({renewable_avail/total*100:.1f}%)")
    print(f"  Emission Targets: {target_avail}/{total} ({target_avail/total*100:.1f}%)")

print(f"\nCOMPREHENSIVE RESULTS CREATION COMPLETE!")
print("=" * 50)
print("Available DataFrames:")
print("- results_2021: Complete 2021 performance data")
print("- results_2022: Complete 2022 performance data")
print("- comparison_summary: Side-by-side component scores")
print(f"- Excel file: {output_path}")

In [None]:
import pandas as pd
import numpy as np
from itertools import product
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

print("ENSEMBLE PERFORMANCE SCORE ANALYSIS")
print("="*50)

# ==========================================
# STEP 1: Generate Valid Weight Combinations
# ==========================================
print("Generating valid weight combinations...")
print("Constraints:")
print("- Individual weights: 0.05 ≤ w ≤ 0.50")
print("- Emissions > Target_Progress > Ambition > Transparency")
print("- All weights sum to 1.0\n")

weight_range = np.round(np.arange(0.05, 0.51, 0.01), 2)
valid_combinations = []
total_tested = 0

for w_ei in weight_range:
    for w_gp in weight_range:
        for w_ta in weight_range:
            for w_tr in weight_range:
                total_tested += 1
                total = w_ei + w_gp + w_ta + w_tr
                if (
                    abs(total - 1.0) < 0.001 and
                    w_ei > w_gp > w_ta > w_tr
                ):
                    valid_combinations.append({
                        'w_ei': w_ei,
                        'w_gp': w_gp,
                        'w_ta': w_ta,
                        'w_tr': w_tr
                    })

print(f"Total combinations tested: {total_tested:,}")
print(f"Valid combinations found: {len(valid_combinations):,}")

if len(valid_combinations) == 0:
    raise ValueError("No valid weight combinations found. Check constraints.")

# ==========================================
# STEP 2: Calculate Scores for All Combinations
# ==========================================
print("\nCalculating scores for all valid combinations...")

all_results = []

for i, weights in enumerate(tqdm(valid_combinations, desc="Processing combinations")):
    for year in [2021, 2022]:
        base_score = (
            weights['w_ei'] * df[f'absolute_emission_score_{year}'] +
            weights['w_gp'] * df[f'goal_achievement_score_{year}'] +
            weights['w_ta'] * df[f'target_ambition_score_{year}'] +
            weights['w_tr'] * df[f'transparency_score_{year}']
        )

        final_score = base_score + df[f'suspicious_changes_penalty_{year}'] + df[f'renewable_bonus_{year}']
        final_score = final_score.clip(0, 110) * (100 / 110)

        all_results.append(pd.DataFrame({
            'Organization': df['Organization'],
            'year': year,
            'score': final_score
        }))

# ==========================================
# STEP 3: Aggregate Median Score Per Company-Year
# ==========================================
print("Computing median scores across all combinations...")

ensemble_df = pd.concat(all_results)
median_scores = ensemble_df.groupby(['Organization', 'year'])['score'].median().reset_index()
median_scores.rename(columns={'score': 'median_score'}, inplace=True)

# ==========================================
# STEP 4: Display or Save Results
# ==========================================
print("\nTOP COMPANIES BY MEDIAN SCORE:")
print(median_scores.sort_values('median_score', ascending=False).head(10))

# Optional: save to Excel
median_scores.to_excel("data/Performance/ensemble_performance_scores.xlsx", index=False)

In [None]:
import pandas as pd
import numpy as np
from itertools import product
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

print("ENSEMBLE PERFORMANCE SCORE ANALYSIS")
print("="*50)

# ==========================================
# STEP 1: Generate valid weight combinations
# ==========================================
print("Generating valid weight combinations...")
print("Constraints:")
print("- Individual weights: 0.05 ≤ w ≤ 0.50")
print("- Emissions > Goal_Progress > Target_Ambition > Transparency")
print("- All weights sum to 1.0")
print()

# Create weight ranges (2 decimal precision as requested)
weight_range = np.arange(0.05, 0.51, 0.01)
weight_range = np.round(weight_range, 2)

valid_combinations = []
total_combinations = 0

# Generate all possible combinations
for w_ei in weight_range:  # Emissions weight
    for w_gp in weight_range:  # Goal_Progress weight
        for w_ta in weight_range:  # Target_Ambition weight
            total_combinations += 1
            
            # Calculate remaining weight for Transparency
            w_tr = 1.0 - (w_ei + w_gp + w_ta)
            w_tr = round(w_tr, 2)
            
            # Check if all constraints are satisfied
            if (0.05 <= w_tr <= 0.50 and  # Transparency in valid range
                w_ei > w_gp and           # Emissions > Goal_Progress
                w_gp > w_ta and           # Goal_Progress > Target_Ambition  
                w_ta > w_tr and           # Target_Ambition > Transparency
                abs(w_ei + w_gp + w_ta + w_tr - 1.0) < 0.001):  # Sum ≈ 1.0
                
                valid_combinations.append({
                    'w_emissions': w_ei,
                    'w_goal_progress': w_gp,
                    'w_target_ambition': w_ta,
                    'w_transparency': w_tr
                })

print(f"Total combinations tested: {total_combinations:,}")
print(f"Valid combinations found: {len(valid_combinations):,}")
print(f"Percentage valid: {len(valid_combinations)/total_combinations*100:.2f}%")

if len(valid_combinations) == 0:
    print("ERROR: No valid weight combinations found. Check constraints.")
else:
    # ==========================================
    # STEP 2: Prepare performance data for ensemble analysis
    # ==========================================
    
    print("\nPreparing performance data...")
    
    # Create performance dataframe with both years
    performance_data = []
    for year in [2021, 2022]:
        year_data = pd.DataFrame({
            'Organization': df['Organization'],
            'year': year,
            'absolute_emission_score': df[f'absolute_emission_score_{year}'],
            'goal_achievement_score': df[f'goal_achievement_score_{year}'],
            'target_ambition_score': df[f'target_ambition_score_{year}'],
            'transparency_score': df[f'transparency_score_{year}'],
            'suspicious_changes_penalty': df[f'suspicious_changes_penalty_{year}'],
            'renewable_bonus': df[f'renewable_bonus_{year}']
        })
        performance_data.append(year_data)
    
    performance_df = pd.concat(performance_data, ignore_index=True)
    
    # ==========================================
    # STEP 3: Calculate scores for all combinations
    # ==========================================
    
    print(f"\nCalculating scores for {len(valid_combinations):,} weight combinations...")
    
    # Initialize storage for all results
    all_results = []
    
    # Progress bar for weight combinations
    for i, weights in enumerate(tqdm(valid_combinations, desc="Processing combinations")):
        
        # Calculate base weighted score for this weight combination
        base_scores = (
            weights['w_emissions'] * performance_df['absolute_emission_score'] +
            weights['w_goal_progress'] * performance_df['goal_achievement_score'] +
            weights['w_target_ambition'] * performance_df['target_ambition_score'] +
            weights['w_transparency'] * performance_df['transparency_score']
        )
        
        # Add penalties and bonuses
        final_scores = base_scores + performance_df['suspicious_changes_penalty'] + performance_df['renewable_bonus']
        
        # Clip to bounds and scale to 0-100
        final_scores = final_scores.clip(0, 110) * (100 / 110)
        final_scores = final_scores.round(2)
        
        # Store results for this combination
        combination_result = {
            'combination_id': i,
            'weights': weights,
            'scores': final_scores.tolist(),
            'organizations': performance_df['Organization'].tolist(),
            'years': performance_df['year'].tolist(),
            'summary_stats': {
                'mean': final_scores.mean(),
                'median': final_scores.median(),
                'std': final_scores.std(),
                'min': final_scores.min(),
                'max': final_scores.max(),
                'q25': final_scores.quantile(0.25),
                'q75': final_scores.quantile(0.75)
            }
        }
        
        all_results.append(combination_result)
    
    # ==========================================
    # STEP 4: Create ensemble summary
    # ==========================================
    
    print("\nCreating ensemble summary...")
    
    # Extract all scores into matrix format
    n_combinations = len(all_results)
    n_observations = len(performance_df)
    
    # Matrix: rows = combinations, columns = observations
    score_matrix = np.array([result['scores'] for result in all_results])
    
    # Calculate statistics across all combinations for each observation
    ensemble_stats = pd.DataFrame({
        'Organization': performance_df['Organization'],
        'year': performance_df['year'],
        'mean_score': np.mean(score_matrix, axis=0),
        'median_score': np.median(score_matrix, axis=0),
        'std_score': np.std(score_matrix, axis=0),
        'min_score': np.min(score_matrix, axis=0),
        'max_score': np.max(score_matrix, axis=0),
        'q25_score': np.percentile(score_matrix, 25, axis=0),
        'q75_score': np.percentile(score_matrix, 75, axis=0),
        'iqr_score': np.percentile(score_matrix, 75, axis=0) - np.percentile(score_matrix, 25, axis=0),
        'range_score': np.max(score_matrix, axis=0) - np.min(score_matrix, axis=0)
    }).round(2)
    
    # Weight distribution analysis
    weight_analysis = pd.DataFrame([result['weights'] for result in all_results])
    weight_stats = {
        'weight_ranges': {
            'emissions': [weight_analysis['w_emissions'].min(), weight_analysis['w_emissions'].max()],
            'goal_progress': [weight_analysis['w_goal_progress'].min(), weight_analysis['w_goal_progress'].max()], 
            'target_ambition': [weight_analysis['w_target_ambition'].min(), weight_analysis['w_target_ambition'].max()],
            'transparency': [weight_analysis['w_transparency'].min(), weight_analysis['w_transparency'].max()]
        },
        'weight_means': weight_analysis.mean().to_dict(),
        'weight_stds': weight_analysis.std().to_dict()
    }
    
    # Overall ensemble statistics
    overall_stats = {
        'total_combinations': n_combinations,
        'total_observations': n_observations,
        'score_stability': {
            'mean_std_across_combinations': ensemble_stats['std_score'].mean(),
            'max_std_across_combinations': ensemble_stats['std_score'].max(),
            'mean_range_across_combinations': ensemble_stats['range_score'].mean(),
            'companies_with_high_uncertainty': len(ensemble_stats[ensemble_stats['std_score'] > ensemble_stats['std_score'].quantile(0.9)])
        }
    }
    
    # ==========================================
    # STEP 5: Calculate company averages across both years
    # ==========================================
    
    print("Calculating company averages across years...")
    
    # Company averages for median scores
    company_averages_median = ensemble_stats.groupby('Organization').agg({
        'median_score': ['mean', 'count']
    }).round(2)
    company_averages_median.columns = ['Avg_Median_Score', 'Years_Count']
    company_averages_median = company_averages_median.reset_index()
    company_averages_median = company_averages_median[company_averages_median['Years_Count'] == 2]  # Only companies with both years
    
    # Company averages for mean scores
    company_averages_mean = ensemble_stats.groupby('Organization').agg({
        'mean_score': ['mean', 'count']
    }).round(2)
    company_averages_mean.columns = ['Avg_Mean_Score', 'Years_Count']
    company_averages_mean = company_averages_mean.reset_index()
    company_averages_mean = company_averages_mean[company_averages_mean['Years_Count'] == 2]  # Only companies with both years
    
    # Combine company averages
    company_averages = pd.merge(company_averages_median[['Organization', 'Avg_Median_Score']], 
                               company_averages_mean[['Organization', 'Avg_Mean_Score']], 
                               on='Organization')

In [None]:
from openpyxl import Workbook
from openpyxl.utils import get_column_letter
from openpyxl.styles import PatternFill
from openpyxl import load_workbook

# ==========================================
# STEP 6: Save results
# ==========================================
if len(valid_combinations) == 0:
    print("ERROR: No valid weight combinations found. Check constraints.")
else:    
    print(f"\nSaving results...")
    
    # Define file path
    output_path = "data/Performance/ensemble_results_summary.xlsx"
    
    # Save key results as Excel for easy viewing
    with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
        ensemble_stats.to_excel(writer, sheet_name='Ensemble_Scores', index=False)
        company_averages.to_excel(writer, sheet_name='Company_Averages', index=False)
        
        # Summary statistics sheet
        summary_df = pd.DataFrame([
            ['Total Combinations', overall_stats['total_combinations']],
            ['Total Observations', overall_stats['total_observations']],
            ['Mean Std Across Combinations', overall_stats['score_stability']['mean_std_across_combinations']],
            ['Max Std Across Combinations', overall_stats['score_stability']['max_std_across_combinations']],
            ['Mean Range Across Combinations', overall_stats['score_stability']['mean_range_across_combinations']],
            ['High Uncertainty Companies', overall_stats['score_stability']['companies_with_high_uncertainty']]
        ], columns=['Metric', 'Value'])
        summary_df.to_excel(writer, sheet_name='Summary_Stats', index=False)
    
    print("Applying Excel formatting...")
    
    # Load the workbook for formatting
    wb = load_workbook(output_path)
    
    # Define grey fill for alternating rows
    grey_fill = PatternFill(start_color="D9D9D9", end_color="D9D9D9", fill_type="solid")
    
    # Format each sheet
    for sheet_name in wb.sheetnames:
        ws = wb[sheet_name]
        
        # Auto-adjust column widths based on the longest string in each column
        for col in ws.columns:
            max_length = 0
            col_letter = get_column_letter(col[0].column)
            for cell in col:
                if cell.value:
                    max_length = max(max_length, len(str(cell.value)))
            ws.column_dimensions[col_letter].width = max_length + 3  # Add padding
        
        # Apply alternating row colors
        if sheet_name in ['Ensemble_Scores', 'Company_Averages']:
            # These sheets have company names in column A - alternate by company
            prev_company = None
            use_grey = False
            for row in range(2, ws.max_row + 1):
                current_company = ws[f"A{row}"].value  # Column A has the company names
                if current_company != prev_company:
                    use_grey = not use_grey
                    prev_company = current_company
                
                if use_grey:
                    for col in range(1, ws.max_column + 1):
                        ws.cell(row=row, column=col).fill = grey_fill
        else:
            # Summary_Stats sheet - simple alternating rows
            for row in range(2, ws.max_row + 1):
                if row % 2 == 0:  # Even rows get grey background
                    for col in range(1, ws.max_column + 1):
                        ws.cell(row=row, column=col).fill = grey_fill
    
    # Save the final formatted workbook
    wb.save(output_path)
    
    print(f"Results saved and formatted:")
    print(f"- Summary tables: data/Performance/ensemble_results_summary.xlsx")
    
    # ==========================================
    # STEP 7: Display results by year and averages
    # ==========================================
    
    print("\n" + "="*60)
    print("ENSEMBLE PERFORMANCE SCORE RESULTS")
    print("="*60)
    
    print(f"\nOVERALL STATISTICS:")
    print(f"Total weight combinations tested: {overall_stats['total_combinations']:,}")
    print(f"Mean uncertainty (std) across all companies: {overall_stats['score_stability']['mean_std_across_combinations']:.2f}")
    print(f"Maximum uncertainty (std) for any company: {overall_stats['score_stability']['max_std_across_combinations']:.2f}")
    print(f"Companies with high score uncertainty (>90th percentile): {overall_stats['score_stability']['companies_with_high_uncertainty']}")
    
    # Results by year - 2021
    print(f"\n" + "="*40)
    print("TOP PERFORMING COMPANIES - 2021")
    print("="*40)
    print("(Ranked by median ensemble score)")
    
    ensemble_2021 = ensemble_stats[ensemble_stats['year'] == 2021]
    top_performers_2021 = ensemble_2021.nlargest(10, 'median_score')[['Organization', 'median_score', 'mean_score', 'std_score', 'min_score', 'max_score']]
    print(top_performers_2021.to_string(index=False))
    
    # Results by year - 2022
    print(f"\n" + "="*40)
    print("TOP PERFORMING COMPANIES - 2022")
    print("="*40)
    print("(Ranked by median ensemble score)")
    
    ensemble_2022 = ensemble_stats[ensemble_stats['year'] == 2022]
    top_performers_2022 = ensemble_2022.nlargest(10, 'median_score')[['Organization', 'median_score', 'mean_score', 'std_score', 'min_score', 'max_score']]
    print(top_performers_2022.to_string(index=False))
    
    # Company averages across both years
    print(f"\n" + "="*40)
    print("TOP PERFORMING COMPANIES - AVERAGE (2021-2022)")
    print("="*40)
    print("(Companies with data for both years)")
    
    top_avg_companies = company_averages.nlargest(14, 'Avg_Median_Score')
    print(top_avg_companies.to_string(index=False))
    
    print(f"\nCompanies with data for both years: {len(company_averages)}")
    
    # Most uncertain companies by year
    print(f"\n" + "="*40)
    print("MOST UNCERTAIN COMPANIES BY YEAR")
    print("="*40)
    print("(Highest std across weight combinations)")
    
    print(f"\n2021 - Most Uncertain:")
    most_uncertain_2021 = ensemble_2021.nlargest(5, 'std_score')[['Organization', 'median_score', 'mean_score', 'std_score', 'min_score', 'max_score']]
    print(most_uncertain_2021.to_string(index=False))
    
    print(f"\n2022 - Most Uncertain:")
    most_uncertain_2022 = ensemble_2022.nlargest(5, 'std_score')[['Organization', 'median_score', 'mean_score', 'std_score', 'min_score', 'max_score']]
    print(most_uncertain_2022.to_string(index=False))
    
    print(f"\n" + "="*60)
    print("ANALYSIS COMPLETE!")
    print("="*60)
    print("Use 'ensemble_stats' DataFrame for individual company/year scores")
    print("Use 'company_averages' DataFrame for cross-year company averages")
    print("All results saved to files for further analysis")

print("\nEnsemble analysis complete!")
print("\nKey variables created:")
print("- ensemble_stats: Individual company scores by year")
print("- company_averages: Company averages across both years") 
print("- weight_analysis: All valid weight combinations (in memory)")
print("- all_results: Complete results for all combinations")
print("\nFormatted results saved to: data/Performance/ensemble_results_summary.xlsx")
print("(3 sheets: Ensemble_Scores, Company_Averages, Summary_Stats)")

In [None]:
# Cell: Create Master Performance Dataset - Combining All Results

print("CREATING MASTER PERFORMANCE DATASET")
print("=" * 60)
print("Combining comprehensive component results with ensemble statistics...")

# ==========================================
# STEP 1: Load Comprehensive Performance Results
# ==========================================

print("\nLoading comprehensive performance results...")

try:
    # Load 2021 results
    comp_2021 = pd.read_excel("data/Performance/comprehensive_performance_results.xlsx", 
                             sheet_name='Results_2021')
    comp_2021['year'] = 2021
    
    # Load 2022 results  
    comp_2022 = pd.read_excel("data/Performance/comprehensive_performance_results.xlsx", 
                             sheet_name='Results_2022')
    comp_2022['year'] = 2022
    
    # Combine both years
    comprehensive_data = pd.concat([comp_2021, comp_2022], ignore_index=True)
    
    print(f"✓ Comprehensive data loaded: {len(comprehensive_data)} records")
    
except Exception as e:
    print(f"Error loading comprehensive results: {e}")
    comprehensive_data = pd.DataFrame()

# ==========================================
# STEP 2: Load Ensemble Performance Results
# ==========================================

print("Loading ensemble performance results...")

try:
    ensemble_data = pd.read_excel("data/Performance/ensemble_results_summary.xlsx", 
                                 sheet_name='Ensemble_Scores')
    
    print(f"✓ Ensemble data loaded: {len(ensemble_data)} records")
    
except Exception as e:
    print(f"Error loading ensemble results: {e}")
    ensemble_data = pd.DataFrame()

# ==========================================
# STEP 3: Standardize and Rename Columns
# ==========================================

print("\nStandardizing column names...")

# Rename comprehensive data columns with clear, informative names
comprehensive_renamed = comprehensive_data.rename(columns={
    'Organization': 'Company',
    'year': 'Year',
    
    # Component Scores (0-100 scale)
    'Absolute_Emission_Score': 'Component_Emission_Intensity_Score',
    'Goal_Achievement_Score': 'Component_Goal_Achievement_Score', 
    'Target_Ambition_Score': 'Component_Target_Ambition_Score',
    'Transparency_Score': 'Component_Transparency_Score',
    'Suspicious_Changes_Penalty': 'Component_Suspicious_Changes_Penalty',
    'Renewable_Energy_Bonus': 'Component_Renewable_Energy_Bonus',
    
    # Emission Data (tCO2e per million USD revenue)
    'Eikon_Emissions_2021_tCO2e_per_M$': 'Eikon_Emissions_Current_Year_tCO2e_M$',
    'Eikon_Emissions_2022_tCO2e_per_M$': 'Eikon_Emissions_Current_Year_tCO2e_M$',
    'CDP_Emissions_2021_tCO2e_per_M$': 'CDP_Emissions_Current_Year_tCO2e_M$',
    'CDP_Emissions_2022_tCO2e_per_M$': 'CDP_Emissions_Current_Year_tCO2e_M$',
    'Eikon_Emissions_2020_tCO2e_per_M$': 'Eikon_Emissions_Previous_Year_tCO2e_M$',
    'Eikon_Emissions_2021_tCO2e_per_M$': 'Eikon_Emissions_Previous_Year_tCO2e_M$',
    
    # Industry Benchmarks
    'Industry_Avg_Emissions_2021': 'Industry_Avg_Emissions_tCO2e_M$',
    'Industry_Avg_Emissions_2022': 'Industry_Avg_Emissions_tCO2e_M$',
    'Company_vs_Industry_Ratio_2021': 'Company_vs_Industry_Emissions_Ratio',
    'Company_vs_Industry_Ratio_2022': 'Company_vs_Industry_Emissions_Ratio',
    
    # Target Information
    'Target_Year_2021': 'Target_Year_Current',
    'Target_Year_2022': 'Target_Year_Current',
    'Target_Reduction_Pct_2021': 'Target_Reduction_Current_Pct',
    'Target_Reduction_Pct_2022': 'Target_Reduction_Current_Pct',
    'Target_Missing_2021': 'Target_Missing_Current',
    'Target_Missing_2022': 'Target_Missing_Current',
    'Target_Year_2020': 'Target_Year_Previous',
    'Target_Year_2021': 'Target_Year_Previous',
    'Target_Reduction_Pct_2020': 'Target_Reduction_Previous_Pct',
    'Target_Reduction_Pct_2021': 'Target_Reduction_Previous_Pct',
    'Target_Missing_2020': 'Target_Missing_Previous',
    'Target_Missing_2021': 'Target_Missing_Previous',
    
    # Renewable Energy Data (per million USD revenue)
    'Renewable_Energy_2021_per_M$': 'Renewable_Energy_Current_Year_M$',
    'Renewable_Energy_2022_per_M$': 'Renewable_Energy_Current_Year_M$',
    'Renewable_Energy_2020_per_M$': 'Renewable_Energy_Previous_Year_M$',
    'Renewable_Energy_2021_per_M$': 'Renewable_Energy_Previous_Year_M$',
    'Industry_Avg_Renewable_2021': 'Industry_Avg_Renewable_M$',
    'Industry_Avg_Renewable_2022': 'Industry_Avg_Renewable_M$',
    'Renewable_vs_Industry_Ratio_2021': 'Industry_vs_Company_Renewable_Ratio',
    'Renewable_vs_Industry_Ratio_2022': 'Industry_vs_Company_Renewable_Ratio',
    
    # Analysis Metrics (percentages)
    'Transparency_Change_2021_Pct': 'Transparency_Change_CDP_vs_Eikon_Pct',
    'Transparency_Change_2022_Pct': 'Transparency_Change_CDP_vs_Eikon_Pct',
    'Actual_Progress_2020_to_2021_Pct': 'Actual_Emission_Progress_YoY_Pct',
    'Actual_Progress_2021_to_2022_Pct': 'Actual_Emission_Progress_YoY_Pct',
    'Expected_Annual_Progress_2021_Pct': 'Expected_Annual_Progress_Pct',
    'Expected_Annual_Progress_2022_Pct': 'Expected_Annual_Progress_Pct',
    'Achievement_Rate_2021_Pct': 'Goal_Achievement_Rate_Pct',
    'Achievement_Rate_2022_Pct': 'Goal_Achievement_Rate_Pct',
    'Annual_Target_Intensity_2021_Pct': 'Annual_Target_Intensity_Pct',
    'Annual_Target_Intensity_2022_Pct': 'Annual_Target_Intensity_Pct'
})

# Handle the year-specific column naming issue by creating a proper mapping
def standardize_comprehensive_columns(df):
    """Fix year-specific column naming in comprehensive data"""
    df_clean = df.copy()
    
    # Create year-aware column mapping
    for idx, row in df_clean.iterrows():
        year = row['Year']
        
        # Fix emission columns based on year
        if year == 2021:
            if f'Eikon_Emissions_2021_tCO2e_per_M$' in df.columns:
                df_clean.loc[idx, 'Eikon_Emissions_Current_Year_tCO2e_M$'] = row.get(f'Eikon_Emissions_2021_tCO2e_per_M$')
            if f'CDP_Emissions_2021_tCO2e_per_M$' in df.columns:
                df_clean.loc[idx, 'CDP_Emissions_Current_Year_tCO2e_M$'] = row.get(f'CDP_Emissions_2021_tCO2e_per_M$')
            if f'Eikon_Emissions_2020_tCO2e_per_M$' in df.columns:
                df_clean.loc[idx, 'Eikon_Emissions_Previous_Year_tCO2e_M$'] = row.get(f'Eikon_Emissions_2020_tCO2e_per_M$')
                
        elif year == 2022:
            if f'Eikon_Emissions_2022_tCO2e_per_M$' in df.columns:
                df_clean.loc[idx, 'Eikon_Emissions_Current_Year_tCO2e_M$'] = row.get(f'Eikon_Emissions_2022_tCO2e_per_M$')
            if f'CDP_Emissions_2022_tCO2e_per_M$' in df.columns:
                df_clean.loc[idx, 'CDP_Emissions_Current_Year_tCO2e_M$'] = row.get(f'CDP_Emissions_2022_tCO2e_per_M$')
            if f'Eikon_Emissions_2021_tCO2e_per_M$' in df.columns:
                df_clean.loc[idx, 'Eikon_Emissions_Previous_Year_tCO2e_M$'] = row.get(f'Eikon_Emissions_2021_tCO2e_per_M$')
    
    return df_clean

# Apply comprehensive column standardization
if not comprehensive_data.empty:
    comprehensive_clean = comprehensive_data.copy()
    comprehensive_clean['Company'] = comprehensive_clean['Organization']
    comprehensive_clean['Year'] = comprehensive_clean['year']
    
    # Manually create the cleaned columns we need
    comprehensive_final = pd.DataFrame({
        'Company': comprehensive_clean['Company'],
        'Year': comprehensive_clean['Year'],
        
        # Component Scores
        'Component_Emission_Intensity_Score': comprehensive_clean['Absolute_Emission_Score'],
        'Component_Goal_Achievement_Score': comprehensive_clean['Goal_Achievement_Score'],
        'Component_Target_Ambition_Score': comprehensive_clean['Target_Ambition_Score'],
        'Component_Transparency_Score': comprehensive_clean['Transparency_Score'],
        'Component_Suspicious_Changes_Penalty': comprehensive_clean['Suspicious_Changes_Penalty'],
        'Component_Renewable_Energy_Bonus': comprehensive_clean['Renewable_Energy_Bonus'],
        
        # Current Year Emissions
        'Eikon_Emissions_Current_Year_tCO2e_M$': comprehensive_clean.apply(
            lambda row: row[f'Eikon_Emissions_{int(row.Year)}_tCO2e_per_M$'] if f'Eikon_Emissions_{int(row.Year)}_tCO2e_per_M$' in comprehensive_clean.columns else np.nan, axis=1),
        'CDP_Emissions_Current_Year_tCO2e_M$': comprehensive_clean.apply(
            lambda row: row[f'CDP_Emissions_{int(row.Year)}_tCO2e_per_M$'] if f'CDP_Emissions_{int(row.Year)}_tCO2e_per_M$' in comprehensive_clean.columns else np.nan, axis=1),
        
        # Previous Year Emissions  
        'Eikon_Emissions_Previous_Year_tCO2e_M$': comprehensive_clean.apply(
            lambda row: row[f'Eikon_Emissions_{int(row.Year-1)}_tCO2e_per_M$'] if f'Eikon_Emissions_{int(row.Year-1)}_tCO2e_per_M$' in comprehensive_clean.columns else np.nan, axis=1),
        
        # Industry Benchmarks
        'Industry_Avg_Emissions_tCO2e_M$': comprehensive_clean.apply(
            lambda row: row[f'Industry_Avg_Emissions_{int(row.Year)}'] if f'Industry_Avg_Emissions_{int(row.Year)}' in comprehensive_clean.columns else np.nan, axis=1),
        'Company_vs_Industry_Emissions_Ratio': comprehensive_clean.apply(
            lambda row: row[f'Company_vs_Industry_Ratio_{int(row.Year)}'] if f'Company_vs_Industry_Ratio_{int(row.Year)}' in comprehensive_clean.columns else np.nan, axis=1),
        
        # Target Data - Current Year
        'Target_Year_Current': comprehensive_clean.apply(
            lambda row: row[f'Target_Year_{int(row.Year)}'] if f'Target_Year_{int(row.Year)}' in comprehensive_clean.columns else np.nan, axis=1),
        'Target_Reduction_Current_Pct': comprehensive_clean.apply(
            lambda row: row[f'Target_Reduction_Pct_{int(row.Year)}'] if f'Target_Reduction_Pct_{int(row.Year)}' in comprehensive_clean.columns else np.nan, axis=1),
        'Target_Missing_Current': comprehensive_clean.apply(
            lambda row: row[f'Target_Missing_{int(row.Year)}'] if f'Target_Missing_{int(row.Year)}' in comprehensive_clean.columns else np.nan, axis=1),
        
        # Target Data - Previous Year
        'Target_Year_Previous': comprehensive_clean.apply(
            lambda row: row[f'Target_Year_{int(row.Year-1)}'] if f'Target_Year_{int(row.Year-1)}' in comprehensive_clean.columns else np.nan, axis=1),
        'Target_Reduction_Previous_Pct': comprehensive_clean.apply(
            lambda row: row[f'Target_Reduction_Pct_{int(row.Year-1)}'] if f'Target_Reduction_Pct_{int(row.Year-1)}' in comprehensive_clean.columns else np.nan, axis=1),
        'Target_Missing_Previous': comprehensive_clean.apply(
            lambda row: row[f'Target_Missing_{int(row.Year-1)}'] if f'Target_Missing_{int(row.Year-1)}' in comprehensive_clean.columns else np.nan, axis=1),
        
        # Renewable Energy Data
        'Renewable_Energy_Current_Year_M$': comprehensive_clean.apply(
            lambda row: row[f'Renewable_Energy_{int(row.Year)}_per_M$'] if f'Renewable_Energy_{int(row.Year)}_per_M$' in comprehensive_clean.columns else np.nan, axis=1),
        'Renewable_Energy_Previous_Year_M$': comprehensive_clean.apply(
            lambda row: row[f'Renewable_Energy_{int(row.Year-1)}_per_M$'] if f'Renewable_Energy_{int(row.Year-1)}_per_M$' in comprehensive_clean.columns else np.nan, axis=1),
        'Industry_Avg_Renewable_M$': comprehensive_clean.apply(
            lambda row: row[f'Industry_Avg_Renewable_{int(row.Year)}'] if f'Industry_Avg_Renewable_{int(row.Year)}' in comprehensive_clean.columns else np.nan, axis=1),
        'Industry_vs_Company_Renewable_Ratio': comprehensive_clean.apply(
            lambda row: row[f'Renewable_vs_Industry_Ratio_{int(row.Year)}'] if f'Renewable_vs_Industry_Ratio_{int(row.Year)}' in comprehensive_clean.columns else np.nan, axis=1),
        
        # Analysis Metrics
        'Transparency_Change_CDP_vs_Eikon_Pct': comprehensive_clean.apply(
            lambda row: row[f'Transparency_Change_{int(row.Year)}_Pct'] if f'Transparency_Change_{int(row.Year)}_Pct' in comprehensive_clean.columns else np.nan, axis=1),
        'Actual_Emission_Progress_YoY_Pct': comprehensive_clean.apply(
            lambda row: row[f'Actual_Progress_{int(row.Year-1)}_to_{int(row.Year)}_Pct'] if f'Actual_Progress_{int(row.Year-1)}_to_{int(row.Year)}_Pct' in comprehensive_clean.columns else np.nan, axis=1),
        'Expected_Annual_Progress_Pct': comprehensive_clean.apply(
            lambda row: row[f'Expected_Annual_Progress_{int(row.Year)}_Pct'] if f'Expected_Annual_Progress_{int(row.Year)}_Pct' in comprehensive_clean.columns else np.nan, axis=1),
        'Goal_Achievement_Rate_Pct': comprehensive_clean.apply(
            lambda row: row[f'Achievement_Rate_{int(row.Year)}_Pct'] if f'Achievement_Rate_{int(row.Year)}_Pct' in comprehensive_clean.columns else np.nan, axis=1),
        'Annual_Target_Intensity_Pct': comprehensive_clean.apply(
            lambda row: row[f'Annual_Target_Intensity_{int(row.Year)}_Pct'] if f'Annual_Target_Intensity_{int(row.Year)}_Pct' in comprehensive_clean.columns else np.nan, axis=1)
    })
else:
    comprehensive_final = pd.DataFrame()

# Rename ensemble data columns  
if not ensemble_data.empty:
    ensemble_renamed = ensemble_data.rename(columns={
        'Organization': 'Company',
        'year': 'Year',
        'median_score': 'Ensemble_Performance_Median_Score',
        'mean_score': 'Ensemble_Performance_Mean_Score', 
        'std_score': 'Ensemble_Performance_Std_Score',
        'min_score': 'Ensemble_Performance_Min_Score',
        'max_score': 'Ensemble_Performance_Max_Score',
        'q25_score': 'Ensemble_Performance_Q25_Score',
        'q75_score': 'Ensemble_Performance_Q75_Score',
        'iqr_score': 'Ensemble_Performance_IQR_Score',
        'range_score': 'Ensemble_Performance_Range_Score'
    })
else:
    ensemble_renamed = pd.DataFrame()

print(f"Column standardization complete")

# ==========================================
# STEP 4: Merge Datasets
# ==========================================

print("\nMerging comprehensive and ensemble data...")

if not comprehensive_final.empty and not ensemble_renamed.empty:
    # Merge on Company and Year
    master_dataset = pd.merge(
        comprehensive_final, 
        ensemble_renamed, 
        on=['Company', 'Year'], 
        how='outer',
        suffixes=('', '_ensemble')
    )
    
    print(f"Merge complete: {len(master_dataset)} records")
    print(f"  - Companies: {master_dataset['Company'].nunique()}")
    print(f"  - Years: {sorted(master_dataset['Year'].unique())}")
    
elif not comprehensive_final.empty:
    master_dataset = comprehensive_final.copy()
    print("Using comprehensive data only (ensemble data not available)")
    
elif not ensemble_renamed.empty:
    master_dataset = ensemble_renamed.copy()
    print("Using ensemble data only (comprehensive data not available)")
    
else:
    print("Error: No data available from either source")
    master_dataset = pd.DataFrame()

# ==========================================
# STEP 5: Final Data Organization and Validation
# ==========================================

if not master_dataset.empty:
    # Sort by Company and Year for consistency
    master_dataset = master_dataset.sort_values(['Company', 'Year']).reset_index(drop=True)
    
    # Round numerical columns for clean display
    numeric_columns = master_dataset.select_dtypes(include=[np.number]).columns
    master_dataset[numeric_columns] = master_dataset[numeric_columns].round(3)
    
    print(f"\nFinal dataset structure:")
    print(f"  - Shape: {master_dataset.shape}")
    print(f"  - Companies: {master_dataset['Company'].nunique()}")
    print(f"  - Years covered: {sorted(master_dataset['Year'].unique())}")
    print(f"  - Component score columns: {len([col for col in master_dataset.columns if 'Component_' in col])}")
    print(f"  - Ensemble statistic columns: {len([col for col in master_dataset.columns if 'Ensemble_' in col])}")

# ==========================================
# STEP 6: Export Master Dataset
# ==========================================

print(f"\nExporting master dataset...")

if not master_dataset.empty:
    output_path = "data/Performance/master_performance_dataset.xlsx"
    
    try:
        with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
            # Main dataset
            master_dataset.to_excel(writer, sheet_name='Master_Performance_Data', index=False)
            
            # Create data dictionary
            data_dict = pd.DataFrame({
                'Column_Name': master_dataset.columns,
                'Data_Type': [str(master_dataset[col].dtype) for col in master_dataset.columns],
                'Description': [
                    'Company name' if col == 'Company' else
                    'Year (2021 or 2022)' if col == 'Year' else
                    'Component score (0-100): Absolute emission intensity performance' if col == 'Component_Emission_Intensity_Score' else
                    'Component score (0-100): Goal achievement rate performance' if col == 'Component_Goal_Achievement_Score' else
                    'Component score (0-100): Target ambition level assessment' if col == 'Component_Target_Ambition_Score' else
                    'Component score (0-100): Transparency in emission reporting' if col == 'Component_Transparency_Score' else
                    'Component penalty (-10 to 0): Suspicious target changes' if col == 'Component_Suspicious_Changes_Penalty' else
                    'Component bonus (0-10): Renewable energy performance' if col == 'Component_Renewable_Energy_Bonus' else
                    'Eikon emission intensity for current year (tCO2e per M$ revenue)' if col == 'Eikon_Emissions_Current_Year_tCO2e_M$' else
                    'CDP self-reported emission intensity for current year (tCO2e per M$ revenue)' if col == 'CDP_Emissions_Current_Year_tCO2e_M$' else
                    'Eikon emission intensity for previous year (tCO2e per M$ revenue)' if col == 'Eikon_Emissions_Previous_Year_tCO2e_M$' else
                    'Industry average emission intensity (tCO2e per M$ revenue)' if col == 'Industry_Avg_Emissions_tCO2e_M$' else
                    'Company to industry emission ratio (company/industry)' if col == 'Company_vs_Industry_Emissions_Ratio' else
                    'Renewable energy per M$ revenue for current year' if col == 'Renewable_Energy_Current_Year_M$' else
                    'Renewable energy per M$ revenue for previous year' if col == 'Renewable_Energy_Previous_Year_M$' else
                    'Industry average renewable energy per M$ revenue' if col == 'Industry_Avg_Renewable_M$' else
                    'Industry to company renewable ratio (industry/company)' if col == 'Industry_vs_Company_Renewable_Ratio' else
                    'Ensemble median performance score across all weight combinations' if col == 'Ensemble_Performance_Median_Score' else
                    'Ensemble mean performance score across all weight combinations' if col == 'Ensemble_Performance_Mean_Score' else
                    'Ensemble standard deviation of performance scores' if col == 'Ensemble_Performance_Std_Score' else
                    'Ensemble minimum performance score across all weight combinations' if col == 'Ensemble_Performance_Min_Score' else
                    'Ensemble maximum performance score across all weight combinations' if col == 'Ensemble_Performance_Max_Score' else
                    'Ensemble 25th percentile performance score' if col == 'Ensemble_Performance_Q25_Score' else
                    'Ensemble 75th percentile performance score' if col == 'Ensemble_Performance_Q75_Score' else
                    'Ensemble interquartile range of performance scores' if col == 'Ensemble_Performance_IQR_Score' else
                    'Ensemble range (max-min) of performance scores' if col == 'Ensemble_Performance_Range_Score' else
                    f'Variable: {col}' for col in master_dataset.columns
                ],
                'Source': [
                    'Both' if col in ['Company', 'Year'] else
                    'Component Analysis' if 'Component_' in col else
                    'Component Analysis' if any(x in col for x in ['Eikon_', 'CDP_', 'Industry_', 'Target_', 'Renewable_', 'Transparency_', 'Actual_', 'Expected_', 'Goal_', 'Annual_']) else
                    'Ensemble Analysis' if 'Ensemble_' in col else
                    'Unknown' for col in master_dataset.columns
                ]
            })
            
            data_dict.to_excel(writer, sheet_name='Data_Dictionary', index=False)
            
            # Summary statistics by year
            if 'Year' in master_dataset.columns:
                summary_2021 = master_dataset[master_dataset['Year'] == 2021].describe()
                summary_2022 = master_dataset[master_dataset['Year'] == 2022].describe()
                
                summary_2021.to_excel(writer, sheet_name='Summary_Stats_2021')
                summary_2022.to_excel(writer, sheet_name='Summary_Stats_2022')
        
        print(f"Master dataset exported successfully to: {output_path}")
        print(f"  - Master_Performance_Data: Complete dataset ({master_dataset.shape[0]} rows, {master_dataset.shape[1]} columns)")
        print(f"  - Data_Dictionary: Column descriptions and sources")
        print(f"  - Summary_Stats_2021: Descriptive statistics for 2021")
        print(f"  - Summary_Stats_2022: Descriptive statistics for 2022")
        
    except Exception as e:
        print(f"Error exporting master dataset: {e}")
else:
    print("Cannot export: Master dataset is empty")

# ==========================================
# STEP 7: Display Final Summary
# ==========================================

print(f"\n" + "=" * 60)
print("MASTER PERFORMANCE DATASET CREATION COMPLETE")
print("=" * 60)

if not master_dataset.empty:
    print(f"\nDATASET OVERVIEW:")
    print(f"  Total records: {len(master_dataset):,}")
    print(f"  Companies: {master_dataset['Company'].nunique()}")
    print(f"  Years: {sorted(master_dataset['Year'].unique())}")
    
    print(f"\nVARIABLE CATEGORIES:")
    component_cols = [col for col in master_dataset.columns if 'Component_' in col]
    ensemble_cols = [col for col in master_dataset.columns if 'Ensemble_' in col]
    emission_cols = [col for col in master_dataset.columns if any(x in col for x in ['Emission', 'tCO2e'])]
    target_cols = [col for col in master_dataset.columns if 'Target_' in col]
    renewable_cols = [col for col in master_dataset.columns if 'Renewable_' in col]
    
    print(f"  Component Scores: {len(component_cols)} variables")
    print(f"  Ensemble Statistics: {len(ensemble_cols)} variables") 
    print(f"  Emission Data: {len(emission_cols)} variables")
    print(f"  Target Information: {len(target_cols)} variables")
    print(f"  Renewable Energy: {len(renewable_cols)} variables")
    
    print(f"\nFILES CREATED:")
    print(f"  Master dataset: data/Performance/master_performance_dataset.xlsx")
    print(f"  (4 sheets: Master_Performance_Data, Data_Dictionary, Summary_Stats_2021, Summary_Stats_2022)")
    
    print(f"\nREADY FOR ANALYSIS:")
    print(f"  All performance variables are now in one dataset")
    print(f"  Clear variable names with units and sources")
    print(f"  Complete data dictionary included")
    print(f"  Summary statistics by year provided")
else:
    print("Master dataset creation failed - check input files")

print(f"\nVariable 'master_dataset' is available in memory for immediate use")

In [None]:
# Cell: Export Master Performance Data as CSV
# Place this AFTER the performance master dataset creation cell

print("EXPORTING MASTER PERFORMANCE DATA AS CSV")
print("=" * 50)

try:
    # Load the Master_Performance_Data sheet from the Excel file
    print("Loading Master_Performance_Data sheet...")
    
    master_performance_csv = pd.read_excel("data/Performance/master_performance_dataset.xlsx", 
                                          sheet_name='Master_Performance_Data')
    
    print(f"✓ Data loaded successfully:")
    print(f"  - Shape: {master_performance_csv.shape}")
    print(f"  - Companies: {master_performance_csv['Company'].nunique()}")
    print(f"  - Years: {sorted(master_performance_csv['Year'].unique())}")
    
    # Fix company name: Change Ørsted to Orsted
    print("\nApplying company name corrections...")
    original_orsted_count = (master_performance_csv['Company'] == 'Ørsted').sum()
    master_performance_csv['Company'] = master_performance_csv['Company'].str.replace('Ørsted', 'Orsted', regex=False)
    new_orsted_count = (master_performance_csv['Company'] == 'Orsted').sum()
    
    if original_orsted_count > 0:
        print(f"✓ Changed {original_orsted_count} records from 'Ørsted' to 'Orsted'")
    else:
        print("  - No 'Ørsted' records found to change")
    
    # Export as CSV
    output_csv_path = "data/Performance/master_performance_data.csv"
    master_performance_csv.to_csv(output_csv_path, index=False)
    
    print(f"\n✓ CSV export successful!")
    print(f"  - File saved to: {output_csv_path}")
    print(f"  - Records exported: {len(master_performance_csv):,}")
    print(f"  - Variables exported: {master_performance_csv.shape[1]}")
    
    # Display file size info
    import os
    file_size = os.path.getsize(output_csv_path) / (1024 * 1024)  # Convert to MB
    print(f"  - File size: {file_size:.2f} MB")
    
    # Show first few rows and column names for verification
    print(f"\nFIRST 3 ROWS PREVIEW:")
    print(master_performance_csv.head(3).to_string())
    
    print(f"\nCOLUMN NAMES ({len(master_performance_csv.columns)} total):")
    for i, col in enumerate(master_performance_csv.columns, 1):
        print(f"  {i:2d}. {col}")
    
    print(f"\nREADY FOR UPLOAD:")
    print(f"  You can now upload: {output_csv_path}")
    
except Exception as e:
    print(f"Error exporting CSV: {e}")
    print("Trying to use in-memory data instead...")
    
    try:
        # Fallback: use the in-memory master_dataset DataFrame
        if 'master_dataset' in locals() and not master_dataset.empty:
            # Apply the same company name fix to in-memory data
            print("Applying company name corrections to in-memory data...")
            original_orsted_count = (master_dataset['Company'] == 'Ørsted').sum()
            master_dataset['Company'] = master_dataset['Company'].str.replace('Ørsted', 'Orsted', regex=False)
            
            if original_orsted_count > 0:
                print(f"✓ Changed {original_orsted_count} records from 'Ørsted' to 'Orsted'")
            
            output_csv_path = "data/Performance/master_performance_data.csv"
            master_dataset.to_csv(output_csv_path, index=False)
            
            print(f"✓ CSV export successful using in-memory data!")
            print(f"  - File saved to: {output_csv_path}")
            print(f"  - Records exported: {len(master_dataset):,}")
            print(f"  - Variables exported: {master_dataset.shape[1]}")
            
            # Display file size info
            import os
            file_size = os.path.getsize(output_csv_path) / (1024 * 1024)  # Convert to MB
            print(f"  - File size: {file_size:.2f} MB")
            
        else:
            print("No master_dataset data available in memory")
            
    except Exception as e2:
        print(f"Fallback also failed: {e2}")

print("\nCSV export process complete!")