# External Validation Using Documented Greenwashing Cases

## Overview
This module conducts external validation of the GRAT by testing its ability to distinguish between companies with documented greenwashing accusations and those with clean environmental records. It implements the validation approach outlined in the methodology using real-world "ground truth" data to assess discriminant validity.

## Validation Framework
**Data source**: Ensemble analysis results (`ensemble_results_summary_ensforperf.xlsx`, Company_Averages sheet)
**Primary metric**: Average median greenwashing scores across 2021-2022 to reflect consistent risk patterns rather than yearly fluctuations
**Validation approach**: Systematic comparison between known positive cases and clean record companies using non-parametric statistical testing

## Company Classification Groups
Based on systematic online search examining all 14 sample companies between January 2020-December 2023:

### Known Positive Cases (n=3)
- **CEZ**: Documented greenwashing accusations from credible sources
- **Ørsted**: Documented environmental claims controversies  
- **PGE**: Documented greenwashing-related criticisms
- **Classification basis**: Multi-language search across Greenpeace offices, investigative outlets, consumer protection agencies, and energy sector publications

### Clean Record Companies (n=11)
- **All remaining companies**: No documented greenwashing accusations found within the timeframe
- **Historical cases excluded**: Past accusations outside 2020-2023 timeframe classified as clean record

## Statistical Testing Methodology
**Primary test**: Mann-Whitney U test (Wilcoxon rank-sum test)
- **Hypothesis**: H₁: Known positive cases score significantly higher than clean record companies (one-tailed test)
- **Justification**: Non-parametric test appropriate for small samples without normal distribution assumptions
- **Effect size**: Rank-biserial correlation (r) calculated as r = 1 - (2U)/(n₁ × n₂)

## Key Validation Metrics Calculated
1. **Group score distributions**: Median, mean, and range for both validation groups
2. **Statistical significance**: p-value testing at α = 0.05 significance level
3. **Effect size interpretation**: Small (≥0.1), medium (≥0.3), or large (≥0.5) effects
4. **Ranking analysis**: Position of known positive cases within overall company rankings
5. **Threshold analysis**: Number of known positive cases scoring above clean record median

## Critical Statistical Limitations
**Sample size constraints**: Only 14 total observations create severe statistical power limitations
**Central Limit Theorem requirements**: Mann-Whitney U test effectiveness compromised with such small samples
**Validation interpretation**: Results should be considered **indicative only**, not conclusive evidence
**Random phenomenon possibility**: Observed patterns could represent chance variation between 2021-2022 rather than systematic GRAT accuracy

## External Validation Outputs
- **Discriminant validity assessment**: Whether GRAT can separate documented cases from clean records
- **Effect size quantification**: Magnitude of difference between validation groups  
- **Ranking validation**: Positional analysis of known cases within risk score distribution
- **Statistical confidence measures**: p-values and confidence intervals (with appropriate caveats about small sample limitations)

This validation provides preliminary evidence for GRAT effectiveness while acknowledging that larger datasets would enable substantially more robust statistical validation with meaningful power for definitive conclusions.

# Known Cases

In [None]:
# Known Case Validation Analysis
# Framework Validation Using Mann-Whitney U Test

import pandas as pd
import numpy as np
from scipy.stats import mannwhitneyu
import matplotlib.pyplot as plt
import seaborn as sns

# Load the data
file_path = "data/Greenwashing Results/ensemble_results_summary_ensforperf.xlsx"
df = pd.read_excel(file_path, sheet_name="Company_Averages")

print("Data loaded successfully:")
print(df.head())
print(f"\nDataset shape: {df.shape}")

In [None]:
# Clean and prepare the data
# Use the Avg_Median_Score as our primary validation metric
companies = df['Organization'].tolist()
scores = df['Avg_Median_Score'].tolist()

print("Companies and their average median scores:")
for company, score in zip(companies, scores):
    print(f"{company}: {score:.2f}")

In [None]:
# Define validation groups based on systematic search results
known_positive_cases = ['CEZ', 'Orsted', 'PGE']  # Companies with documented greenwashing accusations
clean_record_companies = [company for company in companies if company not in known_positive_cases]

print("=== VALIDATION GROUPS ===")
print(f"\nKnown Positive Cases (n={len(known_positive_cases)}):")
for company in known_positive_cases:
    idx = companies.index(company)
    print(f"  {company}: {scores[idx]:.2f}")

print(f"\nClean Record Companies (n={len(clean_record_companies)}):")
for company in clean_record_companies:
    idx = companies.index(company)
    print(f"  {company}: {scores[idx]:.2f}")

In [None]:
# Extract scores for statistical testing
positive_scores = [scores[companies.index(company)] for company in known_positive_cases]
clean_scores = [scores[companies.index(company)] for company in clean_record_companies]

print("=== SCORE DISTRIBUTIONS ===")
print(f"\nKnown Positive Cases:")
print(f"  Scores: {positive_scores}")
print(f"  Median: {np.median(positive_scores):.2f}")
print(f"  Mean: {np.mean(positive_scores):.2f}")
print(f"  Range: {min(positive_scores):.2f} - {max(positive_scores):.2f}")

print(f"\nClean Record Companies:")
print(f"  Scores: {clean_scores}")
print(f"  Median: {np.median(clean_scores):.2f}")
print(f"  Mean: {np.mean(clean_scores):.2f}")
print(f"  Range: {min(clean_scores):.2f} - {max(clean_scores):.2f}")

In [None]:
# Perform Mann-Whitney U test
# H0: No difference between groups
# H1: Known positive cases have higher scores (one-tailed test)

statistic, p_value = mannwhitneyu(positive_scores, clean_scores, alternative='greater')

# Calculate effect size (rank-biserial correlation)
n1, n2 = len(positive_scores), len(clean_scores)
r = 1 - (2 * statistic) / (n1 * n2)

print("=== MANN-WHITNEY U TEST RESULTS ===")
print(f"\nU-statistic: {statistic}")
print(f"p-value (one-tailed): {p_value:.4f}")
print(f"Effect size (r): {r:.3f}")

# Effect size interpretation
if abs(r) >= 0.5:
    effect_interpretation = "Large effect"
elif abs(r) >= 0.3:
    effect_interpretation = "Medium effect"
elif abs(r) >= 0.1:
    effect_interpretation = "Small effect"
else:
    effect_interpretation = "Negligible effect"

print(f"Effect size interpretation: {effect_interpretation}")

# Statistical significance
alpha = 0.05
if p_value < alpha:
    significance = "Statistically significant"
else:
    significance = "Not statistically significant"

print(f"Statistical significance (α = {alpha}): {significance}")

In [None]:
# Additional validation metrics
# Company rankings and individual case analysis

# Create combined dataset with rankings
all_data = list(zip(companies, scores))
all_data.sort(key=lambda x: x[1], reverse=True) # Sort by score (highest first)

print("=== COMPANY RANKINGS ===")
print("Rank | Company | Score | Group")
print("-" * 40)

positive_ranks = []
for rank, (company, score) in enumerate(all_data, 1):
    group = "Known Positive" if company in known_positive_cases else "Clean Record"
    print(f"{rank:2d}   | {company:<8s} | {score:5.2f} | {group}")
    
    if company in known_positive_cases:
        positive_ranks.append(rank)

print(f"\nKnown positive cases rank positions: {positive_ranks}")
print(f"All known positive cases in top half (≤7): {all(rank <= 7 for rank in positive_ranks)}")

# Count how many known positive cases score above clean record median
clean_median = np.median(clean_scores)
positive_above_clean_median = sum(1 for score in positive_scores if score > clean_median)
print(f"\nKnown positive cases scoring above clean record median ({clean_median:.2f}): {positive_above_clean_median}/{len(positive_scores)}")