# CDP-Refinitiv Eikon Data 2022: Integration and Merging  

## Overview
This module combines the cleaned CDP and Refinitiv Eikon datasets to create a comprehensive performance dataset for the 14 sample companies. The integration preserves data source distinctions while enabling cross-validation between self-reported (CDP) and third-party verified (Eikon) metrics.

## Integration Process
- **Source prefixing**: All columns labeled with "CDP -" or "Eikon -" to maintain data source transparency
- **Inner join**: Merges on "Organization" to include only companies with data in both sources  
- **Revenue harmonization**: Uses Eikon revenue data for CEZ and Solaria where CDP data was incomplete
- **Emission intensity calculations**: Creates standardized per-million-dollar revenue metrics for both sources

## Key Output Variables
- **Core metrics**: Scope 1, 2, and combined emission intensities from both sources
- **Renewable indicators**: Total renewable energy per revenue (Eikon only)
- **Target data**: Emission reduction targets and timelines with missing data flags
- **Supporting data**: Absolute emission values and revenue figures for transparency

## Data Completeness Assessment
Creates "Missing Targets" flag for companies lacking emission reduction targets or target years in Eikon data, supporting the performance assessment framework's completeness evaluation.

## Final Dataset Note
This merged dataset contains all variables initially considered potentially relevant for the greenwashing risk assessment. However, the final analysis uses only select variables due to data quality constraints and standardization requirements. The comprehensive dataset supports flexibility for methodology refinement and validation processes.

## Output
Integrated dataset saved as `CDP_Eikon_2022.xlsx` with 17 key variables for 14 companies, serving as the foundation for performance score calculations.

In [None]:
import pandas as pd

# Load the CDP and Eikon datasets
cdp_file_path = "data/CDP/CDP_Final_df_notargets_2023.xlsx"
eikon_file_path = "data/Eikon/Eikon_Final_2022.xlsx"

cdp_df = pd.read_excel(cdp_file_path)
eikon_df = pd.read_excel(eikon_file_path)

# Add prefixes to all columns except 'Organization' before merging
cdp_df = cdp_df.rename(columns={col: f"CDP - {col}" for col in cdp_df.columns if col != "Organization"})
eikon_df = eikon_df.rename(columns={col: f"Eikon - {col}" for col in eikon_df.columns if col != "Organization"})

# Merge with an inner join to keep only organizations present in both DataFrames
combined_df = pd.merge(cdp_df, eikon_df, on="Organization", how="inner")

# Print the combined DataFrame to check the result
print(combined_df)


In [None]:
# List of organizations to update
orgs_to_update = ['CEZ', 'Solaria Energia y Medio Ambiente SA']

# Apply update for each organization
for org in orgs_to_update:
    idx = combined_df[combined_df['Organization'] == org].index
    if not idx.empty:
        combined_df.loc[idx, 'CDP - Revenue (M$)'] = combined_df.loc[idx, 'Eikon - Revenue (M$)']


In [None]:
# Add 6 new columns for emissions per revenue (M$)
combined_df['CDP - Sc1 (ton CO2e) / M$ Rev.'] = combined_df['CDP - Gross Global Sc1 (ton CO2e)'] / combined_df['CDP - Revenue (M$)']
combined_df['CDP - Sc2 (ton CO2e) / M$ Rev.'] = combined_df['CDP - Gross Global Sc2 (ton CO2e)'] / combined_df['CDP - Revenue (M$)']
combined_df['CDP - Sc1+2 (ton CO2e) / M$ Rev.'] = combined_df['CDP - Scope 1+2'] / combined_df['CDP - Revenue (M$)']

combined_df['Eikon - Sc1 (ton CO2e) / M$ Rev.'] = combined_df['Eikon - CO2 Em. Scope 1 (ton CO2e)'] / combined_df['Eikon - Revenue (M$)']
combined_df['Eikon - Sc2 (ton CO2e) / M$ Rev.'] = combined_df['Eikon - CO2 Em. Scope 2 (ton CO2e)'] / combined_df['Eikon - Revenue (M$)']
combined_df['Eikon - Sc1+2 (ton CO2e) / M$ Rev.'] = (
    (combined_df['Eikon - CO2 Em. Scope 1 (ton CO2e)'] + combined_df['Eikon - CO2 Em. Scope 2 (ton CO2e)']) / combined_df['Eikon - Revenue (M$)']
)

combined_df

In [None]:
# Create the new column 'Eikon - Missing Targets'
combined_df['Eikon - Missing Targets'] = combined_df.apply(
    lambda row: 'Yes' if pd.isna(row['Eikon - Em. Red. Target (%)']) or pd.isna(row['Eikon - Em. Red. Target Year']) else 'No',
    axis=1
)

# Print the updated DataFrame to check the new column
print(combined_df[['Organization', 'Eikon - Missing Targets']])


In [None]:
# Select only the required columns in the specified order
columns_to_keep = [
    "Organization", 
    "CDP - Sc1+2 (ton CO2e) / M$ Rev.",
    "Eikon - Sc1+2 (ton CO2e) / M$ Rev.",
    "CDP - Sc1 (ton CO2e) / M$ Rev.",
    "Eikon - Sc1 (ton CO2e) / M$ Rev.",
    "CDP - Sc2 (ton CO2e) / M$ Rev.",
    "Eikon - Sc2 (ton CO2e) / M$ Rev.",
    "Eikon - Total Renewable Energy / Rev. (M$)",
    "Eikon - Em. Red. Target Year",
    "Eikon - Em. Red. Target (%)",
    "Eikon - Missing Targets",
    "CDP - Gross Global Sc1 (ton CO2e)", 
    "Eikon - CO2 Em. Scope 1 (ton CO2e)", 
    "CDP - Gross Global Sc2 (ton CO2e)", 
    "Eikon - CO2 Em. Scope 2 (ton CO2e)", 
    "CDP - Revenue (M$)",
    "Eikon - Revenue (M$)",
]

# Filter the dataframe to keep only these columns
df_filtered = combined_df[columns_to_keep]

# Now convert to numeric
df_filtered["Eikon - Em. Red. Target (%)"] = pd.to_numeric(df_filtered["Eikon - Em. Red. Target (%)"].str.replace('%', ''), errors='coerce')


# Only keep two decimals
df_filtered = round(df_filtered, 2)



In [None]:
from openpyxl import Workbook
from openpyxl.utils import get_column_letter
from openpyxl.styles import PatternFill
from openpyxl import load_workbook

# Define file path and output path
output_path = "data/Eikon/CDP_Eikon_2022.xlsx"  # You can change this path if needed

# Save the DataFrame to Excel
df_filtered.to_excel(output_path, index=False, engine="openpyxl")

# Load the workbook and sheet
wb = load_workbook(output_path)
ws = wb.active  # There's only one sheet since we saved just one DataFrame

# Auto-adjust column widths based on the longest string in each column
for col in ws.columns:
    max_length = 0
    col_letter = get_column_letter(col[0].column)
    for cell in col:
        if cell.value:
            max_length = max(max_length, len(str(cell.value)))
    ws.column_dimensions[col_letter].width = max_length + 3  # Add padding

# Define grey fill for alternating rows
grey_fill = PatternFill(start_color="D9D9D9", end_color="D9D9D9", fill_type="solid")

# Alternate row colors by company
prev_company = None
use_grey = False
for row in range(2, ws.max_row + 1):
    current_company = ws[f"A{row}"].value  # Column A has the company names
    if current_company != prev_company:
        use_grey = not use_grey
        prev_company = current_company

    if use_grey:
        for col in range(1, ws.max_column + 1):
            ws.cell(row=row, column=col).fill = grey_fill

# Save the final cleaned and formatted workbook
wb.save(output_path)
