# Refinitiv Eikon Data 2022: Variable Extraction

## Overview
This module extracts environmental and financial variables from individual company Excel files exported from Refinitiv Eikon Datastream. The data was previously exported through the SUSFIN category at Erasmus University Rotterdam, with each company's ESG data saved as separate Excel files in the `/data/Eikon/ESG/` directory.

## Data Source Structure  
- **Source**: 27 individual Excel files (one per company) exported from Datastream
- **Target sheet**: "Environment" tab within each company file
- **Target year**: 2022 data specifically extracted from row 5
- **Companies**: 14 sample companies plus 13 additional European utilities for sector comparison

## Key Variables Extracted
- **Temporal data**: Period end date for alignment verification
- **Financial metrics**: Revenue calculations and energy costs per revenue  
- **Energy data**: Total energy use, renewable energy ratios, and production figures
- **Emission data**: Scope 1, 2, and 3 CO2 emissions with intensity metrics
- **Target information**: Emission reduction targets and timelines
- **Validation data**: ESG scores and auditor information for transparency assessment

## Data Processing
- Robust text matching for variable identification across different file formats
- Revenue calculation from emission intensity ratios where direct revenue data unavailable
- Column renaming for standardization and brevity
- Final output saved as `Eikon_Final_2022.xlsx` with formatting optimizations

## Note
While comprehensive environmental data is extracted, the final analysis primarily uses emission intensity metrics and target information. Other variables serve as supporting data for validation and context.

In [None]:
import pandas as pd
import os

# List of companies to process
companies = [
    "AKENERJİ ELEKTRİK ÜRETİM A.Ş.",
    "Arendals Fossekompani ASA",
    "Atlantica Sustainable Infrastructure PLC",
    "CEZ",
    "EDF",
    "EDP - Energias de Portugal S.A.",
    "Endesa",
    "ERG S.p.A",
    "Ørsted",
    "Polska Grupa Energetyczna (PGE) SA",
    "Romande Energie Holding SA",
    "Scatec ASA",
    "Solaria Energia y Medio Ambiente SA",
    "Terna Energy S.A",
    "A2A",
    "Albioma",
    "AYDEM YENİLENEBİLİR ENERJİ A.Ş.",
    "ContourGlobal",
    "Drax Group",
    "EnBW Energie Baden-Württemberg AG",
    "ENGIE",
    "Fortum Oyj",
    "MYTILINEOS Holdings S.A.",
    "NEOEN SA",
    "RWE AG",
    "SSE",
    "VERBUND AG",
]

# Directory where the company files are stored
base_dir = "data/Eikon/ESG/"

## List of the variable names that we need to extract
variables = [
    "Period End Date", "ESG Report Auditor Name", "ESG Combined Score",
    "Resource Reduction Policy", "Resource Reduction Targets", 
    "Policy Energy Efficiency", " Targets Energy Efficiency",
    "Policy Environmental Supply Chain", "Total Energy Use / Million in Revenue $",
    "Energy Use Total", "Energy Purchased Direct", "Energy Produced Direct",
    "Indirect Energy Use", "Electricity Purchased", "Electricity Produced",
    "Renewable Energy Use Ratio", "Renewable Energy Supply", "Total Renewable Energy",
    "Renewable Energy Purchased", "Renewable Energy Produced",
    "Emission Reduction Target Percentage", "Emission Reduction Target Year",
    "Estimated CO2 Equivalents Emission Total", "Total CO2 Emissions / Million in Revenue $",
    "CO2 Equivalent Emissions Direct, Scope 1", "CO2 Equivalent Emissions Indirect, Scope 2",
    "CO2 Equivalent Emissions Indirect, Scope 3"
]

# Initialize an empty list to store each company's data
all_companies_data = []

# Loop through each company
for company_name in companies:
    # Construct the file path
    file_path = os.path.join(base_dir, f"{company_name}.xlsx")
    
    # Check if the file exists
    if not os.path.exists(file_path):
        print(f"File not found for company: {company_name}. Skipping.")
        continue

    # Load the specific sheet 'Environment'
    company_data = pd.read_excel(file_path, sheet_name="Environment")

    # Find the column that corresponds to the year 2022 (search for "2022" in row 5, which is index 3 in Python)
    year_column = company_data.iloc[3].apply(lambda x: '2022' in str(x)).idxmax()  # Find the column with "2022" in row 5 (index 3 in Python)

    # Create a dictionary to store company name and corresponding variable values
    company_data_dict = {'Company Name': company_name}

    # Loop through the variables and extract their values
    for variable in variables:
        try:
            # ENHANCED: More robust text cleaning and searching
            # Clean both the search term and the column data
            clean_variable = variable.strip()
            
            # Clean Column B: strip whitespace and normalize internal spaces
            cleaned_column_b = (company_data.iloc[:, 1]
                              .str.strip()  # Remove leading/trailing spaces
                              .str.replace(r'\s+', ' ', regex=True))  # Normalize multiple spaces to single space
            
            # Try exact match first
            exact_match = cleaned_column_b == clean_variable
            if exact_match.any():
                row_idx = company_data[exact_match].index[0]
            else:
                # Try contains match (case insensitive)
                contains_match = cleaned_column_b.str.contains(clean_variable, na=False, case=False, regex=False)
                if contains_match.any():
                    row_idx = company_data[contains_match].index[0]
                else:
                    raise IndexError("Not found")
            
            # Add the value to the dictionary
            company_data_dict[variable] = company_data.iloc[row_idx][year_column]
        
        except IndexError:
            company_data_dict[variable] = "Not found"
            # Debug for the problematic variables
            if variable in ["Total Energy Use / Million in Revenue $", "Total CO2 Emissions / Million in Revenue $"]:
                print(f"DEBUG - Could not find '{variable}' for {company_name}")
                # Show similar matches for debugging
                similar = cleaned_column_b[cleaned_column_b.str.contains("Million in Revenue", na=False, case=False)]
                if len(similar) > 0:
                    print(f"  Similar matches found: {similar.tolist()[:3]}")  # First 3 matches
            company_data_dict[variable] = "Not found"

    # Append the company data dictionary to the list
    all_companies_data.append(company_data_dict)

# Create a DataFrame from the list of dictionaries
companies_df = pd.DataFrame(all_companies_data)

# Display the DataFrame
companies_df



In [None]:
import numpy as np

# Store the original DataFrame to identify the deleted rows
companies_df_original = companies_df.copy()

# Replace '--' and 'Not found' with NaN in the entire dataframe
companies_df.replace({'--': np.nan, 'Not found': np.nan}, inplace=True)

# Identify the rows with NaN in the 'Period End Date' column
deleted_rows = companies_df_original[companies_df_original['Period End Date'].isin([None, '0', '--'])]

# Clean the dataset by removing rows with NaN in the 'Period End Date' column
companies_df_cleaned = companies_df[
    companies_df['Period End Date'].notna()  # Keep only rows where 'Period End Date' is not NaN
]

# Display the cleaned dataset
companies_df_cleaned


In [None]:
# Find the deleted rows
deleted_rows = companies_df[
    companies_df['Period End Date'].isin([None, '0', '--']) | 
    companies_df['Period End Date'].isna()
]

# Print the deleted rows
deleted_rows



In [None]:
# Clean the "Period End Date" to remove the time part
companies_df_cleaned['Period End Date'] = pd.to_datetime(companies_df_cleaned['Period End Date'], errors='coerce').dt.date

# Rename columns to shorter names, preserving all important information and adding units in brackets
companies_df_cleaned.rename(columns={
    "Company Name": "Organization",
    "ESG Report Auditor Name": "ESG Auditor",
    "ESG Combined Score": "ESG Score",
    "Resource Reduction Policy": "Res. Red. Policy",
    "Resource Reduction Targets": "Res. Red. Targets",
    "Policy Energy Efficiency": "Energy Eff. Policy",
    "Targets Energy Efficiency": "Energy Eff. Targets",
    "Policy Environmental Supply Chain": "Env. Supply Chain Policy",
    "Total Energy Use / Million in Revenue $": "Energy Use/Rev. (M$)",
    "Energy Use Total": "Energy Use Tot.",
    "Energy Purchased Direct": "Energy Purch. Dir.",
    "Energy Produced Direct": "Energy Prod. Dir.",
    "Indirect Energy Use": "Energy Use Ind.",
    "Electricity Purchased": "Elec. Purchased",
    "Electricity Produced": "Elec. Prod.",
    "Renewable Energy Use Ratio": "Renew. Energy Use (%)",
    "Renewable Energy Supply": "Renew. Energy Supply (%)",
    "Total Renewable Energy": "Total Renewable Energy",
    "Renewable Energy Purchased": "Renew. Energy Purch.",
    "Renewable Energy Produced": "Renew. Energy Prod.",
    "Emission Reduction Target Percentage": "Em. Red. Target (%)",
    "Emission Reduction Target Year": "Em. Red. Target Year",
    "Estimated CO2 Equivalents Emission Total": "Est. CO2 Em. Tot. (ton CO2e)",
    "Total CO2 Emissions / Million in Revenue $": "CO2 Em. / Rev. (M$)",
    "CO2 Equivalent Emissions Direct, Scope 1": "CO2 Em. Scope 1 (ton CO2e)",
    "CO2 Equivalent Emissions Indirect, Scope 2": "CO2 Em. Scope 2 (ton CO2e)",
    "CO2 Equivalent Emissions Indirect, Scope 3": "CO2 Em. Scope 3 (ton CO2e)"
}, inplace=True)



In [None]:
# Calculate 'Revenue (M$)' and add as a new column
companies_df_cleaned['Revenue (M$)'] = (
    companies_df_cleaned['CO2 Em. Scope 1 (ton CO2e)'].astype(float) +
    companies_df_cleaned['CO2 Em. Scope 2 (ton CO2e)'].astype(float)
) / companies_df_cleaned['CO2 Em. / Rev. (M$)'].astype(float)
companies_df_cleaned['Revenue (M$)'] = companies_df_cleaned['Revenue (M$)'].round(2)

In [None]:
# Calculate Total Renewable Energy / Revenue
companies_df_cleaned['Total Renewable Energy / Rev. (M$)'] = round(companies_df_cleaned['Total Renewable Energy'] / companies_df_cleaned['Revenue (M$)'], 2)

In [None]:
# Define the output path
output_path = "data/Eikon/Eikon_Final_2022.xlsx"

# Save the cleaned DataFrame to an Excel file
companies_df_cleaned.to_excel(output_path, index=False, engine="openpyxl")


In [None]:
from openpyxl import Workbook
from openpyxl.utils import get_column_letter
from openpyxl.styles import PatternFill
from openpyxl import load_workbook

# Define file path and output path
output_path = "data/Eikon/Eikon_Final_2022.xlsx"  # You can change this path if needed

# Save the DataFrame to Excel
companies_df_cleaned.to_excel(output_path, index=False, engine="openpyxl")

# Load the workbook and sheet
wb = load_workbook(output_path)
ws = wb.active  # There's only one sheet since we saved just one DataFrame

# Auto-adjust column widths based on the longest string in each column
for col in ws.columns:
    max_length = 0
    col_letter = get_column_letter(col[0].column)
    for cell in col:
        if cell.value:
            max_length = max(max_length, len(str(cell.value)))
    ws.column_dimensions[col_letter].width = max_length + 3  # Add padding

# Define grey fill for alternating rows
grey_fill = PatternFill(start_color="D9D9D9", end_color="D9D9D9", fill_type="solid")

# Alternate row colors by company
prev_company = None
use_grey = False
for row in range(2, ws.max_row + 1):
    current_company = ws[f"A{row}"].value  # Column A has the company names
    if current_company != prev_company:
        use_grey = not use_grey
        prev_company = current_company

    if use_grey:
        for col in range(1, ws.max_column + 1):
            ws.cell(row=row, column=col).fill = grey_fill

# Save the final cleaned and formatted workbook
wb.save(output_path)
