# ESG Data Exploration

This notebook introduces key ESG concepts and provides interactive examples for exploring ESG data.

## 1. Understanding ESG Components

ESG stands for Environmental, Social, and Governance:

- **Environmental**: Climate change, carbon emissions, water usage, waste management
- **Social**: Employee relations, diversity, human rights, community impact
- **Governance**: Board composition, executive compensation, shareholder rights

Let's explore these components in real ESG reports.

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import pdfplumber
import re
from collections import Counter

# Set style for better visualizations
plt.style.use('seaborn')
sns.set_palette('husl')

## 2. ESG Keywords and Metrics

Let's define some common ESG keywords to look for in reports:

In [None]:
esg_keywords = {
    'environmental': [
        'carbon emissions', 'climate change', 'renewable energy', 'waste management',
        'water consumption', 'biodiversity', 'pollution', 'recycling'
    ],
    'social': [
        'diversity', 'inclusion', 'human rights', 'employee safety', 'community',
        'labor practices', 'data privacy', 'health'
    ],
    'governance': [
        'board diversity', 'executive compensation', 'shareholder rights',
        'ethics', 'compliance', 'transparency', 'risk management'
    ]
}

## 3. Analyzing ESG Report Content

Let's analyze our sample ESG report to see the frequency of ESG-related terms:

In [None]:
def analyze_esg_content(pdf_path):
    """Analyze ESG content in a PDF report."""
    # Extract text
    with pdfplumber.open(pdf_path) as pdf:
        text = '\n'.join(page.extract_text() for page in pdf.pages)
    
    # Count ESG keywords
    keyword_counts = {category: {} for category in esg_keywords}
    for category, words in esg_keywords.items():
        for word in words:
            count = len(re.findall(word, text.lower()))
            if count > 0:
                keyword_counts[category][word] = count
    
    return keyword_counts

# Path to your PDF file
pdf_path = os.path.join('..', 'data', 'totalenergies_sustainability-climate-2024-progress-report_2024_en_pdf.pdf')

# Analyze content if file exists
if os.path.exists(pdf_path):
    keyword_counts = analyze_esg_content(pdf_path)
    
    # Plot results
    plt.figure(figsize=(15, 5))
    for i, (category, counts) in enumerate(keyword_counts.items()):
        plt.subplot(1, 3, i+1)
        if counts:  # If we found any keywords
            plt.bar(counts.keys(), counts.values())
            plt.title(f'{category.capitalize()} Keywords')
            plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.show()
else:
    print("PDF file not found. Please check the path.")

## 4. Exercise: ESG Metric Extraction

Try to extract specific ESG metrics from the report. Here's an example pattern to find carbon emission values:

In [None]:
def extract_carbon_metrics(text):
    """Extract carbon-related metrics from text."""
    # Pattern for numbers followed by CO2 or carbon dioxide
    pattern = r'(\d+(?:\.\d+)?)[\s]*(?:CO2|carbon dioxide|tons of CO2)'
    matches = re.finditer(pattern, text.lower())
    return [match.group() for match in matches]

# Example usage (if PDF exists)
if os.path.exists(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = '\n'.join(page.extract_text() for page in pdf.pages)
        carbon_metrics = extract_carbon_metrics(text)
        print("Found carbon metrics:")
        for metric in carbon_metrics[:5]:  # Show first 5 matches
            print(f"- {metric}")

## 5. Practice Exercises

Try these exercises to practice ESG data analysis:

1. Create a function to extract diversity metrics (e.g., percentage of women in workforce)
2. Generate a word cloud of governance-related terms
3. Find mentions of specific ESG targets or goals

Example solution for exercise 1:

In [None]:
def extract_diversity_metrics(text):
    """Extract diversity-related metrics from text."""
    # Pattern for percentages near diversity-related words
    pattern = r'(\d+(?:\.\d+)?%)[\s]*(?:women|diverse|minority|representation)'
    matches = re.finditer(pattern, text.lower())
    return [match.group() for match in matches]

# Your turn! Try implementing exercises 2 and 3