# Leonardo - IRP 2025

## First steps
- Get access to Imperial GitHub Enterprise account (instructions [here](https://imperialcollegelondon.github.io/#how-do-i-gain-access-to-github-enterprise-cloud))
- Access MS Team data sharing folder
- Get acquainted with the data you will use
- Work in your projects repository

I use `uv`, `polars` and `ruff` instead of `pip`, `pandas` and another linter. I recommend them, but shifting your framework is not mandatory.

### Best Practices:
- use `.env` for portability and sensitive information
- Never commit the `.env` file! You can use the `.env.sample` to fix unpopullated fields
- Datasets will be shared via OneDrive. PDFs via an S3-type of bucket. Use the `s3fs` library to access it.

## Three types of data
- Raw PDFs
- Text of initiatives
- Counts

# Getting familiar with the dataset

In [1]:
import polars as pl
from dotenv import dotenv_values
import json

env_vars = dotenv_values("../.env")
DATA_PATH = env_vars["DATA_PATH"]

lc_dataset = pl.read_csv(f"{DATA_PATH}/LC_dataset_v_1_1O.csv")
text_of_initiatives = pl.read_parquet(f"{DATA_PATH}/text_of_initiatives_v_1_1L.parquet")

In [40]:
text_of_initiatives.head()

id,action,sdg,Stakeholder,full_text,country,MacroRegion,Sector,Industry-group,Industry,Universe,company_name,rfyear
str,str,i32,str,str,str,str,str,str,str,str,str,str
"""319310_2021_606hh7g0109rim_ozd…","""modification of procedures""",12,"""local communities and society""","""Always with a view to making a…","""Italy""",,"""Consumer Discretionary""","""Consumer Durables & Apparel""","""Textiles, Apparel & Luxury Goo…","""Public""","""OVS Officine Valle Seriana Spa""","""2021"""
"""319310_2021_606hh7g0109rim_rcv…","""association""",12,"""local communities and society""","""we made more than 58 million g…","""Italy""",,"""Consumer Discretionary""","""Consumer Durables & Apparel""","""Textiles, Apparel & Luxury Goo…","""Public""","""OVS Officine Valle Seriana Spa""","""2021"""
"""319310_2021_606hh7g0109rim_hfs…","""new products""",12,"""local communities and society""","""Less is better is our circular…","""Italy""",,"""Consumer Discretionary""","""Consumer Durables & Apparel""","""Textiles, Apparel & Luxury Goo…","""Public""","""OVS Officine Valle Seriana Spa""","""2021"""
"""319310_2021_606hh7g0109rim_fss…","""communication""",12,"""local communities and society""","""Life Cycle Assessment LCA anal…","""Italy""",,"""Consumer Discretionary""","""Consumer Durables & Apparel""","""Textiles, Apparel & Luxury Goo…","""Public""","""OVS Officine Valle Seriana Spa""","""2021"""
"""319310_2021_606hh7g0109rim_npv…","""incentives""",12,"""local communities and society""","""Rewarding improvement The comm…","""Italy""",,"""Consumer Discretionary""","""Consumer Durables & Apparel""","""Textiles, Apparel & Luxury Goo…","""Public""","""OVS Officine Valle Seriana Spa""","""2021"""


In [38]:
text_of_initiatives["action"].value_counts().sort("action").to_dicts()

[{'action': 'adoption of standards and rules', 'count': 3316},
 {'action': 'assessment and measurement', 'count': 53433},
 {'action': 'asset modification', 'count': 61230},
 {'action': 'association', 'count': 48614},
 {'action': 'communication', 'count': 101236},
 {'action': 'donation & funding', 'count': 258987},
 {'action': 'incentives', 'count': 13011},
 {'action': 'modification of procedures', 'count': 96695},
 {'action': 'new products', 'count': 21105},
 {'action': 'organizational structuring', 'count': 13290},
 {'action': 'pricing', 'count': 1300},
 {'action': 'r&d investments', 'count': 17975},
 {'action': 'training', 'count': 136602},
 {'action': 'volunteerism', 'count': 49274}]

## LC/GOLDEN Dataset

In [41]:
lc_dataset.head()

mrg,predicted_company_name,predicted_report_year,predicted_report_type,searched_company_name,search_rank,source,pdfurl,pdf_source_path,pdffilename,md5_fingerprint,TYPE: adoption of standards and rules,TYPE: assessment and measurement,TYPE: asset modification,TYPE: association,TYPE: communication,TYPE: donation & funding,TYPE: incentives,TYPE: modification of procedures,TYPE: new products,TYPE: organizational structuring,TYPE: pricing,TYPE: r&d investments,TYPE: training,TYPE: volunteerism,SDG: 1,SDG: 10,SDG: 11,SDG: 12,SDG: 13,SDG: 14,SDG: 15,SDG: 16,SDG: 17,SDG: 2,SDG: 3,SDG: 4,…,SDG_SREC: SDG 8 - nothing,SDG_SREC: SDG 8 - shareholders,SDG_SREC: SDG 8 - suppliers,SDG_SREC: SDG 9 - customers,SDG_SREC: SDG 9 - employees,SDG_SREC: SDG 9 - environment,SDG_SREC: SDG 9 - local communities and society,SDG_SREC: SDG 9 - nothing,SDG_SREC: SDG 9 - shareholders,SDG_SREC: SDG 9 - suppliers,number_of_initiatives,num_tokenised_sentences,pdf_local_path,pdf_local_path_relative,json_filename,gvkey,rfyear,mixed_cname,fyear,available_date,Title,report_fyear,Source_Language,pdf_file_name,Period From,Period To,ids,full_isin_list,data_type,addition_date,CR Report No,conml,GICS_level_1,GICS_level_2,GICS_level_3,loc,MacroRegion
str,str,f64,str,str,str,str,str,str,str,str,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,…,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,i64,str,str,str,i64,i64,str,str,str,str,str,str,str,str,str,str,str,str,str,i64,str,str,str,str,str,str
"""1300-2024""","""Honeywell International Inc""",2024.0,"""sustainability report""",,,"""manual_CR_apr2025""",,,,"""d72de293f5c82641acfdd9b7f98f79…",0,3,6,3,2,8,1,1,0,0,0,0,6,2,0,0,0,3,0,0,3,2,0,0,2,9,…,0,0,0,0,0,0,0,0,0,0,32,1967,"""/srv/data/mrei/davinci/cr_repo…","""3112/235355-24Co-54131650T6219…","""235355-24Co-54131650T621902052…",1300,2024,,,,,,,"""235355-24Co-54131650T621902052…","""2023-01-01""","""2023-12-31""",,"""US4385161066,INE671A01010""","""Downloaded""","""0001-01-01""",235355,"""Honeywell International Inc""","""Industrials""","""Capital Goods""","""Industrial Conglomerates""","""USA""","""United States and Canada"""
"""2269-2024""","""H & R Block Inc""",2024.0,"""sustainability report""",,,"""manual_CR_apr2025""",,,,"""a48f8b70ccb99346f4c5ddf1328601…",0,3,0,2,1,15,0,4,1,0,0,0,7,5,1,0,4,8,0,1,0,3,0,3,4,8,…,0,0,0,0,0,0,0,0,0,0,38,870,"""/srv/data/mrei/davinci/cr_repo…","""11720/234795-24SG-54002850I309…","""234795-24SG-54002850I309873049…",2269,2024,,,,,,,"""234795-24SG-54002850I309873049…",,,,"""US093662AF15,US0936711052,US40…","""Downloaded""","""0001-01-01""",234795,"""Block H&R Inc""","""Consumer Discretionary""","""Consumer Services""","""Diversified Consumer Services""","""USA""","""United States and Canada"""
"""2285-2024""","""The Boeing Company""",2024.0,"""sustainability report""",,,"""manual_CR_apr2025""",,,,"""b7ca7f733c8199a9fc05dd9d771bc1…",0,0,0,5,1,18,0,3,0,2,0,0,4,2,0,1,1,2,0,0,0,1,0,0,6,11,…,0,0,0,0,0,0,0,0,0,0,35,5367,"""/srv/data/mrei/davinci/cr_repo…","""266/227673-24Su-63976113A91906…","""227673-24Su-63976113A919063536…",2285,2024,,,,,,,"""227673-24Su-63976113A919063536…","""2023-01-01""","""2023-12-31""",,"""US0970231058,US4825391034,US09…","""Downloaded""","""0001-01-01""",227673,"""The Boeing Company""","""Industrials""","""Industrial""","""Capital Goods""","""USA""","""United States and Canada"""
"""3157-2024""","""Coherent Corp""",2024.0,"""sustainability report""",,,"""manual_CR_apr2025""",,,,"""bc24f5f853f1ad7c9c9644ad247e43…",0,2,7,0,5,4,0,4,0,0,0,0,10,1,0,0,0,4,0,0,2,1,0,0,1,5,…,0,0,0,0,0,0,0,0,0,0,33,4292,"""/srv/data/mrei/davinci/cr_repo…","""37596/242645-24SG-60903895B932…","""242645-24SG-60903895B932921496…",3157,2024,,,,,,,"""242645-24SG-60903895B932921496…",,,,"""US1924791031,US19247G1076,US90…","""Downloaded""","""0001-01-01""",242645,"""Coherent Corp""","""Information Technology""","""Technology Hardware & Equipmen…","""Electronic Equipment, Instrume…","""USA""","""United States and Canada"""
"""3226-2024""","""Comcast Corporation""",2024.0,"""sustainability report""",,,"""manual_CR_apr2025""",,,,"""821981fd52eeda774c2bcfff0ea06a…",0,0,1,2,3,9,0,0,1,0,0,0,3,2,0,4,1,2,2,0,1,0,0,0,1,5,…,0,0,0,0,0,0,0,0,0,0,21,1584,"""/srv/data/mrei/davinci/cr_repo…","""5126/226163-24Co-49982023I3349…","""226163-24Co-49982023I334992635…",3226,2024,,,,,,,"""226163-24Co-49982023I334992635…","""2023-01-01""","""2023-12-31""",,"""US872287AF41,US63946DKC91,US20…","""Downloaded""","""0001-01-01""",226163,"""Comcast Corp""","""Communication Services""","""Communication Services""","""Media & Entertainment""","""USA""","""United States and Canada"""


## Columns



### PDF metadata




In [2]:
pdf_metadata = {
    "mrg": str,  # Ignore
    "searched_company_name": str,  # The company we were looking for when we downloaded the PDF
    "predicted_company_name": str,  # Equals to searched_company_name, if our algorithm matched the PDF with the searched company name
    "predicted_report_year": float,  # Predicted report year inferred from the PDF
    "predicted_report_type": str,  # Predicted report type inferred from the PDF and the number of initiatives in the PDF
    "search_rank": str,  # Ignore
    "source": str,  # Ignore
    "pdfurl": str,  # Where the PDF was downloaded from
    "pdf_source_path": str,  # Ignore
    "pdffilename": str,  # Ignore
    "md5_fingerprint": str,  # 'Unique' identifier for the PDF
    "number_of_initiatives": int,  # Number of initiatives inferred from the PDF
    "num_tokenised_sentences": int,  # Number of sentences the PDF was broken into
    "pdf_local_path": str,  # Path to the PDF file in the server (used to identify it from the cloud)
    "pdf_local_path_relative": str,  # Ignore
    "json_filename": str,  # Ignore
    "rfyear": int,  # Reference fiscal year of the PDF. Either the same as predicted_report_year or manually provided
    "mixed_cname": str,  # Ignore
    "fyear": str,  # Ignore
    "available_date": str,  # Ignore
    "Title": str,  # Ignore
    "report_fyear": str,  # Ignore
    "Source_Language": str,  # Ignore
    "pdf_file_name": str,  # Ignore
    "Period From": str,  # Period of the actions reported in the PDF. Not available for all PDFs
    "Period To": str,  # Period of the actions reported in the PDF. Not available for all PDFs
    "data_type": str,  # Origin of the data: crawled, manually added, etc.
    "CR Report No": int,  # Report number provided by Corporate Register. Only available for reports purchased from Corporate Register.
}

## Initiative counts

### Total sums

In [3]:
# Types of actions
{col: v for col, v in lc_dataset.schema.to_python().items() if col.startswith("TYPE:")}

{'TYPE: adoption of standards and rules': int,
 'TYPE: assessment and measurement': int,
 'TYPE: asset modification': int,
 'TYPE: association': int,
 'TYPE: communication': int,
 'TYPE: donation & funding': int,
 'TYPE: incentives': int,
 'TYPE: modification of procedures': int,
 'TYPE: new products': int,
 'TYPE: organizational structuring': int,
 'TYPE: pricing': int,
 'TYPE: r&d investments': int,
 'TYPE: training': int,
 'TYPE: volunteerism': int}

In [4]:
# SDGs
# SDGs
{col: v for col, v in lc_dataset.schema.to_python().items() if col.startswith("SDG:")}

{'SDG: 1': int,
 'SDG: 10': int,
 'SDG: 11': int,
 'SDG: 12': int,
 'SDG: 13': int,
 'SDG: 14': int,
 'SDG: 15': int,
 'SDG: 16': int,
 'SDG: 17': int,
 'SDG: 2': int,
 'SDG: 3': int,
 'SDG: 4': int,
 'SDG: 5': int,
 'SDG: 6': int,
 'SDG: 7': int,
 'SDG: 8': int,
 'SDG: 9': int}

In [5]:
# Stakeholder recipients
{
    col: v
    for col, v in lc_dataset.schema.to_python().items()
    if col.startswith("stakeholder_recipient_")
}

{'stakeholder_recipient_customers': int,
 'stakeholder_recipient_employees': int,
 'stakeholder_recipient_environment': int,
 'stakeholder_recipient_local communities and society': int,
 'stakeholder_recipient_nothing': int,
 'stakeholder_recipient_shareholders': int,
 'stakeholder_recipient_suppliers': int}

### Bi-granular counts

In [51]:
lc_dataset["training - SDG 8"].describe()

statistic,value
str,f64
"""count""",50520.0
"""null_count""",0.0
"""mean""",0.60576
"""std""",1.085444
"""min""",0.0
"""25%""",0.0
"""50%""",0.0
"""75%""",1.0
"""max""",25.0


In [52]:
# Type of action - SDG
{col: v for col, v in lc_dataset.schema.to_python().items() if " - SDG" in col}

{'adoption of standards and rules - SDG 1': int,
 'adoption of standards and rules - SDG 10': int,
 'adoption of standards and rules - SDG 11': int,
 'adoption of standards and rules - SDG 12': int,
 'adoption of standards and rules - SDG 13': int,
 'adoption of standards and rules - SDG 14': int,
 'adoption of standards and rules - SDG 15': int,
 'adoption of standards and rules - SDG 16': int,
 'adoption of standards and rules - SDG 17': int,
 'adoption of standards and rules - SDG 2': int,
 'adoption of standards and rules - SDG 3': int,
 'adoption of standards and rules - SDG 4': int,
 'adoption of standards and rules - SDG 5': int,
 'adoption of standards and rules - SDG 6': int,
 'adoption of standards and rules - SDG 7': int,
 'adoption of standards and rules - SDG 8': int,
 'adoption of standards and rules - SDG 9': int,
 'assessment and measurement - SDG 1': int,
 'assessment and measurement - SDG 10': int,
 'assessment and measurement - SDG 11': int,
 'assessment and measurem

In [7]:
# Type of action - Stakeholder recipient
{
    col: v
    for col, v in lc_dataset.schema.to_python().items()
    if col.startswith("TYPE_SREC:")
}

{'TYPE_SREC: adoption of standards and rules - customers': int,
 'TYPE_SREC: adoption of standards and rules - employees': int,
 'TYPE_SREC: adoption of standards and rules - environment': int,
 'TYPE_SREC: adoption of standards and rules - local communities and society': int,
 'TYPE_SREC: adoption of standards and rules - nothing': int,
 'TYPE_SREC: adoption of standards and rules - shareholders': int,
 'TYPE_SREC: adoption of standards and rules - suppliers': int,
 'TYPE_SREC: assessment and measurement - customers': int,
 'TYPE_SREC: assessment and measurement - employees': int,
 'TYPE_SREC: assessment and measurement - environment': int,
 'TYPE_SREC: assessment and measurement - local communities and society': int,
 'TYPE_SREC: assessment and measurement - nothing': int,
 'TYPE_SREC: assessment and measurement - shareholders': int,
 'TYPE_SREC: assessment and measurement - suppliers': int,
 'TYPE_SREC: asset modification - customers': int,
 'TYPE_SREC: asset modification - employee

In [8]:
# SDG - Stakeholder recipient
{
    col: v
    for col, v in lc_dataset.schema.to_python().items()
    if col.startswith("SDG_SREC:")
}

{'SDG_SREC: SDG 1 - customers': int,
 'SDG_SREC: SDG 1 - employees': int,
 'SDG_SREC: SDG 1 - environment': int,
 'SDG_SREC: SDG 1 - local communities and society': int,
 'SDG_SREC: SDG 1 - nothing': int,
 'SDG_SREC: SDG 1 - shareholders': int,
 'SDG_SREC: SDG 1 - suppliers': int,
 'SDG_SREC: SDG 10 - customers': int,
 'SDG_SREC: SDG 10 - employees': int,
 'SDG_SREC: SDG 10 - environment': int,
 'SDG_SREC: SDG 10 - local communities and society': int,
 'SDG_SREC: SDG 10 - nothing': int,
 'SDG_SREC: SDG 10 - shareholders': int,
 'SDG_SREC: SDG 10 - suppliers': int,
 'SDG_SREC: SDG 11 - customers': int,
 'SDG_SREC: SDG 11 - employees': int,
 'SDG_SREC: SDG 11 - environment': int,
 'SDG_SREC: SDG 11 - local communities and society': int,
 'SDG_SREC: SDG 11 - nothing': int,
 'SDG_SREC: SDG 11 - shareholders': int,
 'SDG_SREC: SDG 11 - suppliers': int,
 'SDG_SREC: SDG 12 - customers': int,
 'SDG_SREC: SDG 12 - employees': int,
 'SDG_SREC: SDG 12 - environment': int,
 'SDG_SREC: SDG 12 - loc

## Company Metadata

In [9]:
company_metadata = {
    "ids": str,  # Main ISIN of the company
    "full_isin_list": str,  # All ISINs we gathered from the company
    "conml": str,  # Company legal name
    "GICS_level_1": str,  # GICS Sector name of the company
    "GICS_level_2": str,  # GICS Industry Group name of the company
    "GICS_level_3": str,  # GICS Industry name of the company
    "loc": str,  # Country alpha-3 code of the company HQ
    "MacroRegion": str,  # Macro region of the company HQ - inferred from loc
}

In [54]:
lc_dataset["GICS_level_1"].unique().to_list()

['Energy',
 'Health Care',
 'Financials',
 'Real Estate',
 'Utilities',
 'Industrials',
 'Materials',
 'Information Technology',
 None,
 'Communication Services',
 'Consumer Discretionary',
 'Consumer Staples']

In [58]:
lc_dataset.filter(pl.col("GICS_level_1") == "Health Care")["SDG: 12"].describe()

statistic,value
str,f64
"""count""",3222.0
"""null_count""",0.0
"""mean""",2.929236
"""std""",5.363238
"""min""",0.0
"""25%""",0.0
"""50%""",1.0
"""75%""",4.0
"""max""",101.0


# Maturity

In [59]:
from leonardo_core.constants.constants import IndexV1_1MacroGroups, MacroGroups

index_v1_1_macro_groups = IndexV1_1MacroGroups()
macro_groups = MacroGroups()

index_v1_1_macro_groups.type_groups
macro_groups.type_groups

{'Advocacy': ['communication', 'donation & funding', 'association'],
 'Measurement': ['adoption of standards and rules',
  'assessment and measurement'],
 'Upskilling': ['training', 'incentives', 'volunteerism'],
 'Adaptation': ['asset modification',
  'modification of procedures',
  'pricing',
  'organizational structuring'],
 'Innovation': ['new products', 'r&d investments']}

In [68]:
lc_dataset["communication - SDG 8"]

communication - SDG 8
i64
0
0
0
1
0
…
1
0
0
0


In [64]:
macro_groups.sdg_groups

{'People': ['SDG 1', 'SDG 2', 'SDG 3', 'SDG 4', 'SDG 5', 'SDG 10'],
 'Planet': ['SDG 6', 'SDG 7', 'SDG 12', 'SDG 13', 'SDG 14', 'SDG 15'],
 'Prosperity': ['SDG 8', 'SDG 9', 'SDG 11', 'SDG 16', 'SDG 17']}

In [66]:
lc_dataset.select(
    "conml",
    "rfyear",
    *macro_groups.type_groups_expressions,
    *macro_groups.sdg_groups_expressions,
    "GICS_level_1",
).with_columns(macro_groups.get_zscore_expressions("GICS_level_1")).with_columns(
    macro_groups.index_expr
)

conml,rfyear,Advocacy,Measurement,Upskilling,Adaptation,Innovation,People,Planet,Prosperity,GICS_level_1,People_zscore,Planet_zscore,Prosperity_zscore,Advocacy_zscore,Measurement_zscore,Upskilling_zscore,Adaptation_zscore,Innovation_zscore,lc_index
str,i64,i64,i64,i64,i64,i64,i64,i64,i64,str,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""Honeywell International Inc""",2024,26,6,18,14,0,13,13,6,"""Industrials""",0.826083,0.975301,0.411232,0.753471,1.128319,0.955276,0.779565,-0.45363,-1.207101
"""Block H&R Inc""",2024,36,6,24,8,2,19,10,9,"""Consumer Discretionary""",1.118957,0.217788,0.936473,0.867302,0.608067,1.341891,-0.056559,0.041054,-0.826248
"""The Boeing Company""",2024,48,0,12,10,0,20,5,10,"""Industrials""",1.703124,-0.0927,1.21579,2.059153,-0.555574,0.368086,0.333843,-0.45363,-2.512784
"""Coherent Corp""",2024,18,4,22,22,0,12,15,6,"""Information Technology""",0.552653,1.068957,0.424974,0.21219,0.391795,1.191594,1.557144,-0.425199,-0.637388
"""Comcast Corp""",2024,28,0,10,2,2,12,6,3,"""Communication Services""",-0.025692,-0.041617,-0.382198,0.071521,-0.535428,-0.039458,-0.565667,-0.125644,-0.197165
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
"""POET Technologies Inc""",1991,8,2,0,2,0,3,2,1,"""Information Technology""",-0.5059,-0.456043,-0.608479,-0.369892,-0.086155,-0.834224,-0.541215,-0.425199,-0.055307
"""Ajinomoto Co Inc""",1990,8,0,2,8,2,6,3,1,"""Consumer Staples""",-0.391418,-0.543129,-0.54674,-0.550972,-0.56641,-0.639262,-0.18051,-0.115605,0.435367
"""NOCIL Ltd""",1990,6,0,0,0,0,0,3,0,"""Materials""",-1.004595,-0.508993,-0.931844,-0.604831,-0.686596,-0.948282,-0.931533,-0.531816,0.073016
"""Vale SA""",1990,6,0,0,0,0,0,3,0,"""Materials""",-1.004595,-0.508993,-0.931844,-0.604831,-0.686596,-0.948282,-0.931533,-0.531816,0.073016


# Text of Initiatives