# Outside the SDMX garden, looking at LEI and GLEIF

The Legal Entity Identifier (LEI) is a unique global identifier for legal entities participating in financial transactions. Its purpose is to help identify legal entities on a globally accessible database (source [Wikipedia](https://en.wikipedia.org/wiki/Legal_Entity_Identifier)).

The Global LEI Foundation (GLEIF) supports the implementation and use of the LEI. It makes it possible to access all the LEI records through APIs or downloading file.

The GLEIF site does not use SDMX. It maintains a [data dictionary](https://www.gleif.org/en/lei-data/access-and-use-lei-data/gleif-data-dictionary) and [an API](https://www.gleif.org/en/lei-data/gleif-api). It also makes possible downloading [golden copies of the data, as well as delta files](https://www.gleif.org/en/lei-data/gleif-golden-copy).


As relevant master data about entities, a good integration with the GLEIF data may be in the interest of many institutions.
For institutions having SDMX-driven system, it may be useful to create SDMX metadata and convert the LEI data from their source formats to SDMX, so that they can be integrated int their systems. Besides, it may be useful for those entities to validate the input data an generate new statistics.

This notebook is presenting a way to use pysdmx and the VTL Engine to represent a statistical process that includes:
1. The collection of data from the GLEIF
2. The transformation of data into SDMX
3. The structural validation of data using FMR
4. The consistency validation of the data using VTL
5. The generation of new aggregated statistics using VTL
6. The conversion of the data into SDMX

![Agenda](images/use_case_diagram_start.png)


For this exercise, the necessary SDMX metadata have been added to an FMR instance hosted by Meaningfuldata (https://fmr.meaningfuldata.eu/)

### List of pysdmx classes and functions used in this notebook:

Functions:
- pysdmx.io.read_sdmx
- pysdmx.io.csv.sdmx20.writer.write

Classes:
- pysdmx.api.fmr.RegistryClient (and methods)
- pysdmx.model.message.Message (and methods)
- pysdmx.io.pd.PandasDataset
- pysdmx.model.dataflow.Schema

## Data cleaning and set up using pandas and pysdmx

For this exercise, we will use as input the [golden copy from the GLEIF](https://www.gleif.org/en/lei-data/gleif-golden-copy/download-the-golden-copy#/), and we will transform it to get a dataset following the desired data structure, which can be fournd [here](https://fmr.meaningfuldata.eu/sdmx/v2/structure/datastructure/MD/LEI_DATA/1.0).

Note that we designed this DSD from the existing data, but took a subset of the data and renamed the attributes to make them closer to SDMX practices.

We are using Pandas to read the original data and to transform them to get a final dataset that follows the DSD.

The steps are:

1. Download the data from [the source](https://www.gleif.org/en/lei-data/gleif-golden-copy/download-the-golden-copy#.zip)
2. Read the downloaded data with Pandas
3. Drop the columns not used in the DSD and rename the existing ones
4. Filter to get only the active entities (the GLEIF publishes also inactive entities)

The code includes a function that uses the chunking capabilities of Pandas for better memory efficiency.
This is a prototype of data streaming in pysdmx,
which will be available by the end of 2025.

This code requires to install the extra data from pysdmx,
which simply install pandas.

```bash
pip install pysdmx[data]
```

The size of the original CSV file is almost 4GB.
We have run the example with the full dataset, getting the same results that are shown now.
Handling efficiently large datasets is in the roadmap of pysdmx, but not yet implemented.
For this reason, and for the live demo, we are using a subset of the original dataset.

In [1]:
from utils import streaming_load_save_csv_file

import requests
import zipfile
#1. Download the Golden Copy file

GOLDEN_COPY_PATH = 'data_files/lei_golden_copy'

url = 'https://leidata-preview.gleif.org/storage/golden-copy-files/2025/01/25/1034569/20250125-1600-gleif-goldencopy-lei2-golden-copy.csv.zip'
r = requests.get(url)

with open(GOLDEN_COPY_PATH + '.zip', 'wb') as f:
    f.write(r.content)

with zipfile.ZipFile(GOLDEN_COPY_PATH + '.zip', 'r') as zip_ref:
    zip_ref.extractall('data_files/')
    file_name = zip_ref.namelist()[0]

# file_name = '20250125-1600-gleif-goldencopy-lei2-golden-copy.csv'

#2. Read the downloaded data with Pandas
import pandas as pd

# We will read only the first 10000 rows as a sample
data = pd.read_csv('data_files/' + file_name, dtype=str, nrows=10000)

#3. Drop the columns not used in the DSD and rename the existing ones

RENAME_DICT = {
    "LEI": "LEI",
    "Entity.LegalName": "LEGAL_NAME",
    "Entity.LegalAddress.Country": "COUNTRY_INCORPORATION",
    "Entity.HeadquartersAddress.Country": "COUNTRY_HEADQUARTERS",
    "Entity.EntityCategory": "CATEGORY",
    "Entity.EntitySubCategory": "SUBCATEGORY",
    "Entity.LegalForm.EntityLegalFormCode": "LEGAL_FORM",
    "Entity.EntityStatus": "STATUS",
    "Entity.LegalAddress.PostalCode": "POSTAL_CODE",
}

data.rename(columns=RENAME_DICT, inplace=True)
data = data[list(RENAME_DICT.values())]

# 4. Data filtering by status
data = data[data['STATUS'] == 'ACTIVE'].reset_index(drop=True)
del data['STATUS']
display(data)

Unnamed: 0,LEI,LEGAL_NAME,COUNTRY_INCORPORATION,COUNTRY_HEADQUARTERS,CATEGORY,SUBCATEGORY,LEGAL_FORM,POSTAL_CODE
0,001GPB6A9XPE8XJICC14,Fidelity Advisor Leveraged Company Stock Fund,US,US,FUND,,8888,02210
1,004L5FPTUREIWK9T2N63,"Hutchin Hill Capital, LP",US,US,GENERAL,,T91T,19808
2,00EHHQ2ZHDCFXJCPCL46,Vanguard Russell 1000 Growth Index Trust,US,US,FUND,,8888,19355
3,00GBW0Z2GYIER7DHDS71,"ARISTEIA CAPITAL, L.L.C.",US,US,GENERAL,,HZEH,19801
4,00KLB2PFTM3060S2N216,Oakmark International Fund,US,US,FUND,,8888,02110
...,...,...,...,...,...,...,...,...
9549,21380012PY96FUKPP805,ITELA HOLDING AS,NO,NO,GENERAL,,YI42,0191
9550,21380012Q6KHP17Q4K39,JOSEPH WALTER CHAMBERLAIN DIS,GB,GB,GENERAL,,8888,BS2 0PT
9551,21380012QLF2BOUUZA18,R W ARMSTRONG & SONS LIMITED RETIREMENT AND DE...,GB,GB,GENERAL,,8888,RG26 5RU
9552,21380012QWUG4B14Y756,ORIENTAL HARBOR INVESTMENT FUND,KY,HK,GENERAL,,MPUG,KY1-1206


In [2]:
# Load and save using chunks (if memory is a problem). Reading the output afterwards
streaming_load_save_csv_file('data_files/' + file_name, 'data_files/' + "golden_copy_changed.csv", use_sdmx_csv=False)
# data = pd.read_csv('data_files/' + "golden_copy_changed.csv", dtype=str)

Number of lines written: 2652203
Written in CSV


# Generate the pysdmx dataset

## Retrieving the Schema from FMR

Using the Registry Client, we download the Schema from FMR, using the FusionJSON format.

In [3]:
from pysdmx.api.fmr import RegistryClient
from pysdmx.io.format import StructureFormat

client = RegistryClient(
    "https://fmr.meaningfuldata.eu/sdmx/v2", format=StructureFormat.FUSION_JSON
)
# Recommend to use debugger to see the response
schema = client.get_schema(
    "datastructure", agency="MD", id="LEI_DATA", version="1.0"
)

NotFound: Not found: The requested artefact could not be found in the targeted registry. The query was `https://fmr.meaningfuldata.eu/sdmx/v2/schema/datastructure/MD/LEI_DATA/1.0`

# Generate dataset and validate using FMR

In [None]:
# Code to validate the dataset on FMR
from utils import validate_data_fmr
from pysdmx.io.csv.sdmx20.writer import write

from pysdmx.io.pd import PandasDataset

dataset = PandasDataset(structure=schema, data=data)

# Serialization on SDMX-CSV 2.0
csv_text = write([dataset])

result = validate_data_fmr(csv_text, host="fmr.meaningfuldata.eu", port=443,
                           use_https=True)
result

## Write data to a SDMX-CSV 2.0 file

In [None]:
from pysdmx.io.csv.sdmx20.writer import write

output = write([dataset], "data_files/golden_copy_changed_sdmx.csv")

# Using VTL to validate the data with GLEIF data quality checks

The VTL language allows us to perform validations over the data,
with a business friendly syntax.
While SDMX allows to make validations using the metadata,
VTL broadens the possibilities of validations
by defining rules that generate conditions over which the data is valid,
and combining the errors into a single dataset.

For this purpose,
at MeaningfulData we have developed a library called vtlengine,
part of our own VTL Suite.

In this example,
we will use a VTL script
that performs validations based on the GLEIF [data quality checks](https://www.gleif.org/en/lei-data/gleif-data-quality-management/data-quality-checks) and a custom validation on Subcategory data.

Steps to use VTL from pysdmx:
1. Convert the Schema to a VTL DataStructure
2. Validate the data using VTL (with vtlengine library)
3. Analyse the results and write to a SDMX file

## Convert the Schema to a VTL DataStructure

This code converts the pysdmx.model Schema and DataStructureDefinition objects into a VTL datastructure,
using MeaningfulData internal format, usable only with vtlengine.
On pysdmx we will include this method
but it will generate the VTL 2.1 Standard datastructure.
Both options will be usable by the vtlengine library.

In [None]:
from utils import to_vtl_json

vtl_datastructure = to_vtl_json(schema)
vtl_datastructure

## Validate the data using VTL

This process will execute the VTL Script "validations.vtl"
on the generated data.
The script will perform the following validations:

1. Check the Postal Code format is valid for a specific country, stored on dataset "validation.postal_code_errors"
2. Check the SubCategory is null when the Category is not "RESIDENT_GOVERNMENT_ENTITY", stored on dataset "validation.subcategories_errors"
3. Check the SubCategory is not null when the Category is "RESIDENT_GOVERNMENT_ENTITY", stored on dataset "validation.subcategories_errors"
4. Merging the errors into a single dataset, named "errors_count"

We have considered that any error with level 1 is a warning,
while any error with level 3 is a critical error
and data therefore is not valid.

For more details on the run method, [please visit](https://docs.vtlengine.meaningfuldata.eu/api.html#api)


## Running the VTL script

In [None]:
from utils import _load_script
from vtlengine import run

script = _load_script("vtl/validations.vtl")
datapoints = {"LEI_DATA": data}

validations_result = run(script=script, data_structures=vtl_datastructure,
                         datapoints=datapoints)

### Getting the total number of errors

In [None]:
validations_result['errors_count'].data

### Analysing data on Postal Code errors

In [None]:
cols_to_analyse = ['COUNTRY_INCORPORATION', 'POSTAL_CODE', 'errorcode', 'errorlevel']
validations_result['validation.postal_codes_errors'].data[cols_to_analyse]

### Analysing data on Subcategory errors

In [None]:
cols_to_analyse = ['CATEGORY', 'SUBCATEGORY', 'errorcode', 'errorlevel']
validations_result['validation.subcategories_errors'].data[cols_to_analyse]

## Using VTL to perform calculations

We have designed a VTL script that performs the following calculations:
1. Count the number of entities that are incorporated in each country, stored in dataset "calculation.number_incorporated_entities"
2. Count the number of entities with their headquarters located in each country, stored in dataset "calculation.number_entities_different_hq"
3. Count the number of entities with their headquarters located in a different country than the one they are incorporated, stored in dataset "calculation.number_entities_different_hq"
4. Join the three datasets into a single dataset, named "lei_statistics"

## Running the VTL script

In [None]:
script = _load_script("vtl/calculations.vtl")
datapoints = {"LEI_DATA": data}

calculations_result = run(script=script, data_structures=vtl_datastructure,
                          datapoints=datapoints)

### Analysing the results

In [None]:
calculations_result['lei_statistics'].data

# Generate SDMX-ML file with the aggregated data

Generate a PandasDataset from vtlengine output
and use the SDMX-ML 2.1 Data write method from pysdmx.
We will download the SDMX-ML 2.1 file with the DataStructureDefinition
(with descendants).
We convert the DataStructureDefinition into a Schema object and use it to create the PandasDataset.

This code requires to install the extra data from pysdmx,
which simply install pandas.

```bash
pip install pysdmx[xml]
```

## Getting the Schema from an URL using read_sdmx

In [None]:
from pysdmx.io import read_sdmx
from pysdmx.io.pd import PandasDataset

msg = read_sdmx(
    "https://fmr.meaningfuldata.eu/sdmx/v2/structure/datastructure/MD/LEI_AGGREGATE_STATISTICS/+/?format=sdmx-2.1&references=descendants&prettyPrint=true")
dsd = msg.get_data_structure_definition(
    "DataStructure=MD:LEI_AGGREGATE_STATISTICS(1.0)")
schema_aggregated = dsd.to_schema()
data = calculations_result['lei_statistics'].data
pd_dataset = PandasDataset(structure=schema_aggregated, data=data)
pd_dataset

### Write the SDMX-ML 2.1 file

In [None]:
from pysdmx.io.xml.sdmx21.writer.structure_specific import write

xml_str = write([pd_dataset], prettyprint=False)

xml_str