# Overview of the use case

Talking points and agenda:

- General use of pysdmx on Data Producers
- Outside the SDMX garden, looking at LEI and GLEIF
- Data cleaning and set up using pandas
- Downloading and reading the ConceptScheme on SDMX-ML 2.1 using read_sdmx
- Retrieving the Schema from FMR (FusionJSON)
- Convert the Schema to a VTL DataStructure
- Using VTL to validate the data
- Using VTL to perform calculations
- Generate SDMX file with the aggregated data
- Reading back the SDMX file using read_sdmx

### List of pysdmx classes and functions used in this notebook:

Functions:
- pysdmx.io.read_sdmx
- pysdmx.io.csv.sdmx20.writer.write

Classes:
- pysdmx.api.fmr.RegistryClient (and methods)
- pysdmx.model.message.Message (and methods)
- pysdmx.io.pd.PandasDataset
- pysdmx.model.dataflow.Schema

# Outside the SDMX garden, looking at LEI and GLEIF

The Legal Entity Identifier (LEI) is a unique global identifier for legal entities participating in financial transactions. Its purpose is to help identify legal entities on a globally accessible database (source [Wikipedia](https://en.wikipedia.org/wiki/Legal_Entity_Identifier).

The Global LEI Foundation (GLEIF) supports the implementation and use of the LEI. It makes it possible to access all the LEI records through APIs or downloading file.

The GLEIF site does not use SDMX. It maintains a [data dictionary](https://www.gleif.org/en/lei-data/access-and-use-lei-data/gleif-data-dictionary) and [an API](https://www.gleif.org/en/lei-data/gleif-api). It also makes possible downloading [golden copies of the data, as well as delta files](https://www.gleif.org/en/lei-data/gleif-golden-copy).


As relevant master data about entities, a good integration with the GLEIF data may be in the interest of many institutions.
For institutions having SDMX-driven system, it may be useful to create SDMX metadata and convert the LEI data from their source formats to SDMX, so that they can be integrated int their systems. Besides, it may be useful for those entities to validate the input data an generate new statistics.

This notebook is presenting a way to use pysdmx and the VTL Engine to represent a statistical process that includes:
1. The colletion of data from the GLEIF
2. The transformation of data into SDMX
3. The structural validation of data using FMR
4. The consistency validation of the data using VTL
5. The generation of new aggregated statistics using VTL
6. The conversion of the data into SDMX


For this exercise, the necessary SDMX metadata have been added to an FMR instance hosted by Meaningfuldata (https://fmr.meaningfuldata.eu/)

## Data cleaning and set up using pandas and pysdmx

For this exercise, we will use as input the [golden copy from the GLEIF](https://www.gleif.org/en/lei-data/gleif-golden-copy/download-the-golden-copy#/), and we will transform it to get a dataset following the desired data structure, which can be fournd [here](https://fmr.meaningfuldata.eu/sdmx/v2/structure/datastructure/MD/LEI_DATA/1.0).

Note that we designed this DSD from the existing data, but took a subset of the data and renamed the attributes to make them closer to SDMX practices.

We are using Pandas to read the original data and to transform them to get a final dataset that follows the DSD.

The steps are:

1. Download the data from [the source](https://www.gleif.org/en/lei-data/gleif-golden-copy/download-the-golden-copy#.zip)
2. Read the downloaded data with Pandas
3. Drop the columns not used in the DSD and rename the existing ones
4. Filter to get only the active entities (the GLEIF publishes also inactive entities)

The code uses the chunking capabilities of Pandas for better memory efficiency.
This is a prototype of the streaming capabilities with pandas in pysdmx,
which will be available by the end of 2025.

This code requires to install the extra data from pysdmx,
which simply install pandas.

```bash
pip install pysdmx[data]
```

In [24]:
#Import required libraries

import requests
import zipfile

import pandas as pd

In [5]:
#1. Download the Golden Copy file

GOLDEN_COPY_PATH = 'data_files/lei_golden_copy'

url = 'https://leidata-preview.gleif.org/storage/golden-copy-files/2025/01/25/1034569/20250125-1600-gleif-goldencopy-lei2-golden-copy.csv.zip'
r = requests.get(url)

with open(GOLDEN_COPY_PATH + '.zip', 'wb') as f:
    f.write(r.content)


with zipfile.ZipFile(GOLDEN_COPY_PATH + '.zip', 'r') as zip_ref:
    zip_ref.extractall('data_files/')
    file_name = zip_ref.namelist()[0]



KeyboardInterrupt



In [None]:
# Load file, rename columns and save new files.
from utils import streaming_load_save_csv_file

streaming_load_save_csv_file(
    golden_copy_original_path="data_files/golden-copy-original.csv",
    output_filename="data_files/golden_copy_changed_10000_sdmx.csv",
    use_sdmx_csv=True, nrows=10000)

streaming_load_save_csv_file(
    golden_copy_original_path="data_files/golden-copy-original.csv",
    output_filename="data_files/golden_copy_changed_10000.csv",
    use_sdmx_csv=False, nrows=10000)

# Reading the ConceptScheme on SDMX-ML 2.1 using read_sdmx

For this example,
we generated a DataStructure on FMR called LEI_DATA,
with Short URN: DataStructure=MD:LEI_DATA(1.0),
with the required codelists to be used for structural validation on FMR.

This structures is also available at SDMX_Structures/structures.xml file in this project,
or at the MeaningfulData FMR (fmr.meaningfuldata.eu).
Currently the library does not support SDMX-ML 3.0,
so we will read only the ConceptScheme and descendants (available at SDMX_Structures/concepts.xml).

To ensure we are able to validate the data correctly,
we extended the CL_AREA codelist from SDMX
to add a code that was present in the LEI Golden Copy.

The code below reads the ConceptScheme and descendants,
ensure you have installed the xml extra from pysdmx.

```bash
pip install pysdmx[xml]
```

In [26]:
from pysdmx.io import read_sdmx

structures_msg = read_sdmx("SDMX_Structures/concepts.xml")
# We can access the first concept scheme, or look for the short_urn
concept_scheme1 = structures_msg.get_concept_schemes()[0]
concept_scheme2 = structures_msg.get_concept_scheme(
    "ConceptScheme=MD:LEI_CONCEPTS(1.0)")

# Retrieving the Schema from FMR

We may use as well the FMR Webservices to download the Schema from FMR, using the FusionJSON format.

In [28]:
from pysdmx.api.fmr import RegistryClient
from pysdmx.io.format import StructureFormat

client = RegistryClient(
    "https://fmr.meaningfuldata.eu/sdmx/v2", format=StructureFormat.FUSION_JSON
)
# Recommend to use debugger to see the response
schema = client.get_schema(
    "datastructure", agency="MD", id="LEI_DATA", version="1.0"
)

KeyboardInterrupt: 

# Using VTL to validate the data with GLEIF data quality checks

The VTL language allows us to perform validations over the data,
with a business friendly syntax. 

For this purpose, at MeaningfulData we have developed a library called vtlengine,
which is able to run VTL scripts over data.

In this example,
we will use a VTL script
that performs validations based on the GLEIF data quality checks
(link) and a custom validation on Subcategory data.


Steps to use VTL from pysdmx:
1. Convert the Schema to a VTL DataStructure
2. Validate the data using VTL
3. Analyse the results

---------(Explanation on VTL validations usefulness,
validating more than one component. Quick overview on the code
using VTL Playground)---------

## Convert the Schema to a VTL DataStructure

This code converts the pysdmx.model Schema and DataStructureDefinition objects into a VTL datastructure,
using MeaningfulData internal format, usable only with vtlengine.
On pysdmx we will include this method
but it will generate the VTL 2.1 Standard datastructure.
Both options will be usable by the vtlengine library.

In [None]:
from utils import to_vtl_json

vtl_datastructure = to_vtl_json(schema)

## Validate the data using VTL (sample 10000)

This process will execute the VTL Script validations.vtl on the generated data. 

For more details on the run method, [please visit]()

---------(Explanation on the code, overview on the VTL run method documentation)---------

## Running the VTL script

In [22]:
from utils import _load_script
from vtlengine import run

script = _load_script("vtl/validations.vtl")
data_df = pd.read_csv("data_files/golden_copy_changed_10000.csv")
datapoints = {"LEI_DATA": data_df}

validations_result = run(script=script, data_structures=vtl_datastructure,
                         datapoints=datapoints)
validations_result['errors_count'].data

Unnamed: 0,errorlevel,int_var
0,1,4


### Getting the total number of errors (sample 10000)

In [11]:
validations_result['errors_count'].data

Unnamed: 0,errorlevel,int_var
0,1,


### Analysing data on Subcategory errors (sample 10000)

In [12]:
cols_to_analyse = ['CATEGORY', 'SUBCATEGORY', 'errorcode', 'errorlevel']
validations_result['validation.subcategories_errors'].data[cols_to_analyse]

Unnamed: 0,CATEGORY,SUBCATEGORY,errorcode,errorlevel
0,RESIDENT_GOVERNMENT_ENTITY,,C1,1
1,RESIDENT_GOVERNMENT_ENTITY,,C1,1
2,RESIDENT_GOVERNMENT_ENTITY,,C1,1
3,RESIDENT_GOVERNMENT_ENTITY,,C1,1


---------(Explanation on Subcategory errors)---------

## Using VTL to perform calculations

--- Explanation on VTL scripts for calculations, overview on the code ---

## Running the VTL script

In [23]:
script = _load_script("vtl/calculations.vtl")
data_df = pd.read_csv("data_files/golden_copy_changed_10000.csv")
datapoints = {"LEI_DATA": data_df}

calculations_result = run(script=script, data_structures=vtl_datastructure,
                         datapoints=datapoints)

In [14]:
calculations_result['lei_statistics'].data

Unnamed: 0,COUNTRY,MEASURE,OBS_VALUE
0,US,NUMBER_INCORPORATED_ENTITIES,594
1,CZ,NUMBER_INCORPORATED_ENTITIES,53
2,CA,NUMBER_INCORPORATED_ENTITIES,16
3,KY,NUMBER_INCORPORATED_ENTITIES,88
4,IE,NUMBER_INCORPORATED_ENTITIES,36
...,...,...,...
148,NO,NUMBER_ENTITIES_DIFF_HQ,1
149,CA,NUMBER_ENTITIES_DIFF_HQ,1
150,BE,NUMBER_ENTITIES_DIFF_HQ,1
151,MH,NUMBER_ENTITIES_DIFF_HQ,1


---------(Explanation on the calculations)---------

# Generate SDMX file with the aggregated data

Generate a PandasDataset from vtlengine output
and use the SDMX-ML 2.1 Data write method from pysdmx.

## Setting up the code

In [15]:
from pysdmx.io.pd import PandasDataset

data = calculations_result['lei_statistics'].data
structure = "DataStructure=MD:LEI_AGGREGATE_STATISTICS(1.0)"
pd_dataset = PandasDataset(structure=structure, data=data)

### Write the SDMX-ML 2.1 file

In [16]:
from pysdmx.io.xml.sdmx21.writer.structure_specific import write

output = write([pd_dataset], prettyprint=False)

output

'<?xml version="1.0" encoding="UTF-8"?><mes:StructureSpecificData xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:mes="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/message" xmlns:ss="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/data/structurespecific" xmlns:com="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/common" xmlns:ns1="urn:sdmx:org.sdmx.infomodel.datastructure.DataStructure=MD:LEI_AGGREGATE_STATISTICS(1.0):ObsLevelDim:AllDimensions" xsi:schemaLocation="http://www.sdmx.org/resources/sdmxml/schemas/v2_1/message https://registry.sdmx.org/schemas/v2_1/SDMXMessage.xsd"><mes:Header><mes:ID>ae1f1049-9063-4338-a1fd-200a7b27bbb0</mes:ID><mes:Test>true</mes:Test><mes:Prepared>2025-01-26T14:25:48</mes:Prepared><mes:Sender id="ZZZ"/><mes:Structure structureID="LEI_AGGREGATE_STATISTICS" namespace="urn:sdmx:org.sdmx.infomodel.datastructure.DataStructure==MD:LEI_AGGREGATE_STATISTICS(1.0)" dimensionAtObservation="AllDimensions"><com:Structure><Ref agencyID="MD" id="LEI

# Reading back the SDMX file using read_sdmx

Using read_sdmx method, we are able to read back the data produced.
Note that we do not make any reference to the SDMX format used
or if it is a Data or Structure message.

In [30]:
from pysdmx.io import read_sdmx

data_msg = read_sdmx(output)
data_msg.data[0].data

Unnamed: 0,COUNTRY,MEASURE,OBS_VALUE
0,US,NUMBER_INCORPORATED_ENTITIES,594
1,CZ,NUMBER_INCORPORATED_ENTITIES,53
2,CA,NUMBER_INCORPORATED_ENTITIES,16
3,KY,NUMBER_INCORPORATED_ENTITIES,88
4,IE,NUMBER_INCORPORATED_ENTITIES,36
...,...,...,...
148,NO,NUMBER_ENTITIES_DIFF_HQ,1
149,CA,NUMBER_ENTITIES_DIFF_HQ,1
150,BE,NUMBER_ENTITIES_DIFF_HQ,1
151,MH,NUMBER_ENTITIES_DIFF_HQ,1
