# Overview of the use case

---------(Overview on the use case,
why is it important for data producers,
usefulness of the library
when reading non-SDMX data
and convert to SDMX. Mention validations and calculations over VTL.)---------

Talking points and agenda:

- General use of pysdmx on Data Producers
- Outside the SDMX garden, looking at LEI and GLEIF
- Data cleaning and set up using pandas
- Downloading and reading the ConceptScheme on SDMX-ML 2.1 using read_sdmx
- Retrieving the Schema from FMR (FusionJSON)
- Convert the Schema to a VTL DataStructure
- Using VTL to validate the data
- Using VTL to perform calculations
- Generate SDMX file with the aggregated data
- Reading back the SDMX file using read_sdmx

### List of pysdmx classes and functions used in this notebook:

Functions:
- pysdmx.io.read_sdmx
- pysdmx.io.csv.sdmx20.writer.write

Classes:
- pysdmx.api.fmr.RegistryClient (and methods)
- pysdmx.model.message.Message (and methods)
- pysdmx.io.pd.PandasDataset
- pysdmx.model.dataflow.Schema

# Outside the SDMX garden, looking at LEI and GLEIF

---------(Explanation on LEI and GLEIF, use for data producers)---------

## Data cleaning and set up using pandas and pysdmx

For this use case,
we will use the Golden Copy file from GLEIF (link)
and filter on those LEI with status Active. 

The code renames the columns and select the data we need for later validation.
We added the possibility of saving the data into plain CSV or SDMX-CSV 2.0
(using pysdmx).

The code uses the chunking capabilities of Pandas for better memory efficiency.
This is a prototype of the streaming capabilities with pandas in pysdmx,
which will be available by the end of 2025.

For the sake of this example,
we distinguish between the Golden Copy original path
(link to download the file) and the Golden Copy Changed,
which would be the output of this code.

This code requires to install the extra data from pysdmx,
which simply install pandas.

```bash
pip install pysdmx[data]
```

In [None]:
import os

import pandas as pd
from pysdmx.io.csv.sdmx20.writer import write
from pysdmx.io.pd import PandasDataset

# Original columns and their simple name for next steps of this tutorial
RENAME_DICT = {
    "LEI": "LEI",
    "Entity.LegalName": "LEGAL_NAME",
    "Entity.LegalAddress.Country": "COUNTRY_INCORPORATION",
    "Entity.HeadquartersAddress.Country": "COUNTRY_HEADQUARTERS",
    "Entity.EntityCategory": "CATEGORY",
    "Entity.EntitySubCategory": "SUBCATEGORY",
    "Entity.LegalForm.EntityLegalFormCode": "LEGAL_FORM",
    "Entity.EntityStatus": "STATUS",
    "Entity.LegalAddress.PostalCode": "POSTAL_CODE",
}


def _process_chunk(data: pd.DataFrame):
    data.rename(columns=RENAME_DICT, inplace=True)
    data = data[list(RENAME_DICT.values())]
    data = data[data["STATUS"] == "ACTIVE"]
    del data["STATUS"]
    return data


def _save_as_sdmx_csv(data: pd.DataFrame):
    dataset = PandasDataset(
        structure="DataStructure=MD:LEI_DATA(1.0)", data=data
    )
    return write([dataset])


def __clean_output(output, header=False):
    """Currently may add some extra lines in windows, 
    just removing the  CR character.
    We also clean the extra headers for chunking."""
    out_lst = output.splitlines()
    if not header:
        out_lst = out_lst[1:]
    output = "\n".join(out_lst)
    del out_lst
    return output


def streaming_load_save_csv_file(golden_copy_original_path, output_filename,
                                 use_sdmx_csv=False, nrows=None):
    """Load data and rename using small memory"""
    chunksize = None
    if nrows is None or nrows > 100000:
        chunksize = 100000
    data = pd.read_csv(golden_copy_original_path, dtype=str,
                       chunksize=chunksize, nrows=nrows)
    # Add header only to the first chunk
    add_header = True
    # Removing the file if already present
    if os.path.exists(output_filename):
        os.remove(output_filename)
    number_of_lines_written = 0
    if isinstance(data, pd.DataFrame):
        data = [data]
    for chunk in data:
        chunk = _process_chunk(chunk)
        if add_header:
            header = True
            add_header = False
        else:
            header = False
        number_of_lines_written += len(chunk)
        if not use_sdmx_csv:
            chunk.to_csv(output_filename, mode="a", index=False, header=header)
        else:
            out = _save_as_sdmx_csv(chunk)
            out = __clean_output(out, header)
            with open(output_filename, "w", encoding="utf-8") as f:
                f.write(out)
    print(f"Number of lines written: {number_of_lines_written}")


streaming_load_save_csv_file(
    golden_copy_original_path="data_files/golden-copy-original.csv",
    output_filename="data_files/golden_copy_changed_10000_sdmx.csv",
    use_sdmx_csv=True, nrows=10000)

# Reading the ConceptScheme on SDMX-ML 2.1 using read_sdmx

For this example,
we generated a DataStructure on FMR called LEI_DATA,
with Short URN: DataStructure=MD:LEI_DATA(1.0),
with the required codelists to be used for structural validation on FMR.

This structures is also available at SDMX_Structures/structures.xml file in this project,
or at the MeaningfulData FMR (fmr.meaningfuldata.eu).
Currently the library does not support SDMX-ML 3.0,
so we will read only the ConceptScheme and descendants (available at SDMX_Structures/concepts.xml).

To ensure we are able to validate the data correctly,
we extended the CL_AREA codelist from SDMX
to add a code that was present in the LEI Golden Copy.

The code below reads the ConceptScheme and descendants,
ensure you have installed the xml extra from pysdmx.

```bash
pip install pysdmx[xml]
```

In [None]:
from pysdmx.io import read_sdmx

structures_msg = read_sdmx("SDMX_Structures/concepts.xml")
# We can access the first concept scheme, or look for the short_urn
concept_scheme1 = structures_msg.get_concept_schemes()[0]
concept_scheme2 = structures_msg.get_concept_scheme(
    "ConceptScheme=MD:LEI_CONCEPTS(1.0)")

# Retrieving the Schema from FMR

We may use as well the FMR Webservices to download the Schema from FMR, using the FusionJSON format.

In [None]:
from pprint import pprint
from pysdmx.api.fmr import RegistryClient
from pysdmx.io.format import StructureFormat

client = RegistryClient(
    "https://fmr.meaningfuldata.eu/sdmx/v2", format=StructureFormat.FUSION_JSON
)
# Recommend to use debugger to see the response
schema = client.get_schema(
    "datastructure", agency="MD", id="LEI_DATA", version="1.0"
)
pprint(schema)

# Using VTL to validate the data with GLEIF data quality checks

The VTL language allows us to perform validations over the data,
with a business friendly syntax. 

For this purpose, at MeaningfulData we have developed a library called vtlengine,
which is able to run VTL scripts over data.

In this example,
we will use a VTL script
that performs validations based on the GLEIF data quality checks
(link) and a custom validation on Subcategory data.


Steps to use VTL from pysdmx:
1. Convert the Schema to a VTL DataStructure
2. Validate the data using VTL
3. Analyse the results

---------(Explanation on VTL validations usefulness,
validating more than one component. Quick overview on the code
using VTL Playground)---------

## Convert the Schema to a VTL DataStructure

This code converts the pysdmx.model Schema and DataStructureDefinition objects into a VTL datastructure,
using MeaningfulData internal format, usable only with vtlengine.
On pysdmx we will include this method
but it will generate the VTL 2.1 Standard datastructure.
Both options will be usable by the vtlengine library.

## Setting up the code

In [None]:
import json
from typing import Optional, Dict, Any, List
from pysdmx.model.dataflow import DataStructureDefinition, Component
from pysdmx.model import Role

VTL_DTYPES_MAPPING = {
    "String": "String",
    "Alpha": "String",
    "AlphaNumeric": "String",
    "Numeric": "String",
    "BigInteger": "Integer",
    "Integer": "Integer",
    "Long": "Integer",
    "Short": "Integer",
    "Decimal": "Number",
    "Float": "Number",
    "Double": "Number",
    "Boolean": "Boolean",
    "URI": "String",
    "Count": "Integer",
    "InclusiveValueRange": "Number",
    "ExclusiveValueRange": "Number",
    "Incremental": "Number",
    "ObservationalTimePeriod": "Time_Period",
    "StandardTimePeriod": "Time_Period",
    "BasicTimePeriod": "Date",
    "GregorianTimePeriod": "Date",
    "GregorianYear": "Date",
    "GregorianYearMonth": "Date",
    "GregorianMonth": "Date",
    "GregorianDay": "Date",
    "ReportingTimePeriod": "Time_Period",
    "ReportingYear": "Time_Period",
    "ReportingSemester": "Time_Period",
    "ReportingTrimester": "Time_Period",
    "ReportingQuarter": "Time_Period",
    "ReportingMonth": "Time_Period",
    "ReportingWeek": "Time_Period",
    "ReportingDay": "Time_Period",
    "DateTime": "Date",
    "TimeRange": "Time",
    "Month": "String",
    "MonthDay": "String",
    "Day": "String",
    "Time": "String",
    "Duration": "Duration",
}

VTL_ROLE_MAPPING = {
    Role.DIMENSION: "Identifier",
    Role.MEASURE: "Measure",
    Role.ATTRIBUTE: "Attribute",
}


def to_vtl_json(
        dsd: DataStructureDefinition, path: Optional[str] = None
) -> Optional[Dict[str, Any]]:
    """Formats the DataStructureDefinition as a VTL DataStructure."""
    dataset_name = dsd.id
    components = []
    NAME = "name"
    ROLE = "role"
    TYPE = "type"
    NULLABLE = "nullable"

    _components: List[Component] = []
    _components.extend(dsd.components.dimensions)
    _components.extend(dsd.components.measures)
    _components.extend(dsd.components.attributes)

    for c in _components:
        _type = VTL_DTYPES_MAPPING[c.dtype]
        _nullability = c.role != Role.DIMENSION
        _role = VTL_ROLE_MAPPING[c.role]

        component = {
            NAME: c.id,
            ROLE: _role,
            TYPE: _type,
            NULLABLE: _nullability,
        }

        components.append(component)

    result = {
        "datasets": [{"name": dataset_name, "DataStructure": components}]
    }
    if path is not None:
        with open(path, "w") as fp:
            json.dump(result, fp, indent=2)
        return None

    return result

## Perform the conversion

In [None]:
from pprint import pprint

vtl_datastructure = to_vtl_json(schema)
pprint(vtl_datastructure)

## Validate the data using VTL (sample 10000)

## Setting up the code

In [None]:
import pandas as pd
from vtlengine import run


def _load_script(filename):
    with open(filename, "r") as f:
        script = f.read()
    return script

---------(Explanation on the code, overview on the VTL run method documentation)---------

## Running the VTL script

In [None]:
script = _load_script("vtl/script.vtl")
data_df = pd.read_csv("data_files/golden_copy_changed_10000.csv")
datapoints = {"LEI_DATA": data_df}

run_result = run(script=script, data_structures=vtl_datastructure,
                 datapoints=datapoints)
pprint(run_result)

### Getting the total number of errors (sample 10000)

In [None]:
run_result['errors_count'].data

### Analysing data on Subcategory errors (sample 10000)

In [None]:
cols_to_analyse = ['CATEGORY', 'SUBCATEGORY', 'errorcode', 'errorlevel']
run_result['validation.subcategories_errors'].data[cols_to_analyse]

---------(Explanation on Subcategory errors)---------

### 