# Overview of the use case

---------(Overview on the use case,
why is it important for data producers,
usefulness of the library
when reading non-SDMX data
and convert to SDMX. Mention validations and calculations over VTL.)---------

Talking points and agenda:

- General use of pysdmx on Data Producers
- Outside the SDMX garden, looking at LEI and GLEIF
- Data cleaning and set up using pandas
- Downloading and reading the ConceptScheme on SDMX-ML 2.1 using read_sdmx
- Retrieving the Schema from FMR (FusionJSON)
- Convert the Schema to a VTL DataStructure
- Using VTL to validate the data
- Using VTL to perform calculations
- Generate SDMX file with the aggregated data
- Reading back the SDMX file using read_sdmx

# Retrieving the Schema from FMR

---------(Explanation on the code, overview on the FMR retrieval documentation)---------


In [18]:
from pysdmx.api.fmr import RegistryClient
from pysdmx.io.format import StructureFormat

client = RegistryClient(
    "http://localhost:8080/sdmx/v2", format=StructureFormat.FUSION_JSON
)
# Recommend to use debugger to see the response
schema = client.get_schema(
    "datastructure", agency="MD", id="LEI_DATA", version="1.0"
)

# Convert the Schema to a VTL DataStructure

## Setting up the code

In [19]:
import json
from typing import Optional, Dict, Any, List
from pysdmx.model.dataflow import DataStructureDefinition, Component
from pysdmx.model import Role

VTL_DTYPES_MAPPING = {
    "String": "String",
    "Alpha": "String",
    "AlphaNumeric": "String",
    "Numeric": "String",
    "BigInteger": "Integer",
    "Integer": "Integer",
    "Long": "Integer",
    "Short": "Integer",
    "Decimal": "Number",
    "Float": "Number",
    "Double": "Number",
    "Boolean": "Boolean",
    "URI": "String",
    "Count": "Integer",
    "InclusiveValueRange": "Number",
    "ExclusiveValueRange": "Number",
    "Incremental": "Number",
    "ObservationalTimePeriod": "Time_Period",
    "StandardTimePeriod": "Time_Period",
    "BasicTimePeriod": "Date",
    "GregorianTimePeriod": "Date",
    "GregorianYear": "Date",
    "GregorianYearMonth": "Date",
    "GregorianMonth": "Date",
    "GregorianDay": "Date",
    "ReportingTimePeriod": "Time_Period",
    "ReportingYear": "Time_Period",
    "ReportingSemester": "Time_Period",
    "ReportingTrimester": "Time_Period",
    "ReportingQuarter": "Time_Period",
    "ReportingMonth": "Time_Period",
    "ReportingWeek": "Time_Period",
    "ReportingDay": "Time_Period",
    "DateTime": "Date",
    "TimeRange": "Time",
    "Month": "String",
    "MonthDay": "String",
    "Day": "String",
    "Time": "String",
    "Duration": "Duration",
}

VTL_ROLE_MAPPING = {
    Role.DIMENSION: "Identifier",
    Role.MEASURE: "Measure",
    Role.ATTRIBUTE: "Attribute",
}


def to_vtl_json(
    dsd: DataStructureDefinition, path: Optional[str] = None
) -> Optional[Dict[str, Any]]:
    """Formats the DataStructureDefinition as a VTL DataStructure."""
    dataset_name = dsd.id
    components = []
    NAME = "name"
    ROLE = "role"
    TYPE = "type"
    NULLABLE = "nullable"

    _components: List[Component] = []
    _components.extend(dsd.components.dimensions)
    _components.extend(dsd.components.measures)
    _components.extend(dsd.components.attributes)

    for c in _components:
        _type = VTL_DTYPES_MAPPING[c.dtype]
        _nullability = c.role != Role.DIMENSION
        _role = VTL_ROLE_MAPPING[c.role]

        component = {
            NAME: c.id,
            ROLE: _role,
            TYPE: _type,
            NULLABLE: _nullability,
        }

        components.append(component)

    result = {
        "datasets": [{"name": dataset_name, "DataStructure": components}]
    }
    if path is not None:
        with open(path, "w") as fp:
            json.dump(result, fp, indent=2)
        return None

    return result

## Convert the Schema to a VTL DataStructure

In [20]:
vtl_datastructure = to_vtl_json(schema)
vtl_datastructure

{'datasets': [{'name': 'LEI_DATA',
   'DataStructure': [{'name': 'LEI',
     'role': 'Identifier',
     'type': 'String',
     'nullable': False},
    {'name': 'POSTAL_CODE',
     'role': 'Measure',
     'type': 'String',
     'nullable': True},
    {'name': 'COUNTRY_INCORPORATION',
     'role': 'Measure',
     'type': 'String',
     'nullable': True},
    {'name': 'COUNTRY_HEADQUARTERS',
     'role': 'Measure',
     'type': 'String',
     'nullable': True},
    {'name': 'CATEGORY',
     'role': 'Measure',
     'type': 'String',
     'nullable': True},
    {'name': 'SUBCATEGORY',
     'role': 'Measure',
     'type': 'String',
     'nullable': True},
    {'name': 'LEGAL_FORM',
     'role': 'Measure',
     'type': 'String',
     'nullable': True},
    {'name': 'LEGAL_NAME',
     'role': 'Attribute',
     'type': 'String',
     'nullable': True}]}]}

# Using VTL to validate the data

---------(Explanation on VTL validations usefulness,
validating more than one component. Quick overview on the code
using VTL Playground)---------

## Setting up the code

In [21]:
import pandas as pd
from vtlengine import run

def _load_script(filename):
    with open(filename, "r") as f:
        script = f.read()
    return script


script = _load_script("vtl/script.vtl")
data_df = pd.read_csv("data_files/golden_copy_10000.csv")
datapoints = {"LEI_DATA": data_df}

## Validate the data using VTL

---------(Explanation on the code, overview on the VTL run method documentation)---------

In [22]:
run_result = run(script=script, data_structures=vtl_datastructure, datapoints=datapoints)

### Getting the total number of errors

In [23]:
run_result['errors_count'].data

Unnamed: 0,errorlevel,int_var
0,1,


### Analysing data on Subcategory errors

In [24]:
cols_to_analyse = ['CATEGORY', 'SUBCATEGORY', 'errorcode', 'errorlevel']
run_result['validation.subcategories_errors'].data[cols_to_analyse]

Unnamed: 0,CATEGORY,SUBCATEGORY,errorcode,errorlevel
0,RESIDENT_GOVERNMENT_ENTITY,,C1,1
1,RESIDENT_GOVERNMENT_ENTITY,,C1,1
2,RESIDENT_GOVERNMENT_ENTITY,,C1,1
3,RESIDENT_GOVERNMENT_ENTITY,,C1,1


---------(Explanation on Subcategory errors)---------