# ERDRI CDS
The "Set of common data elements for Rare Diseases Registration" is the first practical instrument released by the EU RD Platform aiming at increasing interoperability of RD registries.

It contains 16 data elements to be registered by each rare disease registry across Europe, which are considered to be essential for further research. They refer to patient's personal data, diagnosis, disease history and care pathway, information for research purposes and about disability.

The "Set of common data elements for Rare Diseases Registration" was produced by a Working Group coordinated by the JRC and composed of experts from EU projects which worked on common data sets: EUCERD Joint Action, EPIRARE and RD-Connect.

[Source](https://eu-rd-platform.jrc.ec.europa.eu/set-of-common-data-elements_en)

## 1. Defining the ERDRI CDS Data Model
To create a data model definition using this package, it can be of use to define the data model first in a tabular format such as csv or excel. We have transcribed the first six sections of the ERDRI CDS into an excel file. Take a look:

In [8]:
import pandas as pd

from pathlib import Path

import rarelink_phenopacket_mapper as rlpm

In [9]:
erdri_cds_excel_path = Path('../res/test_data/erdri/erdri_cds.xlsx')
erdri_cds_tabular = pd.read_excel(erdri_cds_excel_path)

erdri_cds_tabular.head(15)

Unnamed: 0,data_model_section,data_field_name,description,data_types,required,comment
0,1. Pseudonym,1.1. Pseudonym,Patient's pseudonym,string,True,
1,2. Personal information,2.1. Date of Birth,Patient's date of birth,date,True,dd/mm/yy
2,2. Personal information,2.2. Sex,Patient's sex at birth,string,True,"Female, Male, Undetermined, Foetus (Unknown)"
3,3. Patient Status,3.1. Patient's status,Patient alive or dead,,True,"Alive, Dead, Lost in follow-up, Opted-out"
4,3. Patient Status,3.2. Date of death,Patient's date of death,date,True,dd/mm/yy
5,4. Care Pathway,4.1. First contact with specialised centre,Date of first contact with specialised centre,date,True,dd/mm/yy
6,5. Disease history,5.1. Age at onset,Age at which symptoms/signs first appeared,"string, date",True,"Antenatal, At birth, Date (dd/mm/yyyy), Undete..."
7,5. Disease history,5.2. Age at diagnosis,Age at which diagnosis was made,,True,"Antenatal, At birth, Date (dd/mm/yyyy), Undete..."
8,6. Diagnosis,6.1. Diagnosis of the rare disease,Diagnosis retained by the specialised centre,"orpha, alpha, icd-9, icd-9-cm, icd-10",True,Orpha code (strongly\nrecommended – see link) ...
9,6. Diagnosis,6.2. Genetic diagnosis,Genetic diagnosis retained by\nthe specialised...,"hgvs, hgnc, omim",True,International classification of\nmutations (HG...


## 1.1. Defining the Resources used in the Data Model
To accurately load the data model from a file, we need to define the resources that are used in the data model. 

We can make use of the resources that are predefined in `rlpm.data_standards.data_models` and enhance them by adding the correct version of the resource used in the data model.

We can then refer to these resources in the data model definition file by listing their namespace_prefix in the data_type column.

In [10]:
from rarelink_phenopacket_mapper.data_standards import code_system

In [11]:
resources = [
    code_system.ORDO.set_version('2024-08-02'),
    code_system.ICD10_GM,
    code_system.HPO,
    code_system.HGVS,
    code_system.OMIM,
    code_system.HGNC,
    code_system.ICD9
]

## 1.2. Reading in the Data Model from a file

### Data Model Definition
Now we can import this tabular data model definition into the package and create a data model definition object.

We start by defining a dictionary that holds the names of the fields of the `DataField` class as keys and maps them onto columns of the file we want to import our data model from. Conveniently, we have named the columns the same as the fields, which is recommended but not necessary.

We pass a path to the data model tabular definition, its file type and the `column_names` dictionary onto the `rlpm.pipeline.read_data_model` method.

In [12]:
column_names = {
    # left side: fields of DataField class, right side: names of columns in data model definition file
    'name': 'data_field_name',
    'section': 'data_model_section',
    'description': 'description',
    'data_type': 'data_types',
    'required': 'required',
    'specification': 'comment',
    'ordinal': ''  # if left empty such as here, the program will try to parse the ordinal from the file or leave it empty otherwise
}

erdri_cds_data_model = rlpm.pipeline.read_data_model(data_model_name='ERDRI CDS', path=erdri_cds_excel_path, file_type='excel', column_names=column_names, resources=resources, remove_line_breaks=True, parse_data_types=True)

print(erdri_cds_data_model)

df_columns=['data_model_section', 'data_field_name', 'description', 'data_types', 'required', 'comment']
Column data_field_name maps to DataField.name
Column data_model_section maps to DataField.section
Column description maps to DataField.description
Column data_types maps to DataField.data_type
Column required maps to DataField.required
Column comment maps to DataField.specification




































































































ERDRI CDS
DataModel(name=ERDRI CDS
	DataField(
		ordinal, section=None 1. Pseudonym,
		name=1.1. Pseudonym,
		data type=[<class 'str'>], required=True,
		sepcification=None
	)
	DataField(
		ordinal, section=None 2. Personal information,
		name=2.1. Date of Birth,
		data type=[<class 'rarelink_phenopacket_mapper.data_standards.date.Date'>], required=True,
		sepcification=dd/mm/yy
	)
	DataField(
		ordinal, section=None 2. Personal information,
		name=2.2. Sex,
		data type=[<class 'str'>], required=True,
		sep

### Done! ... Almost!
Only if `parse_data_types=True`:
If you inspect the output above, you might find warnings such as:
```
Warning: The type icd-10 could not be parsed to a type or resource. If it refers to a resource, please add it to the list of resources. Otherwise, check your file.
```
This can happen if a resource was not included in the list of resources or through a parsing error. You can usually fix these quite easily. Having actual types in the `DataModel` instead of strings will help when we use the Data Model to load data by checking if each field complies with the correct type. By default, `compliance` is set to soft, leading to warnings. Changing the setting to hard will raise `ValueError`s instead.