Skip to content

MyDigiTwinNL/LifelinesCSV2CDF

Repository files navigation

Lifelines CSV data files to JSON/CDF (Cohort-data format) transformation tool

Cohort studies play a crucial role in understanding the relationships between various factors and health outcomes over time. These studies collect extensive data from participants, including demographic information, clinical variables, lifestyle factors, and biomarkers. However, analyzing cohort study data often poses challenges, particularly when assessing variables across different time points. The data for the same variable is typically scattered across multiple files, each representing a specific assessment or follow-up visit. This fragmentation makes it difficult to perform comprehensive longitudinal analyses, or in the particular case of the MyDigiTwin project, to compute multiple points in time of the same variable to map it to standards like FHIR/MedMij.

This tool transforms cohort study data files in CSV (Comma-Separated Values) format into a format we called CDF/JSON (Cohort Data Format), which is already used by other data analysis tools in the MyDigiTwin project. A CDF format describes all the variables, and their values over time (i.e., each assessment), of an individual study participant. This format is particularly useful for the generation of FHIR/MedMij-compliant data (one of the aforementioned analysis tools).

To illustrate what the tool does, consider the following three files, which contain values for the same variables (VariableX, VariableY, and VariableZ), collected in three different assessments (a1, a2, a3), for three participants (9f0.., 961...,e84...):

a1_file.csv (first assessment datafile)

PROJECT_PSEUDO_ID,VariableX,VariableY,VariableZ
"9f0f9676-9ec4-3","59","1","64"
"9618d9b2-65dd-3","77","1","49"
"e84e5122-f9d3-3","15","2","49"

a2_file.csv (second assessment datafile)

PROJECT_PSEUDO_ID,VariableX,VariableY,VariableZ
"9f0f9676-9ec4-3","67","1","31"
"9618d9b2-65dd-3","99","2","27"
"e84e5122-f9d3-3","59","1","19"

a3_file.csv (third assessment datafile)

PROJECT_PSEUDO_ID,VariableX,VariableY,VariableZ
"9f0f9676-9ec4-3","88","2","23"
"9618d9b2-65dd-3","3","2","64"
"e84e5122-f9d3-3","56","1","38"

Let's assume that only the variables VariableX and VariableZ are needed in the analysis. To perform the transformation, a configuration file, indicating in which file each assessment of each variable is, can be defined as follows:

configuration.json

{
    "VariableX": [{"a1":"a1_file.csv"},{"a2":"a2_file.csv"},{"a3":"a3_file.csv"}],
    "VariableZ": [{"a1":"a1_file.csv"},{"a2":"a2_file.csv"},{"a3":"a3_file.csv"}]
}

With the above configuration, the following JSON files would be generated by this tool:

{
    "PROJECT_PSEUDO_ID": {"1A": "9f0f9676-9ec4-3"}, 
    "VariableX": {"a1": "59", "a2": "67", "a3": "88"}, 
    "VariableZ": {"a1": "64", "a2": "31", "a3": "23"}
}

{
    "PROJECT_PSEUDO_ID": {"1A": "9618d9b2-65dd-3"}, 
    "VariableX": {"a1": "77", "a2": "99", "a3": "3"}, 
    "VariableZ": {"a1": "49", "a2": "27", "a3": "64"}
}

{
    "PROJECT_PSEUDO_ID": {"1A": "e84e5122-f9d3-3"}, 
    "VariableX": {"a1": "15", "a2": "59", "a3": "56"}, 
    "VariableZ": {"a1": "49", "a2": "19", "a3": "38"}
}

Prerequisites

Make sure you have the following prerequisites installed on your system:

  • Python (version 3.6 or higher): Python Downloads
  • pip (Python package installer): This usually comes pre-installed with Python. You can check its presence by running pip --version in your terminal or command prompt.

Setup

To set up and run the Python program, follow these steps:

  1. Clone the repository to your local machine using the following command:

    git clone <repository_url>
  2. Navigate to the project directory:

    cd <project_directory>
  3. Create a virtual environment (venv) for the project. Virtual environments keep the project's dependencies isolated from your system's Python installation. Run the following command to create a venv:

    python -m venv venv
  4. Activate the virtual environment. The process for activating the virtual environment depends on your operating system:

    • For Unix/Linux/macOS:

      source venv/bin/activate
    • For Windows (Command Prompt):

      venv\Scripts\activate.bat

      For Windows (PowerShell):

      venv\Scripts\Activate.ps1
  5. Once the virtual environment is activated, install the required libraries by using the requirements.txt file provided in the repository. Run the following command:

    pip install -r requirements.txt

    This command will install all the necessary dependencies for the program to run.

Generating sample data files for a test run/benchmarking (memory usage/speed)

You can generate sample data files in the folder 'samplecsv/bigfiles' (folder included in the .gitignore). The following command generates N files, each one with C columns and R rows. By default, the first column will be 'PROJECT_PSEUDO_ID', and the following will be named 'Column1', 'Column2', ..., 'ColumnC'. For each row, a unique ID will be generated (but the same IDs will be used on each one of the N files):

python -m samplecsv.generate_sample_csv_datafiles  N C R

For example, for generating ten CSV files, each one with 150000 rows and 200 columns (variables)


python  -m samplecsv.generate_sample_csv_datafiles  10 200 150000

Transforming CVS data files

To transform the sample data files (or the actual files), you must first define the location of the different assessments of the variables you need to process. You can use the sample configuration files in 'sample-configs' for reference. You also need to provide a CSV file with all the IDs (participants) that are expected to be included in the transformations (the sample data generator previously mentioned generates one: 'samplecsv/bigfiles/pseudo_ids.csv')

python -m lifelinescsv_to_icdf.cdfgenerator <file with ids> <config file> <output folder>

About

Tools for transforming Lifelines CSV data files into JSON/CDF (Cohort-study Data Format)

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published