# Installing Dependencies


In [1]:
%pip install duckdb

Note: you may need to restart the kernel to use updated packages.


# ETL: Loading data to beacon compliant JSON format


## Content of data

1. Dataset information

   This contains the information related to the actual data collection event.

   You can explore this data here - [./data-with-dictionary/dataset.csv](./data-with-dictionary/dataset.csv)

2. Metadata

   This contains the actual data collected.

   You can explore this data here - [./data-with-dictionary/metadata.csv](./data-with-dictionary/metadata.csv)

3. Data dictionary

   Data dictionary contains the mapping of the disease name to the ontology code. This ensures Beacon can perform ontology based queries.

   This data is available here - [./data-with-dictionary/data_dictionary.csv](./data-with-dictionary/data_dictionary.csv)


# ETL Logic

## Import necessary libraries

Note that we are only using duckdb as the thirdparty library. Read about at [https://duckdb.org](https://duckdb.org).


In [2]:
import duckdb
from functools import lru_cache
from itertools import chain
import json

## Load metadata from CSV files to duckdb for querying

We are loading data into three tables.

1. datasets - this table contains dataset information row. Since we are considering one dataset per submission, there will be one row.
2. metadata - this table contains the actual metadata belonging to the above dataset.
3. dict - data dictionary has mapping of ontology terms and the labels used in data gathering.


In [3]:
# Load CSV file into DuckDB
con = duckdb.connect(database="./dictionary-metadata.db", )
con.execute("CREATE TABLE IF NOT EXISTS datasets AS SELECT * FROM read_csv('./data-with-dictionary/dataset.csv', ALL_VARCHAR=TRUE)")
con.execute("CREATE TABLE IF NOT EXISTS metadata AS SELECT * FROM read_csv('./data-with-dictionary/metadata.csv', ALL_VARCHAR=TRUE)")
con.execute("CREATE TABLE IF NOT EXISTS dict AS SELECT * FROM read_csv('./data-with-dictionary/data_dictionary.csv', ALL_VARCHAR=TRUE)")
con.execute("SHOW TABLES").df()


Unnamed: 0,name
0,datasets
1,dict
2,metadata


Below is a helper function for us to fetch ontology code given the english label of conditions recorded in metadata.

This particular helper function fetches a dictionary of form `{ "id": str, "label": str, "ontology": str}` from the `dict` table.

If the data dictionary has a different column setting ammend as required.


In [4]:
@lru_cache(maxsize=1000)
def fetch_id(label):
    if not len(label):
        return {"id":"","label":"","ontology":""}
    # escaping single quote
    label = label.replace("'", "''")
    result = con.execute(f"SELECT * FROM dict WHERE label='{label}'").df()
    return result.iloc[0].to_dict()


We are now ready to construct our dataset entry.


In [5]:
dataset_df = con.execute("SELECT * FROM datasets").df()
_, dataset = next(dataset_df.iterrows())
dataset = dataset.to_dict()

beacon_dataset = {
    "createDateTime": dataset["createDateTime"],
    "dataUseConditions": {
      "duoDataUse": [
        {
          "id": fetch_id(cond)["id"],
          "label": cond,
          "version": "17-07-2016"
        } for (cond, ver) in zip(dataset["dataUseConditions"].split(","), dataset["dataUseConditionsVersions"].split(","))
      ]
    },
    "description": dataset["description"],
    "externalUrl": dataset["externalUrl"],
    "info": {},
    "name": dataset["name"],
    "updateDateTime": dataset["updateDateTime"],
    "version": dataset["version"],
  }
print(json.dumps(beacon_dataset, indent=2))


{
  "createDateTime": "2021-03-21T02:37:00-08:00",
  "dataUseConditions": {
    "duoDataUse": [
      {
        "id": "DUO:0000042",
        "label": "general research use",
        "version": "17-07-2016"
      }
    ]
  },
  "description": "Simulation set 1.",
  "externalUrl": "http://example.org/wiki/Main_Page",
  "info": {},
  "name": "Dataset with fake data",
  "updateDateTime": "2022-08-05T17:21:00+01:00",
  "version": "v1.1"
}


In the following block we extract information related to the `individuals` entity type.

Because our table has more fields that it fits the entity type `individuals`, we use SQL `SELECT <fields>` syntax to extract just the items we need.


In [6]:
fields = ["vcf_sample_id as id", "ethnicity", "geographic_origin", "diseases", "interventions_or_procedures", "karyotypic_sex", "sex"]
individuals_df = con.execute(f"SELECT {','.join(fields)} FROM metadata").df()
individuals_df

Unnamed: 0,id,ethnicity,geographic_origin,diseases,interventions_or_procedures,karyotypic_sex,sex
0,HG00096,Congolese,United States of America,,"Cancer Diagnostic or Therapeutic Procedure,Ima...",XXY,"Surgically transgendered transsexual, male-to-..."
1,HG00097,Tamils,United States of America,"Neuroblastoma of central nervous system,Lewy b...","Cancer Diagnostic or Therapeutic Procedure,Ima...",XXYY,"Surgically transgendered transsexual, male-to-..."
2,HG00099,Aymara,United States of America,"Alzheimer's disease,Disorder of the central ne...","Cancer Diagnostic or Therapeutic Procedure,Lab...",XXX,Transsexual
3,HG00100,Onge,India,,"Cancer Diagnostic or Therapeutic Procedure,Ima...",XYY,Female-to-male transsexual
4,HG00101,Tamils,Africa,Pituitary carcinoma,"Laboratory Biomarker Analysis,Imaging Biomarke...",XXXX,Transsexual
5,HG00102,Papuans,Argentina,Heart disease (disorder),Serum Tumor Marker Test,XX,Female
6,HG00103,Atacamenos,Africa,,Cancer Diagnostic or Therapeutic Procedure,XXXY,Female-to-male transsexual
7,HG00105,Alacaluf,Africa,Diabetes mellitus type 2 in nonobese (disorder...,Laboratory Biomarker Analysis,XX,"Surgically transgendered transsexual, male-to-..."
8,HG00106,Guamians,Africa,"Alzheimer's disease,Diabetes mellitus type 2 i...",,XXXX,Female-to-male transsexual
9,HG00107,Yanomama,United States of America,,Imaging Biomarker Analysis,XXXY,Male


Now we are ready to create the `JSON` format entries for each of the individuals.


In [7]:
individuals = []

for data in individuals_df.iterrows():
    idx, data = data
    data.fillna("", inplace=True)
    data = data.to_dict()
    individual = {
            "id": data["id"],
            "ethnicity": {
                "id": fetch_id(data["ethnicity"])["id"],
                "label": data["ethnicity"]
            },
            "geographicOrigin": {
                "id": fetch_id(data["geographic_origin"])["id"],
                "label": data["geographic_origin"]
            },
            "diseases": [
                {
                    "diseaseCode": {
                        "id": fetch_id(label)["id"],
                        "label": label
                    }
                }
                for label in (data["diseases"].split(",") if data["diseases"] else [])
            ],
            "interventionsOrProcedures": [
                {
                    "procedureCode": {
                        "id": fetch_id(proc)["id"],
                        "label": proc
                    }
                } for proc in (data["interventions_or_procedures"].split(",") if data["interventions_or_procedures"] else [])
            ],
            "karyotypicSex": data["karyotypic_sex"],
            "sex": {
                "id": fetch_id(data["sex"])["id"],
                "label": data["sex"]
            }
        }
    individuals.append(individual)

print(json.dumps(individuals, indent=2))

[
  {
    "id": "HG00096",
    "ethnicity": {
      "id": "SNOMED:52075006",
      "label": "Congolese"
    },
    "geographicOrigin": {
      "id": "SNOMED:223688001",
      "label": "United States of America"
    },
    "diseases": [],
    "interventionsOrProcedures": [
      {
        "procedureCode": {
          "id": "NCIT:C79426",
          "label": "Cancer Diagnostic or Therapeutic Procedure"
        }
      },
      {
        "procedureCode": {
          "id": "NCIT:C64264",
          "label": "Imaging Biomarker Analysis"
        }
      }
    ],
    "karyotypicSex": "XXY",
    "sex": {
      "id": "SNOMED:407378000",
      "label": "Surgically transgendered transsexual, male-to-female"
    }
  },
  {
    "id": "HG00097",
    "ethnicity": {
      "id": "SNOMED:12556008",
      "label": "Tamils"
    },
    "geographicOrigin": {
      "id": "SNOMED:223688001",
      "label": "United States of America"
    },
    "diseases": [
      {
        "diseaseCode": {
          "id": "SNOM

Creating biosample entries.


In [8]:
fields = ["vcf_sample_id as id", "biosample_status", "collection_date", "collection_moment", "histological_diagnosis", "obtention_procedure", "pathological_tnm_finding", "sample_origin_detail", "sample_origin_type", "tumor_progression"]
biosamples_df = con.execute(f"SELECT {','.join(fields)} FROM metadata").df()
biosamples_df

Unnamed: 0,id,biosample_status,collection_date,collection_moment,histological_diagnosis,obtention_procedure,pathological_tnm_finding,sample_origin_detail,sample_origin_type,tumor_progression
0,HG00096,Minor blood groups - finding,2019-04-23,P32Y6M1D,12q14 microdeletion syndrome,FGFR1 Mutation Analysis,T2a Stage Finding,Abscess swab,Capillary blood,Primary Malignant Neoplasm
1,HG00097,Mitochondrial 1555 A to G mutation positive,2022-04-23,P32Y6M1D,14q22q23 microdeletion syndrome,,M0 Stage Finding,Specimen from aorta,Capillary blood,
2,HG00099,Mitochondrial 1555 A to G mutation positive,2021-04-23,P32Y6M1D,14q22q23 microdeletion syndrome,,T2a Stage Finding,,Cultured cells,Recurrent Malignant Neoplasm
3,HG00100,Minor blood groups - finding,2021-04-23,P7D,14q22q23 microdeletion syndrome,FGFR1 Mutation Analysis,T2a Stage Finding,Respiratory specimen,Cultured autograft of skin,Primary Malignant Neoplasm
4,HG00101,Mitochondrial antibodies positive,2022-04-23,P7D,Disorder of body system (disorder),,T2a Stage Finding,Nasopharyngeal swab,Cultured autograft of skin,
5,HG00102,Mite present,2018-04-23,P32Y6M1D,12q14 microdeletion syndrome,biopsy,,,Cultured autograft of skin,Primary Malignant Neoplasm
6,HG00103,Mitochondrial antibodies positive,2021-04-23,P32Y6M1D,Abnormality of bombesin secretion,,M0 Stage Finding,Specimen from aorta,Capillary blood,Primary Malignant Neoplasm
7,HG00105,Mitochondrial 1555 A to G mutation positive,2015-04-23,P32Y6M1D,Abnormality of bombesin secretion,,,Specimen from anus obtained by transanal disk ...,Agar medium,
8,HG00106,Mitochondrial antibodies negative,2018-04-23,P32Y6M1D,14q22q23 microdeletion syndrome,,N1c Stage Finding,,Capillary blood,
9,HG00107,Minor blood groups - finding,2022-04-23,P7D,12q14 microdeletion syndrome,,N1c Stage Finding,,Agar medium,Primary Malignant Neoplasm


In [9]:
biosamples = []

for data in biosamples_df.iterrows():
    idx, data = data
    data.fillna("", inplace=True)
    data = data.to_dict()
    biosample = {
            "id": data["id"],
            "individualId": data["id"],
            "biosampleStatus": {
                "id": fetch_id(data["biosample_status"])["id"],
                "label": data["biosample_status"]
            },
            "collectionDate": data["collection_date"],
            "collectionMoment": data["collection_moment"],
            "histologicalDiagnosis": {
                "id": fetch_id(data["histological_diagnosis"])["id"],
                "label": data["histological_diagnosis"]
            },
            "obtentionProcedure": {
                "procedureCode": {
                    "id": fetch_id(data["obtention_procedure"])["id"],
                    "label": data["obtention_procedure"]
                }
            },
            "pathologicalTnmFinding": [
                {
                    "id": fetch_id(data["pathological_tnm_finding"])["id"],
                    "label": data["pathological_tnm_finding"]
                }
            ],
            "sampleOriginDetail": {
                "id": fetch_id(data["sample_origin_detail"])["id"],
                "label": data["sample_origin_detail"]
            },
            "sampleOriginType": {
                "id": fetch_id(data["sample_origin_type"])["id"],
                "label": data["sample_origin_type"]
            },
            "tumorProgression": {
                "id": fetch_id(data["tumor_progression"])["id"],
                "label": data["tumor_progression"]
            },
            "info": {},
            "notes": ""
        }
    biosamples.append(biosample)

print(json.dumps(biosamples, indent=2))

[
  {
    "id": "HG00096",
    "individualId": "HG00096",
    "biosampleStatus": {
      "id": "SNOMED:365641003",
      "label": "Minor blood groups - finding"
    },
    "collectionDate": "2019-04-23",
    "collectionMoment": "P32Y6M1D",
    "histologicalDiagnosis": {
      "id": "SNOMED:719046005",
      "label": "12q14 microdeletion syndrome"
    },
    "obtentionProcedure": {
      "procedureCode": {
        "id": "NCIT:C157179",
        "label": "FGFR1 Mutation Analysis"
      }
    },
    "pathologicalTnmFinding": [
      {
        "id": "NCIT:C48725",
        "label": "T2a Stage Finding"
      }
    ],
    "sampleOriginDetail": {
      "id": "SNOMED:258497007",
      "label": "Abscess swab"
    },
    "sampleOriginType": {
      "id": "SNOMED:31675002",
      "label": "Capillary blood"
    },
    "tumorProgression": {
      "id": "NCIT:C84509",
      "label": "Primary Malignant Neoplasm"
    },
    "info": {},
    "notes": ""
  },
  {
    "id": "HG00097",
    "individualId": "H

Creating runs entries.


In [10]:
fields = ["vcf_sample_id as id", "library_layout", "library_selection", "library_source", "library_strategy", "platform", "platform_model", "run_date"]
runs_df = con.execute(f"SELECT {','.join(fields)} FROM metadata").df()
runs_df

Unnamed: 0,id,library_layout,library_selection,library_source,library_strategy,platform,platform_model,run_date
0,HG00096,PAIRED,RANDOM,other library source,WGS,PacBio,PacBio RS II,2021-10-18
1,HG00097,PAIRED,RANDOM,genomic source,WGS,Illumina,Illumina HiSeq 3000,2021-10-18
2,HG00099,PAIRED,RANDOM,genomic source,WGS,NanoPore,Oxford Nanopore MinION,2021-10-18
3,HG00100,PAIRED,RANDOM,other library source,WGS,NanoPore,Oxford Nanopore MinION,2021-10-18
4,HG00101,PAIRED,RANDOM,other library source,WGS,PacBio,PacBio RS II,2018-01-01
5,HG00102,PAIRED,RANDOM,other library source,WGS,PacBio,PacBio RS II,2018-01-01
6,HG00103,PAIRED,RANDOM,other library source,WGS,Illumina,Illumina HiSeq 3000,2021-10-18
7,HG00105,PAIRED,RANDOM,genomic source,WGS,NanoPore,Oxford Nanopore MinION,2021-10-18
8,HG00106,PAIRED,RANDOM,genomic source,WGS,Illumina,Illumina HiSeq 3000,2018-01-01
9,HG00107,PAIRED,RANDOM,other library source,WGS,Illumina,Illumina HiSeq 3000,2022-08-08


In [11]:
runs = []

for data in runs_df.iterrows():
    idx, data = data
    data.fillna("", inplace=True)
    data = data.to_dict()
    run = {
            "id": data["id"],
            "biosampleId": data["id"],
            "individualId": data["id"],
            "libraryLayout": data["library_layout"],
            "librarySelection": data["library_selection"],
            "librarySource": {
                "id": fetch_id(data["library_source"])["id"],
                "label": data["library_source"]
            },
            "libraryStrategy": data["library_strategy"],
            "platform": data["platform"],
            "platformModel": {
                "id": fetch_id(data["platform_model"])["id"],
                "label": data["platform_model"]
            },
            "runDate": data["run_date"],
        }
    runs.append(run)

print(json.dumps(runs, indent=2))

[
  {
    "id": "HG00096",
    "biosampleId": "HG00096",
    "individualId": "HG00096",
    "libraryLayout": "PAIRED",
    "librarySelection": "RANDOM",
    "librarySource": {
      "id": "GENEPIO:0001969",
      "label": "other library source"
    },
    "libraryStrategy": "WGS",
    "platform": "PacBio",
    "platformModel": {
      "id": "OBI:0002012",
      "label": "PacBio RS II"
    },
    "runDate": "2021-10-18"
  },
  {
    "id": "HG00097",
    "biosampleId": "HG00097",
    "individualId": "HG00097",
    "libraryLayout": "PAIRED",
    "librarySelection": "RANDOM",
    "librarySource": {
      "id": "GENEPIO:0001966",
      "label": "genomic source"
    },
    "libraryStrategy": "WGS",
    "platform": "Illumina",
    "platformModel": {
      "id": "OBI:0002048",
      "label": "Illumina HiSeq 3000"
    },
    "runDate": "2021-10-18"
  },
  {
    "id": "HG00099",
    "biosampleId": "HG00099",
    "individualId": "HG00099",
    "libraryLayout": "PAIRED",
    "librarySelection": "R

Creating analyses entries.


In [12]:
fields = ["vcf_sample_id as id", "aligner", "analysis_date", "pipeline_name", "pipeline_ref", "variant_caller", "vcf_sample_id"]
analyses_df = con.execute(f"SELECT {','.join(fields)} FROM metadata").df()
analyses_df

Unnamed: 0,id,aligner,analysis_date,pipeline_name,pipeline_ref,variant_caller,vcf_sample_id
0,HG00096,bwa-0.7.8,2020-2-15,pipeline 5,Example,SoapSNP,HG00096
1,HG00097,minimap2,2019-3-17,pipeline 1,Example,GATK4.0,HG00097
2,HG00099,minimap2,2018-10-2,pipeline 5,Example,GATK4.0,HG00099
3,HG00100,bwa-0.7.8,2018-11-9,pipeline 5,Example,kmer2snp,HG00100
4,HG00101,bowtie,2019-5-27,pipeline 3,Example,GATK4.0,HG00101
5,HG00102,bwa-0.7.8,2021-11-22,pipeline 1,Example,SoapSNP,HG00102
6,HG00103,bowtie,2018-1-8,pipeline 1,Example,SoapSNP,HG00103
7,HG00105,minimap2,2022-3-6,pipeline 1,Example,GATK4.0,HG00105
8,HG00106,bowtie,2021-2-17,pipeline 2,Example,SoapSNP,HG00106
9,HG00107,bwa-0.7.8,2019-8-13,pipeline 1,Example,SoapSNP,HG00107


In [13]:
analyses = []

for data in analyses_df.iterrows():
    idx, data = data
    data.fillna("", inplace=True)
    data = data.to_dict()
    analysis = {
            "id": data["id"],
            "individualId": data["id"],
            "biosampleId": data["id"],
            "runId": data["id"],
            "aligner": data["aligner"],
            "analysisDate": data["analysis_date"],
            "pipelineName": data["pipeline_name"],
            "pipelineRef": data["pipeline_ref"],
            "variantCaller": data["variant_caller"],
            "vcfSampleId": data["vcf_sample_id"],
        }
    analyses.append(analysis)

print(json.dumps(analyses, indent=2))

[
  {
    "id": "HG00096",
    "individualId": "HG00096",
    "biosampleId": "HG00096",
    "runId": "HG00096",
    "aligner": "bwa-0.7.8",
    "analysisDate": "2020-2-15",
    "pipelineName": "pipeline 5",
    "pipelineRef": "Example",
    "variantCaller": "SoapSNP",
    "vcfSampleId": "HG00096"
  },
  {
    "id": "HG00097",
    "individualId": "HG00097",
    "biosampleId": "HG00097",
    "runId": "HG00097",
    "aligner": "minimap2",
    "analysisDate": "2019-3-17",
    "pipelineName": "pipeline 1",
    "pipelineRef": "Example",
    "variantCaller": "GATK4.0",
    "vcfSampleId": "HG00097"
  },
  {
    "id": "HG00099",
    "individualId": "HG00099",
    "biosampleId": "HG00099",
    "runId": "HG00099",
    "aligner": "minimap2",
    "analysisDate": "2018-10-2",
    "pipelineName": "pipeline 5",
    "pipelineRef": "Example",
    "variantCaller": "GATK4.0",
    "vcfSampleId": "HG00099"
  },
  {
    "id": "HG00100",
    "individualId": "HG00100",
    "biosampleId": "HG00100",
    "runId"

## Final Submission Entry


In [14]:
submission = {
    "dataset": dataset,
    "assemblyId": "GRCH38",
    "individuals": individuals,
    "biosamples": biosamples,
    "runs": runs,
    "analyses": analyses
}

print(json.dumps(submission, indent=2))
json.dump(submission, open("submission-data-dictionary.json", "w+"), indent=2)


{
  "dataset": {
    "createDateTime": "2021-03-21T02:37:00-08:00",
    "dataUseConditions": "general research use",
    "dataUseConditionsVersions": "17-07-2016",
    "description": "Simulation set 1.",
    "externalUrl": "http://example.org/wiki/Main_Page",
    "info": "{}",
    "name": "Dataset with fake data",
    "updateDateTime": "2022-08-05T17:21:00+01:00",
    "version": "v1.1"
  },
  "assemblyId": "GRCH38",
  "individuals": [
    {
      "id": "HG00096",
      "ethnicity": {
        "id": "SNOMED:52075006",
        "label": "Congolese"
      },
      "geographicOrigin": {
        "id": "SNOMED:223688001",
        "label": "United States of America"
      },
      "diseases": [],
      "interventionsOrProcedures": [
        {
          "procedureCode": {
            "id": "NCIT:C79426",
            "label": "Cancer Diagnostic or Therapeutic Procedure"
          }
        },
        {
          "procedureCode": {
            "id": "NCIT:C64264",
            "label": "Imaging Bio