### Demonstration Overview: ClinVar to FHIR Allele Profiles

This notebook demonstrates how to extract, explore, and translate ClinVar variation data into FHIR Allele Profiles, optimized for execution in a GitHub Codespace or local environment.
    
#### Key features of this notebook:
- **Stream large datasets efficiently:** Read and process a compressed .jsonl.gz ClinVar file line-by-line without loading it fully into memory.
- **Filter relevant records:** Identify and retain entries where `members` include an `Allele` type.
- **Preview representative examples:** Display either the full ClinVar record, the embedded VRS Allele object, or both, depending on the selected display mode.
- **Extract VRS Alleles:** Collect a small, reproducible subset (e.g., 5 examples) of VRS Allele objects for quick translation.
- **Perform translation:** Convert those VRS Allele objects into FHIR Allele Profiles to demonstrate the end-to-end mapping workflow.
- **Portable execution:** Runs efficiently in a GitHub Codespace or on a standard laptop without requiring additional compute resources.


> **Note:**  
> - We have translated a **subset** of the original ClinVar data folder, generating **FHIR Allele Profiles for approximately 20,000 examples**, and are now working toward **translating the complete dataset**.
> - If you would like access to the **input or output data files** used in this demonstration, please contact the code owners.  
> - A **command-line Python script** is currently under development to perform this same ClinVar -> FHIR Allele Profile translation. Once released, it will be available in the `src/utils/` directory of this repository.

In [1]:
import gzip, json, random
from ga4gh.vrs.models import Allele
from translators.vrs_to_fhir import VrsToFhirAlleleTranslator 
vrs_translator = VrsToFhirAlleleTranslator()

In [2]:
path = "data/clinvar_gks_variation_2025_09_28_v2_4_3.jsonl.gz"

### Filtering for Allele Records
We filter for records where "members" include an object with "type": "Allele".
Only these entries contain GA4GH VRS Allele representations, which are required for downstream translation into FHIR Allele Profiles.

This filtering step is performed immediately after obtaining the ClinVar data to ensure that we only process records that can be represented as VRS Alleles.

In [3]:
# Choose which variation you want to view.
# Options:
# "full" -> show the entire original JSON object
# "vrs" -> show only the first VRS Allele example
# "both" -> show both the full object and its VRS example
# Optoions: "Original Example: full, VRS Examples: vrs, or both" 

display_mode = "both" 

with gzip.open(path, "rt", encoding="utf-8") as f:
    for line in f:
        obj = json.loads(line)
        members = obj.get("members",[])
        has_allele = False

        for member in members: 
            if isinstance(member,dict) and member.get("type")=="Allele":
                has_allele = True
                break

        if has_allele:
            if display_mode == "full":
                 print(json.dumps(obj, indent=2))
            elif display_mode == "vrs":

                print(json.dumps(obj["members"][0], indent=2))

            elif display_mode == "both":
                print("----------- Original Example -----------")
                print(json.dumps(obj, indent=2))
                print("----------- VRS Example from Original -----------")
                print(json.dumps(obj["members"][0], indent=2))
            break


----------- Original Example -----------
{
  "constraints": [
    {
      "allele": "2/members/0/",
      "relations": [
        {
          "primaryCoding": {
            "code": "liftover_to",
            "system": "ga4gh-gks-term:allele-relation"
          }
        },
        {
          "primaryCoding": {
            "code": "transcribed_to",
            "iris": [
              "http://www.sequenceontology.org/browser/current_release/term/transcribed_to"
            ],
            "system": "http://www.sequenceontology.org"
          }
        }
      ],
      "type": "DefiningAlleleConstraint"
    }
  ],
  "extensions": [
    {
      "name": "clinvarHgvsList",
      "value": [
        {
          "maneSelect": true,
          "nucleotideExpression": {
            "syntax": "hgvs.n",
            "value": "NR_023343.3:n.16G>C"
          },
          "nucleotideType": "non-coding"
        },
        {
          "molecularConsequence": [
            {
              "code": "SO:000162

### Extracting Sample VRS Allele Examples
- In this step, we extract five VRS Allele examples from the filtered ClinVar data to explore and test the translation process.
- We use random.seed() to ensure that the same subset of examples is selected each time the notebook is run.

> **Note:**  
> - Some entries in the dataset contain `"type": "Allele"` but cannot be successfully represented as `Allele(**)` objects.  
> - These cases typically fail VRS validation because their `"molecularType"` is incorrectly set to `"mrna"` instead of `"mRNA"`.  
> - This inconsistency may have been introduced during the **data input process** or during **translation into the dataset**.

In [4]:
random.seed(42) 
vrs_examples = []

with gzip.open(path, "rt", encoding="utf-8") as f:
    for line in f:
        obj = json.loads(line)
        members = obj.get('members',[])

        for member in members:
            if isinstance(member,dict) and member.get("type")=="Allele":
                vrs_examples.append(Allele(**member))
                break
        if len(vrs_examples) == 5:
            break

In [5]:
# Display one example VRS Allele to show its structure.
vrs_examples[0].model_dump(exclude_none = True)

{'id': 'ga4gh:VA.RYY2yzWCjihuu2hmqFWu8qg7aUB3aPwR',
 'type': 'Allele',
 'name': 'NC_000002.12:g.121530895G>C',
 'digest': 'RYY2yzWCjihuu2hmqFWu8qg7aUB3aPwR',
 'expressions': [{'syntax': 'spdi', 'value': 'NC_000002.12:121530894:G:C'},
  {'syntax': 'hgvs.g', 'value': 'NC_000002.12:g.121530895G>C'},
  {'syntax': 'gnomad', 'value': '2-121530895-G-C'}],
 'location': {'id': 'ga4gh:SL.3sVPpjle__QNFe8yGfLEja0R42VIZIp8',
  'type': 'SequenceLocation',
  'digest': '3sVPpjle__QNFe8yGfLEja0R42VIZIp8',
  'sequenceReference': {'type': 'SequenceReference',
   'name': 'NC_000002.12',
   'extensions': [{'name': 'assembly', 'value': 'GRCh38'}],
   'refgetAccession': 'SQ.pnAqCRBrTsUoBghSD1yp_jXWSmlbdh4g',
   'residueAlphabet': 'na',
   'moleculeType': 'genomic'},
  'start': 121530894,
  'end': 121530895},
 'state': {'type': 'LiteralSequenceExpression', 'sequence': 'C'}}

### Translating VRS Alleles into FHIR Allele Profiles

- Translate the five extracted VRS Allele examples into FHIR Allele Profiles.
- This step demonstrates the end-to-end VRS -> FHIR mapping process, on a small subset for quick, reproducible testing.

In [6]:
fhir_examples = []
for vrs in vrs_examples:
    fhir_examples.append(vrs_translator.translate_allele_to_fhir(vrs))


In [None]:
# Display one example FHIR Allele Profile to show its structure.
fhir_examples[0].model_dump()

{'resourceType': 'MolecularDefinition',
 'contained': [{'resourceType': 'MolecularDefinition',
   'id': 'vrs-location-sequenceReference',
   'extension': [{'url': 'https://w3id.org/ga4gh/schema/vrs/2.0.1/json/SequenceReference#properties/name',
     'valueString': 'NC_000002.12'},
    {'extension': [{'url': 'https://github.com/ga4gh/gks-core/blob/1.0/schema/gks-core/json/Extension#properties/name',
       'valueString': 'assembly'},
      {'url': 'https://github.com/ga4gh/gks-core/blob/1.0/schema/gks-core/json/Extension#properties/value',
       'valueString': 'GRCh38'}]}],
   'moleculeType': {'coding': [{'system': 'https://w3id.org/ga4gh/schema/vrs/2.0.1/json/SequenceReference#properties/moleculeType',
      'code': 'genomic'}]},
   'representation': [{'code': [{'coding': [{'system': 'https://w3id.org/ga4gh/schema/vrs/2.0.1/json/SequenceReference#properties/refgetAccession',
         'code': 'SQ.pnAqCRBrTsUoBghSD1yp_jXWSmlbdh4g'}]}]}]}],
 'identifier': [{'system': 'https://w3id.org/ga