### Demonstration Overview: Bulk Translation (VRS → FHIR Allele Profiles)

This notebook demosntrates a bulk translation of ~2,000 Allele records from **VRS Allele Dictionary** into **FHIR Allele Profiles**, and documents both successes and failures.

**Goals**
- Show that the translation pipeline works at scale on a large dataset.
- Provide insight into which inputs successfully translate into valid FHIR profiles.
- Highlight problematic or unmapped values and categorize the reasons they fail.



The input consists of three datasets, each loaded as a DataFrame:

- **`spdiDF`** → Parsed from `spdi-Allele.jsonl`  
  - Contains 1,000 SPDI Allele examples.  
  - All examples were successfully translated to FHIR Allele Profiles (no errors).

- **`hgvsDF`** → Parsed from `hgvs-Allele.jsonl`  
  - Contains 1,000 HGVS Allele examples.  
  - Some examples encountered translation issues.

- **`errorDF`** → Parsed from `hgvs-ErrorTranslation.jsonl`  
  - Captures the HGVS Alleles that failed translation.  
  - Useful for analysis of unmapped or invalid values.

In [14]:
import pandas as pd

# Load input data
spdiDF = pd.read_json("spdi-Allele.jsonl", lines=True)
hgvsDF = pd.read_json("hgvs-Allele.jsonl", lines=True)
errorDF = pd.read_json("hgvs-ErrorTranslation.jsonl",lines=True)

# Quick Check
print("SPDI:", len(spdiDF))
print("HGVS:", len(hgvsDF))
print("Errors:", len(errorDF))

SPDI: 1000
HGVS: 1000
Errors: 210


In [None]:
def count_state_type(df):
    counts = {
        "ReferenceLengthExpression": 0,
        "LiteralSequenceExpression": 0,
        "Other": 0
    }

    for vo in df["out"]:
        if isinstance(vo, dict):
            type_val = vo.get("state", {}).get("type", "")
            if "LiteralSequenceExpression" in type_val:
                counts["LiteralSequenceExpression"] += 1
            elif "ReferenceLengthExpression" in type_val:
                counts["ReferenceLengthExpression"] += 1
            else:
                counts["Other"] += 1
        else:
            counts["Other"] += 1

    return counts

def types_of_error(error_df):

    val_counts = {
        "HGVSParseError": error_df["ERROR"].str.contains("HGVSParseError").sum(),
        "ValueError": error_df["ERROR"].str.contains("ValueError").sum(),
        "Task did not complete": error_df["ERROR"].str.contains("Task did not complete").sum(),
        "AttributeError": error_df["ERROR"].str.contains("AttributeError").sum(),
        "HGVSDataNotAvailable": error_df["ERROR"].str.contains("HGVSDataNotAvailableError").sum(),
        "HGVSInvalidVariant": error_df["ERROR"].str.contains("HGVSInvalidVariantError").sum(),
    }

    return(val_counts)

## Allele State Distribution

We count the occurrences of each **Allele State** in the SPDI and HGVS inputs to understand the composition of the data and where translations may fail.

- **`SPDI`** is uniform and fully translatable in this dataset (all examples are `ReferenceLengthExpression`).
- **`HGVS`** includes multiple state types. The **`Other`** category represents technical HGVS expressions that could not be mapped to valid **VRS Alleles** and therefore could not be translated into **FHIR Allele Profiles**. These cases are captured in **`errorDF`** for analysis.


In [11]:
#SPDI Counter
print("SPDI file")
print(count_state_type(spdiDF))

#HGVS Counter
print("HGVS file")
print(count_state_type(hgvsDF))


SPDI file
{'ReferenceLengthExpression': 1000, 'LiteralSequenceExpression': 0, 'Other': 0}
HGVS file
{'ReferenceLengthExpression': 530, 'LiteralSequenceExpression': 260, 'Other': 210}


In [16]:
error_counts = types_of_error(errorDF)

# Display as a DataFrame for readability
pd.DataFrame(list(error_counts.items()), columns=["Error Type", "Count"])

Unnamed: 0,Error Type,Count
0,HGVSParseError,179
1,ValueError,25
2,Task did not complete,2
3,AttributeError,1
4,HGVSDataNotAvailable,2
5,HGVSInvalidVariant,1


## Error Analysis for HGVS Translations

The `errorDF` dataset captures the **210 HGVS Alleles (~21%)** that could not be translated into valid VRS Allele & FHIR Allele Profiles.  
Breaking down the failures by type reveals clear patterns:

- **HGVSParseError (179 cases)** → The overwhelming majority of failures were due to parsing errors in HGVS strings.  
- **ValueError (25 cases)** → Inputs did not conform to expected formats or ranges.  
- **Task did not complete (2 cases)** → Likely timeouts or incomplete processing.  
- **AttributeError (1 case)** → Unsupported HGVS expression not handled by the pipeline.  
- **HGVSDataNotAvailable (2 cases)** → Missing reference data prevented translation.  
- **HGVSInvalidVariant (1 case)** → The input was not a valid HGVS variant.  