# Parse Nirvana JSON output in python

Nirvana outputs a single JSON annotation file for a single input VCF file. The output file contains a single [JSON object](https://www.w3schools.com/js/js_json_objects.asp) to represent the annotations of all input VCF variants. The JSON file format can be found in the [documentation](https://illumina.github.io/NirvanaDocumentation/file-formats/nirvana-json-file-format).

A [sample JSON file](https://github.com/Illumina/NirvanaDocumentation/blob/master/static/files/ceph_trio_test.json.gz) of the CEPH trio (NA12878, NA12891, and NA12892) can be downloaded from the [NirvanaDocumention](https://github.com/Illumina/NirvanaDocumentation/tree/master/static/files) git repo.

This notebook demonstrates how you can parse this sample JSON file in python and retrieves the annotation data.

## Read Nirvana JSON output by lines

Even though the Nirvana JSON output is a single JSON object, different JSON object fields are written in different lines for memory efficient reading.

The first line in the JSON file is the `header` line:

```json
{"header":{"annotator":"Nirvana
...
,"positions":[
```
Followed by the `positions` lines:

```json
{"chromosome":"chr21","position":9975027,"refAllele":"C","altAlleles":["G"],"quality":102.47,"filters":["PASS"],"fisherStrandBias":0.727,"mappingQuality":43.11,"cytogeneticBand":"21p11.2","samples":,
...
```

After the `positions` lines, there are the `genes` lines, which are optional if there is no overlapping gene of the input VCF variants (3 lines for 2 genes in the example):

```json
],"genes":[
{"name":"ABCC13","omim":[{"mimNumber":608835,"geneName":"ATP-binding cassette, subfamily C, member 13","description":"ABCC13 belongs to a large family of ATP-binding cassette (ABC) transporters that play important roles as membrane transporters or ion channel modulators. However, ABCC13 is a truncated protein that lacks critical ATP-binding motifs and is unlikely to be a functional transporter (Yabuuchi et al., 2002)."}]},
```

Finally, the last line of the JSON file are two brackets to complete the JSON object structure:

```json
]}
```



## Install requirements

In [364]:
# !pip3 install pydantic ijson polars


Collecting polars
  Downloading polars-0.19.19-cp38-abi3-macosx_10_12_x86_64.whl (26.6 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.6/26.6 MB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
Installing collected packages: polars
Successfully installed polars-0.19.19

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m


In [389]:
import gzip
import ijson
import functools

from typing import Any, Dict, List, Optional


import polars as pd
import pydantic

# pd.set_option('display.max_columns', None)
pd.Config(tbl_rows=50)

class AnnotatedData:
    def __init__(self, filename: str):
        self._filename = filename
        
        
        for key in ("annotator", "genomeAssembly", "creationTime"):
            print(f"{key}: {self.header[key]}")     
        
    @property
    def header(self) -> dict:
        with gzip.open(self._filename, 'r') as f:
            return next(ijson.items(f, "header"))
    
    @property
    def data_sources(self):
        return pd.DataFrame(self.header["dataSources"]).sort(by="name")
        
    @property
    def genes(self) -> pd.DataFrame:
        with gzip.open(self._filename, 'r') as f:
            # return pd.json_normalize(ijson.items(f, "genes.item"))
            return pd.DataFrame(ijson.items(f, "genes.item"))
        
    @property
    def positions(self) -> list:
        f =  gzip.open(self._filename, 'r')
        return ijson.items(f, "positions.item")


    def get_annotation(self, chromosome: str, position: int) -> dict:
        
        return next(
            (
                position_item for position_item in self.positions
                if chromosome == position_item.get("chromosome") and \
                position == position_item.get("position")
            ),
            None
        )
    
    @staticmethod
    def multiple_to_df(items: list, key: str = ""):
        return pd.concat((item.to_df(key) for item in items))
    
class BaseClass(pydantic.BaseModel):
    @staticmethod
    def _get_nested_value(dictionary, keys):
        keys = keys.split('.')
        for key in keys:
            print(dictionary)
            if isinstance(dictionary, dict):
                dictionary = dictionary.get(key)
            else:
                return None
        return dictionary
        
    def get_top_level(self):
        return pd.DataFrame(self.get_top_level_dict()).unnest()
    
    def get_top_level_dict(self):
        raise NotImplementedError
    
    def to_df(self, key: str = "") -> pd.DataFrame:
        if not key:
            return pd.DataFrame(self.model_dump()).unnest()
                        
        # values = self._get_nested_value(self.model_dump(), key)
        values = self.model_dump().get(key)
        
        if isinstance(values, list):
            merged = [self.get_top_level_dict() | value for value in values]
        else:
            merged = [self.get_top_level_dict() | {key: values}]
        
        # return pd.json_normalize(merged)
        return pd.DataFrame(merged).unnest()
    
class Transcript(BaseClass):
    model_config = pydantic.ConfigDict(extra='allow') 
    transcript: str
    source: str
    bioType: Optional[str]
    geneId: Optional[str]
    hgnc: Optional[str]
    consequence: Optional[List[str]]
    impact: Optional[str]
    isCanonical: Optional[bool]
    
    def get_top_level_dict(self):
        return dict(zip(
            ("transcript", "isCanonical"),
            (self.transcript, self.isCanonical)
        ))
    
    
class Variant(BaseClass):
    model_config = pydantic.ConfigDict(extra='allow') 
    vid: str
    chromosome: str
    begin: int
    end: int
    refAllele: str
    altAllele: str
    variantType: Optional[str]
    hgvsg: Optional[str]
    phylopScore: Optional[float]
    phyloPPrimateScore: Optional[float]
    transcripts: Optional[List[Transcript]] = None
        
    def get_top_level_dict(self):
        return dict(zip(
            ("chromosome", "begin", "end", "refAllele", "altAllele", "hgvsg"),
            (self.chromosome, self.begin, self.end, self.refAllele, self.altAllele, self.hgvsg)
        ))
    
    
class Position(BaseClass):
    model_config = pydantic.ConfigDict(extra='allow') 
    chromosome: str
    position: int
    refAllele: str
    altAlleles: List[str]
    filters: List[str]
    mappingQuality: float
    cytogeneticBand: str
    vcfInfo: Dict[str, Any]
    samples: List
    variants: List[Variant]
    
    def get_top_level_dict(self):
        return dict(zip(
            ("chromosome", "position", "refAllele", "altAlleles", "filters", "mappingQuality", "cyatogeneticBand", "vcfInfo"),
            (self.chromosome, self.position, self.refAllele, self.altAlleles, self.filters, self.mappingQuality, self.cytogeneticBand, self.vcfInfo)
        ))


## Header

In [390]:
filename = "annotated_38.json.gz"
annotated_data = AnnotatedData(filename=filename)

annotator: Illumina Connected Annotations 3.22.0
genomeAssembly: GRCh38
creationTime: 2023-12-07 14:15:54


In [391]:
annotated_data.data_sources

name,version,description,releaseDate
str,str,str,str
"""1000 Genomes P…","""Phase 3 v3plus…","""A public catal…","""2013-05-27"""
"""1000 Genomes P…","""Phase 3 v5a""","""A public catal…","""2013-05-27"""
"""COSMIC""","""96""","""resource for e…","""2022-05-31"""
"""COSMIC gene fu…","""96""","""manually curat…","""2023-11-07"""
"""CancerHotspots…","""2017""","""A resouce for …","""2017-01-01"""
"""ClinGen""","""20160414""",,"""2016-04-14"""
"""ClinGen Dosage…","""20231105""","""Dosage sensiti…","""2023-11-05"""
"""ClinGen diseas…","""20231105""","""Disease validi…","""2023-11-05"""
"""ClinVar""","""20231028""","""A freely acces…","""2023-11-05"""
"""Cosmic Cancer …","""97""","""Cosmic Cancer …","""2022-11-29"""


## Genes

In [392]:
annotated_data.genes

thread '<unnamed>' panicked at crates/polars-core/src/frame/row/av_buffer.rs:249:85:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("unable to losslessly convert any-value of scale 60 to scale 3"))


PanicException: called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("unable to losslessly convert any-value of scale 60 to scale 3"))

## Positions

In [354]:

position = Position.model_validate(annotated_data.get_annotation("chr21", 5228221))
position.to_df()

Unnamed: 0,chromosome,position,refAllele,altAlleles,filters,mappingQuality,cytogeneticBand,samples,variants,vcfInfo.DP
0,chr21,5228221,G,[T],[PASS],75.75,21p12,"[{'genotype': '0/0', 'variantFrequencies': [0....","[{'vid': '21-5228221-G-T', 'chromosome': 'chr2...",1004


In [355]:
position.to_df(key="variants")

Unnamed: 0,chromosome,position,refAllele,altAlleles,filters,mappingQuality,cyatogeneticBand,vid,begin,end,altAllele,variantType,hgvsg,phylopScore,phyloPPrimateScore,transcripts,vcfInfo.DP
0,chr21,5228221,G,[T],[PASS],75.75,21p12,21-5228221-G-T,5228221,5228221,T,SNV,NC_000021.9:g.5228221G>T,0.2,0.0,"[{'transcript': 'ENST00000623753.1', 'source':...",1004


In [356]:
position.variants[0].to_df("transcripts")

Unnamed: 0,chromosome,begin,end,refAllele,altAllele,hgvsg,transcript,source,bioType,geneId,hgnc,consequence,impact,isCanonical
0,chr21,5228221,5228221,G,T,NC_000021.9:g.5228221G>T,ENST00000623753.1,Ensembl,lncRNA,ENSG00000279669,ENSG00000279669,[downstream_gene_variant],modifier,True


In [357]:
AnnotatedData.multiple_to_df(position.variants, key="transcripts")

Unnamed: 0,chromosome,begin,end,refAllele,altAllele,hgvsg,transcript,source,bioType,geneId,hgnc,consequence,impact,isCanonical
0,chr21,5228221,5228221,G,T,NC_000021.9:g.5228221G>T,ENST00000623753.1,Ensembl,lncRNA,ENSG00000279669,ENSG00000279669,[downstream_gene_variant],modifier,True


In [358]:
position = Position.model_validate(annotated_data.get_annotation("chr21", 5222289))
position.get_top_level()

Unnamed: 0,chromosome,position,refAllele,altAlleles,filters,mappingQuality,cyatogeneticBand,vcfInfo.DP
0,chr21,5222289,C,[T],"[mapping_quality, weak_evidence]",49.7,21p12,749


In [359]:
position.to_df()

Unnamed: 0,chromosome,position,refAllele,altAlleles,filters,mappingQuality,cytogeneticBand,samples,variants,vcfInfo.DP
0,chr21,5222289,C,[T],"[mapping_quality, weak_evidence]",49.7,21p12,"[{'genotype': '0/0', 'variantFrequencies': [0....","[{'vid': '21-5222289-C-T', 'chromosome': 'chr2...",749


In [360]:
position.to_df("cytogeneticBand")

Unnamed: 0,chromosome,position,refAllele,altAlleles,filters,mappingQuality,cyatogeneticBand,cytogeneticBand,vcfInfo.DP
0,chr21,5222289,C,[T],"[mapping_quality, weak_evidence]",49.7,21p12,21p12,749


In [361]:
position.to_df("variants")

Unnamed: 0,chromosome,position,refAllele,altAlleles,filters,mappingQuality,cyatogeneticBand,vid,begin,end,altAllele,variantType,hgvsg,phylopScore,phyloPPrimateScore,transcripts,dbsnp,vcfInfo.DP,gnomad.coverage,gnomad.allAf,gnomad.allAn,gnomad.allAc,gnomad.allHc,gnomad.afrAf,gnomad.afrAn,gnomad.afrAc,gnomad.afrHc,gnomad.amrAf,gnomad.amrAn,gnomad.amrAc,gnomad.amrHc,gnomad.easAf,gnomad.easAn,gnomad.easAc,gnomad.easHc,gnomad.finAf,gnomad.finAn,gnomad.finAc,gnomad.finHc,gnomad.nfeAf,gnomad.nfeAn,gnomad.nfeAc,gnomad.nfeHc,gnomad.asjAf,gnomad.asjAn,gnomad.asjAc,gnomad.asjHc,gnomad.sasAf,gnomad.sasAn,gnomad.sasAc,gnomad.sasHc,gnomad.othAf,gnomad.othAn,gnomad.othAc,gnomad.othHc,gnomad.maleAf,gnomad.maleAn,gnomad.maleAc,gnomad.maleHc,gnomad.femaleAf,gnomad.femaleAn,gnomad.femaleAc,gnomad.femaleHc,gnomad.controlsAllAf,gnomad.controlsAllAn,gnomad.controlsAllAc,topmed.allAf,topmed.allAn,topmed.allAc,topmed.allHc,topmed.failedFilter
0,chr21,5222289,C,[T],"[mapping_quality, weak_evidence]",49.7,21p12,21-5222289-C-T,5222289,5222289,T,SNV,NC_000021.9:g.5222289C>T,-0.2,0.074,,[rs1366179382],749,3,0.000502,149274,75,0,0.001704,40502,69,0,6.7e-05,14966,1,0,0,5022,0,0,0,10430,0,0,7.5e-05,66950,5,0,0,3438,0,0,0,4734,0,0,0,2024,0,0,0.000411,73006,30,0,0.00059,76268,45,0,0.000565,31850,18,0.001091,125568,137,0,True


# Parsing and Filtering

In [241]:
class Parser:
    def __init__(self, annotated_data: AnnotatedData):
        self.annotated_data = annotated_data
        
    def get_variants_above_gnomad_freq(
        self,
        frequency_key: str,
        frequency_threshold_low=float("-inf"),
        frequency_threshold_high=float("inf")
    ) -> list:
        positions = [
            Position.model_validate(position)
            for position in self.annotated_data.positions
            for variant in position.get("variants", {})
            if (freq := variant.get("gnomad", {}).get(frequency_key, None)) \
            and frequency_threshold_low < freq < frequency_threshold_high
        ]
        return positions
    
    def get_positions_with_cannonical_transcripts(self):   
        positions = [
            Position.model_validate(position)
            for position in self.annotated_data.positions
            for variant in position.get("variants", {})
            for transcript in variant.get("transcripts", [])
            if transcript.get("isCanonical")
        ]

        return positions
    
    def filter_transcripts_by_consequence(self, include=[], exclude=[]):
        positions = [
            Position.model_validate(position)
            for position in self.annotated_data.positions
            for variant in position.get("variants", {})
            for transcript in variant.get("transcripts", [])
            for consequence in transcript.get("consequence", [])
            if (not bool(include) or consequence in include) and consequence not in exclude
        ]
        return positions
        
    

## Positions with Cannonical Transcripts only

In [242]:
parser = Parser(annotated_data)

In [243]:
positions = parser.get_positions_with_cannonical_transcripts()
len(list(positions))

2472

In [244]:
Position.positions_to_df(positions, "variants")

Unnamed: 0,chromosome,position,refAllele,altAlleles,filters,mappingQuality,cyatogeneticBand,vid,begin,end,altAllele,variantType,hgvsg,phylopScore,phyloPPrimateScore,transcripts,vcfInfo.DP,regulatoryRegions,gnomad.coverage,gnomad.failedFilter,gnomad.allAf,gnomad.allAn,gnomad.allAc,gnomad.allHc,gnomad.afrAf,gnomad.afrAn,gnomad.afrAc,gnomad.afrHc,gnomad.amrAf,gnomad.amrAn,gnomad.amrAc,gnomad.amrHc,gnomad.easAf,gnomad.easAn,gnomad.easAc,gnomad.easHc,gnomad.finAf,gnomad.finAn,gnomad.finAc,gnomad.finHc,gnomad.nfeAf,gnomad.nfeAn,gnomad.nfeAc,gnomad.nfeHc,gnomad.asjAf,gnomad.asjAn,gnomad.asjAc,gnomad.asjHc,gnomad.sasAf,gnomad.sasAn,gnomad.sasAc,gnomad.sasHc,gnomad.othAf,gnomad.othAn,gnomad.othAc,gnomad.othHc,gnomad.maleAf,gnomad.maleAn,gnomad.maleAc,gnomad.maleHc,gnomad.femaleAf,gnomad.femaleAn,gnomad.femaleAc,gnomad.femaleHc,gnomad.controlsAllAf,gnomad.controlsAllAn,gnomad.controlsAllAc,dbsnp,topmed.allAf,topmed.allAn,topmed.allAc,topmed.allHc,topmed.failedFilter,inLowComplexityRegion,dannScore,oneKg.allAf,oneKg.afrAf,oneKg.amrAf,oneKg.easAf,oneKg.eurAf,oneKg.sasAf,oneKg.allAn,oneKg.afrAn,oneKg.amrAn,oneKg.easAn,oneKg.eurAn,oneKg.sasAn,oneKg.allAc,oneKg.afrAc,oneKg.amrAc,oneKg.easAc,oneKg.eurAc,oneKg.sasAc,cosmic,gerpScore,primateAI-3D,revel.score,clinvar,spliceAI
0,chr21,5228221,G,[T],[PASS],75.75,21p12,21-5228221-G-T,5228221,5228221,T,SNV,NC_000021.9:g.5228221G>T,0.2,0,"[{'transcript': 'ENST00000623753.1', 'source':...",1004,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
0,chr21,5232869,T,[G],"[base_quality, weak_evidence]",58.10,21p12,21-5232869-T-G,5232869,5232869,G,SNV,NC_000021.9:g.5232869T>G,0.5,,"[{'transcript': 'ENST00000623753.1', 'source':...",799,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
0,chr21,5243664,C,[T],"[mapping_quality, no_reliable_supporting_read,...",58.13,21p12,21-5243664-C-T,5243664,5243664,T,SNV,NC_000021.9:g.5243664C>T,0.6,0.074,"[{'transcript': 'ENST00000623753.1', 'source':...",644,"[{'id': 'ENSR00000140073', 'type': 'TF_binding...",0.0,True,0,152310.0,0.0,0.0,0,41488.0,0.0,0.0,0,15294.0,0.0,0.0,0,5208.0,0.0,0.0,0,10632.0,0.0,0.0,0,68056.0,0.0,0.0,0,3472.0,0.0,0.0,0,4838.0,0.0,0.0,0,2094.0,0.0,0.0,0,74416.0,0.0,0.0,0,77894.0,0.0,0.0,0,32928.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
0,chr21,5243812,C,[T],[PASS],108.19,21p12,21-5243812-C-T,5243812,5243812,T,SNV,NC_000021.9:g.5243812C>T,0.4,0.074,"[{'transcript': 'ENST00000623753.1', 'source':...",514,"[{'id': 'ENSR00000140073', 'type': 'TF_binding...",0.0,True,0,152296.0,0.0,0.0,0,41480.0,0.0,0.0,0,15294.0,0.0,0.0,0,5206.0,0.0,0.0,0,10628.0,0.0,0.0,0,68058.0,0.0,0.0,0,3472.0,0.0,0.0,0,4836.0,0.0,0.0,0,2094.0,0.0,0.0,0,74406.0,0.0,0.0,0,77890.0,0.0,0.0,0,32922.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
0,chr21,6367612,-,[CAATG],[filtered_reads],21.70,21p12,21-6367612-C-CAATG,6367613,6367612,AATG,insertion,NC_000021.9:g.6367616_6367619dup,,,"[{'transcript': 'ENST00000615262.1', 'source':...",375,,11.0,True,0.139262,20860.0,2905.0,3.0,0.065393,2722.0,178.0,0.0,0.110881,1930.0,214.0,1.0,0.108384,978.0,106.0,0.0,0.132432,370.0,49.0,0.0,0.172472,12518.0,2159.0,2.0,0.087379,824.0,72.0,0.0,0.059756,820.0,49.0,0.0,0.16,300.0,48.0,0.0,0.127778,8640.0,1104.0,0.0,0.147381,12220.0,1801.0,3.0,0.123314,5190.0,640.0,[rs1234944247],0.078897,125568.0,9907.0,0.0,True,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,chr21,46597114,C,[A],[PASS],219.26,21q22.3,21-46597114-C-A,46597114,46597114,A,SNV,NC_000021.9:g.46597114C>A,0.5,-0.072,"[{'transcript': 'ENST00000291700.9', 'source':...",143,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.75,,,,,,,,,,,,,,,,,,,,-0.635,,,,
0,chr21,46597227,A,[C],"[base_quality, weak_evidence]",217.92,21q22.3,21-46597227-A-C,46597227,46597227,C,SNV,NC_000021.9:g.46597227A>C,0.6,0.781,"[{'transcript': 'ENST00000291700.9', 'source':...",151,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.85,,,,,,,,,,,,,,,,,,,,-0.574,,,,
0,chr21,46597227,A,[C],"[base_quality, weak_evidence]",217.92,21q22.3,21-46597227-A-C,46597227,46597227,C,SNV,NC_000021.9:g.46597227A>C,0.6,0.781,"[{'transcript': 'ENST00000291700.9', 'source':...",151,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.85,,,,,,,,,,,,,,,,,,,,-0.574,,,,
0,chr21,46643510,T,[A],"[base_quality, weak_evidence]",211.58,21q22.3,21-46643510-T-A,46643510,46643510,A,SNV,NC_000021.9:g.46643510T>A,0,-0.603,"[{'transcript': 'ENST00000451211.6', 'source':...",150,"[{'id': 'ENSR00001057246', 'type': 'CTCF_bindi...",0.0,True,0.000021,141118.0,3.0,0.0,0.000051,39060.0,2.0,0.0,0,14018.0,0.0,0.0,0.000232,4306.0,1.0,0.0,0,7762.0,0.0,0.0,0,65358.0,0.0,0.0,0,3364.0,0.0,0.0,0,4110.0,0.0,0.0,0,1968.0,0.0,0.0,0,68068.0,0.0,0.0,0.000041,73050.0,3.0,0.0,0.000108,27846.0,3.0,[rs1601921720],,,,,,,0.58,,,,,,,,,,,,,,,,,,,,0.622,,,,


In [245]:
positions[0].to_df("variants.transcripts")

Unnamed: 0,chromosome,position,refAllele,altAlleles,filters,mappingQuality,cyatogeneticBand,variants.transcripts,vcfInfo.DP
0,chr21,5228221,G,[T],[PASS],75.75,21p12,,1004


In [246]:
positions[0].variants[0].get("transcripts")

[{'transcript': 'ENST00000623753.1',
  'source': 'Ensembl',
  'bioType': 'lncRNA',
  'geneId': 'ENSG00000279669',
  'hgnc': 'ENSG00000279669',
  'consequence': ['downstream_gene_variant'],
  'impact': 'modifier',
  'isCanonical': True}]

## Filter by Consequence

In [13]:
parser = Parser(annotated_data)

In [14]:
positions = parser.filter_transcripts_by_consequence(
    include=["non_coding_transcript_exon_variant"]
)
len(positions), positions[0].model_dump()

(147,
 {'chromosome': 'chr21',
  'position': 5232869,
  'refAllele': 'T',
  'altAlleles': ['G'],
  'filters': ['base_quality', 'weak_evidence'],
  'mappingQuality': 58.1,
  'cytogeneticBand': '21p12',
  'vcfInfo': {'DP': '799'},
  'samples': [{'genotype': '0/0',
    'variantFrequencies': [0.0552],
    'totalDepth': 181,
    'alleleDepths': [171, 10],
    'somaticQuality': 0,
    'vcfSampleInfo': {'F1R2': '96,6', 'F2R1': '75,4'}},
   {'genotype': '0/1',
    'variantFrequencies': [0.0677],
    'totalDepth': 591,
    'alleleDepths': [551, 40],
    'somaticQuality': 8.6,
    'vcfSampleInfo': {'F1R2': '280,23', 'F2R1': '271,17'}}],
  'variants': [{'vid': '21-5232869-T-G',
    'chromosome': 'chr21',
    'begin': 5232869,
    'end': 5232869,
    'refAllele': 'T',
    'altAllele': 'G',
    'variantType': 'SNV',
    'hgvsg': 'NC_000021.9:g.5232869T>G',
    'phylopScore': 0.5,
    'transcripts': [{'transcript': 'ENST00000623753.1',
      'source': 'Ensembl',
      'bioType': 'lncRNA',
      'cdn

In [15]:
positions = parser.filter_transcripts_by_consequence(
    exclude=["downstream_gene_variant", "upstream_gene_variant"]
)
len(positions), positions[0].model_dump()

(11723,
 {'chromosome': 'chr21',
  'position': 5232869,
  'refAllele': 'T',
  'altAlleles': ['G'],
  'filters': ['base_quality', 'weak_evidence'],
  'mappingQuality': 58.1,
  'cytogeneticBand': '21p12',
  'vcfInfo': {'DP': '799'},
  'samples': [{'genotype': '0/0',
    'variantFrequencies': [0.0552],
    'totalDepth': 181,
    'alleleDepths': [171, 10],
    'somaticQuality': 0,
    'vcfSampleInfo': {'F1R2': '96,6', 'F2R1': '75,4'}},
   {'genotype': '0/1',
    'variantFrequencies': [0.0677],
    'totalDepth': 591,
    'alleleDepths': [551, 40],
    'somaticQuality': 8.6,
    'vcfSampleInfo': {'F1R2': '280,23', 'F2R1': '271,17'}}],
  'variants': [{'vid': '21-5232869-T-G',
    'chromosome': 'chr21',
    'begin': 5232869,
    'end': 5232869,
    'refAllele': 'T',
    'altAllele': 'G',
    'variantType': 'SNV',
    'hgvsg': 'NC_000021.9:g.5232869T>G',
    'phylopScore': 0.5,
    'transcripts': [{'transcript': 'ENST00000623753.1',
      'source': 'Ensembl',
      'bioType': 'lncRNA',
      'c

## Filter by gnomad frequency
Possible values

'coverage', 'failedFilter', 'allAf', 'allAn', 'allAc', 'allHc', 'afrAf', 'afrAn', 'afrAc', 'afrHc', 'amrAf', 'amrAn', 'amrAc', 'amrHc', 'easAf', 'easAn', 'easAc', 'easHc', 'finAf', 'finAn', 'finAc', 'finHc', 'nfeAf', 'nfeAn', 'nfeAc', 'nfeHc', 'asjAf', 'asjAn', 'asjAc', 'asjHc', 'sasAf', 'sasAn', 'sasAc', 'sasHc', 'othAf', 'othAn', 'othAc', 'othHc', 'maleAf', 'maleAn', 'maleAc', 'maleHc', 'femaleAf', 'femaleAn', 'femaleAc', 'femaleHc', 'controlsAllAf', 'controlsAllAn', 'controlsAllAc'

In [16]:
parser = Parser(annotated_data)

In [17]:
positions = parser.get_variants_above_gnomad_freq(frequency_key="allAf", frequency_threshold_high=0.1)
len(positions), positions[0].model_dump()

(851,
 {'chromosome': 'chr21',
  'position': 5222289,
  'refAllele': 'C',
  'altAlleles': ['T'],
  'filters': ['mapping_quality', 'weak_evidence'],
  'mappingQuality': 49.7,
  'cytogeneticBand': '21p12',
  'vcfInfo': {'DP': '749'},
  'samples': [{'genotype': '0/0',
    'variantFrequencies': [0.0219],
    'totalDepth': 137,
    'alleleDepths': [134, 3],
    'somaticQuality': 0,
    'vcfSampleInfo': {'F1R2': '77,3', 'F2R1': '57,0'}},
   {'genotype': '0/1',
    'variantFrequencies': [0.0196],
    'totalDepth': 562,
    'alleleDepths': [551, 11],
    'somaticQuality': 7,
    'vcfSampleInfo': {'F1R2': '286,6', 'F2R1': '265,5'}}],
  'variants': [{'vid': '21-5222289-C-T',
    'chromosome': 'chr21',
    'begin': 5222289,
    'end': 5222289,
    'refAllele': 'C',
    'altAllele': 'T',
    'variantType': 'SNV',
    'hgvsg': 'NC_000021.9:g.5222289C>T',
    'phylopScore': -0.2,
    'dbsnp': ['rs1366179382'],
    'gnomad': {'coverage': 3,
     'allAf': 0.000502,
     'allAn': 149274,
     'allAc': 

In [18]:
positions = parser.get_variants_above_gnomad_freq(frequency_key="allAf", frequency_threshold_low=0.1)
len(positions), positions[0].model_dump()

(302,
 {'chromosome': 'chr21',
  'position': 5227548,
  'refAllele': 'C',
  'altAlleles': ['T'],
  'filters': ['PASS'],
  'mappingQuality': 93.13,
  'cytogeneticBand': '21p12',
  'vcfInfo': {'DP': '314'},
  'samples': [{'genotype': '0|0',
    'variantFrequencies': [0.125],
    'totalDepth': 40,
    'alleleDepths': [35, 5],
    'somaticQuality': 0,
    'vcfSampleInfo': {'F1R2': '16,4', 'F2R1': '19,1'}},
   {'genotype': '0|1',
    'variantFrequencies': [0.3432],
    'totalDepth': 169,
    'alleleDepths': [111, 58],
    'somaticQuality': 28.1,
    'vcfSampleInfo': {'F1R2': '60,29', 'F2R1': '51,29'}}],
  'variants': [{'vid': '21-5227548-C-T',
    'chromosome': 'chr21',
    'begin': 5227548,
    'end': 5227548,
    'refAllele': 'C',
    'altAllele': 'T',
    'variantType': 'SNV',
    'hgvsg': 'NC_000021.9:g.5227548C>T',
    'phylopScore': 0.2,
    'inLowComplexityRegion': True,
    'dbsnp': ['rs1219135472'],
    'gnomad': {'coverage': 15,
     'allAf': 0.348842,
     'allAn': 65554,
     'a

In [19]:
positions = parser.get_variants_above_gnomad_freq(frequency_key="allAf", frequency_threshold_low=0.1, frequency_threshold_high=0.2)
len(positions), positions[0].model_dump()

(114,
 {'chromosome': 'chr21',
  'position': 5278225,
  'refAllele': 'A',
  'altAlleles': ['T'],
  'filters': ['alt_allele_in_normal',
   'filtered_reads',
   'mapping_quality',
   'non_homref_normal',
   'weak_evidence'],
  'mappingQuality': 12.91,
  'cytogeneticBand': '21p12',
  'vcfInfo': {'DP': '113'},
  'samples': [{'genotype': '0/0',
    'variantFrequencies': [0.25],
    'totalDepth': 12,
    'alleleDepths': [9, 3],
    'somaticQuality': 0.5,
    'vcfSampleInfo': {'F1R2': '5,2', 'F2R1': '4,1'}},
   {'genotype': '0/1',
    'variantFrequencies': [0.7931],
    'totalDepth': 29,
    'alleleDepths': [6, 23],
    'somaticQuality': 7.5,
    'vcfSampleInfo': {'F1R2': '4,9', 'F2R1': '2,14'}}],
  'variants': [{'vid': '21-5278225-A-T',
    'chromosome': 'chr21',
    'begin': 5278225,
    'end': 5278225,
    'refAllele': 'A',
    'altAllele': 'T',
    'variantType': 'SNV',
    'hgvsg': 'NC_000021.9:g.5278225A>T',
    'dbsnp': ['rs1171728286'],
    'gnomad': {'coverage': 7,
     'failedFilter

## Gene Filtering

In [21]:
annotated_data.genes

Unnamed: 0,name,hgncId,ncbiGeneId,ensemblGeneId,omim,gnomAD,clingenDosageSensitivityMap,clingenGeneValidity,cosmic
0,AATBC,51526.0,284837,ENSG00000215458,,,,,
1,ABCC13,,,ENSG00000291052,"[{'mimNumber': 608835, 'geneName': 'ATP-bindin...",,,,
2,ABCC13,16022.0,150000,ENSG00000243064,"[{'mimNumber': 608835, 'geneName': 'ATP-bindin...",,,,
3,ABCG1,73.0,9619,ENSG00000160179,"[{'mimNumber': 603076, 'geneName': 'ATP-bindin...","{'pLi': 0.112, 'pRec': 0.888, 'pNull': 9.34e-0...",,,
4,ADARB1,226.0,104,ENSG00000197381,"[{'mimNumber': 601218, 'geneName': 'Adenosine ...","{'pLi': 0.803, 'pRec': 0.197, 'pNull': 8.32e-0...",,,
...,...,...,...,...,...,...,...,...,...
437,YBEY,1299.0,54059,ENSG00000182362,"[{'mimNumber': 617461, 'geneName': 'YBEY metal...","{'pLi': 1.34e-07, 'pRec': 0.0625, 'pNull': 0.9...",,,
438,YRDCP3,39921.0,100861429,ENSG00000230859,,,,,
439,ZBTB21,13083.0,49854,ENSG00000173276,"[{'mimNumber': 616485, 'geneName': 'Zinc finge...","{'pLi': 0.998, 'pRec': 0.00222, 'pNull': 5.45e...",,,
440,ZNF295-AS1,23130.0,150142,ENSG00000237232,,,,,


In [38]:
annotated_data.genes[annotated_data.genes["name"] == "ABCC13"]

Unnamed: 0,name,hgncId,ncbiGeneId,ensemblGeneId,omim,gnomAD,clingenDosageSensitivityMap,clingenGeneValidity,cosmic
1,ABCC13,,,ENSG00000291052,"[{'mimNumber': 608835, 'geneName': 'ATP-bindin...",,,,
2,ABCC13,16022.0,150000.0,ENSG00000243064,"[{'mimNumber': 608835, 'geneName': 'ATP-bindin...",,,,


In [36]:
annotated_data.genes[annotated_data.genes["ensemblGeneId"].isin(["ENSG00000173276", "ENSG00000243064"])]

Unnamed: 0,name,hgncId,ncbiGeneId,ensemblGeneId,omim,gnomAD,clingenDosageSensitivityMap,clingenGeneValidity,cosmic
2,ABCC13,16022.0,150000,ENSG00000243064,"[{'mimNumber': 608835, 'geneName': 'ATP-bindin...",,,,
439,ZBTB21,13083.0,49854,ENSG00000173276,"[{'mimNumber': 616485, 'geneName': 'Zinc finge...","{'pLi': 0.998, 'pRec': 0.00222, 'pNull': 5.45e...",,,


In [31]:
annotated_data.genes.dropna(subset=["cosmic"])

Unnamed: 0,name,hgncId,ncbiGeneId,ensemblGeneId,omim,gnomAD,clingenDosageSensitivityMap,clingenGeneValidity,cosmic
198,ERG,3446.0,2078,ENSG00000157554,"[{'mimNumber': 165080, 'geneName': 'ETS transc...","{'pLi': 0.964, 'pRec': 0.0359, 'pNull': 5.03e-...",,,"{'roleInCancer': ['oncogene', 'fusion']}"
418,TMPRSS2,11876.0,7113,ENSG00000184012,"[{'mimNumber': 602060, 'geneName': 'Transmembr...","{'pLi': 2.4e-10, 'pRec': 0.88, 'pNull': 0.12, ...",,,{'roleInCancer': ['fusion']}
