# Parse Nirvana JSON output in python

Nirvana outputs a single JSON annotation file for a single input VCF file. The output file contains a single [JSON object](https://www.w3schools.com/js/js_json_objects.asp) to represent the annotations of all input VCF variants. The JSON file format can be found in the [documentation](https://illumina.github.io/NirvanaDocumentation/file-formats/nirvana-json-file-format).

A [sample JSON file](https://github.com/Illumina/NirvanaDocumentation/blob/master/static/files/ceph_trio_test.json.gz) of the CEPH trio (NA12878, NA12891, and NA12892) can be downloaded from the [NirvanaDocumention](https://github.com/Illumina/NirvanaDocumentation/tree/master/static/files) git repo.

This notebook demonstrates how you can parse this sample JSON file in python and retrieves the annotation data.

## Read Nirvana JSON output by lines

Even though the Nirvana JSON output is a single JSON object, different JSON object fields are written in different lines for memory efficient reading.

The first line in the JSON file is the `header` line:

```json
{"header":{"annotator":"Nirvana
...
,"positions":[
```
Followed by the `positions` lines:

```json
{"chromosome":"chr21","position":9975027,"refAllele":"C","altAlleles":["G"],"quality":102.47,"filters":["PASS"],"fisherStrandBias":0.727,"mappingQuality":43.11,"cytogeneticBand":"21p11.2","samples":,
...
```

After the `positions` lines, there are the `genes` lines, which are optional if there is no overlapping gene of the input VCF variants (3 lines for 2 genes in the example):

```json
],"genes":[
{"name":"ABCC13","omim":[{"mimNumber":608835,"geneName":"ATP-binding cassette, subfamily C, member 13","description":"ABCC13 belongs to a large family of ATP-binding cassette (ABC) transporters that play important roles as membrane transporters or ion channel modulators. However, ABCC13 is a truncated protein that lacks critical ATP-binding motifs and is unlikely to be a functional transporter (Yabuuchi et al., 2002)."}]},
```

Finally, the last line of the JSON file are two brackets to complete the JSON object structure:

```json
]}
```



## Install requirements

In [1]:
# !pip3 install pydantic


In [2]:
import gzip
import json
import functools


import pandas as pd
import pydantic


class AnnotatedData:
    def __init__(self, filename: str):
        with gzip.open(filename, 'r') as f:
            self._annotated_data = json.load(f)
        
        for key in ("annotator", "genomeAssembly", "creationTime"):
            print(f"{key}: {self.header[key]}")     
        
        print()
        
        for chromosome in self.chromosomes:
            chr_positions = self._split_by_chromosome[chromosome]
            print(
                f"chromsome={chromosome}, "
                f"positions={len(chr_positions)}, "
                f"min={chr_positions[0]['position']}, "
                f"max={chr_positions[-1]['position']}"
            )
        
    @property
    def header(self) -> dict:
        return self._annotated_data["header"]
    
    @property
    def data_sources(self):
        return pd.DataFrame(self.header["dataSources"]).set_index("name").sort_index()
        
    @property
    def genes(self) -> pd.DataFrame:
        return pd.DataFrame(self._annotated_data["genes"])
        
    @property
    def positions(self) -> list:
        return self._annotated_data["positions"]
    
    @property
    def chromosomes(self) -> list:
        return sorted(self._split_by_chromosome.keys())
    
    @functools.cached_property
    def _split_by_chromosome(self) -> dict:
        split_by_chromosome = {}
        for position in self.positions:
            chromosome = position["chromosome"]
            if chromosome not in split_by_chromosome:
                split_by_chromosome[chromosome] = []
                
            split_by_chromosome[chromosome].append(position)
            
        
        
        return split_by_chromosome
    
    def get_annotation(self, chromosome: str, position: int) -> dict:
        if chromosome not in self.chromosomes:
            return None
        
        return next(
            (
                position_item for position_item in self._split_by_chromosome[chromosome]
            ),
            None
        )
    
class Position(pydantic.BaseModel):
    model_config = pydantic.ConfigDict(extra='allow') 
    chromosome: str
    position: int
    refAllele: str
    altAlleles: list[str]
    filters: list
    mappingQuality: float
    cytogeneticBand: str
    vcfInfo: dict
    samples: list
    variants: list
        
    def get_top_level(self):
        return dict(zip(
            ("chromosome", "position", "refAllele", "altAlleles", "filters", "mappingQuality", "cyatogeneticBands", "vcfInfo"),
            (self.chromosome, self.position, self.refAllele, self.altAlleles, self.filters, self.mappingQuality, self.cytogeneticBand, self.vcfInfo)
        ))


## Header

In [3]:
annotated_data = AnnotatedData(filename="annotated_38.json.gz")

annotator: Illumina Connected Annotations 3.22.0
genomeAssembly: GRCh38
creationTime: 2023-12-07 14:15:54

chromsome=chr21, positions=1819, min=5222289, max=46678074


In [4]:
annotated_data.data_sources

Unnamed: 0_level_0,version,description,releaseDate
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1000 Genomes Project,Phase 3 v3plus,A public catalogue of human variation and geno...,2013-05-27
1000 Genomes Project (SV),Phase 3 v5a,A public catalogue of human variation and geno...,2013-05-27
COSMIC,96,resource for exploring the impact of somatic m...,2022-05-31
COSMIC gene fusions,96,manually curated somatic gene fusions,2023-11-07
CancerHotspots,2017,A resouce for statistically significant mutati...,2017-01-01
ClinGen,20160414,,2016-04-14
ClinGen Dosage Sensitivity Map,20231105,Dosage sensitivity map from ClinGen (dbVar),2023-11-05
ClinGen disease validity curations,20231105,Disease validity curations from ClinGen (dbVar),2023-11-05
ClinVar,20231028,"A freely accessible, public archive of reports...",2023-11-05
Cosmic Cancer Gene Census,97,Cosmic Cancer Gene Census catalogs genes with ...,2022-11-29


## Genes

In [5]:
annotated_data.genes

Unnamed: 0,name,hgncId,ncbiGeneId,ensemblGeneId,omim,gnomAD,clingenDosageSensitivityMap,clingenGeneValidity,cosmic
0,AATBC,51526.0,284837,ENSG00000215458,,,,,
1,ABCC13,,,ENSG00000291052,"[{'mimNumber': 608835, 'geneName': 'ATP-bindin...",,,,
2,ABCC13,16022.0,150000,ENSG00000243064,"[{'mimNumber': 608835, 'geneName': 'ATP-bindin...",,,,
3,ABCG1,73.0,9619,ENSG00000160179,"[{'mimNumber': 603076, 'geneName': 'ATP-bindin...","{'pLi': 0.112, 'pRec': 0.888, 'pNull': 9.34e-0...",,,
4,ADARB1,226.0,104,ENSG00000197381,"[{'mimNumber': 601218, 'geneName': 'Adenosine ...","{'pLi': 0.803, 'pRec': 0.197, 'pNull': 8.32e-0...",,,
...,...,...,...,...,...,...,...,...,...
437,YBEY,1299.0,54059,ENSG00000182362,"[{'mimNumber': 617461, 'geneName': 'YBEY metal...","{'pLi': 1.34e-07, 'pRec': 0.0625, 'pNull': 0.9...",,,
438,YRDCP3,39921.0,100861429,ENSG00000230859,,,,,
439,ZBTB21,13083.0,49854,ENSG00000173276,"[{'mimNumber': 616485, 'geneName': 'Zinc finge...","{'pLi': 0.998, 'pRec': 0.00222, 'pNull': 5.45e...",,,
440,ZNF295-AS1,23130.0,150142,ENSG00000237232,,,,,


## Positions

In [6]:
position = Position.model_validate(annotated_data.get_annotation("chr21", 5222289))
position.get_top_level()

{'chromosome': 'chr21',
 'position': 5222289,
 'refAllele': 'C',
 'altAlleles': ['T'],
 'filters': ['mapping_quality', 'weak_evidence'],
 'mappingQuality': 49.7,
 'cyatogeneticBands': '21p12',
 'vcfInfo': {'DP': '749'}}

In [7]:
position.samples

[{'genotype': '0/0',
  'variantFrequencies': [0.0219],
  'totalDepth': 137,
  'alleleDepths': [134, 3],
  'somaticQuality': 0,
  'vcfSampleInfo': {'F1R2': '77,3', 'F2R1': '57,0'}},
 {'genotype': '0/1',
  'variantFrequencies': [0.0196],
  'totalDepth': 562,
  'alleleDepths': [551, 11],
  'somaticQuality': 7,
  'vcfSampleInfo': {'F1R2': '286,6', 'F2R1': '265,5'}}]

In [8]:
position.variants

[{'vid': '21-5222289-C-T',
  'chromosome': 'chr21',
  'begin': 5222289,
  'end': 5222289,
  'refAllele': 'C',
  'altAllele': 'T',
  'variantType': 'SNV',
  'hgvsg': 'NC_000021.9:g.5222289C>T',
  'phylopScore': -0.2,
  'dbsnp': ['rs1366179382'],
  'gnomad': {'coverage': 3,
   'allAf': 0.000502,
   'allAn': 149274,
   'allAc': 75,
   'allHc': 0,
   'afrAf': 0.001704,
   'afrAn': 40502,
   'afrAc': 69,
   'afrHc': 0,
   'amrAf': 6.7e-05,
   'amrAn': 14966,
   'amrAc': 1,
   'amrHc': 0,
   'easAf': 0,
   'easAn': 5022,
   'easAc': 0,
   'easHc': 0,
   'finAf': 0,
   'finAn': 10430,
   'finAc': 0,
   'finHc': 0,
   'nfeAf': 7.5e-05,
   'nfeAn': 66950,
   'nfeAc': 5,
   'nfeHc': 0,
   'asjAf': 0,
   'asjAn': 3438,
   'asjAc': 0,
   'asjHc': 0,
   'sasAf': 0,
   'sasAn': 4734,
   'sasAc': 0,
   'sasHc': 0,
   'othAf': 0,
   'othAn': 2024,
   'othAc': 0,
   'othHc': 0,
   'maleAf': 0.000411,
   'maleAn': 73006,
   'maleAc': 30,
   'maleHc': 0,
   'femaleAf': 0.00059,
   'femaleAn': 76268,
   '

# Parsing and Filtering

In [9]:
class Parser:
    def __init__(self, annotated_data: AnnotatedData):
        self.annotated_data = annotated_data
        
    def get_variants_above_gnomad_freq(
        self,
        frequency_key: str,
        frequency_threshold_low=float("-inf"),
        frequency_threshold_high=float("inf")
    ) -> list:
        positions = [
            Position.model_validate(position)
            for position in self.annotated_data.positions
            for variant in position.get("variants", {})
            if (freq := variant.get("gnomad", {}).get(frequency_key, None)) \
            and frequency_threshold_low < freq < frequency_threshold_high
        ]
        return positions
    
    def get_positions_with_cannonical_transcripts(self):   
        positions = [
            Position.model_validate(position)
            for position in self.annotated_data.positions
            for variant in position.get("variants", {})
            for transcript in variant.get("transcripts", [])
            if transcript.get("isCanonical")
        ]

        return positions
    
    def filter_transcripts_by_consequence(self, include=[], exclude=[]):
        positions = [
            Position.model_validate(position)
            for position in self.annotated_data.positions
            for variant in position.get("variants", {})
            for transcript in variant.get("transcripts", [])
            for consequence in transcript.get("consequence", [])
            if (not bool(include) or consequence in include) and consequence not in exclude
        ]
        return positions
        
    

## Positions with Cannonical Transcripts only

In [10]:
parser = Parser(annotated_data)

In [11]:
positions = parser.get_positions_with_cannonical_transcripts()
len(positions), positions[0].model_dump()

(2472,
 {'chromosome': 'chr21',
  'position': 5228221,
  'refAllele': 'G',
  'altAlleles': ['T'],
  'filters': ['PASS'],
  'mappingQuality': 75.75,
  'cytogeneticBand': '21p12',
  'vcfInfo': {'DP': '1004'},
  'samples': [{'genotype': '0/0',
    'variantFrequencies': [0.0046],
    'totalDepth': 219,
    'alleleDepths': [218, 1],
    'somaticQuality': 0,
    'vcfSampleInfo': {'F1R2': '108,0', 'F2R1': '110,1'}},
   {'genotype': '0/1',
    'variantFrequencies': [0.1227],
    'totalDepth': 709,
    'alleleDepths': [622, 87],
    'somaticQuality': 86.7,
    'vcfSampleInfo': {'F1R2': '309,38', 'F2R1': '313,49'}}],
  'variants': [{'vid': '21-5228221-G-T',
    'chromosome': 'chr21',
    'begin': 5228221,
    'end': 5228221,
    'refAllele': 'G',
    'altAllele': 'T',
    'variantType': 'SNV',
    'hgvsg': 'NC_000021.9:g.5228221G>T',
    'phylopScore': 0.2,
    'phyloPPrimateScore': 0,
    'transcripts': [{'transcript': 'ENST00000623753.1',
      'source': 'Ensembl',
      'bioType': 'lncRNA',
 

In [12]:
positions[0].variants[0].get("transcripts")

[{'transcript': 'ENST00000623753.1',
  'source': 'Ensembl',
  'bioType': 'lncRNA',
  'geneId': 'ENSG00000279669',
  'hgnc': 'ENSG00000279669',
  'consequence': ['downstream_gene_variant'],
  'impact': 'modifier',
  'isCanonical': True}]

## Filter by Consequence

In [13]:
parser = Parser(annotated_data)

In [14]:
positions = parser.filter_transcripts_by_consequence(
    include=["non_coding_transcript_exon_variant"]
)
len(positions), positions[0].model_dump()

(147,
 {'chromosome': 'chr21',
  'position': 5232869,
  'refAllele': 'T',
  'altAlleles': ['G'],
  'filters': ['base_quality', 'weak_evidence'],
  'mappingQuality': 58.1,
  'cytogeneticBand': '21p12',
  'vcfInfo': {'DP': '799'},
  'samples': [{'genotype': '0/0',
    'variantFrequencies': [0.0552],
    'totalDepth': 181,
    'alleleDepths': [171, 10],
    'somaticQuality': 0,
    'vcfSampleInfo': {'F1R2': '96,6', 'F2R1': '75,4'}},
   {'genotype': '0/1',
    'variantFrequencies': [0.0677],
    'totalDepth': 591,
    'alleleDepths': [551, 40],
    'somaticQuality': 8.6,
    'vcfSampleInfo': {'F1R2': '280,23', 'F2R1': '271,17'}}],
  'variants': [{'vid': '21-5232869-T-G',
    'chromosome': 'chr21',
    'begin': 5232869,
    'end': 5232869,
    'refAllele': 'T',
    'altAllele': 'G',
    'variantType': 'SNV',
    'hgvsg': 'NC_000021.9:g.5232869T>G',
    'phylopScore': 0.5,
    'transcripts': [{'transcript': 'ENST00000623753.1',
      'source': 'Ensembl',
      'bioType': 'lncRNA',
      'cdn

In [15]:
positions = parser.filter_transcripts_by_consequence(
    exclude=["downstream_gene_variant", "upstream_gene_variant"]
)
len(positions), positions[0].model_dump()

(11723,
 {'chromosome': 'chr21',
  'position': 5232869,
  'refAllele': 'T',
  'altAlleles': ['G'],
  'filters': ['base_quality', 'weak_evidence'],
  'mappingQuality': 58.1,
  'cytogeneticBand': '21p12',
  'vcfInfo': {'DP': '799'},
  'samples': [{'genotype': '0/0',
    'variantFrequencies': [0.0552],
    'totalDepth': 181,
    'alleleDepths': [171, 10],
    'somaticQuality': 0,
    'vcfSampleInfo': {'F1R2': '96,6', 'F2R1': '75,4'}},
   {'genotype': '0/1',
    'variantFrequencies': [0.0677],
    'totalDepth': 591,
    'alleleDepths': [551, 40],
    'somaticQuality': 8.6,
    'vcfSampleInfo': {'F1R2': '280,23', 'F2R1': '271,17'}}],
  'variants': [{'vid': '21-5232869-T-G',
    'chromosome': 'chr21',
    'begin': 5232869,
    'end': 5232869,
    'refAllele': 'T',
    'altAllele': 'G',
    'variantType': 'SNV',
    'hgvsg': 'NC_000021.9:g.5232869T>G',
    'phylopScore': 0.5,
    'transcripts': [{'transcript': 'ENST00000623753.1',
      'source': 'Ensembl',
      'bioType': 'lncRNA',
      'c

## Filter by gnomad frequency
Possible values

'coverage', 'failedFilter', 'allAf', 'allAn', 'allAc', 'allHc', 'afrAf', 'afrAn', 'afrAc', 'afrHc', 'amrAf', 'amrAn', 'amrAc', 'amrHc', 'easAf', 'easAn', 'easAc', 'easHc', 'finAf', 'finAn', 'finAc', 'finHc', 'nfeAf', 'nfeAn', 'nfeAc', 'nfeHc', 'asjAf', 'asjAn', 'asjAc', 'asjHc', 'sasAf', 'sasAn', 'sasAc', 'sasHc', 'othAf', 'othAn', 'othAc', 'othHc', 'maleAf', 'maleAn', 'maleAc', 'maleHc', 'femaleAf', 'femaleAn', 'femaleAc', 'femaleHc', 'controlsAllAf', 'controlsAllAn', 'controlsAllAc'

In [16]:
parser = Parser(annotated_data)

In [17]:
positions = parser.get_variants_above_gnomad_freq(frequency_key="allAf", frequency_threshold_high=0.1)
len(positions), positions[0].model_dump()

(851,
 {'chromosome': 'chr21',
  'position': 5222289,
  'refAllele': 'C',
  'altAlleles': ['T'],
  'filters': ['mapping_quality', 'weak_evidence'],
  'mappingQuality': 49.7,
  'cytogeneticBand': '21p12',
  'vcfInfo': {'DP': '749'},
  'samples': [{'genotype': '0/0',
    'variantFrequencies': [0.0219],
    'totalDepth': 137,
    'alleleDepths': [134, 3],
    'somaticQuality': 0,
    'vcfSampleInfo': {'F1R2': '77,3', 'F2R1': '57,0'}},
   {'genotype': '0/1',
    'variantFrequencies': [0.0196],
    'totalDepth': 562,
    'alleleDepths': [551, 11],
    'somaticQuality': 7,
    'vcfSampleInfo': {'F1R2': '286,6', 'F2R1': '265,5'}}],
  'variants': [{'vid': '21-5222289-C-T',
    'chromosome': 'chr21',
    'begin': 5222289,
    'end': 5222289,
    'refAllele': 'C',
    'altAllele': 'T',
    'variantType': 'SNV',
    'hgvsg': 'NC_000021.9:g.5222289C>T',
    'phylopScore': -0.2,
    'dbsnp': ['rs1366179382'],
    'gnomad': {'coverage': 3,
     'allAf': 0.000502,
     'allAn': 149274,
     'allAc': 

In [18]:
positions = parser.get_variants_above_gnomad_freq(frequency_key="allAf", frequency_threshold_low=0.1)
len(positions), positions[0].model_dump()

(302,
 {'chromosome': 'chr21',
  'position': 5227548,
  'refAllele': 'C',
  'altAlleles': ['T'],
  'filters': ['PASS'],
  'mappingQuality': 93.13,
  'cytogeneticBand': '21p12',
  'vcfInfo': {'DP': '314'},
  'samples': [{'genotype': '0|0',
    'variantFrequencies': [0.125],
    'totalDepth': 40,
    'alleleDepths': [35, 5],
    'somaticQuality': 0,
    'vcfSampleInfo': {'F1R2': '16,4', 'F2R1': '19,1'}},
   {'genotype': '0|1',
    'variantFrequencies': [0.3432],
    'totalDepth': 169,
    'alleleDepths': [111, 58],
    'somaticQuality': 28.1,
    'vcfSampleInfo': {'F1R2': '60,29', 'F2R1': '51,29'}}],
  'variants': [{'vid': '21-5227548-C-T',
    'chromosome': 'chr21',
    'begin': 5227548,
    'end': 5227548,
    'refAllele': 'C',
    'altAllele': 'T',
    'variantType': 'SNV',
    'hgvsg': 'NC_000021.9:g.5227548C>T',
    'phylopScore': 0.2,
    'inLowComplexityRegion': True,
    'dbsnp': ['rs1219135472'],
    'gnomad': {'coverage': 15,
     'allAf': 0.348842,
     'allAn': 65554,
     'a

In [19]:
positions = parser.get_variants_above_gnomad_freq(frequency_key="allAf", frequency_threshold_low=0.1, frequency_threshold_high=0.2)
len(positions), positions[0].model_dump()

(114,
 {'chromosome': 'chr21',
  'position': 5278225,
  'refAllele': 'A',
  'altAlleles': ['T'],
  'filters': ['alt_allele_in_normal',
   'filtered_reads',
   'mapping_quality',
   'non_homref_normal',
   'weak_evidence'],
  'mappingQuality': 12.91,
  'cytogeneticBand': '21p12',
  'vcfInfo': {'DP': '113'},
  'samples': [{'genotype': '0/0',
    'variantFrequencies': [0.25],
    'totalDepth': 12,
    'alleleDepths': [9, 3],
    'somaticQuality': 0.5,
    'vcfSampleInfo': {'F1R2': '5,2', 'F2R1': '4,1'}},
   {'genotype': '0/1',
    'variantFrequencies': [0.7931],
    'totalDepth': 29,
    'alleleDepths': [6, 23],
    'somaticQuality': 7.5,
    'vcfSampleInfo': {'F1R2': '4,9', 'F2R1': '2,14'}}],
  'variants': [{'vid': '21-5278225-A-T',
    'chromosome': 'chr21',
    'begin': 5278225,
    'end': 5278225,
    'refAllele': 'A',
    'altAllele': 'T',
    'variantType': 'SNV',
    'hgvsg': 'NC_000021.9:g.5278225A>T',
    'dbsnp': ['rs1171728286'],
    'gnomad': {'coverage': 7,
     'failedFilter

## Gene Filtering

In [21]:
annotated_data.genes

Unnamed: 0,name,hgncId,ncbiGeneId,ensemblGeneId,omim,gnomAD,clingenDosageSensitivityMap,clingenGeneValidity,cosmic
0,AATBC,51526.0,284837,ENSG00000215458,,,,,
1,ABCC13,,,ENSG00000291052,"[{'mimNumber': 608835, 'geneName': 'ATP-bindin...",,,,
2,ABCC13,16022.0,150000,ENSG00000243064,"[{'mimNumber': 608835, 'geneName': 'ATP-bindin...",,,,
3,ABCG1,73.0,9619,ENSG00000160179,"[{'mimNumber': 603076, 'geneName': 'ATP-bindin...","{'pLi': 0.112, 'pRec': 0.888, 'pNull': 9.34e-0...",,,
4,ADARB1,226.0,104,ENSG00000197381,"[{'mimNumber': 601218, 'geneName': 'Adenosine ...","{'pLi': 0.803, 'pRec': 0.197, 'pNull': 8.32e-0...",,,
...,...,...,...,...,...,...,...,...,...
437,YBEY,1299.0,54059,ENSG00000182362,"[{'mimNumber': 617461, 'geneName': 'YBEY metal...","{'pLi': 1.34e-07, 'pRec': 0.0625, 'pNull': 0.9...",,,
438,YRDCP3,39921.0,100861429,ENSG00000230859,,,,,
439,ZBTB21,13083.0,49854,ENSG00000173276,"[{'mimNumber': 616485, 'geneName': 'Zinc finge...","{'pLi': 0.998, 'pRec': 0.00222, 'pNull': 5.45e...",,,
440,ZNF295-AS1,23130.0,150142,ENSG00000237232,,,,,


In [38]:
annotated_data.genes[annotated_data.genes["name"] == "ABCC13"]

Unnamed: 0,name,hgncId,ncbiGeneId,ensemblGeneId,omim,gnomAD,clingenDosageSensitivityMap,clingenGeneValidity,cosmic
1,ABCC13,,,ENSG00000291052,"[{'mimNumber': 608835, 'geneName': 'ATP-bindin...",,,,
2,ABCC13,16022.0,150000.0,ENSG00000243064,"[{'mimNumber': 608835, 'geneName': 'ATP-bindin...",,,,


In [36]:
annotated_data.genes[annotated_data.genes["ensemblGeneId"].isin(["ENSG00000173276", "ENSG00000243064"])]

Unnamed: 0,name,hgncId,ncbiGeneId,ensemblGeneId,omim,gnomAD,clingenDosageSensitivityMap,clingenGeneValidity,cosmic
2,ABCC13,16022.0,150000,ENSG00000243064,"[{'mimNumber': 608835, 'geneName': 'ATP-bindin...",,,,
439,ZBTB21,13083.0,49854,ENSG00000173276,"[{'mimNumber': 616485, 'geneName': 'Zinc finge...","{'pLi': 0.998, 'pRec': 0.00222, 'pNull': 5.45e...",,,


In [31]:
annotated_data.genes.dropna(subset=["cosmic"])

Unnamed: 0,name,hgncId,ncbiGeneId,ensemblGeneId,omim,gnomAD,clingenDosageSensitivityMap,clingenGeneValidity,cosmic
198,ERG,3446.0,2078,ENSG00000157554,"[{'mimNumber': 165080, 'geneName': 'ETS transc...","{'pLi': 0.964, 'pRec': 0.0359, 'pNull': 5.03e-...",,,"{'roleInCancer': ['oncogene', 'fusion']}"
418,TMPRSS2,11876.0,7113,ENSG00000184012,"[{'mimNumber': 602060, 'geneName': 'Transmembr...","{'pLi': 2.4e-10, 'pRec': 0.88, 'pNull': 0.12, ...",,,{'roleInCancer': ['fusion']}
