# Parse Nirvana JSON output in python

Nirvana outputs a single JSON annotation file for a single input VCF file. The output file contains a single [JSON object](https://www.w3schools.com/js/js_json_objects.asp) to represent the annotations of all input VCF variants. The JSON file format can be found in the [documentation](https://illumina.github.io/IlluminaConnectedAnnotationsDocumentation/file-formats/illumina-annotator-json-file-format).


This notebook demonstrates how you can parse the JSON file in python and perform filtering operations.

## Read Nirvana JSON output by lines

Even though the Nirvana JSON output is a single JSON object, different JSON object fields are written in different lines for memory efficient reading.

The first line in the JSON file is the `header` line:

```json
{"header":{"annotator":"Nirvana
...
,"positions":[
```
Followed by the `positions` lines:

```json
{"chromosome":"chr21","position":9975027,"refAllele":"C","altAlleles":["G"],"quality":102.47,"filters":["PASS"],"fisherStrandBias":0.727,"mappingQuality":43.11,"cytogeneticBand":"21p11.2","samples":,
...
```

After the `positions` lines, there are the `genes` lines, which are optional if there is no overlapping gene of the input VCF variants (3 lines for 2 genes in the example):

```json
],"genes":[
{"name":"ABCC13","omim":[{"mimNumber":608835,"geneName":"ATP-binding cassette, subfamily C, member 13","description":"ABCC13 belongs to a large family of ATP-binding cassette (ABC) transporters that play important roles as membrane transporters or ion channel modulators. However, ABCC13 is a truncated protein that lacks critical ATP-binding motifs and is unlikely to be a functional transporter (Yabuuchi et al., 2002)."}]},
```

Finally, the last line of the JSON file are two brackets to complete the JSON object structure:

```json
]}
```



## Install requirements

In [1]:
# !pip3 install pydantic ijson pandas


## Pydantic objects
Here we use pydantic objects to deserialize the json.
Pandas is also used to display the data neatly.

### Note: Memory efficiency
In these examples, ijson is used to minimize the footprint. Python generators are used where possible to minimize memory footprint. This also means that the file is scanned each time which might be time consuming for very large file.

In [2]:
import gzip
from typing import Any, Dict, Generator, List, Optional

import ijson
import pandas as pd
import pydantic

pd.set_option('display.max_columns', None)


class BaseClass(pydantic.BaseModel):
    model_config = pydantic.ConfigDict(extra='allow')

    def get_top_level(self) -> pd.DataFrame:
        return pd.json_normalize(self.get_top_level_dict())

    def get_top_level_dict(self) -> Dict[str, Any]:
        raise NotImplementedError

    def to_df(self, key: str = "") -> pd.DataFrame:
        if not key:
            return pd.json_normalize(self.model_dump())

        values = self.model_dump().get(key)

        if isinstance(values, list):
            merged = [self.get_top_level_dict() | value for value in values]
        else:
            merged = [self.get_top_level_dict() | {key: values}]

        return pd.json_normalize(merged)


class Transcript(BaseClass):
    transcript: str
    source: str
    bioType: Optional[str] = None
    geneId: Optional[str] = None
    hgnc: Optional[str] = None
    consequence: Optional[List[str]] = None
    impact: Optional[str] = None
    isCanonical: Optional[bool] = None

    def get_top_level_dict(self) -> Dict[str, Any]:
        return dict(zip(("transcript", "isCanonical"), (self.transcript, self.isCanonical)))


class Variant(BaseClass):
    vid: str
    chromosome: str
    begin: int
    end: int
    refAllele: str
    altAllele: str
    variantType: Optional[str] = None
    hgvsg: Optional[str] = None
    phylopScore: Optional[float] = None
    phyloPPrimateScore: Optional[float] = None
    transcripts: Optional[List[Transcript]] = None

    def get_top_level_dict(self) -> Dict[str, Any]:
        return dict(
            zip(
                ("chromosome", "begin", "end", "refAllele", "altAllele", "hgvsg"),
                (self.chromosome, self.begin, self.end, self.refAllele, self.altAllele, self.hgvsg),
            )
        )


class Position(BaseClass):
    chromosome: str
    position: int
    refAllele: str
    altAlleles: List[str]
    filters: List[str]
    mappingQuality: float
    cytogeneticBand: str
    vcfInfo: Dict[str, Any]
    samples: List[Dict[str, Any]]
    variants: List[Variant]

    def get_top_level_dict(self) -> Dict[str, Any]:
        return dict(
            zip(
                (
                    "chromosome",
                    "position",
                    "refAllele",
                    "altAlleles",
                    "filters",
                    "mappingQuality",
                    "cyatogeneticBand",
                    "vcfInfo",
                ),
                (
                    self.chromosome,
                    self.position,
                    self.refAllele,
                    self.altAlleles,
                    self.filters,
                    self.mappingQuality,
                    self.cytogeneticBand,
                    self.vcfInfo,
                ),
            )
        )


class AnnotatedData:
    def __init__(self, filename: str):
        self._filename = filename

        for key in ("annotator", "genomeAssembly", "creationTime"):
            print(f"{key}: {self.header[key]}")

    @property
    def header(self) -> Dict[str, Any]:
        with gzip.open(self._filename, 'r') as f:
            return next(ijson.items(f, "header"))

    @property
    def data_sources(self) -> pd.DataFrame:
        return pd.DataFrame(self.header["dataSources"]).set_index("name").sort_index()

    @property
    def genes(self) -> pd.DataFrame:
        with gzip.open(self._filename, 'r') as f:
            return pd.json_normalize(ijson.items(f, "genes.item"))

    @property
    def positions(self) -> Any:
        f = gzip.open(self._filename, 'r')
        return ijson.items(f, "positions.item")

    def get_annotation(self, chromosome: str, position: int) -> Dict[str, Any]:
        return next(
            (
                position_item
                for position_item in self.positions
                if chromosome == position_item.get("chromosome") and position == position_item.get("position")
            ),
            {},
        )

    def get_annotation_range(self, chromosome: str, position: int, end: int) -> Generator[Any, Any, None]:
        return (
            position_item
            for position_item in self.positions
            if chromosome == position_item.get("chromosome") and position <= position_item.get("position") <= end
        )

    @staticmethod
    def multiple_to_df(items: List[BaseClass], key: str = "") -> pd.DataFrame:
        return pd.concat((item.to_df(key) for item in items))


## Filename

In [3]:
filename = ""

## Header

In [4]:
annotated_data = AnnotatedData(filename=filename)

annotator: Illumina Connected Annotations 3.22.0
genomeAssembly: GRCh38
creationTime: 2023-12-07 14:15:54


In [5]:
annotated_data.data_sources

Unnamed: 0_level_0,version,description,releaseDate
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1000 Genomes Project,Phase 3 v3plus,A public catalogue of human variation and geno...,2013-05-27
1000 Genomes Project (SV),Phase 3 v5a,A public catalogue of human variation and geno...,2013-05-27
COSMIC,96,resource for exploring the impact of somatic m...,2022-05-31
COSMIC gene fusions,96,manually curated somatic gene fusions,2023-11-07
CancerHotspots,2017,A resouce for statistically significant mutati...,2017-01-01
ClinGen,20160414,,2016-04-14
ClinGen Dosage Sensitivity Map,20231105,Dosage sensitivity map from ClinGen (dbVar),2023-11-05
ClinGen disease validity curations,20231105,Disease validity curations from ClinGen (dbVar),2023-11-05
ClinVar,20231028,"A freely accessible, public archive of reports...",2023-11-05
Cosmic Cancer Gene Census,97,Cosmic Cancer Gene Census catalogs genes with ...,2022-11-29


## Genes

In [6]:
annotated_data.genes

Unnamed: 0,name,hgncId,ncbiGeneId,ensemblGeneId,omim,gnomAD.pLi,gnomAD.pRec,gnomAD.pNull,gnomAD.synZ,gnomAD.misZ,gnomAD.loeuf,clingenDosageSensitivityMap.haploinsufficiency,clingenDosageSensitivityMap.triplosensitivity,clingenGeneValidity,cosmic.roleInCancer
0,AATBC,51526.0,284837,ENSG00000215458,,,,,,,,,,,
1,ABCC13,,,ENSG00000291052,"[{'mimNumber': 608835, 'geneName': 'ATP-bindin...",,,,,,,,,,
2,ABCC13,16022.0,150000,ENSG00000243064,"[{'mimNumber': 608835, 'geneName': 'ATP-bindin...",,,,,,,,,,
3,ABCG1,73.0,9619,ENSG00000160179,"[{'mimNumber': 603076, 'geneName': 'ATP-bindin...",0.112,0.888,0.00000934,0.423,2.14,0.461,,,,
4,ADARB1,226.0,104,ENSG00000197381,"[{'mimNumber': 601218, 'geneName': 'Adenosine ...",0.803,0.197,0.00000832,0.603,3.52,0.399,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
437,YBEY,1299.0,54059,ENSG00000182362,"[{'mimNumber': 617461, 'geneName': 'YBEY metal...",1.34E-7,0.0625,0.937,1.13,0.0149,1.92,,,,
438,YRDCP3,39921.0,100861429,ENSG00000230859,,,,,,,,,,,
439,ZBTB21,13083.0,49854,ENSG00000173276,"[{'mimNumber': 616485, 'geneName': 'Zinc finge...",0.998,0.00222,5.45E-10,-1.59,0.512,0.249,,,,
440,ZNF295-AS1,23130.0,150142,ENSG00000237232,,,,,,,,,,,


## Positions

In [7]:
position = Position.model_validate(annotated_data.get_annotation("chr21", 5228221))
position.to_df()

Unnamed: 0,chromosome,position,refAllele,altAlleles,filters,mappingQuality,cytogeneticBand,samples,variants,vcfInfo.DP
0,chr21,5228221,G,[T],[PASS],75.75,21p12,"[{'genotype': '0/0', 'variantFrequencies': [0....","[{'vid': '21-5228221-G-T', 'chromosome': 'chr2...",1004


In [8]:
position.to_df(key="variants")

Unnamed: 0,chromosome,position,refAllele,altAlleles,filters,mappingQuality,cyatogeneticBand,vid,begin,end,altAllele,variantType,hgvsg,phylopScore,phyloPPrimateScore,transcripts,vcfInfo.DP
0,chr21,5228221,G,[T],[PASS],75.75,21p12,21-5228221-G-T,5228221,5228221,T,SNV,NC_000021.9:g.5228221G>T,0.2,0.0,"[{'transcript': 'ENST00000623753.1', 'source':...",1004


In [9]:
position.variants[0].to_df("transcripts")

Unnamed: 0,chromosome,begin,end,refAllele,altAllele,hgvsg,transcript,source,bioType,geneId,hgnc,consequence,impact,isCanonical
0,chr21,5228221,5228221,G,T,NC_000021.9:g.5228221G>T,ENST00000623753.1,Ensembl,lncRNA,ENSG00000279669,ENSG00000279669,[downstream_gene_variant],modifier,True


In [10]:
AnnotatedData.multiple_to_df(position.variants, key="transcripts")

Unnamed: 0,chromosome,begin,end,refAllele,altAllele,hgvsg,transcript,source,bioType,geneId,hgnc,consequence,impact,isCanonical
0,chr21,5228221,5228221,G,T,NC_000021.9:g.5228221G>T,ENST00000623753.1,Ensembl,lncRNA,ENSG00000279669,ENSG00000279669,[downstream_gene_variant],modifier,True


In [11]:
position = Position.model_validate(annotated_data.get_annotation("chr21", 5222289))
position.get_top_level()

Unnamed: 0,chromosome,position,refAllele,altAlleles,filters,mappingQuality,cyatogeneticBand,vcfInfo.DP
0,chr21,5222289,C,[T],"[mapping_quality, weak_evidence]",49.7,21p12,749


In [12]:
position.to_df()

Unnamed: 0,chromosome,position,refAllele,altAlleles,filters,mappingQuality,cytogeneticBand,samples,variants,vcfInfo.DP
0,chr21,5222289,C,[T],"[mapping_quality, weak_evidence]",49.7,21p12,"[{'genotype': '0/0', 'variantFrequencies': [0....","[{'vid': '21-5222289-C-T', 'chromosome': 'chr2...",749


In [13]:
position.to_df("cytogeneticBand")

Unnamed: 0,chromosome,position,refAllele,altAlleles,filters,mappingQuality,cyatogeneticBand,cytogeneticBand,vcfInfo.DP
0,chr21,5222289,C,[T],"[mapping_quality, weak_evidence]",49.7,21p12,21p12,749


In [14]:
position.to_df("variants")

Unnamed: 0,chromosome,position,refAllele,altAlleles,filters,mappingQuality,cyatogeneticBand,vid,begin,end,altAllele,variantType,hgvsg,phylopScore,phyloPPrimateScore,transcripts,dbsnp,vcfInfo.DP,gnomad.coverage,gnomad.allAf,gnomad.allAn,gnomad.allAc,gnomad.allHc,gnomad.afrAf,gnomad.afrAn,gnomad.afrAc,gnomad.afrHc,gnomad.amrAf,gnomad.amrAn,gnomad.amrAc,gnomad.amrHc,gnomad.easAf,gnomad.easAn,gnomad.easAc,gnomad.easHc,gnomad.finAf,gnomad.finAn,gnomad.finAc,gnomad.finHc,gnomad.nfeAf,gnomad.nfeAn,gnomad.nfeAc,gnomad.nfeHc,gnomad.asjAf,gnomad.asjAn,gnomad.asjAc,gnomad.asjHc,gnomad.sasAf,gnomad.sasAn,gnomad.sasAc,gnomad.sasHc,gnomad.othAf,gnomad.othAn,gnomad.othAc,gnomad.othHc,gnomad.maleAf,gnomad.maleAn,gnomad.maleAc,gnomad.maleHc,gnomad.femaleAf,gnomad.femaleAn,gnomad.femaleAc,gnomad.femaleHc,gnomad.controlsAllAf,gnomad.controlsAllAn,gnomad.controlsAllAc,topmed.allAf,topmed.allAn,topmed.allAc,topmed.allHc,topmed.failedFilter
0,chr21,5222289,C,[T],"[mapping_quality, weak_evidence]",49.7,21p12,21-5222289-C-T,5222289,5222289,T,SNV,NC_000021.9:g.5222289C>T,-0.2,0.074,,[rs1366179382],749,3,0.000502,149274,75,0,0.001704,40502,69,0,6.7e-05,14966,1,0,0,5022,0,0,0,10430,0,0,7.5e-05,66950,5,0,0,3438,0,0,0,4734,0,0,0,2024,0,0,0.000411,73006,30,0,0.00059,76268,45,0,0.000565,31850,18,0.001091,125568,137,0,True


# Parsing and Filtering Positions

In [15]:
class Parser:
    def __init__(self, annotated_data: AnnotatedData):
        self.annotated_data = annotated_data

    def get_variants_above_gnomad_freq(
        self,
        frequency_key: str,
        frequency_threshold_low: float = float("-inf"),
        frequency_threshold_high: float = float("inf"),
    ) -> Generator[Any, Any, None]:
        positions = (
            Position.model_validate(position)
            for position in self.annotated_data.positions
            for variant in position.get("variants", {})
            if (freq := variant.get("gnomad", {}).get(frequency_key, None))
            and frequency_threshold_low < freq < frequency_threshold_high
        )
        return positions

    def get_positions_with_cannonical_transcripts(self) -> Generator[Any, Any, None]:
        positions = (
            Position.model_validate(position)
            for position in self.annotated_data.positions
            for variant in position.get("variants", {})
            for transcript in variant.get("transcripts", [])
            if transcript.get("isCanonical")
        )

        return positions

    def filter_transcripts_by_consequence(
        self, include: Optional[List[str]] = None, exclude: Optional[List[str]] = None
    ) -> Generator[Any, Any, None]:
        if not exclude:
            exclude = []

        if not include:
            include = []

        positions = (
            Position.model_validate(position)
            for position in self.annotated_data.positions
            for variant in position.get("variants", {})
            for transcript in variant.get("transcripts", [])
            for consequence in transcript.get("consequence", [])
            if (not bool(include) or consequence in include) and consequence not in exclude
        )
        return positions

## Positions with Cannonical Transcripts only

In [16]:
parser = Parser(annotated_data)

In [17]:
positions = list(parser.get_positions_with_cannonical_transcripts())
len(positions)

2472

In [18]:
# AnnotatedData.multiple_to_df(positions, "variants")

In [19]:
positions[0].to_df("variants")

Unnamed: 0,chromosome,position,refAllele,altAlleles,filters,mappingQuality,cyatogeneticBand,vid,begin,end,altAllele,variantType,hgvsg,phylopScore,phyloPPrimateScore,transcripts,vcfInfo.DP
0,chr21,5228221,G,[T],[PASS],75.75,21p12,21-5228221-G-T,5228221,5228221,T,SNV,NC_000021.9:g.5228221G>T,0.2,0.0,"[{'transcript': 'ENST00000623753.1', 'source':...",1004


In [20]:
positions[0].variants[0].to_df("transcripts")

Unnamed: 0,chromosome,begin,end,refAllele,altAllele,hgvsg,transcript,source,bioType,geneId,hgnc,consequence,impact,isCanonical
0,chr21,5228221,5228221,G,T,NC_000021.9:g.5228221G>T,ENST00000623753.1,Ensembl,lncRNA,ENSG00000279669,ENSG00000279669,[downstream_gene_variant],modifier,True


## Filter by Consequence

In [21]:
parser = Parser(annotated_data)

In [22]:
positions = list(parser.filter_transcripts_by_consequence(
    include=["non_coding_transcript_exon_variant"]
))
len(positions)

147

In [23]:
AnnotatedData.multiple_to_df(positions, "variants")

Unnamed: 0,chromosome,position,refAllele,altAlleles,filters,mappingQuality,cyatogeneticBand,vid,begin,end,altAllele,variantType,hgvsg,phylopScore,phyloPPrimateScore,transcripts,vcfInfo.DP,regulatoryRegions,gnomad.coverage,gnomad.failedFilter,gnomad.allAf,gnomad.allAn,gnomad.allAc,gnomad.allHc,gnomad.afrAf,gnomad.afrAn,gnomad.afrAc,gnomad.afrHc,gnomad.amrAf,gnomad.amrAn,gnomad.amrAc,gnomad.amrHc,gnomad.easAf,gnomad.easAn,gnomad.easAc,gnomad.easHc,gnomad.finAf,gnomad.finAn,gnomad.finAc,gnomad.finHc,gnomad.nfeAf,gnomad.nfeAn,gnomad.nfeAc,gnomad.nfeHc,gnomad.asjAf,gnomad.asjAn,gnomad.asjAc,gnomad.asjHc,gnomad.sasAf,gnomad.sasAn,gnomad.sasAc,gnomad.sasHc,gnomad.othAf,gnomad.othAn,gnomad.othAc,gnomad.othHc,gnomad.maleAf,gnomad.maleAn,gnomad.maleAc,gnomad.maleHc,gnomad.femaleAf,gnomad.femaleAn,gnomad.femaleAc,gnomad.femaleHc,gnomad.controlsAllAf,gnomad.controlsAllAn,gnomad.controlsAllAc,dbsnp,topmed.allAf,topmed.allAn,topmed.allAc,topmed.allHc,topmed.failedFilter,dannScore,gerpScore,inLowComplexityRegion,oneKg.allAf,oneKg.afrAf,oneKg.amrAf,oneKg.easAf,oneKg.eurAf,oneKg.sasAf,oneKg.allAn,oneKg.afrAn,oneKg.amrAn,oneKg.easAn,oneKg.eurAn,oneKg.sasAn,oneKg.allAc,oneKg.afrAc,oneKg.amrAc,oneKg.easAc,oneKg.eurAc,oneKg.sasAc,primateAI-3D,revel.score,cosmic,spliceAI,clinvar
0,chr21,5232869,T,[G],"[base_quality, weak_evidence]",58.10,21p12,21-5232869-T-G,5232869,5232869,G,SNV,NC_000021.9:g.5232869T>G,0.5,,"[{'transcript': 'ENST00000623753.1', 'source':...",799,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
0,chr21,5243812,C,[T],[PASS],108.19,21p12,21-5243812-C-T,5243812,5243812,T,SNV,NC_000021.9:g.5243812C>T,0.4,0.074,"[{'transcript': 'ENST00000623753.1', 'source':...",514,"[{'id': 'ENSR00000140073', 'type': 'TF_binding...",0.0,True,0,152296.0,0.0,0.0,0,41480.0,0.0,0.0,0,15294.0,0.0,0.0,0,5206.0,0.0,0.0,0,10628.0,0.0,0.0,0,68058.0,0.0,0.0,0,3472.0,0.0,0.0,0,4836.0,0.0,0.0,0,2094.0,0.0,0.0,0,74406.0,0.0,0.0,0,77890.0,0.0,0.0,0,32922.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
0,chr21,8209019,-,[GCGC],"[filtered_reads, mapping_quality, no_reliable_...",9.94,21p11.2,21-8209019-G-GCGC,8209020,8209019,CGC,insertion,NC_000021.9:g.8209022_8209024dup,,,"[{'transcript': 'ENST00000623664.1', 'source':...",1317,,1.0,,0.000379,145094.0,55.0,0.0,0.000049,40532.0,2.0,0.0,0.000205,14652.0,3.0,0.0,0,5082.0,0.0,0.0,0.000367,8170.0,3.0,0.0,0.000582,65286.0,38.0,0.0,0.002097,3338.0,7.0,0.0,0,4814.0,0.0,0.0,0,2000.0,0.0,0.0,0.000311,70634.0,22.0,0.0,0.000443,74460.0,33.0,0.0,0.000297,30260.0,9.0,[rs1555877791],0.00039,125568.0,49.0,0.0,True,,,,,,,,,,,,,,,,,,,,,,,,,,
0,chr21,8213090,T,[G],"[base_quality, filtered_reads, mapping_quality...",21.85,21p11.2,21-8213090-T-G,8213090,8213090,G,SNV,NC_000021.9:g.8213090T>G,-1.8,,"[{'transcript': 'ENST00000623664.1', 'source':...",1841,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
0,chr21,8213222,A,[C],"[filtered_reads, mapping_quality, no_reliable_...",18.39,21p11.2,21-8213222-A-C,8213222,8213222,C,SNV,NC_000021.9:g.8213222A>C,0.2,,"[{'transcript': 'ENST00000623664.1', 'source':...",1173,,178.0,,0.154747,118316.0,18309.0,4.0,0.076287,34816.0,2656.0,1.0,0.117568,11942.0,1404.0,0.0,0.003146,5086.0,16.0,0.0,0.164856,7734.0,1275.0,1.0,0.228802,49816.0,11398.0,2.0,0.276284,2454.0,678.0,0.0,0.098278,3948.0,388.0,0.0,0.182048,1582.0,288.0,0.0,0.148306,57840.0,8578.0,1.0,0.160907,60476.0,9731.0,3.0,0.126071,25922.0,3268.0,[rs1460148663],0.210945,125568.0,26488.0,1.0,True,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,chr21,46193167,T,[C],"[base_quality, mapping_quality, non_homref_nor...",56.43,21q22.3,21-46193167-T-C,46193167,46193167,C,SNV,NC_000021.9:g.46193167T>C,-2.0,-3.537,"[{'transcript': 'ENST00000415026.1', 'source':...",130,,1.0,,0.002211,94994.0,210.0,0.0,0.003126,24314.0,76.0,0.0,0.001071,9340.0,10.0,0.0,0.005271,2846.0,15.0,0.0,0.001511,5958.0,9.0,0.0,0.001844,45550.0,84.0,0.0,0,2332.0,0.0,0.0,0.002682,2610.0,7.0,0.0,0.00219,1370.0,3.0,0.0,0.002105,46556.0,98.0,0.0,0.002312,48438.0,112.0,0.0,0.003282,17368.0,57.0,[rs112603162],0.123869,125568.0,15554.0,0.0,True,0.32,,,,,,,,,,,,,,,,,,,,,,0.007,"[{'id': 'COSV62702051', 'numSamples': 2, 'refA...",,
0,chr21,46286556,T,[G],"[base_quality, weak_evidence]",213.21,21q22.3,21-46286556-T-G,46286556,46286556,G,SNV,NC_000021.9:g.46286556T>G,-0.2,-1.756,"[{'transcript': 'ENST00000291688.6', 'source':...",93,"[{'id': 'ENSR00000143372', 'type': 'promoter',...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.32,,,,,,,,,,,,,,,,,,,,,,,,,
0,chr21,46325257,A,[C],"[base_quality, non_homref_normal, weak_evidence]",221.46,21q22.3,21-46325257-A-C,46325257,46325257,C,SNV,NC_000021.9:g.46325257A>C,-0.5,-2.122,"[{'transcript': 'ENST00000397683.5', 'source':...",144,"[{'id': 'ENSR00000143378', 'type': 'promoter',...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.54,-0.51,,,,,,,,,,,,,,,,,,,,,,,,
0,chr21,46436858,T,[G],"[base_quality, weak_evidence]",216.57,21q22.3,21-46436858-T-G,46436858,46436858,G,SNV,NC_000021.9:g.46436858T>G,0.1,-1.329,"[{'transcript': 'ENST00000703224.1', 'source':...",149,"[{'id': 'ENSR00000665397', 'type': 'enhancer',...",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.67,-0.837,,,,,,,,,,,,,,,,,,,,,,,"[{'hgnc': 'PCNT', 'donorGainScore': 0.6, 'dono...",


In [24]:
positions = list(parser.filter_transcripts_by_consequence(
    exclude=["downstream_gene_variant", "upstream_gene_variant"]
))
len(positions)

11723

In [25]:
AnnotatedData.multiple_to_df(positions, "variants")

Unnamed: 0,chromosome,position,refAllele,altAlleles,filters,mappingQuality,cyatogeneticBand,vid,begin,end,altAllele,variantType,hgvsg,phylopScore,phyloPPrimateScore,transcripts,vcfInfo.DP,regulatoryRegions,gnomad.coverage,gnomad.failedFilter,gnomad.allAf,gnomad.allAn,gnomad.allAc,gnomad.allHc,gnomad.afrAf,gnomad.afrAn,gnomad.afrAc,gnomad.afrHc,gnomad.amrAf,gnomad.amrAn,gnomad.amrAc,gnomad.amrHc,gnomad.easAf,gnomad.easAn,gnomad.easAc,gnomad.easHc,gnomad.finAf,gnomad.finAn,gnomad.finAc,gnomad.finHc,gnomad.nfeAf,gnomad.nfeAn,gnomad.nfeAc,gnomad.nfeHc,gnomad.asjAf,gnomad.asjAn,gnomad.asjAc,gnomad.asjHc,gnomad.sasAf,gnomad.sasAn,gnomad.sasAc,gnomad.sasHc,gnomad.othAf,gnomad.othAn,gnomad.othAc,gnomad.othHc,gnomad.maleAf,gnomad.maleAn,gnomad.maleAc,gnomad.maleHc,gnomad.femaleAf,gnomad.femaleAn,gnomad.femaleAc,gnomad.femaleHc,gnomad.controlsAllAf,gnomad.controlsAllAn,gnomad.controlsAllAc,dbsnp,topmed.allAf,topmed.allAn,topmed.allAc,topmed.allHc,topmed.failedFilter,inLowComplexityRegion,dannScore,cosmic,oneKg.allAf,oneKg.afrAf,oneKg.amrAf,oneKg.easAf,oneKg.eurAf,oneKg.sasAf,oneKg.allAn,oneKg.afrAn,oneKg.amrAn,oneKg.easAn,oneKg.eurAn,oneKg.sasAn,oneKg.allAc,oneKg.afrAc,oneKg.amrAc,oneKg.easAc,oneKg.eurAc,oneKg.sasAc,gerpScore,primateAI-3D,revel.score,clinvar,spliceAI
0,chr21,5232869,T,[G],"[base_quality, weak_evidence]",58.10,21p12,21-5232869-T-G,5232869,5232869,G,SNV,NC_000021.9:g.5232869T>G,0.5,,"[{'transcript': 'ENST00000623753.1', 'source':...",799,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
0,chr21,5243664,C,[T],"[mapping_quality, no_reliable_supporting_read,...",58.13,21p12,21-5243664-C-T,5243664,5243664,T,SNV,NC_000021.9:g.5243664C>T,0.6,0.074,"[{'transcript': 'ENST00000623753.1', 'source':...",644,"[{'id': 'ENSR00000140073', 'type': 'TF_binding...",0.0,True,0,152310.0,0.0,0.0,0,41488.0,0.0,0.0,0,15294.0,0.0,0.0,0,5208.0,0.0,0.0,0,10632.0,0.0,0.0,0,68056.0,0.0,0.0,0,3472.0,0.0,0.0,0,4838.0,0.0,0.0,0,2094.0,0.0,0.0,0,74416.0,0.0,0.0,0,77894.0,0.0,0.0,0,32928.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
0,chr21,5243664,C,[T],"[mapping_quality, no_reliable_supporting_read,...",58.13,21p12,21-5243664-C-T,5243664,5243664,T,SNV,NC_000021.9:g.5243664C>T,0.6,0.074,"[{'transcript': 'ENST00000623753.1', 'source':...",644,"[{'id': 'ENSR00000140073', 'type': 'TF_binding...",0.0,True,0,152310.0,0.0,0.0,0,41488.0,0.0,0.0,0,15294.0,0.0,0.0,0,5208.0,0.0,0.0,0,10632.0,0.0,0.0,0,68056.0,0.0,0.0,0,3472.0,0.0,0.0,0,4838.0,0.0,0.0,0,2094.0,0.0,0.0,0,74416.0,0.0,0.0,0,77894.0,0.0,0.0,0,32928.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
0,chr21,5243812,C,[T],[PASS],108.19,21p12,21-5243812-C-T,5243812,5243812,T,SNV,NC_000021.9:g.5243812C>T,0.4,0.074,"[{'transcript': 'ENST00000623753.1', 'source':...",514,"[{'id': 'ENSR00000140073', 'type': 'TF_binding...",0.0,True,0,152296.0,0.0,0.0,0,41480.0,0.0,0.0,0,15294.0,0.0,0.0,0,5206.0,0.0,0.0,0,10628.0,0.0,0.0,0,68058.0,0.0,0.0,0,3472.0,0.0,0.0,0,4836.0,0.0,0.0,0,2094.0,0.0,0.0,0,74406.0,0.0,0.0,0,77890.0,0.0,0.0,0,32922.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
0,chr21,8209019,-,[GCGC],"[filtered_reads, mapping_quality, no_reliable_...",9.94,21p11.2,21-8209019-G-GCGC,8209020,8209019,CGC,insertion,NC_000021.9:g.8209022_8209024dup,,,"[{'transcript': 'ENST00000623664.1', 'source':...",1317,,1.0,,0.000379,145094.0,55.0,0.0,0.000049,40532.0,2.0,0.0,0.000205,14652.0,3.0,0.0,0,5082.0,0.0,0.0,0.000367,8170.0,3.0,0.0,0.000582,65286.0,38.0,0.0,0.002097,3338.0,7.0,0.0,0,4814.0,0.0,0.0,0,2000.0,0.0,0.0,0.000311,70634.0,22.0,0.0,0.000443,74460.0,33.0,0.0,0.000297,30260.0,9.0,[rs1555877791],0.00039,125568.0,49.0,0.0,True,,,,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,chr21,46643510,T,[A],"[base_quality, weak_evidence]",211.58,21q22.3,21-46643510-T-A,46643510,46643510,A,SNV,NC_000021.9:g.46643510T>A,0.0,-0.603,"[{'transcript': 'ENST00000451211.6', 'source':...",150,"[{'id': 'ENSR00001057246', 'type': 'CTCF_bindi...",0.0,True,0.000021,141118.0,3.0,0.0,0.000051,39060.0,2.0,0.0,0,14018.0,0.0,0.0,0.000232,4306.0,1.0,0.0,0,7762.0,0.0,0.0,0,65358.0,0.0,0.0,0,3364.0,0.0,0.0,0,4110.0,0.0,0.0,0,1968.0,0.0,0.0,0,68068.0,0.0,0.0,0.000041,73050.0,3.0,0.0,0.000108,27846.0,3.0,[rs1601921720],,,,,,,0.58,,,,,,,,,,,,,,,,,,,,0.622,,,,
0,chr21,46643510,T,[A],"[base_quality, weak_evidence]",211.58,21q22.3,21-46643510-T-A,46643510,46643510,A,SNV,NC_000021.9:g.46643510T>A,0.0,-0.603,"[{'transcript': 'ENST00000451211.6', 'source':...",150,"[{'id': 'ENSR00001057246', 'type': 'CTCF_bindi...",0.0,True,0.000021,141118.0,3.0,0.0,0.000051,39060.0,2.0,0.0,0,14018.0,0.0,0.0,0.000232,4306.0,1.0,0.0,0,7762.0,0.0,0.0,0,65358.0,0.0,0.0,0,3364.0,0.0,0.0,0,4110.0,0.0,0.0,0,1968.0,0.0,0.0,0,68068.0,0.0,0.0,0.000041,73050.0,3.0,0.0,0.000108,27846.0,3.0,[rs1601921720],,,,,,,0.58,,,,,,,,,,,,,,,,,,,,0.622,,,,
0,chr21,46643510,T,[A],"[base_quality, weak_evidence]",211.58,21q22.3,21-46643510-T-A,46643510,46643510,A,SNV,NC_000021.9:g.46643510T>A,0.0,-0.603,"[{'transcript': 'ENST00000451211.6', 'source':...",150,"[{'id': 'ENSR00001057246', 'type': 'CTCF_bindi...",0.0,True,0.000021,141118.0,3.0,0.0,0.000051,39060.0,2.0,0.0,0,14018.0,0.0,0.0,0.000232,4306.0,1.0,0.0,0,7762.0,0.0,0.0,0,65358.0,0.0,0.0,0,3364.0,0.0,0.0,0,4110.0,0.0,0.0,0,1968.0,0.0,0.0,0,68068.0,0.0,0.0,0.000041,73050.0,3.0,0.0,0.000108,27846.0,3.0,[rs1601921720],,,,,,,0.58,,,,,,,,,,,,,,,,,,,,0.622,,,,
0,chr21,46643510,T,[A],"[base_quality, weak_evidence]",211.58,21q22.3,21-46643510-T-A,46643510,46643510,A,SNV,NC_000021.9:g.46643510T>A,0.0,-0.603,"[{'transcript': 'ENST00000451211.6', 'source':...",150,"[{'id': 'ENSR00001057246', 'type': 'CTCF_bindi...",0.0,True,0.000021,141118.0,3.0,0.0,0.000051,39060.0,2.0,0.0,0,14018.0,0.0,0.0,0.000232,4306.0,1.0,0.0,0,7762.0,0.0,0.0,0,65358.0,0.0,0.0,0,3364.0,0.0,0.0,0,4110.0,0.0,0.0,0,1968.0,0.0,0.0,0,68068.0,0.0,0.0,0.000041,73050.0,3.0,0.0,0.000108,27846.0,3.0,[rs1601921720],,,,,,,0.58,,,,,,,,,,,,,,,,,,,,0.622,,,,


## Filter by gnomad frequency
Possible values

'coverage', 'failedFilter', 'allAf', 'allAn', 'allAc', 'allHc', 'afrAf', 'afrAn', 'afrAc', 'afrHc', 'amrAf', 'amrAn', 'amrAc', 'amrHc', 'easAf', 'easAn', 'easAc', 'easHc', 'finAf', 'finAn', 'finAc', 'finHc', 'nfeAf', 'nfeAn', 'nfeAc', 'nfeHc', 'asjAf', 'asjAn', 'asjAc', 'asjHc', 'sasAf', 'sasAn', 'sasAc', 'sasHc', 'othAf', 'othAn', 'othAc', 'othHc', 'maleAf', 'maleAn', 'maleAc', 'maleHc', 'femaleAf', 'femaleAn', 'femaleAc', 'femaleHc', 'controlsAllAf', 'controlsAllAn', 'controlsAllAc'

In [26]:
parser = Parser(annotated_data)

In [27]:
positions = list(parser.get_variants_above_gnomad_freq(
    frequency_key="allAf", 
    frequency_threshold_high=0.1
))
len(positions)

851

In [28]:
positions = list(parser.get_variants_above_gnomad_freq(
    frequency_key="allAf",
    frequency_threshold_low=0.1
))
len(positions)

302

In [29]:
positions = list(parser.get_variants_above_gnomad_freq(
    frequency_key="allAf", 
    frequency_threshold_low=0.1,
    frequency_threshold_high=0.2
))
len(positions)

114

### Further filtering

In [30]:
df = AnnotatedData.multiple_to_df(positions, "variants")
df

Unnamed: 0,chromosome,position,refAllele,altAlleles,filters,mappingQuality,cyatogeneticBand,vid,begin,end,altAllele,variantType,hgvsg,phylopScore,phyloPPrimateScore,transcripts,dbsnp,vcfInfo.DP,gnomad.coverage,gnomad.failedFilter,gnomad.allAf,gnomad.allAn,gnomad.allAc,gnomad.allHc,gnomad.afrAf,gnomad.afrAn,gnomad.afrAc,gnomad.afrHc,gnomad.amrAf,gnomad.amrAn,gnomad.amrAc,gnomad.amrHc,gnomad.easAf,gnomad.easAn,gnomad.easAc,gnomad.easHc,gnomad.finAf,gnomad.finAn,gnomad.finAc,gnomad.finHc,gnomad.nfeAf,gnomad.nfeAn,gnomad.nfeAc,gnomad.nfeHc,gnomad.asjAf,gnomad.asjAn,gnomad.asjAc,gnomad.asjHc,gnomad.sasAf,gnomad.sasAn,gnomad.sasAc,gnomad.sasHc,gnomad.othAf,gnomad.othAn,gnomad.othAc,gnomad.othHc,gnomad.maleAf,gnomad.maleAn,gnomad.maleAc,gnomad.maleHc,gnomad.femaleAf,gnomad.femaleAn,gnomad.femaleAc,gnomad.femaleHc,gnomad.controlsAllAf,gnomad.controlsAllAn,gnomad.controlsAllAc,topmed.allAf,topmed.allAn,topmed.allAc,topmed.allHc,topmed.failedFilter,dannScore,oneKg.allAf,oneKg.afrAf,oneKg.amrAf,oneKg.easAf,oneKg.eurAf,oneKg.sasAf,oneKg.allAn,oneKg.afrAn,oneKg.amrAn,oneKg.easAn,oneKg.eurAn,oneKg.sasAn,oneKg.allAc,oneKg.afrAc,oneKg.amrAc,oneKg.easAc,oneKg.eurAc,oneKg.sasAc,cosmic,inLowComplexityRegion,gerpScore,regulatoryRegions
0,chr21,5278225,A,[T],"[alt_allele_in_normal, filtered_reads, mapping...",12.91,21p12,21-5278225-A-T,5278225,5278225,T,SNV,NC_000021.9:g.5278225A>T,,,,[rs1171728286],113,7,True,0.148668,46856,6966,80,0.181794,26244,4771,46,0.110039,2590,285,5,0.000608,1644,1,0,0.050104,958,48,0,0.122751,12448,1528,26,0.225636,944,213,2,0.022059,1224,27,0,0.116838,582,68,0,0.145535,22084,3214,34,0.151461,24772,3752,46,0.128554,9848,1266,0.084209,125568.0,10574.0,0.0,True,,,,,,,,,,,,,,,,,,,,,,,
0,chr21,5291458,C,[T],"[mapping_quality, no_reliable_supporting_read,...",106.93,21p12,21-5291458-C-T,5291458,5291458,T,SNV,NC_000021.9:g.5291458C>T,,,,[rs1265206646],220,26,,0.110935,70798,7854,78,0.022681,26322,597,0,0.079767,7196,574,2,0.075443,2598,196,0,0.102046,4204,429,0,0.210871,25480,5373,72,0.151239,1534,232,2,0.122714,2078,255,0,0.107527,930,100,0,0.102856,34660,3565,30,0.118684,36138,4289,48,0.074837,16302,1220,0.372101,125568.0,46724.0,4255.0,True,,,,,,,,,,,,,,,,,,,,,,,
0,chr21,5317005,C,[A],"[mapping_quality, non_homref_normal]",32.32,21p12,21-5317005-C-A,5317005,5317005,A,SNV,NC_000021.9:g.5317005C>A,,,,[rs1399429098],194,24,,0.187619,33520,6289,71,0.096655,18116,1751,1,0.237973,2328,554,10,0.275613,1386,382,2,0.122881,708,87,2,0.327003,8636,2824,47,0.306579,760,233,4,0.300403,992,298,3,0.234742,426,100,1,0.181277,15788,2862,24,0.193266,17732,3427,47,0.216009,7958,1719,0.508537,125568.0,63856.0,3328.0,True,,,,,,,,,,,,,,,,,,,,,,,
0,chr21,5341568,C,[T],[weak_evidence],30.38,21p12,21-5341568-C-T,5341568,5341568,T,SNV,NC_000021.9:g.5341568C>T,,,,[rs1362623940],241,19,,0.199399,94564,18856,550,0.039066,30282,1183,4,0.28162,8792,2476,39,0.320692,3296,1057,20,0.257278,4878,1255,27,0.283761,40298,11435,434,0.216651,2174,471,12,0.177857,2800,498,3,0.208531,1266,264,8,0.199067,45432,9044,245,0.199707,49132,9812,305,0.179372,19546,3506,0.22923,125568.0,28784.0,977.0,True,,,,,,,,,,,,,,,,,,,,,,,
0,chr21,5344000,C,[T],"[no_reliable_supporting_read, weak_evidence]",24.00,21p12,21-5344000-C-T,5344000,5344000,T,SNV,NC_000021.9:g.5344000C>T,,,,[rs1156358732],241,15,,0.108729,99118,10777,218,0.01388,35374,491,2,0.178461,9016,1609,17,0.194145,2972,577,8,0.15375,5574,857,6,0.165372,38876,6429,177,0.113576,2254,256,3,0.096678,3010,291,0,0.107963,1306,141,2,0.108202,47892,5182,85,0.109222,51226,5595,133,0.103308,21344,2205,0.191402,125568.0,24034.0,247.0,True,,,,,,,,,,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,chr21,44260048,A,[C],[PASS],204.27,21q22.3,21-44260048-CA-C,44260049,44260049,-,deletion,NC_000021.9:g.44260071del,,,"[{'transcript': 'ENST00000628202.3', 'source':...",[rs765722839],121,18,,0.163523,63508,10385,235,0.241867,17460,4223,131,0.11847,5360,635,16,0.020419,1910,39,1,0.089506,1944,174,7,0.136704,32040,4380,49,0.189066,1756,332,8,0.229017,1668,382,13,0.157407,756,119,6,0.164289,29022,4768,103,0.162878,34486,5617,132,0.163979,11288,1851,0.110832,125568.0,13917.0,51.0,True,,,,,,,,,,,,,,,,,,,,,True,,"[{'id': 'ENSR00000664607', 'type': 'enhancer',..."
0,chr21,45654338,T,[A],[PASS],205.80,21q22.3,21-45654338-AT-A,45654339,45654339,-,deletion,NC_000021.9:g.45654359del,,,"[{'transcript': 'ENST00000465077.5', 'source':...",[rs398036499],145,15,,0.162227,108952,17675,674,0.215993,28200,6091,308,0.181828,9476,1723,86,0.060228,3852,232,1,0.08975,3844,345,9,0.149985,55012,8251,237,0.143775,2956,425,14,0.091275,3232,295,12,0.147222,1440,212,6,0.159572,50598,8074,306,0.16453,58354,9601,368,0.155962,19960,3113,0.163991,125568.0,20592.0,276.0,True,,,,,,,,,,,,,,,,,,,,,True,,
0,chr21,45909790,C,[A],[weak_evidence],107.77,21q22.3,21-45909790-AC-A,45909791,45909791,-,deletion,NC_000021.9:g.45909800del,,,"[{'transcript': 'ENST00000400314.5', 'source':...",[rs376651490],53,35,True,0.153031,9338,1429,128,0.31489,2458,774,92,0.086076,790,68,2,0.081818,220,18,2,0.079592,490,39,2,0.091773,4838,444,22,0.180952,210,38,3,0.131579,114,15,1,0.126984,126,16,1,0.154401,4158,642,64,0.151931,5180,787,64,0.158901,1674,266,0.373614,125568.0,46914.0,125.0,True,,,,,,,,,,,,,,,,,,,,,True,,
0,chr21,46145756,C,[G],[weak_evidence],205.79,21q22.3,21-46145756-C-G,46145756,46145756,G,SNV,NC_000021.9:g.46145756C>G,-0.2,-1.657,"[{'transcript': 'ENST00000494498.2', 'source':...",[rs191178818],65,12,True,0.177827,53940,9592,0,0.077993,14232,1110,0,0.256837,5704,1465,0,0.139273,1156,161,0,0.153454,4314,662,0,0.222038,24784,5503,0,0.219403,1340,294,0,0.11203,1330,149,0,0.22093,688,152,0,0.183313,26332,4827,0,0.172595,27608,4765,0,0.155291,11018,1711,0.242052,125568.0,30394.0,0.0,True,0.3,,,,,,,,,,,,,,,,,,,,True,-0.675,


In [31]:
df = df.dropna(subset= ["phyloPPrimateScore", "gerpScore"])
df[(df["gerpScore"]>0) & (df["phyloPPrimateScore"]>0)]

Unnamed: 0,chromosome,position,refAllele,altAlleles,filters,mappingQuality,cyatogeneticBand,vid,begin,end,altAllele,variantType,hgvsg,phylopScore,phyloPPrimateScore,transcripts,dbsnp,vcfInfo.DP,gnomad.coverage,gnomad.failedFilter,gnomad.allAf,gnomad.allAn,gnomad.allAc,gnomad.allHc,gnomad.afrAf,gnomad.afrAn,gnomad.afrAc,gnomad.afrHc,gnomad.amrAf,gnomad.amrAn,gnomad.amrAc,gnomad.amrHc,gnomad.easAf,gnomad.easAn,gnomad.easAc,gnomad.easHc,gnomad.finAf,gnomad.finAn,gnomad.finAc,gnomad.finHc,gnomad.nfeAf,gnomad.nfeAn,gnomad.nfeAc,gnomad.nfeHc,gnomad.asjAf,gnomad.asjAn,gnomad.asjAc,gnomad.asjHc,gnomad.sasAf,gnomad.sasAn,gnomad.sasAc,gnomad.sasHc,gnomad.othAf,gnomad.othAn,gnomad.othAc,gnomad.othHc,gnomad.maleAf,gnomad.maleAn,gnomad.maleAc,gnomad.maleHc,gnomad.femaleAf,gnomad.femaleAn,gnomad.femaleAc,gnomad.femaleHc,gnomad.controlsAllAf,gnomad.controlsAllAn,gnomad.controlsAllAc,topmed.allAf,topmed.allAn,topmed.allAc,topmed.allHc,topmed.failedFilter,dannScore,oneKg.allAf,oneKg.afrAf,oneKg.amrAf,oneKg.easAf,oneKg.eurAf,oneKg.sasAf,oneKg.allAn,oneKg.afrAn,oneKg.amrAn,oneKg.easAn,oneKg.eurAn,oneKg.sasAn,oneKg.allAc,oneKg.afrAc,oneKg.amrAc,oneKg.easAc,oneKg.eurAc,oneKg.sasAc,cosmic,inLowComplexityRegion,gerpScore,regulatoryRegions
0,chr21,9325863,C,[G],"[filtered_reads, mapping_quality, non_homref_n...",8.65,21p11.2,21-9325863-C-G,9325863,9325863,G,SNV,NC_000021.9:g.9325863C>G,-0.8,0.171,"[{'transcript': 'ENST00000622961.3', 'source':...",[rs796282180],306,30,True,0.118328,57146,6762,0,0.205042,13446,2757,0,0.088286,5822,514,0,0.040883,2446,100,0,0.150943,3710,560,0,0.089256,27382,2444,0,0.062331,1476,92,0,0.111321,1590,177,0,0.079012,810,64,0,0.119863,27498,3296,0,0.116905,29648,3466,0,0.121282,12170,1476,0.03997,125568.0,5019.0,0.0,True,0.17,,,,,,,,,,,,,,,,,,,,,1.46,"[{'id': 'ENSR00000140262', 'type': 'promoter',..."
0,chr21,9577267,C,[T],"[non_homref_normal, weak_evidence]",34.56,21p11.2,21-9577267-C-T,9577267,9577267,T,SNV,NC_000021.9:g.9577267C>T,0.2,0.292,"[{'transcript': 'ENST00000623408.1', 'source':...",[rs1414013851],227,24,True,0.16528,104290,17237,1,0.131743,27918,3678,0,0.198579,10414,2068,1,0.168754,3354,566,0,0.170216,7420,1263,0,0.178455,47250,8432,0,0.193937,2408,467,0,0.129664,3270,424,0,0.150636,1414,213,0,0.163481,50954,8330,1,0.166998,53336,8907,0,0.16508,22486,3712,0.151583,125568.0,19034.0,0.0,True,,,,,,,,,,,,,,,,,,,,,,0.718,
0,chr21,9732136,C,[A],[weak_evidence],71.44,21p11.2,21-9732136-C-A,9732136,9732136,A,SNV,NC_000021.9:g.9732136C>A,-0.7,0.019,,[rs910381466],393,59,True,0.181347,101518,18410,58,0.081754,29760,2433,2,0.231872,9764,2264,7,0.126896,3428,435,1,0.230513,6902,1591,5,0.23185,44214,10251,38,0.164279,2234,367,2,0.207627,3068,637,2,0.168666,1334,225,1,0.181859,49346,8974,31,0.180863,52172,9436,27,0.169482,22026,3733,0.22583,125568.0,28357.0,46.0,True,,,,,,,,,,,,,,,,,,,,,,1.22,
0,chr21,27416167,T,[C],"[alt_allele_in_normal, non_homref_normal, weak...",194.46,21q21.3,21-27416167-T-C,27416167,27416167,C,SNV,NC_000021.9:g.27416167T>C,0.0,0.163,"[{'transcript': 'ENST00000420186.2', 'source':...",[rs866572428],113,5,,0.161367,102834,16594,3980,0.072592,27262,1979,377,0.147874,8514,1259,386,0.246599,2940,725,200,0.08338,3550,296,115,0.210971,52538,11084,2603,0.178827,2796,500,101,0.107071,2998,321,90,0.174312,1308,228,59,0.148092,47700,7064,1751,0.172852,55134,9530,2229,0.112171,18026,2022,0.049694,125568.0,6240.0,0.0,True,0.46,,,,,,,,,,,,,,,,,,,,True,0.266,


## Gene Filtering

In [32]:
annotated_data.genes

Unnamed: 0,name,hgncId,ncbiGeneId,ensemblGeneId,omim,gnomAD.pLi,gnomAD.pRec,gnomAD.pNull,gnomAD.synZ,gnomAD.misZ,gnomAD.loeuf,clingenDosageSensitivityMap.haploinsufficiency,clingenDosageSensitivityMap.triplosensitivity,clingenGeneValidity,cosmic.roleInCancer
0,AATBC,51526.0,284837,ENSG00000215458,,,,,,,,,,,
1,ABCC13,,,ENSG00000291052,"[{'mimNumber': 608835, 'geneName': 'ATP-bindin...",,,,,,,,,,
2,ABCC13,16022.0,150000,ENSG00000243064,"[{'mimNumber': 608835, 'geneName': 'ATP-bindin...",,,,,,,,,,
3,ABCG1,73.0,9619,ENSG00000160179,"[{'mimNumber': 603076, 'geneName': 'ATP-bindin...",0.112,0.888,0.00000934,0.423,2.14,0.461,,,,
4,ADARB1,226.0,104,ENSG00000197381,"[{'mimNumber': 601218, 'geneName': 'Adenosine ...",0.803,0.197,0.00000832,0.603,3.52,0.399,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
437,YBEY,1299.0,54059,ENSG00000182362,"[{'mimNumber': 617461, 'geneName': 'YBEY metal...",1.34E-7,0.0625,0.937,1.13,0.0149,1.92,,,,
438,YRDCP3,39921.0,100861429,ENSG00000230859,,,,,,,,,,,
439,ZBTB21,13083.0,49854,ENSG00000173276,"[{'mimNumber': 616485, 'geneName': 'Zinc finge...",0.998,0.00222,5.45E-10,-1.59,0.512,0.249,,,,
440,ZNF295-AS1,23130.0,150142,ENSG00000237232,,,,,,,,,,,


In [33]:
annotated_data.genes[annotated_data.genes["name"] == "ABCC13"]

Unnamed: 0,name,hgncId,ncbiGeneId,ensemblGeneId,omim,gnomAD.pLi,gnomAD.pRec,gnomAD.pNull,gnomAD.synZ,gnomAD.misZ,gnomAD.loeuf,clingenDosageSensitivityMap.haploinsufficiency,clingenDosageSensitivityMap.triplosensitivity,clingenGeneValidity,cosmic.roleInCancer
1,ABCC13,,,ENSG00000291052,"[{'mimNumber': 608835, 'geneName': 'ATP-bindin...",,,,,,,,,,
2,ABCC13,16022.0,150000.0,ENSG00000243064,"[{'mimNumber': 608835, 'geneName': 'ATP-bindin...",,,,,,,,,,


In [34]:
required_gene_ids = ["ENSG00000173276", "ENSG00000243064"]
annotated_data.genes[annotated_data.genes["ensemblGeneId"].isin(required_gene_ids)]

Unnamed: 0,name,hgncId,ncbiGeneId,ensemblGeneId,omim,gnomAD.pLi,gnomAD.pRec,gnomAD.pNull,gnomAD.synZ,gnomAD.misZ,gnomAD.loeuf,clingenDosageSensitivityMap.haploinsufficiency,clingenDosageSensitivityMap.triplosensitivity,clingenGeneValidity,cosmic.roleInCancer
2,ABCC13,16022.0,150000,ENSG00000243064,"[{'mimNumber': 608835, 'geneName': 'ATP-bindin...",,,,,,,,,,
439,ZBTB21,13083.0,49854,ENSG00000173276,"[{'mimNumber': 616485, 'geneName': 'Zinc finge...",0.998,0.00222,5.45e-10,-1.59,0.512,0.249,,,,


In [35]:
annotated_data.genes.dropna(subset=["cosmic.roleInCancer"])

Unnamed: 0,name,hgncId,ncbiGeneId,ensemblGeneId,omim,gnomAD.pLi,gnomAD.pRec,gnomAD.pNull,gnomAD.synZ,gnomAD.misZ,gnomAD.loeuf,clingenDosageSensitivityMap.haploinsufficiency,clingenDosageSensitivityMap.triplosensitivity,clingenGeneValidity,cosmic.roleInCancer
198,ERG,3446.0,2078,ENSG00000157554,"[{'mimNumber': 165080, 'geneName': 'ETS transc...",0.964,0.0359,5.03e-07,0.228,2.53,0.329,,,,"[oncogene, fusion]"
418,TMPRSS2,11876.0,7113,ENSG00000184012,"[{'mimNumber': 602060, 'geneName': 'Transmembr...",2.4e-10,0.88,0.12,0.758,0.399,0.942,,,,[fusion]
