# Orange QB1.2_FA_Gene_Pathogenic_Variants

## Query:
What core FA gene variants are pathogenic, and for what conditions? (i.e. cause or contribute to pathogenic outcomes)

## Input:
Hardcoded tsv file from:
https://raw.githubusercontent.com/NCATS-Tangerine/cq-notebooks/master/FA_gene_sets/FA_1_core_complex.txt

## Goals:

A benchmarking query to assess information in the Translator system about pathogenicity of variants in human FA core genes.
This query is independently answerable using only CIViC data accessible via Wikidata, or ClinVar data accessible via Monarch/Biolink. This query will be used as a simple test case for aggregating results that are from different primary sources, served different knowledge beacons, and may use differnet IRIs for equivalent concepts.

## Proposed Data Types, Sources, and Access Endpoints:

ClinVar variant-disease associations (via Monarch/Biolink)

CIViC variant-disease associations (via Wikidata, and soon via Biolink))

## Proposed Sub-Queries/Tasks:

### Input: FA Gene X 

Retrieve all variants of Gene X
Gene -[has_affected_feature]-> Variant (Monarch)
Gene-[biological variant of] -> Variant (Wikidata)

Retrieve all diseases caused by variants in set above
Variant -[causes_or_contribues_to_condition*]-> Disease (Monarch)
Variant-[positive diagnostic predictor] -> Variant (Wikidata)

Output: Set of diseases associated with variant in FA gene(s)

In [1]:
import requests
import pandas as pd
from pprint import pprint

In [9]:
# get gene sets from github
base_url = "https://raw.githubusercontent.com/NCATS-Tangerine/cq-notebooks/master/FA_gene_sets/"
FA_all_genes = "FA_4_all_genes.txt"
columns = ['gene_curie', 'gene_symbol']
fa_genes = pd.read_csv(base_url + FA_all_genes, sep='\t', names=columns)

In [10]:
fa_genes

Unnamed: 0,gene_curie,gene_symbol
0,NCBIGene:2175,FANCA
1,NCBIGene:2187,FANCB
2,NCBIGene:2176,FANCC
3,NCBIGene:2178,FANCE
4,NCBIGene:2188,FANCF
5,NCBIGene:2189,FANCG
6,NCBIGene:55120,FANCL
7,NCBIGene:57697,FANCM
8,NCBIGene:2177,FANCD2
9,NCBIGene:55215,FANCI


In [11]:
#  return gene disease association data from biolink
def query_biolink_gene_disease(gene_curie):
    bl_url = 'https://api.monarchinitiative.org/api/bioentity/gene/{}/diseases/'
    params = {
        'fetch_objects': True,   
    }
    r = requests.get(url=bl_url.format(gene_curie), params=params)
    return r.json()

In [13]:
# geno terms to human readable labels
term_map_reversed = {
    "GENO:0000840": "pathogenic",
    "GENO:0000841": "likely pathogenic",
    "GENO:0000843": "benign",
    "GENO:0000844": "likely benign",
    "GENO:0000845": "uncertain significance"    
}

In [14]:
"""
Look for pathogenic variants and the disease 
they are implicated with via biolink
"""
result_set = []
for index, row in fa_genes.iterrows():
    bl_dat = query_biolink_gene_disease(row[0])
    for assoc in bl_dat['associations']:
        edges = assoc['evidence_graph']['edges']
        nodes = assoc['evidence_graph']['nodes']
        node_map = dict()
        for node in nodes:
            node_map[node['id']] = node['lbl']
        for edge in edges:
            if edge['pred'] in term_map_reversed.keys():
                pd_row = [row[1], row[0], node_map[edge['sub']], 
                       edge['sub'], term_map_reversed[edge['pred']], 
                       edge['pred'], node_map[edge['obj']], edge['obj']]
                result_set.append(pd_row)
    column_names = ['gene_name', 'gene_curie', 'variant_name', 'variant_curie', 
                    'relation_label', 'relation_curie', 'disease_label', 'disease_curie']
result_frame = pd.DataFrame(data=result_set, columns=column_names)
result_frame.to_csv('FA_pathogenic_variant_BioLink.csv', sep=',')
result_frame

# Returns dataFrame with FA gene ClinVar data.

Unnamed: 0,gene_name,gene_curie,variant_name,variant_curie,relation_label,relation_curie,disease_label,disease_curie
0,FANCA,NCBIGene:2175,NC_000016.9:g.(89837128_89837200)_(89847471_89...,ClinVarVariant:402243,likely pathogenic,GENO:0000841,"Fanconi Anemia, Complementation Group a",OMIM:227650
1,FANCA,NCBIGene:2175,NM_000135.2(FANCA):c.1115_1118delTTGG (p.Val37...,ClinVarVariant:3440,pathogenic,GENO:0000840,"Fanconi Anemia, Complementation Group a",OMIM:227650
2,FANCA,NCBIGene:2175,"FANCA, 156-BP DEL, NT1515",ClinVarVariant:3441,pathogenic,GENO:0000840,"Fanconi Anemia, Complementation Group a",OMIM:227650
3,FANCA,NCBIGene:2175,NM_000135.2(FANCA):c.2574C>G (p.Ser858Arg),ClinVarVariant:134256,pathogenic,GENO:0000840,"Fanconi Anemia, Complementation Group a",OMIM:227650
4,FANCA,NCBIGene:2175,NM_000135.2(FANCA):c.3788_3790delTCT (p.Phe126...,ClinVarVariant:41003,pathogenic,GENO:0000840,"Fanconi Anemia, Complementation Group a",OMIM:227650
5,FANCA,NCBIGene:2175,NM_000135.2(FANCA):c.513G>A (p.Trp171Ter),ClinVarVariant:3447,pathogenic,GENO:0000840,"Fanconi Anemia, Complementation Group a",OMIM:227650
6,FANCA,NCBIGene:2175,NM_000135.2(FANCA):c.2839dupT (p.Ser947Phefs),ClinVarVariant:188383,likely pathogenic,GENO:0000841,"Fanconi Anemia, Complementation Group a",OMIM:227650
7,FANCA,NCBIGene:2175,NM_000135.2(FANCA):c.3761_3762delAG (p.Glu1254...,ClinVarVariant:370466,pathogenic,GENO:0000840,"Fanconi Anemia, Complementation Group a",OMIM:227650
8,FANCA,NCBIGene:2175,NM_000135.2(FANCA):c.1615delG (p.Asp539Thrfs),ClinVarVariant:3443,pathogenic,GENO:0000840,"Fanconi Anemia, Complementation Group a",OMIM:227650
9,FANCA,NCBIGene:2175,NM_000135.2(FANCA):c.2557C>T (p.Arg853Ter),ClinVarVariant:192384,pathogenic,GENO:0000840,"Fanconi Anemia, Complementation Group a",OMIM:227650


In [15]:
from SPARQLWrapper import SPARQLWrapper, JSON

def execute_query(query):
    endpoint = SPARQLWrapper('https://query.wikidata.org/sparql')
    endpoint.setQuery(query)
    endpoint.setReturnFormat(JSON)
    return endpoint.query().convert()

def var_query(entrez):
    """
    query wikidata by entrez id for variant that is 'positive diagnostic predictor of a disease'
    """    
    query = """
    SELECT distinct ?gene ?geneLabel ?variant ?variantLabel ?disease ?diseaseLabel
     WHERE {
      ?gene wdt:P351 '%s'. 
      OPTIONAL {?variant wdt:P3433 ?gene.}        # variant of gene
      OPTIONAL {?variant wdt:P3433 ?gene;
                         wdt:P3356 ?disease.}    # variant is a positive diagnostic predictor of disease
       SERVICE wikibase:label {
            bd:serviceParam wikibase:language "en" .
      }
    }
    """ % (entrez)
    r = execute_query(query)
    return r['results']['bindings']

def keycheck(ckey, cdict):
    if ckey in cdict.keys():
        return cdict[ckey]['value']
    else:
        return None

In [16]:
"""
Look for variants of FA genes in wikidata that are 'positive diagnostic predictors' for a disease
"""
wd_columns = ['gene', 'geneLabel', 'variant', 'variantLabel', 'disease', "diseaseLabel"]
wd_result_frame = pd.DataFrame(columns=wd_columns)
  
for index, row in fa_genes.iterrows():
    entrez_id = row[0].split(":")[-1]
    wd_hits = var_query(entrez_id)
    for hit in wd_hits:
        result = dict()
        result['gene'] = keycheck('gene', hit)
        result['geneLabel'] = keycheck('geneLabel', hit)
        result['variant'] = keycheck('variant', hit)
        result['variantLabel'] = keycheck('variantLabel', hit)
        result['disease'] = keycheck('disease', hit)
        result['diseaseLabel'] = keycheck('diseaseLabel', hit)
        wd_result_frame = wd_result_frame.append(result, ignore_index=True)
        
wd_result_frame.to_csv('FA_pathogenic_variant_Wikidata.csv', sep=',')  
wd_result_frame

# Returns dataframe linking FA genes to Civic variant data

Unnamed: 0,gene,geneLabel,variant,variantLabel,disease,diseaseLabel
0,http://www.wikidata.org/entity/Q17927056,FANCA,,,,
1,http://www.wikidata.org/entity/Q17927471,FANCB,,,,
2,http://www.wikidata.org/entity/Q18250517,FANCC,http://www.wikidata.org/entity/Q28445146,FANCC LOSS-OF-FUNCTION,,
3,http://www.wikidata.org/entity/Q17927077,FANCE,,,,
4,http://www.wikidata.org/entity/Q17927502,FANCF,,,,
5,http://www.wikidata.org/entity/Q17927524,FANCG,,,,
6,http://www.wikidata.org/entity/Q18041564,FANCL,,,,
7,http://www.wikidata.org/entity/Q18044458,FANCM,,,,
8,http://www.wikidata.org/entity/Q17927069,FANCD2,,,,
9,http://www.wikidata.org/entity/Q18041663,FANCI,,,,
