# Variant annotation tables & direction of effect investigation

## Table of contents

1. [Initial data exploration](#Initial-data-exploration)
    1. [Example clinical annotation](#Example-clinical-annotation)
2. [Coverage](#Coverage)
3. [Direction of effect](#Direction-of-effect)
    1. [Sentence breakdown](#Sentence-breakdown)
    2. [Vocabulary](#Vocabulary)
4. [Alleles and genotypes](#Alleles-and-genotypes)
5. [Bonus material](#Bonus-material)

In [96]:
import os
import csv
import re
from collections import Counter

import pandas as pd

from opentargets_pharmgkb.evidence_generation import ID_COL_NAME
from opentargets_pharmgkb.pandas_utils import read_tsv_to_df

In [6]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', None)

## Initial data exploration

[Top of page](#Table-of-contents)

The variant annotations zip file contains 4 new tables, described in the readme as follows:
>* **var_pheno_ann.tsv**: Contains associations in which the variant affects a phenotype, with or without drug information.
>* **var_drug_ann.tsv**: Contains associations in which the variant affects a drug dose, response, metabolism, etc.
>* **var_fa_ann.tsv**: Contains in vitro and functional analysis-type associations.
>* **study_parameters.tsv**: Contains information about the study population size, biogeographical group and statistics for the variant annotations; this file is cross-referenced against the 3 variant annotation files.

Study parameters table is interesting but feels out of scope for now at least, will ignore for the rest of the notebook.

In [4]:
data_dir = '/home/april/projects/opentargets/pharmgkb/doe'

In [5]:
# Download new data (2024-05-05)
# !cd {data_dir}

# !wget -q https://api.pharmgkb.org/v1/download/file/data/clinicalAnnotations.zip
# !wget -q https://api.pharmgkb.org/v1/download/file/data/variantAnnotations.zip

# !unzip -jq clinicalAnnotations.zip "*.tsv" -d {data_dir}
# !unzip -jq variantAnnotations.zip "*.tsv" -d {data_dir}

# !rm clinicalAnnotations.zip variantAnnotations.zip

In [11]:
var_drug_ann = read_tsv_to_df(os.path.join(data_dir, 'var_drug_ann.tsv'))
var_fa_ann = read_tsv_to_df(os.path.join(data_dir, 'var_fa_ann.tsv'))
var_pheno_ann = read_tsv_to_df(os.path.join(data_dir, 'var_pheno_ann.tsv'))

Questions to consider:
* How many annotations?
* What's the coverage of variant/haplotypes relative to clinical annotations?
* What are the relevant fields?
* What's the relationship between these annotations and clinical annotations?
* Which of these columns has a controlled vocab vs. free text?
* How do the different variant-level annotation sentences contribute to the overall clincial annotation sentences?

In [30]:
len(var_drug_ann)

11901

In [31]:
len(var_fa_ann)

2009

In [32]:
len(var_pheno_ann)

13517

In [245]:
# Looking at the data - output suppressed for brevity
var_drug_ann.head()

Unnamed: 0,Variant Annotation ID,Variant/Haplotypes,Gene,Drug(s),PMID,Phenotype Category,Significance,Notes,Sentence,Alleles,Specialty Population,Metabolizer types,isPlural,Is/Is Not associated,Direction of effect,PD/PK terms,Multiple drugs And/or,Population types,Population Phenotypes or diseases,Multiple phenotypes or diseases And/or,Comparison Allele(s) or Genotype(s),Comparison Metabolizer types
0,1451834452,"CYP3A4*1, CYP3A4*17",CYP3A4,nifedipine,15634941,"Other, Metabolism/PK",not stated,in vitro expression of the recombinant CYP3A4*17 allelic protein and the wild-type protein,CYP3A4 *17 is associated with decreased metabolism of nifedipine as compared to CYP3A4 *1.,*17,,,Is,Associated with,decreased,metabolism of,,,,,*1,
1,1451159680,rs5031016,CYP2A6,warfarin,22248286,Dosage,no,No association was found between this variant and warfarin-maintenance dose. Described as CYP2A6*7 in this study.,Allele G is not associated with increased dose of warfarin in people with an international normalized ratio (INR) of 2.0-3.0 as compared to allele A.,G,,,Is,Not associated with,increased,dose of,,in people with,Other:an international normalized ratio (INR) of 2.0-3.0,,A,
2,1451306860,CYP2C9*11,CYP2C9,warfarin,33350885,Dosage,not stated,"""This case suggests that CYP2C9 *11/*11 carriers require approximately two thirds less warfarin than CYP2C9"" normal function homozygotes.",CYP2C9 *11/*11 is associated with decreased dose of warfarin.,*11/*11,,,Is,Associated with,decreased,dose of,,,,,,
3,1448997750,"CYP2B6*1, CYP2B6*18",CYP2B6,efavirenz,16495778,Metabolism/PK,yes,"Please note that in the paper the allele was referred to as CYP2B6*16. CYP2B6*16 and *18 alleles have been consolidated by PharmVar in Jan 2020, with *16 now listed as a suballele of *18 (CYP2B6*18.002). This annotation is updated to be on CYP2B6*18, instead of CYP2B6*16.",CYP2B6 *1/*18 is associated with increased concentrations of efavirenz in people with HIV Infections as compared to CYP2B6 *1/*1.,*1/*18,,,Is,Associated with,increased,concentrations of,,in people with,Disease:HIV Infections,,*1/*1,
4,1448631821,"CYP2C19*1, CYP2C19*2",CYP2C19,"clomipramine, desmethyl clomipramine",28470111,Metabolism/PK,no,in a single individual,CYP2C19 *1/*2 is associated with increased trough concentration of clomipramine and desmethyl clomipramine.,*1/*2,,,Is,Associated with,increased,trough concentration of,and,,,,,


In [246]:
var_fa_ann.head()

Unnamed: 0,Variant Annotation ID,Variant/Haplotypes,Gene,Drug(s),PMID,Phenotype Category,Significance,Notes,Sentence,Alleles,Specialty Population,Assay type,Metabolizer types,isPlural,Is/Is Not associated,Direction of effect,Functional terms,Gene/gene product,When treated with/exposed to/when assayed with,Multiple drugs And/or,Cell type,Comparison Allele(s) or Genotype(s),Comparison Metabolizer types
0,1451148445,"CYP2C19*1, CYP2C19*17",CYP2C19,normeperidine,30902024,,not stated,"In other in vitro experiments, normeperidine formation was significantly correlated with CYP2C19 activity, as measured by S-mephenytoin 4-hydroxylation.",CYP2C19 *17/*17 is associated with increased formation of normeperidine as compared to CYP2C19 *1/*1 + *1/*17.,*17/*17,,in human liver microsomes,,Is,Associated with,increased,formation of,,,,,*1/*1 + *1/*17,
1,1447814273,rs9923231,VKORC1,,26847243,Other,no,,Allele T is not associated with transcription of VKORC1 in HepG2 cells as compared to allele C.,T,,luciferase assay,,Is,Not associated with,,transcription of,VKORC1,,,in HepG2 cells,C,
2,1447814277,rs56314408,VKORC1,,26847243,Other,yes,"In the European population, this SNPs is in high LD with rs9923231 but not other populations. This SNP disrupts a binding motif for transcription factor TFAP2A/C.",Allele C is associated with increased transcription of VKORC1 in HepG2 cells as compared to allele T.,C,,luciferase assay,,Is,Associated with,increased,transcription of,VKORC1,,,in HepG2 cells,T,
3,1447990384,rs1065852,CYP2D6,bufuralol,2211621,Metabolism/PK,not stated,In vitro experiments showed a significant decrease in CYP2D6 activity for the variant construct expressed in COS-1 cells as compared to wild-type.,Allele A is associated with decreased activity of CYP2D6 when assayed with bufuralol in COS-1 cells as compared to allele G.,A,,,,Is,Associated with,decreased,activity of,CYP2D6,when assayed with,,in COS-1 cells,G,
4,1448281185,"CYP2B6*1, CYP2B6*6",CYP2B6,bupropion,27439448,Efficacy,yes,The ratio of hydroxybupropion versus bupropion (AUC_hyd/ AUC_bup) in terms of area under the time-concentration curve (AUC) was used to assay CYP2B6 activity.,CYP2B6 *1/*1 is associated with increased activity of CYP2B6 when assayed with bupropion as compared to CYP2B6 *1/*6.,*1/*1,,,,Is,Associated with,increased,activity of,CYP2B6,when assayed with,,,*1/*6,


In [247]:
var_pheno_ann.head()

Unnamed: 0,Variant Annotation ID,Variant/Haplotypes,Gene,Drug(s),PMID,Phenotype Category,Significance,Notes,Sentence,Alleles,Specialty Population,Metabolizer types,isPlural,Is/Is Not associated,Direction of effect,Side effect/efficacy/other,Phenotype,Multiple phenotypes And/or,When treated with/exposed to/when assayed with,Multiple drugs And/or,Population types,Population Phenotypes or diseases,Multiple phenotypes or diseases And/or,Comparison Allele(s) or Genotype(s),Comparison Metabolizer types
0,1449169911,HLA-B*35:08,HLA-B,lamotrigine,29238301,Toxicity,no,"The allele was not significant when comparing allele frequency in cases of severe cutaneous adverse reactions (SCAR), Stevens-Johnson Syndrome (SJS) and Maculopapular Exanthema (MPE) (1/15) and controls (individuals without AEs who took lamotrigine) (0/50). The allele was significant when comparing between cases (1/15) and the general population (1/986).","HLA-B *35:08 is not associated with likelihood of Maculopapular Exanthema, severe cutaneous adverse reactions or Stevens-Johnson Syndrome when treated with lamotrigine in people with Epilepsy.",*35:08,,,Is,Not associated with,,likelihood of,"Side Effect:Maculopapular Exanthema, Side Effect:severe cutaneous adverse reactions, Side Effect:Stevens-Johnson Syndrome",or,when treated with,,in people with,Disease:Epilepsy,,,
1,982022165,rs45607939,NAT2,sulfamethoxazole / trimethoprim,22850190,Toxicity,no,Minor allele frequencies were compared between cases (with drug-induced hypersensitivity) and controls.,Allele T is not associated with increased risk of Hypersensitivity when treated with sulfamethoxazole / trimethoprim in people with Infection.,T,,,Is,Not associated with,increased,risk of,Disease:Hypersensitivity,,when treated with,,in people with,Disease:Infection,,,
2,982022148,rs1799930,NAT2,sulfamethoxazole / trimethoprim,22850190,Toxicity,no,Minor allele frequencies were compared between cases (with drug-induced hypersensitivity) and controls.,Allele A is not associated with increased risk of Hypersensitivity when treated with sulfamethoxazole / trimethoprim in people with Infection.,A,,,Is,Not associated with,increased,risk of,Disease:Hypersensitivity,,when treated with,,in people with,Disease:Infection,,,
3,1451283480,rs16969968,CHRNA5,,22071378,Other,yes,"this was from meta-analysis of 27 studies but the number of total cases and the risk allele not clearly specified. Minor allele frequency was given for A allele. Introduction states that variant is Asp398Asn, where Asn (A allele) has lower nicotine response than Asp (G allele) and may be at greater risk for nicotine addiction.",Allele A is associated with increased severity of Tobacco Use Disorder in people with Tobacco Use Disorder.,A,,,Is,Associated with,increased,severity of,Other:Tobacco Use Disorder,,,,in people with,Other:Tobacco Use Disorder,,,
4,1444696916,rs267606617,MT-RNR1,streptomycin,7689389,Toxicity,not stated,"Pedigree analysis with 3 separate families. Within the maternal lines, 15 individuals had the 1555G variant, took aminoglycoside antibiotics, and developed hearing loss. 100% of individuals with the 1555G variant who took aminoglycosides developed hearing loss. Homoplasmic. Please note that no statistical analyses were done.",Allele G is associated with Ototoxicity when treated with streptomycin as compared to allele A.,G,,,Is,Associated with,,,Side Effect:Ototoxicity,and,when treated with,,,,,A,


The 3 annotations tables provide evidence for the clinical annotations, can be connected by joining with the `clinical_ann_evidence.tsv` file. In general a clinical annotation can have multiple variant annotations as evidence, and a variant annotation can be used as evidence for multiple clinical annotations (in theory, I've not actually observed this).

Each of these tables has a "Direction of effect" column, and the type of "effect" is different for each - likelihood of side effects, formation of product, metabolism of drug, etc.

**Question for OT**: when we say "direction of effect", do we mean any of these "effects"? I.e. should we include all three of these tables or focus on one?

In [38]:
clinical_annotations = read_tsv_to_df(os.path.join(data_dir, 'clinical_annotations.tsv'))
clinical_ann_evidence = read_tsv_to_df(os.path.join(data_dir, 'clinical_ann_evidence.tsv'))
clinical_ann_alleles = read_tsv_to_df(os.path.join(data_dir, 'clinical_ann_alleles.tsv'))

In [122]:
main_df = pd.merge(clinical_annotations, clinical_ann_evidence, how='left', on=ID_COL_NAME)
main_df = main_df[[
    # Main table
    'Clinical Annotation ID', 'Variant/Haplotypes', 'Gene', 'Level of Evidence', 'Phenotype Category', 'Drug(s)', 'Phenotype(s)',
    # Evidence table
    'Evidence ID', 'Evidence Type', 'PMID', 'Summary',
]]

#### Example clinical annotation

[Top of page](#Table-of-contents)

Looking at [981755803](https://www.pharmgkb.org/clinicalAnnotation/981755803), which has all three types of variant annotation evidence as well as label/guideline evidence.

In [124]:
df_981755803 = main_df[main_df[ID_COL_NAME] == '981755803']

df_981755803_drug = pd.merge(df_981755803, var_drug_ann, left_on='Evidence ID', right_on='Variant Annotation ID', how='inner', suffixes=(None, '_var_drug'))
df_981755803_pheno = pd.merge(df_981755803, var_pheno_ann, left_on='Evidence ID', right_on='Variant Annotation ID', how='inner', suffixes=(None, '_var_pheno'))
df_981755803_fa = pd.merge(df_981755803, var_fa_ann, left_on='Evidence ID', right_on='Variant Annotation ID', how='inner', suffixes=(None, '_var_fa'))

In [229]:
print('Number of evidence', len(df_981755803))
print('Number of var/drug evidence', len(df_981755803_drug))
print('Number of var/fa evidence', len(df_981755803_fa))
print('Number of var/pheno evidence', len(df_981755803_pheno))

Number of evidence 30
Number of var/drug evidence 24
Number of var/fa evidence 2
Number of var/pheno evidence 2


In [248]:
df_981755803_drug

Unnamed: 0,Clinical Annotation ID,Variant/Haplotypes,Gene,Level of Evidence,Phenotype Category,Drug(s),Phenotype(s),Evidence ID,Evidence Type,PMID,Summary,Variant Annotation ID,Variant/Haplotypes_var_drug,Gene_var_drug,Drug(s)_var_drug,PMID_var_drug,Phenotype Category_var_drug,Significance,Notes,Sentence,Alleles,Specialty Population,Metabolizer types,isPlural,Is/Is Not associated,Direction of effect,PD/PK terms,Multiple drugs And/or,Population types,Population Phenotypes or diseases,Multiple phenotypes or diseases And/or,Comparison Allele(s) or Genotype(s),Comparison Metabolizer types
0,981755803,rs75527207,CFTR,1A,Efficacy,ivacaftor,Cystic Fibrosis,981755665,Variant Drug Annotation,21083385,Genotypes AA + AG are associated with response to ivacaftor in people with Cystic Fibrosis.,981755665,rs75527207,CFTR,ivacaftor,21083385,Efficacy,not stated,Clinical trials were carried out to test efficacy of ivacaftor selecting only patients with the CFTR G551D mutation on at least one allele (genotype AA or AG).,Genotypes AA + AG are associated with response to ivacaftor in people with Cystic Fibrosis.,AA + AG,,,Are,Associated with,,response to,,in people with,Disease:Cystic Fibrosis,,,
1,981755803,rs75527207,CFTR,1A,Efficacy,ivacaftor,Cystic Fibrosis,981755678,Variant Drug Annotation,22047557,Genotypes AA + AG are associated with response to ivacaftor in people with Cystic Fibrosis.,981755678,rs75527207,CFTR,ivacaftor,22047557,Efficacy,not stated,A clinical trial that selected patients with the G551D CFTR mutation (rs75527207 genotype AA or AG). Patients without this mutation were excluded. One patient included in the placebo group was homozygous for F508del (rs113993960 genotype del/del).,Genotypes AA + AG are associated with response to ivacaftor in people with Cystic Fibrosis.,AA + AG,,,Are,Associated with,,response to,,in people with,Disease:Cystic Fibrosis,,,
2,981755803,rs75527207,CFTR,1A,Efficacy,ivacaftor,Cystic Fibrosis,982009991,Variant Drug Annotation,23590265,Allele A is associated with response to ivacaftor in children with Cystic Fibrosis.,982009991,rs75527207,CFTR,ivacaftor,23590265,Efficacy,yes,Patients aged 6-11 at time of screening who had at least one allele with the G551D mutation (allele A at position rs75527207) were recruited for this trial. Ivacaftor is only indicated in CF patients with this mutation. Significant improvements in lung function were seen in the ivacaftor treatment group compared to placebo.,Allele A is associated with response to ivacaftor in children with Cystic Fibrosis.,A,Pediatric,,Is,Associated with,,response to,,in children with,Disease:Cystic Fibrosis,,,
3,981755803,rs75527207,CFTR,1A,Efficacy,ivacaftor,Cystic Fibrosis,1183629335,Variant Drug Annotation,24066763,Genotype AA is associated with response to ivacaftor in women with Cystic Fibrosis.,1183629335,rs75527207,CFTR,ivacaftor,24066763,Efficacy,not stated,"Case report of a female homozygous for the G551D CFTR mutation (genotype AA) in which ivacaftor was efficacious: increased absolute change in percent of predicted FEV1, increased weight and walk distance and decreased sweat chloride levels over a 12 month course with no sign of plateau to date.",Genotype AA is associated with response to ivacaftor in women with Cystic Fibrosis.,AA,,,Is,Associated with,,response to,,in women with,Disease:Cystic Fibrosis,,,
4,981755803,rs75527207,CFTR,1A,Efficacy,ivacaftor,Cystic Fibrosis,1448423752,Variant Drug Annotation,27773592,Genotypes AA + AG is associated with increased response to ivacaftor in people with Cystic Fibrosis as compared to genotype GG.,1448423752,rs75527207,CFTR,ivacaftor,27773592,Efficacy,yes,The outcome of change in sweat chloride was correlated with change in FEV1 in patients with cystic fibrosis and found to have improved results for both.,Genotypes AA + AG is associated with increased response to ivacaftor in people with Cystic Fibrosis as compared to genotype GG.,AA + AG,Pediatric,,Is,Associated with,increased,response to,,in people with,Disease:Cystic Fibrosis,,GG,
5,981755803,rs75527207,CFTR,1A,Efficacy,ivacaftor,Cystic Fibrosis,1449191908,Variant Drug Annotation,25682022,Allele A is associated with response to ivacaftor in people with Cystic Fibrosis.,1449191908,rs75527207,CFTR,ivacaftor,25682022,Efficacy,not stated,Study was an expanded access program targeted at patients with severe lung disease and was not powered to determine efficacy. Majority of patients reported an improvement in FEV following 24 weeks of treatment.,Allele A is associated with response to ivacaftor in people with Cystic Fibrosis.,A,Pediatric,,Is,Associated with,,response to,,in people with,Disease:Cystic Fibrosis,,,
6,981755803,rs75527207,CFTR,1A,Efficacy,ivacaftor,Cystic Fibrosis,1449192055,Variant Drug Annotation,28711222,Allele A is associated with response to ivacaftor in people with Cystic Fibrosis.,1449192055,rs75527207,CFTR,ivacaftor,28711222,Efficacy,yes,"G551D allele. Statistically significant increases in FEV1, weight and BMI and statistically significant decreases in sweat chloride level, the number of days of antibiotic treatment and in the use of some maintenance treatments.; No differences in bone density, pancreatic insufficiency and cystic fibrosis related diabetes were observed.",Allele A is associated with response to ivacaftor in people with Cystic Fibrosis.,A,Pediatric,,Is,Associated with,,response to,,in people with,Disease:Cystic Fibrosis,,,
7,981755803,rs75527207,CFTR,1A,Efficacy,ivacaftor,Cystic Fibrosis,1449192093,Variant Drug Annotation,25311995,Allele A is associated with response to ivacaftor in people with Cystic Fibrosis.,1449192093,rs75527207,CFTR,ivacaftor,25311995,Efficacy,not stated,"G551 D allele. Increases in FEV1, body weight, CFQ-R scores and time to first pulmonary exacerbation were observed.",Allele A is associated with response to ivacaftor in people with Cystic Fibrosis.,A,Pediatric,,Is,Associated with,,response to,,in people with,Disease:Cystic Fibrosis,,,
8,981755803,rs75527207,CFTR,1A,Efficacy,ivacaftor,Cystic Fibrosis,1449192439,Variant Drug Annotation,28611235,Allele A is associated with response to ivacaftor in people with Cystic Fibrosis.,1449192439,rs75527207,CFTR,ivacaftor,28611235,Efficacy,yes,"G551D allele. FEV1, Alfred wellness score, exercise time, CFQ-R score and sweat chloride levels showed a significant improvement following ivacaftor treatment as compared to placebo while other outcomes (VO2, ventilation, cardiac response nd recovery following exercise) did not.",Allele A is associated with response to ivacaftor in people with Cystic Fibrosis.,A,,,Is,Associated with,,response to,,in people with,Disease:Cystic Fibrosis,,,
9,981755803,rs75527207,CFTR,1A,Efficacy,ivacaftor,Cystic Fibrosis,1449192481,Variant Drug Annotation,26135562,Allele A is associated with response to ivacaftor in people with Cystic Fibrosis.,1449192481,rs75527207,CFTR,ivacaftor,26135562,Efficacy,yes,"G551D allele. Analysis of CFQ-R scores from participants in the STRIVE trial. Scores for eating problems, health perceptions, physical functioning, respiratory symptoms, social functioning, treatment burden and vitality showed significant improvements following ivacaftor treatment.",Allele A is associated with response to ivacaftor in people with Cystic Fibrosis.,A,Pediatric,,Is,Associated with,,response to,,in people with,Disease:Cystic Fibrosis,,,


In [138]:
df_981755803_fa

Unnamed: 0,Clinical Annotation ID,Variant/Haplotypes,Gene,Level of Evidence,Phenotype Category,Drug(s),Phenotype(s),Evidence ID,Evidence Type,PMID,Summary,Variant Annotation ID,Variant/Haplotypes_var_fa,Gene_var_fa,Drug(s)_var_fa,PMID_var_fa,Phenotype Category_var_fa,Significance,Notes,Sentence,Alleles,Specialty Population,Assay type,Metabolizer types,isPlural,Is/Is Not associated,Direction of effect,Functional terms,Gene/gene product,When treated with/exposed to/when assayed with,Multiple drugs And/or,Cell type,Comparison Allele(s) or Genotype(s),Comparison Metabolizer types
0,981755803,rs75527207,CFTR,1A,Efficacy,ivacaftor,Cystic Fibrosis,1043737620,Variant Functional Assay Annotation,23757361,Allele A is associated with increased activity of CFTR when treated with ivacaftor in transfected CHO cells.,1043737620,rs75527207,CFTR,ivacaftor,23757361,Efficacy,yes,compared to no treatment. Ivacaftor stimulated CFTR activity in CFTR-G551D expressing CHO cells (as measured by iodine efflux).,Allele A is associated with increased activity of CFTR when treated with ivacaftor in transfected CHO cells.,A,,,,Is,Associated with,increased,activity of,CFTR,when treated with,,in transfected CHO cells,,
1,981755803,rs75527207,CFTR,1A,Efficacy,ivacaftor,Cystic Fibrosis,1043737636,Variant Functional Assay Annotation,23891399,Allele A is associated with activity of CFTR when treated with ivacaftor in FRT cell lines.,1043737636,rs75527207,CFTR,ivacaftor,23891399,Efficacy,yes,G551D allele. 55.3 fold increase in chloride transport upon ivacaftor treatment as compared to baseline (no ivacaftor treatment).,Allele A is associated with activity of CFTR when treated with ivacaftor in FRT cell lines.,A,,,,Is,Associated with,,activity of,CFTR,when treated with,,in FRT cell lines,,


In [94]:
df_981755803_pheno

Unnamed: 0,Clinical Annotation ID,Variant/Haplotypes,Gene,Level of Evidence,Phenotype Category,Drug(s),Phenotype(s),Evidence ID,Evidence Type,PMID,Summary,Variant Annotation ID,Variant/Haplotypes_var_pheno,Gene_var_pheno,Drug(s)_var_pheno,PMID_var_pheno,Phenotype Category_var_pheno,Significance,Notes,Sentence,Alleles,Specialty Population,Metabolizer types,isPlural,Is/Is Not associated,Direction of effect,Side effect/efficacy/other,Phenotype,Multiple phenotypes And/or,When treated with/exposed to/when assayed with,Multiple drugs And/or,Population types,Population Phenotypes or diseases,Multiple phenotypes or diseases And/or,Comparison Allele(s) or Genotype(s),Comparison Metabolizer types
0,981755803,rs75527207,CFTR,1A,Efficacy,ivacaftor,Cystic Fibrosis,1448267532,Variant Phenotype Annotation,27745802,Genotypes AA + AG is associated with decreased severity of bone density when treated with ivacaftor in people with Cystic Fibrosis as compared to genotype GG.,1448267532,rs75527207,CFTR,ivacaftor,27745802,Efficacy,yes,Bone mineral density compared before and after 1 year of treatment with ivacaftor using dual energy X-ray absorptiometry at the L2-L4 lumbar spine. All patients were pancreatic insufficient.,Genotypes AA + AG is associated with decreased severity of bone density when treated with ivacaftor in people with Cystic Fibrosis as compared to genotype GG.,AA + AG,,,Is,Associated with,decreased,severity of,Side Effect:bone density,and,when treated with,,in people with,Disease:Cystic Fibrosis,,GG,
1,981755803,rs75527207,CFTR,1A,Efficacy,ivacaftor,Cystic Fibrosis,1449192031,Variant Phenotype Annotation,28651844,Allele A is associated with decreased likelihood of cystic fibrosis pulmonary exacerbation when treated with ivacaftor in people with Cystic Fibrosis.,1449192031,rs75527207,CFTR,ivacaftor,28651844,Efficacy,yes,G551D allele. Patients receiving ivacaftor treatment had a reduced rate of pulmonary exacerbation events compared to patients receiving a placebo.,Allele A is associated with decreased likelihood of cystic fibrosis pulmonary exacerbation when treated with ivacaftor in people with Cystic Fibrosis.,A,Pediatric,,Is,Associated with,decreased,likelihood of,Disease:cystic fibrosis pulmonary exacerbation,and,when treated with,,in people with,Disease:Cystic Fibrosis,,,


In [93]:
# Comparing number of PMIDs vs. number of evidence
len(set(df_981755803_drug['PMID']) | set(df_981755803_pheno['PMID']) | set(df_981755803_fa['PMID']))

28

#### Observations so far
Clinical annotation [981755803](https://www.pharmgkb.org/clinicalAnnotation/981755803) has 30 supporting evidence:
* 24 variant/drug annotations
* 2 variant/functional assay annotations
* 2 variant/phenotype annotations
* 2 others (drug labels & guidelines, present in another data download so not included here)

Each variant annotation is associated with a PMID, these are 1:1 (at least in this example).
* We should think about whether we want to preserve the PMID & evidence associations.

These annotations seem much more specific than the clinical annotations, e.g.
* they distinguish between "disease" and "side effect" (check how often)
* if there are multiple phenotypes or drugs, they specify whether these should be "and"s or "or"s

These annotations are specific to one or more alleles or genotypes, so we will need to associate them accordingly.

It would good if we could select the relevant columns from the 3 variant annotation tables and merge them into a unified representation, so we don't have to manage them separately in the pipelines or in the UI


## Coverage

[Top of page](#Table-of-contents)

How many annotations have evidence, how many have direction of effect specifically

In [97]:
len(clinical_annotations)

5111

In [129]:
# Exploded on evidence - average 3 per annotation
len(main_df)

15129

In [130]:
# Add all the var annotation tables - this will be a huge mess
main_with_var = pd.merge(main_df, var_drug_ann, left_on='Evidence ID', right_on='Variant Annotation ID', how='left', suffixes=(None, '_var_drug'))
main_with_var = pd.merge(main_with_var, var_pheno_ann, left_on='Evidence ID', right_on='Variant Annotation ID', how='left', suffixes=(None, '_var_pheno'))
main_with_var = pd.merge(main_with_var, var_fa_ann, left_on='Evidence ID', right_on='Variant Annotation ID', how='left', suffixes=(None, '_var_fa'))

In [144]:
ca_with_var_evidence = set(main_with_var[main_with_var['Sentence'].notna() | main_with_var['Sentence_var_pheno'].notna() | main_with_var['Sentence_var_fa'].notna()][ID_COL_NAME])

In [147]:
# Every clinical annotation has at least one variant annotation as supporting evidence
len(ca_with_var_evidence)

5111

In [205]:
print('Evidence from var/drug', len(set(main_with_var[main_with_var['Sentence'].notna()][ID_COL_NAME])))
print('Evidence from var/pheno', len(set(main_with_var[main_with_var['Sentence_var_pheno'].notna()][ID_COL_NAME])))
print('Evidence from var/fa', len(set(main_with_var[main_with_var['Sentence_var_fa'].notna()][ID_COL_NAME])))

Evidence from var/drug 2435
Evidence from var/pheno 2958
Evidence from var/fa 418


In [196]:
def main_with_var_where_notna(common_col_name):
    # Filter main_with_var on non-na columns that are common to all three variant annotation tables
    return main_with_var[
        main_with_var[common_col_name].notna()
        | main_with_var[f'{common_col_name}_var_pheno'].notna()
        | main_with_var[f'{common_col_name}_var_fa'].notna()
    ]

In [197]:
def main_with_var_values_in(common_col_name):
    # Return set of values in given column, common to all three variant annotation tables
    return (
        set(main_with_var[common_col_name]) 
        | set(main_with_var[f'{common_col_name}_var_pheno']) 
        | set(main_with_var[f'{common_col_name}_var_fa'])
    )

In [231]:
ca_with_doe_evidence = set(main_with_var_where_notna('Direction of effect')[ID_COL_NAME])

In [232]:
# Most contain some kind of direction of effect info
len(ca_with_doe_evidence)

4917

In [230]:
4917 / 5111

0.9620426531011543

In [234]:
print('Total', len(main_with_var))  # i.e. clinical annotations exploded by evidence id
print('With variant annotation', len(main_with_var_where_notna('Variant Annotation ID')))
print('With PMID', len(main_with_var_where_notna('PMID')))
print('With allele', len(main_with_var_where_notna('Alleles')))
print('With comparison allele', len(main_with_var_where_notna('Comparison Allele(s) or Genotype(s)')))

Total 15129
With variant annotation 14658
With PMID 14658
With allele 14248
With comparison allele 12464


In [199]:
# Suddenly worried about counts
# All variant annotation IDs in all three tables
len(set(var_drug_ann['Variant Annotation ID']) | set(var_fa_ann['Variant Annotation ID']) | set(var_pheno_ann['Variant Annotation ID']))

27427

In [200]:
# All evidence IDs - includes variant annotations and drug labels
len(set(main_df['Evidence ID']))

13783

In [180]:
all_var_ann_ids = set(var_drug_ann['Variant Annotation ID']) | set(var_fa_ann['Variant Annotation ID']) | set(var_pheno_ann['Variant Annotation ID'])
all_ev_ids = set(main_df['Evidence ID'])

# Not all variant annotation evidence is used
len(all_var_ann_ids - all_ev_ids)

13778

#### Observations so far:
* Every clinical annotation has at least one variant annotation as supporting evidence
* Most contain some kind of direction of effect info (i.e. in one of the three tables)
   * => coverage is good, assuming we care about all three types of effects
* Selecting one table covers at most about half of the clinical annotations
* Not all variant annotation evidence is included in a clinical annotation

## Direction of effect

[Top of page](#Table-of-contents)

In [None]:
# Trying to make sense of the columns - output suppressed for brevity
main_with_var.columns

In [133]:
all_var_ann_cols = set(var_drug_ann.columns) | set(var_fa_ann.columns) | set(var_pheno_ann.columns)
common_var_ann_cols = set(var_drug_ann.columns) & set(var_fa_ann.columns) & set(var_pheno_ann.columns)

In [134]:
common_var_ann_cols

{'Alleles',
 'Comparison Allele(s) or Genotype(s)',
 'Comparison Metabolizer types',
 'Direction of effect',
 'Drug(s)',
 'Gene',
 'Is/Is Not associated',
 'Metabolizer types',
 'Multiple drugs And/or',
 'Notes',
 'PMID',
 'Phenotype Category',
 'Sentence',
 'Significance',
 'Specialty Population',
 'Variant Annotation ID',
 'Variant/Haplotypes',
 'isPlural'}

In [135]:
unique_var_ann_cols = all_var_ann_cols - common_var_ann_cols

In [136]:
# annotate with origin table
annotated_unique_var_ann_cols = {'drug':[], 'fa':[], 'pheno':[]}
for c in unique_var_ann_cols:
    if c in var_drug_ann.columns:
        annotated_unique_var_ann_cols['drug'].append(c)
    if c in var_fa_ann.columns:
        annotated_unique_var_ann_cols['fa'].append(c)
    if c in var_pheno_ann.columns:
        annotated_unique_var_ann_cols['pheno'].append(c)

In [137]:
annotated_unique_var_ann_cols

{'drug': ['Multiple phenotypes or diseases And/or',
  'Population types',
  'Population Phenotypes or diseases',
  'PD/PK terms'],
 'fa': ['Cell type',
  'Functional terms',
  'When treated with/exposed to/when assayed with',
  'Gene/gene product',
  'Assay type'],
 'pheno': ['Multiple phenotypes or diseases And/or',
  'Side effect/efficacy/other',
  'Multiple phenotypes And/or',
  'When treated with/exposed to/when assayed with',
  'Population types',
  'Population Phenotypes or diseases',
  'Phenotype']}

#### Sentence breakdown

[Top of page](#Table-of-contents)

* Population phenotype always goes with "multiple phenotypes and/or" (note functional assay doesn't mention phenotype - I guess there's no multiples in the gene product?)
    * where does the main table phenotype come from? OT is overwriting this now but I'm still confused
* Drug (main table) always goes with "multiple drugs and/or"
* Comparison alleles in theory I guess tells us something about the reference or baseline, not always present though

Sentence examples:
* **drug**: "Genotypes AA + AG are associated with response to ivacaftor in people with Cystic Fibrosis."
    * alleles = "AA + AG"
    * direction of effect = [none]
    * pd/pk term = "response to"
    * drug = "ivacaftor"
    * population types = "people"
    * population phenotype = "cystic fibrosis"
    * comparison alleles/genotypes = [none]
* **fa**: "Allele A is associated with increased activity of CFTR when treated with ivacaftor in transfected CHO cells."
    * alleles = "A"
    * direction of effect = "increased"
    * functional term = "activity of"
    * gene/gene product = "CFTR"
    * when treated with/exposed to/assayed with = "when treated with"
    * drug = "ivacaftor"
    * cell type = "transfected CHO cells"
    * comparison alleles/genotypes = [none]
* **pheno**: "Genotypes AA + AG is associated with decreased severity of bone density when treated with ivacaftor in people with Cystic Fibrosis as compared to genotype GG."
    * alleles = "AA + AG"
    * direction of effect = "decreased"
    * side effect/efficacy/other = "severity of"
    * phenotype = "bone density" **< note distinction between this and population phenotype**
    * when treated with/exposed to/assayed with = "when treated with"
    * drug = "ivacaftor"
    * population types = "people"
    * population phenotype = "cystic fibrosis"
    * comparison alleles/genotypes = "GG"

For the simplest direction of effect annotation (i.e. not including the population, cell type, etc.), I think we only strictly need the following:
1. direction of effect
2. pd/pk term | functional term | side effect/efficacy/other
3. drug | gene/gene product | phenotype

This tells us the direction (1) and what the effect is (2&3).
Of course we also need the alleles to associate with the appropriate evidence string, maybe also the comparison alleles/genotypes when present, and the "is/is not associated" column so we don't report negative results (unless we want to).

Maybe also just the origin of the evidence (variant/drug, variant/phenotype, or functional analysis) is useful.

#### Vocabulary

[Top of page](#Table-of-contents)

Check some vocabulary - how variable or consistent are the most critical terms, are they using fixed vocab, etc.

In [161]:
# DoE fixed vocab according to readme
set(main_with_var['Direction of effect']) | set(main_with_var['Direction of effect_var_pheno']) | set(main_with_var['Direction of effect_var_fa'])

{'decreased', 'increased', nan}

In [158]:
set(main_with_var['PD/PK terms'])  # not limited according to readme

{'clearance of',
 'clinical benefit to',
 'concentrations of',
 'discontinuation of',
 'dose of',
 'dose-adjusted trough concentrations of',
 'exposure to',
 'half-life time of',
 'metabolism of',
 nan,
 'resistance to',
 'response to',
 'steady-state concentration of',
 'time to response to',
 'trough concentration of'}

In [159]:
set(main_with_var['Functional terms'])  # not limited

{'activity of',
 'affinity to',
 'catalytic activity of',
 'clearance of',
 'concentrations of',
 'enzyme activity of',
 'expression of',
 'formation of',
 'glucuronidation of',
 'half-life of',
 'inhibition of',
 'metabolism of',
 nan,
 'protein stability of',
 'sensitivity to',
 'steady-state level of',
 'sulfation of',
 'transcription of',
 'transport of',
 'uptake of'}

In [160]:
set(main_with_var['Side effect/efficacy/other'])  # limited

{'age at onset of', 'likelihood of', nan, 'risk of', 'severity of'}

Assume the final term (drug | gene/gene product | phenotype) will vary, but hopefully we can also map it to the relevant domain if needed (CHEMBL, EFO, Ensembl). Drugs & genes terms look about what you'd expect, as usual phenotype is the most diverse (see below; the readme explicitly states phenotype is not standardized).

Otherwise this is a pretty small and consistent set of terms for ca. 15000 rows, though I don't think we can assume the vocab won't grow (except the actual direction word, we should be good there).

In [228]:
list(set(main_with_var['Phenotype']))[1:11]

['PK:differences in exposure to the active metabolite of prasugrel',
 'Disease:Endometrial Neoplasms',
 '"Disease:Epidermal Necrolysis, Toxic", "Disease:Stevens-Johnson Syndrome"',
 'Side Effect:total hemorrhage and major hemorrhage',
 'Other:subjective feelings of intoxication, stimulation, sedation, and happiness',
 'PK:plasma oxymorphone/oxycodone ratio',
 'Disease:Hematologic Diseases',
 'Side Effect:Venous Thrombosis',
 'Efficacy:non-remission',
 'Side Effect:Leukopenia']

In [219]:
len(set(main_with_var['Phenotype']))

1735

In [216]:
# Dirty attempt to get the prefix - doesn't account for multiples
set(main_with_var['Phenotype'].dropna().apply(lambda p: p.split(':')[0].strip('"')))

{'Disease', 'Efficacy', 'Other', 'PK', 'Side Effect'}

In [218]:
len(set(main_with_var['Phenotype'].dropna().apply(lambda p: p.split(':')[1].strip('"'))))

1499

The prefix looks to be fixed and always present, which is nice and honestly kind of surprising actually. The rest I think can come from body of PGKB terms or be filled in freely. We could perhaps map them (with OLS or NLP).

Same is true for population phenotypes when present:

In [226]:
# var_fa does not have the column so I can't use my nice function :(
(
    set(main_with_var['Population Phenotypes or diseases'].dropna().apply(lambda p: p.split(':')[0].strip('"'))) 
    | set(main_with_var['Population Phenotypes or diseases_var_pheno'].dropna().apply(lambda p: p.split(':')[0].strip('"'))) 
)

{'Disease', 'Efficacy', 'Other', 'PK', 'Side Effect'}

## Alleles and genotypes

[Top of page](#Table-of-contents)

In [None]:
# Visual inspection of alleles - output suppressed for brevity
main_with_var_values_in('Alleles')

In [None]:
main_with_var_values_in('Comparison Allele(s) or Genotype(s)')

Alleles and comparison alleles look relatively consistent with what's in the alleles table:
* SNP genotype `C/T` or `CC` (annoying)
* SNP allele `C`
* indel `GGGGAGCTTTCCCAGAGACCC/del`
* named allele `*17` or `HTTLPR short form (S allele)`
* named genotype `*2/*4`
* combinations of the above delineated with `+`

In [203]:
variant_haps = main_with_var_values_in('Variant/Haplotypes')

[v for v in variant_haps if pd.notna(v) and not (v.startswith('rs') or '*' in v)]

['G6PD Canton, Taiwan-Hakka, Gifu-like, Agrigento-like',
 'G6PD A- 202A_376G',
 'G6PD A- 202A_376G, G6PD B (reference)',
 'CYP1A2 high activity',
 'SLC6A4 HTTLPR long form (L allele), SLC6A4 HTTLPR short form (S allele)',
 'SLC6A4 HTTLPR short form (S allele)',
 'CYP2D6 poor metabolizer genotype',
 'CYP1A2 low activity',
 'CYP2A6 poor metabolizer genotype',
 'CYP2D6 ultrarapid metabolizer genotype',
 'CYP2D6 low activity',
 'CYP2A6 low activity',
 'G6PD B (reference), G6PD Mediterranean Haplotype',
 'CYP2C19 poor metabolizer phenotype',
 'CYP2C19 poor metabolizers',
 'CYP2C19 poor metabolizer genotype',
 'CYP2D6 poor metabolizer phenotype',
 'CYP3A4 low activity',
 'TPMT intermediate metabolizer phenotype',
 'G6PD deficiency',
 'NAT2 slow acetylator',
 'CYP2D6 ultrarapid metabolizer phenotype',
 'CYP2D6 poor and ultrarapid metabolizers',
 'CYP2D6 poor metabolizer and intermediate metabolizer genotypes',
 'CYP2D6 normal metabolizer and ultrarapid metabolizer genotypes',
 'CYP2C19 normal

Some of these are just named (non-star) alleles, but things like "CYP2A6 poor metabolizer genotype" are where the comparison metabolyzer gets used as opposed to comparison alleles.

Not sure what to do about these - we can't easily associate them with an allele or genotype, only with the clinical annotation as a whole.

Here's [one example](https://www.pharmgkb.org/clinicalAnnotation/1139506787) - clincial annotation has many haplotype-level annotations, but the variant annotation is only given for "poor metabolizer genotype".

In [240]:
def main_with_var_where_equals(common_col_name, value):
    # Filter main_with_var on columns = value that are common to all three variant annotation tables
    return main_with_var[
        (main_with_var[common_col_name] == value)
        | (main_with_var[f'{common_col_name}_var_pheno'] == value)
        | (main_with_var[f'{common_col_name}_var_fa'] == value)
    ]

In [243]:
main_with_var_where_equals('Variant/Haplotypes', 'CYP2A6 poor metabolizer genotype')[[
    'Clinical Annotation ID', 'Variant/Haplotypes', 'Gene',
       'Level of Evidence', 'Phenotype Category', 'Drug(s)', 'Phenotype(s)',
       'Evidence ID', 'Evidence Type', 'PMID', 'Summary',
       'Variant Annotation ID_var_pheno', 'Variant/Haplotypes_var_pheno',
       'Gene_var_pheno', 'Drug(s)_var_pheno', 'PMID_var_pheno',
       'Phenotype Category_var_pheno', 'Significance_var_pheno',
       'Notes_var_pheno', 'Sentence_var_pheno', 'Alleles_var_pheno',
       'Specialty Population_var_pheno', 'Metabolizer types_var_pheno',
       'isPlural_var_pheno', 'Is/Is Not associated_var_pheno',
       'Direction of effect_var_pheno', 'Side effect/efficacy/other',
       'Phenotype', 'Multiple phenotypes And/or',
       'When treated with/exposed to/when assayed with',
       'Multiple drugs And/or_var_pheno', 'Population types_var_pheno',
       'Population Phenotypes or diseases_var_pheno',
       'Multiple phenotypes or diseases And/or_var_pheno',
       'Comparison Allele(s) or Genotype(s)_var_pheno',
       'Comparison Metabolizer types_var_pheno'
]]

Unnamed: 0,Clinical Annotation ID,Variant/Haplotypes,Gene,Level of Evidence,Phenotype Category,Drug(s),Phenotype(s),Evidence ID,Evidence Type,PMID,Summary,Variant Annotation ID_var_pheno,Variant/Haplotypes_var_pheno,Gene_var_pheno,Drug(s)_var_pheno,PMID_var_pheno,Phenotype Category_var_pheno,Significance_var_pheno,Notes_var_pheno,Sentence_var_pheno,Alleles_var_pheno,Specialty Population_var_pheno,Metabolizer types_var_pheno,isPlural_var_pheno,Is/Is Not associated_var_pheno,Direction of effect_var_pheno,Side effect/efficacy/other,Phenotype,Multiple phenotypes And/or,When treated with/exposed to/when assayed with,Multiple drugs And/or_var_pheno,Population types_var_pheno,Population Phenotypes or diseases_var_pheno,Multiple phenotypes or diseases And/or_var_pheno,Comparison Allele(s) or Genotype(s)_var_pheno,Comparison Metabolizer types_var_pheno
10622,1139506787,"CYP2A6*1, CYP2A6*1x2, CYP2A6*2, CYP2A6*4, CYP2A6*7, CYP2A6*9, CYP2A6*10, CYP2A6*11, CYP2A6*12, CYP2A6*13, CYP2A6*14, CYP2A6*15, CYP2A6*17, CYP2A6*19, CYP2A6*20, CYP2A6*23, CYP2A6*24, CYP2A6*25, CYP2A6*26, CYP2A6*27, CYP2A6*28, CYP2A6*35, CYP2A6*38, CYP2A6*39, CYP2A6*41, CYP2A6*46, CYP2A6*55",CYP2A6,1B,Metabolism/PK,nicotine,Tobacco Use Disorder,1183689160,Variant Phenotype Annotation,23371292,CYP2A6 poor metabolizer is associated with increased ratio of cotinine formation to removal when exposed to nicotine in nonsmokers as compared to CYP2A6 normal metabolizer.,1183689160,CYP2A6 poor metabolizer genotype,CYP2A6,nicotine,23371292,Metabolism/PK,yes,"In CYP2A6 reduced metabolizers, cotinine formation was altered less than was cotinine removal as compared to normal metabolizers. Ratios of cotinine formation to removal were 1.31 for reduced metabolizers and 1.12 for normal metabolizers . Reduced metabolizers were defined as subjects with one or two copies of *2,*4, *7,*9,*10,*12,*17,*35.",CYP2A6 poor metabolizer is associated with increased ratio of cotinine formation to removal when exposed to nicotine in nonsmokers as compared to CYP2A6 normal metabolizer.,,,poor metabolizer,Is,Associated with,increased,,PK:ratio of cotinine formation to removal,,when exposed to,,in,Other:nonsmokers,,,normal metabolizer
10623,1139506787,"CYP2A6*1, CYP2A6*1x2, CYP2A6*2, CYP2A6*4, CYP2A6*7, CYP2A6*9, CYP2A6*10, CYP2A6*11, CYP2A6*12, CYP2A6*13, CYP2A6*14, CYP2A6*15, CYP2A6*17, CYP2A6*19, CYP2A6*20, CYP2A6*23, CYP2A6*24, CYP2A6*25, CYP2A6*26, CYP2A6*27, CYP2A6*28, CYP2A6*35, CYP2A6*38, CYP2A6*39, CYP2A6*41, CYP2A6*46, CYP2A6*55",CYP2A6,1B,Metabolism/PK,nicotine,Tobacco Use Disorder,1183689165,Variant Phenotype Annotation,23371292,CYP2A6 poor metabolizer is associated with decreased ratio of plasma cotinine to urinary TNE when exposed to nicotine in smokers as compared to CYP2A6 normal metabolizer.,1183689165,CYP2A6 poor metabolizer genotype,CYP2A6,nicotine,23371292,Metabolism/PK,yes,"In CYP2A6 reduced metabolizers, the slope between urinary TNE (a measurement of tobacco exposure) and plasma cotinine was significantly lower as compared to normal metabolizers. Reduced metabolizers were defined as subjects with one or two copies of *2,*4, *7,*9,*10,*12,*17,*35.",CYP2A6 poor metabolizer is associated with decreased ratio of plasma cotinine to urinary TNE when exposed to nicotine in smokers as compared to CYP2A6 normal metabolizer.,,,poor metabolizer,Is,Associated with,decreased,,PK:ratio of plasma cotinine to urinary TNE,,when exposed to,,in,Other:smokers,,,normal metabolizer


## Bonus material

[Top of page](#Table-of-contents)

Things I thought about but haven't checked yet:
* How many so-called "alleles" are actually these metabolyzer terms?
    * might inform whether we need to associate via something other than allele, if many important annotations fall under this category
* Do we have to manage contradictory information for a single clinical annotation or even for a single allele/genotype?
    * i.e. one study says genotype AA increases X, another says it decreases X, another says it decreases some other Y...
    * maybe also check whether this occurs in level 1/2 evidence especially
    * informs data structure - i.e. do fields need to be lists or strings
* How do multiple phenotypes, genes and drugs at the variant annotation level get aggregated at the clinical annotation level?
    * [981755803](https://www.pharmgkb.org/clinicalAnnotation/981755803) indicates they do _not_ include all "phenotype" and "population phenotype" for all variant annotations at the clinical annotation level
    * similarly interested in how they derive the clinical annotation sentence from all the variant annotation sentences, though it's arguably not important for our automated processing
 
I also think there are at least 2 additional issues (not relating to direction of effect) that we can explore using these variant annotation tables, namely:
* Using the "and/or" column to clearly delineate drug combinations vs. drugs that are just being annotated together
* Using the additional phenotype annotations (side effect etc.) to disambiguate or supplement the phenotype information we use from the clinical annotation

## Post-meeting

* Get a few representative (?!) examples of annotations
* Join with all variant evidence _and_ all clinical_alleles
* Dump to CSV

In [285]:
# Build a clean table showing everything
complete_df = pd.merge(clinical_annotations, clinical_ann_evidence, how='left', on=ID_COL_NAME)
complete_df = pd.merge(complete_df, clinical_ann_alleles, how='left', on=ID_COL_NAME)

In [286]:
def get_annotation_tables_for_ids(ca_ids):
    df = complete_df[complete_df[ID_COL_NAME].isin({str(id) for id in ca_ids})]
    df_drug = pd.merge(df, var_drug_ann, left_on='Evidence ID', right_on='Variant Annotation ID', how='inner', suffixes=(None, '_var_drug'))
    df_pheno = pd.merge(df, var_pheno_ann, left_on='Evidence ID', right_on='Variant Annotation ID', how='inner', suffixes=(None, '_var_pheno'))
    df_fa = pd.merge(df, var_fa_ann, left_on='Evidence ID', right_on='Variant Annotation ID', how='inner', suffixes=(None, '_var_fa'))
    return df_drug, df_pheno, df_fa

In [276]:
example_ca_ids = [981755803, 1139506787, 1183888969, 1184514050, 981419266]

d, p, f = get_annotation_tables_for_ids(example_ca_ids)
d.to_csv(f'{data_dir}/example_drug.csv', index=False)
p.to_csv(f'{data_dir}/example_pheno.csv', index=False)
f.to_csv(f'{data_dir}/example_func.csv', index=False)