# Variant annotation tables & direction of effect investigation

## Table of contents

1. [Initial data exploration](#Initial-data-exploration)
2. [Coverage](#Coverage)
3. [Direction of effect](#Direction-of-effect)
4. [Alleles and genotypes](#Alleles-and-genotypes)
5. [Summary](#Summary)

In [96]:
import os
import csv
import re
from collections import Counter

import pandas as pd

from opentargets_pharmgkb.evidence_generation import ID_COL_NAME
from opentargets_pharmgkb.pandas_utils import read_tsv_to_df

In [6]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', None)

## Initial data exploration

[Top of page](#Table-of-contents)

In [4]:
data_dir = '/home/april/projects/opentargets/pharmgkb/doe'

In [5]:
# Download new data (2024-05-05)
# !cd {data_dir}

# !wget -q https://api.pharmgkb.org/v1/download/file/data/clinicalAnnotations.zip
# !wget -q https://api.pharmgkb.org/v1/download/file/data/variantAnnotations.zip

# !unzip -jq clinicalAnnotations.zip "*.tsv" -d {data_dir}
# !unzip -jq variantAnnotations.zip "*.tsv" -d {data_dir}

# !rm clinicalAnnotations.zip variantAnnotations.zip

The variant annotations zip file contains 4 new tables, described in the readme as follows:
>* **var_pheno_ann.tsv**: Contains associations in which the variant affects a phenotype, with or without drug
information.
>* **var_drug_ann.tsv**: Contains associations in which the variant affects a drug dose, response, metabolism, etc.
>* **var_fa_ann.tsv**: Contains in vitro and functional analysis-type associations.
>* **study_parameters.tsv**: Contains information about the study population size, biogeographical group and statistics
for the variant annotations; this file is cross-referenced against the 3 variant annotation files.

Study parameters table is interesting but feels out of scope for now at least, will ignore for the rest of the notebook.

In [11]:
var_drug_ann = read_tsv_to_df(os.path.join(data_dir, 'var_drug_ann.tsv'))
var_fa_ann = read_tsv_to_df(os.path.join(data_dir, 'var_fa_ann.tsv'))
var_pheno_ann = read_tsv_to_df(os.path.join(data_dir, 'var_pheno_ann.tsv'))

Questions to consider:
* How many annotations?
* What's the coverage of variant/haplotypes relative to clinical annotations?
* What are the relevant fields?
* What's the relationship between these annotations and clinical annotations?
* Which of these columns has a controlled vocab vs. free text?
* How do the different variant-level annotation sentences contribute to the overall clincial annotation sentences?

In [30]:
len(var_drug_ann)

11901

In [31]:
len(var_fa_ann)

2009

In [32]:
len(var_pheno_ann)

13517

In [None]:
# Looking at the data - output suppressed for brevity
var_drug_ann.head()

In [None]:
var_fa_ann.head()

In [None]:
var_pheno_ann.head()

The 3 annotations tables provide evidence for the clinical annotations, can be connected by joining with the `clinical_ann_evidence.tsv` file. In general a clinical annotation can have multiple variant annotations as evidence, and a variant annotation can be used as evidence for multiple clinical annotations.

Each of these tables has a "Direction of effect" column, and the type of "effect" is different for each - likelihood of side effects, formation of product, metabolism of drug, etc.

**Question for OT**: when we say "direction of effect", do we mean any of these "effects"? I.e. should we include all three of these tables or focus on one?

In [38]:
clinical_annotations = read_tsv_to_df(os.path.join(data_dir, 'clinical_annotations.tsv'))
clinical_ann_evidence = read_tsv_to_df(os.path.join(data_dir, 'clinical_ann_evidence.tsv'))
clinical_ann_alleles = read_tsv_to_df(os.path.join(data_dir, 'clinical_ann_alleles.tsv'))

In [122]:
main_df = pd.merge(clinical_annotations, clinical_ann_evidence, how='left', on=ID_COL_NAME)
main_df = main_df[[
    # Main table
    'Clinical Annotation ID', 'Variant/Haplotypes', 'Gene', 'Level of Evidence', 'Phenotype Category', 'Drug(s)', 'Phenotype(s)',
    # Evidence table
    'Evidence ID', 'Evidence Type', 'PMID', 'Summary',
]]

In [123]:
# https://www.pharmgkb.org/clinicalAnnotation/981755803 - has all three types of variant annotation evidence, plus label/guideline evidence
df_981755803 = main_df[main_df[ID_COL_NAME] == '981755803']

In [124]:
# Just look at one clinical annotation as an example
df_981755803_drug = pd.merge(df_981755803, var_drug_ann, left_on='Evidence ID', right_on='Variant Annotation ID', how='inner', suffixes=(None, '_var_drug'))
df_981755803_pheno = pd.merge(df_981755803, var_pheno_ann, left_on='Evidence ID', right_on='Variant Annotation ID', how='inner', suffixes=(None, '_var_pheno'))
df_981755803_fa = pd.merge(df_981755803, var_fa_ann, left_on='Evidence ID', right_on='Variant Annotation ID', how='inner', suffixes=(None, '_var_fa'))

In [125]:
len(df_981755803)

30

In [126]:
len(df_981755803_drug)

24

In [127]:
len(df_981755803_fa)

2

In [128]:
len(df_981755803_pheno)

2

In [206]:
df_981755803_drug.head(1)

Unnamed: 0,Clinical Annotation ID,Variant/Haplotypes,Gene,Level of Evidence,Phenotype Category,Drug(s),Phenotype(s),Evidence ID,Evidence Type,PMID,Summary,Variant Annotation ID,Variant/Haplotypes_var_drug,Gene_var_drug,Drug(s)_var_drug,PMID_var_drug,Phenotype Category_var_drug,Significance,Notes,Sentence,Alleles,Specialty Population,Metabolizer types,isPlural,Is/Is Not associated,Direction of effect,PD/PK terms,Multiple drugs And/or,Population types,Population Phenotypes or diseases,Multiple phenotypes or diseases And/or,Comparison Allele(s) or Genotype(s),Comparison Metabolizer types
0,981755803,rs75527207,CFTR,1A,Efficacy,ivacaftor,Cystic Fibrosis,981755665,Variant Drug Annotation,21083385,Genotypes AA + AG are associated with response to ivacaftor in people with Cystic Fibrosis.,981755665,rs75527207,CFTR,ivacaftor,21083385,Efficacy,not stated,Clinical trials were carried out to test efficacy of ivacaftor selecting only patients with the CFTR G551D mutation on at least one allele (genotype AA or AG).,Genotypes AA + AG are associated with response to ivacaftor in people with Cystic Fibrosis.,AA + AG,,,Are,Associated with,,response to,,in people with,Disease:Cystic Fibrosis,,,


In [138]:
df_981755803_fa

Unnamed: 0,Clinical Annotation ID,Variant/Haplotypes,Gene,Level of Evidence,Phenotype Category,Drug(s),Phenotype(s),Evidence ID,Evidence Type,PMID,Summary,Variant Annotation ID,Variant/Haplotypes_var_fa,Gene_var_fa,Drug(s)_var_fa,PMID_var_fa,Phenotype Category_var_fa,Significance,Notes,Sentence,Alleles,Specialty Population,Assay type,Metabolizer types,isPlural,Is/Is Not associated,Direction of effect,Functional terms,Gene/gene product,When treated with/exposed to/when assayed with,Multiple drugs And/or,Cell type,Comparison Allele(s) or Genotype(s),Comparison Metabolizer types
0,981755803,rs75527207,CFTR,1A,Efficacy,ivacaftor,Cystic Fibrosis,1043737620,Variant Functional Assay Annotation,23757361,Allele A is associated with increased activity of CFTR when treated with ivacaftor in transfected CHO cells.,1043737620,rs75527207,CFTR,ivacaftor,23757361,Efficacy,yes,compared to no treatment. Ivacaftor stimulated CFTR activity in CFTR-G551D expressing CHO cells (as measured by iodine efflux).,Allele A is associated with increased activity of CFTR when treated with ivacaftor in transfected CHO cells.,A,,,,Is,Associated with,increased,activity of,CFTR,when treated with,,in transfected CHO cells,,
1,981755803,rs75527207,CFTR,1A,Efficacy,ivacaftor,Cystic Fibrosis,1043737636,Variant Functional Assay Annotation,23891399,Allele A is associated with activity of CFTR when treated with ivacaftor in FRT cell lines.,1043737636,rs75527207,CFTR,ivacaftor,23891399,Efficacy,yes,G551D allele. 55.3 fold increase in chloride transport upon ivacaftor treatment as compared to baseline (no ivacaftor treatment).,Allele A is associated with activity of CFTR when treated with ivacaftor in FRT cell lines.,A,,,,Is,Associated with,,activity of,CFTR,when treated with,,in FRT cell lines,,


In [94]:
df_981755803_pheno

Unnamed: 0,Clinical Annotation ID,Variant/Haplotypes,Gene,Level of Evidence,Phenotype Category,Drug(s),Phenotype(s),Evidence ID,Evidence Type,PMID,Summary,Variant Annotation ID,Variant/Haplotypes_var_pheno,Gene_var_pheno,Drug(s)_var_pheno,PMID_var_pheno,Phenotype Category_var_pheno,Significance,Notes,Sentence,Alleles,Specialty Population,Metabolizer types,isPlural,Is/Is Not associated,Direction of effect,Side effect/efficacy/other,Phenotype,Multiple phenotypes And/or,When treated with/exposed to/when assayed with,Multiple drugs And/or,Population types,Population Phenotypes or diseases,Multiple phenotypes or diseases And/or,Comparison Allele(s) or Genotype(s),Comparison Metabolizer types
0,981755803,rs75527207,CFTR,1A,Efficacy,ivacaftor,Cystic Fibrosis,1448267532,Variant Phenotype Annotation,27745802,Genotypes AA + AG is associated with decreased severity of bone density when treated with ivacaftor in people with Cystic Fibrosis as compared to genotype GG.,1448267532,rs75527207,CFTR,ivacaftor,27745802,Efficacy,yes,Bone mineral density compared before and after 1 year of treatment with ivacaftor using dual energy X-ray absorptiometry at the L2-L4 lumbar spine. All patients were pancreatic insufficient.,Genotypes AA + AG is associated with decreased severity of bone density when treated with ivacaftor in people with Cystic Fibrosis as compared to genotype GG.,AA + AG,,,Is,Associated with,decreased,severity of,Side Effect:bone density,and,when treated with,,in people with,Disease:Cystic Fibrosis,,GG,
1,981755803,rs75527207,CFTR,1A,Efficacy,ivacaftor,Cystic Fibrosis,1449192031,Variant Phenotype Annotation,28651844,Allele A is associated with decreased likelihood of cystic fibrosis pulmonary exacerbation when treated with ivacaftor in people with Cystic Fibrosis.,1449192031,rs75527207,CFTR,ivacaftor,28651844,Efficacy,yes,G551D allele. Patients receiving ivacaftor treatment had a reduced rate of pulmonary exacerbation events compared to patients receiving a placebo.,Allele A is associated with decreased likelihood of cystic fibrosis pulmonary exacerbation when treated with ivacaftor in people with Cystic Fibrosis.,A,Pediatric,,Is,Associated with,decreased,likelihood of,Disease:cystic fibrosis pulmonary exacerbation,and,when treated with,,in people with,Disease:Cystic Fibrosis,,,


In [93]:
# Comparing number of PMIDs vs. number of evidence
len(set(df_981755803_drug['PMID']) | set(df_981755803_pheno['PMID']) | set(df_981755803_fa['PMID']))

28

#### Observations so far
Clinical annotation [981755803](https://www.pharmgkb.org/clinicalAnnotation/981755803) has 30 supporting evidence:
* 24 variant/drug annotations
* 2 variant/functional assay annotations
* 2 variant/phenotype annotations
* 2 others (drug labels & guidelines, present in another data download so not included here)

Each variant annotation is associated with a PMID, these are 1:1.
* We should think about whether we want to preserve the PMID & evidence associations.

These annotations seem much more specific than the clinical annotations, e.g.
* they distinguish between "disease" and "side effect" (check how often)
* if there are multiple phenotypes or drugs, they specify whether these should be "and"s or "or"s

These annotations are specific to one or more alleles or genotypes, so we will need to associate them accordingly.

It would good if we could select the relevant columns from the 3 variant annotation tables and merge them into a unified representation, so we don't have to manage them separately in the pipelines or in the UI


## Coverage

[Top of page](#Table-of-contents)

In [97]:
len(clinical_annotations)

5111

In [129]:
# Exploded on evidence - average 3 per annotation
len(main_df)

15129

In [130]:
# Add all the var annotation tables - this will be a huge mess
main_with_var = pd.merge(main_df, var_drug_ann, left_on='Evidence ID', right_on='Variant Annotation ID', how='left', suffixes=(None, '_var_drug'))
main_with_var = pd.merge(main_with_var, var_pheno_ann, left_on='Evidence ID', right_on='Variant Annotation ID', how='left', suffixes=(None, '_var_pheno'))
main_with_var = pd.merge(main_with_var, var_fa_ann, left_on='Evidence ID', right_on='Variant Annotation ID', how='left', suffixes=(None, '_var_fa'))

In [144]:
ca_with_var_evidence = set(main_with_var[main_with_var['Sentence'].notna() | main_with_var['Sentence_var_pheno'].notna() | main_with_var['Sentence_var_fa'].notna()][ID_COL_NAME])

In [147]:
# Every clinical annotation has at least one variant annotation as supporting evidence
len(ca_with_var_evidence)

5111

In [205]:
print('Evidence from var/drug', len(set(main_with_var[main_with_var['Sentence'].notna()][ID_COL_NAME])))
print('Evidence from var/pheno', len(set(main_with_var[main_with_var['Sentence_var_pheno'].notna()][ID_COL_NAME])))
print('Evidence from var/fa', len(set(main_with_var[main_with_var['Sentence_var_fa'].notna()][ID_COL_NAME])))

Evidence from var/drug 2435
Evidence from var/pheno 2958
Evidence from var/fa 418


In [148]:
ca_with_doe_evidence = set(main_with_var[main_with_var['Direction of effect'].notna() | main_with_var['Direction of effect_var_pheno'].notna() | main_with_var['Direction of effect_var_fa'].notna()][ID_COL_NAME])

In [149]:
# Most contain some kind of direction of effect info
len(ca_with_doe_evidence)

4917

In [196]:
def main_with_var_where_notna(common_col_name):
    # Filter main_with_var on non-na columns that are common to all three variant annotation tables
    return main_with_var[
        main_with_var[common_col_name].notna()
        | main_with_var[f'{common_col_name}_var_pheno'].notna()
        | main_with_var[f'{common_col_name}_var_fa'].notna()
    ]

In [197]:
def main_with_var_values_in(common_col_name):
    # Return set of values in given column, common to all three variant annotation tables
    return (
        set(main_with_var[common_col_name]) 
        | set(main_with_var[f'{common_col_name}_var_pheno']) 
        | set(main_with_var[f'{common_col_name}_var_fa'])
    )

In [198]:
print('Total', len(main_with_var))  # i.e. clinical annotations exploded by evidence id
print('With variant annotation', len(main_with_var_where_notna('Variant Annotation ID')))
print('With allele', len(main_with_var_where_notna('Alleles')))
print('With comparison allele', len(main_with_var_where_notna('Comparison Allele(s) or Genotype(s)')))

Total 15129
With variant annotation 14658
With allele 14248
With comparison allele 12464


In [199]:
# Suddenly worried about counts
# All variant annotation IDs in all three tables
len(set(var_drug_ann['Variant Annotation ID']) | set(var_fa_ann['Variant Annotation ID']) | set(var_pheno_ann['Variant Annotation ID']))

27427

In [200]:
# All evidence IDs - includes variant annotations and drug labels
len(set(main_df['Evidence ID']))

13783

In [180]:
all_var_ann_ids = set(var_drug_ann['Variant Annotation ID']) | set(var_fa_ann['Variant Annotation ID']) | set(var_pheno_ann['Variant Annotation ID'])
all_ev_ids = set(main_df['Evidence ID'])

# Not all variant annotation evidence is used
len(all_var_ann_ids - all_ev_ids)

13778

#### Observations so far:
* Every clinical annotation has at least one variant annotation as supporting evidence
* Most contain some kind of direction of effect info (i.e. in one of the three tables)
   * => coverage is good, assuming we care about all three types of effects
* Selecting one table covers at most about half of the clinical annotations
* Not all variant annotation evidence is included in a clinical annotation

## Direction of effect

[Top of page](#Table-of-contents)

In [None]:
# Trying to make sense of the columns - output suppressed for brevity
main_with_var.columns

In [133]:
all_var_ann_cols = set(var_drug_ann.columns) | set(var_fa_ann.columns) | set(var_pheno_ann.columns)
common_var_ann_cols = set(var_drug_ann.columns) & set(var_fa_ann.columns) & set(var_pheno_ann.columns)

In [134]:
common_var_ann_cols

{'Alleles',
 'Comparison Allele(s) or Genotype(s)',
 'Comparison Metabolizer types',
 'Direction of effect',
 'Drug(s)',
 'Gene',
 'Is/Is Not associated',
 'Metabolizer types',
 'Multiple drugs And/or',
 'Notes',
 'PMID',
 'Phenotype Category',
 'Sentence',
 'Significance',
 'Specialty Population',
 'Variant Annotation ID',
 'Variant/Haplotypes',
 'isPlural'}

In [135]:
unique_var_ann_cols = all_var_ann_cols - common_var_ann_cols

In [136]:
# annotate with origin table
annotated_unique_var_ann_cols = {'drug':[], 'fa':[], 'pheno':[]}
for c in unique_var_ann_cols:
    if c in var_drug_ann.columns:
        annotated_unique_var_ann_cols['drug'].append(c)
    if c in var_fa_ann.columns:
        annotated_unique_var_ann_cols['fa'].append(c)
    if c in var_pheno_ann.columns:
        annotated_unique_var_ann_cols['pheno'].append(c)

In [137]:
annotated_unique_var_ann_cols

{'drug': ['Multiple phenotypes or diseases And/or',
  'Population types',
  'Population Phenotypes or diseases',
  'PD/PK terms'],
 'fa': ['Cell type',
  'Functional terms',
  'When treated with/exposed to/when assayed with',
  'Gene/gene product',
  'Assay type'],
 'pheno': ['Multiple phenotypes or diseases And/or',
  'Side effect/efficacy/other',
  'Multiple phenotypes And/or',
  'When treated with/exposed to/when assayed with',
  'Population types',
  'Population Phenotypes or diseases',
  'Phenotype']}

#### Sentence analysis

* Population phenotype always goes with "multiple phenotypes and/or" (note functional assay doesn't mention phenotype - I guess there's no multiples in the gene product?)
    * where does the main table phenotype come from? OT is overwriting this now but I'm still confused
* Drug (main table) always goes with "multiple drugs and/or"
* Comparison alleles in theory I guess tells us something about the reference or baseline, not always present though

Sentence examples:
* **drug**: "Genotypes AA + AG are associated with response to ivacaftor in people with Cystic Fibrosis."
    * alleles = "AA + AG"
    * direction of effect = [none]
    * pd/pk term = "response to"
    * drug = "ivacaftor"
    * population types = "people"
    * population phenotype = "cystic fibrosis"
    * comparison alleles/genotypes = [none]
* **fa**: "Allele A is associated with increased activity of CFTR when treated with ivacaftor in transfected CHO cells."
    * alleles = "A"
    * direction of effect = "increased"
    * functional term = "activity of"
    * gene/gene product = "CFTR"
    * when treated with/exposed to/assayed with = "when treated with"
    * drug = "ivacaftor"
    * cell type = "transfected CHO cells"
    * comparison alleles/genotypes = [none]
* **pheno**: "Genotypes AA + AG is associated with decreased severity of bone density when treated with ivacaftor in people with Cystic Fibrosis as compared to genotype GG."
    * alleles = "AA + AG"
    * direction of effect = "decreased"
    * side effect/efficacy/other = "severity of"
    * phenotype = "bone density" **< note distinction between this and population phenotype**
    * when treated with/exposed to/assayed with = "when treated with"
    * drug = "ivacaftor"
    * population types = "people"
    * population phenotype = "cystic fibrosis"
    * comparison alleles/genotypes = "GG"

For the simplest direction of effect annotation (i.e. not including the population, cell type, etc.), I think we only strictly need the following:
1. direction of effect
2. pd/pk term | functional term | side effect/efficacy/other
3. drug | gene/gene product | phenotype

This tells us the direction (1) and what the effect is (2&3).
Of course we also need the alleles to associate with the appropriate evidence string, maybe also the comparison alleles/genotypes when present, and the "is/is not associated" column so we don't report negative results (unless we want to).

Maybe also just the origin of the evidence (variant/drug, variant/phenotype, or functional analysis) is useful.

In [161]:
# Check some vocab 
# DoE fixed vocab according to readme
set(main_with_var['Direction of effect']) | set(main_with_var['Direction of effect_var_pheno']) | set(main_with_var['Direction of effect_var_fa'])

{'decreased', 'increased', nan}

In [158]:
set(main_with_var['PD/PK terms'])  # not limited according to readme

{'clearance of',
 'clinical benefit to',
 'concentrations of',
 'discontinuation of',
 'dose of',
 'dose-adjusted trough concentrations of',
 'exposure to',
 'half-life time of',
 'metabolism of',
 nan,
 'resistance to',
 'response to',
 'steady-state concentration of',
 'time to response to',
 'trough concentration of'}

In [159]:
set(main_with_var['Functional terms'])  # not limited

{'activity of',
 'affinity to',
 'catalytic activity of',
 'clearance of',
 'concentrations of',
 'enzyme activity of',
 'expression of',
 'formation of',
 'glucuronidation of',
 'half-life of',
 'inhibition of',
 'metabolism of',
 nan,
 'protein stability of',
 'sensitivity to',
 'steady-state level of',
 'sulfation of',
 'transcription of',
 'transport of',
 'uptake of'}

In [160]:
set(main_with_var['Side effect/efficacy/other'])  # limited

{'age at onset of', 'likelihood of', nan, 'risk of', 'severity of'}

Assume the final term (drug | gene/gene product | phenotype) will vary, but hopefully we can also map it to the relevant domain if needed (CHEMBL, EFO, Ensembl). Drugs & genes terms look about what you'd expect, as usual phenotype is the most diverse (see below; the readme explicitly states phenotype is not standardized).

Otherwise this is a pretty small and consistent set of terms for ca. 15000 rows, though I don't think we can assume the vocab won't grow (except the actual direction word, we should be good there).

In [209]:
list(set(main_with_var['Phenotype']))[:100]

[nan,
 'PK:differences in exposure to the active metabolite of prasugrel',
 'Disease:Endometrial Neoplasms',
 '"Disease:Epidermal Necrolysis, Toxic", "Disease:Stevens-Johnson Syndrome"',
 'Side Effect:total hemorrhage and major hemorrhage',
 'Other:subjective feelings of intoxication, stimulation, sedation, and happiness',
 'PK:plasma oxymorphone/oxycodone ratio',
 'Disease:Hematologic Diseases',
 'Side Effect:Venous Thrombosis',
 'Efficacy:non-remission',
 'Side Effect:Leukopenia',
 'Other:Drug Toxicity',
 'PK:renal clearance and secretion clearance of metformin',
 'PK:serum concentration',
 'Disease:Hyperammonemia',
 'Disease:Cough',
 'Disease:Hepatitis, Toxic',
 'Other:overanticoagulation',
 'Side Effect:Anemia, Side Effect:Leukopenia, Side Effect:Neutropenia, Side Effect:Thrombocytopenia',
 'PK:active moiety levels',
 'PK:log metabolic ratio of amitriptyline/nortriptyline',
 'Efficacy:maximum platelet aggregation (MPA) or the anti-platelet effect of clopidogrel in Chinese stroke pa

In [219]:
len(set(main_with_var['Phenotype']))

1735

In [216]:
# Dirty attempt to get the prefix - doesn't account for multiples
set(main_with_var['Phenotype'].dropna().apply(lambda p: p.split(':')[0].strip('"')))

{'Disease', 'Efficacy', 'Other', 'PK', 'Side Effect'}

In [218]:
len(set(main_with_var['Phenotype'].dropna().apply(lambda p: p.split(':')[1].strip('"'))))

1499

The prefix looks to be fixed and always present, which is kind of nice actually. The rest I think can come from body of PGKB terms or be filled in freely. We could perhaps map them (with OLS or NLP).

Same is true for population phenotypes when present:

In [226]:
# var_fa does not have the column so I can't use my nice function :(
(
    set(main_with_var['Population Phenotypes or diseases'].dropna().apply(lambda p: p.split(':')[0].strip('"'))) 
    | set(main_with_var['Population Phenotypes or diseases_var_pheno'].dropna().apply(lambda p: p.split(':')[0].strip('"'))) 
)

{'Disease', 'Efficacy', 'Other', 'PK', 'Side Effect'}

## Alleles and genotypes

[Top of page](#Table-of-contents)

In [None]:
# Visual inspection of alleles - output suppressed for brevity
main_with_var_values_in('Alleles')

In [None]:
main_with_var_values_in('Comparison Allele(s) or Genotype(s)')

Alleles and comparison alleles look relatively consistent with what's in the alleles table:
* SNP genotype `C/T` or `CC` (annoying)
* SNP allele `C`
* indel `GGGGAGCTTTCCCAGAGACCC/del`
* named allele `*17` or `HTTLPR short form (S allele)`
* named genotype `*2/*4`
* combinations of the above delineated with `+`

In [203]:
variant_haps = main_with_var_values_in('Variant/Haplotypes')

[v for v in variant_haps if pd.notna(v) and not (v.startswith('rs') or '*' in v)]

['G6PD Canton, Taiwan-Hakka, Gifu-like, Agrigento-like',
 'G6PD A- 202A_376G',
 'G6PD A- 202A_376G, G6PD B (reference)',
 'CYP1A2 high activity',
 'SLC6A4 HTTLPR long form (L allele), SLC6A4 HTTLPR short form (S allele)',
 'SLC6A4 HTTLPR short form (S allele)',
 'CYP2D6 poor metabolizer genotype',
 'CYP1A2 low activity',
 'CYP2A6 poor metabolizer genotype',
 'CYP2D6 ultrarapid metabolizer genotype',
 'CYP2D6 low activity',
 'CYP2A6 low activity',
 'G6PD B (reference), G6PD Mediterranean Haplotype',
 'CYP2C19 poor metabolizer phenotype',
 'CYP2C19 poor metabolizers',
 'CYP2C19 poor metabolizer genotype',
 'CYP2D6 poor metabolizer phenotype',
 'CYP3A4 low activity',
 'TPMT intermediate metabolizer phenotype',
 'G6PD deficiency',
 'NAT2 slow acetylator',
 'CYP2D6 ultrarapid metabolizer phenotype',
 'CYP2D6 poor and ultrarapid metabolizers',
 'CYP2D6 poor metabolizer and intermediate metabolizer genotypes',
 'CYP2D6 normal metabolizer and ultrarapid metabolizer genotypes',
 'CYP2C19 normal

Some of these are just named (non-star) alleles, but things like "CYP2A6 poor metabolizer genotype" are where the comparison metabolyzer gets used as opposed to comparison alleles.

Not sure what to do about these - we can't easily associate them with an allele or genotype, only with the clinical annotation as a whole.

### To-do's

Prototyping:
* sample schema additions for unified representation
* sample extraction
* check multiplicity when connecting variant annotations to exploded evidence - what does this look like?

## Summary

[Top of page](#Table-of-contents)

Specific questions:
* Do we care about all of the tables? (for v1)
* Do we care about the other columns? (for v1)