# Prototype of associating variant annotations with clinical annotations

Goal:
* For each clinical annotation genotype/haplotype ID, associate one or more variant annotation IDs
* Try a simple algorithm first, check approx. what percentage we can associate naively

Known issues:
* What to do about ref/ref genotypes?
    * Curious how often the "comparison allele" is the reference?
        * not always, e.g. [here](https://www.pharmgkb.org/variantAnnotation/1451665780)
* How to go from statement about alleles to a statement about genotypes?
    * Or vice versa, apparently
* What to do about "metabolyzer" types?

## Table of contents

1. [Intro](#Intro)
    1. [Example 1](#Example-1)
    2. [Example 2](#Example-2)
2. [Algorithm](#Algorithm)
    1. [Prototyping](#Prototyping)
    2. [Full dataset run](#Full-dataset-run)

## Intro

[Top of page](#Table-of-contents)

In [1]:
import os
import csv
import re
from collections import Counter

import pandas as pd

from opentargets_pharmgkb.evidence_generation import ID_COL_NAME
from opentargets_pharmgkb.pandas_utils import read_tsv_to_df

In [2]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', None)

In [3]:
data_dir = '/home/april/projects/opentargets/pharmgkb/doe'

In [4]:
var_drug_ann = read_tsv_to_df(os.path.join(data_dir, 'var_drug_ann.tsv'))
var_fa_ann = read_tsv_to_df(os.path.join(data_dir, 'var_fa_ann.tsv'))
var_pheno_ann = read_tsv_to_df(os.path.join(data_dir, 'var_pheno_ann.tsv'))

In [5]:
clinical_annotations = read_tsv_to_df(os.path.join(data_dir, 'clinical_annotations.tsv'))
clinical_ann_evidence = read_tsv_to_df(os.path.join(data_dir, 'clinical_ann_evidence.tsv'))
clinical_ann_alleles = read_tsv_to_df(os.path.join(data_dir, 'clinical_ann_alleles.tsv'))

### Example 1

[Top of page](#Table-of-contents)

In [32]:
var_drug_ann[var_drug_ann['Variant Annotation ID'] == '1448997750']

Unnamed: 0,Variant Annotation ID,Variant/Haplotypes,Gene,Drug(s),PMID,Phenotype Category,Significance,Notes,Sentence,Alleles,Specialty Population,Metabolizer types,isPlural,Is/Is Not associated,Direction of effect,PD/PK terms,Multiple drugs And/or,Population types,Population Phenotypes or diseases,Multiple phenotypes or diseases And/or,Comparison Allele(s) or Genotype(s),Comparison Metabolizer types
3,1448997750,"CYP2B6*1, CYP2B6*18",CYP2B6,efavirenz,16495778,Metabolism/PK,yes,"Please note that in the paper the allele was referred to as CYP2B6*16. CYP2B6*16 and *18 alleles have been consolidated by PharmVar in Jan 2020, with *16 now listed as a suballele of *18 (CYP2B6*18.002). This annotation is updated to be on CYP2B6*18, instead of CYP2B6*16.",CYP2B6 *1/*18 is associated with increased concentrations of efavirenz in people with HIV Infections as compared to CYP2B6 *1/*1.,*1/*18,,,Is,Associated with,increased,concentrations of,,in people with,Disease:HIV Infections,,*1/*1,


In [14]:
var_drug_ann[var_drug_ann['Variant Annotation ID'] == '1184988315']

Unnamed: 0,Variant Annotation ID,Variant/Haplotypes,Gene,Drug(s),PMID,Phenotype Category,Significance,Notes,Sentence,Alleles,Specialty Population,Metabolizer types,isPlural,Is/Is Not associated,Direction of effect,PD/PK terms,Multiple drugs And/or,Population types,Population Phenotypes or diseases,Multiple phenotypes or diseases And/or,Comparison Allele(s) or Genotype(s),Comparison Metabolizer types
299,1184988315,rs28399499,CYP2B6,efavirenz,20952418,Metabolism/PK,yes,*Note: combined analysis with rs3745274* Mean log efavirenz trough concentrations increased with the number of rs3745274 T or rs28399499 C alleles.,Allele C is associated with decreased metabolism of efavirenz in people with HIV Infections.,C,,,Is,Associated with,decreased,metabolism of,,in people with,Disease:HIV Infections,,,


In [15]:
clinical_ann_evidence[clinical_ann_evidence['Evidence ID'] == '1184988315']

Unnamed: 0,Clinical Annotation ID,Evidence ID,Evidence Type,Evidence URL,PMID,Summary,Score
13356,1184133833,1184988315,Variant Drug Annotation,https://www.pharmgkb.org/variantAnnotation/1184988315,20952418,Allele C is associated with decreased metabolism of efavirenz in people with HIV Infections.,1.5


In [10]:
clinical_annotations[clinical_annotations[ID_COL_NAME] == '1184133833']

Unnamed: 0,Clinical Annotation ID,Variant/Haplotypes,Gene,Level of Evidence,Level Override,Level Modifiers,Score,Phenotype Category,PMID Count,Evidence Count,Drug(s),Phenotype(s),Latest History Date (YYYY-MM-DD),URL,Specialty Population
4493,1184133833,"CYP2B6*1, CYP2B6*4, CYP2B6*6, CYP2B6*9, CYP2B6*18, CYP2B6*28",CYP2B6,1A,,Tier 1 VIP,361.5,Metabolism/PK,44,69,efavirenz,HIV Infections,2021-03-24,https://www.pharmgkb.org/clinicalAnnotation/1184133833,Pediatric


In [11]:
clinical_ann_alleles[clinical_ann_alleles[ID_COL_NAME] == '1184133833']

Unnamed: 0,Clinical Annotation ID,Genotype/Allele,Annotation Text,Allele Function
13707,1184133833,*1,"The CYP2B6*1 allele is assigned as a normal function allele by CPIC. Patients carrying CYP2B6*1 allele in combination with another normal function allele may have increased metabolism and decreased concentrations of efavirenz as compared to patients with a no or decreased function allele in combination with a normal or increased function allele or with two no or decreased function alleles. However, conflicting evidence has been reported. Other genetic and clinical factors may also influence metabolism of efavirenz. This annotation only covers the pharmacokinetic relationship between CYP2B6 and efavirenz and does not include evidence about clinical outcomes.",Normal function
13708,1184133833,*4,"The CYP2B6*4 allele is assigned as an increased function allele by CPIC. Patients carrying the CYP2B6*4 allele in combination with a normal function allele or another increased function allele may have increased metabolism of efavirenz as compared to patients with two normal function alleles. However, conflicting evidence has been reported. Other genetic and clinical factors may also influence metabolism of efavirenz. This annotation only covers the pharmacokinetic relationship between CYP2B6 and efavirenz and does not include evidence about clinical outcomes.",Increased function
13709,1184133833,*6,"The CYP2B6*6 allele is assigned as a decreased function allele by CPIC. Patients carrying the CYP2B6*6 allele in combination with a no, decreased, normal, or increased function allele may have decreased metabolism of efavirenz as compared to patients with two normal function alleles. However, conflicting evidence has been reported. Other genetic and clinical factors may also influence metabolism of efavirenz. This annotation only covers the pharmacokinetic relationship between CYP2B6 and efavirenz and does not include evidence about clinical outcomes.",Decreased function
13710,1184133833,*9,"The CYP2B6*9 allele is assigned as a decreased function allele by CPIC. Patients carrying the CYP2B6*9 allele in combination with a no, decreased, normal, or increased function allele may have decreased metabolism of efavirenz as compared to patients with two normal function alleles. Other genetic and clinical factors may also influence metabolism of efavirenz. This annotation only covers the pharmacokinetic relationship between CYP2B6 and efavirenz and does not include evidence about clinical outcomes.",Decreased function
13711,1184133833,*18,"The CYP2B6*18 allele is assigned as a no function allele by CPIC. Patients carrying the CYP2B6*18 allele in combination with a no, decreased, normal, or increased function allele may have decreased metabolism of efavirenz as compared to patients with two normal function alleles. However, conflicting evidence has been reported. Other genetic and clinical factors may also influence metabolism of efavirenz. This annotation only covers the pharmacokinetic relationship between CYP2B6 and efavirenz and does not include evidence about clinical outcomes.",No function
13712,1184133833,*28,"The CYP2B6*18 allele is assigned as a no function allele by CPIC. Patients carrying the CYP2B6*28 allele in combination with a no, decreased, normal, or increased function allele may have decreased metabolism of efavirenz as compared to patients with two normal function alleles. Other genetic and clinical factors may also influence metabolism of efavirenz. This annotation only covers the pharmacokinetic relationship between CYP2B6 and efavirenz and does not include evidence about clinical outcomes.",No function


Variant annotation sentence:
```
CYP2B6 *1/*18 is associated with increased concentrations of efavirenz in people with HIV Infections as compared to CYP2B6 *1/*1.
```

Clinical annotation has sentences for `*1` & `*18`, need to associate with one or both of them.

It's actually much worse - there are variant annotations associated with this clinical annotation that refer to alleles of one of the RS composing the named allele!

[Here](https://www.pharmgkb.org/variantAnnotation/1184988315) is an example. Variant annotation sentence:
```
Allele C is associated with decreased metabolism of efavirenz in people with HIV Infections.
```
This is for a specific variant rs28399499. This would have to be associated with one or more of the star alleles of CYP2B6 based on the allele definition table (in this case, only `*18` has C for that RS).

Also note the direction flipping - "increased concentration" to "decreased metabolism".

### Example 2

[Top of page](#Table-of-contents)

In [24]:
var_drug_ann[var_drug_ann['Variant Annotation ID'] == '981755665']

Unnamed: 0,Variant Annotation ID,Variant/Haplotypes,Gene,Drug(s),PMID,Phenotype Category,Significance,Notes,Sentence,Alleles,Specialty Population,Metabolizer types,isPlural,Is/Is Not associated,Direction of effect,PD/PK terms,Multiple drugs And/or,Population types,Population Phenotypes or diseases,Multiple phenotypes or diseases And/or,Comparison Allele(s) or Genotype(s),Comparison Metabolizer types
7365,981755665,rs75527207,CFTR,ivacaftor,21083385,Efficacy,not stated,Clinical trials were carried out to test efficacy of ivacaftor selecting only patients with the CFTR G551D mutation on at least one allele (genotype AA or AG).,Genotypes AA + AG are associated with response to ivacaftor in people with Cystic Fibrosis.,AA + AG,,,Are,Associated with,,response to,,in people with,Disease:Cystic Fibrosis,,,


In [26]:
clinical_ann_evidence[clinical_ann_evidence['Evidence ID'] == '981755665']

Unnamed: 0,Clinical Annotation ID,Evidence ID,Evidence Type,Evidence URL,PMID,Summary,Score
2,981755803,981755665,Variant Drug Annotation,https://www.pharmgkb.org/variantAnnotation/981755665,21083385,Genotypes AA + AG are associated with response to ivacaftor in people with Cystic Fibrosis.,0.25


In [27]:
clinical_ann_alleles[clinical_ann_alleles[ID_COL_NAME] == '981755803']

Unnamed: 0,Clinical Annotation ID,Genotype/Allele,Annotation Text,Allele Function
0,981755803,AA,"Patients with the rs75527207 AA genotype (two copies of the CFTR G551D variant) and cystic fibrosis may respond to ivacaftor treatment. FDA-approved drug labeling information and CPIC guidelines indicate use of ivacaftor in cystic fibrosis patients with at least one copy of a list of 33 CFTR genetic variants, including G551D. Other genetic and clinical factors may also influence response to ivacaftor.",
1,981755803,AG,"Patients with the rs75527207 AG genotype (one copy of the CFTR G551D variant) and cystic fibrosis may respond to ivacaftor treatment. FDA-approved drug labeling information and CPIC guidelines indicate use of ivacaftor in cystic fibrosis patients with at least one copy of a list of 33 CFTR genetic variants, including G551D. Other genetic and clinical factors may also influence response to ivacaftor.",
2,981755803,GG,"Patients with the rs75527207 GG genotype (do not have a copy of the CFTR G551D variant) and cystic fibrosis have an unknown response to ivacaftor treatment, as response may depend on the presence of other CFTR variants. FDA-approved drug labeling information and CPIC guidelines indicate use of ivacaftor in cystic fibrosis patients with at least one copy of a list of 33 CFTR genetic variants, including G551D. Other genetic and clinical factors may also influence response to ivacaftor.",


Variant annotation sentence:
```
Genotypes AA + AG are associated with response to ivacaftor in people with Cystic Fibrosis.
```
Lines up exactly with clinical annotations which are on genotypes as well.

In [29]:
clinical_ann_evidence[clinical_ann_evidence['Evidence ID'] == '982009991']

Unnamed: 0,Clinical Annotation ID,Evidence ID,Evidence Type,Evidence URL,PMID,Summary,Score
4,981755803,982009991,Variant Drug Annotation,https://www.pharmgkb.org/variantAnnotation/982009991,23590265,Allele A is associated with response to ivacaftor in children with Cystic Fibrosis.,2.25


On the other hand, this evidence is for the same clinical annotation but is for an allele.

Allele types observed in variant annotation tables:
* SNP genotype `C/T` or `CC` (annoying)
* SNP allele `C`
* insertion/deletion genotype `GGGGAGCTTTCCCAGAGACCC/del` (not sure about indel allele)
* named allele `*17` or `HTTLPR short form (S allele)`
* named genotype `*2/*4`
* combinations of the above delineated with `+`
* tricky ones that we'll ignore
    * poor/intermediate/normal/ultrarapid metabolyzer, or combinations of these
    * low/high activity
    * slow/intermediate acetylator
    * deficiency

## Algorithm

[Top of page](#Table-of-contents)

#### Main
1. First associate variant annotations with clinical annotations by ID
2. For variant annotation, take the `Alleles` column
3. Parse by splitting on `+` (and)
4. For each piece, try to match exactly against the `Genotype/Allele` column in clinical annotation
    * If it matches, go to "Resolve combo"
5. Parse each piece by splitting on `/` to get alleles
    * If no `/`, assume this to be an allele already
6. Try to match against alleles parsed from clinical annotation (done as part of ID generation)
    * If it matches, go to "Resolve genotype" and then "Resolve combo"
7. If no match, log and exit

#### Resolve combo (`+`)
Variant annotation should be associated with each operand of `+`.

#### Resolve genotype vs. allele
We reach this state if we've matched something but the level isn't right - either variant annotation is per allele and clinical annotation is per genotype, or vice versa.

**Case 1:** Variant annotation is per allele but clinical annotation is per genotype.
* Variant annotation should be associated with each genotype containing the allele.

**Case 2:** Variant annotation is per genotype but clinical annotation is per allele.
* Variant annotation should be associated with each allele contained in its genotype.

Worked example 1: Variant annotation is for `*1/*18 + *2/*18 + *18/*18`, clinical annotations are on `*1`, `*2` and `*18`

1. Break combo to `*1/*18` AND `*2/*18` AND `*18/*18`=> no match
2. Split genotype to `*1` & `*18` AND `*2` & `*18` AND `*18` & `*18` => match
3. Associate variant annotation with `*1`, `*2`, and `*18`

Worked example 2: Variant annotation is on `AA + AG`, clinical annotations are on `AA`, `AG`, and `GG`.

1. Break combo to `AA` and `AG` => match
2. Associate variant annotation with `AA` and `AG`

### Prototyping

[Top of page](#Table-of-contents)

In [279]:
from opentargets_pharmgkb.pandas_utils import split_and_explode_column
import numpy as np
from IPython.display import display

In [368]:
# Examples from spreadsheet
example_ca_ids = ['981755803', '1139506787', '1183888969', '1184514050', '981419266']

In [456]:
def get_evidence(caids=None):
    if not caids:
        caids = clinical_ann_evidence[ID_COL_NAME]
    # Variant annotation ids
    caid_to_vaid = {
        caid: clinical_ann_evidence[clinical_ann_evidence[ID_COL_NAME] == caid]['Evidence ID'].to_list()
        for caid in caids
    }
    return caid_to_vaid

In [342]:
def extended_parse_genotype(genotype_string):
    """
    Parse PGKB string representations of genotypes into alleles. Extended to include star alleles.
    """
    alleles = [genotype_string]
    
    # SNPs
    if len(genotype_string) == 2 and '*' not in genotype_string:
        alleles = [genotype_string[0], genotype_string[1]]

    # others
    m = re.match('([^/]+)/([^/]+)', genotype_string, re.IGNORECASE)
    if m:
        alleles = [m.group(1), m.group(2)]

    return alleles

In [455]:
def get_associations_old(annotation_df, clinical_alleles_df):
    """
    Older version for testing.
    This was faster (no for-loop) and useful to see what variant annotation pieces don't get matched,
    but some small bugs and not able to see which clinical annotations don't get matched.
    """
    # Split on +
    split_df = split_and_explode_column(annotation_df, 'Alleles', 'split_alleles_1', sep='\+')
    # Split on /
    split_df = split_and_explode_column(split_df, 'split_alleles_1', 'split_alleles_2', sep='/')
    # Get alleles from clinical annotations - same logic as for getting ids
    split_clin_df = clinical_alleles_df.assign(parsed_genotype=clinical_alleles_df['Genotype/Allele'].apply(extended_parse_genotype))
    split_clin_df = split_clin_df.explode('parsed_genotype').reset_index(drop=True)

    # First match by the +-split alone
    merged_df = pd.merge(split_clin_df, split_df, how='right', left_on='Genotype/Allele', right_on='split_alleles_1')

    # Then try to fill in non-matches from /-split
    merged_df_2 = pd.merge(split_clin_df, split_df, how='right', left_on='parsed_genotype', right_on='split_alleles_2')
    merged_df.fillna(value={
        'Clinical Annotation ID': merged_df_2['Clinical Annotation ID'],
        'Genotype/Allele': merged_df_2['Genotype/Allele'],
        'Annotation Text': merged_df_2['Annotation Text'],
        'parsed_genotype': merged_df_2['parsed_genotype']
    }, inplace=True)

    # Select some columns to return & drop duplicates
    return merged_df[
    [ID_COL_NAME, 'Genotype/Allele', 'parsed_genotype', 'Annotation Text', 'Variant Annotation ID', 'Alleles', 'split_alleles_1', 'split_alleles_2', 'Sentence']
    ].drop_duplicates()

In [453]:
def get_associations(annotation_df, clinical_alleles_df):
    # Split on +
    split_ann_df = split_and_explode_column(annotation_df, 'Alleles', 'split_alleles_1', sep='\+')
    # Split on /
    split_ann_df = split_and_explode_column(split_ann_df, 'split_alleles_1', 'split_alleles_2', sep='/')
    # Get alleles from clinical annotations - same logic as for getting ids
    split_clin_df = clinical_alleles_df.assign(parsed_genotype=clinical_alleles_df['Genotype/Allele'].apply(extended_parse_genotype))
    split_clin_df = split_clin_df.explode('parsed_genotype').reset_index(drop=True)

    # Trim down columns 
    split_ann_df = split_ann_df[['Variant Annotation ID', 'Alleles', 'split_alleles_1', 'split_alleles_2', 'Sentence']]
    split_clin_df = split_clin_df[[ID_COL_NAME, 'Genotype/Allele', 'parsed_genotype', 'Annotation Text']]

    # Match by +-split and /-split
    merged_df = pd.merge(split_clin_df, split_ann_df, how='outer', left_on='Genotype/Allele', right_on='split_alleles_1')
    merged_df_2 = pd.merge(split_clin_df, split_ann_df, how='outer', left_on='parsed_genotype', right_on='split_alleles_2')

    # If a genotype in a clinical annotation doesn't have evidence, want this listed with nan's
    all_results = []
    for _, genotype, parsed_genotype, _ in split_clin_df.itertuples(index=False):
        # Rows that matched on genotype
        rows_first_match = merged_df[(merged_df['Genotype/Allele'] == genotype) & (merged_df['parsed_genotype'] == parsed_genotype) & (~merged_df['Variant Annotation ID'].isna())]
        # Rows that matched on parsed genotype
        rows_second_match = merged_df_2[(merged_df_2['Genotype/Allele'] == genotype) & (merged_df_2['parsed_genotype'] == parsed_genotype) & (~merged_df_2['Variant Annotation ID'].isna())]
        # If neither matches, add with nan's
        if rows_first_match.empty and rows_second_match.empty:
            all_results.append(merged_df[(merged_df['Genotype/Allele'] == genotype) & (merged_df['parsed_genotype'] == parsed_genotype)])
        else:
            all_results.extend([rows_first_match, rows_second_match])

    final_result = pd.concat(all_results).drop_duplicates()

    # If _no_ part of a variant annotation is associated with any clinical annotation, want this listed with nan's
    for vaid, alleles, split_1, split_2, _ in split_ann_df.itertuples(index=False):
        results_with_vaid = final_result[final_result['Variant Annotation ID'] == vaid]
        if results_with_vaid.empty:
            final_result = pd.concat((final_result,
                                      merged_df[(merged_df['Variant Annotation ID'] == vaid) & (merged_df['Alleles'] == alleles) &
                                                (merged_df['split_alleles_1'] == split_1) & (merged_df['split_alleles_2'] == split_2)]
                                     ))
    return final_result

In [457]:
def get_results(caid_to_vaid):
    results = {}
    for caid, vaids in caid_to_vaid.items():
        clinical_alleles_df = clinical_ann_alleles[clinical_ann_alleles[ID_COL_NAME] == caid]
        # Get common columns among all types of variant annotations so we can concat
        drug_df = var_drug_ann[var_drug_ann['Variant Annotation ID'].isin(vaids)][['Variant Annotation ID', 'Alleles', 'Sentence']]
        phenotype_df = var_pheno_ann[var_pheno_ann['Variant Annotation ID'].isin(vaids)][['Variant Annotation ID', 'Alleles', 'Sentence']]
        assay_df = var_fa_ann[var_fa_ann['Variant Annotation ID'].isin(vaids)][['Variant Annotation ID', 'Alleles', 'Sentence']]

        all_variant_annotations = pd.concat((drug_df, phenotype_df, assay_df))
        results[caid] = get_associations(all_variant_annotations, clinical_alleles_df)

    return results

In [458]:
example_results = get_results(get_evidence(example_ca_ids))

In [459]:
# Approximate percentage of variant annotations we were able to associate to clinical annotation
for caid in example_ca_ids:
    num_rows = len(example_results[caid])
    num_success = example_results[caid][ID_COL_NAME].count()
    print(f'{caid}: {num_success}/{num_rows} ({num_success/num_rows*100:0.2f}%)')

981755803: 62/62 (100.00%)
1139506787: 168/171 (98.25%)
1183888969: 12/12 (100.00%)
1184514050: 39/39 (100.00%)
981419266: 20/20 (100.00%)


In [460]:
# Approximate percentage of clinical annotation genotype/alleles we were able to associate
for caid in example_ca_ids:
    num_rows = len(example_results[caid])
    num_success = example_results[caid]['Variant Annotation ID'].count()
    print(f'{caid}: {num_success}/{num_rows} ({num_success/num_rows*100:0.2f}%)')

981755803: 61/62 (98.39%)
1139506787: 171/171 (100.00%)
1183888969: 10/12 (83.33%)
1184514050: 39/39 (100.00%)
981419266: 20/20 (100.00%)


In [371]:
example_results['1139506787'][example_results['1139506787'][ID_COL_NAME].isna()]

Unnamed: 0,Clinical Annotation ID,Genotype/Allele,parsed_genotype,Annotation Text,Variant Annotation ID,Alleles,split_alleles_1,split_alleles_2,Sentence
8,,,,,1451663080,CT + TT,CT,CT,Genotypes CT + TT are associated with decreased metabolism of nicotine as compared to genotype CC.
9,,,,,1451663080,CT + TT,TT,TT,Genotypes CT + TT are associated with decreased metabolism of nicotine as compared to genotype CC.
84,,,,,1451667320,T,T,T,Allele T is associated with decreased metabolism of nicotine as compared to allele C.
85,,,,,1451672860,T,T,T,Allele T is associated with decreased metabolism of nicotine as compared to allele C.
173,,,,,1183703333,,,,CYP2A6 poor metabolizer is associated with decreased metabolism of nicotine in nonsmokers as compared to CYP2A6 normal metabolizer.
10,,,,,1183689160,,,,CYP2A6 poor metabolizer is associated with increased ratio of cotinine formation to removal when exposed to nicotine in nonsmokers as compared to CYP2A6 normal metabolizer.
11,,,,,1183689165,,,,CYP2A6 poor metabolizer is associated with decreased ratio of plasma cotinine to urinary TNE when exposed to nicotine in smokers as compared to CYP2A6 normal metabolizer.


In [372]:
example_results['981419266'][example_results['981419266'][ID_COL_NAME].isna()]

Unnamed: 0,Clinical Annotation ID,Genotype/Allele,parsed_genotype,Annotation Text,Variant Annotation ID,Alleles,split_alleles_1,split_alleles_2,Sentence
18,,,,,981859108,*15:02 + *46:01,*46:01,*46:01,HLA-B *15:02 + *46:01 are associated with increased risk of Stevens-Johnson Syndrome when treated with phenytoin in Epilepsy.


7 missing from `1139506787`:
* 4 are annotations on an RS that is part of the named haplotype
* 3 are on "poor metabolizers"

1 missing from `981419266`:
* allele that is indeed not part of the clinical annotation for some reason

Note these counts are after all the exploding, counting by variant annotation it should be something like 6 missing + 0 missing.

Also this is counting coverage of variant annotations, not coverage of clinical annotations (e.g. not all genotypes in clincial annotations are present in these tables)

In [461]:
for caid, table in example_results.items():
    table.to_csv(f'{data_dir}/{caid}_association_test.csv', index=False)

### Full dataset run

[Top of page](#Table-of-contents)

In [462]:
full_results = get_results(get_evidence())

In [504]:
def compute_variant_annotation_coverage(results):
    """Compute fraction of variant annotations that are covered by at least one clinical annotation genotype."""
    total_var_anns = 0
    total_failure = 0
    for caid in results:
        var_anns = results[caid]['Variant Annotation ID'].dropna().unique()
        total_var_anns += len(var_anns)
        match = set(results[caid][~results[caid][ID_COL_NAME].isna()]['Variant Annotation ID'].dropna().unique())
        non_match = set(results[caid][results[caid][ID_COL_NAME].isna()]['Variant Annotation ID'].dropna().unique())
        total_failure += len(non_match - match)

    total_success = total_var_anns - total_failure
    print(f'Coverage of variant annotations: {total_success}/{total_var_anns} ({total_success/total_var_anns*100:0.2f}%)')

In [499]:
def compute_clinical_annotation_coverage(results):
    """Compute fraction of clinical annotation genotype/alleles that are covered by at least one variant annotation"""
    total_genotypes = 0
    total_failure = 0
    
    for caid in results:
        genotypes = results[caid]['Genotype/Allele'].dropna().unique()
        total_genotypes += len(genotypes)
        match = set(results[caid][~results[caid]['Variant Annotation ID'].isna()]['Genotype/Allele'].dropna().unique())
        non_match = set(results[caid][results[caid]['Variant Annotation ID'].isna()]['Genotype/Allele'].dropna().unique())
        total_failure += len(non_match - match)
        
    total_success = total_genotypes - total_failure
    print(f'Coverage of clinical annotation genotype/alleles: {total_success}/{total_genotypes} ({total_success/total_genotypes*100:0.2f}%)')

In [505]:
compute_variant_annotation_coverage(full_results)

Coverage of variant annotations: 13380/14280 (93.70%)


In [500]:
compute_clinical_annotation_coverage(full_results)

Coverage of clinical annotation genotype/alleles: 10258/15755 (65.11%)
