# Benchmarking ClinVar evidences

We want to get an idea on how the ClinGen evidences and the corresponding scores contributes to identifying known drug/target interactions.

**Steps:**
1. Fetching data
2. Exploring data: distribution of scores, annotation and uniqueness of evidence
3. Calculate recovery of Pharmaproject hits.

## Fetching data + expoloration. 

In [None]:
%%bash 

# Fetching clingen evidence from google bucket:
gsutil cp -r gs://otar000-evidence_input/ClinGen/clingen_2020-08-04.json.gz .

In [3]:
import json
import pandas as pd
import gzip

clingenFile = './clingen_2020-08-04.json.gz'

parsed_data = []
with gzip.open(clingenFile, 'r') as f:
    for line in f:
        data = json.loads(line)
        parsed_data.append({
            'confidence': data['evidence']['confidence'],
            'allelic_requirement': data['evidence']['allelic_requirement'],
            'disease': data['disease']['id'].split('/')[-1],
            'target': data['target']['id'].split('/')[-1],
            'unique_disease_id': data['unique_association_fields']['disease_id'],
            'score': data['evidence']['resource_score']['value']    
        })

clingen_data = pd.DataFrame(parsed_data)
clingen_data.head()

Unnamed: 0,confidence,allelic_requirement,disease,target,unique_disease_id,score
0,No Reported Evidence,Autosomal Dominant,Orphanet_500,ENSG00000166535,MONDO_0007893,0.01
1,No Reported Evidence,Autosomal Dominant,Orphanet_1340,ENSG00000166535,MONDO_0015280,0.01
2,No Reported Evidence,Autosomal Dominant,Orphanet_3071,ENSG00000166535,MONDO_0009026,0.01
3,Disputed,Autosomal Dominant,Orphanet_648,ENSG00000166535,MONDO_0018997,0.1
4,No Reported Evidence,Autosomal Dominant,Orphanet_2701,ENSG00000166535,MONDO_0011899,0.01


In [97]:
# clingen_data.apply(lambda row: (row['confidence'],row['score']), axis=1).value_counts()

summary_list = []
for conf, group in clingen_data.groupby(['confidence']):
    summary_list.append({
        'confidence': conf,
        'evidence_count': len(group),
        'score': group.score.tolist()[0]
    })
    
summary_df = pd.DataFrame(summary_list)
summary_df.sort_values(['score'],ascending=False, inplace=True)
summary_df['counts_perc'] = summary_df.evidence_count.apply(lambda x: round(x / summary_df.evidence_count.sum() * 100, 1))

summary_df['cumulative_perc'] = summary_df.evidence_count.cumsum().apply(lambda x: round(x / summary_df.evidence_count.sum() * 100, 1))



summary_df['target_cnt'] = summary_df.confidence.apply(lambda x: len(clingen_data.loc[clingen_data.confidence == x].target.unique()))
summary_df['disease_cnt'] = summary_df.confidence.apply(lambda x: len(clingen_data.loc[clingen_data.confidence == x].disease.unique()))



summary_df



Unnamed: 0,confidence,evidence_count,score,counts_perc,cumulative_perc,target_cnt,disease_cnt
0,Definitive,528,1.0,53.9,53.9,498,333
6,Strong,19,0.75,1.9,55.8,19,14
3,Moderate,100,0.5,10.2,66.0,99,52
2,Limited,162,0.25,16.5,82.6,152,55
1,Disputed,69,0.1,7.0,89.6,63,23
5,Refuted,10,0.05,1.0,90.6,10,9
4,No Reported Evidence,92,0.01,9.4,100.0,48,11


### Check the uniqueness of target/disease

In [70]:
number_of_assoc = len(clingen_data)
number_of_unique_t_d = len(clingen_data[['target','disease']].drop_duplicates())

# Count of unique disease and target:
print(f'Number of unique diseases: {len(clingen_data.disease.unique())}')
print(f'Number of unique targets: {len(clingen_data.target.unique())}')

# Counts of disease/target pairs:
print(f'Number of associations: {number_of_assoc}')
print(f'Number of unique target/disease pairs: {number_of_unique_t_d}')

# How the target/disease split:
target_disease_distribution = {}
for disease in clingen_data.disease.unique():
    target_count = len(clingen_data.loc[clingen_data.disease == disease].target.unique())
    try:
        target_disease_distribution[target_count] += 1
    except:
        target_disease_distribution[target_count] = 1
        
print('')
for k in sorted(target_disease_distribution.keys()):
    print(f"Number diseases with {k} targets: {target_disease_distribution[k]}") 

Number of unique diseases: 393
Number of unique targets: 776
Number of associations: 980
Number of unique target/disease pairs: 961

Number disease with 1 targets: 310
Number disease with 2 targets: 33
Number disease with 3 targets: 17
Number disease with 4 targets: 5
Number disease with 5 targets: 4
Number disease with 6 targets: 6
Number disease with 9 targets: 1
Number disease with 12 targets: 1
Number disease with 14 targets: 1
Number disease with 18 targets: 1
Number disease with 20 targets: 6
Number disease with 22 targets: 3
Number disease with 25 targets: 1
Number disease with 33 targets: 2
Number disease with 45 targets: 1
Number disease with 83 targets: 1


In [54]:
d = {
    'reduncant_cnt': 0,
    'redundancy_level': []
}
for (target,disease), group in clingen_data.groupby(['target','disease']):

    if len(group) > 1:
        d['reduncant_cnt'] += 1
        d['redundancy_level'].append(str(len(group)))
        print(f"\nNon unique disease/trait pair: {target}/{disease}")
        for col in group.columns:
            if len(group[col].unique()) != 1:
                print(f"Non unique column: {col}. Values: {','.join(group[col].unique().astype(str).tolist())}")

# Let's get the counts:
print(f"\n\nNumber of redundant trait/disease pair: {d['reduncant_cnt']}")
print(f"Distribution of redundancy levels: {','.join(d['redundancy_level'])}")



Non unique disease/trait pair: ENSG00000099949/Orphanet_1340
Non unique column: allelic_requirement. Values: Autosomal Dominant,Autosomal Recessive

Non unique disease/trait pair: ENSG00000099949/Orphanet_2701
Non unique column: allelic_requirement. Values: Autosomal Recessive,Autosomal Dominant

Non unique disease/trait pair: ENSG00000099949/Orphanet_3071
Non unique column: allelic_requirement. Values: Autosomal Dominant,Autosomal Recessive

Non unique disease/trait pair: ENSG00000099949/Orphanet_500
Non unique column: allelic_requirement. Values: Autosomal Dominant,Autosomal Recessive

Non unique disease/trait pair: ENSG00000099949/Orphanet_648
Non unique column: confidence. Values: Limited,Strong
Non unique column: allelic_requirement. Values: Autosomal Recessive,Autosomal Dominant
Non unique column: score. Values: 0.25,0.75

Non unique disease/trait pair: ENSG00000108576/MONDO_0100038

Non unique disease/trait pair: ENSG00000109927/Orphanet_87884
Non unique column: allelic_require

## Comparing performance against Pharmaproject data

In [76]:
# Read/prepare pharmaproject file:
pharmaprojectFile = './abbvie_pharmaprojects_2018_mapped.csv'
pharma_df = pd.read_csv(pharmaprojectFile, sep=',')
pharma_df.rename(columns={'lApprovedUS.EU':'is_approved'}, inplace=True)


# What do we have:
print(f'Number of disease target pairs: {len(pharma_df)}')
print(f"Number of unique disese/target pairs: {len(pharma_df[['ensembl_id','id']].drop_duplicates())}")
print(f"Number of unique targets: {len(pharma_df.ensembl_id.unique())}")
print(f"Number of unique diseases: {len(pharma_df.id.unique())}")

# What approved stuff we have?
approved = pharma_df.loc[pharma_df.is_approved == True]
print('\n\nIn the approved set:')
print(f'Number of disease target pairs: {len(approved)}')
print(f"Number of unique disese/target pairs: {len(approved[['ensembl_id','id']].drop_duplicates())}")
print(f"Number of unique targets: {len(approved.ensembl_id.unique())}")
print(f"Number of unique diseases: {len(approved.id.unique())}")


Number of disease target pairs: 22947
Number of unique disese/target pairs: 22947
Number of unique targets: 2089
Number of unique diseases: 514


In the approved set:
Number of disease target pairs: 1606
Number of unique disese/target pairs: 1606
Number of unique targets: 368
Number of unique diseases: 293


#### Calculating the overlap between the pharmaproject and ClinGen

In [138]:
# Get the number of targets in the pharmaproject data corresponding to a given clingen confidence:
def get_pharmaproject_target_count(conf, approved_only=False):
    conf_genes = clingen_data.loc[clingen_data.confidence == conf].target.unique().tolist()
    if approved_only:
        return len(pharma_df.loc[(pharma_df.ensembl_id.isin(conf_genes)&
                                 pharma_df.is_approved == True)].ensembl_id.unique())
    else:
        return len(pharma_df.loc[pharma_df.ensembl_id.isin(conf_genes)].ensembl_id.unique())


# Get the number of diseases in the pharmaproject data corresponding to a given clingen confidence:    
def get_pharmaproject_disease_count(conf, approved_only=False):
    conf_diseases = clingen_data.loc[clingen_data.confidence == conf].disease.unique().tolist()
    if approved_only:
        return len(pharma_df.loc[(pharma_df.id.isin(conf_diseases)&
                                 pharma_df.is_approved == True)].id.unique())
    else:
        return len(pharma_df.loc[pharma_df.id.isin(conf_diseases)].id.unique())


# Look up exact d/t match:
def get_pharmaproject_association_count(conf, approved_only=False):
    conf_df = clingen_data.loc[clingen_data.confidence == conf][['target','disease']].drop_duplicates()
    
    if approved_only:
        return conf_df.apply(lambda row: len(
                            pharma_df.loc[
                                (pharma_df.id == row['disease'])&
                                (pharma_df.ensembl_id == row['target']) &
                                (pharma_df.is_approved == True)]), axis=1).sum()
    else:
        return conf_df.apply(lambda row: len(
                            pharma_df.loc[
                                (pharma_df.id == row['disease'])&
                                (pharma_df.ensembl_id == row['target'])]), axis=1).sum()    
    
# Adding overlap counts for pharmaproject:
summary_df['approved_targets']  = summary_df.confidence.apply(get_pharmaproject_target_count, approved_only=True)
summary_df['poscon_targets']    = summary_df.confidence.apply(get_pharmaproject_target_count)
summary_df['approved_diseases'] = summary_df.confidence.apply(get_pharmaproject_disease_count, approved_only=True)
summary_df['poscon_diseases']   = summary_df.confidence.apply(get_pharmaproject_disease_count)

summary_df['approved_associations'] = summary_df.confidence.apply(get_pharmaproject_association_count, approved_only=True)
summary_df['poscon_associations']   = summary_df.confidence.apply(get_pharmaproject_association_count)

summary_df


Unnamed: 0,confidence,evidence_count,score,counts_perc,cumulative_perc,target_cnt,disease_cnt,approved_targets,poscon_targets,approved_diseases,poscon_diseases,approved_associations,poscon_associations
0,Definitive,528,1.0,53.9,53.9,498,333,36,126,3,7,1,3
6,Strong,19,0.75,1.9,55.8,19,14,2,6,0,1,0,0
3,Moderate,100,0.5,10.2,66.0,99,52,6,17,0,1,0,0
2,Limited,162,0.25,16.5,82.6,152,55,10,35,1,2,0,1
1,Disputed,69,0.1,7.0,89.6,63,23,10,17,2,2,0,0
5,Refuted,10,0.05,1.0,90.6,10,9,0,2,1,1,0,0
4,No Reported Evidence,92,0.01,9.4,100.0,48,11,8,16,0,1,0,0


## Results


The table below shows the evidence distribution in the ClinGen data set stratified by the `confidence` field (which is the basis of the scoring):

| confidence class     |   score |   evidence count | count % |cumulative % |
|:---------------------|--------:|-----------------:|--------------:|------------------:|
| Definitive           |    1    |              528 |          53.9 |              53.9 |
| Strong               |    0.75 |               19 |           1.9 |              55.8 |
| Moderate             |    0.5  |              100 |          10.2 |              66   |
| Limited              |    0.25 |              162 |          16.5 |              82.6 |
| Disputed             |    0.1  |               69 |           7   |              89.6 |
| Refuted              |    0.05 |               10 |           1   |              90.6 |
| No Reported Evidence |    0.01 |               92 |           9.4 |             100   |

The table below shows the unmber of unique targets in each confidence category and now many of them can be found in the Pharmaproject dataset:

| confidence           |   target_cnt |   poscon_targets |   approved_targets |
|:---------------------|-------------:|-----------------:|-------------------:|
| Definitive           |          498 |              126 |                 36 |
| Strong               |           19 |                6 |                  2 |
| Moderate             |           99 |               17 |                  6 |
| Limited              |          152 |               35 |                 10 |
| Disputed             |           63 |               17 |                 10 |
| Refuted              |           10 |                2 |                  0 |
| No Reported Evidence |           48 |               16 |                  8 |

The table below shows the unmber of unique diseases in each confidence category and now many of them can be found in the Pharmaproject dataset:

| confidence           |   disease_cnt |   poscon_diseases |   approved_diseases |
|:---------------------|--------------:|------------------:|--------------------:|
| Definitive           |           333 |                 7 |                   3 |
| Strong               |            14 |                 1 |                   0 |
| Moderate             |            52 |                 1 |                   0 |
| Limited              |            55 |                 2 |                   1 |
| Disputed             |            23 |                 2 |                   2 |
| Refuted              |             9 |                 1 |                   1 |
| No Reported Evidence |            11 |                 1 |                   0 |

The last table shows the number of associations in each category and how many of them were found in the pharmaproject dataset:

| confidence           |   evidence_count |   poscon_associations |   approved_associations |
|:---------------------|-----------------:|----------------------:|------------------------:|
| Definitive           |              528 |                     3 |                       1 |
| Strong               |               19 |                     0 |                       0 |
| Moderate             |              100 |                     0 |                       0 |
| Limited              |              162 |                     1 |                       0 |
| Disputed             |               69 |                     0 |                       0 |
| Refuted              |               10 |                     0 |                       0 |
| No Reported Evidence |               92 |                     0 |                       0 |


In [181]:
print(summary_df[['confidence','evidence_count','poscon_associations','approved_associations']].to_markdown(showindex=False))


| confidence           |   evidence_count |   poscon_associations |   approved_associations |
|:---------------------|-----------------:|----------------------:|------------------------:|
| Definitive           |              528 |                     3 |                       1 |
| Strong               |               19 |                     0 |                       0 |
| Moderate             |              100 |                     0 |                       0 |
| Limited              |              162 |                     1 |                       0 |
| Disputed             |               69 |                     0 |                       0 |
| Refuted              |               10 |                     0 |                       0 |
| No Reported Evidence |               92 |                     0 |                       0 |


In [178]:
summary_df

Unnamed: 0,confidence,evidence_count,score,counts_perc,cumulative_perc,target_cnt,disease_cnt,approved_targets,poscon_targets,approved_diseases,poscon_diseases,approved_associations,poscon_associations
0,Definitive,528,1.0,53.9,53.9,498,333,36,126,3,7,1,3
6,Strong,19,0.75,1.9,55.8,19,14,2,6,0,1,0,0
3,Moderate,100,0.5,10.2,66.0,99,52,6,17,0,1,0,0
2,Limited,162,0.25,16.5,82.6,152,55,10,35,1,2,0,1
1,Disputed,69,0.1,7.0,89.6,63,23,10,17,2,2,0,0
5,Refuted,10,0.05,1.0,90.6,10,9,0,2,1,1,0,0
4,No Reported Evidence,92,0.01,9.4,100.0,48,11,8,16,0,1,0,0


In [6]:
clingen_data.to_csv('./parsed_clingen_data.tsv.gz', sep='\t', compression='infer', index=False)