# IMPC - round #3

We are defining a disease relevant if:
- A descendant of hemotologic diseases (`EFO_0005803`), these diseases will be categorized as `D`.
- A descendant of hemotological measurement (`EFO_0004503`), these diseases will be categorized as `M`.
- Ad secendant of Abnormality of the blood and blood-forming tissues (`HP_0001871`) these diseases will be categorized as `P`.

This table will be joined with associations, then stratified by EFO term categories. Then the report is generated as usual.


In [80]:
from statistics import median
from functools import reduce 

from pyspark.sql import dataframe
import pyspark.sql
import pyspark.sql.types as t
import pyspark.sql.functions as f
from pyspark.sql.window import Window
from pyspark.conf import SparkConf

spark_conf = (
    SparkConf()
    .set("spark.driver.memory", "10g")
    .set("spark.executor.memory", "10g")
    .set("spark.driver.maxResultSize", "0")
    .set("spark.debug.maxToStringFields", "2000")
    .set("spark.sql.execution.arrow.maxRecordsPerBatch", "500000")
    .set("spark.driver.bindAddress", "127.0.0.1")
)
spark = (
    pyspark.sql.SparkSession.builder.config(conf=spark_conf)
    .master("local[*]")
    .getOrCreate()
)
   
disease_dataset = '/Users/dsuveges/project_data/diseases_22.06'
association_data_nomuse = '/Users/dsuveges/project_data/nmm/associationByOverallDirect'
association_data = '/Users/dsuveges/project_data/associationByOverallDirect'
target_data = '/Users/dsuveges/project_data/targets'


## Processing disease index

1. Get list of diseases that we will use as root terms
2. Read disease index
3. Filter for the above root terms, explode to all terms while keeping track of the source
4. Join with original disease set
5. Save annotated disease set as tsv.

In [86]:

# Create a dataframe with the relevant disease identifiers and the corresponding category label:
category_of_interest = spark.createDataFrame([
    {'id': 'EFO_0005803', 'category': 'D'}, # hemotologic diseases 
    {'id': 'EFO_0004503', 'category': 'M'}, # hemotological measurement
    {'id': 'HP_0001871',  'category': 'P'}  # Abnormality of the blood and blood-forming tissues
])

# Relevant diseases are all descendants of the above terms:
relevant_diseases = (
    spark.read.parquet(disease_dataset)
    
    # Extract the relevant rows from the disease index:
    .join(category_of_interest, on='id', how='right')
    
    # Adding terms to descendats, this step is required otherwise the three root terms would not be annotated as relevant:
    .withColumn('descendants', f.array_union(f.col('descendants'), f.array(f.col('id'))))
    
    # Extract descendants:
    .select('category', f.explode('descendants').alias('id'))
    
    # Grouping by disease id -> get a list of categories the disease is annotated with:
    .groupby('id')
    .agg(f.collect_set('category').alias('category'))
    .persist()
)

# Join the disease index with the above generated list:
annotated_diseases = (
    spark.read.parquet(disease_dataset)
    .join(relevant_diseases, on='id', how='left')
    .select(
        f.col('id').alias('diseaseId'),
        f.col('name').alias('diseaseName'),
        f.col('category')
    )
    .withColumn('isRelevant', f.when(f.col('category').isNotNull(), True).otherwise(False))
    .persist()
)

# Saving data:
annotated_diseases.toPandas().to_csv('annotated_diseases.tsv', sep='\t', index=False)

## Processing gene dataset

1. Read parquet
2. Select/rename columns

In [87]:
# Read targets:
targets = (
    spark.read.parquet(target_data)
    .select(
        f.col('id').alias('targetId'),
        f.col('approvedSymbol').alias('targetSymbol'),
        f.col('approvedName').alias('targetName')
    )
    .persist()
)

## Processing association dataset

Association souce:
- Overall **driect** association from Platform release `22.06`
- Overall **driect** association from Platform release `22.06`, where the weight of mouse models is set to 0.

By setting the weight of the scoring to zero provides two purposes
1. Allows removing associatinos which only provided by mouse models.
2. Allows assessing the importance of evidence provided by the available animal models.

**Logic**:
1. Read both files.
2. Join the two files together, based on the difference of the scores provide a flag indicating if the association is given by **ONLY** mouse models, has **NO** moouse models OR **YES** where mouse models + other datasources are also available.
3. Join with gene data.
4. Join with disease data

In [88]:
# Reading association dataset where the mouse model weight is 0:
assoc_nomouse = spark.read.parquet(association_data_nomuse).persist()

# Reading normal dataset:
assoc = spark.read.parquet(association_data).persist()

def QC_df(df: dataframe, dataset: str) -> None:
    print(f'Processing {dataset}')
    print(f'\tNumber of associations: {df.count()}')
    print(f'\tNumber of genes: {df.select("targetId").distinct().count()}')
    print(f'\tNumber of diseases: {df.select("diseaseId").distinct().count()}')
    
    
QC_df(assoc_nomouse, 'Nullified mouse')
QC_df(assoc, 'Standard')

Processing Nullified mouse
	Number of associations: 2120908
	Number of genes: 29221
	Number of diseases: 16578
Processing Standard
	Number of associations: 2120908
	Number of genes: 29221
	Number of diseases: 16578


All looks good, all the numbers are matching. Great.

In [90]:
# +---------------+-------+
# | evidenceSource|  count|
# +---------------+-------+
# |     mouse only| 506339|
# |mouse and other|  16954|
# |       no mouse|1597615|
# +---------------+-------+


annotated_associations = (
    # Small formatting in the original association dataset:
    assoc_nomouse
    
    # Joing with the nullified dataset:
    .join(assoc.withColumnRenamed('score', 'old_score'), on=['diseaseId', 'targetId'], how='inner')
    
    # Generate a flag from where the score is coming from:
    .withColumn(
        'evidenceSource',
        f.when(f.col('score') == 0.0, f.lit('mouse only'))
        .when(f.col('score') == f.col('old_score'), f.lit('no mouse'))
        .otherwise(f.lit('mouse and other'))
    )
    
    # Join with targets:
    .join(targets, on='targetId', how='left')
    
    # Join with diseasese:
    .join(annotated_diseases, on='diseaseId', how='left')
    
    # Select and order columns:
    .select(
        'targetId', 
        'targetSymbol', 
        'targetName', 
        'diseaseId', 
        'diseaseName', 
        'isRelevant', 
        'category',
        'evidenceSource',
        'score'
    )

    .persist()
)

# Let's see what we have:
annotated_associations.show()

# Saving data:
(
    annotated_associations
    .withColumn('category', f.concat_ws(',', f.col('category')))
    .write.option("compression", "gzip").mode('overwrite')
    .csv('annotated_associations.tsv.gz', sep='\t')
)    
# Print number of associations with different evidence source categories:
(
    annotated_associations
    .groupby('evidenceSource')
    .agg(
        f.count(f.col('targetId')).alias('associationCount'),
        f.first(f.col('targetId')).alias('targetExample'),
        f.first(f.col('diseaseId')).alias('diseaseExample')
    )
    .show()
)

+---------------+------------+--------------------+----------+--------------------+----------+--------+--------------+--------------------+
|       targetId|targetSymbol|          targetName| diseaseId|         diseaseName|isRelevant|category|evidenceSource|               score|
+---------------+------------+--------------------+----------+--------------------+----------+--------+--------------+--------------------+
|ENSG00000113749|        HRH2|histamine recepto...|DOID_10718|          giardiasis|     false|    null|      no mouse|0.001478319418738...|
|ENSG00000120937|        NPPB|natriuretic pepti...|DOID_13406|pulmonary sarcoid...|     false|    null|      no mouse|0.002217479128108...|
|ENSG00000066427|       ATXN3|            ataxin 3| DOID_7551|           gonorrhea|     false|    null|      no mouse|0.001478319418738...|
|ENSG00000095739|       BAMBI|BMP and activin m...| DOID_7551|           gonorrhea|     false|    null|      no mouse|0.003695798546847...|
|ENSG00000102755|   

### Association by evidence source categories

* Mouse only: 506k unique disease/target pairs. Example: [ENSG00000158163/MONDO_0009148](https://platform.opentargets.org/evidence/ENSG00000158163/MONDO_0009148)
* Mouse and other: 16k unique disease/target pairs. Example: [ENSG00000163646/MONDO_0016485](https://platform.opentargets.org/evidence/ENSG00000163646/MONDO_0016485)
* No mouse: 1.6M unique disease/target pairs. Example: [ENSG00000066427/DOID_7551](https://platform.opentargets.org/evidence/ENSG00000066427/DOID_7551)


```
+---------------+----------------+---------------+---------------+
| evidenceSource|associationCount| example_target|example_disease|
+---------------+----------------+---------------+---------------+
|     mouse only|          506339|ENSG00000158163|  MONDO_0009148|
|mouse and other|           16954|ENSG00000163646|  MONDO_0016485|
|       no mouse|         1597615|ENSG00000066427|      DOID_7551|
+---------------+----------------+---------------+---------------+
```


In [101]:
# Windowing through all associations:
windowSpec  = Window.partitionBy('evidenceSource')


(
    annotated_associations
    .groupBy(['evidenceSource', 'isRelevant'])
    .count()
    .withColumn('totalCount', f.sum(f.col('count')).over(windowSpec))
    .withColumn('relevantRatio', f.col('count') / f.col('totalCount'))
    .groupBy('evidenceSource')
    .pivot('isRelevant')
    .agg(f.first('relevantRatio'))
    .show()
)

+---------------+------------------+-------------------+
| evidenceSource|             false|               true|
+---------------+------------------+-------------------+
|     mouse only| 0.891201349293655|0.10879865070634497|
|mouse and other|0.8629232039636664| 0.1370767960363336|
|       no mouse|0.8874196849679052|0.11258031503209472|
+---------------+------------------+-------------------+

