* Mohd's dataset is here: `gs://genetics-portal-dev-mr/mr_results_v2`
* Contents:
```
gs://genetics-portal-dev-mr/mr_results_v2/prot_mr_filtered.tsv.gz
gs://genetics-portal-dev-mr/mr_results_v2/prot_snp_filtered.tsv.gz
gs://genetics-portal-dev-mr/mr_results_v2/prot_snp_unfiltered.tsv.gz
gs://genetics-portal-dev-mr/mr_results_v2/partners/
```

* Data folder: `~/project_data/MR`

In [1]:
%%bash

ls -lah ~/project_data/MR

total 58400
drwxr-xr-x   5 dsuveges  384566875   160B  9 Mar 16:25 .
drwxrwxr-x  19 dsuveges  384566875   608B  9 Mar 16:17 ..
drwxr-xr-x   5 dsuveges  384566875   160B  9 Mar 16:18 partners
-rw-r--r--   1 dsuveges  384566875   7.7M  9 Mar 16:25 prot_mr_filtered.tsv.gz
-rw-r--r--   1 dsuveges  384566875    20M  9 Mar 16:24 prot_snp_filtered.tsv.gz


The unfiltered set is not downloaded as that dataset is not relevant for the purposes of platform evidence.

In [57]:
protMrFiltered = '/Users/dsuveges/project_data/MR/prot_mr_filtered.tsv.gz'
protSnpFiltered = '/Users/dsuveges/project_data/MR/prot_snp_filtered.tsv.gz'

import json
from json import JSONDecodeError

import requests
from functools import reduce
import pandas as pd
from pyspark.sql.functions import (
    col, udf, struct, lit, split, expr, collect_set, struct, 
    regexp_replace, min as pyspark_min, explode, when, regexp_extract,
    array_contains, count, first, element_at, size, sum as pyspark_sum
)
from pyspark.sql.types import FloatType, ArrayType, StructType, StructField, BooleanType, StringType
from pyspark.sql import SparkSession
from collections import defaultdict

# establish spark connection
spark = (
    SparkSession.builder
    .master('local[*]')
    .getOrCreate()
)


## Reviewing the protfiltered dataset

In [2]:
(
    spark.read.csv(protMrFiltered, sep='\t', header=True)
    .show(1, False, True)
)

-RECORD 0------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Data                     | SUHRE_2017                                                                                                                                                       
 exposure                 | 2182-54_1_one_out_merged_h2                                                                                                                                      
 protein_trait            | Complement C4b                                                                                                                                                   
 hgnc_protein             | C4B                                                                                                                                                              
 ensid                    | ENSG00000224389       

## Reviewing snp filtered dataset

In [3]:
(
    spark.read.csv(protSnpFiltered, sep='\t', header=True)
    .filter(col('left_study') != 'NA')
    .show(1, False, True)
)

-RECORD 0----------------------------------------------------------------------------------------------------------------------------------
 snp                       | 7:45945221:C:T                                                                                                
 bxy                       | -1.32196                                                                                                      
 bzx                       | 0.428                                                                                                         
 bzx_se                    | 0.0429935                                                                                                     
 bzx_pval                  | 2.39839281528496e-23                                                                                          
 bzy                       | -0.565798                                                                                                     
 bzy_se             

## Estimate the overlap of MR hits and already known associations


**Processing MR data:**
1. Reading tsv.
2. Filter for significant MR only (`bxy_pval < 0.0005`)
3. Exploding multiple EFO annotations when required
4. Dropping rows with no EFO.

**Processing OT data:**
1. Fetch 22.02 data from ftp (`/pub/databases/opentargets/platform/22.02/output/etl/parquet/associationByDatasourceDirect`)
2. Group by `datasourceId`
3. Join with MR data to see the overlap.
4. Estimate overlap specifically with `ot_genetics`

In [103]:
MR_associations = (
    spark.read.csv(protMrFiltered, sep='\t', header=True)
    
    # Filter for significant MR only:
    .filter(col('bxy_pval') < 0.0005)
    
    # Some outcome traits are a bit crazy: `c("HP_0002024", "EFO_0001060")`
    .withColumn('traits', explode(split(col('outcome_trait_efo'), ' ')))
    .withColumn('trait', regexp_extract(col('traits'), r'[A-Z]+_[0-9]+', 0))

    # Drop missing efos:
    .filter(col('trait') != '')

    # Get unique associations:
    .select('ensid', 'trait')
    .withColumnRenamed('ensid', 'targetId')
    .withColumnRenamed('trait', 'diseaseId')
    .distinct()
    .persist()
)

MR_associations.show()

print(MR_associations.count())

+---------------+-------------+
|       targetId|    diseaseId|
+---------------+-------------+
|ENSG00000204520|  EFO_0000401|
|ENSG00000204520|MONDO_0045002|
|ENSG00000105472|  EFO_0008082|
|ENSG00000168811|  EFO_0008264|
|ENSG00000172156|  EFO_0007997|
|ENSG00000106952|  EFO_0007991|
|ENSG00000010438|  EFO_0001069|
|ENSG00000010438|  EFO_0004713|
|ENSG00000160712|  EFO_0000275|
|ENSG00000204516|  EFO_0002690|
|ENSG00000148400|   HP_0004418|
|ENSG00000130203|  EFO_0007796|
|ENSG00000159958|   HP_0000726|
|ENSG00000175164|  EFO_0008300|
|ENSG00000175164|  EFO_0004312|
|ENSG00000204305|  EFO_0009676|
|ENSG00000204305|  EFO_0004833|
|ENSG00000204305|  EFO_1001771|
|ENSG00000164022|  EFO_0000677|
|ENSG00000164022|MONDO_0001627|
+---------------+-------------+
only showing top 20 rows

26312


In [107]:
platform_assoc = (
    spark.read.parquet('/Users/dsuveges/project_data/associationByDatasourceDirect/')
    .groupBy(['diseaseId', 'targetId'])
    .agg(collect_set(col('datasourceId')).alias('datasourceIds'))
    .persist()
)



platform_assoc.show()

print(platform_assoc.count())

+--------------------+---------------+-------------+
|           diseaseId|       targetId|datasourceIds|
+--------------------+---------------+-------------+
|//ebi.ac.uk/efo/E...|ENSG00000078549|  [europepmc]|
|//ebi.ac.uk/efo/E...|ENSG00000153094|  [europepmc]|
|//ebi.ac.uk/efo/E...|ENSG00000164400|  [europepmc]|
|//ebi.ac.uk/efo/E...|ENSG00000169194|  [europepmc]|
|//ebi.ac.uk/efo/E...|ENSG00000198873|  [europepmc]|
|//ebi.ac.uk/efo/E...|ENSG00000233276|  [europepmc]|
|          DOID_10718|ENSG00000113749|  [europepmc]|
|          DOID_13406|ENSG00000120937|  [europepmc]|
|           DOID_7551|ENSG00000066427|  [europepmc]|
|           DOID_7551|ENSG00000095739|  [europepmc]|
|           DOID_7551|ENSG00000102755|  [europepmc]|
|           DOID_7551|ENSG00000103335|  [europepmc]|
|           DOID_7551|ENSG00000106004|  [europepmc]|
|           DOID_7551|ENSG00000116132|  [europepmc]|
|           DOID_7551|ENSG00000118046|  [europepmc]|
|           DOID_7551|ENSG00000121060|  [europ

In [108]:
overlap = (
    MR_associations
    .join(platform_assoc, on=['diseaseId', 'targetId'], how='left')
    .persist()
)

overlap.show()

+-------------+---------------+--------------------+
|    diseaseId|       targetId|       datasourceIds|
+-------------+---------------+--------------------+
|  EFO_0004995|ENSG00000002822|[ot_genetics_portal]|
|MONDO_0056820|ENSG00000006210|                null|
|  EFO_0009805|ENSG00000007312|[ot_genetics_portal]|
|  EFO_0001069|ENSG00000010438|                null|
|  EFO_0004713|ENSG00000010438|                null|
|MONDO_0021661|ENSG00000030582|                null|
|  EFO_0009116|ENSG00000037280|                null|
|  EFO_0004309|ENSG00000067182|[ot_genetics_portal]|
|  EFO_0004309|ENSG00000069482|                null|
|  EFO_0005755|ENSG00000073969|                null|
|   HP_0000790|ENSG00000073969|                null|
|  EFO_0007990|ENSG00000079950|                null|
|  EFO_0004518|ENSG00000088387|                null|
|  EFO_0009440|ENSG00000088387|                null|
|  EFO_0004312|ENSG00000088543|                null|
|  EFO_0004526|ENSG00000089692|               

In [111]:
supported_by_genetics = (
    overlap
    .select('*', explode(col('datasourceIds')).alias('datasourceId'))
    .filter(col('datasourceId') == 'ot_genetics_portal')
    .count()
)

completely_new = (
    overlap
    .filter(col('datasourceIds').isNull())
    .count()
)
supported_by_any_source = (
    overlap
    .filter(col('datasourceIds').isNotNull())
    .count()
)

print(f'Number of MR association covered by OT genetics: {supported_by_genetics}')
print(f'Number of MR association covered by any source: {supported_by_any_source}')
print(f'Completely new association: {completely_new}')


Number of MR association covered by OT genetics: 4484
Number of MR association covered by any source: 6111
Completely new association: 20201


### How do the scores look like?

I want to see if those MR associations that overlap with genetics portal get higher score than the average OT genetics associations.

In [112]:
genetics_only = (
    spark.read.parquet('/Users/dsuveges/project_data/associationByDatasourceDirect/')
    .select('targetId', 'diseaseId', 'score')
    .persist()
)



genetics_only.show()

print(genetics_only.count())

+---------------+-----------+-------------------+
|       targetId|  diseaseId|              score|
+---------------+-----------+-------------------+
|ENSG00000180644|EFO_0000574| 0.5824669406749944|
|ENSG00000006062|EFO_0000584|0.26201817377060865|
|ENSG00000020633|EFO_0000584| 0.2956367468785313|
|ENSG00000026103|EFO_0000584| 0.3366112826375546|
|ENSG00000049768|EFO_0000584| 0.2611062775741913|
|ENSG00000077150|EFO_0000584| 0.2783107191466001|
|ENSG00000104856|EFO_0000584| 0.2649970346789056|
|ENSG00000109471|EFO_0000584| 0.2555741073159255|
|ENSG00000117560|EFO_0000584|0.24998114397789856|
|ENSG00000126353|EFO_0000584| 0.2545406249599857|
|ENSG00000134242|EFO_0000584|0.26530100007771146|
|ENSG00000145901|EFO_0000584|0.25891772670278945|
|ENSG00000162924|EFO_0000584|0.25168335021121113|
|ENSG00000163154|EFO_0000584| 0.2756966167168702|
|ENSG00000163554|EFO_0000584|0.25660758967186525|
|ENSG00000166908|EFO_0000584| 0.2764261336740041|
|ENSG00000196329|EFO_0000584|0.25113621249336066|


In [116]:
MR_vs_genetics = (
    MR_associations
    .join(genetics_only, on=['targetId', 'diseaseId'], how='left')
    .persist()
)

print('Score distribution of all genetics association:')
(
    genetics_only
    .select(col('score'))
    .describe()
    .show()
)

print('\nScore distribution of all genetics association:')
(
    MR_vs_genetics
    .filter(col('score').isNotNull())
    .select(col('score'))
    .describe()
    .show()
)

Score distribution of all genetics association:
+-------+-------------------+
|summary|              score|
+-------+-------------------+
|  count|            2366419|
|   mean| 0.1973850285140361|
| stddev|0.20511342467504334|
|    min|6.99311006169496E-5|
|    max| 0.9997595486001684|
+-------+-------------------+


Score distribution of all genetics association:
+-------+--------------------+
|summary|               score|
+-------+--------------------+
|  count|                6994|
|   mean| 0.26430191317209856|
| stddev| 0.23779356044199645|
|    min|8.399583698113842E-5|
|    max|  0.9963588700588945|
+-------+--------------------+

