# QC of Genetics Portal Evidence

## Findings

1. There are 75 studies in the 21.11 dataset which used to have efo identifiers, but they don't have anymore. Two of these are GCST identifiers, probably the implicated studies the genetics team already knows about. Rest of the list are FINNGEN and UKBB studies, with weird trait categories. This discrepancy leads to dropping 8k evidence. 
    * Verify the reasoning behind the mappings.
    * Was it only the ambiguity of the mapping? Or the nature of the studied trait? 
2. In the unfiltered evidence set (l2g threshold == 0), there are 627,835 evidence from 18,031 studies. When applying the l2g >= 0.05 cutoff, the number of studies dropped to 17,359. So because of the l2g threshold, we are losing only 672 studies.
3. When compared to the Genetics Portal evidence set from the 21.09 release, I have found due to missing studies, only 147 evidence from 26 studies were lost. These 26 studies are missing from the unfiltered dataset. When the l2g threshold is applied the number of missing studies were 178 with 1788 evidence.
4. 

Running parser on the new data:

### Creating cluster

```bash
gcloud dataproc clusters create \
  genetics-evidence-generation \
  --image-version=1.4 \
  --properties=spark:spark.debug.maxToStringFields=100,spark:spark.executor.cores=31,spark:spark.executor.instances=1 \
  --master-machine-type=n1-standard-32 \
  --master-boot-disk-size=1TB \
  --zone=europe-west1-d \
  --single-node \
  --max-idle=15m \
  --region=europe-west1 \
  --project=open-targets-genetics-dev
```

### Running for release `21.11`


```bash
# Set Genetics Portal envidence source files:
export TAG='21.11_nofilter'

L2G_FILE="gs://genetics-portal-dev-data/21.10/outputs/l2g"
STUDY_FILE="gs://genetics-portal-dev-data/21.10/outputs/lut/study-index"
TOPLOCI_FILE="gs://genetics-portal-dev-data/21.10/inputs/v2d/toploci.parquet"
VARIANTINDEX="gs://genetics-portal-dev-data/21.10/inputs/variant-annotation/190129/variant-annotation.parquet"
ECO_CODES="gs://genetics-portal-dev-data/21.10/inputs/lut/vep_consequences.tsv"


OUT_FILE="gs://genetics-portal-analysis/l2g-platform-export/data/genetics_portal-${TAG}.parquet"
SCRIPT_DIR='/Users/dsuveges/repositories/evidence_datasource_parsers/'
```

```bash
gcloud dataproc jobs submit pyspark \
  --cluster=genetics-evidence-generation \
  --project=open-targets-genetics-dev \
  --region=europe-west1 \
  ${SCRIPT_DIR}/modules/GeneticsPortal.py -- \
  --locus2gene ${L2G_FILE} \
  --toploci ${TOPLOCI_FILE} \
  --study ${STUDY_FILE} \
  --variantIndex ${VARIANTINDEX}  \
  --ecoCodes ${ECO_CODES} \
  --threshold 0 \
  --outputFile ${OUT_FILE}
```

Move file:

```bash
gsutil cp -r ${OUT_FILE} .

```

Copy old evidence files used for the 21.09 release:

```bash
gsutil cp -r gs://open-targets-pre-data-releases/21.09.5/input/evidence-files/genetics-portal-evidences-2021-06-21  .
```

In [32]:
new_data = '/Users/dsuveges/repositories/evidence_datasource_parsers/genetics_portal-21.11_nofilter.parquet'
old_data = '/Users/dsuveges/repositories/evidence_datasource_parsers/genetics_portal-2021-06-21/'

from pyspark.sql.functions import (
    col, udf, struct, lit, split, regexp_replace, create_map, min as spark_min, max as spark_max
)
from pyspark.sql.types import FloatType, ArrayType, StructType, StructField
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from itertools import chain


# establish spark connection
sparkConf = (
    SparkConf()
    .set('spark.driver.memory', '15g')
    .set('spark.executor.memory', '15g')
    .set('spark.driver.maxResultSize', '0')
)
spark = (
    SparkSession.builder
    .config(conf=sparkConf)
    .master('local[*]')
    .getOrCreate()
)


In [9]:
new_df = (
    spark.read.parquet(new_data)
    .select('diseaseFromSourceMappedId', 'targetFromSourceId', 'variantId', 'studyId', 'resourceScore')
    .withColumn('release', lit('21.11'))
    .persist()
)

In [54]:
print('Evidence count')
print(f'Unfiltered: {new_df.count()}')
print(f'Filtered: {new_df.filter(col("resourceScore") >= 0.05).count()}')

print('\nStudy count')
print(f'Unfiltered: {new_df.select(col("studyId")).distinct().count()}')
print(f'Filtered: {new_df.filter(col("resourceScore") >= 0.05).select(col("studyId")).distinct().count()}')

Evidence count
Unfiltered: 3379045
Filtered: 627835

Study count
Unfiltered: 18031
Filtered: 17359


In [13]:
old_df = (
    spark.read.json(old_data)
    .select('diseaseFromSourceMappedId', 'targetFromSourceId', 'variantId', 'studyId', 'resourceScore')
    .withColumn('release', lit('21.09'))
    .withColumnRenamed('resourceScore', 'resourceScore_old')
    .persist()
)

In [14]:
old_df.count()

635033

Is there any study in the new set, without EFO, that used to have EFO?

In [27]:
(
    new_df
    .filter(col('diseaseFromSourceMappedId').isNull())
    .select('diseaseFromSourceMappedId', 'studyId')
    .distinct()
    .join(
        (
            old_df
            .filter(col('diseaseFromSourceMappedId').isNotNull())
            .select(col('diseaseFromSourceMappedId').alias('old_disease'), 'studyId')
            .distinct()
        ), on='studyId', how='inner')
#     .count()
    .show(100, truncate=False)
)

+----------------------------------------------------------+-------------------------+---------------+
|studyId                                                   |diseaseFromSourceMappedId|old_disease    |
+----------------------------------------------------------+-------------------------+---------------+
|SAIGE_202                                                 |null                     |EFO_0001642    |
|SAIGE_695_4                                               |null                     |MONDO_0004670  |
|SAIGE_240                                                 |null                     |EFO_0004283    |
|NEALE2_23101_raw                                          |null                     |EFO_0004995    |
|FINNGEN_R5_Z21_DESEN_ALLERGE                              |null                     |NCIT_C15262    |
|NEALE2_30290_raw                                          |null                     |EFO_0007986    |
|NEALE2_30520_raw                                          |null         

In [20]:
(
    old_df
    .filter(col('studyId') =='SAIGE_695_4')
    .select('targetFromSourceId')
    .distinct()
    .show()
)

+------------------+
|targetFromSourceId|
+------------------+
|   ENSG00000137338|
|   ENSG00000112337|
|   ENSG00000197279|
|   ENSG00000010704|
|   ENSG00000180573|
|   ENSG00000112343|
|   ENSG00000096654|
|   ENSG00000180596|
|   ENSG00000185130|
|   ENSG00000124610|
|   ENSG00000137185|
|   ENSG00000168131|
|   ENSG00000124568|
|   ENSG00000198315|
+------------------+



In [22]:
(
    new_df
    .filter(col('studyId') =='SAIGE_695_4')
    .filter(col('resourceScore') >= 0.05)
    .select('targetFromSourceId', 'diseaseFromSourceMappedId', 'resourceScore', 'variantId')
    .distinct()
    .show()
)

+------------------+-------------------------+--------------------+--------------+
|targetFromSourceId|diseaseFromSourceMappedId|       resourceScore|     variantId|
+------------------+-------------------------+--------------------+--------------+
|   ENSG00000124568|                     null|  0.0514032244682312|6_25769380_G_C|
|   ENSG00000137338|                     null|0.057079605758190155|6_27846899_A_G|
|   ENSG00000168131|                     null|  0.0694759339094162|6_27846899_A_G|
|   ENSG00000185130|                     null|  0.1946474313735962|6_27846899_A_G|
|   ENSG00000233822|                     null| 0.08230216801166534|6_27846899_A_G|
|   ENSG00000146039|                     null|  0.0857049822807312|6_25769380_G_C|
|   ENSG00000180573|                     null|  0.1071556955575943|6_25769380_G_C|
|   ENSG00000112337|                     null| 0.22847461700439453|6_25769380_G_C|
|   ENSG00000124610|                     null| 0.08196471631526947|6_25769380_G_C|
|   

In [24]:
(
    old_df
    .filter(col('studyId') =='SAIGE_695_4')
    .filter(col('resourceScore') >= 0.05)
    .select('targetFromSourceId', 'diseaseFromSourceMappedId', 'resourceScore_old', 'variantId')
    .distinct()
    .show()
)k 

+------------------+-------------------------+--------------------+--------------+
|targetFromSourceId|diseaseFromSourceMappedId|   resourceScore_old|     variantId|
+------------------+-------------------------+--------------------+--------------+
|   ENSG00000096654|            MONDO_0004670| 0.05934744328260422|6_27846899_A_G|
|   ENSG00000124610|            MONDO_0004670| 0.09076076745986938|6_25769380_G_C|
|   ENSG00000112337|            MONDO_0004670|  0.2199445515871048|6_25769380_G_C|
|   ENSG00000010704|            MONDO_0004670| 0.07372410595417023|6_25769380_G_C|
|   ENSG00000180596|            MONDO_0004670|0.054590966552495956|6_25769380_G_C|
|   ENSG00000197279|            MONDO_0004670| 0.05814579874277115|6_27846899_A_G|
|   ENSG00000124568|            MONDO_0004670|0.059311188757419586|6_25769380_G_C|
|   ENSG00000137338|            MONDO_0004670| 0.07066993415355682|6_27846899_A_G|
|   ENSG00000180573|            MONDO_0004670| 0.09729931503534317|6_25769380_G_C|
|   

### Are there lost studies?

* Let's take all the new studies, how many of them have no associations above the l2g cutoff?

In [37]:
(
    new_df
    .groupBy('studyId')
    .agg(
        spark_max(col('resourceScore')).alias('min_l2g_score')
    )
    .filter(col('min_l2g_score') <= 0.05)
    .show()
#     .count() # 672
)


+-----------------+--------------------+
|          studyId|       min_l2g_score|
+-----------------+--------------------+
|       GCST006993|0.036080799996852875|
|       GCST003381| 0.04202813655138016|
|       GCST003723| 0.03896152600646019|
|       GCST003830|0.027043016627430916|
|     GCST90002330| 0.02011059783399105|
|     GCST90026213|0.023423319682478905|
|NEALE2_20003_1189|0.033882319927215576|
|   GCST011427_509| 0.04131009057164192|
|   GCST005650_236|0.020931528881192207|
|       GCST007055|0.038793694227933884|
|     GCST005498_2|0.045617688447237015|
|       GCST004553| 0.04202813655138016|
|   GCST011427_405|0.020421117544174194|
|     GCST002466_3|0.027632499113678932|
|       GCST007952|0.026811346411705017|
|     GCST006584_3|0.020399658009409904|
|     GCST90014054| 0.04202813655138016|
|       GCST005486|0.045617688447237015|
|    GCST012353_27|0.018399110063910484|
|       GCST011570|0.020900540053844452|
+-----------------+--------------------+
only showing top

How about the old dataset:

In [61]:
print(f'Number of studies in the old dataset: {old_df.select("studyId").distinct().count()}')

Number of studies in the old dataset: 16429


In [62]:
print('Studies found in the old dataset, but not in the new:')
print(
    old_study_list
    .join(new_df, on='studyId', how='leftanti')
    .select('studyId')
    .distinct()
    .count()
)

print('Studies found in the old dataset, but not in the new AFTER applying l2g threshold:')
print(
    old_study_list
    .join(new_df.filter(col('resourceScore') >= 0.05), on='studyId', how='leftanti')
    .select('studyId')
    .distinct()
    .count()
)

Studies found in the old dataset, but not in the new:
26
Studies found in the old dataset, but not in the new AFTER applying l2g threshold:
178


In [52]:
old_missing_studies = (
    old_df
    .join(new_df.select('studyId').distinct(), on='studyId', how='left_anti')
    .persist()
)

print(f'Number of missing associations: {old_missing_studies.count()}')
print(f'Number of missing studies: {old_missing_studies.select("studyId").distinct().count()}')

old_missing_studies = (
    old_df
    .join(new_df.filter(col('resourceScore') >= 0.05).select('studyId').distinct(), on='studyId', how='left_anti')
    .persist()
)

print(f'Number of missing associations: {old_missing_studies.count()}')
print(f'Number of missing studies: {old_missing_studies.select("studyId").distinct().count()}')


Number of missing associations: 147
Number of missing studies: 26
Number of missing associations: 1788
Number of missing studies: 178
