# Gene2Phenotype evidence drop

- We see a large number of missing evidence in the Gene2Phenotype dataset. 
- It seems the drop happens in the ETL. 
- When taking a closer look we are dropping evidence due to `null` score evidence. 
- These were failing because the ETL configuration could not handle confidence values that are no longer supported. 
- It has been identified that the old confidence values were coming in from old files downloaded from ftp. 
- The files downloaded from http were good.

**Actions:**

1. Update evidence based on files fetched from http.
2. Let Backend team know about the issue and the possibility of new ETL run.
2. Update configuration of the evidence datasource parsers.

## As of 2022.04.25

* This issue is fixed now. ->  we have verified the distribution of the clinical significances.
* 

In [7]:
%%bash

TARGET='/Users/dsuveges/project_data/gene2phenotype'

mkdir -p ${TARGET}/22.02
mkdir -p ${TARGET}/22.04

# Fetching old data:
gsutil cp -r gs://open-targets-pre-data-releases/22.04/input/evidence-files/gene2phenotype-2022-04-11.json.gz ${TARGET}  2> /dev/null

# Fetching new data:
gsutil cp -r gs://open-targets-data-releases/22.02/input/evidence-files/gene2phenotype-2022-01-25.json.gz ${TARGET} 2> /dev/null

# Fetching pre-release post-pipeline:
gsutil cp -r gs://open-targets-pre-data-releases/22.04/output/etl/parquet/evidence/sourceId=gene2phenotype \
    ${TARGET}/22.04 2> /dev/null

gsutil cp -r gs://open-targets-data-releases/22.02/output/etl/parquet/evidence/sourceId=gene2phenotype \
    ${TARGET}/22.02 2> /dev/null

ls -ls $TARGET | grep gene2

360 -rw-r--r--  1 dsuveges  384566875  184217 25 Apr 15:28 gene2phenotype-2022-01-25.json.gz
392 -rw-r--r--  1 dsuveges  384566875  197986 25 Apr 15:28 gene2phenotype-2022-04-11.json.gz


In [147]:
import json
from json import JSONDecodeError

import requests
from functools import reduce
import pandas as pd
from pyspark.sql.functions import (
    col, udf, struct, lit, split, expr, collect_set, struct, 
    regexp_replace, min as pyspark_min, explode, when,array_except,
    array_contains, count, first, element_at, size, sum as pyspark_sum
)
from pyspark.sql.types import FloatType, ArrayType, StructType, TimestampType, StructField, BooleanType, StringType, IntegerType
from pyspark.sql import SparkSession
from collections import defaultdict

# establish spark connection
spark = (
    SparkSession.builder
    .master('local[*]')
    .getOrCreate()
)

In [8]:
oldFile = '/Users/dsuveges/project_data/gene2phenotype/gene2phenotype-2022-01-25.json.gz'
newFile = '/Users/dsuveges/project_data/gene2phenotype/gene2phenotype-2022-04-11.json.gz'

In [9]:
old = spark.read.json(oldFile).persist()
new = spark.read.json(newFile).persist()

print(f'New evidence count: {new.count()}')
print(f'Old evidence count: {old.count()}')

New evidence count: 4427
Old evidence count: 4449


In [10]:
new.printSchema()

root
 |-- allelicRequirements: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- confidence: string (nullable = true)
 |-- datasourceId: string (nullable = true)
 |-- datatypeId: string (nullable = true)
 |-- diseaseFromSource: string (nullable = true)
 |-- diseaseFromSourceId: string (nullable = true)
 |-- diseaseFromSourceMappedId: string (nullable = true)
 |-- literature: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- studyId: string (nullable = true)
 |-- targetFromSourceId: string (nullable = true)
 |-- variantFunctionalConsequenceId: string (nullable = true)



In [15]:
x = (
    new.groupBy('confidence')
    .count()
    .withColumnRenamed('count', 'new')
)

y = (
    old.groupBy('confidence')
    .count()
    .withColumnRenamed('count', 'old')
)

y.join(x, on='confidence', how='outer').show()

+--------------+----+----+
|    confidence| old| new|
+--------------+----+----+
|    definitive|2956|1736|
|both RD and IF| 138| 121|
|      moderate|   2|null|
|      possible|null| 129|
|        strong| 869| 696|
|      probable|null| 166|
|       limited| 484| 354|
|     confirmed|null|1208|
|      child IF|null|  17|
+--------------+----+----+



In [24]:
(
    old.filter(col('confidence') =='definitive')
    .groupby('studyId')
    .count()
    .show()
)

+-------+-----+
|studyId|count|
+-------+-----+
|   Skin|  446|
|     DD| 1749|
| Cancer|  111|
|    Eye|  650|
+-------+-----+



In [25]:
        map(
          'definitive', 1.0,
          'both RD and IF', 1.0,
          'strong', 1.0,
          'moderate', 0.5,
          'limited', 0.01
        ),

TypeError: 'float' object is not iterable

In [32]:
import pandas as pd

df = pd.read_csv('http://ftp.ebi.ac.uk/pub/databases/gene2phenotype/DD/DDG2P_1_12_2021.csv.gz', sep=',')

df['confidence category'].unique()


array(['strong', 'definitive', 'limited', 'both RD and IF', 'child IF'],
      dtype=object)

In [48]:
df = pd.read_csv('~/project_data/gene2phenotype/sourceFiles/CardiacG2P-2022-04-25.csv.gz')
print(len(df))
df['gene disease pair entry date'].unique()

53


array(['2022-04-25 12:45:20', '2022-04-25 12:45:21',
       '2022-04-25 12:45:22', '2022-04-25 12:45:23'], dtype=object)

In [40]:
%%bash 

# Fetching files:
files=(
    https://www.ebi.ac.uk/gene2phenotype/downloads/CancerG2P.csv.gz
    https://www.ebi.ac.uk/gene2phenotype/downloads/CardiacG2P.csv.gz
    https://www.ebi.ac.uk/gene2phenotype/downloads/DDG2P.csv.gz
    https://www.ebi.ac.uk/gene2phenotype/downloads/EyeG2P.csv.gz
    https://www.ebi.ac.uk/gene2phenotype/downloads/SkinG2P.csv.gz
)

# Folder to download file into:
TARGET='/Users/dsuveges/project_data/gene2phenotype/sourceFiles'
mkdir -p $TARGET

for f in ${files[@]}; do
    wget -q $f -P $TARGET
done

ls -lah $TARGET

total 744
drwxr-xr-x  7 dsuveges  EBI\Domain Users   224B 25 Apr 17:06 .
drwxr-xr-x  7 dsuveges  EBI\Domain Users   224B 25 Apr 17:03 ..
-rw-r--r--  1 dsuveges  EBI\Domain Users   8.8K 25 Apr 17:06 CancerG2P.csv.gz
-rw-r--r--  1 dsuveges  EBI\Domain Users   4.4K 25 Apr 17:06 CardiacG2P.csv.gz
-rw-r--r--  1 dsuveges  EBI\Domain Users   222K 25 Apr 17:06 DDG2P.csv.gz
-rw-r--r--  1 dsuveges  EBI\Domain Users    71K 25 Apr 17:06 EyeG2P.csv.gz
-rw-r--r--  1 dsuveges  EBI\Domain Users    53K 25 Apr 17:06 SkinG2P.csv.gz


In [42]:
%%bash

TARGET='/Users/dsuveges/project_data/gene2phenotype/sourceFiles'

for f in ${TARGET}/*gz; do
    echo $f;
    gzcat $f | head -n1 
done

/Users/dsuveges/project_data/gene2phenotype/sourceFiles/CancerG2P.csv.gz
"gene symbol","gene mim","disease name","disease mim","confidence category","allelic requirement","mutation consequence",phenotypes,"organ specificity list",pmids,panel,"prev symbols","hgnc id","gene disease pair entry date","cross cutting modifier","mutation consequence flag","confidence value flag",comments,"variant consequence","disease ontology"
/Users/dsuveges/project_data/gene2phenotype/sourceFiles/CardiacG2P.csv.gz
"gene symbol","gene mim","disease name","disease mim","confidence category","allelic requirement","mutation consequence",phenotypes,"organ specificity list",pmids,panel,"prev symbols","hgnc id","gene disease pair entry date","cross cutting modifier","mutation consequence flag","confidence value flag",comments,"variant consequence","disease ontology"
/Users/dsuveges/project_data/gene2phenotype/sourceFiles/DDG2P.csv.gz
"gene symbol","gene mim","disease name","disease mim","confidence category","a

In [43]:
(
    spark.read.json('/Users/dsuveges/repositories/evidence_datasource_parsers/gene2phenotype-2022-04-25.json.gz')
    .groupBy('confidence')
    .count()
    .show()
)

+--------------+-----+
|    confidence|count|
+--------------+-----+
|    definitive| 3815|
|both RD and IF|  162|
|      moderate|   11|
|        strong|  999|
|       limited|  590|
+--------------+-----+



In [44]:
(
    spark.read.json('/Users/dsuveges/repositories/evidence_datasource_parsers/gene2phenotype-2022-04-25.json.gz')
    .count()
)

5577

In [55]:
new_cardiac = (
    spark.read.json('/Users/dsuveges/repositories/evidence_datasource_parsers/gene2phenotype-2022-04-25.json.gz')
    .filter(col('studyId') == 'Cardiac')
    .persist()
)

assoc_cnt = (
    new_cardiac
    .select('diseaseFromSource', 'diseaseFromSourceMappedId', 'targetFromSourceId')
    .distinct()
    .count()
)

print(f'Evidence count: {new_cardiac.count()}')
print(f'Association count: {assoc_cnt}')
print(f'Target count: {new_cardiac.select("targetFromSourceId").distinct().count()}')
print(f'Disease count: {new_cardiac.select("diseaseFromSourceMappedId").distinct().count()}')

Evidence count: 53
Association count: 46
Target count: 39
Disease count: 1


In [57]:
new_cardiac.select("diseaseFromSourceMappedId").distinct().show()

+-------------------------+
|diseaseFromSourceMappedId|
+-------------------------+
|                     null|
+-------------------------+



In [77]:
new_cardiac.select('diseaseFromSource').show(100, truncate=False)

+-----------------------------------------------------+
|diseaseFromSource                                    |
+-----------------------------------------------------+
|CALM1-related CPVT                                   |
|SLC22A5-related primary systemic carnitine deficiency|
|JUP-related Naxos disease                            |
|PKP2-related  ARVC                                   |
|DSC2-related ARVC                                    |
|DSC2-related ARVC                                    |
|DSG2-related ARVC                                    |
|DSG2-related ARVC                                    |
|MYH7-related DCM                                     |
|SLC4A3-related SQTS                                  |
|TPM1-related HCM                                     |
|DSP-related ARVC                                     |
|DSP-related ARVC                                     |
|CASQ2-related CPVT                                   |
|CASQ2-related CPVT                             

In [72]:
(
    spark.read.json('/Users/dsuveges/repositories/evidence_datasource_parsers/gene2phenotype-2022-04-25.json.gz')
    .filter(col('studyId') == 'Cardiac')
    .withColumn('test', regexp_replace(col('diseaseFromSource'), r'.+-related ', ''))
    .select('diseaseFromSource','test', 'diseaseFromSourceId', 'diseaseFromSourceMappedId', 'targetFromSourceId')
    .show()
)

+--------------------+--------------------+-------------------+-------------------------+------------------+
|   diseaseFromSource|                test|diseaseFromSourceId|diseaseFromSourceMappedId|targetFromSourceId|
+--------------------+--------------------+-------------------+-------------------------+------------------+
|  CALM1-related CPVT|                CPVT|               null|                     null|             CALM1|
|SLC22A5-related p...|primary systemic ...|               null|                     null|           SLC22A5|
|JUP-related Naxos...|       Naxos disease|               null|                     null|               JUP|
|  PKP2-related  ARVC|                ARVC|               null|                     null|              PKP2|
|   DSC2-related ARVC|                ARVC|               null|                     null|              DSC2|
|   DSC2-related ARVC|                ARVC|               null|                     null|              DSC2|
|   DSG2-related AR

In [78]:
(
    spark.read.json('/Users/dsuveges/repositories/evidence_datasource_parsers/gene2phenotype-2022-04-25.json.gz')
    .filter(col('diseaseFromSourceId').isNull())
#     .groupby('studyId')    
#     .count()
    .select('diseaseFromSource')
    .distinct()
#     .count()
    .show(300, truncate=False)
)

+------------------------------------------------------------------------------------------------------------------------------+
|diseaseFromSource                                                                                                             |
+------------------------------------------------------------------------------------------------------------------------------+
|Ectodermal dysplasia (dominant)                                                                                               |
|TFE3-related intellectual disability with pigmentary mosaicism                                                                |
|Ectodermal dysplasia, cleft lip/palate                                                                                        |
|KDM4B-related Developmental Disorder                                                                                          |
|Marfanoid Habitus and Cognitive Impairment                                                      

In [79]:
import ontoma


In [112]:
new_df = (
    spark.read.json('/Users/dsuveges/repositories/evidence_datasource_parsers/gene2phenotype-2022-04-27.json.gz')
    .select('diseaseFromSource', 'diseaseFromSourceId', 'diseaseFromSourceMappedID')
    .filter(col('diseaseFromSourceMappedID').isNotNull())
    .groupBy(['diseaseFromSource', 'diseaseFromSourceId'])
    .agg(collect_set(col('diseaseFromSourceMappedID')).alias('mapped_new'))
    .persist()
)

old_df = (
    spark.read.json('/Users/dsuveges/repositories/evidence_datasource_parsers/gene2phenotype-2022-04-25.json.gz')
    .select('diseaseFromSource', 'diseaseFromSourceId', 'diseaseFromSourceMappedID')
    .filter(col('diseaseFromSourceMappedID').isNotNull())
    .groupBy(['diseaseFromSource', 'diseaseFromSourceId'])
    .agg(collect_set(col('diseaseFromSourceMappedID')).alias('mapped_old'))
    .persist()
)



+----------------------------------------------------------------------------------------------------+-------------------+------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+
|diseaseFromSource                                                                                   |diseaseFromSourceId|mapped_new                                                                                      |mapped_old                                                                                      |
+----------------------------------------------------------------------------------------------------+-------------------+------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+
|ANTLEY-BIXLER SYNDROME                          

In [117]:
(
    new_df
    .join(old_df, on=['diseaseFromSource', 'diseaseFromSourceId'], how='outer')
    .filter(col('mapped_new').isNotNull() & col('mapped_old').isNotNull())
    .filter(array_except(col('mapped_new'), col('mapped_old')).isNotNull())
    .show(truncate=False, vertical=True)
)

-RECORD 0-------------------------------------------------------------------------------------------------------------------
 diseaseFromSource   | ANTLEY-BIXLER SYNDROME                                                                               
 diseaseFromSourceId | OMIM:207410                                                                                          
 mapped_new          | [MONDO_0008803]                                                                                      
 mapped_old          | [MONDO_0008803]                                                                                      
-RECORD 1-------------------------------------------------------------------------------------------------------------------
 diseaseFromSource   | CATARACT 4, MULTIPLE TYPES                                                                           
 diseaseFromSourceId | OMIM:115700                                                                                          


In [105]:
old_df.count()

2157

In [126]:
diseases = (
    spark.read.parquet('/Users/dsuveges/project_data/diseases')
    .select(col('id').alias('diseaseFromSourceMappedId'), 'name')
    .distinct()
)

dis_map = (
    spark.read.json('/Users/dsuveges/repositories/evidence_datasource_parsers/gene2phenotype-2022-04-27.json.gz')
    .select('diseaseFromSource', 'diseaseFromSourceId', 'diseaseFromSourceMappedID')
#     .filter(col('diseaseFromSourceMappedID').isNotNull())
    .join(diseases, on='diseaseFromSourceMappedId', how='left')
    .persist()
)

In [130]:
(
    dis_map
    .filter(col('name').isNotNull())
#     .show()
    .count()
)

3595

In [131]:
(
    dis_map
    .filter(col('diseaseFromSourceMappedId').isNotNull())
#     .show()
    .count()
)

3595

In [132]:
dis_map.count()

4585

In [173]:
g2p = (
    spark.read.json('/Users/dsuveges/repositories/evidence_datasource_parsers/gene2phenotype-2022-04-27.json.gz')
    .filter((col('studyId') == 'Cardiac') & (col('diseaseFromSourceMappedId').isNotNull()))
    .join(diseases, on='diseaseFromSourceMappedId', how='left')
     .filter(col('name').isNotNull())
     .persist()
)

print(g2p.count())
print(
    g2p
    .select('diseaseFromSourceMappedId', 'targetFromSourceId')
    .distinct()
    .count()
)

targets = (
    spark.read.parquet('/Users/dsuveges/project_data/targets')
    .select(
        col('id').alias('targetId'),
        col('approvedSymbol').alias('targetFromSourceId')
    )
    .distinct()
)

associations = (
    g2p
    .select('diseaseFromSourceMappedId', 'targetFromSourceId')
    .distinct()
    .join(diseases, on='diseaseFromSourceMappedId', how='left')
    .join(targets, on='targetFromSourceId', how='left')
    .withColumnRenamed('diseaseFromSourceMappedId', 'diseaseId')
    .persist()
)

associations.show(100)

54
47
+------------------+--------------+--------------------+---------------+
|targetFromSourceId|     diseaseId|                name|       targetId|
+------------------+--------------+--------------------+---------------+
|             TNNI3|   EFO_0000538|hypertrophic card...|ENSG00000129991|
|             ALPK3|   EFO_0000538|hypertrophic card...|ENSG00000136383|
|             LAMP2|Orphanet_34587|Glycogen storage ...|ENSG00000005893|
|               GLA|  Orphanet_324|       Fabry disease|ENSG00000102393|
|              BAG3|   EFO_0000407|dilated cardiomyo...|ENSG00000151929|
|               PLN| MONDO_0000591|intrinsic cardiom...|ENSG00000198523|
|              MYH7|   EFO_0000407|dilated cardiomyo...|ENSG00000092054|
|             ACTC1|   EFO_0000538|hypertrophic card...|ENSG00000159251|
|             SCN5A|  Orphanet_768|Familial long QT ...|ENSG00000183873|
|             LAMP2|   EFO_1001333|Glycogen Storage ...|ENSG00000005893|
|             KCNQ1|Orphanet_51083|Familial s

* Cardiac panel first released on 2022.04.25
* Number of evidence: 54 between 47 unique disease/trait associations.
* None of the associations are completely new.
* All associations were supported by genetic evidence,

In [183]:
all_associations = (
    spark.read.parquet('/Users/dsuveges/project_data/associationByDatasourceDirect')
)

(
    associations
    .join(all_associations, on=['targetId', 'diseaseId'], how='left')
    .filter(col('datatypeId').isNull())
    .show(10, False, True)
)

-RECORD 0---------------------------------------
 targetId           | ENSG00000128591           
 diseaseId          | Orphanet_593              
 targetFromSourceId | FLNC                      
 name               | Myofibrillar myopathy     
 datatypeId         | null                      
 datasourceId       | null                      
 score              | null                      
 evidenceCount      | null                      
-RECORD 1---------------------------------------
 targetId           | ENSG00000186439           
 diseaseId          | Orphanet_768              
 targetFromSourceId | TRDN                      
 name               | Familial long QT syndrome 
 datatypeId         | null                      
 datasourceId       | null                      
 score              | null                      
 evidenceCount      | null                      



In [170]:
aggregated_associations = (
    spark.read.parquet('/Users/dsuveges/project_data/associationByDatasourceDirect')
    .withColumn('sources', struct(col('datatypeId'), col('datasourceId')))
    .groupby('diseaseId','targetId')
    .agg(collect_set(col('sources')).alias('sources'))
)

+----------+---------------+--------------------+
| diseaseId|       targetId|             sources|
+----------+---------------+--------------------+
|DOID_10718|ENSG00000113749|[{literature, eur...|
|DOID_13406|ENSG00000120937|[{literature, eur...|
| DOID_7551|ENSG00000066427|[{literature, eur...|
| DOID_7551|ENSG00000095739|[{literature, eur...|
| DOID_7551|ENSG00000102755|[{literature, eur...|
| DOID_7551|ENSG00000103335|[{literature, eur...|
| DOID_7551|ENSG00000106004|[{literature, eur...|
| DOID_7551|ENSG00000116132|[{literature, eur...|
| DOID_7551|ENSG00000118046|[{literature, eur...|
| DOID_7551|ENSG00000121060|[{literature, eur...|
| DOID_7551|ENSG00000130770|[{literature, eur...|
| DOID_7551|ENSG00000137752|[{literature, eur...|
| DOID_7551|ENSG00000147133|[{literature, eur...|
| DOID_7551|ENSG00000149806|[{literature, eur...|
| DOID_7551|ENSG00000154370|[{literature, eur...|
| DOID_7551|ENSG00000158941|[{literature, eur...|
| DOID_7551|ENSG00000163629|[{literature, eur...|


In [176]:
spark.read.parquet('/Users/dsuveges/project_data/associationByDatasourceDirect').select('datatypeId').distinct().show()

+-------------------+
|         datatypeId|
+-------------------+
|   affected_pathway|
|         literature|
|     rna_expression|
|       animal_model|
|   somatic_mutation|
|         known_drug|
|genetic_association|
+-------------------+



In [184]:
(
    spark.read.parquet('/Users/dsuveges/project_data/associationByDatasourceDirect')
    .filter(
        (col('targetId') == 'ENSG00000186439') &
        (col('diseaseId') == 'Orphanet_768')
    )
    .show(100, False, True)
)


-RECORD 0-----------------------------
 datatypeId    | literature           
 datasourceId  | europepmc            
 diseaseId     | Orphanet_768         
 targetId      | ENSG00000186439      
 score         | 0.024148362238461615 
 evidenceCount | 3                    

