# HUMSAVAR


**Exploring the HUMSAVAR dataset**

* Getting basic stats: evidence count, association count, target count, disease count
* Comparing with the existing Uniprot evidence set: overlap by association.

#### Acquiring dataset:

* Downloading humsavar from `https://www.uniprot.org/docs/humsavar.txt` load as pyspark dataframe.
* Downloading the original uniprot evidence set from: 

### Basic stats

* Number of evidence: 80,129
* Number of evidence with no uniprot identifier: 0
* Number of evidence with no genes: 27 (discrepancy between gene and uniprot annotation)

* Number of evidence with no disease annotation: 41,354 (with 11,855 genes)
* Number of evidence disease annotation: 38775

* Number of associations: 6318. genes: 4190, diseases: 4529
* Number of associations with MIM : 4002. genes: 2914, diseases: 3843

## Humsavar vs uniprot evidence

#### Acquiring dataset:

Dataset downloaded from `gs://otar011-uniprot/cttv011-11-10-2021.json.gz`

### Comparison 


#### Approach 1:

To compare the two datasets, I only considered associations. The humsavar dataset was filtered for entries with OMIM identifiers, I made no attempt to map the dieseases to EFO. The uniprot evidence set was joined with humsavar dataset by `omim` disease identifier AND `uniprot` target identifiers (the uniprot evidence set has OMIM identifier in the `diseaseFromSourceId` field).

Out of the 4002 unique association from humsavar 3815 was found in the Uniprot dataset. It suggests humsavar is a subset of the uniprot dataset. The 187 missing associations are comprised of 123 distinct diseases (from which 3 was not found in Uniprot dataset), and 182 distinct targets (from which 159 was not found in the uniprot dataset.)

**Conclusion**:

I think the humsavar dataset is likely a subset of the curated evidence set curated by Uniprot due to the large overlap. However some of the evidence in the humsavar dataset has no disease annotation at all, indicating it might be generated automatically by parsing Uniprot entires.

#### Approach 2

Given one third of the diseases have no OMIM identifiers, we should try to map those diseases to EFO then try to join with the uniprot evidence set.

1. Extract evidence from humsavar dataset where no OMIM identifiers availabe.
2. Generate associations.
3. Map diseases to EFO.
4. Join with Uniprot evidence based on uniprot id and EFO id. 

**Conclusion**:

In [117]:
%%bash

gsutil cp -r gs://otar011-uniprot/cttv011-11-10-2021.json.gz . 

ls -lah

gzcat cttv011-11-10-2021.json.gz | head -n1 | jq

total 32216
drwxr-xr-x   7 dsuveges  384566875   224B 21 Dec 23:37 .
drwxrwxr-x  65 dsuveges  384566875   2.0K 20 Dec 22:33 ..
drwxr-xr-x   3 dsuveges  384566875    96B 20 Dec 22:33 .ipynb_checkpoints
-rw-r--r--   1 dsuveges  384566875    55K 21 Dec 23:36 Untitled.ipynb
-rw-r--r--   1 dsuveges  384566875   703K 21 Dec 23:37 cttv011-11-10-2021.json.gz
-rw-r--r--   1 dsuveges  384566875   7.5M 29 Sep 01:00 humsavar-2021-12-20.txt
-rw-r--r--   1 dsuveges  384566875   7.5M 21 Dec 00:26 humsavar.txt
{
  "datasourceId": "uniprot_literature",
  "datatypeId": "genetic_literature",
  "diseaseFromSource": "Ectodermal dysplasia and immunodeficiency 1",
  "diseaseFromSourceId": "OMIM:300291",
  "confidence": "high",
  "diseaseFromSourceMappedId": "MONDO_0020740",
  "literature": [
    "21606507",
    "16547522",
    "11224521",
    "14651848",
    "19185524",
    "11242109",
    "11047757",
    "15100680",
    "12045264"
  ],
  "targetFromSourceId": "Q9Y6K9",
  "targetModulation": "up_or_down"
}


Copying gs://otar011-uniprot/cttv011-11-10-2021.json.gz...
/ [0 files][    0.0 B/702.7 KiB]                                                -- [1 files][702.7 KiB/702.7 KiB]                                                
Operation completed over 1 objects/702.7 KiB.                                    


In [23]:
%%bash

# Fetching data from uniprot:
# wget https://www.uniprot.org/docs/humsavar.txt -O humsavar-2021-12-20.txt
wc -l humsavar-2021-12-20.txt

head -47 humsavar-2021-12-20.txt
echo "... 80k sor.."
tail -15 humsavar-2021-12-20.txt

   80179 humsavar-2021-12-20.txt
--------------------------------------------------------------------------------
        UniProt - Swiss-Prot Protein Knowledgebase
        SIB Swiss Institute of Bioinformatics; Geneva, Switzerland
        European Bioinformatics Institute (EBI); Hinxton, United Kingdom
        Protein Information Resource (PIR); Washington DC, USA
--------------------------------------------------------------------------------

Description: Index of human variants curated from literature reports
Name:        humsavar.txt
Release:     2021_04 of 29-Sep-2021

--------------------------------------------------------------------------------
This file lists all missense variants annotated in UniProtKB/Swiss-Prot human
entries. It provides a variant classification which is intended for research
purposes only, not for clinical and diagnostic use.

 - The column 'Variant category' shows the classification of the variant using
   the American College of Medical Genetics and Ge

In [2]:
import pandas as pd
import re
from ontoma import OnToma

# Initialize ontoma:
otmap = OnToma()

# Data read directly from the web:
humsavar_url = 'https://www.uniprot.org/docs/humsavar.txt'

# Column names are not inferred from the data:
humsavar_columns = [
    'gene', 
    'uniprotId', 
    'variantId', 
    'aaChange', 
    'severity',
    'rsId',
    'disease'
]

# Sinificant terms are mapped:
significant_map = {
    "LP/P": "pathogenic",
    "LB/B": "benign",
    "US": "uncertain significance"
}

# Humsavar disease annotation needs to be mapped at first:
def parse_disease(disease:str)-> str:
    """Returning with the parsed diese + OMIM"""
    
    return_dict = {
        'disease': disease,
        'diseaseMappedId': None,
        'diseaseMappedLabel': None,
        'omimId': None
    }

    if disease == '-':
        return return_dict
    
    elif '[MIM' not in disease:
        disease_label = disease

    else:
    
        # The OMIM needs to be extracted:
        m = re.match(r'(.+?) \[(MIM:\d+)', disease)

        # Do we have OMIM:
        if m:
            return_dict['omimId'] = m[2].replace('MIM', 'OMIM')
            disease_label = m[1]
        else:
            m = re.match(r'(.+?) \(', disease)
            disease_label = m[1]

    # Mapping label:
    mapping = otmap.find_term(disease_label)
    
    if len(mapping) > 0:
        return_dict['diseaseMappedId'] = mapping[0].id_normalised
        return_dict['diseaseMappedLabel'] = mapping[0].label

    elif return_dict['omimId']:
        mapping = otmap.find_term(return_dict['omimId'], code=True)
        if len(mapping) > 0:
            return_dict['diseaseMappedId'] = mapping[0].id_normalised
            return_dict['diseaseMappedLabel'] = mapping[0].label

    return return_dict

# Reading data as dataframe:
humsavar_data = (
    
    # Reading file:
    pd.read_fwf(humsavar_url, header=None, names=humsavar_columns, skiprows=44, skipfooter=6)
    
    # mapping severity significance:
    .assign(severity = lambda df: df.severity.map(significant_map))
    
    # Filtering out rows where no disease is given:
    .query('disease != "-"')
)

# Get the unique list of diseases and try to map to efo using ontoma:
efo_mapping = pd.DataFrame(
    pd.Series(
        humsavar_data.disease.unique()
    ).apply(parse_disease)
    .to_list()
)

# Joining humsavar data with disease mapping:
humsavar_data = (
    humsavar_data
    
    # Merging with efo:
    .merge(efo_mapping, on='disease', how='left')
    
    # Dropping unused columns:
    .drop(['gene', 'variantId'], axis=1)
)


INFO     - ontoma.interface - Created EFO cache directory /var/folders/gr/q7cf7ljx2ss8b5cgvycxn9yxz2wxjr/T/tmpm_pr1ics.
INFO     - ontoma.interface - Downloading EFO cache to /var/folders/gr/q7cf7ljx2ss8b5cgvycxn9yxz2wxjr/T/tmpm_pr1ics.
  self._extract_object_property(prop, curies)
  self._extract_object_property(prop, curies)
  self._extract_object_property(prop, curies)
  self._extract_object_property(prop, curies)
  self._extract_object_property(prop, curies)
  self._extract_object_property(prop, curies)
  self._extract_object_property(prop, curies)
  self._extract_object_property(prop, curies)
  self._extract_object_property(prop, curies)
  self._extract_object_property(prop, curies)
  self._extract_object_property(prop, curies)
  self._extract_object_property(prop, curies)
  self._extract_object_property(prop, curies)
  self._extract_object_property(prop, curies)
  self._extract_object_property(prop, curies)
  self._extract_object_property(prop, curies)
  self._extract_object_prop

  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
 

  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  cls(self).parse_from(_handle)  # type: ignore
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  self._extract_term(class_, curies)
  cls(self).parse_from(_handle)  # type: ignore
  cls(self).parse_from(_handle)  # type: ignore
  cls(self).parse_from(_handle)  # type: ignore
  cls(self).parse_from(_handle)  # type: ignore
  cls(self).parse_from(_handle)  # type: ignore
  cls(self).parse_from(_handle)  # type: ignore
  cls(self).parse_from(_handle)  # type: ignore
  cls(self).parse_from(_handle)  # type: ignore
  cls(self).parse_from(_handle)  # type: ignore
  cls(self).parse_from(_handle)  # type: ignore
  cls(self).parse_from(_handle)  # type: ignore
  cls(self).parse_from(_handle)  # type: ignore
  cls(self).parse_from(_handle)  # type: ignore
  cls(self).parse_from(_handle)  # type: ignore
  cls(self).parse_fro

INFO:ontoma.interface:Using EFO cache from /var/folders/gr/q7cf7ljx2ss8b5cgvycxn9yxz2wxjr/T/tmpm_pr1ics.
INFO     - ontoma.interface - Loaded 20322 terms, 92701 xrefs, and 67124 synonyms from EFO cache.
INFO:ontoma.interface:Loaded 20322 terms, 92701 xrefs, and 67124 synonyms from EFO cache.


JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [142]:
## Getting some stats:
print(f'Number of unique uniprot ids: {len(humsavar_data.uniprotId.unique())}')
print(f'Number of unique diseases: {len(humsavar_data.disease.unique())}')

print(f'\nNumber of evidence: {len(humsavar_data)}')
print(f'Number of associations: {len(humsavar_data[["uniprotId", "disease"]].drop_duplicates())}')

print(f'\nNumber of diseases with no mapping: {len(efo_mapping.loc[efo_mapping.diseaseMappedId.isna()])}')

humsavar_mapped = humsavar_data.loc[humsavar_data.diseaseMappedId.notna()]
humsavar_notmapped = humsavar_data.loc[humsavar_data.diseaseMappedId.isna()]

print(f'\nNumber of evidence w. mapping: {len(humsavar_mapped)}')
print(f'Number of associations w. mapping: {len(humsavar_mapped[["uniprotId", "diseaseMappedId"]].drop_duplicates())}')


print(f'\nNumber of evidence wo. mapping: {len(humsavar_notmapped)}')
print(f'Number of associations wo. mapping: {len(humsavar_notmapped[["uniprotId", "disease"]].drop_duplicates())}')


humsavar_df = spark.createDataFrame(humsavar_data).persist()
humsavar_df.show()

Number of unique uniprot ids: 4190
Number of unique diseases: 4529

Number of evidence: 38775
Number of associations: 6318

Number of diseases with no mapping: 874

Number of evidence w. mapping: 31307
Number of associations w. mapping: 3806

Number of evidence wo. mapping: 7468
Number of associations wo. mapping: 2404
+---------+-----------+--------------------+------------+--------------------+---------------+--------------------+-----------+
|uniprotId|   aaChange|            severity|        rsId|             disease|diseaseMappedId|  diseaseMappedLabel|     omimId|
+---------+-----------+--------------------+------------+--------------------+---------------+--------------------+-----------+
|   Q9NRG9| p.Gln15Lys|          pathogenic| rs121918549|Achalasia-addison...|       ORDO:869|   triple a syndrome|OMIM:231550|
|   Q9NRG9|p.His160Arg|          pathogenic|rs1297831120|Achalasia-addison...|       ORDO:869|   triple a syndrome|OMIM:231550|
|   Q9NRG9|p.Ser263Pro|          pathog

In [84]:
from pyspark.conf import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, col, concat, lit, lower, regexp_replace, split, udf
from pyspark import SparkFiles
from pyspark.sql.types import StringType

spark_conf = (
    SparkConf()
    .set('spark.driver.memory', '5g')
    .set('spark.executor.memory', '5g')
    .set('spark.driver.maxResultSize', '0')
    .set('spark.debug.maxToStringFields', '2000')
    .set('spark.sql.execution.arrow.maxRecordsPerBatch', '500000')
)
spark = (
    SparkSession.builder
    .config(conf=spark_conf)
    .config("spark.sql.broadcastTimeout", "36000")
    .master('local[*]')
    .getOrCreate()
)

uniprot_df = (
    spark.read.json('cttv011-11-10-2021.json.gz')
    .select('diseaseFromSource', 'diseaseFromSourceId', 'confidence', 'diseaseFromSourceMappedId', 'targetFromSourceId')
    .distinct()
    .persist()
)


print(uniprot_df.count())
print(uniprot_df.select('diseaseFromSourceMappedId', 'targetFromSourceId').distinct().count())





7754
7396


In [155]:
# Prepare association from the humsavar evidence set:
humsavar_associations = (
    humsavar_df
    
    # Filtering for evidence where OMIM disease identifiers are available:
    .filter(col('omimId').isNotNull())
    .select('uniprotId', 'omimId')
    
    # Generate associations:
    .distinct()
    
    # Annotate source:
    .withColumn('humsavar', lit(True))
    .persist()
)

print(humsavar_associations.count()) # 4002
print(humsavar_associations.show())

# +---------+-----------+--------+
# |uniprotId|     omimId|humsavar|
# +---------+-----------+--------+
# |   Q9UPV0|OMIM:614845|    true|
# |   Q9Y259|OMIM:602541|    true|
# |   Q8TDJ6|OMIM:618663|    true|
# |   Q9NYP3|OMIM:617604|    true|
# |   Q04446|OMIM:232500|    true|
# |   P36382|OMIM:108770|    true|
# |   P29033|OMIM:602540|    true|
# |   Q5T442|OMIM:613206|    true|
# |   Q96IR7|OMIM:619027|    true|
# |   P78411|OMIM:611174|    true|
# |   Q96FE5|OMIM:618103|    true|
# |   P37287|OMIM:300818|    true|
# |   O00469|OMIM:609220|    true|
# |   P28340|OMIM:612591|    true|
# |   P78424|OMIM:601583|    true|
# |   Q9ULP9|OMIM:605021|    true|
# |   Q9NUM4|OMIM:617964|    true|
# |   Q9Y462|OMIM:300803|    true|
# |   P22557|OMIM:300751|    true|
# |   Q2TAA5|OMIM:613661|    true|
# +---------+-----------+--------+


# Prepare association from the uniprot evidence set:
uniprot_assoc = (
    uniprot_df
    
    # Selecting and renaming relevant columns:
    .select('diseaseFromSourceId', 'targetFromSourceId')
    .withColumnRenamed('targetFromSourceId', 'uniprotId')
    .withColumnRenamed('diseaseFromSourceId', 'omimId')

    # Generate associations:
    .distinct()
    
    # Annotate source:
    .withColumn('uniprot', lit(True))
    .persist()
)

print(uniprot_assoc.count()) # 5894
print(uniprot_assoc.show())

# +-----------+---------+-------+
# |     omimId|uniprotId|uniprot|
# +-----------+---------+-------+
# |OMIM:614935|   Q86X45|   true|
# |OMIM:264350|   P51168|   true|
# |OMIM:615833|   Q8NC96|   true|
# |OMIM:617613|   Q9BUE6|   true|
# |OMIM:612782|   Q96D31|   true|
# |OMIM:619051|   P14854|   true|
# |OMIM:609968|   P06213|   true|
# |OMIM:218330|   Q9HBG6|   true|
# |OMIM:607174|   Q969G3|   true|
# |OMIM:614250|   Q16653|   true|
# |OMIM:203780|   P53420|   true|
# |OMIM:242300|   P22735|   true|
# |OMIM:600195|   Q02763|   true|
# |OMIM:608812|   Q8IXK2|   true|
# |OMIM:310490|   O95831|   true|
# |OMIM:273250|   Q8IY37|   true|
# |OMIM:608516|   P08172|   true|
# |OMIM:309520|   Q93074|   true|
# |OMIM:615558|   P04114|   true|
# |OMIM:300559|   P46020|   true|
# +-----------+---------+-------+


# Joining humsavar and uniprot datasets:
association_compare = (
    uniprot_assoc
    .join(humsavar_associations, on=['omimId', 'uniprotId'], how='outer')
    .persist()
)

print(association_compare.count()) # 6171
print(association_compare.show())

# +-----------+---------+-------+--------+
# |     omimId|uniprotId|uniprot|humsavar|
# +-----------+---------+-------+--------+
# |OMIM:149200|   P29033|   true|    true|
# |OMIM:203780|   P53420|   true|    true|
# |OMIM:218330|   Q9HBG6|   true|    true|
# |OMIM:242300|   P22735|   true|    true|
# |OMIM:264350|   P51168|   true|    true|
# |OMIM:273250|   Q8IY37|   true|    true|
# |OMIM:300494|   Q9NZ94|   true|    true|
# |OMIM:300559|   P46020|   true|    true|
# |OMIM:300661|   P60891|   true|    null|
# |OMIM:309520|   Q93074|   true|    null|
# |OMIM:310490|   O95831|   true|    null|
# |OMIM:600195|   Q02763|   true|    true|
# |OMIM:605282|   Q86X52|   true|    true|
# |OMIM:605589|   Q71SY5|   true|    true|
# |OMIM:607174|   Q969G3|   true|    null|
# |OMIM:608516|   P08172|   true|    null|
# |OMIM:608812|   Q8IXK2|   true|    true|
# |OMIM:609968|   P06213|   true|    true|
# |OMIM:611291|   Q9H9Q4|   true|    null|
# |OMIM:612286|   Q06495|   true|    true|
# +-----------+---------+-------+--------+

# Number of associations found only in the uniprot dataset: 2169
print(
    association_compare
    .filter(
        col('uniprot').isNotNull() & 
        col('humsavar').isNull()
    )
    .count()
)

# Number of associations found only in the humsavar dataset: 187
print(
    association_compare
    .filter(
        col('uniprot').isNull() & 
        col('humsavar').isNotNull()
    )
    .count()
)

# Number of associations found in both dataset: 3815
print(
    association_compare
    .filter(
        col('uniprot').isNotNull() & 
        col('humsavar').isNotNull()
    )
    .count()
)

4002
+---------+-----------+--------+
|uniprotId|     omimId|humsavar|
+---------+-----------+--------+
|   Q9UPV0|OMIM:614845|    true|
|   Q9Y259|OMIM:602541|    true|
|   Q8TDJ6|OMIM:618663|    true|
|   Q9NYP3|OMIM:617604|    true|
|   Q04446|OMIM:232500|    true|
|   P36382|OMIM:108770|    true|
|   P29033|OMIM:602540|    true|
|   Q5T442|OMIM:613206|    true|
|   Q96IR7|OMIM:619027|    true|
|   P78411|OMIM:611174|    true|
|   Q96FE5|OMIM:618103|    true|
|   P37287|OMIM:300818|    true|
|   O00469|OMIM:609220|    true|
|   P28340|OMIM:612591|    true|
|   P78424|OMIM:601583|    true|
|   Q9ULP9|OMIM:605021|    true|
|   Q9NUM4|OMIM:617964|    true|
|   Q9Y462|OMIM:300803|    true|
|   P22557|OMIM:300751|    true|
|   Q2TAA5|OMIM:613661|    true|
+---------+-----------+--------+
only showing top 20 rows

None
5984
+-----------+---------+-------+
|     omimId|uniprotId|uniprot|
+-----------+---------+-------+
|OMIM:614935|   Q86X45|   true|
|OMIM:264350|   P51168|   true|
|OMIM:6

In [251]:
# Getting more insights on the associations which are not covered in the uniprot datasets: 
missing_assoc = (
    association_compare
    .filter(col('uniprot').isNull())
    .persist()
)

# Which is missing from the uniprot dataset exactly?
disease_not_in_uniprot = (
    missing_assoc
    .select("omimId")
    .distinct()
    .join((
        association_compare
        .filter(col('uniprot').isNotNull())
        .select('omimId')
        .distinct()
    ), on='omimId', how='inner')
)

target_not_in_uniprot = (
    missing_assoc
    .select("uniprotId")
    .distinct()
    .join((
        association_compare
        .filter(col('uniprot').isNotNull())
        .select('uniprotId')
        .distinct()
    ), on='uniprotId', how='inner')
)

print(f'Missing associations: {missing_assoc.count()}')
print(f'Number of diseases in the missing associations: {missing_assoc.select("omimId").distinct().count()} (not in Uniprot: {disease_not_in_uniprot.count()})')
print(f'Number of targets in the missing associations: {missing_assoc.select("uniprotId").distinct().count()} (not in Uniprot: {target_not_in_uniprot.count()})')


Missing associations: 187
Number of diseases in the missing associations: 123 (not in Uniprot: 3)
Number of targets in the missing associations: 182 (not in Uniprot: 159)


In [252]:
disease_not_in_uniprot.show()
target_not_in_uniprot.show()

+-----------+
|     omimId|
+-----------+
|OMIM:608089|
|OMIM:613659|
|OMIM:146550|
+-----------+

+---------+
|uniprotId|
+---------+
|   Q00796|
|   O14656|
|   Q02487|
|   Q9UHD2|
|   A2RRP1|
|   O94886|
|   P55268|
|   Q9BY12|
|   Q8WZ42|
|   Q8NB90|
|   Q5K651|
|   Q8TE60|
|   Q13228|
|   Q8NBL1|
|   O15047|
|   P17612|
|   P49589|
|   Q8IVL5|
|   P53701|
|   Q8WVS4|
+---------+
only showing top 20 rows



In [197]:
humsavar_associations = (
    pd.read_fwf(humsavar_url, header=None, names=humsavar_columns, skiprows=44, skipfooter=6)
    
    # Extracting OMIM disease identifier:
    .assign(
        diseaseFromSourceId = lambda df: df.disease.str.extract(r'(MIM:\d+)').replace(regex=r'MIM', value='OMIM')
    )
    
    # Filtering for rows where OMIM identifiers are available:
    .query('diseaseFromSourceId.notnull()', engine='python')
    
    # Get unique disease/target associations from evidence:
    [['diseaseFromSourceId', 'uniprotId']]
    .drop_duplicates()
    .reset_index(drop=True)
)

humsavar_associations

Unnamed: 0,diseaseFromSourceId,uniprotId
0,OMIM:231550,Q9NRG9
1,OMIM:613287,P49588
2,OMIM:616339,P49588
3,OMIM:614096,Q5JTZ9
4,OMIM:615889,Q5JTZ9
...,...,...
3997,OMIM:616833,Q8N1G0
3998,OMIM:300803,Q9Y462
3999,OMIM:260565,Q15649
4000,OMIM:617712,P21754


In [198]:
humsavar_associations = (
    pd.read_fwf(humsavar_url, header=None, names=humsavar_columns, skiprows=44, skipfooter=6)
    
    # Extracting OMIM disease identifier:
    .assign(
        diseaseFromSourceId = lambda df: df.disease.str.extract(r'(MIM:\d+)').replace(regex=r'MIM', value='OMIM')
    )
)

humsavar_associations.head()

Unnamed: 0,gene,uniprotId,variantId,aaChange,severity,rsId,disease,diseaseFromSourceId
0,A1BG,P04217,VAR_018369,p.His52Arg,LB/B,rs893184,-,
1,A1BG,P04217,VAR_018370,p.His395Arg,LB/B,rs2241788,-,
2,A1CF,Q9NQ94,VAR_052201,p.Val555Met,LB/B,rs9073,-,
3,A1CF,Q9NQ94,VAR_059821,p.Ala558Ser,LB/B,rs11817448,-,
4,A2M,P01023,VAR_000012,p.Arg704His,LB/B,rs1800434,-,


In [220]:
# Numbber of evidence:
print(f'Number of evidence: {len(humsavar_associations)}')

# Number of evidence without gene or uniprot id:
missing_gene_uniprot = humsavar_associations.query('(gene == "-") & (uniprotId == "-")')
print(f'Number of evidence with no genes: {len(missing_gene_uniprot)}')

missing_gene = humsavar_associations.query('gene == "-"')
print(f'Number of evidence with no genes: {len(missing_gene)}')
missing_gene[['gene', 'uniprotId', 'rsId', 'disease']]

# Number of evidence without disease annotation:
missing_disease = humsavar_associations.query('disease == "-"')
genes_w_missing_disease = humsavar_associations.query('disease == "-"').uniprotId.unique()
print(f'Number of evidence with no disease annotation: {len(missing_disease)} (with {len(genes_w_missing_disease)} genes)')
w_disease = humsavar_associations.query('disease != "-"')
genes_w_disease = w_disease.query('disease == "-"').uniprotId.unique()
print(f'Number of evidence disease annotation: {len(w_disease)}')

# Number of associations with disease annotation:
associations = humsavar_associations[['uniprotId', 'disease']].query('disease != "-"').drop_duplicates()
print(f'Number of associations: {len(associations)}. genes: {len(associations.uniprotId.unique())}, diseases: {len(associations.disease.unique())}')

# Number of associations with MIM disease:
mim_assoc = (
    associations
    .assign(
        diseaseFromSourceId = lambda df: df.disease.str.extract(r'(MIM:\d+)').replace(regex=r'MIM', value='OMIM')
    )
    .query('diseaseFromSourceId.notnull()', engine='python')
)
print(f'Number of associations with MIM : {len(mim_assoc)}. genes: {len(mim_assoc.uniprotId.unique())}, diseases: {len(mim_assoc.disease.unique())}')

# Number of evidence without diease with OMIM identifier

Number of evidence: 80129
Number of evidence with no genes: 0
Number of evidence with no genes: 27
Number of evidence with no disease annotation: 41354 (with 11855 genes)
Number of evidence disease annotation: 38775
Number of associations: 6318. genes: 4190, diseases: 4529
Number of associations with MIM : 4062. genes: 2914, diseases: 3843


In [224]:
mim_assoc[['uniprotId', 'diseaseFromSourceId']].drop_duplicates()

Unnamed: 0,uniprotId,diseaseFromSourceId
29,Q9NRG9,OMIM:231550
61,P49588,OMIM:613287
64,P49588,OMIM:616339
70,Q5JTZ9,OMIM:614096
72,Q5JTZ9,OMIM:615889
...,...,...
79750,Q8N1G0,OMIM:616833
79783,Q9Y462,OMIM:300803
79999,Q15649,OMIM:260565
80016,P21754,OMIM:617712


In [230]:
(
    association_compare
    .filter(col('uniprot').isNull())
    .select(col('uniprotId'))
    .distinct()
    .count()
)
# Disease: 123 -> found in uniprot: 3 (missing 120)
# Target: 182 -> found in uniprot: 159 (missing 23)

182

In [240]:
(
    association_compare
    .filter(col('uniprot').isNotNull())
    .select(col('omimId'), ('uniprot'))
    .distinct()
    .join((
        association_compare
        .filter(col('uniprot').isNull())
        .select(col('omimId'))
        .distinct()
    ), on='omimId', how='inner')
    .count()
)

3

In [253]:
efo_mapping.head()

Unnamed: 0,disease,diseaseMappedId,diseaseMappedLabel,omimId
0,Achalasia-addisonianism-alacrima syndrome (AAA...,ORDO:869,triple a syndrome,OMIM:231550
1,Charcot-Marie-Tooth disease 2N (CMT2N) [MIM:61...,ORDO:228174,autosomal dominant charcot-marie-tooth disease...,OMIM:613287
2,Developmental and epileptic encephalopathy 29 ...,ORDO:1934,early infantile epileptic encephalopathy,OMIM:616339
3,Combined oxidative phosphorylation deficiency ...,ORDO:319504,combined oxidative phosphorylation defect type 8,OMIM:614096
4,"Leukoencephalopathy, progressive, with ovarian...",,,OMIM:615889


## Approach 2

1. Filter humsavar dataset for rows, where omim is not available + disease annotation is available.
2. Map disease annotation to EFO
3. Join with Uniprot dataset


In [24]:
import pandas as pd
import re
from ontoma import OnToma

# Initialize ontoma:
# otmap = OnToma()

# Data read directly from the web:
humsavar_url = 'https://www.uniprot.org/docs/humsavar.txt'

# Column names are not inferred from the data:
humsavar_columns = [
    'gene', 
    'uniprotId', 
    'variantId', 
    'aaChange', 
    'severity',
    'rsId',
    'disease'
]


# Reading data as dataframe:
humsavar_data = (
    
    # Reading file:
    pd.read_fwf(humsavar_url, header=None, names=humsavar_columns, skiprows=44, skipfooter=6)
    
    # mapping severity significance:
    .assign(severity = lambda df: df.severity.map(significant_map))
    
    # Filtering out rows where no disease is given:
    .query('disease != "-"')
    
    # Filtering for diseases where no OMIM is available:
    .loc[lambda df: ~ df.disease.str.contains(r'MIM')]
)

print(f'Number of evidence: {len(humsavar_data)}')

Number of evidence: 6261


In [31]:
# Restricting data for associations
associations = (
    humsavar_data
    [['uniprotId', 'disease']]
    .drop_duplicates()
)

print(f'Number of association: {len(associations)}')
print(associations.head())

diseases = pd.Series(associations.disease.unique())
print(f'Number of unique diseases: {len(diseases)}')



Number of association: 2193
    uniprotId                                    disease
86     Q6ZMQ8       An ovarian mucinous carcinoma sample
87     Q6ZMQ8               A lung adenocarcinoma sample
132    O95477                 A colorectal cancer sample
189    Q86UK0  A pancreatic ductal adenocarcinoma sample
227    Q99758                     A breast cancer sample
Number of unique diseases: 623


In [52]:
from json import JSONDecodeError

def map_disease(disease:str)-> str:
    """Returning with the parsed disease"""
    
    return_dict = {
        'disease': disease,
        'diseaseMappedId': None,
        'diseaseMappedLabel': None,
    }
    
    # Removing the string 'sample'
    disease = (
        disease
        .replace(' sample', '')
        .replace(r'An ', '')
        .replace(r'A ', '')
    )
        
    # Mapping label:
    try:
        mapping = otmap.find_term(disease)
    except:
        return return_dict
        
    
    if len(mapping) > 0:
        return_dict['diseaseMappedId'] = mapping[0].id_normalised
        return_dict['diseaseMappedLabel'] = mapping[0].label

    return return_dict


mapped_diseases = diseases.apply(map_disease)
# otmap.find_term('An ovarian mucinous carcinoma')



INFO     - ontoma.interface - Processed: ovarian mucinous carcinoma → [OnTomaResult(query='ovarian mucinous carcinoma', id_normalised='EFO:0006462', id_ot_schema='EFO_0006462', id_full_uri='http://www.ebi.ac.uk/efo/EFO_0006462', label='ovarian mucinous adenocarcinoma')]
INFO:ontoma.interface:Processed: ovarian mucinous carcinoma → [OnTomaResult(query='ovarian mucinous carcinoma', id_normalised='EFO:0006462', id_ot_schema='EFO_0006462', id_full_uri='http://www.ebi.ac.uk/efo/EFO_0006462', label='ovarian mucinous adenocarcinoma')]
INFO     - ontoma.interface - Processed: lung adenocarcinoma → [OnTomaResult(query='lung adenocarcinoma', id_normalised='EFO:0000571', id_ot_schema='EFO_0000571', id_full_uri='http://www.ebi.ac.uk/efo/EFO_0000571', label='lung adenocarcinoma')]
INFO:ontoma.interface:Processed: lung adenocarcinoma → [OnTomaResult(query='lung adenocarcinoma', id_normalised='EFO:0000571', id_ot_schema='EFO_0000571', id_full_uri='http://www.ebi.ac.uk/efo/EFO_0000571', label='lung ad

INFO:ontoma.interface:Processed: Midface hypoplasia, hearing impairment, elliptocytosis, and nephrocalcinosi → []
INFO     - ontoma.interface - Processed: lung neuroendocrine carcinoma → []
INFO:ontoma.interface:Processed: lung neuroendocrine carcinoma → []
INFO     - ontoma.interface - Processed: Colorectal carcinoma → [OnTomaResult(query='Colorectal carcinoma', id_normalised='EFO:1001951', id_ot_schema='EFO_1001951', id_full_uri='http://www.ebi.ac.uk/efo/EFO_1001951', label='colorectal carcinoma')]
INFO:ontoma.interface:Processed: Colorectal carcinoma → [OnTomaResult(query='Colorectal carcinoma', id_normalised='EFO:1001951', id_ot_schema='EFO_1001951', id_full_uri='http://www.ebi.ac.uk/efo/EFO_1001951', label='colorectal carcinoma')]
INFO     - ontoma.interface - Processed: Colorectal tumor → [OnTomaResult(query='Colorectal tumor', id_normalised='EFO:0004142', id_ot_schema='EFO_0004142', id_full_uri='http://www.ebi.ac.uk/efo/EFO_0004142', label='colorectal neoplasm')]
INFO:ontoma.int

INFO     - ontoma.interface - Processed: bladder transitional cell carcinoma → [OnTomaResult(query='bladder transitional cell carcinoma', id_normalised='EFO:0006544', id_ot_schema='EFO_0006544', id_full_uri='http://www.ebi.ac.uk/efo/EFO_0006544', label='bladder transitional cell carcinoma')]
INFO:ontoma.interface:Processed: bladder transitional cell carcinoma → [OnTomaResult(query='bladder transitional cell carcinoma', id_normalised='EFO:0006544', id_ot_schema='EFO_0006544', id_full_uri='http://www.ebi.ac.uk/efo/EFO_0006544', label='bladder transitional cell carcinoma')]
INFO     - ontoma.interface - Processed: Neurodevelopmental disorder with dysmorphic facies and distal limb anomalie → []
INFO:ontoma.interface:Processed: Neurodevelopmental disorder with dysmorphic facies and distal limb anomalie → []
INFO     - ontoma.interface - Processed: colorectal cancer cell line → []
INFO:ontoma.interface:Processed: colorectal cancer cell line → []
INFO     - ontoma.interface - Processed: Colon

INFO     - ontoma.interface - Processed: thyroid cancer → [OnTomaResult(query='thyroid cancer', id_normalised='MONDO:0002108', id_ot_schema='MONDO_0002108', id_full_uri='http://purl.obolibrary.org/obo/MONDO_0002108', label='thyroid cancer')]
INFO:ontoma.interface:Processed: thyroid cancer → [OnTomaResult(query='thyroid cancer', id_normalised='MONDO:0002108', id_ot_schema='MONDO_0002108', id_full_uri='http://purl.obolibrary.org/obo/MONDO_0002108', label='thyroid cancer')]
INFO     - ontoma.interface - Processed: ovarian carcinoma → [OnTomaResult(query='ovarian carcinoma', id_normalised='EFO:0001075', id_ot_schema='EFO_0001075', id_full_uri='http://www.ebi.ac.uk/efo/EFO_0001075', label='ovarian carcinoma')]
INFO:ontoma.interface:Processed: ovarian carcinoma → [OnTomaResult(query='ovarian carcinoma', id_normalised='EFO:0001075', id_ot_schema='EFO_0001075', id_full_uri='http://www.ebi.ac.uk/efo/EFO_0001075', label='ovarian carcinoma')]
INFO     - ontoma.interface - Processed: ovarian mucin

INFO     - ontoma.interface - Processed: Multiple cancers → []
INFO:ontoma.interface:Processed: Multiple cancers → []
INFO     - ontoma.interface - Processed: Myasthenic syndrome, congenital, 3C, associated with acetylcholine receptor → []
INFO:ontoma.interface:Processed: Myasthenic syndrome, congenital, 3C, associated with acetylcholine receptor → []
INFO     - ontoma.interface - Processed: Low molecular weight proteinuria with hypercalciuria and nephrocalcinosis ( → []
INFO:ontoma.interface:Processed: Low molecular weight proteinuria with hypercalciuria and nephrocalcinosis ( → []
INFO     - ontoma.interface - Processed: Intellectual developmental disorder with speech delay, autism and dysmorphi → []
INFO:ontoma.interface:Processed: Intellectual developmental disorder with speech delay, autism and dysmorphi → []
INFO     - ontoma.interface - Processed: Breast cancers → [OnTomaResult(query='Breast cancers', id_normalised='EFO:0000305', id_ot_schema='EFO_0000305', id_full_uri='http://w

INFO     - ontoma.interface - Processed: Spinal muscular atrophy, lower extremity-predominant 1, autosomal dominant → []
INFO:ontoma.interface:Processed: Spinal muscular atrophy, lower extremity-predominant 1, autosomal dominant → []
INFO     - ontoma.interface - Processed: Neurodevelopmental disorder with microcephaly and structural brain anomalie → []
INFO:ontoma.interface:Processed: Neurodevelopmental disorder with microcephaly and structural brain anomalie → []
INFO     - ontoma.interface - Processed: Ectodermal dysplasia 10B, hypohidrotic/hair/tooth type, autosomal recessive → []
INFO:ontoma.interface:Processed: Ectodermal dysplasia 10B, hypohidrotic/hair/tooth type, autosomal recessive → []
INFO     - ontoma.interface - Processed: Ectodermal dysplasia 10A, hypohidrotic/hair/nail type, autosomal dominant ( → []
INFO:ontoma.interface:Processed: Ectodermal dysplasia 10A, hypohidrotic/hair/nail type, autosomal dominant ( → []
INFO     - ontoma.interface - Processed: Ectodermal dyspla

INFO     - ontoma.interface - Processed: Granulosa-cell tumors of the ovary → []
INFO:ontoma.interface:Processed: Granulosa-cell tumors of the ovary → []
INFO     - ontoma.interface - Processed: Myopathy, mitochondrial progressive, with congenital cataract, hearing loss → []
INFO:ontoma.interface:Processed: Myopathy, mitochondrial progressive, with congenital cataract, hearing loss → []
INFO     - ontoma.interface - Processed: Muscular dystrophy-dystroglycanopathy congenital with mental retardation B1 → []
INFO:ontoma.interface:Processed: Muscular dystrophy-dystroglycanopathy congenital with mental retardation B1 → []
INFO     - ontoma.interface - Processed: acute myeloid leukemia → [OnTomaResult(query='acute myeloid leukemia', id_normalised='EFO:0000222', id_ot_schema='EFO_0000222', id_full_uri='http://www.ebi.ac.uk/efo/EFO_0000222', label='acute myeloid leukemia')]
INFO:ontoma.interface:Processed: acute myeloid leukemia → [OnTomaResult(query='acute myeloid leukemia', id_normalised='E

INFO:ontoma.interface:Processed: Mast cell leukemia → [OnTomaResult(query='Mast cell leukemia', id_normalised='EFO:0007359', id_ot_schema='EFO_0007359', id_full_uri='http://www.ebi.ac.uk/efo/EFO_0007359', label='mast-cell leukemia')]
INFO     - ontoma.interface - Processed: germ cell tumor of the testis → [OnTomaResult(query='germ cell tumor of the testis', id_normalised='EFO:1000566', id_ot_schema='EFO_1000566', id_full_uri='http://www.ebi.ac.uk/efo/EFO_1000566', label='testicular germ cell tumor')]
INFO:ontoma.interface:Processed: germ cell tumor of the testis → [OnTomaResult(query='germ cell tumor of the testis', id_normalised='EFO:1000566', id_ot_schema='EFO_1000566', id_full_uri='http://www.ebi.ac.uk/efo/EFO_1000566', label='testicular germ cell tumor')]
INFO     - ontoma.interface - Processed: testicular tumor → [OnTomaResult(query='testicular tumor', id_normalised='MONDO:0021348', id_ot_schema='MONDO_0021348', id_full_uri='http://purl.obolibrary.org/obo/MONDO_0021348', label='ne

INFO     - ontoma.interface - Processed: Hypotonia, infantile, with psychomotor retardation and characteristic facie → []
INFO:ontoma.interface:Processed: Hypotonia, infantile, with psychomotor retardation and characteristic facie → []
INFO     - ontoma.interface - Processed: Congenital contractures of the limbs and face, hypotonia, and developmental → []
INFO:ontoma.interface:Processed: Congenital contractures of the limbs and face, hypotonia, and developmental → []
INFO     - ontoma.interface - Processed: Neurodevelopmental disorder with microcephaly, impaired language, and gait → []
INFO:ontoma.interface:Processed: Neurodevelopmental disorder with microcephaly, impaired language, and gait → []
INFO     - ontoma.interface - Processed: Breast ductal carcinoma → [OnTomaResult(query='Breast ductal carcinoma', id_normalised='EFO:0006318', id_ot_schema='EFO_0006318', id_full_uri='http://www.ebi.ac.uk/efo/EFO_0006318', label='breast ductal adenocarcinoma')]
INFO:ontoma.interface:Processed:

INFO:ontoma.interface:Processed: colon cancer → [OnTomaResult(query='colon cancer', id_normalised='MONDO:0021063', id_ot_schema='MONDO_0021063', id_full_uri='http://purl.obolibrary.org/obo/MONDO_0021063', label='malignant colon neoplasm'), OnTomaResult(query='colon cancer', id_normalised='EFO:1001950', id_ot_schema='EFO_1001950', id_full_uri='http://www.ebi.ac.uk/efo/EFO_1001950', label='colon carcinoma')]
INFO     - ontoma.interface - Processed: colorectal cancer → [OnTomaResult(query='colorectal cancer', id_normalised='EFO:0005842', id_ot_schema='EFO_0005842', id_full_uri='http://www.ebi.ac.uk/efo/EFO_0005842', label='colorectal cancer')]
INFO:ontoma.interface:Processed: colorectal cancer → [OnTomaResult(query='colorectal cancer', id_normalised='EFO:0005842', id_ot_schema='EFO_0005842', id_full_uri='http://www.ebi.ac.uk/efo/EFO_0005842', label='colorectal cancer')]
INFO     - ontoma.interface - Processed: gastric cancer → [OnTomaResult(query='gastric cancer', id_normalised='MONDO:000

INFO:ontoma.interface:Processed: Cervical cancer → [OnTomaResult(query='Cervical cancer', id_normalised='MONDO:0002974', id_ot_schema='MONDO_0002974', id_full_uri='http://purl.obolibrary.org/obo/MONDO_0002974', label='cervical cancer')]
INFO     - ontoma.interface - Processed: Cervical carcinoma → [OnTomaResult(query='Cervical carcinoma', id_normalised='EFO:0001061', id_ot_schema='EFO_0001061', id_full_uri='http://www.ebi.ac.uk/efo/EFO_0001061', label='cervical carcinoma')]
INFO:ontoma.interface:Processed: Cervical carcinoma → [OnTomaResult(query='Cervical carcinoma', id_normalised='EFO:0001061', id_ot_schema='EFO_0001061', id_full_uri='http://www.ebi.ac.uk/efo/EFO_0001061', label='cervical carcinoma')]
INFO     - ontoma.interface - Processed: Ovarian carcinoma → [OnTomaResult(query='Ovarian carcinoma', id_normalised='EFO:0001075', id_ot_schema='EFO_0001075', id_full_uri='http://www.ebi.ac.uk/efo/EFO_0001075', label='ovarian carcinoma')]
INFO:ontoma.interface:Processed: Ovarian carcino

INFO:ontoma.interface:Processed: bladder carcinoma → [OnTomaResult(query='bladder carcinoma', id_normalised='EFO:0000292', id_ot_schema='EFO_0000292', id_full_uri='http://www.ebi.ac.uk/efo/EFO_0000292', label='bladder carcinoma')]
INFO     - ontoma.interface - Processed: Wilms tumor → [OnTomaResult(query='Wilms tumor', id_normalised='MONDO:0019004', id_ot_schema='MONDO_0019004', id_full_uri='http://purl.obolibrary.org/obo/MONDO_0019004', label='kidney wilms tumor'), OnTomaResult(query='Wilms tumor', id_normalised='ORDO:654', id_ot_schema='Orphanet_654', id_full_uri='http://www.orpha.net/ORDO/Orphanet_654', label='nephroblastoma')]
INFO:ontoma.interface:Processed: Wilms tumor → [OnTomaResult(query='Wilms tumor', id_normalised='MONDO:0019004', id_ot_schema='MONDO_0019004', id_full_uri='http://purl.obolibrary.org/obo/MONDO_0019004', label='kidney wilms tumor'), OnTomaResult(query='Wilms tumor', id_normalised='ORDO:654', id_ot_schema='Orphanet_654', id_full_uri='http://www.orpha.net/ORDO/O

In [80]:


# Get the unique list of diseases and try to map to efo using ontoma:
efo_mapping = pd.DataFrame(
    mapped_diseases
    .to_list()
)


# Joining humsavar data with disease mapping:
humsavar_associations_mapped = (
    associations
    
    # Merging with efo:
    .merge(efo_mapping, on='disease', how='left')
    
    # Selecting only mapped associations
    .loc[lambda df: df.diseaseMappedId.notna()]
    
    # Update disease id:
    .assign(diseaseMappedId = lambda df: df.diseaseMappedId.str.replace(':', '_'))
    
    # Get unique list of associations:
    [['uniprotId', 'diseaseMappedId']]
    .drop_duplicates()
    .assign(humsavar = True)
)

print(f'Number of unique associations with mapped diseases: {len(humsavar_associations_mapped)}')
print(f'Number of unique mapped diseases: {len(humsavar_associations_mapped.diseaseMappedId.unique())}')


# Convert it to spark df:
hs_mapped_df = spark.createDataFrame(humsavar_associations_mapped).persist()
hs_mapped_df.show()

Number of unique associations with mapped diseases: 1567
Number of unique mapped diseases: 93
+---------+---------------+--------+
|uniprotId|diseaseMappedId|humsavar|
+---------+---------------+--------+
|   Q6ZMQ8|    EFO_0006462|    true|
|   Q6ZMQ8|    EFO_0000571|    true|
|   O95477|    EFO_0005842|    true|
|   Q86UK0|    EFO_0002517|    true|
|   Q99758|  MONDO_0007254|    true|
|   P78363|  MONDO_0007254|    true|
|   P08183|    EFO_0005842|    true|
|   Q9NRK6|  MONDO_0007254|    true|
|   Q2M3G0|    EFO_0005842|    true|
|   Q2M3G0|    EFO_0002517|    true|
|   Q9NP58|  MONDO_0007254|    true|
|   Q9NUT2|  MONDO_0007254|    true|
|   Q9UBJ2|    EFO_0002517|    true|
|   Q7Z5M8|  MONDO_0007254|    true|
|   P00519|    EFO_0003050|    true|
|   P00519|    EFO_0000756|    true|
|   Q6H8Q1|    EFO_0002517|    true|
|   Q13085|    EFO_0005842|    true|
|   O00763|    EFO_0002517|    true|
|   P11310|  MONDO_0007254|    true|
+---------+---------------+--------+
only showing top 2

Join this mapped dataset with our uniprot associations:


In [79]:
uniprot_df = (
    spark.read.json('cttv011-11-10-2021.json.gz')
    .select('diseaseFromSource', 'diseaseFromSourceId', 'confidence', 'diseaseFromSourceMappedId', 'targetFromSourceId')
    .distinct()
    .persist()
)


print(uniprot_df.count())
print(uniprot_df.select('diseaseFromSourceMappedId', 'targetFromSourceId').distinct().count())


uniprot_assoc = (
    uniprot_df
    
    # Selecting and renaming relevant columns:
    .select('diseaseFromSourceMappedId', 'targetFromSourceId')
    .withColumnRenamed('targetFromSourceId', 'uniprotId')
    .withColumnRenamed('diseaseFromSourceMappedId', 'diseaseMappedId')

    # Generate associations:
    .distinct()
    
    # Annotate source:
    .withColumn('uniprot', lit(True))
    .persist()
)

uniprot_assoc.show()

7754
7396
+---------------+---------+-------+
|diseaseMappedId|uniprotId|uniprot|
+---------------+---------+-------+
|    EFO_0009070|   P55201|   true|
|  MONDO_0032910|   A1L188|   true|
|  Orphanet_3337|   Q06495|   true|
|Orphanet_169446|   P29597|   true|
| Orphanet_79351|   O43175|   true|
|  Orphanet_1934|   Q8NC96|   true|
|   Orphanet_610|   P12111|   true|
| Orphanet_95716|   P07202|   true|
| Orphanet_63454|   Q16644|   true|
|   Orphanet_154|   P49810|   true|
|Orphanet_166081|   P04275|   true|
|   Orphanet_791|   Q9NZN9|   true|
|   Orphanet_334|   Q8IWT1|   true|
|  Orphanet_1020|   P49810|   true|
|   Orphanet_822|   P16157|   true|
|           null|   Q9C0K0|   true|
|   Orphanet_740|   P02545|   true|
|Orphanet_314689|   Q13043|   true|
|  MONDO_0029134|   Q6F5E8|   true|
|    EFO_0000279|   Q8TBY8|   true|
+---------------+---------+-------+
only showing top 20 rows



In [98]:
get_sources = udf(
    lambda humsavar, uniprot: 'both' if humsavar and uniprot else 'uniprot' if uniprot else 'humsavar',
    StringType()
)

joined_df = (
    hs_mapped_df
    .join(uniprot_assoc, on = ['diseaseMappedId', 'uniprotId'], how='outer')
    .withColumn('source', get_sources(col('humsavar'), col('uniprot')))
    .persist()
)

print(
    joined_df
    .groupBy('source')
    .count()
    .toPandas()
    .to_markdown(index=False)
)

joined_df.show()

| source   |   count |
|:---------|--------:|
| humsavar |    1557 |
| both     |      10 |
| uniprot  |    7386 |
+---------------+---------+--------+-------+--------+
|diseaseMappedId|uniprotId|humsavar|uniprot|  source|
+---------------+---------+--------+-------+--------+
|           null|   Q9C0K0|    null|   true| uniprot|
|           null|   Q9NP91|    null|   true| uniprot|
|    EFO_0000279|   Q8TBY8|    null|   true| uniprot|
|    EFO_0000712|   P24723|    null|   true| uniprot|
|    EFO_0002517|   O75078|    true|   null|humsavar|
|    EFO_0003105|   Q8TEW0|    null|   true| uniprot|
|    EFO_0003105|   Q9H2D1|    null|   true| uniprot|
|    EFO_0005842|   P50749|    true|   null|humsavar|
|    EFO_0005842|   Q9Y286|    true|   null|humsavar|
|    EFO_0009070|   P55201|    null|   true| uniprot|
|  MONDO_0007254|   P15090|    true|   null|humsavar|
|  MONDO_0007254|   P24557|    true|   null|humsavar|
|  MONDO_0007254|   Q86UW6|    true|   null|humsavar|
|  MONDO_0008572|   Q

Why do we have such a poor overlap??

In [100]:
(
    humsavar_data
    .query('uniprotId == "O75078"')
    .head(50)
)

Unnamed: 0,gene,uniprotId,variantId,aaChange,severity,rsId,disease
2273,ADAM11,O75078,VAR_062669,p.Ser693Arg,uncertain significance,-,A pancreatic ductal adenocarcinoma sample


In [103]:
hs_all = spark.createDataFrame(humsavar_data).persist()

(
    joined_df
    .filter(col('source') == 'humsavar')
    .select('diseaseMappedId', 'uniprotId')
    .join(hs_all, on = 'uniprotId', how='left')
    .select('uniprotId', 'disease')
    .distinct()
    .show(truncate=False)
)

+---------+---------------------------------------------------------------------------+
|uniprotId|disease                                                                    |
+---------+---------------------------------------------------------------------------+
|O75643   |A colorectal cancer sample                                                 |
|P35222   |Hepatocellular carcinoma                                                   |
|P35222   |Neurodevelopmental disorder with spastic diplegia and visual defects (NEDSD|
|P51812   |A breast cancer sample                                                     |
|P51812   |A gastric adenocarcinoma sample                                            |
|P51812   |A glioblastoma multiforme sample                                           |
|Q00341   |A breast cancer sample                                                     |
|Q15477   |A breast cancer sample                                                     |
|Q15477   |A colorectal cancer s

In [97]:
(
    hs_mapped_df
    .filter(col('uniprotId') == 'Q8N2K0')
    .show()
)

+---------+---------------+--------+
|uniprotId|diseaseMappedId|humsavar|
+---------+---------------+--------+
+---------+---------------+--------+

