# Intact network dump

Upon working with the network data exported by Intact (OTAR044) we have identified a series of issues. These issues could be summarized:

* Explosion of evidence: there are bunch of evidence where the interaction id, interactors, biological roles are the same. So, what is different?
* For many evidence the participant detection method is not properly mapped.
* There are non-unique evidence where all value is the same.

**Workflow:**

1. Fetching network data from ftp.
2. Build table with all fields, one row for each evidence + interactors + biological roles
3. Find out redundancy and irregularities
4. Report back the findings to intact.

## Fetching data from ftp

In [2]:
%%bash

TARGETDIR=/Users/dsuveges/repositories/random_notebooks/2020.09.15_checking_intact_data
curl -s ftp://ftp.ebi.ac.uk/pub/databases/intact/various/ot_graphdb/current/data/interactor_pair_interactions.json \
    | gzip > ${TARGETDIR}/interactor_pair_interactions.json.gz

gzcat ${TARGETDIR}/interactor_pair_interactions.json.gz | wc -l 

gzcat ${TARGETDIR}/interactor_pair_interactions.json.gz | head -n1 | jq

  391905
{
  "interactorA": {
    "organism": {
      "taxon_id": 727,
      "mnemonic": "haeif",
      "scientific_name": "Haemophilus influenzae"
    },
    "id_source": "uniprotkb",
    "id": "A0A024A2C9",
    "biological_role": "unspecified role"
  },
  "interactorB": {
    "organism": {
      "taxon_id": 9606,
      "mnemonic": "human",
      "scientific_name": "Homo sapiens"
    },
    "id_source": "uniprotkb",
    "id": "P08603-2",
    "biological_role": "unspecified role"
  },
  "source_info": {
    "database_version": "234",
    "source_database": "intact"
  },
  "interaction": {
    "causal_interaction": null,
    "evidence": [
      {
        "host_organism_tax_id": -1,
        "participant_detection_method_mi_identifier_B": "MI:0421",
        "interaction_type_mi_identifier": "MI:0407",
        "participant_detection_method_mi_identifier_A": "MI:0421",
        "expansion_method_short_name": null,
        "host_organism_scientific_name": "In vitro",
        "participant_dete

So there are ~400k interaction in the dataset.

## Parse data and build table

In [83]:
import json
import pandas as pd
import gzip

source_file = 'interactor_pair_interactions.json.gz'

# Reading file:
rows = []
with gzip.open(source_file) as f:
    for line in f:
        row = json.loads(line)
        
        for evidence in row['interaction']['evidence']:
            try:
                evidence.update({
                    'interactor_A': row['interactorA']['id'],
                    'biological_role_A ': row['interactorA']["biological_role"],
                    'interactor_B': row['interactorB']['id'],
                    'biological_role_B ': row['interactorB']["biological_role"],
                    'source': row['source_info']['source_database']
                })
            except:
                # Some interactor B objects are missing. Known issue, it's fine.
                evidence.update({
                    'interactor_A': row['interactorA']['id'],
                    'biological_role_A ': row['interactorA']["biological_role"],
                    'interactor_B': None,
                    'biological_role_B ': None,
                    'source': row['source_info']['source_database']
                })
            rows.append(evidence)

# Compile into pandas dataframe:
evidence_df = pd.DataFrame(rows)
print(len(evidence_df))

evidence_df.head()

879631


Unnamed: 0,host_organism_tax_id,participant_detection_method_mi_identifier_B,interaction_type_mi_identifier,participant_detection_method_mi_identifier_A,expansion_method_short_name,host_organism_scientific_name,participant_detection_method_short_name_A,participant_detection_method_short_name_B,pubmed_id,interaction_detection_method_mi_identifier,interaction_detection_method_short_name,expansion_method_mi_identifier,interaction_identifier,interaction_type_short_name,interactor_A,biological_role_A,interactor_B,biological_role_B,source
0,-1,MI:0421,MI:0407,MI:0421,,In vitro,antibody detection,antibody detection,24835392,MI:0411,elisa,,EBI-12684777,direct interaction,A0A024A2C9,unspecified role,P08603-2,unspecified role,intact
1,4932,MI:0078,MI:0915,MI:0078,,Saccharomyces cerevisiae (Baker's yeast),nucleotide sequence,nucleotide sequence,32296183,MI:1356,validated two hybrid,,EBI-24521810,physical association,A0A024R0L9,unspecified role,Q93062-3,unspecified role,intact
2,4932,MI:0078,MI:0915,MI:0078,,Saccharomyces cerevisiae (Baker's yeast),nucleotide sequence,nucleotide sequence,32296183,MI:0397,two hybrid array,,EBI-23426250,physical association,A0A024R0L9,unspecified role,Q93062-3,unspecified role,intact
3,4932,MI:0078,MI:0915,MI:0078,,Saccharomyces cerevisiae (Baker's yeast),nucleotide sequence,nucleotide sequence,32296183,MI:1112,two hybrid prey pooling approach,,EBI-23201216,physical association,A0A024R0L9,unspecified role,Q93062-3,unspecified role,intact
4,9606,MI:0102,MI:0915,MI:0102,,Homo sapiens transformed primary embryonal kid...,sequence tag,sequence tag,17353931,MI:0006,anti bait coip,,EBI-1081478,physical association,A0A024R493,unspecified role,Q07283,unspecified role,intact


Before digging deeper, let's see some numbers. I am mostly interested in the uniqueness of the data.

In [172]:
interaction_ids = evidence_df.interaction_identifier.value_counts().sort_values(ascending=False)
unique_interactions = evidence_df.drop_duplicates()
unique_major_values = evidence_df[['interactor_A', 'biological_role_A ','source',
       'interactor_B', 'biological_role_B ','pubmed_id', 'interaction_identifier']].drop_duplicates()

print(f'Number of evidence: {len(evidence_df)}')
print(f'Number of unique interaction ids: {len(interaction_ids)}')
print(f'Number of completely unique properties:{len(unique_interactions)}')
print(f'Number of unique interactions keeping major values fixed: {len(unique_major_values)}')



Number of evidence: 879631
Number of unique interaction ids: 427822
Number of completely unique properties:875688
Number of unique interactions keeping major values fixed: 802900


* ~4000 evidence in the data is completely redundant.
* ~10% of the evidences are somewhat redundant: there might be differences in the participation detecion methods,but that doesn't really make sense.

## Preparing example data for Intact

To help troubleshooting, we are prepairing a reprsentative dataset so they'll have a better idea.

In [None]:
%%bash

gzcat interactor_pair_interactions.json.gz | grep EBI-21454169 | grep Q96RK4 | grep Q8N3I7 > example.json

In [175]:
with open('example.json') as f:
    data = json.load(f)
    

# Load all the evidence into a pandas dataframe:
parsed_df = pd.DataFrame(data['interaction']['evidence'])
parsed_df_filtered = parsed_df.loc[parsed_df.pubmed_id == '19081074']

print(f'Number of rows with the same interaction identifiers: {len(parsed_df)}')
parsed_df_filtered.head()

Number of rows with the same interaction identifiers: 22


Unnamed: 0,host_organism_tax_id,participant_detection_method_mi_identifier_B,interaction_type_mi_identifier,participant_detection_method_mi_identifier_A,expansion_method_short_name,host_organism_scientific_name,participant_detection_method_short_name_A,participant_detection_method_short_name_B,pubmed_id,interaction_detection_method_mi_identifier,interaction_detection_method_short_name,expansion_method_mi_identifier,interaction_identifier,interaction_type_short_name
2,9606,MI:0661,MI:0914,MI:0661,spoke expansion,Human retinal pigment epithelium cell line,experimental particp,experimental particp,19081074,MI:0676,tap,MI:1060,EBI-21454169,association
3,9606,MI:0661,MI:0914,MI:0661,spoke expansion,Human retinal pigment epithelium cell line,experimental particp,sequence tag,19081074,MI:0676,tap,MI:1060,EBI-21454169,association
4,9606,MI:0661,MI:0914,MI:0661,spoke expansion,Human retinal pigment epithelium cell line,experimental particp,weight autoradiogra,19081074,MI:0676,tap,MI:1060,EBI-21454169,association
5,9606,MI:0661,MI:0914,MI:0661,spoke expansion,Human retinal pigment epithelium cell line,experimental particp,weight silver stain,19081074,MI:0676,tap,MI:1060,EBI-21454169,association
6,9606,MI:0102,MI:0914,MI:0102,spoke expansion,Human retinal pigment epithelium cell line,sequence tag,experimental particp,19081074,MI:0676,tap,MI:1060,EBI-21454169,association


In [167]:
(
    parsed_df_filtered[['participant_detection_method_mi_identifier_A','participant_detection_method_short_name_A','participant_detection_method_mi_identifier_B','participant_detection_method_short_name_B']]
    .rename(columns={
        'participant_detection_method_mi_identifier_A': 'mi_identifier_A',
        'participant_detection_method_short_name_A' : 'short_name_A',
        'participant_detection_method_mi_identifier_B': 'mi_identifier_B',
        'participant_detection_method_short_name_B': 'short_name_B'
    })
    .to_csv('participant_detection_methods.tsv', sep='\t', index=False)
)



### Email sent to intact:

```
Hi Guys,
 
 
We are getting the backend pipeline ready to process the exported JSON file and getting the endpoints ready for the frontend. In this process we have noticed something about the data, which might be of a concern. We very frequently see multiple evidence objects for the same interactor pairs with the same biological role, interaction identifier and pubmed id. Very frequently the number is a square number.
 
In the attached example json file. You’ll see evidence for the interaction between Q96RK4 and  Q8N3I7 [1]. Out of the 22 evidence 16 has the same interaction id: EBI-21454169. When taking a closer look at these evidence, we saw an interesting pattern: there are four different participant detection methods for both interactors. But while the mi_identifier for interactor A and B are always the same, the participant detection short names are different.  So the correspondence between the mi identifier and labels are not 1:1 (participant_detection_methods.tsv). This suggests us that there might be a mapping issue between the mi_id and label in your pipeline maybe. Taking a closer look, we think ~10% of the exported evidence affected. We think this issue might be related to the ~4000 identical evidence we see in the data.
 
We are wondering if you could take a look at this issue and let us know if this is indeed a problem, or we are interpreting the case wrongly. Anyway, on our side it causes some problem, and we would like to know if this inflation of evidence could be reduced somehow. If it’s a data issue, how complicated do you think it would be to fix it for us? Anyway, we are more than happy to schedule a meeting for discussion.
 
Have a wonderful day!
 
 
Best,
Daniel
 
 
 
[1]  zcat interactor_pair_interactions.json.gz | grep EBI-21454169 | grep Q96RK4 | grep Q8N3I7 > example.json
```