## Tesing networks data - Release date: 2020.10.06

**Conclusions:**
* 408k rows in the json file.
* 746k evidence
* 169 associations does not have interactor B.
* 406k unique interaction - ~1.7k duplicates!
* Duplication due to annotation of different organism to the same uniprot id. (that's the only reason for duplications - all non-human)
* 24 different biological roles - 87 pariwise combinations.
* For 95% of the interaction no biological role is specified for the interactors. 66 of the pairs are represented by less than 10 interactions.







## Workflow:

In [20]:
%%bash 


curl -s ftp://ftp.ebi.ac.uk/pub/databases/intact/various/ot_graphdb/2020-10-05/data/interactor_pair_interactions.json \
    | gzip > /Users/dsuveges/project/evidences/2020.10.06.interactor_pair_interactions.json.gz
    
gzcat /Users/dsuveges/project/evidences/2020.10.06.interactor_pair_interactions.json.gz | wc -l

  408127


In [101]:
import json
import pandas as pd
import gzip

intact_file = '/Users/dsuveges/project/evidences/2020.10.06.interactor_pair_interactions.json.gz'

parsed_interaction_data = []
parsed_evidence_data = []

# OPen file and read line by line, extract info, build dataframe
with gzip.open(intact_file) as f:
    for row in f:
        data = json.loads(row)
        
        # 
        interaction_data = {
            'int_A_id': data['interactorA']['id'],
            'int_A_source': data['interactorA']["id_source"],
            'int_A_organism': data['interactorA']["organism"]['mnemonic'],
            'int_A_biological_role': data['interactorA']["biological_role"],
            'source': data['source_info']['source_database'],
            'causal': data['interaction']['causal_interaction']
        }
        
        try:
            interaction_data.update({
                'int_B_id': data['interactorB']['id'],
                'int_B_source': data['interactorB']["id_source"],
                'int_B_organism': data['interactorB']["organism"]['mnemonic'],
                'int_B_biological_role': data['interactorB']["biological_role"],
            })
        except:
            interaction_data.update({
                'int_B_id': None,
                'int_B_source':None,
                'int_B_organism': None
            })
            
        # Adding interaction to list:
        parsed_interaction_data.append(interaction_data)
        
        # Extract evidence data:
        for evidence in data['interaction']['evidence']:
            evidnece_data = {
                'pmid': evidence['pubmed_id'],
                'interaction_type': evidence['interaction_type_short_name'],
                'interaction_id': evidence['interaction_identifier'],
                'interaction_detection_method': f"{evidence['interaction_detection_method_short_name']} ({evidence['interaction_detection_method_mi_identifier']})"
            }
            
            # Adding participant detection methods:
            if isinstance(evidence['participant_detection_method_A'], list):
                evidnece_data['participant_detection_method_A'] = [f'{x["short_name"]} ({x["mi_identifier"]})' for x in evidence['participant_detection_method_A']]
            if isinstance(evidence['participant_detection_method_B'], list):
                evidnece_data['participant_detection_method_B'] = [f'{x["short_name"]} ({x["mi_identifier"]})' for x in evidence['participant_detection_method_B']]
            
            evidnece_data.update(interaction_data)
            parsed_evidence_data.append(evidnece_data)
            
intact_interaction_df = pd.DataFrame(parsed_interaction_data)
intact_evidence_df = pd.DataFrame(parsed_evidence_data)

# These fields define a unique interaction:
uniqueness_columns  = ["int_A_id",
                       "int_B_id",
                       "int_A_biological_role",
                       "int_B_biological_role",
                       "source",
                       "causal"]

print(f'Number of evidence: {len(intact_evidence_df)}')
print(f'Number of interactions: {len(intact_interaction_df)}')
print(f'Number of unique interactions: {len(intact_interaction_df[uniqueness_columns].drop_duplicates())}')


Number of evidence: 746474
Number of interactions: 408128
Number of unique interactions: 406311


How many of the interactions have missing interactor b?

In [104]:
print(f'Number of associations with missing B interactor: {len(intact_interaction_df.loc[intact_interaction_df.int_B_id.isna()])}')
print(f'Number of associations non-null direction: {len(intact_interaction_df.loc[~intact_interaction_df.causal.isna()])}')
print(f'Number of homomeric associations: {len(intact_interaction_df.loc[intact_interaction_df.int_B_id == intact_interaction_df.int_A_id])}')


Number of associations with missing B interactor: 169
Number of associations non-null direction: 0
Number of homomeric associations: 7405


### Duplication

There is an other souce of duplication. Apparently the organism annotation of the interactor can be ambigious. It leads to explosion of the interaction object.

In [55]:
columns  = ["int_A_id",
           "int_B_id",
           "int_A_organism",
           "int_B_organism"]
intact_interaction_df.loc[
    (intact_interaction_df.int_A_id == 'Q99IB8') &
    (intact_interaction_df.int_B_id == 'Q99IB8'),
    columns
]

Unnamed: 0,int_A_id,int_B_id,int_A_organism,int_B_organism
406992,Q99IB8,Q99IB8,Hepatitis C virus genotype 2a,Hepatitis C virus genotype 2a
406993,Q99IB8,Q99IB8,Hepatitis C virus genotype 2a,hcvjf
406994,Q99IB8,Q99IB8,hcvjf,Hepatitis C virus genotype 2a
406995,Q99IB8,Q99IB8,hcvjf,hcvjf


In [53]:
from tqdm import tqdm

for index, row in tqdm(intact_interaction_df.loc[intact_interaction_df[uniqueness_columns].duplicated()].iterrows()):
    m = intact_interaction_df.loc[
        (intact_interaction_df.int_A_id == row['int_A_id']) &
        (intact_interaction_df.int_B_id == row['int_B_id']) &
        (intact_interaction_df.int_A_biological_role == row['int_A_biological_role']) &
        (intact_interaction_df.int_B_biological_role == row['int_B_biological_role']) &
        (intact_interaction_df.source == row['source'])
    ]
    
    if (len(m.int_A_organism) == 1) & (len(m.int_B_organism) == 1):
        print(m)

1817it [07:27,  4.06it/s]


In [110]:
for (fields), group in tqdm(intact_interaction_df.loc[intact_interaction_df[uniqueness_columns].duplicated(keep=False)].groupby(uniqueness_columns)):
    if (len(group.int_A_organism) == 1) & (len(group.int_B_organism) == 1):
        print(group)


  0%|          | 0/1625 [00:00<?, ?it/s]


That's good. The above loop indicated all the duplications are due exclusively to the organism annotation.


Extract a sample set of the json lines with duplicated entries:

In [114]:
%%bash

# There are 6 rows of duplicates we extract:
gzcat /Users/dsuveges/project/evidences/2020.10.06.interactor_pair_interactions.json.gz \
    | perl -lane 'print $_ if $. >= 704 and $. <= 709' \
    | head > duplicated_entries.json

wc -l duplicated_entries.json

head -n2 duplicated_entries.json | jq '.interactorB.organism'

       6 duplicated_entries.json
{
  "taxon_id": 11320,
  "mnemonic": "9infa",
  "scientific_name": "Influenza A virus"
}
{
  "taxon_id": 211044,
  "mnemonic": "i34a1",
  "scientific_name": "Influenza A virus (strain A/Puerto Rico/8/1934 H1N1)"
}


### Exploring biological roles


* filter out rows where the interactor b is null
* select biological roles for both partner


In [62]:
biological_roles = (
    intact_interaction_df
    .loc[~intact_interaction_df.int_B_biological_role.isna(),['int_A_biological_role','int_B_biological_role']]
    .value_counts()
)

print(len(biological_roles))

87


In [87]:
biol_roles = []
    
for role in biological_roles.index:
    biol_roles.append(role[0])
    biol_roles.append(role[1])
    
print(set(biol_roles))
print(len(set(biol_roles)))
biological_roles.loc[biological_roles <= 10]



{'phosphate donor', 'acceptor', 'proton acceptor', 'self', 'proton donor', 'inhibitor', 'phosphate acceptor', 'enzyme', 'putative self', 'stimulator', 'electron donor', 'electron acceptor', 'regulator target', 'photon donor', 'biological role', 'unspecified role', 'competitor', 'regulator', 'enzyme target', 'ancillary', 'cofactor', 'enzyme regulator', 'donor', 'photon acceptor'}
24


int_A_biological_role  int_B_biological_role
stimulator             unspecified role         10
phosphate donor        phosphate acceptor       10
enzyme                 competitor               10
putative self          putative self             9
unspecified role       putative self             9
                                                ..
enzyme                 phosphate acceptor        1
regulator              regulator                 1
                       unspecified role          1
phosphate donor        phosphate donor           1
phosphate acceptor     enzyme                    1
Length: 66, dtype: int64

### Exploring interaction detection method


In [118]:
det_methods = intact_evidence_df.interaction_detection_method.unique()

det_methods.sort()
print(f'Number of methods: {len(det_methods)}')
print(f'Number of unique method labels: {len(set([x.split("(MI")[0] for x in det_methods]))}')
print(f'Number of unique method ids: {len(set([x.split(" (MI")[1] for x in det_methods]))}')



Number of methods: 214
Number of unique method labels: 214
Number of unique method ids: 214


In [128]:
print(len(intact_evidence_df))
len(intact_evidence_df.drop(['participant_detection_method_A','participant_detection_method_B'], axis=1).drop_duplicates())

746474


746474

In [127]:
intact_evidence_df['participant_detection_method_A_json'] = intact_evidence_df['participant_detection_method_A'].apply(json.dumps)
intact_evidence_df['participant_detection_method_B_json'] = intact_evidence_df['participant_detection_method_B'].apply(json.dumps)




In [124]:
intact_evidence_df.drop([], axis=1).head()

Unnamed: 0,pmid,interaction_type,interaction_id,interaction_detection_method,int_A_id,int_A_source,int_A_organism,int_A_biological_role,source,causal,int_B_id,int_B_source,int_B_organism,int_B_biological_role
0,24835392,direct interaction,EBI-12684777,elisa (MI:0411),A0A024A2C9,uniprotkb,haeif,unspecified role,intact,,P08603-2,uniprotkb,human,unspecified role
1,32296183,physical association,EBI-24521810,validated two hybrid (MI:1356),A0A024R0L9,uniprotkb,human,unspecified role,intact,,Q93062-3,uniprotkb,human,unspecified role
2,32296183,physical association,EBI-23426250,two hybrid array (MI:0397),A0A024R0L9,uniprotkb,human,unspecified role,intact,,Q93062-3,uniprotkb,human,unspecified role
3,32296183,physical association,EBI-23201216,two hybrid prey pooling approach (MI:1112),A0A024R0L9,uniprotkb,human,unspecified role,intact,,Q93062-3,uniprotkb,human,unspecified role
4,17353931,physical association,EBI-1081478,anti bait coip (MI:0006),A0A024R493,uniprotkb,human,unspecified role,intact,,Q07283,uniprotkb,human,unspecified role
