In this notebook we'll process the networks inside the folder with the same name, in this way, we are going to add some headers, retrieve coordinates of network nodes and merge with the metadata we obtained from interproscan (go terms and functional annotation). In this say, finally, we'll have a  very informative and easy to use table for tableu. 

We'll be using bash and python to acomplish this, let's import the python libraries

In [23]:
import pandas as pd
import numpy as np

## Preprocessing

Now it's time to start, firts, we'll begin adding some headers to the networks. Basically, each column is describing the following:

1. The id of this interaction
2. Transcription Factor TF
3. Target TG
4. Status of the interactions (is it known or is it new?)
5. Organism where that interaction was mapped
6. id of the interaction mapped in the organism where that interaction was mapped
7. Total organism where that interaction came from
8. Foward or Reverse string
9. Transcription Unit id of the target

To acomplish this, we'll use bash to iterate over the files and append at the beggining of the line the header

In [None]:
%% bash
for net in networks/*tsv; do
    
    # Note we are modifying inplace, therefore, I'll check if the headers are there
    sed -i '1s/^/id_net\ttf\ttg\tstatus\torg_reference\tid_net_reference\tcount_org_reference\tstring\ttu_unit\n/' "${net}"

done

In [14]:
%%bash

for xgmml in sakldsa; do 

    # Getting node coordinates
    grep -P "(<node |<graphics y=)" "${xgmml}" |
        sed -r ':r;$!{N;br};s/\n\s+<graphics/ /g' |
        sed -r 's/(\s+<node\s+|>|\w+=|")//g' |
        sed -r 's/\s+/\t/g' |
        cut -f1,2,3,6
        sed '1s/^/id_node\tlabel\ty\tx\n/' > test_1.tsv

    # Getting edge relations
    grep "<edge " "${xgmml}" | 
        grep -oP 'source="\S+"\s+target="\S+"' | 
        sed -r 's/(\S+=|")//g' | 
        sed -r 's/\s+/\t/g' | 
        sed '1s/^/source\ttarget\n/' > test_2.tsv

done

In [5]:
node_file = "networks/test.tsv"
edge_file = "networks/test_2.tsv"

nodes_df = pd.read_csv(node_file, sep="\t")
edges_df = pd.read_csv(edge_file, sep="\t")

In [6]:
nodes_df

Unnamed: 0,id,label,y,x
0,22650,YP_026246.1,-279.246521,-701.868628
1,22632,YP_026222.1,422.753479,-79.868628
2,22601,NP_418049.1,-2384.563893,-1849.295083
3,22544,NP_416686.1,-2278.246521,-2033.442173
4,22534,NP_416045.1,-2171.929149,-1849.295083
...,...,...,...,...
2219,15148,NP_415542.1,-2382.907843,-1144.823467
2220,15146,NP_415541.1,-2503.133103,-1320.207107
2221,15144,NP_415540.1,-2493.649369,-1197.809244
2222,15141,NP_414560.1,1087.753479,-4468.868628


In [7]:
edges_df

Unnamed: 0,source,target
0,22650,16555
1,22650,16553
2,22650,16551
3,22650,15986
4,22650,15984
...,...,...
5287,15140,15148
5288,15140,15146
5289,15140,15144
5290,15140,15140


In [9]:
source = edges_df.merge(nodes_df, how="left", right_on="id", left_on="source")
source

Unnamed: 0,source,target,id,label,y,x
0,22650,16555,22650,YP_026246.1,-279.246521,-701.868628
1,22650,16553,22650,YP_026246.1,-279.246521,-701.868628
2,22650,16551,22650,YP_026246.1,-279.246521,-701.868628
3,22650,15986,22650,YP_026246.1,-279.246521,-701.868628
4,22650,15984,22650,YP_026246.1,-279.246521,-701.868628
...,...,...,...,...,...,...
5287,15140,15148,15140,NP_414561.1,-121.246521,-3261.868628
5288,15140,15146,15140,NP_414561.1,-121.246521,-3261.868628
5289,15140,15144,15140,NP_414561.1,-121.246521,-3261.868628
5290,15140,15140,15140,NP_414561.1,-121.246521,-3261.868628


In [13]:
target = edges_df.merge(nodes_df, how="left", right_on="id", left_on="target")
target

Unnamed: 0,source,target,id,label,y,x
0,22650,16555,16555,NP_417681.1,-495.246521,-744.868628
1,22650,16553,16553,NP_417680.1,-558.246521,-1028.868628
2,22650,16551,16551,NP_417679.2,-468.246521,-953.868628
3,22650,15986,15986,NP_416406.4,295.753479,-417.868628
4,22650,15984,15984,NP_416405.1,331.753479,-379.868628
...,...,...,...,...,...,...
5287,15140,15148,15148,NP_415542.1,-2382.907843,-1144.823467
5288,15140,15146,15146,NP_415541.1,-2503.133103,-1320.207107
5289,15140,15144,15144,NP_415540.1,-2493.649369,-1197.809244
5290,15140,15140,15140,NP_414561.1,-121.246521,-3261.868628


In [15]:
main = pd.concat([source, target])

In [16]:
main

Unnamed: 0,source,target,id,label,y,x
0,22650,16555,22650,YP_026246.1,-279.246521,-701.868628
1,22650,16553,22650,YP_026246.1,-279.246521,-701.868628
2,22650,16551,22650,YP_026246.1,-279.246521,-701.868628
3,22650,15986,22650,YP_026246.1,-279.246521,-701.868628
4,22650,15984,22650,YP_026246.1,-279.246521,-701.868628
...,...,...,...,...,...,...
5287,15140,15148,15148,NP_415542.1,-2382.907843,-1144.823467
5288,15140,15146,15146,NP_415541.1,-2503.133103,-1320.207107
5289,15140,15144,15144,NP_415540.1,-2493.649369,-1197.809244
5290,15140,15140,15140,NP_414561.1,-121.246521,-3261.868628


In [18]:
main.to_csv("network.tsv", sep="\t",header=True, index=False)