In this notebook we'll process the networks inside the folder with the same name, in this way, we are going to add some headers, retrieve coordinates of network nodes and merge with the metadata we obtained from interproscan (go terms and functional annotation). In this way, finally, we'll have a  very informative and easy to use table for tableu. 

We'll be using bash and python to acomplish this, let's import the python libraries

In [1]:
import pandas as pd
import numpy as np
import os

## Preprocessing

Now it's time to start, firts, we'll begin adding some headers to the networks. Basically, each column is describing the following:

1. The id of this interaction
2. Transcription Factor TF
3. Target TG
4. Status of the interactions (is it known or is it new?)
5. Organism where that interaction was mapped
6. id of the interaction mapped in the organism where that interaction was mapped
7. Total organism where that interaction came from
8. Foward or Reverse string
9. Transcription Unit id of the target

To acomplish this, we'll use bash to iterate over the files and append at the beggining of the line the header

In [2]:
%%bash
for net in networks/*tsv; do
    
    # Note below we are modifying inplace, therefore, I'll check if the headers are already there
    HEADER=$(head -n 1 "${net}")
    EXPECTED_HEADER=$(printf "id_net\ttf\ttg\tstatus\torg_reference\tid_net_reference\tcount_org_reference\tstring\ttu_unit")
    
    echo "Checking headers in ${net}"
    
    if [ "${HEADER}" != "${EXPECTED_HEADER}" ]; then
        printf "\tAdding header\n"
        sed -i "1s/^/${EXPECTED_HEADER}\n/" "${net}"
    fi
done

Checking headers in networks/GCF_000005845.2_E_coli_K12_genomic_extended_network_plus_TU.tsv
Checking headers in networks/GCF_000006765.1_ASM676v1_P_aeruginosa_PA01_genomic_extended_network_plus_TU.tsv
Checking headers in networks/GCF_000006945.2_ASM694v2_S_enterica_LT2_genomic_extended_network_plus_TU.tsv
Checking headers in networks/GCF_000009045.1_ASM904v1_B_subtilis_168_genomic_extended_network_plus_TU.tsv
Checking headers in networks/GCF_000009645.1_ASM964v1_S_aureus_N315_genomic_extended_network_plus_TU.tsv
Checking headers in networks/GCF_000195955.2_ASM19595v2_M_tuberculosis_H37Rv_genomic_extended_network_plus_TU.tsv


From here we'll use cytoscape to load our networks and apply some layout. You need to launch manually the software by clicking on it or calling it directly from the terminal. In my case, it would be like

```bash ~/Cytoscape_v3.8.2/cytoscape.sh```

Cytoscape allows calls from python and R to automatize processes, however, I'll apply an organic layout from the app **yFiles** and, unfortunately, it does not accept automatization; therefore, the loading and applying of layout will be done manually. In this way, you'll need to load the networks in *File -> import -> Network from file* and select the tsv networks. I'll use only the *tf* (source node), *tg* (target node) and *status* (edge attribute) columns. From here, select *layout -> yFiles organic layout* to finally save the nets in *File -> export -> Network to File* and save it as xgmml. This last file contains the coordinates of the nodes. Those files will be in the same folder of networks.

Now we'll retrieve the node id, node label, y,x coordinates and edges as a tabular format. Those files will be saved in the folder *CoordinatesNetworks*

In [3]:
%%bash

if [ ! -d "CoordinatesNetworks" ]; then
    echo "Creating folder: CoordinatesNetworks"
    mkdir CoordinatesNetworks
fi

In [4]:
%%bash

PATH_OUTPUTH="CoordinatesNetworks/"

for xgmml in networks/*xgmml; do 

    BASENAME=$(basename $xgmml)
    NODE_FILE_NAME=$(printf "${PATH_OUTPUTH}${BASENAME}.node.tsv")
    echo "Getting nodes in: ${xgmml}"
    
    # Getting node coordinates
    grep -P "(<node |<graphics )" "${xgmml}"|
        sed -r ':r;$!{N;br};s/\n\s+<graphics/ /g' | 
        sed -r 's/(\s+<node\s+|>|\w+=|")//g' | 
        sed -r 's/\s+/\t/g' | 
        cut -f1,2,5,9 | 
        sed '1s/^/id_node\tlabel\tx\ty\n/' > "${NODE_FILE_NAME}"

    EDGE_FILE_NAME=$(printf "${PATH_OUTPUTH}${BASENAME}.edge.tsv")
    echo "Getting edges in: ${xgmml}"
    
    # Getting edge relations
    grep -P '(<edge\s+|<att\s+name="status" value="\S+")' "${xgmml}" |
        grep -oP '(source="\S+"\s+target="\S+"|value="\S+")' |
        sed -r ':r;$!{N;br};s/\nvalue/ value/g' |
        sed -r 's/(\S+=|")//g' |
        sed -r 's/\s+/\t/g' |
        sed '1s/^/source\ttarget\tstatus_interaction\n/' > "${EDGE_FILE_NAME}"

done

Getting nodes in: networks/GCF_000005845.2_E_coli_K12_genomic_extended_network_plus_TU.tsv.xgmml
Getting edges in: networks/GCF_000005845.2_E_coli_K12_genomic_extended_network_plus_TU.tsv.xgmml
Getting nodes in: networks/GCF_000006765.1_ASM676v1_P_aeruginosa_PA01_genomic_extended_network_plus_TU.tsv.xgmml
Getting edges in: networks/GCF_000006765.1_ASM676v1_P_aeruginosa_PA01_genomic_extended_network_plus_TU.tsv.xgmml
Getting nodes in: networks/GCF_000006945.2_ASM694v2_S_enterica_LT2_genomic_extended_network_plus_TU.tsv.xgmml
Getting edges in: networks/GCF_000006945.2_ASM694v2_S_enterica_LT2_genomic_extended_network_plus_TU.tsv.xgmml
Getting nodes in: networks/GCF_000009045.1_ASM904v1_B_subtilis_168_genomic_extended_network_plus_TU.tsv.xgmml
Getting edges in: networks/GCF_000009045.1_ASM904v1_B_subtilis_168_genomic_extended_network_plus_TU.tsv.xgmml
Getting nodes in: networks/GCF_000009645.1_ASM964v1_S_aureus_N315_genomic_extended_network_plus_TU.tsv.xgmml
Getting edges in: networks/GCF_

Those files look like

In [5]:
%%bash

head CoordinatesNetworks/GCF_000005845.2_E_coli_K12_genomic_extended_network_plus_TU.tsv.xgmml.edge.tsv

source	target	status_interaction
7581	1486	Known
7581	1484	Known
7581	1482	Known
7581	917	Known
7581	915	Known
7563	5998	Known
7563	1725	Known
7563	5949	Known
7563	5947	Known


In [6]:
%%bash

column -s$'\t' -t CoordinatesNetworks/GCF_000005845.2_E_coli_K12_genomic_extended_network_plus_TU.tsv.xgmml.node.tsv | head

id_node  label                       x                    y
7581     YP_026246.1                 -299.6598532119333   -941.3007732911692
7563     YP_026222.1                 -1030.6598532119333  -772.3007732911692
7532     NP_418049.1                 1757.7666016983903   984.5165987943656
7475     NP_416686.1                 1941.913691877743    878.1992267088308
7465     NP_416045.1                 1757.7666016983903   771.881854623296
7365     NP_416959.1                 917.3401467880667    705.6992267088308
7357     NP_417053.2                 1479.3401467880658   53.69922670883079
7334     NP_416219.1                 4666.3421134646205   -215.64317887042944
7332     NP_415781.1                 4759.133635319123    -289.6419482632955


Now It's time to merge everything into one file with the node coordinates and the metadata obtained from interproscan that we have been processing. We'll need to load the files with the coordinates, interaction and the data of interproscan as follows:

In [7]:
coordinates_path = "CoordinatesNetworks"
full_results_path = "InterproFullMerge/"

files_list_coordinates = os.listdir(coordinates_path)
files_list_full = os.listdir(full_results_path)

files_nodes_names = [file for file in files_list_coordinates if file.endswith("node.tsv")] 
files_edges_names = [file for file in files_list_coordinates if file.endswith("edge.tsv")] 

files_nodes_names.sort()
files_edges_names.sort()
files_list_full.sort()

In [8]:
files_nodes_names

['GCF_000005845.2_E_coli_K12_genomic_extended_network_plus_TU.tsv.xgmml.node.tsv',
 'GCF_000006765.1_ASM676v1_P_aeruginosa_PA01_genomic_extended_network_plus_TU.tsv.xgmml.node.tsv',
 'GCF_000006945.2_ASM694v2_S_enterica_LT2_genomic_extended_network_plus_TU.tsv.xgmml.node.tsv',
 'GCF_000009045.1_ASM904v1_B_subtilis_168_genomic_extended_network_plus_TU.tsv.xgmml.node.tsv',
 'GCF_000009645.1_ASM964v1_S_aureus_N315_genomic_extended_network_plus_TU.tsv.xgmml.node.tsv',
 'GCF_000195955.2_ASM19595v2_M_tuberculosis_H37Rv_genomic_extended_network_plus_TU.tsv.xgmml.node.tsv']

In [9]:
files_edges_names

['GCF_000005845.2_E_coli_K12_genomic_extended_network_plus_TU.tsv.xgmml.edge.tsv',
 'GCF_000006765.1_ASM676v1_P_aeruginosa_PA01_genomic_extended_network_plus_TU.tsv.xgmml.edge.tsv',
 'GCF_000006945.2_ASM694v2_S_enterica_LT2_genomic_extended_network_plus_TU.tsv.xgmml.edge.tsv',
 'GCF_000009045.1_ASM904v1_B_subtilis_168_genomic_extended_network_plus_TU.tsv.xgmml.edge.tsv',
 'GCF_000009645.1_ASM964v1_S_aureus_N315_genomic_extended_network_plus_TU.tsv.xgmml.edge.tsv',
 'GCF_000195955.2_ASM19595v2_M_tuberculosis_H37Rv_genomic_extended_network_plus_TU.tsv.xgmml.edge.tsv']

In [10]:
files_list_full

['GCF_000005845.2_E_coli_K12_genomic.faa.tsv.full.tsv',
 'GCF_000006765.1_ASM676v1_P_aeruginosa_PA01_genomic.faa.tsv.full.tsv',
 'GCF_000006945.2_ASM694v2_S_enterica_LT2_genomic.faa.tsv.full.tsv',
 'GCF_000009045.1_ASM904v1_B_subtilis_168_genomic.faa.tsv.full.tsv',
 'GCF_000009645.1_ASM964v1_S_aureus_N315_genomic.faa.tsv.full.tsv',
 'GCF_000195955.2_ASM19595v2_M_tuberculosis_H37Rv_genomic.faa.tsv.full.tsv']

As you can see, because of the format naming we have been using all the name files matches in the order, so we'll iterate simultaneously. In addition, the files with the information about the edges contain the ID rather than the name of the nodes. so I'll create a function to add a column with that data and it'll be used later

In [11]:
def AddSourceTargetLabel(left:pd.DataFrame, right:pd.DataFrame,
                         keydict:dict, drop:list=None,
                         rename:dict=None) -> pd.DataFrame:
    
    df = left.merge(right, **keydict)
    
    if drop:
        df.drop(drop, axis=1, inplace=True)
        
    if rename:
        df.rename(rename, axis=1, inplace=True)
    
    return df

Now let's put all the pieces together, take your time to decode the cell below.

In [32]:
# Labels to filter what organism we want in the network (useful for tableu)
organism_names = ["E.coli","P.aeruginosa","S.enterica","B.subtilis","S.aureus","M.tuberculosis"]

# Columns that will be maintained at the end
important_columns = ['source', 'target', "source_label","target_label",
                     'status_interaction','id_node','node_label', 'x', 'y',
                     'organism','analysis', 'id_analysis', 'desc_analysis',
                     'id_inter', 'desc_inter', 'id_go', 'type']

arguments_add_source_function = {"how":"left", "left_on":"source", "right_on":"id_node"}
arguments_add_target_function = {"how":"left", "left_on":"target", "right_on":"id_node"}

# All files will be saved into one big file
interactions_all_organism_df = pd.DataFrame()

iterator = zip(files_nodes_names, files_edges_names, files_list_full, organism_names)

for f_n, f_e, f_f, organism in iterator:
    
    nodes_df = pd.read_csv(os.path.join(coordinates_path, f_n), sep="\t")
    edges_df = pd.read_csv(os.path.join(coordinates_path, f_e), sep="\t")
    metadata_df = pd.read_csv(os.path.join(full_results_path, f_f), sep="\t")
    
    # In order to have both directions of the interaction, merge for source and target
    source_df = edges_df.merge(nodes_df, how="left", right_on="id_node", left_on="source")
    target_df = edges_df.merge(nodes_df, how="left", right_on="id_node", left_on="target")
    
    # Each row represent the info about a node and this node could be either a source or target
    # so I'll specify this in the table by renaming the col and also adding the respective
    # organism
    main_df = pd.concat([source_df, target_df])
    main_df["organism"] = organism
    main_df.rename({"label":"node_label"}, axis=1, inplace=True)
    
    # In addition of the id in source and target, add the label to source and target
    main_df = AddSourceTargetLabel(main_df, nodes_df[["id_node","label"]],
                                   arguments_add_source_function, ["id_node_y"],
                                   {"id_node_x": "id_node", "label":"source_label"})
    
    main_df = AddSourceTargetLabel(main_df, nodes_df[["id_node","label"]],
                                   arguments_add_target_function, ["id_node_y"],
                                   {"id_node_x": "id_node", "label":"target_label"})
    
    # Combine the info from the networks with the interproscan metadata based on the node
    # and not with the source or target label directly (node already count for this)
    data_df = main_df.merge(metadata_df, how="left", left_on="node_label", right_on="id_protein")
    data_df = data_df[important_columns]

    interactions_all_organism_df = pd.concat([interactions_all_organism_df, data_df])  

# Saving
interactions_all_organism_df.to_csv("Data/network_all_info.tsv", header=True,
                                    index=False, na_rep="NaN", sep="\t")