# Data gathering

In this part, we will gather as much as possible proteomics data from published literature and databases. Here are some sources:

1. CyanoMapDB, this is a database providing cyanobacterial PPIs with experimental evidence, consisting of 52,304 PPIs among 6,789 proteins from 23 cyanobacterial species. It collected available data in UniProt, STRING, and IntAct, and mined numerous PPIs from co-fractionation MS data in cyanobacteria.
2. Native Protein Complexes in Synechocystis sp. PCC 6803, Comparative Network Biology Discovers Protein Complexes That Underline Cellular Differentiation in Anabaena sp.; These two papers talk about how to construct the complexes using interactome data and clustering method. Our approach will be the same to use the same clustering method to analyze the CyanoMapDB interactome data of S. elongatus PCC7942
3. There are some cofraction proteomics data of S. elongatus PCC7942 reported from the paper: "Monitoring light/dark association dynamics of multi-protein complexes in cyanobacteria using size exclusion chromatography-based proteomics". The CyanoMapDB included this dataset and use it predicted the interactome.

In [1]:
import pandas as pd
import os, sys, re
from pathlib import Path
home = str(Path.home())

In [6]:
work_dir = home + "/Dropbox/PNNL/PredPheno/SystemModeling/Modeling/S_elongatus"
# work_dir

In [7]:
proteome = pd.read_excel(work_dir + "/data/interactome/Synechococcus_PCC_7942_Dataset.xlsx", sheet_name="Protein")
interactome = pd.read_excel(work_dir + "/data/interactome/Synechococcus_PCC_7942_Dataset.xlsx", sheet_name="PPI")


In [8]:
interactome

Unnamed: 0,Protein A,Protein B,Taxon,UniProt evidence,STRING score,IntAct score,IntAct method,GS complex evidence,CF-MS score,CF-MS ID A,CF-MS ID B,PPI index
0,O05161,P16954,Synechococcus elongatus (strain PCC 7942 / FAC...,,,,,,0.651,1G6TR,1G1IW,0.651
1,O05161,Q31KN7,Synechococcus elongatus (strain PCC 7942 / FAC...,,,,,,0.595,1G6TR,1G152,0.595
2,O05161,Q31LJ5,Synechococcus elongatus (strain PCC 7942 / FAC...,,,,,,0.694,1G6TR,1G2EH,0.694
3,O05161,Q31LM9,Synechococcus elongatus (strain PCC 7942 / FAC...,,,,,,0.639,1G6TR,1G4ZP,0.639
4,O05161,Q31N38,Synechococcus elongatus (strain PCC 7942 / FAC...,,,,,,0.647,1G6TR,1GCIT,0.647
...,...,...,...,...,...,...,...,...,...,...,...,...
4529,Q8GMR7,Q99QJ5,Synechococcus elongatus (strain PCC 7942 / FAC...,,,,,,0.598,1G0PI,1G030,0.598
4530,Q8GMT0,Q935Z3,Synechococcus elongatus (strain PCC 7942 / FAC...,,,,,,0.849,1G6TX,1G1IA,0.849
4531,Q8KPQ0,Q9L4P3,Synechococcus elongatus (strain PCC 7942 / FAC...,,,,,,0.675,1G0ND,1G126,0.675
4532,Q8KPU9,Q8L1E5,Synechococcus elongatus (strain PCC 7942 / FAC...,,,,,,0.696,1G15V,1G1X2,0.696


In [10]:
cl1_input = interactome[["Protein A", "Protein B", "PPI index"]]
input_file = work_dir + "/data/interactome/ClusterONE_input.tsv"
cl1_input.to_csv(input_file, sep='\t', index=None, header=None)

In [13]:
import subprocess
output_file = work_dir + "/data/interactome/ClusterONE_output.tsv"
output_file_csv = work_dir + "/data/interactome/ClusterONE_output.csv"
with open(output_file, 'w') as output:
    subprocess.run(["java", "-jar", home + "/Dropbox/PNNL/PredPheno/SystemModeling/tools/cluster_one-1.2.jar", "-f", "edge_list", "-F", "plain", input_file], stdout=output)
with open(output_file_csv, 'w') as output:
    subprocess.run(["java", "-jar", home + "/Dropbox/PNNL/PredPheno/SystemModeling/tools/cluster_one-1.2.jar", "-f", "edge_list", "-F", "csv", input_file], stdout=output)

Loaded graph with 679 nodes and 4534 edges
Detected 129 complexes
Loaded graph with 679 nodes and 4534 edges
Detected 129 complexes


In [17]:
sys.path.append("./data/interactome")
import utils3 as util3