Skip to content
This repository has been archived by the owner on Mar 11, 2019. It is now read-only.

Protein connection graph

Carlos edited this page Jul 23, 2018 · 7 revisions

PathwayMatcher allows the user to generate a connection graph as an additional output when executing the pathway search and analysis. The graph can use genes, proteins or proteoforms as vertices, with the command line arguments -gg, -gu and -gp respectively.

Graph Definition

The connection graph is defined by a set of vertices and edges, where vertices represent genes, proteins or proteoforms. The edges represent connections/relations between proteins according to the data model in the Reactome database.

Proteins are referenced only by their UniProt[1] accession. Genes follow the HUGO gene nomenclature[2]. The proteoforms follow the Simple format explained here.

There is an connection between two proteins when:

  • (Protein1)--(Complex)--(Protein2): Both are components of the same complex.
  • (Protein1)--(Reaction)--(Protein2): Both participate in the same reaction.
  • (Protein1)--(Set)--(Protein2): Both are members of the same entity set.

This connections are undirected, they have no direction; the two proteins are just related to each other.

Proteins can participate with multiple roles in a chemical reaction:

  • input (reactant)
  • output (product)
  • catalyst
  • regulator

Proteins participate independently or as components of a complex or entity set:

  • (Reaction)--(Protein)
  • (Reaction)--(Complex)--(Protein)
  • (Reaction)--(Complex)--(Complex)--(Protein)
  • (Reaction)--(Set)--(Protein)
  • (Reaction)--(Set)--(Set)--(Protein)
  • (Reaction)--(Complex)--(Set)--(Protein)
  • (Reaction)--(Complex)--(Set)--(Set)--(Complex)--(Protein)
  • ...

For the genes and proteoforms, the connections function in a similar way, replacing the protein by the respective gene or proteoform.

Finally, there are two types of edges: internal and external.

  • Internal edges are connections between proteins of the input list.
  • External edges are connections between a protein in the input list and a protein not in the input list.

Graph representation

The graph is defined in three files vertices.tsv, internalEdges.tsv and externalEdges.tsv. The format chosen to represent these graphs is compatible with the iGraph System notation [3] for graphs. By default, they are saved in the same directory where PathwayMatcher is located. To save them in a different directory use the command line argument -o.

Vertices file

A tab separated file (.tsv) with two columns, one vertex (protein) each row:

  • id: Uniprot accession of the protein
  • name: Colloquial name of the protein

Example:

id	 name
P35070	 Probetacellulin
P21359	 Neurofibromin
Q8IV61	 Ras guanyl-releasing protein 3

Edges files

Tab separated files (.tsv) with 6 columns, one edge (connection) each row:

  • id1: UniProt accession of one protein in the connection
  • id2: UniProt accession of the second protein in the connection
  • type: Where the two proteins meet (Complex or Reaction)
  • container_id: Id of the complex or reaction
  • role1: Role of the first protein in the connection
  • role2: Role of the second protein in the connection

Example:

id1	 id2	 type	  container_id   role1	 role2
P27361	 P28482  Reaction R-HSA-5675373  input	 output
P27361	 P28562	 Reaction R-HSA-5675373  input	 catalyst
P27361	 P28562  Reaction R-HSA-5675373  output	 catalyst
O43524	 P84022	 Complex  R-HSA-1535906	component component

References

[1] UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45: D158-D169 (2017)
[2] HUGO gene nomenclature
[3] Ferres L., Parush A., Li Z., Oppacher Y., Lindgaard G. (2006) Representing and Querying Line Graphs in Natural Language: The iGraph System. In: Butz A., Fisher B., Krüger A., Olivier P. (eds) Smart Graphics. SG 2006. Lecture Notes in Computer Science, vol 4073. Springer, Berlin, Heidelberg

Clone this wiki locally