Skip to content
This repository has been archived by the owner on Mar 11, 2019. It is now read-only.

Performance

Luis Francisco Hernández Sánchez edited this page Jun 14, 2018 · 14 revisions

PathwayMatcher was benchmarked against different reference datasets covering multiple types of omics data.

Response time

In all cases, response time increases with query size until reaching a plateau indicative of a near-complete coverage of pathways for the given input type. As expected, protein identifiers provide the lowest response time, completing within less than a minute. Mapping peptides and genetic variants to proteins adds additional computation complexity resulting in response time of approximately one and two minutes, respectively. Finally, proteoform matching, the most demanding task computationally, shows a response time increasing linearly until reaching 3.5 min.

Response time of PathwayMatcher using (A) proteins in blue, (B) proteoforms in green, (C) peptides in yellow, and (D) genetic variants in red. Response time in minutes is plotted against query size. Mean is displayed as solid line and 95% range as ribbon.

Datasets

Proteins

  • The Human protein set from Uniprot/Swiss-Prot which are manually annotated and reviewed (release 2017_10).
  • The list of all annotated proteins in Reactome version 63
MATCH (pe:PhysicalEntity)-[:referenceEntity]->(re:ReferenceEntity)
WHERE pe.speciesName = "Homo sapiens" AND re.databaseName = "UniProt"
RETURN DISTINCT (CASE WHEN size(re.variantIdentifier) > 0 THEN re.variantIdentifier ELSE re.identifier END) as proteinAccession
ORDER by proteinAccession

Proteoforms

The list of all annotated proteoforms in Reactome. Query for Neo4j:

MATCH (pe:PhysicalEntity{speciesName:'Homo sapiens'})-[:referenceEntity]->(re:ReferenceEntity{databaseName:'UniProt'})
WITH DISTINCT pe, re
OPTIONAL MATCH (pe)-[:hasModifiedResidue]->(tm:TranslationalModification)-[:psiMod]->(mod:PsiMod)
WITH DISTINCT pe.stId AS physicalEntity,
                re.identifier AS protein,
                re.variantIdentifier AS isoform,
                tm.coordinate as coordinate, 
                mod.identifier as type ORDER BY type, coordinate
WITH DISTINCT 
	        physicalEntity,
		CASE WHEN isoform IS NOT NULL THEN isoform ELSE protein END as UniProtAcc,
                COLLECT(type + ":" + CASE WHEN coordinate IS NOT NULL THEN coordinate ELSE "null" END ) AS ptms
RETURN DISTINCT UniProtAcc, ptms
ORDER BY UniProtAcc

Peptides

  • Proteotypic Peptide Set from ProteomeTools available available from the ProteomeXchange Consortium via the PRIDE repository PXD004732, release date 01/23/2017. It includes 139,797 non-redundant peptides.

  • 'Missing Gene' Set, collection of 141,601 non-redundant peptides. Which includes all unique tryptic peptides between 7 and 30 amino acids in length for canonical gene products lacking confident protein level identification in ProteomicsDB.org. The set comprises all the files designated as “TUM_second_pool” and with ".zip" type.

  • 'SRMAtlas' Set, which is the SRMAtlas collection of 81497 non-redundant peptides. The set comprises all the files designated as “SRMAtlas” and with ".zip" type.

Each compressed file contains a text file peptides.txt with a list of peptides. The utility class used to gather the peptides from all the files is no.uib.pathwaymatcher.tools.ProteomeTools_PTPListExtractor. You need to download all the files locally and specify the location in the class.

In total there are 333,784 non-redundant peptides in the reference list used for sampling.

Genetic Variants

Variants from the human assembly GRCh37.p13.

Clone this wiki locally