https://spectra-cluster.github.io provides a complete overview over all the tools we provide on spectrum clustering and the spectra-cluster algorithm.
The spectra-cluster Java API is the central collection of algorithms used to develop and run the PRIDE Cluster project. The library was built to quickly test different combinations of clustering approaches and contains implementations of a variety of, for example, similarity metrics for MS/MS spectrum clustering.
It is currently used in two applications:
- spectra-cluster-hadoop: The Hadoop implementation of the re-developed PRIDE Cluster algorithm
- spectra-cluster-cli: A (still in beta) stand-alone implementation of the PRIDE Cluster algorithm.
spectra-cluster is an open-source (Apache 2 licensed) library. It offers the following features out-of-box:
- A collection of both classic and new algorithms for measuring spectra similarities.
- A set of engines for clustering spectra together.
- A set of normalizers for normalising spectral peaks.
- A set of filters and functions for pre-processing spectra, such as removing noisy peaks.
- A set of cleanly defined data models and interfaces that represents spectra, peptide spectrum matches, and clusters.
- Read in spectra and write out clustering results
- Moved to Java 1.8
- Changed default consensus spectrum builder to a binned version of the GreedyConsensusSpectrum builder
- Added features to estimate the number of comparisons directly from the data
- Optimised the MGF parser
- Added predicates to being able to only cluster identified and / or unidentified spectra
- Added support for additional MGF parameters and encode these in the .clustering file using JSON strings
- Added feature to output similarity scores at the time a spectrum is added to a cluster
- Added new function to remove contaminant ions (RemoveContaminantsPeaksFunction). Currently, this function removes all commonly observed immonium ions.
- Added a new function to remove all peaks outside a given m/z range (RemoveWindowsPeaksFunction). By default, all peaks below 200 m/z are being ignored.
- Adapted the RemoveImpossibleHighPeaksFunction and the RemovePrecursorPeaksFunction classes to work with spectra where the charge state is unknown (ie. < 1). In these cases the unchanged original spectrum is returned.
- Fixed bug in the function removing precursor peaks
- Added the mass of the complete TMT tag to the functions removing reporter peaks
You will need to have Maven installed in order to build and use the spectra-cluster library.
Add the following snippets in your Maven pom file:
<!-- spectra-cluster dependency -->
<dependency>
<groupId>uk.ac.ebi.pride.spectracluster</groupId>
<artifactId>spectra-cluster</artifactId>
<version>${current.version}</version>
</dependency>
<!-- EBI repo -->
<repository>
<id>pst-release</id>
<url>http://www.ebi.ac.uk/Tools/maven/repos/content/repositories/pst-release</url>
</repository>
<!-- EBI SNAPSHOT repo -->
<snapshotRepository>
<id>pst-snapshots</id>
<url>http://www.ebi.ac.uk/Tools/maven/repos/content/repositories/pst-snapshots</url>
</snapshotRepository>
The clustering process itself is done by a clutering engine. The following examples use the implementations used for PRIDE Cluster.
float WINDOW_SIZE = 4.0F;
float FRAGMENT_TOLERANCE = 0.5F;
double CLUSTERING_PRECISION = 0.01;
/**
* This creates an incremental clustering engine that
* uses the CombinedFisherIntensityTest with a fragment
* ion tolerance of 0.5 m/z as similarity metrics. The
* ClusterComparator is only used for sorting of the clusters
* during the clustering process. The WINDOW_SIZE of 4.0 m/z
* means that as soon as a new cluster is added, any cluster
* with an average precursor m/z lower than 4.0 m/z than the
* newly added cluster is automatically returned during the
* clustering process. The CLUSTERING_PRECISION is the defined
* accuracy for the clustering process (benchmarked on the
* PRIDE Cluster test dataset). Finally, the FrationTICPeakFunction
* is a peak filter function that is applied to every spectrum
* before comparison (in this case all peaks that represent
* 50% of the total ion current, but a minimum of 20 peaks).
* For consensus spectrum building, the complete unfiltered
* spectrum is used.
*/
IIncrementalClusteringEngine clusteringEngine = new GreedyIncrementalClusteringEngine(
new CombinedFisherIntensityTest(FRAGMENT_TOLERANCE),
ClusterComparator.INSTANCE,
WINDOW_SIZE,
CLUSTERING_PRECISION,
FractionTICPeakFunction(0.5f, 20));
// during clustering the clusters must be sorted
// according to precursor m/z. Otherwise an
// exception is thrown
for (ICluster clusterToAdd : clusterIterable) {
// clusters are simply added through the 'addClusterIncremental'
// function. Clusters that have a lower precursor m/z
// than the added cluster (based on the set window size)
// are returned.
Collection<ICluster> removedClusters = clusteringEngine.addClusterIncremental(clusterToAdd);
if (!removedClusters.isEmpty()) {
// use some method to save the removed and thereby
// "final" clusters
writeOutClusters(removedClusters);
}
}
// after all spectra were clustered, save the finally
// remaining clusters still stored in the clustering
// engine
Collection<ICluster> clusters = clusteringEngine.getClusters();
writeOutClusters(clusters);
If you have questions or need additional help, please contact the PRIDE help desk at the EBI.
email: pride-support@ebi.ac.uk
Please give us your feedback, including error reports, suggestions on improvements, new feature requests. You can do so by opening a new issue at our issues section
Please cite this library using one of the following publications:
- Griss J, et al. Recognizing millions of consistently unidentified spectra across hundreds of shotgun proteomics datasets. Nature methods. 2016; doi:10.1038/nmeth.3902
- Griss J, Foster JM, Hermjakob H, Vizcaíno JA. PRIDE Cluster: building the consensus of proteomics data. Nature methods. 2013;10(2):95-96. doi:10.1038/nmeth.2343. PDF, HTML, PubMed
We welcome all contributions submitted as pull request.
This project is available under the Apache 2 open source software (OSS) license.