-
-
Notifications
You must be signed in to change notification settings - Fork 307
Google Season of Docs 2022 Proposal Restructuring OpenMS Developer Documentation (OpenMS ReDevDoc)
OpenMS is a cross-platform (Linux, Windows, MacOS) software framework based around a core open source C++ library, which implements all data structures and algorithms required for mass spectrometry (MS) data analyses. It was first released in 2006, is currently at release 2.8 and is licensed under the three clause BSD license. Our contributors are computational mass spectrometrists, bioinformaticians, and data scientists. Among our users are large data repositories like MassIVE and PRIDE, individuals that value the flexibility and open-source nature of OpenMS and companies in need of customized solutions. OpenMS was cited in ~1,200 scientific publications in 2021 according to google scholar, and its tools were used ~3,000 times per month by unique users in the last quarter of 2021. We have seen strong interest in our workshops, and over the last years, we have trained between 80 to 120 participants annually (2020 and 2021 numbers are lower due to COVID-19 travel restrictions). Several downstream tools including MSstats, aLFQ, QCloud2 and Skyline have integrated their tools with OpenMS.
Mass spectrometry is a sensitive high-throughput technique capable of quantitative and qualitative measurement of the molecules of life, including nucleic acids, proteins, and metabolites. For example, direct protein interactions between SARS-CoV-2 and human proteins are essential for virulence (such as the viral spike protein and human ACE2); mass spectrometry has produced a full map of 332 pairwise protein interactions. OpenMS is used not only for proteins but also for metabolites, and it is used for more specialized purposes like analysis of cross-linked molecules (both protein-protein and protein-nucleic acid) and nucleic acid identification and quantification.
OpenMS uses modern object-oriented, template-based C++, and extensive documentation is available for several thousand C++ functions of the public API. All development is performed in the open, using GitHub. Besides from C++, most of the functionality of OpenMS is also available from Python through pyOpenMS. These Python bindings allow for rapid prototyping and offer an easier route to get acquainted with OpenMS for aspiring developers, but also create a documentation issue: classes are documented in C++ using doxygen and (partly duplicated) in Python using doc strings.
Besides duplicated class documentation, there are also two tutorials for software developers, one for C++ (using doxygen) and one for pyOpenMS (using readthedocs). The C++ tutorial and documentation has not been updated for the last few OpenMS releases whereas the pyOpenMS ones have been updated recently with a 2021 grant from Season of Docs. Thus the documentation and tutorials have diverged considerably. This divergence makes it difficult for starting developers to make the step from pyOpenMS (easier to learn) to C++ (more efficient in runtime and memory and long-term support). Also, the C++ tutorial is focused on proteomics with metabolomics completely absent. Thanks to a 2021 GSoD project, there is a metabolomics workflow in the pyOpenMS documentation. Metabolomics is an important use case for OpenMS, and its importance in biomedicine is growing rapidly.
The OpenMS project (code-named OpenMS ReDevDoc) will:
- Review the current C++ tutorial and update to the latest release of OpenMS, currently 2.8 with 3.0 in the works.
- Make the C++ documentation similar to the pyOpenMS by porting it to reStructuredText.
- Improve the maintainability of the C++ documentation by combining Doxygen, Breathe and Sphinx in a CMake script.
- Translate the pyOpenMS metabolomics workflow to a C++ tutorial example including KNIME and TOPPAS.
- Create “cheat sheet” that documents the main classes and methods.
- Make a documentation landing page that summarizes all available documentation, both on the C++ and the Python side.
Work that is out-of-scope of the OpenMS ReDevDoc project:
- Substantial changes to the Python API and pyOpenMS tutorial documentation
The core OpenMS developers Hannes Rost, Axel Walter, Timo Sachsenberg, and Tjeerd Dijkstra have committed to supervise the OpenMS ReDevDoc project. We have identified Rahul Agrawal, who worked on last year's GSoD project as technical writer.
OpenMS receives an average of 300 pull requests per year to add or update classes or functions. Most of these pull requests are from the core OpenMS developers and only 20 of them were from new developers in 2021. We believe that improved tutorials will result in more pull requests from new developers for OpenMS. Also, after successful completion, metabolomics workflows will documented both in pyOpenMS and KNIME/TOPPAS. We would consider the OpenMS ReDevDoc project successful if, after publication of the new documentation:
- The yearly number of pull requests by new developers increase by 50% in 2023
- The number of questions (on GitHub and Gitter) about metabolomics workflows increases by 25% in 2023
In 2021 we were successful in the GSoD program and Rahul Agrawal improved documentation on the Python side. In detail he documented over 300 classes distributed over 22 pull requests, made the readthedocs code snippets live by integration with binder and wrote a tutorial showing how to integrate pyOpenMS with Machine Learning methods.
Furthermore, we have been highly active in mentoring GSoC students and participated in the course of three summers under the umbrella of the Open Bioinformatics Foundation (OBF). All students successfully completed GSoC and two of them continued contributing to OpenMS. In 2017 a student added algorithms for high-resolution isotope generators. In 2018, a student improved estimation of error probabilities and extended the project to a master thesis. In 2020, we mentored two students. The first student automatically generated an OpenMS R-package from our python bindings and the second student developed a novel tool for protein database suitability estimation. For every student, we follow the same principles: provide a friendly atmosphere on equal footing, have frequent meetings and work closely to achieve our joint goal of improving open source software and lowering the barrier of entry for beginners.
Reference: GSoC 2017 mentors: Timo Sachsenberg, Julianus Pfeuffer, Artem Tarasov, Oliver Alka https://summerofcode.withgoogle.com/archive/2017/projects/6722516903002112/
GSoC 2018 mentors: Timo Sachsenberg, Julianus Pfeuffer, Oliver Alka https://summerofcode.withgoogle.com/archive/2018/projects/5921078421487616/
GSoC 2020 mentors: Hannes Röst, Timo Sachsenberg, Oliver Alka and Chris Bielow, Julianus Pfeuffer https://summerofcode.withgoogle.com/archive/2020/projects/5403269754519552/ https://summerofcode.withgoogle.com/archive/2020/projects/5225514815455232/
We estimate that this work will take five months to complete.
| Budget item | Amount | Running Total | Notes/justifications |
|---|---|---|---|
| Review and update current C++ documentation and tutorial | 4000.00 | 4000.00 | Two months |
| Improve maintainability and add metabolomics example | 4000.00 | 8000.00 | Two months |
| Cheatsheet, landing page and final report | 2000.00 | 10000.00 | One month |
| Project t-shirts (10 t-shirts) | 200.00 | 10200.00 | Reuse the OpenMS t-shirt design from ASMS 2019 conference |
| TOTAL | 10200.00 |
Open an issue on gitter to contact Tjeerd Dijkstra (@tjeerdijk), Axel Walter (@axelwalter) or Timo Sachsenberg (@timosachsenberg)