Skip to content

Google Season of Docs 2024 Proposal for a Chatbot for OpenMS Documentation

Tjeerd Dijkstra edited this page Apr 2, 2024 · 11 revisions

Background

OpenMS is an, open source, cross-platform (Linux, Windows, MacOS) software framework based on a core C++ library, that aims to implement the data structures and algorithms required for mass spectrometry (MS) data analyses. Mass spectrometry is a sensitive high-throughput analytical technique capable of quantitative and qualitative measurement of the molecules of life, including nucleic acids, proteins, and metabolites. OpenMS was first released in 2002 and is currently at release 3.0. As a major milestone, the scientific paper detailing the improvements of OpenMS 3.0 over 2.0 came out this year in Nature Methods. OpenMS was cited in ~1,300 scientific publications in 2023 according to google scholar, and its tools were used ~3,000 times per week by unique users in the last quarter of 2023.

OpenMS uses modern object-oriented, template-based C++, and extensive documentation is available for several thousand C++ functions of the public API. All development is performed in the open, using GitHub. The documentation of OpenMS consists of five main parts:

  • the web site, which is the main entry point for OpenMS and was rewritten with partial support from a 2022 Season of Docs grant
  • the OpenMS readthedocs documentation, which targets beginners and was rewritten with partial support from a 2022 Season of Docs grant
  • the pyOpenMS readthedocs documentation, which targets beginning developers and was extended with support from a 2021 Season of Docs grant
  • the OpenSwath readthedocs documentation, which targets users of OpenSwath and has not been updated since 2019
  • the C++ doxygen documentation, which targets developers

Enhancing the OpenMS documentation with a chatbot

While the OpenMS documentation has improved thanks to multiple grants from the Season of Docs program, it has also grown in size and new users often complain of getting lost in the many pages. As a solution, we propose a chat bot that allows users to ask question in natural language like: "I have a data-independent acquisition data set and want to perform quantitative proteomics. Where should I start in the manual?"

LLMs like chatGPT or Gemini allow for access to all general knowledge on the internet in a highly interactive fashion. Recently, the power of LLMs has been harnessed to improve the documentation interface of bioinformatics software, like in the PRIDE-chatbot. We propose to use the same approach for a chatbot that allows users to query the OpenMS documentation

OpenMS chatbot scope

The OpenMS chatbot project will:

  • Review the current OpenMS documentation system and propose an architecture for the chatbot.
  • Encode the OpenMS documentation with a sentence transformer.
  • Evaluate the current state of LLMs and propose a best one. For the pride-chatbot, the authors found mistral-7B-instruct-v0.2 the best LLM but as the landscape of LLMs is rapidly changing, we will re-evaluate this choice for this project.
  • Implement the chatbot.

Work that is out-of-scope of the OpenMS chatbot project:

  • Changes to the OpenMS web site and the OpenMS C++ doxygen documentation.

The core OpenMS developers Sam Wein, Matteo Pilz, Timo Sachsenberg, and Tjeerd Dijkstra will supervise the OpenMS chatbot project.

Measuring OpenMS chatbot success

OpenMS receives an average of 300 pull requests per year to add or update classes or functions. Most of these pull requests are from the core OpenMS developers and only 20 of them were from new developers in 2023. We believe that the chatbot will increase the number of pull requests from new developers. We would consider the OpenMS chatbot project successful if:

  • The number of questions posed to the chatbot exceeds 1000 in 2025.
  • The yearly number of pull requests by new developers increases by 50% in 2025.

Project timeline

We estimate that this work will take six months to complete.

Dates Action items
May introduction
June encode documentation with a sentence transformer
July evaluate the current state of LLMs
Aug-Sep implement the chatbot
October wrap-up

Project budget

Budget item Amount Running Total Notes/justifications
introduction 2500.00 2500.00 One month
encode documentation with sentence transformer 2500.00 5000.00 One month
evaluate current state of LLMs 2500.00 7500.00 One month
implement chatbot 5000.00 12500.00 Two months
Project t-shirts (10 t-shirts) 200.00 12700.00 Reuse the OpenMS t-shirt design from ASMS 2019 conference
TOTAL 12700.00

Previous experience with GSoD or GSoC

In 2021 we were successful in the GSoD program (proposal) and Rahul Agrawal improved documentation on the Python side. In detail he documented over 300 classes distributed over 22 pull requests, made the readthedocs code snippets live by integration with binder and wrote a tutorial showing how to integrate pyOpenMS with Machine Learning methods.

In 2022 we were also successful in the GSoD program (proposal). We supplemented the GSoD funds with funds from the Chan-Zuckerberg Initiative (18k) and from the University of Tübingen (6k). We (1) ported the current OpenMS web site dating from 2015 and written in WordPress to static HTML with a modern theme and (2) added an OpenMS manual for starting mass spectrometry data analysts, in a format similar to the pyOpenMS manual. This new manual included the pdf tutorial that was used for training purposes and links to video lectures from the University of Tübingen to explain mass spectrometry concepts. A team of three technical writers helped: Rahul Agrawal worked mostly on the new web site and Tapasweni Patak and Christina Kumar worked mostly in the new manual.

Furthermore, we have been highly active in mentoring GSoC students and participated in the course of three summers under the umbrella of the Open Bioinformatics Foundation (OBF). All students successfully completed GSoC and two of them continued contributing to OpenMS. In 2017 a student added algorithms for high-resolution isotope generators. In 2018, a student improved estimation of error probabilities and extended the project to a master thesis. In 2020, we mentored two students. The first student automatically generated an OpenMS R-package from our python bindings, and the second student developed a novel tool for protein database suitability estimation. For every student, we follow the same principles: provide a friendly atmosphere on equal footing, have frequent meetings and work closely to achieve our joint goal of improving open source software and lowering the barrier of entry for beginners.

Reference: GSoC 2017 mentors: Timo Sachsenberg, Julianus Pfeuffer, Artem Tarasov, Oliver Alka https://summerofcode.withgoogle.com/archive/2017/projects/6722516903002112/

GSoC 2018 mentors: Timo Sachsenberg, Julianus Pfeuffer, Oliver Alka https://summerofcode.withgoogle.com/archive/2018/projects/5921078421487616/

GSoC 2020 mentors: Hannes Röst, Timo Sachsenberg, Oliver Alka and Chris Bielow, Julianus Pfeuffer https://summerofcode.withgoogle.com/archive/2020/projects/5403269754519552/ https://summerofcode.withgoogle.com/archive/2020/projects/5225514815455232/

Contact

Open an issue on gitter to contact Tjeerd Dijkstra (@tjeerdijk) or Timo Sachsenberg (@timosachsenberg)

Clone this wiki locally