MDAnalysis 1.0 paper initial notes

We are slowly moving towards MDAnalysis 1.0. This release will freeze the API (until MDAnalysis 2.0) and guarantee backwards compatibility throughout 1.x. We want to write a paper that describes the library and gives the more recent contributors also a chance to get proper academic credit. This paper should also be suitable to replace the current 2011+2016 papers as the sole citation for MDAnalysis in published work.

This wiki page should act as initial starting point for gathering ideas and hashing out the broad structure. Once we get started in earnest, we will create a separate repository.

Paper content
Authorship model
Journal

Paper content

Introduction

domain description
- challenges and requirements
history and other similar packages

Design and Structure of the library

Design

Philosophy and what lessons did we learn between initial release and 1.0 and how did they impact the design?

object oriented
pythonic
interactive
interoperable

Organization

Understanding the code base

core data structures
library structure
- core, topology system, "lib"
- coordinates and topology readers/writers (+ maybe briefly selection writers); note on random access in trajectories
  - trajectory formats, topology formats (table)
  - special topics
    - MemoryReader
    - download from PDB with fetch_mmtf()
- analysis: overview

Capabilities

This should answer the reader's question "What can MDAnalysis do for me?"

(Ordering of topics? First the "developer" oriented ones on AtomGroup etc or rather "user" oriented with analysis first?)

Analysis

Ready-made building blocks with a common API

mention anything already published (ENCORE, water analysis, ...)
highlights? – add anything here that you'd like to write about

Working with MD data: trajectories, topologies, etc

Dealing with MD data in a unified is the core task in MDAnalysis, therefore we support many commonly used formats (and some uncommonly used ones, too):

topology formats
trajectory formats
- special topics
  - MemoryReader
  - ChainReader
  - download from PDB with fetch_mmtf()
- possibly some benchmarks on reading/writing?
selection writers (brief)
augmenting trajectories
- auxiliaries
- on-the-fly transformations

Atom selection and working with AtomGroups

hierarchy of containers (Atom, AtomGroup, Residue, ResidueGroup, Segment, SegmentGroup) + fragments; indexing, slicing, groupby
selection language
set operations with groups
updating AtomGroup
highlighting some AtomGroup or Residue methods

Interoperability

API compatibility enables conversion between different data structures
converter framework
- ParmEd
- MMTF
- Possibly: RDKit, VMD molfile, mdtraj, ...
Possibly: downstream CLI-based analysis frameworks

Enabling algorithms

What do we use under the hood?

distance calculations
- choose optimal algorithms (eg PR #2035 and much of @ayushsuhane's work — see also his notebooks and his GSOC summary)
RMSD (just mention QCPROT)
PBC treatment (hopefully consistent...)
- user interface
- algorithms (and perhaps benchmarks to show why we chose what we chose)

Development process and Community

development (CI, testing, PR/review/merge)
broad community (mailing lists, issue tracker), conferences, workshops
code of conduct
future community plans (@IAlibay: particularly in light of recent discussions on DEI, it would be good to have something devoted to our roadmap for continual improvement of diversity within our community).

Example uses

Highlights from the literature

something that was accomplished with the help of MDAnalysis, e.g., scientific question answered
other packages/tools that use MDAnalysis

@richardjgowers - I think this is where MDA has really changed recently. I'm increasingly seeing cool applications built on top of MDA, which is hopefully because we've made a nice format-agnostic API. So I'd be interested in making this section prominent, a mini-review of MDA apps?

Conclusions

summary and future work and plans

Authorship model

We need to decide on how we want to deal with authorship. A few potential models:

recent core developers (core developers with recent contributions; example: SciPy papers such as the SciPy 1.0 major paper draft), doi:10.1038/s41592-019-0686-2 (Nature Methods)
all core developers (past and present) (e.g. MDAnalysis 2011 paper)
anyone who has contributed (example: Astropy 2.0 paper (arXiv:1801.02634, see also Khmer 2.0 and C. Titus Brown's rationale Pubwication of software papers, and authorship on them)

Notes on authorship

@orbeckst feels that authorship should require at a minimum

code contributions (commits to develop/master)
participation in paper writing
approving the paper and committing to be accountable for the work (e.g., if you're the author of an analysis module, commit to fixing any issues that might come up)

The widely used ICMJE guidelines on authorship suggests that authorship be based on

Substantial contributions to the conception or design of the work; or the acquisition, analysis, or interpretation of data for the work; AND
Drafting the work or revising it critically for important intellectual content; AND
Final approval of the version to be published; AND
Agreement to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

@orbeckst feels that for software, almost all contributions are important and it is not straightforward to always measure the impact of code contributions; certainly not by line of codes or number of PRs. See also Pubwication of software papers, and authorship on them.

@IAlibay feels that the scipy (Nature Methods) authorship model of including all contributors but aesthetically separating core from non-core seems attractive if all the authors under "Scipy 1.0 contributors" did indeed get indexed. Including all the contributors would be a good way to celebrate our community, maybe there are scipy-like alternatives we can consider where we include all the contributors but also highlight the work done by core developers? With regards to "impact of code contributions", @IAlibay agrees with @orbeckst, particularly considering that things like reviews and issue discussions are an important part of what makes MDAnalysis move forward.

Journal

The journal prescribes the length of the article.

Some suggestions

Nature Methods — the SciPy paper was published there. doi:10.1038/s41592-019-0686-2 (Nature Methods)
PLoS Comp Biol (software paper, requires presubmission inquiry; no length limitations; authorship policy requires "Substantial contributions to conception and design, acquisition of data, or analysis and interpretation of data" + more), Open Access
J Comp Theory Comp (no length limitations as far as I can see) immediate Gold Open Access available ($4,000)
J Comp Chem (Software News and Updates, no length limitations(?)) Gold Open Access available
Biophys J (Computational Tools, max 5 pages (see author guidelines (PDF)), Open access available
Software X (max 6 pages) Open Access

Provide feedback

Saved searches

Use saved searches to filter your results more quickly