Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
253 lines (186 sloc) 17.2 KB

Jupyter Notebooks: A Primer for Data Curators

Participants:

Mentor: Susan Borda, University of Michigan (sborda@umich.edu)

Suggested Citation: Bouquin, Daina; Hou, Sophie; Benzing, Matthew; Wilson, Lee. (2019). Jupyter Notebooks: A Primer for Data Curators. Data Curation Network.

An archived version of this primer is available at: Bouquin, Daina; Hou, Sophie; Benzing, Matthew; Wilson, Lee. (2019). Jupyter Notebooks: A Primer for Data Curators. Data Curation Network. Retrieved from the University of Minnesota Digital Conservancy, http://hdl.handle.net/11299/202815.

This work was created as part of the Data Curation Network “Specialized Data Curation” Workshop #1 co-located with the Digital Library Federation (DLF) Forum 2018 in Las Vegas, Nevada on October 17-18, 2018. These workshops have been generously funded by the Institute of Museum and Library Services # RE-85-18-0040-18.

See also: Primers authored by the workshop attendees at DLF: http://datacurationnetwork.org.

Table of contents

Format overview

Topic Description
File Extension .ipynb
MIME type https://jupyter.readthedocs.io/en/latest/reference/mimetype.html
Structure Browser-rendered composite digital asset: Notebook file (.ipynb); Notebook app; kernel
Versions 4.0.0 - 5.7.0 (previously IPython Notebook)
Primary fields or areas of use Not discipline-specific; can be used by anyone who writes code in a language with a supported kernel
Source and affiliation Project Jupyter
Metadata standards Codemeta; CFF; Jisc/SSI Guidance; discipline-specific keywords
Tools for curation review nbconvert, nbviewer, repo2docker; CodeMeta crosswalks; CodeMeta tools; CFF tools
Date created February 1, 2019
Created by Daina Bouquin (daina.bouquin@cfa.harvard.edu); Sophie Hou (hou@ucar.edu); Matthew Benzing (matt.benzing@miamioh.edu); Lee Wilson (lee.wilson@ace-net.ca)
Date updated and summary of changes made February 1, 2019

Description of format

Background

Jupyter Notebooks are composite digital objects used to develop, share, view, and execute interspersed, interlinked, and interactive documentation, equations, visualizations, and code. Researchers seeking to deposit software, in this case Jupyter Notebooks, in repositories do so with the expectation that repositories will provide documentation explaining "what you can deposit, the supported file formats for deposits, what metadata you may need to provide, how to provide this metadata and what happens after you make your deposit" (Jackson, 2018a). This expectation is not necessarily met by repositories that currently accept software deposits and complex objects like Jupyter Notebooks. This guide is meant to both inform curatorial practices around Jupyter Notebooks, and support the development of resources that meet researchers' expectations to ensure long-term availability of software in curated archival repositories. Guidance provided by Jisc (1) and the Software Sustainability Institute (2) outlines three different kinds of software deposits: a minimal deposit, a runnable deposit, and a comprehensive deposit (Jackson, 2018b). This primer follows this same conceptual framework in dealing with Jupyter Notebooks, which even in their static, non-executable form, can be used to document how scientific research was carried out or be used as teaching models among many other use cases.

Jupyter Notebook Format Description

A Jupyter Notebook is a file used in conjunction with a suite of tools that allow users to create and share documents that contain runnable code, equations, data visualizations, and other interactive material. While Python is the most common language associated with Jupyter Notebooks, they can be used with code written in over 40 different programming languages. Jupyter Notebooks' versatility enables them to be used in any number of disciplines and for various purposes, and while they are very popular in the sciences, they are also used in the social sciences and the humanities. Because Jupyter Notebooks are meant to be interactive and constructed using a multitude of programming and spoken languages, they are especially challenging for curators to work with. Any curation and archiving activity needs to be done in such a way as to not inhibit a future user's need to adapt the code contained within the Notebook file. Similarly, when a future user extracts deposited Notebook files, metadata, and supplemental material from the archive, curation and archiving activities should have had no degrading influence on the level of functionality that a depositor enabled with their initial deposit. For example, rather than zipping files on the depositor's behalf, it is preferable for curators to request that depositors pack and unpack their content prior to making their deposit to allow the them to check that files function as intended when unpacked.

To open a Jupyter Notebook file, a curator would need to have installed Python and Jupyter (using either pip or Anaconda(3)) and be familiar with using the Terminal (Mac/Linux), Command Prompt, or Bash (Windows).(4) Once opened, Jupyter Notebooks have a browser-rendered user interface composed of "cells" and clickable buttons to execute tasks. A cell is a multiline text input field where a user can enter and execute code or a markup language called Markdown. Markdown handles text formatting, linking, and the display of images. Behind the Notebook cells is a kernel that runs the processes needed for each cell to function. Code cells often require dependencies and specific input parameters, and may be run in any order, which is both a strength and a weakness.(5)

Once rendered in the user's browser, a Notebook can be exported in the following formats:

  • Notebook (.ipynb)
  • Python (.py)
  • HTML (.html)
  • Markdown (.md)
  • reST (.rst)
  • PDF via LaTeX (.pdf)

The following are useful tools for working with Jupyter Notebook files and curating metadata associated with them:

Deposit Requirements

The following elements outline recommendations for repositories accepting Jupyter Notebook submissions. Minimally required files and metadata will support the ability to open and cite the Notebook, but additional functionality should not be expected without requiring additional files and more comprehensive metadata.

File Requirements:

  • Minimally required files:
    • .ipynb (cells run with results viewable)
    • README (.txt or .md)
    • LICENSE (.txt or .md)
  • Additional files to request:
    • PDF of the Jupyter Notebook (export from Jupyter web application or nbviewer)
    • reST export of the Jupyter Notebook (export from Jupyter web application)
    • CodeMeta.json
    • CITATION.cff
    • Sample datasets and documentation (see below)
    • Container metafile (e.g. docker, singularity, reprozip)
      • Can be created using jupyter-repo2docker
      • Can be published separately with execution instructions; link this to the Jupyter Notebook record
    • Release of the full repository of files associated with .ipynb when applicable
      • Recommend minting a software DOI for the code repository (Fenner et al., 2018)
      • Provide guidance on how to mint a software DOI (e.g. assigning a software DOI via Zenodo(11)

Metadata Requirements:

  • Minimal submission: baseline description; enables user to view and cite the Notebook

    • Jupyter Notebook title
    • Author(s)
    • Jupyter implementation details
      • Jupyter version
      • Distribution (e.g. Anaconda)
      • Kernel version
    • README
      • Documents what the Jupyter Notebook is for
      • Request that this file include citation(s) to third-party algorithms and analyses
      • Recommend code comments within the Notebook file itself in addition to the README file
    • Alternate identifiers and supplemental links associated with the Notebook
    • License information
  • Runnable submission: allows another researcher to execute the Notebook locally using sample data and files provided by the depositor (12); minimal submission metadata plus:

    • User documentation
      • Instructions to support configuration needed to execute the Notebook and code cells
      • Sample input and output files
    • CodeMeta.json
      • Document required software dependencies
      • Recommend additional machine actionable dependency documentation (e.g. requirements.txt)
    • CITATION.cff for the Notebook
      • Preferred citation; should enable native software citation
  • Comprehensive metadata: minimal and "runnable" requirements plus:

    • Developer documentation
      • Include test code and description of expected results
    • Narrative description of how the code implemented in the Notebook works and what it does
    • Documentation about the computing ecosystem (e.g. CodeMeta.json: targetProduct, processorRequirements)

Key Curatorial Questions

Once a decision has been made to accept and curate Jupyter Notebook submissions in an archival repository, the following questions should be considered with each submission:

  1. What are the depositor's expectations for the Notebook's future functionality once the deposited files are exported from the archival repository?
  2. Does the submission include minimally required files and metadata to enable the expected functionality?
  3. Is the Notebook self-contained?
  4. Is the Notebook a standalone object or one of many products resulting from a project?
    • Examples:
    • Were supplemental files deposited along with the Notebook?
      • Is information about supplemental files included within the Notebook or in separate files?
      • If separate files, can those files be opened and read?
    • Are there multiple Notebooks in the deposit?
      • If multiple Notebooks were deposited together, do they require different metadata to meet the depositor's functionality expectations?
  5. What are the technical characteristics of the Notebook? Including:
    • File size
    • Availability of alternate format(s)
    • Availability of additional copies
  6. Who is the intended user community?
  7. Are there any specific search, discovery, and/or access needs?
  8. Are there any specific usage metrics requirements?
  9. Is the Notebook expected to be replaced or updated by a newer version at a later date?
  10. Is the Notebook peer-reviewed?
  11. Are there any confidentiality/ethics concerns associated with the Notebook?

Decision Trees

(view online)

The following decision trees (15) illustrate questions and actions that should be considered when determining whether or not to accept a Jupyter Notebook submission into a particular repository, as well key questions curators should consider when evaluating Jupyter Notebook submissions.

Repository Suitability

*https://datacurationnetwork.org/home/resources/
**http://hdl.handle.net/11299/202815

Curatorial Activities

Additional Recommended Reading

References

Fenner, M., Katz, D. S., Nielsen, L. H., & Smith, A. (2018, May 17). DOI Registrations for Software. DataCite Blog. doi: https://doi.org/10.5438/1nmy-9902

Jackson, M. (2018a). Software Deposit: How to deposit software (Version 1.0). Zenodo. http://doi.org/10.5281/zenodo.1327327

Jackson, M. (2018b). Software Deposit: What to deposit (Version 1.0). Zenodo. http://doi.org/10.5281/zenodo.1327325

End Notes

1 https://www.jisc.ac.uk/

2 https://www.software.ac.uk/

3 https://jupyter.org/install

4 https://jupyter.readthedocs.io/en/latest/running.html#running

5 https://bit.ly/2Tw2aIo

6 https://github.com/jupyter/nbviewer

7 https://github.com/jupyter/nbconvert

8 https://github.com/jupyter/nbconvert

9 https://codemeta.github.io/tools/

10 https://citation-file-format.github.io/#/tools

11 https://guides.github.com/activities/citable-code/

12 This assumes the Notebook is self-contained. How to best archive Notebooks that are not self-contained is an unresolved issue.

13 https://bit.ly/2sBF3jH

14 https://arxiv.org/abs/1810.06559

15 https://www.lucidchart.com/documents/view/4848c483-1267-499c-9172-3a2782abfaaf/0

You can’t perform that action at this time.