Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
111 lines (60 sloc) 29.1 KB

Welcome to this twenty-fourth edition of the Tracking Jupyter newsletter (#TJ24).

Since getting out of the weekly / fortnightly cycle, I'm noticing how quickly some of the "news" items I queue up are dating. Which has got me wondering about what sort of form, style and content an erratically occasional newsletter should take...

News and Announcements

Keeping up with its monthly releases, IPython v7.10 is now out. Support for Python 3.5 has been dropped.

I belatedly note that the November Python extension update for VS Code is long since out [and the December one is probably due to land any time...—Ed.] bringing Altair chart support and line numbering within code blocks.

If you pay for your editor, the PyCharm 2019.3 release is out now, offering widget support and runtime code completion in the Jupyter views.

Even as VS Code increasingly lets you edit Jupyter documents inside that environment, the Jupytext Jupyter extension goes the other way, letting you edit Python and Markdown documents, among other things, in a notebook UI. It recently bumped up to v1.3, with changes including improved UI tools for pairing against multiple formats, multi-line comments in Python scripts, and improved ways of handling raw cells in Markdown and encoding cell metadata. Supported formats now also include Powershell, Rust and Robot Framework, the latter two contributed. Jupytext also plays nicely with things like Papermill (parameterised notebook execution) as described in this post on Automated reports with Jupyter Notebooks using Jupytext and Papermill. A Vim plugin for editing Jupyter notebook (ipynb) files via the jupytext percent format, jupycent, also appeared recently, suggesting that an ecosystem is now developing around Jupytext [jupyter-book and nbsphinx already make use of Jupytext, for example...—Ed.].

If you prefer notebook-lite UIs, nteract is now available in the browser, as a Jupyter server server extension. So now a well as JupyterLab /lab views down the server path, you can also have /nteract views.

And if you want to access remote Databricks GPUs from your local JupyterLab server, it seems you now can...

Great to see an announcement from the Chan-Zuckerberg Initiative's Essential Open Source Software for Science program that includes funding for a JupyterHub Contributor in Residence program, sustainability projects on scaling OpenRefine and containerized reproducible analyses with Rocker,​ and development projects ensuring the continued growth of pandas, developing matplotlib as foundation of scientific visualization in Python and developing scipy as a solid foundation for statistics in Python.

If you're looking for funding yourself, the call for proposals for Jupyter Community Workshops, Jan-Aug 2020 (supported by Bloomberg and AWS), is open until Sunday, December 15th, 2019.

Or maybe a new job? If so, and you fancy running a hosted notebook service for the UK education sector (schools, FE, HE), Edina at the University of Edinburgh are looking got a service manager to run the noteable hosted/cloud notebooks service.

At their annual re:Invent event, Amazon have made their traditional slew of AWS related announcements. In particular, Amazon SageMaker Studio was presented as _"_the first fully integrated development environment (IDE) for machine learning (ML)" that "unifies all the tools needed for ML development: write code, track experiments, visualize data, and perform debugging and monitoring". Make of that what you will, but note this — the first class UI to the Studio appears to be JupyterLab. In the first case, "with Amazon SageMaker Notebooks, you can enjoy an enhanced notebook experience that lets you easily create and share Jupyter notebooks"; in the second, custom tabs inside the JupyterLab experience offer experiment tracking and job monitoring as well as ML model automated builds, debugging and monitoring. [By the by, I also note you can now co-lo AWS hardware... —Ed.]

Part of the rationale for my starting Tracking Jupyter was to help gather items that might be useful for Jupyter advocacy projects [tho' I've spectacularly failed in making any progress on that front in my own institution for the 5th year in a row...—Ed.]. As it is, TJ is maybe a bit too scattergun an approach for that, so the JupyterHub Institutional FAQ may be a better start for anyone looking on pointers for how to start putting a case together and how to start allaying concerns. If folk are swayed by prestigious journals, a Nature article last month describing how to "Make code accessible with cloud services" such as MyBinder and Nextjournal [both oft mentioned in Tracking Jupyter —Ed.], as well as Gigantum and Code Ocean, deployers of "container platforms that let researchers run each other’s software — and check the results"; I guess it may help make the case for reproducible computational environments and tease the question "so could we do this locally?".In which case, this slidedeck onOpenShift and Machine Learning at ExxonMobil which in passing shows how Jupyter notebooks may themselves in support of advocacy around a cloud based machine learning stack may be worth a quick read...

As if Jupytext isn't keeping him busy enough, maintainer Marc Wouts has another new project on the go: itables [repo] offers [yet another? —Ed.] alternative to the range of interactive grid / dataframe viewers available for notebooks and JupyterLab. [I've had a thematic round up on related items languishing in the Tracking Jupyter queue for ages, which I really should try to get round to finishing off for future issue...—Ed.]

This could be quite handy for those of you who, like me, tend to have more than a few open tabs on the go at any one time: Save-and-Quit functionality for JupyterLab-in-JupyterHub that "allows the user to save all notebooks, stop the container, and log out of the hub". Just the sort of thing for an institutionally deployed JupyterHub, methinks...

Docker may not have given Kitematic any love since they first acquired it, but it's been forked and rebooted in the form of ContainDS (available for Windows and Mac, at least...). Improvements over Kitematic include better support for mounting local directories into a container and clicking through to a notebook server in your browser (the first-run token is managed for you...). Even more impressive is "local Binder" facility that lets you build and launch containers from remote repositories or a local directory using repo2docker. [Thinking through some ways of integrating nbgitpuller to exploit the idea of "Binder base boxes" could be really interesting here, too... —Ed.]

A phrase presumably intended to allay, rather than excite, the fears of everyone of who's concerned about the ways in which Jupyter user environments are used to develop code, "Use Jupyter Notebooks for everything" was the strapline on the fast.ai blog post announcing nbdev [repo], their workflow and tooling package to help you create "delightful" [is that term back in vogue again?! —Ed.] Python projects using Jupyter notebooks. nbdev allows you to "put all your code, tests and documentation in one place", using an #export flag at the start of a code cell to mark the contents of that cell as exportable to a separate source code file; a limited ability to sync changes made to the code file back into the parent notebook is also possible [there are lessons to be learned from Jupytext about how to do this more smoothly, and completely, methinks? —Ed.]. Tests can be defined and run in situ, and integration with CI tools such as Github Actions is provided. Docs can be generated directly from notebooks, and completed packages built and pushed to PyPi.

As well as "pro-developer" support, formal proofs of code is another area that is arguably lacking in a lot of Jupyter notebook use... But what if you do want to do those proofs in a notebook environment? For any TLA+ afficionados out there, tlaplus_jupyter brings you a Jupyter kernel for TLA⁺. [I didn't know either... So, erm, "a high-level language for modeling programs and systems--especially concurrent and distributed ones"...—Ed.]
 
One of the things I've found myself repeatedly over recent years is dropping minimally viable notebook based code fragments and demos in Github gists. So the recent cocode.ai community'n'JupyterLab extension announcement caught my eye, offering as it does a_"code snippet search"_ library that "helps programmers to find the right code snippets ('bricks') that they're looking for and allows them to save bricks for easy access in the future". Bricks are Jupyter notebooks containing the "key code", an example, and some narrative text that provides the ability to "learn more" [think of them as akin to reversed good-answer-then-question Stack Overflow posts...—Ed.]. Bricks form part of a common repository, but can also be created and saved locally. [The demo video on the homepage is worth a watch...—Ed.] It's maybe in beta (you need to sign up), and I've no idea what the business model is... [I'll never make it as a journalist...—Ed.]

Notebook Practice

As with any new technology or medium, the development of uninformed, have-a-go practice often appears in advance of informed practice. If notebooks are to become a thing in academic publishing, we might expect standards to emerge that situate the notebooks in the wider, formalised structure of academic publishing so that they contribute to that system in an appropriate way.

So here are some readings that speak to the current state of notebooks in academic communications.

Once you get into the swing of it, it's easy to hack together code in notebooks, and easy to hack together low quality code. But with so many notebooks out there, how can we even begin to get a sense of code quality, in general, across those notebooks? Better Code, Better Sharing:On the Need of Analyzing Jupyter Notebooks [arXiv:1906.05234] explored the extent to which code in a sample of Jupyter notebooks respected recommended Python programming conventions and made use of unused variables exist in Jupyter notebooks and deprecated library functions used in Jupyter notebooks, identifying the need for the community to enforce good coding styles, improve code quality and reliability, apply best practices for software quality [testing, in other words...—Ed.] and ensuring a good balance between text and code. A separate "Large-scale Study About Quality and Reproducibility of Jupyter Notebooks" [DOI: 10.1109/MSR.2019.00077 and PDF], on 1.4 million notebooks found on Github asked how literate programming features are used in notebooks [markdown cells are used... —Ed.], how notebooks are named [generally meaningful but not conventional...—Ed.], how they use modules, functions, and classes [hands up if you understand classes...—Ed.], how they are tested [you're having a laugh...—Ed.], whether they are stored with executed cell output [yes...—Ed.], whether any recorded cell execution appears to be in linear order [it's all over the place...—Ed.], and how reproducible the notebooks are, for example, in terms of declared versions of imported libraries [not really...—Ed.].

[Loosely related to code presentation and quality, I've previously made some notes for myself on "Nudging Student Coders into Conforming with the PEP8 Python Style Guide Using Jupyter Notebooks, flake8 and pycodestyle_magic Linters"... —Ed.]

In terms of notebooks that have made it, in some way, into formal publications, "Reproducible Research is more than Publishing Research Artefacts: A Systematic Analysis of Jupyter Notebooks from Research Articles" [arxiv:1905.00092] reviewed five publications from PubMed Central, performing a metadata analysis of the repository that is used [Github, Supplementary Material, Zenodo and GIN...—Ed.] and the source code license [do I need one? Erm, MIT, GPL 3, CC0 1,0?! —Ed.]; and a reproducibility analysis, noting the number of notebooks mentioned compared to published [mostly the same...—Ed.], where source code artefacts are documented besides the publication [sometimes...—Ed.], where software requirements documentation can be found [all over the place...—Ed.], whether the computing environment is available or can be reconstructed from the documentation [not really...—Ed.], whether the complete raw data of the study is available [it's my data, not yours...—Ed.] and whether any provided Jupyter notebook be completely re-executed with the same results [no...—Ed.].

Effective citation is also an important part of scholarly communication, so how do you cite a notebook? Jupyter notebooks as discovery mechanisms for open science: Citation practices in the astronomy community [DOI: 10.1109/MCSE.2019.2932067] reports on "a study of references to Jupyter notebooks in astronomy over a 5-year period (2014-2018)", noting that "references increased rapidly, but fewer than half of the references led to Jupyter notebooks that could be located and opened". It also discusses how folk might be able to do it better... [I think... maybe... paywalled... —Ed.]

Environmentally Speaking...

If you're a conda fan, you probably have environments everywhere. So how about a Jupyter kernel for — and from — each of your conda environments [I think?! —Ed]? nb_conda_kernels.

And here's another twist on conda environments - use them to create different notebook environments, which is to say, notebook environments with different pre-enabled extensions, in this case appropriate for authoring maths chapters:jupyter-forchaps[Makes me wonder: is there a meta nbextensions configurator that would let you specify a set of enabled configurations as an "extension set" and then associate this with a particular notebook directory path. Those notebooks would then open into an environment with particular extension enabled, the name/label of the environment being displayed on the right side of the notebook header? —Ed.]

In passing, a handful more of things for the maths fans: pyganja offers a visualisation library for geometric algebra with cefpython_ ["Python bindings for the Chromium Embedded Framework" apparently...—Ed.]_ and ganja.js, the Javascript Geometric Algebra Generator. It also plays nicely with the clifford numerical geometric algebra Python package.  Although from the repo commit stamps, mathbox ("presentation-quality WebGL math graphing") looks like it may be a bit stale, there are some nbviewer rendered notebook demos that look quite pretty. And finally, this may helpful for instructors wanting to create animated explanatory maths tutorials: jupyter-manim, some cell magic to integrate the manim animation engine developed for just that purpose.

Ramblings...

A couple of days ago, I idly wondered on the Twitterz about whether there was a way I could connect to a  notebook kernel running on Google Colab from my own local notebook server [seems only fair, right? After all, you can connect to a local kernel from Colab, albeit it with Colab GPU support asthis post suggests... And it'd mean free GPUs locally, which would be good for edu...—Ed.]. I imagined something like an nb2kg extension [now a native part of Jupyter noteooks...—Ed.] that would let me connect to Colab rather than my own org's enterprise kernel gateway [I wish! Or maybe noteable will offer this sort of thing?—Ed.] The reality seems it may be possible via an ssh tunnel using something like ngrok [example, perhaps, though I haven't tried it yet...—Ed.]. By chance, a recent post on the Paypal Engineering blog describes their set up for bringing GPU-powered Jupyter Notebooks to their analysts: Jupyter Enterprise Gateway fronting a Kubernetes cluster that spans CPU and GPU machines (see #TJ4 and #TJ8 for more on PayPal's Jupyter notebook extensions).

By chance, (and as noted in the news section), I just noticed that another post that appeared a couple of days ago describes how to access Databricks Spark Workspaces and GPUs from local JupyterLab instances. The recipe is described here: New Databricks Integration for Jupyter Bridges Local and Remote Workflows.

I've also been wondering whether Amazon Ignite, the platform for selling original digital educational resources announced last month, will soon open up a JupyterHub gateway for selling not just notebook based materials, but also the means of delivering them? Or maybe they'd buy something like the notebook powered educative.io, which claims to offer "rich, text-based courses with embedded coding environments make learning a breeze". Either way, how might that affect the noteable service roadmap?

Related to this, and something I missed at the time, was an announcement back in September that O'Reilly had added Jupyter notebooks to their online learning platform... And it's not just notebooks — since March, there have also been dashboards that deliver "learning impact data to the enterprise", providing a "a single-pane-of-glass view into the learning behaviors of users, generating actionable data for business leaders" and other Wizard of Oz-ery... [Now that's probably what I should be pushing in order to get notebooks adopted in my org... :-(—Ed.]

Continuous Integration and Automated Builds

Sharing reproducible code is one of the motivating reasons behind the Jupyter project and making computational environments shareable is a important part of that, as applications like Binderhub/MyBinder and the Binder Reproducible Execution Environment Specification (REES) demonstrate. Continuous integration, the art of invisibly getting other computers to build your project (or shareable computational environment...) for you every time you push to Github [it's never been easier to unknowingly trigger huge amounts of computation, bandwidth and energy consumption of the back of every minor typo correction in your README, has it?! —Ed.] is related to this, and is something that's crossing my radar more and more, in part as I learn to automate more and more of my own side-project activities.

MyBinder offers a lazy way of doing this, building a new image, if necessary, when a Binder instance is requested. But if you want to build a Docker image that you can share publicly on Dockerhub, for example, from your repo then either need to configure a Dockerhub action to watch the repo and build a new container when an appropriate commit is made to the repo, or build an image and push it to Dockerhub using another continuous build service. _[By the by, I forget where I saw this, but using {sourceref} as the tag and /^([^m]|.[^a]|..[^s]|...[^t]|....[^e]|.....[^r]|.{0,5}$|.{7,})/ as the source in a Dockerhub build rule will build from branches other than master _and tag the image with the branchname. Is there a simpler, approved way of doing this?! —Ed.]

The "official" continuous-build Binder example uses Circle.CI and demonstrates "how to use repo2docker to build a container and push it to Dockerhub for others to use". The pangeo-stacks repo shows how to use Travis to a similar end.

For Github users, the native Github Actions automation route may be more convenient, such as using this recipe or this actual repo2docker-action to build a Docker image using repo2docker and then push it to Dockerhub following a Github commit; (see also this Jupyter discourse thread on GitHub Actions and Binder).

And for Google Cloud Platform / Google Cloud Build users, this article on _"_Continuous Integration For Your Jupyter Notebooks On GitHub With GCP" describes "a fully configured example of CI on GitHub that you can use as a reference example on your project" [apparently...—Ed.].

In terms of use cases, another example of how to use CI is building a documentation site that draws on something like Jupyter Book or nbjekyll to run a Jupyter source document through nbconvert to generate an output document that includes code generated output content [my own demos for these are here (Jupyter Book/sphinx) and here (nbjekyll). —Ed.].

Keeping Track of Cell History...

Depending on how you use them, one of the great things about notebooks [which can also be a not so great thing...—Ed.] is the way the history of an idea can be captured as code is developed across notebook code cells. But iterative code development can also take place within a code cell; and whilst each cell maintains its own history, that history is lost when you close a notebook session. Unless you save that history, as CoCalc's notebook history does.

I've previously mentioned the [now deprecated? —Ed.]nbcomet (#TJ16; this project itself seemed to build on the earlier comet and comet_server work on tracking changes to notebooks over time) and Microsoft's nbgather (#TJ17) but I've recently come across a couple more examples of cell level history capture in the form of Verdant and ProvBook.

Verdant is "a JupyterLab extension that automatically records history of all experiments you run in a Jupyter notebook, store them in a tidy .ipyhistory JSON file". You can then "visualize the history of individual cells, code snippets, markdown, and outputs". The associated two minute CHI video [that initially reminded me of the The Machine is Us/ing Us video... —Ed.] provides a good overview. [You can also read more via the intriguingly titled paper, "Towards Effective Foraging by Data Scientists to Find Past Analysis Choices"...—Ed.]. This extension was originally developed for use "by data scientists", but I think it could also be useful for learners... [REDACTED —Ed.]

Second up, ProvBookis a notebook, rather than JupyterLab, extension, that displays the provenance of each notebook code cell through a cell based time slider. Cell history is saved as cell metadata, which can itself be exported to a .ttl RDF datafile. Again there's a paper — ProvBook: Provenance-based Semantic Enrichment of Interactive Notebooks for Reproducibility — and again, a video demo, although in this case it comes in at just under 6 minutes, which requires a bit more of a commitment [that said, you don't lose much by skipping the first 2 mins...—Ed.].

By the by, it's probably also worth mentioning the multi-outputs extension and ordo here: both these extensions allow you to save a particular cell output for comparison with the executed output of the same cell at a later point, via a simple diff in the case of the multi-outputs extension, and as part of a simple self-assessment/feedback activity in the case of ordo.


Disclaimer: this newsletter is produced independently of the official Jupyter project.

All errors are down to the editor... Oops.. If I make any that I later become aware of, or I'm informed of, I'll announce them in the first possible issue thereafter. If it's really bad, I'll do a Stop Press/emergency issue.

If you have any Jupyter related news items or notebooks you'd like to be considered for inclusion in the newsletter, or experiences of using any of the technologies described in this newsletter that you'd like to share, please email them along to: tony.hirst@open.ac.uk

You can’t perform that action at this time.