Reproducible and Interactive Data Science

Syllabus
Credits
Program
Prerequisites
Preparation Before the First Session
Project Work
Notebook Requirements
Getting a DOI Via Zenodo
Create and Export Conda Environments
Troubleshooting
External Resources

Syllabus

The aim of this course is to introduce students to the Jupyter Notebook which is an open-source software that allows you to create and share documents that contain live code, equations, visualizations, and explanatory text. Uses include: data cleansing and manipulation, numerical simulations, statistical modeling, machine learning, and much more. Through the notebooks, research results and the underlying analyses can be transparently reproduced as well as shared. As an example, see this Notebook on gravitational waves published in Physical Review Letters.

During three days with alternating video lectures (Intro & Widgets, Libraries, ATLAS Dijet) and hands-on exercises, the participants will learn to construct well-documented, electronic notebooks that perform advanced data analyses and produce publication ready plots. While the course is based on Python, this is not a prerequisite since the Jupyter Notebook supports many programming languages. The name Jupyter itself stands for Julia, Python, and R, the main languages of data science.

Credits

4 ECTS.

Program

Sessions on March 9, 10 and 11 2020 from 10:15 to 15:00 and project presentations on TBA from 10:15 to 12:00.

Location:

March 9, 2020: LINXS, IDEON Building Delta 5, Scheelevägen 19 (5th floor)
March 10, 2020: LINXS, IDEON Building Delta 5, Scheelevägen 19 (5th floor)
March 11, 2020: LINXS, IDEON Building Delta 5, Scheelevägen 19 (5th floor)

The course consists of five full days: three with alternating video lectures (Intro & Widgets, Libraries, ATLAS Dijet) and hands-on exercises, and two days with project presentations. The notebooks shown in the video lectures are available on this site in the lectures folder.

Day 1. Introduction
- morning:
  - Introduction and overview of the Jupyter Notebook (10')
  - Introduction to project work and peer discussion (15')
  - Installation and package management (Miniconda)
  - Binder and conda environments
  - Navigating cells, online resources, and getting help
  - Documenting using Markdown: rich text, equations, images, tables, videos
  - IPython Magic commands
  - Cross-language interaction (bash)
- afternoon:
  - Python built-in functions
  - Storage and manipulation of numerical arrays (numpy)
  - Repeated operations and universal functions (numpy, Cython, and fortranmagic)
Day 2. Data Science
- morning:
  - Data structures and data wrangling (pandas)
  - Pivot tables, grouping and aggregating (pandas)
  - Creating publication ready plots (matplotlib)
- afternoon:
  - Plotting images, errorbars, histograms, and composite plots (matplotlib)
  - Exporting figures to raster and vector formats (matplotlib)
  - Plotting categorical data (matplotlib,pandas,seaborn)
Day 3. Visualization and Interactivity
- morning:
  - Nonlinear least-squares (scipy, R, and rpy2)
  - Explore a Notebook in action in the search for new particles (ATLAS Dijet)
- afternoon:
  - IPython widgets
  - Interactive plots (bokeh)
  - Version control, sharing, and archiving (Github and Zenodo)
Day 4 and 5. Project presentations

Prerequisites

No prior knowledge in Python is required, but familiarity with programming concepts is helpful.
A laptop connected to the internet (eduroam, for example) and running Linux, MacOS, or Windows and with Anaconda installed, see below.
Earphones for silently watching lectures during the sessions.

If you have little experience with Python or shell programming, the following two tutorials may be helpful:

Preparation Before the First Session

Watch the video lectures (Intro & Widgets, Libraries, ATLAS dijets)
Install miniconda3 alternatively the full anaconda3 enviroment on your laptop (the latter is much larger).
Download the course material (this github repository) and unzip.
Uncomment the line with "# - gcc # [osx]" in the file environment.yml.
Install and activate the LUcompute environment described by the file environment.yml by running the following in a terminal:
```
conda env create -f environment.yml
conda activate LUcompute
```

Instructions for Windows:

Watch the video lectures (Intro & Widgets, Libraries, ATLAS dijets)
Install miniconda3.
Download the course material (this github repository) and unzip.
Open the anaconda prompt from the start menu.
Navigate to the folder where the course material has been unzipped (e.g. using cd to change directory and dir to list files in a folder).
Install and activate the LUcompute environment described by the file environment.yml by running the following in the anaconda prompt:
```
conda env create -f environment.yml
activate LUcompute
```

Documentation on conda environments

Project Work

The project work consists of three steps:

Each student will make a Notebook project covering topics from day 1–3 with either:

research, presenting data analysis and theory behind a manuscript or published paper. The Notebook should ideally be written such that it can act as supporting information (SI) for a journal. Here's some inspiration.
or a Notebook presenting a text-book topic of choice and aimed at students. Here's some inspiration.
Deadline for project: April 1st

Each student will upload her/his project on a public GitHub repository created through GitHub Classroom For a brief introduction to git repositories, see here. You can find your repository here, press cancel if an error has occured during importing.
Notify your referees via email that your notebook is ready to be checked.
A peer-review process where each student reviews and writes comments on two other notebooks by creating issues on the respective GitHub repositories. The review should be based on the criteria listed below. For each point, include specific suggestions for improvements. Deadline for review: April 8th.
The deadline for implementing the reviews and answering the GitHub issues is April 22nd. At this point you should also have a Zenodo DOI for your project - add this as a badge to your repository, or as a link to your README.
Notebook presentation to the class (remote on Zoom, days to be chosen). Maximum 10 minutes per participant.
The presentations shall serve the purpose of briefly showing the workflow of the Notebook. Include the response to the reviewers' comments and highlight the most interesting, original, or advanced features of your Notebook (e.g. the use of a particular library, a certain composite plot, a method to manage references or implement interactivity, or any other feature that you found particularly useful and would like to share). This is optional but it is part of the course, so if you skip this, please make sure to include the response to the reviewers' comments and to highlight the most interesting, original, or advanced features of your Notebook in the README.md of your GitHub repository.
Save your project to your own GitHub repository when the course has finished as we may delete it before the next course event.

Notebook Requirements

This check list summarizes the minimum requirements for the Notebook project to be approved. It should be used as a reference for both the development of the Notebook and the peer-review process.

Getting a DOI via Zenodo

Part of your project work will consist of adding a Digital Object Identifier DOI to your work, through Zenodo. In order to do that, you should watch the videos mentioned in "day 3": - Version control, sharing, and archiving (Github and Zenodo) The easiest and preferred way to do it is by connecting your Github account to Zenodo first, enabling the repository to be seen by Zenodo, then making a tag in GitHub, following the instructions here.

Create and Export Conda Environments

The command to create a new environemnt with Python x.y is

conda create --name myenv python=x.y

where myenv is a name of your choice for the new environment and x.y is a specific Python version (e.g. 2.7 or 3.6). After activating the environemnt (conda activate myenv), you can install all the other packages within the environment. conda list shows the list of packages installed in the environment. The command to export the active environment myenv to an environment yml file (e.g. myenv.yml) is

conda env export > myenv.yml

Troubleshooting

If your notebook seems to have an issue on connection, similar to the lines below:

[E 12:18:57.001 NotebookApp] Uncaught exception in /api/kernels/5e16fa4b-3e35-4265-89b0-ab36bb0573f5/channels
 Traceback (most recent call last):
   File "/Library/Python/2.7/site-packages/tornado-5.0a1-py2.7-macosx-10.13-intel.egg/tornado/websocket.py", line 494, in _run_callback
     result = callback(*args, **kwargs)
   File "/Library/Python/2.7/site-packages/notebook-5.2.2-py2.7.egg/notebook/services/kernels/handlers.py", line 258, in open
     super(ZMQChannelsHandler, self).open()
   File "/Library/Python/2.7/site-packages/notebook-5.2.2-py2.7.egg/notebook/base/zmqhandlers.py", line 168, in open
     self.send_ping, self.ping_interval, io_loop=loop,
 TypeError: __init__() got an unexpected keyword argument 'io_loop'
[I 12:18:58.021 NotebookApp] Adapting to protocol v5.1 for kernel 5e16fa4b-3e35-4265

You should either a) downgrade the package "tornado" b) change L178 of the file

[your conda installation location]/miniconda3/envs/LUcompute/lib/python3.6/site-packages/notebook/base/zmqhandlers.py

from

             self.send_ping, self.ping_interval, io_loop=loop,

into

             self.send_ping, self.ping_interval,

https://stackoverflow.com/questions/48090119/jupyter-notebook-typeerror-init-got-an-unexpected-keyword-argument-io-l

External Resources

Cross-language interaction is a striking feature of Jupyter notebooks: The possibility to integrate multiple languages in the same notebook makes it feasible to exploit the best tools of the various languages in the different steps of data analysis. You can read more about it in this post.
The Jupyter notebook is a very popular tool for working with data in academia as well as in the private sector.
- These tutorials show how the LIGO/VIRGO collaboration extensively uses Jupyter notebooks to communicate its research.
- The streaming service Netflix currently uses Jupyter notebooks as the main tool for data analysis. For example, recommendation algorithms which suggest which movies or TV series to watch next are currently run on Jupyter notebooks. You can read more about it in this post.
- In 2017 Jupyter received the ACM Software System Award, a prestigious award that it shares with projects such as Unix and the Web.
There are many freely available online resources to learn data science.
- The best resource to find help with programming and scripting is Stack Overflow, which is a question and answer website curated by software developer communities.
- An excellent book is "Python Data Science Handbook" by Jake VanderPlas which is freely available as Jupyter notebooks at this GitHub page. On the author's webpage, you can also find a list of excellent talks, lectures, and tutorials and a blog.
- Yet another useful resource is the podcast Data Skeptic which features a collection of entertaining and educational mini-lectures on data science as well as interviews with experts.

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
2017		2017
exercises		exercises
lectures		lectures
qdetailss		qdetailss
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
install.R		install.R
referees_list.ipynb		referees_list.ipynb
runtime.txt		runtime.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reproducible and Interactive Data Science

Syllabus

Credits

Program

Prerequisites

Preparation Before the First Session

Project Work

Notebook Requirements

Getting a DOI via Zenodo

Create and Export Conda Environments

Troubleshooting

External Resources

About

Releases

Packages

Languages

License

COMPUTE-LU/jupyter-course

Folders and files

Latest commit

History

Repository files navigation

Reproducible and Interactive Data Science

Syllabus

Credits

Program

Prerequisites

Preparation Before the First Session

Project Work

Notebook Requirements

Getting a DOI via Zenodo

Create and Export Conda Environments

Troubleshooting

External Resources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages