Tutorial to run OPSD scripts

Jonathan Mühlenpfordt edited this page Aug 1, 2018 · 20 revisions

Software requirements / installation

  • Install Anaconda (choose Python 3.x). Anaconda is a standard Python distribution that includes packages required for scientific work with Python. It also includes Jupyter Notebook (formerly known as IPython Notebook) which is used in OPSD to run and document the scripts.
  • This Beginners Guide is a brief step-by-step tutorial on installing and running Jupyter (IPython) notebooks for new users who have no familiarity with python.

Retrieve the scripts

Option 1

  • On the OPSD-site on github you can choose the datapackage you are interested in. Click on it.
  • Then you can see all files contained in the data package. Click on the green button on the top right ("Clone or download") to download all files to your computer.
  • You can download the package as as a ZIP file. Then you just have the latest version on your computer.

Option 2

You can also use git for retrieving older versions of data packages which is recommended if you want to retrieve updated versions easily.

  • Install the version control software Git
  • Register a free account on GitHub, which hosts collaborative open source projects
  • Generate a SSH-key for authentication on GitHub.
  • Log into GitHub with your account and add the SSH-key in your profile
  • Clone repository from GitHub by typing in your terminal
  • The most important git commands are summarized in this git cheat sheet (PDF)

Install required Python packages

Our recommended approach is to create a separate conda environment for each data package. Using separate conda environments is a way to keep dependencies required by different projects separate from each other, making it easy, for example, for different projects to use different versions of a package like pandas.

  • Each Data Package has a file called requirements.yml which lists all packages and their versions required for the scripts in the respective Data Package. To install these and to create a corresponding conda environment, run in your terminal conda env create --file requirements.yml
  • The name of the environment (called env_name below) is specific to each Data Package and will be displayed after successful installation. That name is defined in the requirements.yml
  • The environment can be activated by source activate env_name
  • Deactivate the environment by source deactivate env_name

Open Notebook

There are two possible ways to access it:

(1) by command line / terminal

  • You can find it in Windows by by searching "cmd" and open "cmd.exe", linux users probably know where they can find the terminal
  • Type jupyter notebook Press Enter.
  • (If it does not work, try ipython notebook)

(2) Without command line / terminal

  • Go to the Anaconda-folder in the start menu and click on the IPython Notebook Link.

No matter which access path you use this should happen:

  • A new window in your web browser opens and shows the Notebook Dashboard. Firefox works best, Internet Explorer does not work (well).
  • Now you can choose the first notebook of the Data Package
  • It is recommended to start with the main.ipynb notebook because it explains the structure and aim of the scripts in the respective Data Package.

If this explanation for Jupyter Notebooks was too short, maybe this Jupyter Notebook beginner guide is helpful.

Work and Play with the Notebook

  • Start with main.ipynb, which can be found in each Data Package and explains the Data Package and scripts
  • If there is a lot of rather boring stuff (e.g. for downloading routines), this is sometimes outsourced in extra python scripts which you also have on your computer in the data packages. They are automatically run by the notebooks where necessary. Code directly working with the original data like verification, processing, corrections is in the notebooks.
  • Generally, the script starts with downloading, continues with reading, processing and some also have graphs to illustrate data.
  • Downloaded files will be put in the input/original_data folder on your computer
  • If you run the scripts again, these already downloaded files will be used
  • Output files will be stored in the folder output which is generated once you run the script in the Data Package folder on your computer

Closing a Jupyter Notebook

  • In the browser window of the notebook that should be closed click File -> Close and halt
  • On the main Jupyter Notebook browser window / the dashboard click on the tab Running and close all notebooks you want to close
  • Closing only the tabs/browser does not shut down the kernel

Trouble shooting

  • Creating new Notebook: Creating Notebook Failed [...] Errno 13 The link to open IPython Notebook (see the Anaconda folder in start menu) has to be copied to the folder that contains the notebook files (.ipynb).

Versioning of Data Packages

  • Versions of the Data Packages is organized by tags which are named with the date of the version. We do not use version names but version dates
  • On the respective Data Package page on GitHub, you can click on branches, there you can choose tags
  • There you have the choice between different dates
  • The original_data on which the version of the script is based is stored in a folder with the same date. This can be accessed via view_original_data on the Data Package site on the OPSD page
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.