# Starting a Renku project

## Initializing a repository

The first thing to do when starting a project is to initialize a repository. A Renku repository is just a git repository with a little bit of extra structure. 

Note: In this tutorial, we are placing the repository within another git repository. Normally, you would want to use a submodule for such cases. Here, we have placed the folder `renku-tutorial-flights` in the `.gitignore` file.

In [None]:
!renku init renku-tutorial-flights

Let is take a look at what's inside the renku repository

In [None]:
%ls -al renku-tutorial-flights

It's just a normal git repository with a few files in there. 

The `.reknu` folder is where renku stores internal information. Similar to the `.git` folder, it is not designed for users to regularly interact with, but it is possible to do so if necessary.

The files `.gitlab-ci.yml` and `Dockerfile` we will look at soon. That leaves `requirements.txt`, which we already know.

## Housekeeping

For the rest of the tutorial, we will work in the renku repository

In [None]:
%cd renku-tutorial-flights

## Importing data

There are several ways renku can import data. Renku can import data from another renku or git repository, from a URL, or from a file on the file system. We will start with the last of these options.

### Create a dataset

First, we will create a dataset which will group together the files we want to work with.

In [None]:
!renku dataset create flights

### Add data

And we will get some data and add it to the dataset.

In [None]:
%%bash
# Download the data we will work with and add it to a dataset
curl -L -o /tmp/2019-01-flights.csv.zip https://www.dropbox.com/s/99w7evit5y7jxb3/2019-01-flights.csv.zip?dl=0
renku dataset add flights /tmp/2019-01-flights.csv.zip

This copies the data into the folder for the flights dataset.

In [None]:
!renku dataset ls-files flights

As you probably know, git is not normally a good system for storing binary artifacts. For this, renku uses git-lfs. For the moment, just remember that. It's something we will get back to when we look at the technical details of renku.

In [None]:
!git lfs track

What have we done so far? This is easy to find out! Just look at the git log.

In [None]:
!git log --graph --oneline

# Inspect and preprocess data

There are notebooks prepared in the templates folder. One of these notebooks will get us started with reading in and preprocessing the data.

In [None]:
%mkdir notebooks
%cp ../templates/01-Preprocess-00.ipynb ./notebooks/01-Preprocess.ipynb

## Inspect data

Open [01-Preprocess.ipynb](renku-tutorial-flights/notebooks/01-Preprocess.ipynb) and run through the '01-Preprocess.ipynb' notebook. When done, return to this notebook to continue.

## Preprocess data

The data looks good. We want to save the output as a file. We could just save the file in the notebook, but then we have not recorded what input and processing were used to produce an output.

Instead, we can use the tool **papermill** which runs Jupyter notebooks in a reproducible way. To do this, we need to convert the notebook into one that is papermill compatible and then run it with papermill.

**Before continuing with the notebook, you need to quit the *other* JupyterLab we started above**.

### Modify the notebook to use papermill

Converting the notebook to be papermill compatible is done by creating a cell that initializes all the values we want to use as parameters and converting the code to use these parameters. The cell needs to be tagged as a parameters cell. This is done by editing the cell metadata and adding the following:
```
{
    "tags": [
        "parameters"
    ]
}
```

In [None]:
# Update the notebook to a version with parameters
%cp ../templates/01-Preprocess-01-papermill.ipynb ./notebooks/01-Preprocess.ipynb 

### Resolve the dirty repository state

Renku uses information from git to determine the output of a program. For this to work, the working directory needs to be clean (without modifications).

In [None]:
!git status

We added a notebook. Let us put it into git and make a commit.

In [None]:
!git add notebooks
!git commit -m"Initial data inspection and processing"

Now we can run the notebook with papermill

In [None]:
%%bash
mkdir -p data/output
renku run papermill \
  -p input_path data/flights/2019-01-flights.csv.zip \
  -p output_path data/output/2019-01-flights-preprocessed.csv \
  notebooks/01-Preprocess.ipynb \
  notebooks/01-Preprocess.ran.ipynb

Let's take a look at how things look from the renku perspective.

In [None]:
!renku status

In [None]:
!renku log

In [None]:
!renku log notebooks/01-Preprocess.ran.ipynb --format Makefile

# Inspect Preprocessed Data

Let us examine the preprocessed data to make sure we interpreted the original data correctly.

To do this, we use another notebook. This notebook takes the preprocessed data and visualizes it so we can see if it conforms to our expectations regarding how it should look.

In [None]:
%cp ../templates/02-Inspection-00.ipynb ./notebooks/02-Inspection.ipynb 

## Inspect preprocessed data

Run through the [02-Inspection.ipynb](./renku-tutorial-flights/notebooks/02-Inspection.ipynb) notebook. and then come back here.

### Modify the notebook to use papermill (again)

The 02-Inspection.ipynb notebook was already written with papermill in mind. The one parameter is already declared in its own cell. To make the notebook  papermill compatible all that needs to be done is to tag the parameters cell. This is done by editing the cell metadata and adding the following:

## **Exercise 1**

Make the 02-Inspection.ipynb work with papermill.

In [None]:
# Ex. 1 Solution
# Update the notebook to a version with parameters
# %cp ../templates/02-Inspection-01-papermill.ipynb ./notebooks/02-Inspection.ipynb 

### Resolve the dirty repository state

Again, we need to ensure the working directory is clean.

In [None]:
!git status

We added a notebook. Let us put it into git and make a commit.

In [None]:
!git add notebooks
!git commit -m"Inspecting the results of preprocessing."

Now we can run the notebook with papermill

In [None]:
%%bash
renku run papermill \
  -p input_path data/output/2019-01-flights-preprocessed.csv \
  notebooks/02-Inspection.ipynb \
  notebooks/02-Inspection.ran.ipynb

Now we have a workflow in Renku!

In [None]:
!renku log

In [None]:
!renku log notebooks/02-Inspection.ran.ipynb --format Makefile