# Iterating on a Renku project

## Improving steps in a workflow

In the last step of 01-Starting.ipynb, we produced a boxplot and a probability plot of the delay times. Some flights were arriving 1500 minutes = 25 hours early! This seems quite unlikely and is probably the result of an error in our preprocessing.

Let us go back to the preprocessing notebook and try to track down the source of the problem.

In [None]:
%cd renku-tutorial-flights

Normally, we would improve the code in our notebook, but this is a tutorial on Renku, not on pandas, so we will fast-forward through that part of the process and continue from there.

In [None]:
%cp ../templates/01-Preprocess-02-improved.ipynb ./notebooks/01-Preprocess.ipynb 

## Look at changes to notebook

Run through the new [01-Preprocess.ipynb ](renku-tutorial-flights/notebooks/01-Preprocess.ipynb) to see what has changed there. Come back here when done with that.

### Resolve the dirty repository state

Now that we have improved the preprocessing. Let us re-run it with papermill. Again, we need to ensure the working directory is clean.

In [None]:
!git status

We updated a notebook. Let us put it into git and make a commit.

In [None]:
!git add notebooks/01-Preprocess.ipynb 
!git commit -m"Fixed problems extracting delay durations."

In [None]:
!git status

<div style="color: #004085; background-color: #cce5ff; border-color: #b8daff; padding: .75rem 1.25rem; margin-bottom: 1rem; border: 1px solid transparent; border-radius: .25rem; font-size: larger;">
Let's pause for a second and reflect on where we are...
</div>

We read in some data, processed it, and worked with the processed data. Now we want to change the initial processing. 
- How do you know what is downstream of the processed data?
- How do you know how to update all downstream consumers of the data?

**With renku, we can just ask the system.**

In [None]:
!renku status

And we can ask renku to update everything.

In [None]:
!renku update

<div style="color: #155724; background-color: #d4edda; border-color: #c3e6cb; padding: .75rem 1.25rem; margin-bottom: 1rem; border: 1px solid transparent; border-radius: .25rem; font-size: larger;">
Wasn't that easy!?
</div>

## Inspecting the results

Let us look at [2019-01-flights-delay-fivenums.csv](renku-tutorial-flights/data/output/2019-01-flights-delay-fivenums.csv) to see if the problem has been fixed.

In [None]:
!git diff HEAD^ data/output/2019-01-flights-delay-fivenums.csv

We can also look at the notebook [02-Inspection.ran.ipynb](renku-tutorial-flights/notebooks/02-Inspection.ran.ipynb) to see the new boxplot.

Alas, it has not been fixed.

# Fixing the problem at the source

Let us take a different approach. Instead of trying to compute the delay, let us go back to the Bureau of Transportation Statistics and use a data series that includes the delay as part of the data.

In [None]:
%%bash
# Download an improved version of the data we will work with and add it to a dataset
curl -L -o ./data/flights/2019-01-flights.csv.zip https://renkulab.io/gitlab/cramakri/renku-tutorial-flights-data/blob/master/data/v2/2019-01-flights.csv.zip

In [None]:
%cp ../templates/01-Preprocess-03-fixed.ipynb ./notebooks/01-Preprocess.ipynb

## Running through the fix

Open [01-Preprocess.ipynb](renku-tutorial-flights/notebooks/01-Preprocess.ipynb) and run through the notebook to see how it was changed to adapt to the new data.

Now that we fixed the data and the processing, let us see what is in git.

In [None]:
!git status

In [None]:
!git add data/flights/2019-01-flights.csv.zip
!git add notebooks/01-Preprocess.ipynb
!git commit -m"New way of computing delays -- use BTS data directly."

<div style="color: #004085; background-color: #cce5ff; border-color: #b8daff; padding: .75rem 1.25rem; margin-bottom: 1rem; border: 1px solid transparent; border-radius: .25rem; font-size: larger;">
Again, we have made a change to data and code in the pipeline. How do we get everything up-to-date?
</div>


In [None]:
!renku status

In [None]:
!renku update

Let us take a look at [02-Inspection.ran.ipynb](renku-tutorial-flights/notebooks/02-Inspection.ran.ipynb) again. That looks much better! 

So does the five-number summary.

In [None]:
!git diff HEAD^ data/output/2019-01-flights-delay-fivenums.csv

# Answering the questions

A notebook has been prepared to compute the mean delay time to flights to Austin.

In [None]:
%cp ../templates/03-Analysis.ipynb ./notebooks/03-Analysis.ipynb

## Exercise 2

Run through the notebook above. Then run it in papermill. Are flights generally on time?

In [None]:
# %load ../solutions/ex2.fragment

## Exercise 3

Modify the notebook to compute delay time by city. Are there differences between cities? Update the results with your new code.