# Demo Week 2 DATA2901 - Jupyter Notebooks and Unix

Jupyter notebooks can include Unix commands - as long as the server is running under Unix itself.

These commands have access to the same file system than the notebook itself.

## 1. Executing Unix commands in iPython Notebooks

We can execute Unix commands in a Jupyter Python notebook after a <font color='purple'>**!**</font>

In [None]:
# list the content of the current directory
! ls -al

In [None]:
# pwd prints the name of the current work directory
! pwd

Unix commands can refer to environment variables (using the usual _$var_ notation) that can be set with <font color='purple'>**%env**</font> in the notebook.

In [None]:
%env filename = MajorPowerStations_v2.csv

In [None]:
# show the first two lines of the file referred to via the variable $filename
! head -n 2 $filename

We can also execute pipelines of Unix commands this way.

In [None]:
# list all unique OPERATIONALSTATUS values
! cut -f 5 -d , $filename | sort | uniq

In [None]:
# list all unique OPERATIONALSTATUS values - ignoring the CSV file's header line (starting from line 2)
! tail -n +2 $filename | cut -f 5 -d , | sort | uniq

In [None]:
# which classes of power stations do we have - and how many of each?
! tail -n +2 $filename | cut -f 2 -d , | sort | uniq -c

You can also access the help pages for the Unix commands using the **man** command; just be careful with long help texts...

In [None]:
# if you want to check the user manual of a Unix command, use 'man' (careful with long help texts)
! man uniq

**sed** is the Unix stream editor which is handy for doing some basic string repalcements for example

In [None]:
# sed example (cf. adv lecture slides and man page for more information about sed)
! echo "hello world"
! echo "hello world" | sed -e 's/hello/bonjours/g'

## 2. Connecting Python with Unix

So far, the examples were just invoking Unix commands and including then in the notebook their result as output text. But we can also cvonnect Python and Unix more directly:
 - Use <font color="red">__{var}__</font> to receive data ferom a Python variable into a Unix command as input.
 - Use <font color="purple"> = ! </font> as part of a variable assignment to receive results from Unix into a Python variable.

In [None]:
# let's prepare a Python variable
text = """Standing at the limit of an endless ocean
Stranded like a runaway, lost at sea
City on a rainy day down in the harbour
Watching as the grey clouds shadow the bay
Looking everywhere 'cause I had to find you
This is not the way that I remember it here
Anyone will tell you its a prisoner island
Hidden in the summer for a million years
"""
print(text)

# send the Python 'text' to Unix and modify it with sed; result is assigned to 'modified_text'
modified_text = ! echo "{text}" | sed -e 's/ocean/beach/g'
print(modified_text)

Notice how the text is actually processed line-by-line by the Unix commands, and returned as a string list (<tt>IPython.utils.text.SList</tt>).

In [None]:
type(modified_text)

A <tt>text.SList</tt> type supports several special fields which help working with these lists of texts.

Fro example, to see it in its original multi-line text format, we need to join the different lines with a newline in between. This is done by the **.n** special field of the <tt>SList</tt> type:

In [None]:
# print result as one concatenated string with newlines in-between
print( modified_text.n )

There are several other very useful functions defined on these <tt>SList</tt>s, such as a **grep()** or a **sort()** function too. For more information about the <tt>IPython.utils.text.SList</tt> type, see the iPython online documentation at: https://ipython.readthedocs.io/en/stable/api/generated/IPython.utils.text.html#IPython.utils.text.SList

**Exercise:** What would happen if you would use the patter 's/way/PATH/g' with sed?

## 3. IPython Magics Examples

In the advanced lecture slides, we explained that Jupyter is quite an extensible eccosystem with multiple options to add new functionality. Python notebooks are actually executed by an **iPython kernel** and this iPhython kernel in turn supports own extensions, called **magics**. There are quite a few of those magics already built-in.

An **inline magic** is invoked with an <font color="purple">**%**</font> at the start of the line:

In [None]:
%lsmagic

In [None]:
%pwd

A **cell magic** is invoked with a double <font color="purple">**%%**</font> at the start of the cell and then affects everything following it in the same notebook cell:

In [None]:
%%time
# measure the execution time needed for a cell
! tail -n +2 $filename | cut -f 2 -d , | sort | uniq -c

In [None]:
%%bash
# the %%bash cell magic allows to run multi-line Unix shell scripts.
 
 tail -n +2 $filename \
| cut -f 2 -d ,       \
| sort                \
| uniq -c

In [None]:
%%html
<h3>Heading Level 3</h3>
<p>The quick brown fox <b>jumps</b> over the lazy dog.</p>

In [None]:
%%latex
$$ E = m c^2$$
$$f(\omega) = \int_{0}^{\infty} \! f(x) \mathrm{e}^{-2\pi{}i\omega x}\, \mathrm{d}x $$

## 4. Performance Comparison

Finally, let's do a simple performance comparison between Python+Pandas and Unix shell commands for some simple data analysis.

**Important:** Note that the runtime results completely depend on the computer hardware where the Jupyter notebook is executed.

We are using a slightly larger dataset here from the US Bureau of Transport Statistics about the on-time performance of major US airlines. The dataset of the flight performance for January 2019 is a CSV file of about 54 MB:

In [1]:
! ls -al

total 52992
drwxr-sr-x 3 uroehm linuxusers     4096 Mar  7 21:50 .
drwxr-sr-x 9 uroehm linuxusers     4096 Mar  4 13:39 ..
-rw-r--r-- 1 uroehm linuxusers    21862 Mar  7 21:50 Demo_Week2_Adv.ipynb
drwxr-sr-x 2 uroehm linuxusers     4096 Mar  4 13:40 .ipynb_checkpoints
-rw-r--r-- 1 uroehm linuxusers    91310 Mar  4 14:07 MajorPowerStations_v2.csv
-rw-r--r-- 1 uroehm linuxusers 54125805 Mar  5 10:57 ontime_performance_2019-01.csv
-rw-r--r-- 1 uroehm linuxusers       12 Mar  5 11:13 test.txt


In [2]:
# let's check the format of this file by looking at the header line and the first data row
! head -n 2 ontime_performance_2019-01.csv

FL_DATE,OP_UNIQUE_CARRIER,ORIGIN,ORIGIN_STATE_NM,DEST,DEST_STATE_NM,DEP_TIME,DEP_DELAY,ARR_TIME,ARR_DELAY,CANCELLATION_CODE,AIR_TIME,DISTANCE
2019-01-28,"DL","DFW","Texas","SLC","Utah","1233",-8.00,"1437",0.00,"",158.00,989.00


### Experiment 1: List Top-3 Airports by Flight Departures 

The first experiment requires to read the full 54MB dataset, group (and sort) by 'ORIGIN', count the flights per origin, and then print the top-3 values.

#### Measurement 1.1 List Top-3 Airports by Flight Departures using Pandas 

In [3]:
%%time

# load OnTime Performance dataset for 2019-01 into Pandas DataFrame
import pandas as pd
data = pd.read_csv('ontime_performance_2019-01.csv')

# What is the frequency distribution of the origin airports?
originDistr = data.groupby('ORIGIN').size()
print(originDistr.nlargest(3))

ORIGIN
ATL    31155
ORD    26216
DFW    23063
dtype: int64
CPU times: user 1.03 s, sys: 1.12 s, total: 2.15 s
Wall time: 1.44 s


#### Measurement 1.2: List Top-3 Airports by Flight Departures using Unix commands
_(careful - CPU times only measured for Jupyter process, does not include CPU time of Unix sub-commands; you need to compare Wall times.)_

In [4]:
%%time
# find the top 3 origin airports by number of flights
! cut -f 3 -d , 'ontime_performance_2019-01.csv' | sort | uniq -c |  sort --sort=numeric -r |  head -n 3

  31155 "ATL"
  26216 "ORD"
  23063 "DFW"
CPU times: user 50 ms, sys: 23 ms, total: 73 ms
Wall time: 2.6 s


#### Measurement 1.3: List Top-3 Airports by Flight Departures using Unix' awk command

The 'awk' command is a very powerful and flexible data processing command in Unix, which allows to pattern match certain lines of the input, and the execute a small code funtion (including local variables, condition statements and loops) on ythe matching lines. BEGIN is for program initialization; the code block after END is executed at the end of processing the file with awk. Again, when comparing times, concentrate on _wall time_.

In [5]:
%%time
%%bash
awk 'BEGIN {FS=","} 
           {origin[$3]++}
     END {for(i in origin) print i,origin[i]}'  "ontime_performance_2019-01.csv" \
| sort -nr -k2 \
| head -n 3

"ATL" 31155
"ORD" 26216
"DFW" 23063
CPU times: user 4 ms, sys: 2 ms, total: 6 ms
Wall time: 435 ms


#### Measurement 1.4: List Top-3 Airports by Flight Departures using Perl

Perl is another popular test and pattern processing language; we are not covering Perl in more depth in this unit, but just for fun let's see how fast or slow it is on this task compared to the previous alternatives.

In [6]:
%%time
! perl -F, -l -an -e '$h{$F[2]}++; END{for $w (sort {$h{$b}<=>$h{$a}} keys %h) {print "$h{$w}\t$w"}}' 'ontime_performance_2019-01.csv' | head -n 3

31155	"ATL"
26216	"ORD"
23063	"DFW"
CPU times: user 42 ms, sys: 20 ms, total: 62 ms
Wall time: 2.43 s


### Experiment 2: Determine average departure delay of United Airlines
The next experiment is an analysis without grouping or sorting. It requires to scan the full dataset and determine the average DEP_DELAY valuy for those entries of the 'UA' carrier (that means filtering by United Airlines flights).

#### Measurement 2.1: Determine average departure delay of United Airlines using Pandas

In [7]:
%%time

# load OnTime Performance dataset for 2019-01 into Pandas DataFrame
import pandas as pd
data = pd.read_csv('ontime_performance_2019-01.csv')

# What is the average delay of United Airlines flights?
uadelays  = data.loc[data['OP_UNIQUE_CARRIER']=='UA']
print(uadelays['OP_UNIQUE_CARRIER'].count(),uadelays['DEP_DELAY'].sum(),uadelays['DEP_DELAY'].mean())

46915 560776.0 12.1293448403
CPU times: user 824 ms, sys: 144 ms, total: 968 ms
Wall time: 971 ms


#### Measurement 2.2: Determine average departure delay of United Airlines using Unix and awk

In [8]:
%%time
%%bash
awk 'BEGIN  {FS=","}
     /"UA"/ {delay_sum+=$8; delay_count++} 
     END    {print delay_count, delay_sum, delay_sum/delay_count}' "ontime_performance_2019-01.csv"

46915 560776 11.953
CPU times: user 1e+03 µs, sys: 8 ms, total: 9 ms
Wall time: 381 ms


Note that the results of the two queries above differ: While pandas computed a mean departure dealy of 12.1 min, awk found it to be a mean of 11.9min. 

Why?

The reason is that Pandas by default ignores any NaN entries in the DataDrame, i.e. flights with unknown delay.
If we re-code the awk program to also ignore those lines, we get the same average value of 12.1 min:

In [9]:
%%time
%%bash
awk 'BEGIN  { FS="," }
     /"UA"/ { if ($8!="") { delay_sum+=$8; delay_count++} } 
     END    { print delay_count, delay_sum, delay_sum/delay_count }' "ontime_performance_2019-01.csv"

46233 560776 12.1293
CPU times: user 5 ms, sys: 4 ms, total: 9 ms
Wall time: 383 ms


That's it.

Please keep in mind that the performance values shown above are the ones for the current Jupyter server at USyd. They will differ on your own machine, and Python+Pandas can on a more modern machine with enough memory be faster than Unix...

# The End