<DIV ALIGN=CENTER>

# Unix Data Processing
## Professor Robert J. Brunner
  
</DIV>  
-----
-----

## Unix Data Procesing Overview

The purpose of this class is to introduce practical data science comcepts. The Unix operating system and its rich set of commands have been in existence for a number of years. When they were first developed, data sets were generally much larger than the available compute resources. As we now find ourselve in a similar siutation, these commands remain extremely useful. Furthermore, since a lot of data processing can most easily be done directly on the data as it resides on the file system (to minimize data transfer and other overheads), hacing familiarity with these commands can prove extremely useful.

Overall, we will explore Unix commands for data processing that can be categorized in the following ways:

- Viewing data
- Basic data processing
- Finding data
- Tranforming data
- Advanced data processing

Basic data processing refers to simple operations on a data file, including sort, finding unique or duplicate lines, counting the number of lines in a file, joining two files, or extracting columns from a file. Advanced data processing refers to performing simple or complicated operations on data in a file, where the operation may change depending on valus in the file itself. Finally, we will breifly discuss Shell scripting to these data processing tasks more permenant.

-----

## Viewing data

An important task we will cover first is how to actually view the
contents of a file. In a graphical interface, you would open a document
editor, such as Microsoft Word, and then load the file into your editor.
At the command line, however, we simply use  a Unix command to open a
file for reading and to display the contents of a file to `stdout`,
which is generally the screen. When constructing Unix pipes, however, we
can use a file viewer at the start of the pipeline and pipe `stdout` 
into the next command in the chain.

Several commands are useful for viewing files:

### [`cat`][1]

Used to view the entire contents of a file. For example, to send the
contents of myfile to `stdout`, which in this case is the terminal
display:

    $ cat myfile
    
### [`less`][2]

Used to view the contents of a file, one screen at a time. Additional
options are available, that can be changed while viewing the file,
providing a lot of flexibility. `less` is a more recent version of the
`more` command, which can also be used. For example, to page through the
contents of myfile (using the spacebar to go to the next screen, or the
`b` key to go back one screen):

    $ less myfile
   
### [`head`][3]

Used to view a limited number of lines from the start (or head) of the
file. By default the first ten lines will be displayed, but you can
specify the exact number by using the `-n num` flag, where _num_ is the
number of lines to display. For example, to display the first five lines
from myfile:

    $ head -5 myfile

### [`tail`][4]

Used to view a limited number of lines from the end (or tail) of the
file. By default the first ten lines will be displayed, but you can
specify the exact number by using the `-n num` flag, where _num_ is the
number of lines to display. For example, to display the last three lines
from myfile:


    $ tail -3 myfile

Another useful option for the `tail` command is the `-f` flag, which can
be used to display the last lines of a file that might be continually
updated (e.g., the output of another command).

We can demonstrate several of these commands, by first grabbing some
data (as indicated in the Unix Networking lesson) and viewing part of
the data.

![Viewing data example](images/shell-view.png)

-----
[1]: https://en.wikipedia.org/wiki/Cat_(Unix)
[2]: https://en.wikipedia.org/wiki/Less_(Unix)
[3]: https://en.wikipedia.org/wiki/Head_(Unix)
[4]: https://en.wikipedia.org/wiki/Tail_(Unix)

## Simple data processing

A number of simple Unix commands exist to perform basic data processing
tasks, like sorting, eliminating duplicate lines, counting lines, and
joining files together or splitting files apart.

### [`sort`][1]

Used to sort the contents of a file alphabetically. The field, or
column, that should be used as the sort key is specified by the `-k`
flag, by default the first column is used. A numeric sorting can be
indicated by using the `-n` flag can be used. the entire contents of a
file. For example, to sort the contents of myfile numerically, using the
third column, we can use the following command:

    $ sort -n -k 3 myfile

### [`uniq`][2] 

Used to view only uniq lines in a file, the lines must be adjacent to be
considered unique. Thus, this command is often used in a pipe after a
sort so that matching lines are adjacent. For example, to display only
the unique lines in a myfile that has been sorted, we can use the
following command:

    $ uniq myfile

### [`wc`][3]

Used to count the number of words and/or lines in a file (or set of
files). One common use of `wc` is to display the number of lines in a
file, which is done by using the `-l` flag. For example, to count and
display the number of lines in myfile, we can use the following command:

    $ wc -l myfile

-----

Three other related Unix commands can be useful for simple data
processing, especially as part of a Unix Pipeline. These are the
[`paste`], the [`join`], and the [`cut`] commands. The `paste` command
is used to connect lines from multiple files.  The second command is
similar, but `join` will connect lines that match have matching entries.
Finally, the third command cuts part of a file to display. 

### [`paste`][4]

Generally used to combine two files together by pasting _row1_ from the
second file  to the end of _row1_ in the first file. This process
continues until all rows have been pasted. For example, to paste the two
files, myfile1 and myfile2 together, we can use the following command:

    $ paste myfile1 myfile2

### [`join`][5]

Used to combine only the rows from two files that have matching entries
in a specific column, in a similar manner to a database join (both
inner, the default, and outer joins are supported). The columns can be
different for each file, and are specified by using flags: `-1`, to
refer to the field position in the first file, and `-2`, to refer to the
field position in the second file. For example, to perform an inner join
for rows from myfile1 and myfile2 where the second column in myfile1
matches the fifth column in myfile2, we can use the following command:

    $ join -1 2 -2 5 myfile1 myfile2

### [`cut`][6]

Used to select only part of a each row for display. The part to cut can
be specified as either specific numbers of characters or bytes, or the
rows can be split on a delimiter value and only select fields (or
columns) will be displayed. For example, to display only the third
through the fifth fields from a CSV file, we can use the following
command:

    $ cut -d "," -f 3-5 myfile.csv


-----
[1]: https://en.wikipedia.org/wiki/Sort_(Unix)
[2]: https://en.wikipedia.org/wiki/Uniq
[3]: https://en.wikipedia.org/wiki/Wc_(Unix)
[4]: https://en.wikipedia.org/wiki/Paste_(Unix)
[5]: https://en.wikipedia.org/wiki/Join_(Unix)
[6]: https://en.wikipedia.org/wiki/Cut_(Unix)

## Finding data

A common data processing task is to find information, which can be even
more difficult in a big data environment. Unix offeres several commands
that are designed to quickly find relavant information.

### [`grep`][1]

One common task is searching for a pattern within a stream of data. On a
Unix system, we can use the `grep` command to search for a pattern in
one more files. Formally, the pattern can be a [regular
expression](http://www.aboutlinux.info/2006/01/learn-how-to-use-regular-
expressions.html), allowing for complicated pattern matching. Any row in
the file (or list of files) supplied to grep that match the target
pattern are written to `stdout`, which typically means displayed to the
terminal.

A typical use of `grep` is to output lines from a file that match the
target pattern. For example, to print out all rows in myfile that
contain _Illinois_, we can use the following command:

    $ grep Illinois myfile

### [`which`][2]

In some cases, we need to know where a a program is located in the Unix
filesystem. This can be useful when multiple versions of an executable
might exist. We can find the location of a particular command or program
by using the `which` command. For example, to find the directory in
which the `which` command is located, we can use the following command:

    $ which which

### [`find`][3]

A more general case is where we might need to find one or more files
within a large file system. The standard tool used to accomplish this on
a Unix system is the `find` command. The `find` command is extremely
powerful, as you can recursively traverse directoires to find files that
match a pattern, and then perform some option based on the results. This
can include simply listing the full pathnames of the files to performing
some complicated processing task on the located files (like renaming the
files). For example, to find and list all files located within the
current directory that have the word _illinois_ somewhere in the
filename, we can use the following command:

    $ find *illinois* -print

here the `*` charachters signify that any pattern is matched, allowing
any filename with illinois anywhere in the filename to be identified.

-----

[1]: https://en.wikipedia.org/wiki/Grep
[2]: https://en.wikipedia.org/wiki/Which_(Unix)
[3]: https://en.wikipedia.org/wiki/Find

## Transforming data

To transform data we can use the stream editor, or
[`sed`](https://en.wikipedia.org/wiki/Sed). Sed provides an easy
mechanism for the find-and-replace approach of data processing,
line-by-line through a file. A `sed` command is more flexible, however,
and it can also be used to filter data as part of a Unix pipe sequence.
By default, `sed` processes a stream of data read from `stdin` and
writes the possibly transformed data top `stdout`.

We can summarize the general format of a substitution `sed` command:

    sed 's/pattern/replacement/g' myfile

where the `s` indicates a substitution will occur, `pattern` is the text
to find, which can include a [regular
expression](http://www.aboutlinux.info/2006/01/learn-how-to-use-regular-
expressions.html), `replacement` is the text to substitute in place of
the pattern, and the `g` indicates that every occurrence of the `pattern`
on the line should be replaced. For example, if we want to transform a
comma-separated value file to use two spaces instead of commas to
separate fields, we could use the following `sed` command:

    $ sed 's/,/ /g' myfile.csv

-----

## Processing data

A simple technique to perform simple data processing of large data files
on a Unix system is to use the [Awk](https://en.wikipedia.org/wiki/AWK)
programming language. We will generally write a short, inline `awk`
command to performa simple processing, however, we could write complex
`awk` programs that are saved in their own text file.

The general format of an `awk` program is pre-processing, a per-line
processing, and post-processing. The pre-processing is contained within
the `BEGIN{ ... }` clause, while the post-processing is contained within
the `END { ... }` clause. The per-line processing occurs in the
central `{ ... }` clause. Both the pre- and post-processing are optional.

For example, if we want to print out only the first, third and sixth
columns of every row in `myfile`, we can use the following `awk` command:

    $ awk '{print $1, $3, $6 ; }' myfile

We also can selectively process only rows that meet a certain condition.
For example, to print out the entire row when the first column is
greater than ten, we can use the following `awk` command::


    $ awk '{if ($1 > 10) print $0 ; }' myfile

where we have used the fact that `$0` is a special variable that
contains the entire row data.

Another useful trick is to accumulate a running sum from one or more
columns, and print the result out after processing an entire file:

    $ awk 'BEGIN {sum = 0}{sum += $3 } END {print sum}' myfile

-----

## Simple Shell Scripting

The Bash shell provides a full scripting capability, including the use
of variables, expression, loops, and conditionals. While a full
discussion on Bash shell scripting is beyond the scope of this lesson,
it is often useful to take an existing data processing command sequence
and turn it into a simple script. This can provide several benefits.
First, by using comments to document the script, you have a documented
shell command sequence. Second, by creating a simple script, you can
often increase the functionality of your data processing, for example,
by allowing arbitrary filenames for the input and output of the command.

When creating a simple Bash shell script there are several useful things
to keep in mind.

1. The first line of the file should start with the special sequence
`#!/bin/bash` which signifies this script should be run via the bash
shell.  
2. Comment lines start with the hash `#` character.  
3. Command line parameters are available as special variables encoded by
`$D` where D is a decimal integer. For example, the first argument is
`$1`.  
4. The shell script must have execute permission set for the current
user (either via owner, group, or all).  

For example, to turn our `sed` script that converted a csv file into a
whitespace separated file, we can create the following shell script that
we call test.sh:

```
#!/bin/bash
#
# Convert the CSV file read from STDIN to a white space separated file on STDOUT

sed 's/,/  /g' $1
```

To use this script, we first need to make the script executable:
    $ chmod u+x ./test.sh
    
Next, we can execute this script as shown in the following image:

![shell script example](images/shell-script.png)

Notice how we first display two lines, and next convert only these two
lins to whitespace separated values.

-----

### Additional References

1. Why a data scientist should be familiar with the [command line](http://www.dataists.com/2010/09/a-taxonomy-of-data-science/)
2. The book on [Data Science at the Command Line](http://datascienceatthecommandline.com)
2. A detailed discussion on [regular expressions](http://en.wikipedia.org/wiki/Regular_expression#Basic_concepts).
3. [AWK Tutorial](http://www.thelinuxtips.com/2012/03/awk-basics-tutorial-1/)

-----

### Return to the [Course Index](index.ipynb).

-----