# UNIX Commands for Data Scientists

## Declare Filename

First, create a variable to hold the text file that needs to be analyzed

In [2]:
!ls 

02_UNIX_reading.pdf                    shakespeare.txt
UNIX-Jupyter-Notebook-Example.ipynb    text_file_analysis_unix_commands.ipynb


In [3]:
%env filename= shakespeare.txt

env: filename=shakespeare.txt


Verify the variable by printing it using `echo`

In [4]:
!echo $filename

shakespeare.txt


## head

`head` prints some lines from the top of the file, `-n` specifies how many. When `-n` is not mentioned it returns the first few line by default

In [5]:
!head -n 3 $filename

This is the 100th Etext file presented by Project Gutenberg, and
is presented in cooperation with World Library, Inc., from their
Library of the Future and Shakespeare CDROMS.  Project Gutenberg


## tail

`tail` prints some line from the bottom of the file, `-n` specifies how many. When `-n` is not mentioned it returns the last few lines by default

In [6]:
!tail -n 10 $filename

PERSONAL USE ONLY, AND (2) ARE NOT DISTRIBUTED OR USED
COMMERCIALLY.  PROHIBITED COMMERCIAL DISTRIBUTION INCLUDES BY ANY
SERVICE THAT CHARGES FOR DOWNLOAD TIME OR FOR MEMBERSHIP.>>



End of this Etext of The Complete Works of William Shakespeare





## wc

`wc` stands for wordcount, prints the number of lines, words and characters:

In [7]:
!wc $filename

  124505  901447 5583442 shakespeare.txt


-l prints just the number of lines

In [8]:
!wc -l $filename

  124505 shakespeare.txt


## cat

`|` can be used to stream the output of a command to the input of another

For example `cat` dumps the content of a file, then we can pipe it to `wc`:

In [9]:
!cat $filename | wc -l 

  124505


## grep

`grep` is an extremely powerful tool to look for text in one or more files
For example in the next command we are looking for all the lines that contains `parchment`, `-i` specifies that matching can be case insensitive(upper/lower case)

In [10]:
!grep -i 'parchment' $filename

  If the skin were parchment, and the blows you gave were ink,
  Ham. Is not parchment made of sheepskins?
    of the skin of an innocent lamb should be made parchment? That
    parchment, being scribbl'd o'er, should undo a man? Some say the
    Upon a parchment, and against this fire
    But here's a parchment with the seal of Caesar;  
    With inky blots and rotten parchment bonds;
    Nor brass, nor stone, nor parchment, bears not one,


Combine `grep` and `wc` to count the number of lines in a file that contain a specific word

In [11]:
!grep -i 'liberty' $filename | wc -l

      72


## sed

`sed` is a powerful stream editor, it works similar to `grep`, but it also modifies the output text, it uses regular expressions, which are a language to define pattern matching and replacement.

For example:

    s/from/to/g
    
means:

* `s` for substitution
* `from` is the word to match
* `to` is the replacement string
* `g` specifies to apply this to all occurrences on a line, not just the first

In the following we are replacing all instances of 'parchment' to 'manuscript'

Also we are redirecting the output to a file with `>`. Therefore the output instead of being printed to screen is saved in the text file `temp.txt`.

In [12]:
#replace all instances of 'parchment' to 'manuscript'

!sed -e 's/parchment/manuscript/g' $filename > temp.txt

Then we are checking with `grep` that `temp.txt` contains the word "manuscript":

In [13]:
!grep -i 'manuscript' temp.txt 

  If the skin were manuscript, and the blows you gave were ink,
  Ham. Is not manuscript made of sheepskins?
    of the skin of an innocent lamb should be made manuscript? That
    manuscript, being scribbl'd o'er, should undo a man? Some say the
    Upon a manuscript, and against this fire
    But here's a manuscript with the seal of Caesar;  
    With inky blots and rotten manuscript bonds;
    Nor brass, nor stone, nor manuscript, bears not one,


## sort

In [21]:
!head -n 5 $filename

This is the 100th Etext file presented by Project Gutenberg, and
is presented in cooperation with World Library, Inc., from their
Library of the Future and Shakespeare CDROMS.  Project Gutenberg
often releases Etexts that are NOT placed in the Public Domain!!



use `sort` to sort the above line in alphabetical order (based on the first word)

In [20]:
!head -n 5 $filename | sort -k1


Library of the Future and Shakespeare CDROMS.  Project Gutenberg
This is the 100th Etext file presented by Project Gutenberg, and
is presented in cooperation with World Library, Inc., from their
often releases Etexts that are NOT placed in the Public Domain!!


We can specify that we would like to sort on the second word of each line, we specify that the delimiter is space with `-t' '` and then specify we want to sort on column 2 `-k2`.

Therefore we are sorting on "is, of, presented, releases"

In [19]:
!head -n 5 $filename | sort -t' ' -k2


This is the 100th Etext file presented by Project Gutenberg, and
Library of the Future and Shakespeare CDROMS.  Project Gutenberg
is presented in cooperation with World Library, Inc., from their
often releases Etexts that are NOT placed in the Public Domain!!


`sort` is often used in combination with `uniq` to remove duplicated lines.

`uniq -u` eliminates duplicated lines, but they need to be consecutive, therefore we first use `sort` to have equal lines consecutive and then we can filter them out easily with `uniq`:

In [16]:
!sort $filename | wc -l

124505


In [17]:
!sort $filename | uniq -u | wc -l

110834


# Lets bring it all together

The "UNIX philosophy" is "Do one thing, do it well" (https://en.wikipedia.org/wiki/Unix_philosophy). The point is to have specialized tools with just 1 well defined function and then compose them together with pipes.

## Count the most frequent words

**Warning for MAC OS**: Mac OS has a different version of `sed` that has a special treatment of line feed `\n` and carriage return `\n`. Therefore on Mac we need to replace each occurrence of:

    sed -e 's/ /\n/g' -e 's/\r//g'

with:

    sed -e 's/ /\'$'\n/g' -e $'s/\r//g'

In [18]:
!sed -e 's/ /\n/g' -e 's/\r//g' $filename  | sed '/^$/d'| sort | uniq -c | sort -nr | head -15

  23244 the
  19542 I
  18302 and
  15623 to
  15551 of
  12532 a
  10824 my
   9576 in
   9081 you
   7851 is
   7531 that
   7068 And
   6948 not
   6722 with
   6218 his
sort: write failed: 'standard output': Broken pipe
sort: write error


**do not worry** about the Broken Pipe error, it is due to the fact that `head` is closing the pipe after the first 15 lines, and `sort` is complaining that it would have more text to write 

    !sed -e 's/ /\n/g' -e 's/\r//g' $filename
    
`sed` is making 2 replacements. The first replaces each space with `\n`, which is the symbol for a newline character, basically this is splitting all of the words in a text on separate lines. See yourself below!

The second replacement is more complicated, `shakespeare.txt` is using the Windows convention of using `\r\n` to indicate a new line. `\r` is carriage return, we want to get rid of it, so we are replacing it with nothing.

In [19]:
!sed -e 's/ /\n/g' -e 's/\r//g' < $filename | head

This
is
the
100th
Etext
file
presented
by
Project
Gutenberg,
sed: couldn't write 48 items to stdout: Broken pipe


Next we are not interested in counting empty lines, so we can remove them with:

     sed '/^$/d'
     
* `^` indicates the beginning of a line
* `$` indicates the end of a line

Therefore `/^$/` matches empty lines. `/d` instructs `sed` to delete them.

Next we'd like to count the occurrence of each word, here we can use `uniq` with the `-c` option, but as with the `-u` option, it needs equal lines to be consecutive, so we do a sort first:

In [20]:
!sed -e 's/ /\n/g' -e 's/\r//g' $filename  | sed '/^$/d' | sort | uniq -c | head

      1 __
      9 -
      2 ?
      1 /
     51 .
    241 "
      1 (~),
      1 (_)
      1 (*)
     14 [
uniq: write error: Broken pipe


Good so we have counted the words, so we need to sort but we need to sort in numeric ordering instead of alphabetical so we specify `-n`, also we need reverse order `-r`, bigger first!

And finally we take the first 15 lines:

In [21]:
!sed -e 's/ /\n/g' -e 's/\r//g' $filename | sed '/^$/d' | sort | uniq -c | sort -nr | head -15

  23244 the
  19542 I
  18302 and
  15623 to
  15551 of
  12532 a
  10824 my
   9576 in
   9081 you
   7851 is
   7531 that
   7068 And
   6948 not
   6722 with
   6218 his
sort: write failed: 'standard output': Broken pipe
sort: write error


## Write the output to a file

We can also do the same and save the output to a file for later usage:

In [22]:
!sed -e 's/ /\n/g' -e 's/\r//g' < $filename | sed '/^$/d' | sort | sed '/^$/d' | uniq -c | sort -nr | head -15 > count_vs_words

sort: write failed: 'standard output': Broken pipe
sort: write error


In [23]:
!cat count_vs_words

  23244 the
  19542 I
  18302 and
  15623 to
  15551 of
  12532 a
  10824 my
   9576 in
   9081 you
   7851 is
   7531 that
   7068 And
   6948 not
   6722 with
   6218 his
