# Exercise 1 - Shell basics

Work through as much of the Software Carpentry [lesson on the Unix Shell](http://swcarpentry.github.io/shell-novice/) as you can.  Run through the Setup section just below, then open a shell from the command line or with a terminal session through Jupyter to run through the exercises.

After you have completed the first few sections of the tutorial, return to this notebook.

Execute all of the cells, answer all of the questions, and wherever you see "**Edit this cell**", do it!


## 0. Setup - getting required files 

To get started, you'll need to have the required files in your directory.  Use `wget` to get them:

In [1]:
!wget http://swcarpentry.github.io/shell-novice/data/data-shell.zip

--2018-09-08 15:13:46--  http://swcarpentry.github.io/shell-novice/data/data-shell.zip
Resolving swcarpentry.github.io (swcarpentry.github.io)... 185.199.110.153, 185.199.109.153, 185.199.108.153, ...
Connecting to swcarpentry.github.io (swcarpentry.github.io)|185.199.110.153|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 579150 (566K) [application/zip]
Saving to: ‘data-shell.zip’


2018-09-08 15:13:47 (29.6 MB/s) - ‘data-shell.zip’ saved [579150/579150]



The `unzip -l` option shows you what will be unpacked before you actually unpack it.  It's always a good habit to check what you're going to find before you fill your disk with new files.

In [2]:
!unzip -l data-shell.zip

Archive:  data-shell.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  2018-03-08 09:17   data-shell/
       32  2017-02-16 15:58   data-shell/pizza.cfg
      199  2017-06-30 09:40   data-shell/.bash_profile
        0  2017-02-16 15:58   data-shell/molecules/
      622  2017-02-16 15:58   data-shell/molecules/ethane.pdb
     1158  2017-02-16 15:58   data-shell/molecules/cubane.pdb
      825  2017-02-16 15:58   data-shell/molecules/propane.pdb
     1226  2017-02-16 15:58   data-shell/molecules/pentane.pdb
      422  2017-02-16 15:58   data-shell/molecules/methane.pdb
     1828  2017-02-16 15:58   data-shell/molecules/octane.pdb
        0  2017-02-16 15:58   data-shell/north-pacific-gyre/
        0  2017-10-19 09:41   data-shell/north-pacific-gyre/2012-07-03/
     4371  2017-02-16 15:58   data-shell/north-pacific-gyre/2012-07-03/NENE01736A.txt
     4391  2017-02-16 15:58   data-shell/north-pacific-gyre/2012-07-03/NENE02040A.txt
     4409  2017-02-16 15:

Looks good - now we remove the `-l` so we can `unzip` it for real this time.

In [3]:
!unzip data-shell.zip

Archive:  data-shell.zip
replace data-shell/pizza.cfg? [y]es, [n]o, [A]ll, [N]one, [r]ename: ^C


*Note*: you only need to do this once per session while using Jupyter.  You can open a terminal now and work through the steps, and return to this notebook a little later, and the files will be available either way.  That's because you're working in the same local directory.

If you're reusing one AWS EC2 instance, your files should remain after you stop and restart.  But if you're using datanotebook.org, you'll have to repeat the above, because it cleans out every session when you're done.

Okay, let's get on with the exercise!

## 1. Markdown basics

In the following cells, practice marking up text using markdown:

Make this a top-level header (add "`# `" to the left of the text).

Make this a second-level header (add "`## `" to the left of the text).

Make the following three words three separate bullet points (add "`* `" to the left of each line):

  One
  Two
  Three

Make this cell a Markdown cell instead of code!

In [4]:
print("Make this cell a code cell instead of Markdown!")

Make this cell a code cell instead of Markdown!


## 2. Navigating Files and Directories

As you work through this section of the tutorial, complete the steps here as well, using the `!` shell escape command.  Execute each cell as you go.

These steps aren't exactly the same as what's in the tutorial, where the file layout is a little different and where they're not using a notebook like we are.  That's okay.  Just consider this practice.

In [5]:
!whoami

ubuntu


In [6]:
!pwd

/home/ubuntu


In [7]:
!ls -F

data-shell/	exercise-1.ipynb   R/
data-shell.zip	practice-02.ipynb  Untitled.ipynb


In [8]:
!ls -F

data-shell/	exercise-1.ipynb   R/
data-shell.zip	practice-02.ipynb  Untitled.ipynb


In [9]:
!ls -F data-shell/

creatures/  haiku-words.txt  lengths.txt	  numbers.txt	      writing/
data/	    hydro.csv	     molecules/		  pizza.cfg
example/    hydro.xls	     north-pacific-gyre/  solar.pdf
final.txt   index.html	     notes.txt		  sorted-lengths.txt


In [10]:
!ls -aF

./	       data-shell.zip	    .RData
../	       exercise-1.ipynb     .screenrc
.bash_aliases  .ipynb_checkpoints/  .ssh/
.bash_history  .ipython/	    .sudo_as_admin_successful
.bash_logout   .jupyter/	    Untitled.ipynb
.bashrc        .local/		    .viminfo
.cache/        .parallel/	    .vimrc
.cargo/        practice-02.ipynb    .wget-hsts
.config/       .profile
data-shell/    R/


In [11]:
!ls -af .

Untitled.ipynb	    data-shell.zip  exercise-1.ipynb	       .viminfo
.ipynb_checkpoints  .cache	    .bash_aliases	       .screenrc
.cargo		    .config	    .profile		       .bash_logout
..		    .ssh	    .sudo_as_admin_successful  .jupyter
R		    .ipython	    practice-02.ipynb	       .bash_history
.parallel	    .bashrc	    .RData		       .local
.		    .wget-hsts	    .vimrc		       data-shell


What is the difference between the two previous cells, and what does the single dot mean?

**EDIT THIS CELL** WITH YOUR ANSWER HERE    
1	'!ls -aF' distinguish between files and directories, '!ls -af .'does not.    
	2	‘.’on its own means ‘the current directory’.

In [12]:
!ls -F ..

ubuntu/


What do the double dots mean?

**EDIT THIS CELL** WITH YOUR ANSWER HERE    
'..' means ‘the directory above the current one’,which shows the parent of the current directory.


In [13]:
!ls data-shell/north-pacific-gyre/2012-07-03/

goodiff		NENE01736A.txt	NENE01843A.txt	NENE01978B.txt	NENE02040Z.txt
goostats	NENE01751A.txt	NENE01843B.txt	NENE02018B.txt	NENE02043A.txt
NENE01729A.txt	NENE01751B.txt	NENE01971Z.txt	NENE02040A.txt	NENE02043B.txt
NENE01729B.txt	NENE01812A.txt	NENE01978A.txt	NENE02040B.txt


## 3. Working with Files and Directories

The following cells come from the next section of the tutorial.

In [14]:
!ls -F

data-shell/	exercise-1.ipynb   R/
data-shell.zip	practice-02.ipynb  Untitled.ipynb


In [15]:
!mkdir thesis

In [16]:
import os
assert "thesis" in os.listdir()

In [17]:
!ls -F

data-shell/	exercise-1.ipynb   R/	    Untitled.ipynb
data-shell.zip	practice-02.ipynb  thesis/


You can't use the nano editor here in Jupyter, so we'll use the `touch` command to create an empty file instead.

In [18]:
!touch thesis/draft.txt

In [19]:
assert "draft.txt" in os.listdir("thesis")

In [20]:
!ls -F thesis

draft.txt


Removing files and directories.

In [21]:
!rm thesis/draft.txt

In [22]:
assert "draft.txt" not in os.listdir("thesis")

In [23]:
!rm thesis

rm: cannot remove 'thesis': Is a directory


In [24]:
!rmdir thesis

In [25]:
assert "thesis" not in os.listdir()

In [26]:
!ls

data-shell	exercise-1.ipynb   R
data-shell.zip	practice-02.ipynb  Untitled.ipynb


Renaming and copying files.

In [27]:
!touch draft.txt

In [28]:
assert "draft.txt" in os.listdir()

In [29]:
!mv draft.txt quotes.txt

In [30]:
assert "quotes.txt" in os.listdir()
assert "draft.txt" not in os.listdir()

In [31]:
!ls

data-shell	exercise-1.ipynb   quotes.txt  Untitled.ipynb
data-shell.zip	practice-02.ipynb  R


In [32]:
!cp quotes.txt quotations.txt

In [33]:
assert "quotes.txt" in os.listdir()
assert "quotations.txt" in os.listdir()

## 4. Working with output redirection

Create a new directory:

In [34]:
!mkdir part1

Rename `part1` to `partone` using `mv`.

In [35]:
!mv part1 partone
!ls

data-shell	exercise-1.ipynb  practice-02.ipynb  quotes.txt  Untitled.ipynb
data-shell.zip	partone		  quotations.txt     R


Create a file named `filelist.txt` using the output from `ls` and the output redirector `>`.

In [36]:
!ls > filelist.txt

In [37]:
!cat filelist.txt

data-shell
data-shell.zip
exercise-1.ipynb
filelist.txt
partone
practice-02.ipynb
quotations.txt
quotes.txt
R
Untitled.ipynb


Append to `filelist.txt` using the output appending redirector `>>`.  Note the difference between the single `>` and double `>>`.

In [38]:
!ls >> filelist.txt
!cat filelist.txt

data-shell
data-shell.zip
exercise-1.ipynb
filelist.txt
partone
practice-02.ipynb
quotations.txt
quotes.txt
R
Untitled.ipynb
data-shell
data-shell.zip
exercise-1.ipynb
filelist.txt
partone
practice-02.ipynb
quotations.txt
quotes.txt
R
Untitled.ipynb


In [39]:
!ls > filelist.txt
!cat filelist.txt

data-shell
data-shell.zip
exercise-1.ipynb
filelist.txt
partone
practice-02.ipynb
quotations.txt
quotes.txt
R
Untitled.ipynb


What's the difference between `>` and `>>`?


**EDIT THIS CELL** WITH YOUR ANSWER HERE    
For '>', the file gets overwritten each time we run the command.    
However， '>>' appends something new to the original file that already exists.,


Now create a directory called "`mydirectory`":

In [40]:
# Edit this cell!
!mkdir mydirectory

In [41]:
assert 'mydirectory' in os.listdir('.')

Using `ls` and output redirection, create a file called `myfiles.txt` in the directory `mydirectory` that contains the list of files in the current directory.

In [42]:
# Edit this cell!
!ls > myfiles.txt
!touch mydirectory/myfiles.txt

In [43]:
assert 'myfiles.txt' in os.listdir('mydirectory')

Clean up the directory you just created by removing its contents (the file you created) using `rm`.

In [44]:
# Edit this cell!
!rm mydirectory/myfiles.txt

In [45]:
assert 'myfiles.txt' not in os.listdir('mydirectory')

Now remove the directory itself using `rmdir`.

In [46]:
# Edit this cell!
!rmdir mydirectory

In [47]:
assert 'mydirectory' not in os.listdir('.')

## 5. Filters and pipes

Let's look at something a little more interesting.  Download the text of Charlotte Bronte's *Jane Eyre* from [Project Gutenberg](http://www.gutenberg.org/):

In [48]:
!wget https://s3.amazonaws.com/2018-dmfa/assignment-1/jane-eyre.txt

--2018-09-08 15:21:18--  https://s3.amazonaws.com/2018-dmfa/assignment-1/jane-eyre.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.133.221
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.133.221|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1070331 (1.0M) [text/plain]
Saving to: ‘jane-eyre.txt’


2018-09-08 15:21:18 (21.5 MB/s) - ‘jane-eyre.txt’ saved [1070331/1070331]



`head` and `tail` are very useful.  They let you take a quick peek at the start and end of files.

In [49]:
!head jane-eyre.txt

﻿The Project Gutenberg eBook, Jane Eyre, by Charlotte Bronte, Illustrated
by F. H. Townsend


This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project Gutenberg License included
with this eBook or online at www.gutenberg.org




In [50]:
!tail jane-eyre.txt

keep eBooks in compliance with any particular paper edition.

Most people start at our Web site which has the main PG search facility:

     http://www.gutenberg.org

This Web site includes information about Project Gutenberg-tm,
including how to make donations to the Project Gutenberg Literary
Archive Foundation, how to help produce our new eBooks, and how to
subscribe to our email newsletter to hear about new eBooks.


`grep` is one of the most useful filters.  It lets you search for and match lines that contain specific expressions.  For example, to find mentions of "copyright":

In [51]:
!grep copyright jane-eyre.txt

_in this Volume are the copyright of_
one owns a United States copyright in these works, so the Foundation
permission and without paying copyright royalties.  Special rules,
(trademark/copyright) agreement.  If you do not agree to abide by all
or PGLAF), owns a compilation copyright in the collection of Project
1.D.  The copyright laws of the place where you are located also govern
the copyright status of any work in any country outside the United
posted with permission of the copyright holder), the work can be copied
with the permission of the copyright holder, your use and distribution
terms imposed by the copyright holder.  Additional terms will be linked
permission of the copyright holder found at the beginning of this work.
effort to identify, do copyright research on, transcribe and proofread
corrupt data, transcription errors, a copyright or other intellectual
unless a copyright notice is included.  Thus, we do not necessarily


Notice anything that those lines have in common?

Let's add a little more information by including the `-n` flag to add matching line numbers.

In [52]:
!grep -n copyright jane-eyre.txt

57:_in this Volume are the copyright of_
20719:one owns a United States copyright in these works, so the Foundation
20721:permission and without paying copyright royalties.  Special rules,
20756:(trademark/copyright) agreement.  If you do not agree to abide by all
20775:or PGLAF), owns a compilation copyright in the collection of Project
20790:1.D.  The copyright laws of the place where you are located also govern
20797:the copyright status of any work in any country outside the United
20816:posted with permission of the copyright holder), the work can be copied
20826:with the permission of the copyright holder, your use and distribution
20828:terms imposed by the copyright holder.  Additional terms will be linked
20830:permission of the copyright holder found at the beginning of this work.
20901:effort to identify, do copyright research on, transcribe and proofread
20906:corrupt data, transcription errors, a copyright or other intellectual
21052:unless a copyright notice is included. 

Now let's look for any mention of "book".  This will match a lot of text, so we'll just take the first 10 matching lines by *piping* the output from `grep` into `head`.

In [53]:
!grep -n book jane-eyre.txt | head

101:doubt the tendency of such books as "Jane Eyre:" in whose eyes whatever
220:contained a bookcase: I soon possessed myself of a volume, taking care
229:book, I studied the aspect of that winter afternoon.  Afar, it offered a
234:I returned to my book--Bewick's History of British Birds: the letterpress
357:"Show the book."
361:"You have no business to take our books; you are a dependent, mama says;
365:rummage my bookshelves: for they _are_ mine; all the house belongs to me,
370:lift and poise the book and stand in act to hurl it, I instinctively
799:tart away.  Bessie asked if I would have a book: the word _book_ acted as
801:the library.  This book I had again and again perused with delight.  I
grep: write error: Broken pipe


How many lines contain "book"?  We can count by piping into the word count tool `wc`.

In [54]:
!grep book jane-eyre.txt | wc

     84    1055    5889


That's 84 matching lines, containing 1055 words and 5889 characters.  If you just wanted the lines by themselves, use `wc -l`:

In [55]:
!grep book jane-eyre.txt | wc -l

84


What if we want to match both upper- and lower-case text?  Use `grep -i`:

In [56]:
!grep time jane-eyre.txt | wc -l

402


In [57]:
!grep -i time jane-eyre.txt | wc -l

402


How many lines in *Jane Eyre* contain "other" (just lower-case)?  Start by using `grep` to extract lines that match the word "other" in `jane-eyre.txt` and redirecting it to a file called `other-lines.txt`.

In [58]:
!grep other jane-eyre.txt > other-lines.txt

In [59]:
%sc h_other = head -1 other-lines.txt
assert "other" in h_other

In [60]:
%sc t_other = tail -1 other-lines.txt
assert "other" in t_other

Now count up the lines in the file you created using wc.

In [61]:
!wc -l other-lines.txt

426 other-lines.txt


Your answer should be 426!

## 6. Counting words with `grep`

By piping commands together we can do a lot of powerful things right at the command line.  Let's create a count of the most commonly occurring words in *Jane Eyre*.  To do that, we could write a Python or R script that just counts words, but with the command line shell tools we only need to put a proper pipeline together and we can often accomplish tasks like this in one line.

First we need to split up the text lines into a word per line.  There are `grep` flags for that!

In [62]:
!cat jane-eyre.txt | tr -sc '[:alpha:]' '[\n*]' | head -10


The
Project
Gutenberg
eBook
Jane
Eyre
by
Charlotte
Bronte
tr: write error: Broken pipe
tr: write error
cat: write error: Broken pipe


Now we need to sort them and count the unique tokens.  `sort` solves the first problem.

In [63]:
!cat jane-eyre.txt | tr -sc '[:alpha:]' '[\n*]' | sort | head -10


a
a
a
a
a
a
a
a
a
sort: write failed: 'standard output': Broken pipe
sort: write error


And `uniq -c` solves the second problem.

In [64]:
!cat jane-eyre.txt | tr -sc '[:alpha:]' '[\n*]' | sort | uniq -c | head -25

      1 
   4382 a
    159 A
      3 abandon
      8 abandoned
      2 abandonment
      1 abate
     25 Abbot
      4 abhor
      4 abhorred
      1 Abhorred
      3 abide
      1 Abigail
      2 abigails
      1 abilities
     23 able
      8 abode
      1 abodes
      1 abominable
      1 Abominable
      1 aboon
    223 about
      5 About
     38 above
      3 Above
uniq: write error: Broken pipe


But there's a catch... do you see it?

We need to convert all the words down into lower case so that we are correctly counting unique words.  There's another command, `tr`, for that.

In [65]:
!cat jane-eyre.txt | tr -sc '[:alpha:]' '[\n*]' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | head -25

      1 
   4541 a
      3 abandon
      8 abandoned
      2 abandonment
      1 abate
     25 abbot
      4 abhor
      5 abhorred
      3 abide
      1 abigail
      2 abigails
      1 abilities
     23 able
      8 abode
      1 abodes
      2 abominable
      1 aboon
    228 about
     41 above
      1 abridge
      4 abroad
      8 abrupt
      6 abruptly
      1 abruptness
uniq: write error: Broken pipe


...and if we want to know only the top 10 words in *Jane Eyre*, we need to sort the output.

In [66]:
!cat jane-eyre.txt | tr -sc '[:alpha:]' '[\n*]' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort | head -10

      1 
    100 looking
    100 master
    100 thornfield
    101 got
    101 those
    101 work
    102 almost
    103 most
    103 why
sort: write failed: 'standard output': Broken pipe
sort: write error


But that sorts by character, not number.  Fortunately, `sort -n` does what we want.

In [67]:
!cat jane-eyre.txt | tr -sc '[:alpha:]' '[\n*]' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -n | head -10

      1 
      1 abate
      1 abigail
      1 abilities
      1 abodes
      1 aboon
      1 abridge
      1 abruptness
      1 absences
      1 absorbing
sort: write failed: 'standard output': Broken pipe
sort: write error


But that's the wrong end of the list!  Two ways to fix that:  (a) use `tail` instead of `head`; (b) use `sort -rn`, which will sort in reverse order.  Let's try the latter.

In [68]:
!cat jane-eyre.txt | tr -sc '[:alpha:]' '[\n*]' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -rn | head -10

   8033 the
   7268 i
   6699 and
   5318 to
   4541 a
   4484 of
   3066 you
   2826 in
   2527 was
   2431 it
sort: write failed: 'standard output': Broken pipe
sort: write error


Let's try another text.

Download *Through the Looking Glass* from https://s3.amazonaws.com/2018-dmfa/assignment-1/looking-glass.txt

In [69]:
!wget https://s3.amazonaws.com/2018-dmfa/assignment-1/looking-glass.txt

--2018-09-08 15:21:54--  https://s3.amazonaws.com/2018-dmfa/assignment-1/looking-glass.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 54.231.40.98
Connecting to s3.amazonaws.com (s3.amazonaws.com)|54.231.40.98|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 193607 (189K) [text/plain]
Saving to: ‘looking-glass.txt’


2018-09-08 15:21:54 (7.30 MB/s) - ‘looking-glass.txt’ saved [193607/193607]



In [70]:
assert 'looking-glass.txt' in os.listdir('.')

Take a look at the next cell.  Will it find the top 25 unique words in *Through the Looking Glass* successfully?

In [71]:
!cat looking-glass.txt | tr -sc '[:alpha:]' '[\n*]' | sort | uniq -c | head -25

      1 
    793 a
     26 A
      1 abide
      6 able
     75 about
      1 About
      2 above
      1 accents
      1 accept
      2 accepted
      1 accepting
     10 access
      1 accessed
      1 accessible
      1 accident
      2 accordance
      1 accounts
      1 acres
      8 across
      1 act
      3 active
      1 ACTUAL
      2 actually
     21 added
uniq: write error: Broken pipe


Describe what needs to be done to the previous cell to get it to work correctly.  **Describe it using words**, explaining the issues, rather than using shell commands!

**EDIT THIS CELL** WITH YOUR ANSWER HERE    
**Error of the code above**
1. just using 'sort', the output is alphabetica instead of numerical. For correction, 'sort-n' should be added.    
2. Moreover, by using 'sort-n', the order is from the smallest to the greatest, so instead of 'head -25', 'tail-25' should be used.

Okay, now implement your solution using shell commands with a pipeline.

In [73]:
# Edit this cell!
!cat looking-glass.txt | tr -sc '[:alpha:]' '[\n*]' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -n | tail -25

    208 but
    210 all
    225 for
    226 he
    245 at
    260 on
    268 with
    272 her
    309 s
    323 t
    326 as
    358 was
    374 that
    455 alice
    466 in
    473 said
    544 she
    604 of
    660 i
    681 it
    686 you
    817 to
    819 a
    975 and
   1775 the
