Linux Intro
===========

Objectives
----------

By the end of this morning, you will be able to:

- `ssh` into a remote Linux server

- Examine log files using `head` and `vi`

- Create a `pipe` for interprocess communication

- Find patterns using `grep`

- `cut` out selected portions of each line of a file

- `sort` lines of text files

- report or filter out repeated lines in a file using `uniq`

Connecting to a remote computer
-------------------------------

#### ~~`telnet`~~  

#### `ssh`

SSH
---

> Secure Shell, or SSH, is a cryptographic (encrypted) network
> protocol for initiating text-based shell sessions on remote machines
> in a secure way.
 
![Ssh_binary_packet](https://upload.wikimedia.org/wikipedia/commons/0/0f/Ssh_binary_packet_alt.svg)


    ssh -i ~/.ssh/galvanize-DEI.pem ec2-user@ec2-52-27-60-213.us-west-2.compute.amazonaws.com ls

Unix Shell Walkthrough
----------------------

#### Grab `shakespeare-sonnets.txt`.

    curl -LO http://dsci6007.s3.amazonaws.com/data/shakespeare-sonnets.txt

#### Let's look at the first few lines.
    
    head shakespeare-sonnets.txt 

#### Let's skip the first two lines:

    tail -n +3 shakespeare-sonnets.txt | head
    
####NOTE: tail -n +3 shakespeare-sonnets.txt

    skips first 2 lines and displays the remainder of the file
    as the tail end, which in this case is nearly the entire file
    
    This command is desirable because it filters out the title 
    and the name of the author, leave behind the actual sonnets

#### Let's cut out the chapter : verse part:

    tail -n +3 shakespeare-sonnets.txt | cut -d' ' -f2- | head
    
####NOTE:    cut -d' ' -f2- 
    
    From input remove field 1 with space as field separator.
    Here, 'field 1' is the first word on every line.

#### Let's translate the characters to lower case:

    tail -n +3 shakespeare-sonnets.txt | cut -d' ' -f2- | tr 'A-Z' 'a-z' | head

####NOTE:  tr 'A-Z' 'a-z' 

    Replace A-Z with a-z (lowercase the words)


#### Let's tokenize our words:


    tail -n +3 shakespeare-sonnets.txt | 
        cut -d' ' -f2- | 
        tr 'A-Z' 'a-z' | 
        tr -cs 'a-z' '\012' | 
        head

####NOTE: tr -cs 'a-z' '\012'	

    Replace sequences of non-a-z with newlines (split lines into words)
    Spaces, commas, periods, etc are indicators that a string is a word
    ALSO commands '-cs' and '-sc' are equivalent 
    
#### Let's sort them:


    tail -n +3 shakespeare-sonnets.txt |
        cut -d' ' -f2- |
        tr 'A-Z' 'a-z' |
        tr -sc 'a-z' '\012' |
        sort |
        head

####NOTE:  sort, no head
    
    Returns a long list of sorted tokens
    
####NOTE: sort, with head

    Returns a list of the first 10 sorted tokens


#### What is our vocabulary?

    tail -n +3 shakespeare-sonnets.txt |
        cut -d' ' -f2- |
        tr 'A-Z' 'a-z' |
        tr -sc 'a-z' '\012' |
        sort |
        uniq |
        head
        
####NOTE: uniq
    
    Filters list by unique tokens
    

#### How big is our vocabulary?

    tail -n +3 shakespeare-sonnets.txt | 
        cut -d' ' -f2- | 
        tr 'A-Z' 'a-z' | 
        tr -sc 'a-z' '\012' | 
        sort | 
        uniq | 
        wc -w

####NOTE: wc -w
    
    $\textbf{wc}$ Count words, lines, characters
    $\textbf{-w}$ Specifies words

#### How many times does each word occur?

    tail -n +3 shakespeare-sonnets.txt | cut -d' ' -f2- | 
        tr 'A-Z' 'a-z' | tr -sc 'a-z' '\012' | 
        sort | uniq -c | head

#### How might we construct a rhyming dictionary?

    tail -n +3 shakespeare-sonnets.txt | cut -d' ' -f2- | 
        tr 'A-Z' 'a-z' | tr -sc 'a-z' '\012' | 
        sort | uniq | rev | sort | rev | head

####NOTE: sort | uniq | rev 

    Sorts unique tokens then reverse them

####NOTE: sort | uniq | rev | sort 
    
    Takes reversed tokens (backward words) and sorts them.
    This has the effect of ordering the end of words
    by the same letters:
    
    (i.e. words that end with the same letters tend to rhyme!)


#### What was the penultimate word of each sentence?

    tail -n +3 shakespeare-sonnets.txt | cut -d' ' -f2- | awk '{print   $(NF-1)}' | head
    
####NOTES:  NF    

    number of fields in the current record

#### NOTES: awk '{print   $(NF-1)}' THROWS ERROR

    awk: trying to access out of range field -1
    
    May be because the romen numbers exist in a line with one token
    This command assumes that there are at least two tokens per line.
    
    commands with awk '{print   $(NF-1)}' will not work until I filter
    out all the roman numerals (NOT PART OF THE EXERCISE, BUT I SHOULD DO IT ANYWAYS!!!!)


#### What was the antepenultimate word of each sentence?

    tail -n +3 shakespeare-sonnets.txt | cut -d' ' -f2- | awk '{print $(NF-2)}' | head
    
#### NOTES: THROWS ERROR

    awk: trying to access out of range field -2
    
    May be because the romen numbers exist in a line with one token
    This command assumes that there are at least two tokens per line.


#### What's the word count of those words?

    tail -n +3 shakespeare-sonnets.txt |
        cut -d' ' -f2- |
        awk '{print $(NF-2)}' |
        sort |
        uniq -c |
        head

#### Let's delete punctuation:

    tail -n +3 shakespeare-sonnets.txt |
        cut -d' ' -f2- |
        awk '{print $(NF-2)}' |
        sort |
        tr -d '[:punct:]' |
        uniq -c |
        head

Parsing Web Log Files
---------------------

1) 
Look at the file `data/NASA_access_log_Jul95.gz` without
saving the uncompressed version using `gunzip -c`.

#### gzip -cd  NASA_access_log_Jul95.gz | head

$\textbf{gzip}$ is used to minipulate .gz files 

$\textbf{c}$ writes output on standard output

$\textbf{d}$ decompress file

#### Note: 

    if 'd' is used without 'c' then the .gz file is decompressed 
    
    if 'd' is used with 'c' then the decompressed content is display
    however, the original .gz file is not decompressed 

2) Find the total number lines in the files.

#### gzip -cd  NASA_access_log_Jul95.gz | wc -l

There are 1891714 lines in this document

3) Using `gunzip -c` find the total number of 400 errors in the file.
This includes errors such as 401, 404, etc.

 #### gzip -cd  NASA_access_log_Jul95.gz | awk '{print $(NF-1)}' | grep '^4' |sort |uniq -c
 
        5 400
       54 403
    10845 404

4) Find the total number of 500 errors in the file. Again include all
errors from 500-599.

#### gzip -cd  NASA_access_log_Jul95.gz | awk '{print $(NF-1)}' | grep '^5' |sort |uniq -c

      62 500
      14 501

5) Find the total count of all the different status codes in the
`NASA_access_log_Jul95.gz` file.

#### gzip -cd  NASA_access_log_Jul95.gz | awk '{print $(NF-1)}'|sort |uniq -c

    1701534 200
      46573 302
     132627 304
          5 400
         54 403
      10845 404
         62 500
         14 501

Note
----

- Using `export LC_CTYPE=C` sets the locale to the default locale
  instead of ASCII.

- Setting this enables commands like `rev` and `tr` to not produce
  `Illegal byte sequence` warnings.