# Lab2: Introduction to Shell for Data Science

Outline
1. Basic commands
2. Manipulating data
3. Combining tools
4. Batch processing
----
Appendix: Regular expressions

## 1. Basic commands

Suppose you want to download a file from the web and place it in `/Labs/data/` where your `/Labs/` current directory. 


In [None]:
# We can invoke shell commands within Jupyter notebook ! 
! pwd

- We start by downloading a data file from the [UCI Machine Learning repository](https://archive.ics.uci.edu/ml/index.php). 
- To do so, we will use the `wget` command introduced in [Data collection and manipulation](https://github.com/UCSB-PSTAT-134-234/Spring-2018/blob/master/03-Data-collection-and-manipulation.ipynb).

In [None]:
! wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

We now make sure that the data has been successfully downloaded by using our basic navigation commands (`pwd`, `ls`, `cd`, etc).

In [None]:
! ls -F -l

## 2. Manipulating data

- The first thing most data scientists do when given a new dataset to analyze is figure out what fields it contains and what values those fields have. 
- If the dataset has been exported from a database or spreadsheet, it will often be stored as comma-separated values (CSV). 
- A quick way to figure out what it contains is to look at the first few rows.

### Selecting rows from a file

In [None]:
# Prints entire contents of a file
! cat iris.data

In [None]:
! head iris.data # show first ten lines

In [None]:
! tail iris.data # show last ten lines

In [None]:
# Given the information in the help file we can then display
# all but the first 6 lines of the iris.data file
! tail -n +6 iris.data

### Selecting columns from a file

`head` and `tail` let you select rows from a text file. If you want to select columns, you can use the command `cut`. 

In [None]:
! cut -f 1-3,5 -d , iris.data # -f: "fields" , -d: "delimiter"

### Selecting lines containing particular values

- `head` and `tail` select rows, `cut` selects columns.
- `grep` selects lines according to what they contain. 
- In its simplest form, `grep` takes a piece of text followed by one or more filenames and prints all of the lines in those files that contain that text. 

For example, suppose we want to retrieve all lines containing the word "setosa": 

In [None]:
! grep "setosa" iris.data

###  `grep` command 

- The name "grep" stands for "global regular expression print". This means that grep can be used to see if the input it receives matches a specified pattern.

- `grep` prints all of the lines from the `iris.data` that contain "setosa". It can search for patterns as well using [*regular expressions*](https://www.digitalocean.com/community/tutorials/using-grep-regular-expressions-to-search-for-text-patterns-in-linux); we will explore those later. 

What's more important right now is some of `grep`'s more common **flags**:

`-c`: print a count of matching lines rather than the lines themselves

`-h`: do not print the names of files when searching multiple files

`-i`: ignore case (e.g., treat "Regression" and "regression" as matches)

`-l`: print the names of files that contain matches, not the matches

`-n`: print line numbers for matching lines

`-v`: invert the match, i.e., only show lines that don't match

In [None]:
# best practices for grep is to place options/flags 
# before the pattern we are looking for:
! grep -n -v "setosa" iris.data 

In [None]:
! grep -c "setosa" iris.data

## 3. Combining tools

## Re-directing: store a command's output in a file

- All of the tools you have seen so far let you name input files. Most don't have an option for naming an output file because they don't need one. 
- Instead, you can use redirection to save any command's output anywhere you want. 

In [None]:
! grep "setosa" iris.data > setosa.csv

In [None]:
! ls

We can then apply either of the above commands to the resulting file `setosa.csv`

In [None]:
# For example, select only first and last columns
! cut -f 1,5 -d , setosa.csv

## Piping: using one command's output as another command's input

The shell provides another tool that solves both of these problems at once called a **pipe**. 


In [None]:
# Option 1
! grep "setosa" iris.data > setosa.csv
! cut -f 1,5 -d , setosa.csv

In [None]:
# Option 2
! grep "setosa" iris.data | cut -f 1,5 -d , 

The pipe symbol `|` tells the shell to use the output of the command on the left as the input to the command on the right.

In [None]:
# Sort rows containing the word setosa according to the 4th "numeric" (-n)
# key (field)
! grep -n "setosa" iris.data | sort -n -k 4 --field-separator=',' | head -n 5

## Pipes and re-direction

The shell lets us redirect the output of a sequence of piped commands. However, `>` must appear at the end of the pipeline: if we try to use it in the middle, like this:

In [None]:
! grep -n "setosa" iris.data | sort -n -k 4 | head -n 5 > bottom5.csv

## How can I stop a running program?

The commands and scripts that you have run so far have all executed quickly, but some tasks will take minutes, hours, or even days to complete. 

You may also mistakenly put **redirection** in the middle of a **pipeline**, causing it to hang up. If you decide that you don't want a program to keep running, you can type `Ctrl-C` to end it. This is often written `^C` in Unix documentation; note that the 'c' can be lower-case.

# 4. Batch processing

## Environment variables

Like other programs, the shell stores information in variables. Some of these, called **environment variables**, are available all the time. Environment variables' names are conventionally written in upper case, and a few of the more commonly-used ones are shown below:

Variable | Purpose | Value
--- | --- | ---
`HOME` | User's home directory | `/home/repl`
`PWD` | Present working directory | Same as `pwd` command
`SHELL` | Which shell program is being used | `/bin/bash`
`USER` | User's `ID` | repl

In [None]:
# To get the entire list of environment variables use the set command
! set

To see the **variable value**, the simplest way is to use the `echo` command and the variable name preceded by the dollar sign `$`. For example:

In [None]:
! echo $SHELL 

## `for` loops

- Suppose we want to subset `iris.data` in three files, each one containing the rows associated to a certain class (in this case `Iris-setosa`, `Iris-versicolor`, `Iris-virginica`). 
- A possible option would be to use `grep` and re-directing the output to a file using `>`. 
- Since the latter would involve repeating the same process three times, we can then replicate the same procedure more efficiently using a for loop.

In [None]:
# Obtain iris flower types
! cut -f 5 -d , iris.data | uniq 

We can create "shell" variables just as environment variables by "redirecting" de value of a command to a variable in LHS. Note that there are no spaces around the equal sign.

In [None]:
%%bash  
iris_type=$(cut -f 5 -d , iris.data | uniq)
echo $iris_type

In [None]:
%%bash  
iris_type=$(cut -f 5 -d , iris.data | uniq)
for flower in $iris_type; do grep $flower iris.data > $flower.csv; done

In [None]:
! ls

### The loop's parts are:

The skeleton `for ...variable... in ...list...; ...body...; done`

1. The list of things the loop is to process (in our case, the words Iris-setosa, Iris-versicolor, Iris-virginica accesses using `$iris_type`).
2. The variable that keeps track of which thing the loop is currently processing (in our case, `flower`).
3. The body of the loop that does the processing (in our case, `grep $flower iris.data > $flower.csv`).

> Notice that the body uses `$flower` to get the variable's value instead of just suffix, just like it does with any other shell variable. Also notice where the semi-colons go: the first one comes between the list and the keyword do, and the second comes between the body and the keyword done.

----
# Appendix: Regular Expressions

1. Literal Matches
2. Anchor Matches
3. Matching Any Character
4. Bracket Expressions
5. Repeat Pattern Zero or More Times
6. Escaping Meta-Characters

Let's try an example. We will use grep to search for every line that contains the word "GNU" in the GNU General Public License version 3 on an Ubuntu system.

In [None]:
! wget https://www.gnu.org/licenses/gpl.txt

In [None]:
! ls

 ## Literal Matches

----
- Patterns that exactly specify the characters to be matched are called "literals" because they match the pattern literally, character-for-character.
- All alphabetic and numerical characters (as well as certain other characters) are matched literally unless modified by other expression mechanisms.




In [None]:
! grep "GNU" gpl.txt

## Anchor Matches
---
- Anchors are special characters that specify where in the line a match must occur to be valid.



In [None]:
! grep "^GNU" gpl.txt

In [None]:
! grep "and$" gpl.txt

## Matching Any Character
---
- The period character (.) is used in regular expressions to mean that any single character can exist at the specified location.

- For example, if we want to match anything that has two characters and then the string "cept", we could use the following pattern:


In [None]:
! grep "..cept" gpl.txt

## Bracket Expressions
---
- By placing a group of characters within brackets ("[" and "]"), we can specify that the character at that position can be any one character found within the bracket group.


In [None]:
!grep "t[wo]o" gpl.txt

In [None]:
!grep "[^c]ode" gpl.txt

In [None]:
!grep "^[A-Z]" gpl.txt

## Repeat Pattern Zero or More Times
---
- Finally, one of the most commonly used meta-characters is the "*", which means "repeat the previous character or expression zero or more times".

- If we wanted to find each line that contained an opening and closing parenthesis, with only letters and single spaces in between, we could use the following expression:


In [None]:
!grep "([A-Za-z ]*)" gpl.txt

## Escaping Meta-Characters
---
Sometimes, we may want to search for a literal period or a literal opening bracket. Because these characters have special meaning in regular expressions, we need to "escape" these characters to tell grep that we do not wish to use their special meaning in this case.

We can escape characters by using the backslash character (`\`) before the character that would normally have a special meaning.

For instance, if we want to find any line that begins with a capital letter and ends with a period, we could use the following expression. The ending period is escaped so that it represents a literal period instead of the usual "any character" meaning:


In [None]:
!grep "^[A-Z].*\.$" gpl.txt

# References

1. [https://www.digitalocean.com/community/tutorials/using-grep-regular-expressions-to-search-for-text-patterns-in-linux](https://www.digitalocean.com/community/tutorials/using-grep-regular-expressions-to-search-for-text-patterns-in-linux)
2. https://regexone.com/