# Grep

[GNU Grep User's Manual](https://www.gnu.org/software/grep/manual/grep.html)

[Grep on Wikipedia](https://en.wikipedia.org/wiki/Grep)

## History

`grep` is an acronym for *globally search a regular expression and print*. In 1974, Ken Thompson of Bell Labs was asked by a researcher, Lee McMahon, who was attempting to determine the authors of the 85 anonymous [Federalist Papers](https://en.wikipedia.org/wiki/The_Federalist_Papers) solely through natural language processing. The current UNIX text editor at the time `ed` would run out of memory with so many documents, so Ken Thompson wrote `grep` overnight as a replacement.

![federalist.jpg](federalist.jpg)

## So, What's It Good For?

`grep` is used for searching plain-text files for regular expressions. It embodies the Unix philosophy.

```
1. Write programs that do one thing and do it well.
2. Write programs to work together.
3. Write programs to handle text streams, because that is a universal interface.

 -- Peter Salus (1994)
```

In [None]:
grep --version

## Examples

### airports.dat

`airports.dat` is a plain-text database of airports located around the world.

In [None]:
# first examine the structure of the database

head airports.dat

In [None]:
# count the number of entries (lines) in the file

wc -l airports.dat

In [None]:
# Searching for airports in Canada
grep 'Canada' airports.dat

In [None]:
# Counting the number of results

grep -c 'Canada' airports.dat

In [None]:
# Filtering for all airports in America/Los Angeles time zone

grep 'America/Los_Angeles' airports.dat

## Piping

The standard output stream of a command can be `piped` into grep using the `|` operator.

![Pipeline.svg](Pipeline.svg)

In [None]:
# Example: printing every process run on the system

ps aux

In [None]:
# Pipe the output of ps into grep to find all processes run by the user

ps aux | grep $USER

In [None]:
# Piping cat into grep

cat airports.dat | grep -c 'Canada'

In [None]:
# Equivalent of the above

cat airports.dat | grep 'Canada' | wc -l

## Regular Expressions

Regular expressions are search patterns that can be used to find or extract text from a larger body.

A regular expression may be an exact sequence of characters, e.g. `Canada`.

However, an exact sequence is rather unflexible. Regular expressions allow for many options to customize a search.

## Airport Identifiers

### Dot Wildcard

In [None]:
# Finding all airports with an ICAO code beginning with 'LF' (France airports)

# . <--- dot wildcard means search for any character (including whitespace)

grep -E '"LF.."' airports.dat

### `[AB]` Filtering for only certain characters

In [None]:
# Filtering for airports with only "C" or "K" identifiers (airports only in Canada or continental USA)

# [CK] <--- Select either C or K

grep -E '"[CK]..."' airports.dat

Our search returned:

```
1921,"Sancti Spiritus Airport","Sancti Spiritus","Cuba","USS","MUSS",21.9704,-79.442703,295,-5,"U","America/Havana","airport","OurAirports"
```

This means our search is not restrictive enough. We need to enforce searches for only upper case letters.

### Range of characters

In [None]:
# [A-Z] Search only for uppercase letters

grep -E '"[CK][A-Z][A-Z][A-Z]"' airports.dat

### `{}` Repetition

In [None]:
# Repetition

# If we know how many times a pattern should be repeated, then we can specify this amount in {}

grep -E '"[CK][A-Z]{3}"' airports.dat

In [None]:
# Repetition

# We can also specify a range of possible repetitions from a to b with {a,b}

grep -E '"[CK][A-Z]{2,3}"' airports.dat

### `+` One or more repetitions of an element

In [None]:
# Filtering for all Canadian airports between 40 N and 50 N

# Breaking it down

# Step 1) Filter for a 4 Letter Canadian ICAO code in quotation marks -- see previous examples

# Step 2) Search for a comma

# Step 3 Search for a number 4

# Step 4) Search for a number between 0-9, followed by a period (period is escaped by a backslash)

# Step 5) Search for one or more repetitions of another number with the + operator

grep -E '"C[A-Z]{2,3}",4[0-9]\.[0-9]+' airports.dat

In [None]:
# Filtering for all Canadian airports between 40-50 N and 120-130W

grep -E '"C[A-Z]{2,3}",4[0-9]\.[0-9]+,-12[0-9]\.[0-9]+' airports.dat

### `^` Not operator

In [None]:
# Filtering for all non-Canadian airports between 40-50 N and 120-130W

grep -E '"[^C][A-Z]{2,3}",4[0-9]\.[0-9]+,-12[0-9]\.[0-9]+' airports.dat

In [None]:
# Filtering for all non-Canadian and non-continental U.S airports between 50-60 N and 130-140W

grep -E '"[^CK][A-Z]{2,3}",5[0-9]\.[0-9]+,-13[0-9]\.[0-9]+' airports.dat

In [None]:
head -n 200 RJ_WS.txt

### Beginning of Line `^`
### Whitespace `\s`
### Zero or more repetitions `*`

### Example: Filtering for all lines by a character

Step 1) Anchor your search at the beginning of each line with the ^ anchor 

Step 2) Filter for zero or more multiple repetitions (*) of whitespace \s

Step 3) Filter for the character's name prefix

Step 4) Filter for a period (\.)

In [None]:
# Filtering for all lines by Juliet

grep -E '^\s*Jul\.' RJ_WS.txt

### `$` End-of-Line Anchor

In [None]:
# Finding all questions asked by Juliet

grep -E '^\s*Jul\..*\?$' RJ_WS.txt