#### UNIX grep and sed Tutorial

The UNIX command `grep` (global search for regular expressions and print) searches lines of text files for substrings that match a regular expression and prints those lines. The symbols of grep expressions are ASCII characters. _Extended regular expressions_ are supported by  `grep -E` (or `egrep` in earlier versions of UNIX):

| regular expression | matching string                                                                           |
|:-------------------|:------------------------------------------------------------------------------------------|
| `c`                | non-operator character `c`                                                                |
| `\c`               | character `c` literally                                                                   |
| `r₁r₂`             | sequence of regular expressions `r₁` and `r₁`                                                 |
| `r₁│r₂`         | either `r₁` or `r₂`                                                          |
| `r*`               | zero or more occurrences of `r`, where `r` is a regular expression for a single character |
| `r+`               | one or more occurrences of `r`, same as `rr*`                                |
| `r?`               | zero or one occurrence of `r`                                                |
| `(r)`              | same as `r`                        |
| `r{i}`         | `i` repetitions of `r`, e.g. `9{3}` = `999`                               |
| `r{i,j}`         | `i` to `j` repetitions of `r`, e.g. `9{1,3}` = `9│99│999`                               |
| `r{i,}`         | at least `i` repetitions of `r`, e.g. `9{3,}` = `9999*`                               |
| `r{,j}`         | at most `j` repetitions of `r`, e.g. `9{,3}` = `(9│99│999)?`                               |
| `[s]`              | character class, e.g. `[ab34] = a│b│3│4` and `[A-Za-z] = A│…│Z│a│…│z`                     |
| `[^s]`             | complemented character class, e.g. `[^0-9]` are all non-digit characters                  |
| `.`                | character class with all characters                                                       |
| `^`                | a fictitious character at the beginning of a line                                         |
| `$`                | a fictitious character at the end of a line                                               |
|`\<`                | a fictitious character at the beginning of a word |
| `\>`               | a fictitious character at the end of a word |
| `\b`               | a fictitious character at the edge of a word |
| `\B`               | a fictitious character that is not at the edge of a word |
| `\w`               | synonym for `[_[:alnum:]]` |
| `\W`              | same as `[^_[:alnum:]]` |
| `[[:alnum:]]`   | same as `[0-9A-Za-z]` |
| `[[:alpha:]]`    | same as `[A-Za-z]` |
| `[[:cntrl:]]`      | Control characters. In ASCII, these characters have octal codes 000 through 037, and 177 (DEL) |
| `[[:digit:]]`      | same as `[0123456789]` |
| `[[:graph:]]`    | `[:alnum:]` and `[:punct:]` |
| `[[:lower:]]`    | same as `[abcdefghijklmnopqrstuvwxyz]` |
| `[[:print:]]`     | `[:alnum:]`, `[:punct:]`, and ` ` (space) |
| `[[:punct:]]`    | punctuation characters, same as <code>[!"#$%&'()*+,-./:;<=>?@[\]^`{\|}~]</code> |
| `[[:space:]]`   | tab, newline, vertical tab, form feed, carriage return, and `' '` (space) |
| `[[:upper:]]`   | same as `[ABCDEFGHIJKLMNOPQRSTUVWXYZ]` |
| `[[:xdigit:]]`    | hexadecimal digits, same as `[0123456789ABCDEFabcdef]` |

In  basic  regular  expressions,  `?`, `+`, `{`, `|`, `(`, and `)` lose their special meaning; `\?`, `\+`, `\{`, `\|`, `\(`, and `\)` have to be used instead.

Consult the [online manual](https://www.gnu.org/software/grep/manual/grep.html) and the _man page_ by opening a new terminal in Jupyter and typing in `man grep`.

The Jupyter (IPython) `%%writefile` [cell magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html#cell-magics) writes the content of the cell to the named file:

In [2]:
%%writefile file.txt
A line starting with a character and with 3.14
17 does not start with a character
(comment)
(comment with space in between)
(  comment9)
()

Writing file.txt


The Jupyter (IPython) `%%bash` [cell magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html#cell-magics) allows an arbitrary shell command to be executed:

In [3]:
%%bash
grep -E '\b[[:digit:]]+\b' file.txt # Prints all lines with at least one separated number

A line starting with a character and with 3.14
17 does not start with a character


In [None]:
%%bash
grep -E '\b[[:digit:]]+\b.*\b[[:digit:]]+\b' file.txt # Prints all lines with at least two separated numbers

In [None]:
%%bash
grep -E '^[a-zA-Z]' file.txt # Prints all lines beginning with a letter.

In [None]:
%%bash
grep -E '\( *([a-zA-Z]*|[0-9]*) *\)' file.txt # Prints all lines that contain either letters or digits in
# parentheses with spaces optionally preceding and following, but not letter-digit combinations in parentheses.

In [None]:
%%bash
grep -E '[0-9]+\.[0-9]*|\.[0-9]+' file.txt # Prints all lines containing a floating point number.

---

The UNIX command `sed` is a _stream editor_ that uses regular expressions in search and replace commands. Consult the [online manual](https://www.gnu.org/software/sed/manual/sed.html) and the _man page_ by opening a new terminal in Jupyter and typing in `man sed`.

In [None]:
%%writefile dayandnight.txt
night and day, day and night   
daydreamer daylight dayshift
daydreamer daylight dayshift
nightmare nightmare    

The typical use is `sed SCRIPT INPUTFILE`, which produces output on `stdout`. For example, the script `'1,2p'` prints the lines 1 - 2. The `-n` (quite) option suppressed that the input is printed as well:

In [None]:
!sed -n '1,2p' dayandnight.txt

The script `'2d'` deletes line 2; here, the `-n` option is left out:

In [None]:
!sed '2d' dayandnight.txt

The script `'s/night/day/'` replaces the first occurrence of `night` with `day` on each line:

In [None]:
!sed 's/night/day/' dayandnight.txt

The script `'s/night/day/2'` replaces the second occurrence of `night` with `day` on each line:

In [None]:
!sed 's/night/day/2' dayandnight.txt

The script `'s/night/day/g'` replaces all occurrences of `night` with `day` on each line:

In [None]:
!sed 's/night/day/g' dayandnight.txt

The script `'2 s/day/night/g'` replaces all occurrences of `night` with `day` in the second line :

In [None]:
!sed ' 2 s/day/night/g' dayandnight.txt

The script ` 's/([[:alpha:]]*)/\1 \1/g'` duplicates every word consisting only of letters: `[[:alpha:]]*` matches as many characters as possible and the backreference `\1` refers to the first parenthesized expression, which is `([[:alpha:]]*)`:

In [None]:
!sed -r 's/([[:alpha:]]*)/\1 \1/g' dayandnight.txt

The `-r` flag allows extended regular expressions. Backreferences are numbered, `\0`, `\1`, `\2`, etc., referring to the n-th parenthesized expression. The reference `\0` refers to the first non-parenthesized expression and is abbreviated by `&`; see below for an example. Backreferences can be used in the search expression. The script `'/([a-z]+) \1/p'` prints all lines that match `[a-z]+` and match the same string again, i.e. all lines with duplicate words:

In [None]:
!sed -rn '/([a-z]+) \1/p' dayandnight.txt

The pattern also matches `d d` in `and day`. To fix this, `\b` is used to match at word boundaries:

In [None]:
!sed -rn '/(\b[a-z]+)\b \1/p' dayandnight.txt

[Perl regular expressions](https://perldoc.perl.org/perlre) extend standard Unix regular expressions; they are not universally supported. The script `'s/\<./\u&/g'` capitalizes the first letter of each word: `\u` capitalizes the subsequent expression,  `&` here, which matches `\<.`, the first letter of a word. See also this [introduction](https://www.regular-expressions.info/replacecase.html):

In [1]:
!sed -r 's/\<./\u&/g' dayandnight.txt

sed: can't read dayandnight.txt: No such file or directory


Write an sed script that indents all lines by four characters!

In [None]:
!sed -r 's/^/    //g' dayandnight.txt

Write an sed script that removes all trailing spaces! Use `wc` to check that spaces have indeed been removed!

In [None]:
!sed -r 's/ *$//g' dayandnight.txt
!sed -r 's/ *$//g' dayandnight.txt | wc