# Chapter 5 - Regular Expressions

### - Greatest thing since sliced bred

In [2]:
import re

# Today
- Recap
- How did you manage with Chapter 5?
- What was difficult?
- How did you solve the exercises?

## Extract Year from file bee_list.txt

    Scientific Name         Taxon Author        tsn
    Alocandrena porteri     Michener, 1986      752055
    Ancylandrena atoposoma  (Cockerell, 1934)   654122
    Ancylandrena koebelei   (Timberlake, 1951)  654123

Using `grep` 

    $ grep -P -o -w '\d{4}' bee_list.txt


Using `awk`

    $ awk 'match($0, /[0-9]{4}/) {print substr($0, RSTART, RLENGTH)}' bee_list.txt


Using `sed`

    $ sed -r 's/([^0-9]*([0-9]*)){1}.*/\2/' bee_list.txt


Using `perl`

    $ perl -lne 'print $1 if /(\d{4})/' bee_list.txt

## Common to all

#### To match:

`/regex/`

`m/regex/`

#### To search (match) and replace:

`s/regex/replacement/`

## Using Python

```python
import re
rexp = re.compile(r'\(?[\w\s,\.\-\&]*,\s(\d{4})\)?')
with open("bee_list.txt", "r") as f:
    for line in f:
        match = rexp.search(line)
        if match:
            print(match.group(1))
```

#### This one liner didn't work for me

```sh
$ python -c "import sys,re;[sys.stdout.write(re.search(r'(\d{4})',line).group(1)) for line in sys.stdin]" < bee_list.txt
```

## 5.4.2 Metacharacters (p. 168)

Add to the list:

- `\S` - not a white space (*cf.* `\s`)

- `\W` - not a word character (*cf.* `\w`, which is the set `[a-zA-Z0-9_]`).

Much more info from `help(re)` in python.

## 5.4.3 Sets (p. 169) (See also p. 177)

A commonly used strategy to avoid (some) issues: **Use a middle step**!

Instead of:

    re.search(r"GATC", my_unknown_DNA_string)

use a "decorator":

    re.search(r"GATC", my_unknown_DNA_string.upper())

## 5.4.4 Quantifiers (p. 171)

Greedy vs non-greedy matching -- major source of error!

In [3]:
re.search(r".*\s", "once upon a time").group()

'once upon a '

In [4]:
re.search(r".*?\s", "once upon a time").group()

'once '

## 5.5 Functions of the `re` module (p.175)

All examples up to here where used with `re.search()` --- and with matches!

**Remember** if no match, `re.search()` returns `None`, an this character type does not have an attribute `.group()`:

```sh
>>> match.group()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'
```

**Test** if you got a match. For example:

```python
my_match = re.search(reg, target_string)
if my_match:
    print(my_match.group())
```

Similar issue with `re.findall()`. It returns a list (empty or not):

```python
my_list = re.findall(r'Pepes bodega', my_DNA_string)
if my_list:
    # do something
```

`re.finditer()` on the other hand, returns a "callable_iterator object" (which may be empty or not), but the object does not evaluate to `False` or `None`.

```python
my_iter = re.finditer(r'Pepes bodega', my_DNA_string)
if my_iter: # This will be True even without a match!
    # do something
```

**One solution**: Cast the iterable to a list, AND, do a copy before calling `len()`!

```python
my_iter = re.finditer(r'Pepes bodega', r'ACGTACGT')
my_iter = list(my_iter) # This is needed since len() "consumes" the iterable!
if len(list(my_iter)):
    for i in my_iter:
        print(i.group())
else:
    print("No match found")
```

(See <https://stackoverflow.com/questions/12351475/regular-expression-callable-iterator-get-length/12351496>)

In [6]:
my_iter = re.finditer(r'Pepes bodega', r'ACGTACGT')
my_iter = list(my_iter) # This is needed since len() "consumes" the iterable!
if len(list(my_iter)):
    for i in my_iter:
        print(i.group())
else:
    print("No match found")

No match found


In [7]:
my_iter = re.finditer(r'ACGT', r'ACGTACGT')
my_iter = list(my_iter) # This is needed since len() "consumes" the iterable!
if len(my_iter):
    for i in my_iter:
        print(i.group())
else:
    print("No match found")

ACGT
ACGT


In [8]:
my_iter = re.finditer(r'ACGT', r'ACGTACGT')
#my_iter = list(my_iter) # This is needed since len() "consumes" the iterable!
if len(my_iter):
    for i in my_iter:
        print(i.group())
else:
    print("No match found")

TypeError: object of type 'callable_iterator' has no len()

In [9]:
my_iter = re.finditer(r'ACGT', r'ACGTACGT')
print(len(list(my_iter)))
len(list(my_iter))

2


0

In [10]:
my_iter = re.finditer(r'ACGT', r'ACGTACGT')
my_iter = list(my_iter)
print(len(my_iter))
my_iter

2


[<re.Match object; span=(0, 4), match='ACGT'>,
 <re.Match object; span=(4, 8), match='ACGT'>]

## 5.5 Functions of the `re` module (p.175)

Reminder: You can find things in strings/files without using regular expressions.

Compare

```python
re.sub(reg, repl, target_string)
```

with
```
target_string.replace(search, repl) # replace doesn't take a regular expression
```

## Exercises (p. 182)

**Any comments or questions?**

## 5.8 The Quest for the Perfect Regular Expression (p. 181)

How often have you filled a web form with, say, your personal ID number, only to be greeted with the message saying: "This is not a valid personal ID number"?

The example for a pattern to match a (US) Zip code on p. 181 is non-exhaustive. As it stands, it handles three cases. 

```python
60637
60637 1503
60637-1503
```

How would you have made it better?

## 5.8 The Quest for the Perfect Regular Expression (p. 181)

What about?

```python
60637 -1503
60637- 1503
60637 - 1503
60637-
...
```

# "Never underestimate the imagination of the user!"

![](img/User_interactive.png)

# Assignment for next occasion (Dec 03, in two weeks)

- Chapter 6. Scientific Computing (Python)
- **Use the Slack channels (<https://bioinfo-course-2020.slack.com>)!**
