# Pattern Matching With Regular Expressions

## Introduction

This notebook is about finding and replacing text by using regular expressions (regex). Writing regexes requires some practice and may be scary at first - try to build up a regex step-by-step and don't shy away from using one of the websites listed below!

This notebook covers [chapter 7](https://automatetheboringstuff.com/2e/chapter7/) of the book.

You can find more information about regular expressions in the [Python Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html). There are also many websites that allow you to help writing and testing regex expressions:
* [RegEx Pal](https://www.regexpal.com/)
* [RegExr](https://regexr.com/)
* [regex101](https://regex101.com/)


## Summary

### Creating a Regex Pattern Object

Creating a new regex `Pattern` object is done by using the `compile` function of the `re` module.

```python
import re
phone_number_regex = re.compile(r'\+\d{9,15}')
```

### Matching

Finding a pattern is done by using the method `search` of the regex `Pattern` object. This returns either a `Match` object if the pattern has been found in the given string or else `None`.

```python
phone_number_regex.search('My number is +41552221838')
# <re.Match object; span=(13, 25), match='+41552221838'>
```

Finding all patterns is done by using the `findall` method of the regex `Pattern` object. `findall` returns a list of strings, not `Match` objects.

```python
phone_number_regex.findall('My phone number is +41552221838, my fax number is +41552224270')
# ['+41552221838', '+41552224270']
```

These functions can also be used directly using the module level functions.

```python
re.search(r'\+\d{9,15}', 'My number is +41552221838')
re.findall(r'\+\d{9,15}', 'My phone number is +41552221838, my fax number is +41552224270')
```

### Replacing

Replacing patterns or parts of patterns is done using the `sub` method of the regex `Pattern` object.

```python
phone_number_regex.sub('CENSORED', 'My number is +41552221838')
# 'My number is CENSORED'
```

`sub` can be called at the module level, too.

```python
re.sub(r'\+\d{9,15}', 'CENSORED', 'My number is +41552221838')
# 'My number is CENSORED'
```

### Flags

The regex functions and methods take an optional argument `flags` which allows to modify how they behave.

- `IGNORECASE`: Perform case-insensitive matching
- `DOTALL`: The `.` also matches newlines.
- `VERBOSE`: Whitespaces in the pattern are ignored.

The full list of flags is available at the [Python Documentation](https://docs.python.org/3/library/re.html).

Multiple flags can be passed using `|`:

```python
re.compile('foo', re.IGNORECASE | re.DOTALL | re.VERBOSE)
```

### Groups

You can extract different parts (groups) of the pattern using brackets in the expression  and reference them using backslash plus group number:

```python
re.search(r'\+41(\d{2})(\d{7})', 'My number is +41552221838').groups()
# ('55', '2221838')
re.sub(r'\+41(\d{2})(\d{7})', r'0\1*******', 'My number is +41552221838')
# 'My number is 055*******'
```

### Cheat Sheet

| Expression | Description                                                  |
| ---------- | ------------------------------------------------------------ |
| ?          | matches zero or one of the preceding group                   |
| *          | matches zero or more of the preceding group                  |
| +          | matches one or more of the preceding group                   |
| {n}        | matches exactly *n* of the preceding group                   |
| {n,}       | matches *n* or more of the preceding group                   |
| {,m}       | matches 0 to *m* of the preceding group                      |
| {n, m}     | matches at least *n* and at most *m* of the preceding group  |
| ^spam      | means the string must begin with *spam*                      |
| spam$      | means the string must end with *spam*                        |
| .          | matches any character, except newline characters.            |
| \d         | matches a digit                                              |
| \w         | matches a word                                               |
| \s         | matches a whitespace character                               |
| \b         | matches a word boundary                                      |
| [abc]      | matches any character between the brackets (such as *a*, *b*, or *c*) |
| [a-z]      | matches any character between the brackets (such as *a*, *b*, ... *z*) |
| [^abc]     | matches any character that isn’t between the brackets.       |
| ab\|cd     | matches *ab* or *cd*                                         |
| \          | escapes special characters                                   |
| ()         | capture group                                                |
| \1         | back reference to capture group 1                            |

## Exercises

### Exercise 1: Matching Patterns
Use regex and the following sentence

    My dog is 12 years old, has brown fur and eats 2kg of meat every day. I like my dog.

To find and print:
- the numbers (12, 2)
- the weight (2kg)
- the numbers without units (12)
- the words years and day (years, day)
- the words starting with a capital letter (My)
- the colored fur, no matter what color (brown fur)
- the sentences

In [174]:
import re

text = "My dog is 12 years old, has brown fur and eats 2kg of meat every day. I like my dog."

# todo: find and print all the things

Now return the verbs: (is, eats, like). Try to use a regex statement, such that:
- Trailing `s` doesn't matter - e.g. `count` or `counts` are found
- The word boundary will be considered - e.g. `eat` and `eats` are found, but not `meat`

In [None]:
# todo: find the verbs

### Exercise 2: Replacing Patterns
Censor all numbers with `*`, nobody needs to know how old my dog is and how much he eats! The text should look like this:

    My dog is ** years old, has brown fur and eats *kg of meat every day. I like my dog.

In [None]:
# todo: replace the numbers

Now make the dog younger and greedy. Replace all numbers with `1`. The text should look like this:

    My dog is 1 years old, has brown fur and eats 1kg of meat every day. I like my dog.

In [None]:
# todo: fix the numbers

Now instead of censoring, add a `0` to all numbers! The text should look like this:

    My dog is 120 years old, has brown fur and eats 20kg of meat every day. I like my dog.

This is a bit tricky, since adding the zero right after the back reference leads to an error: it looks for the 10th group! Instead of using the `\1` back reference syntax, it's also possible to use the `\g<number>` syntax!

In [None]:
# todo: increase the numbers

### Exercise 3: Using Flags

Replace all instances of `my` to `Peter's`, because the dog is not actually mine! The text should look like this:

    Peter's dog is 12 years old, has brown fur and eats 2kg of meat every day. I like Peter's dog.

In [None]:
# todo: give the dog to Peter

### Exercise 4: Analysing Code Lines

Let's evaluate some python code line by line!

```python
#  This is my script, it does something.
def add(a, b):
    # add both arguments
    return a + b

a = 1
b   = 2
print(add(a, b))

```

Create a dictionary with the line numbers as keys and the following values:
- "Comment: {comment}" if the line is a comment
- "Assignment: {variable} becomes {value}" if the line is a variable assignment
- "Function definition: {function name} with parameters [{parameter list}]" if the line is a function definition
- "Unknown" in all other cases

Your dictionary should look like this:

```python
{0: 'Unknown',
 1: 'Comment: This is my script, it does something.',
 2: 'Function definition: add with parameters [a, b]',
 3: 'Comment: add both arguments',
 4: 'Assignment: result becomes a + b',
 5: 'Unknown',
 6: 'Unknown',
 7: 'Assignment: a becomes 1',
 8: 'Assignment: b becomes 2',
 9: 'Unknown'}
```

Don't despair if you struggle with the solution and just see how far you get - remember, regexes require practice!

In [None]:
code = """
#  This is my script, it does something.
def add(a, b):
    #add both arguments
    result = a + b
    return result

a = 1
b   = 2
print(add(a, b))
"""

lines = {}

# todo: evaluate all the lines

from pprint import pprint

pprint(lines)