# Intro to Regular Expressions

Regular Expressions are a powerful tool. 

They can be found in some Operating Sytems and many languages

In [None]:
# First look at the Unix Shell

! ls

In [None]:
! ls *.txt

In [None]:
! ls N*m*

## Stop and think

List all the Python programs in a directory

## Let's look at the Unix command grep 
## Global Regular Expression Print

In [None]:
! grep ^c.t.$ ../words.txt

```python
^c.t.$
```
- ^ - match start of string
- $ - match end of string
- . - match a single character

```python
cat
cot
cut
```
The last period matches the newline

In [None]:
## Four letter words starting in a and ending in t
    
! grep ^a..t.$ ../words.txt

## The last period matches the newline

# Regular Expressions in Python

## Problem:

Take a string of the form "Smith, C" and turn it into "C. Smith".

We will use Regular Expressions to break the string into components and reassemble it.

We will need to understand our tools to accomplish this

In [None]:
import re

text = 'I yam what I yam'

pattern = r'yam'

### We are using the 'raw' string format

```python
pattern = r'yam'
```

While is won't be needed for these examples, it is needed for some RE patterns

### We look at search(), findall(), and split()

In [None]:
m = re.search(r'yam', text)    # Finds first

m

In [None]:
print(f"[{m.start()}:{m.end()}]")

In [None]:
text[m.start():m.end()]

In [None]:
## There is a simpler way to get the result

m.group()

In [None]:
re.findall(r'yam', text) 

In [None]:
re.split(r'yam', text)  

## New Example

Finding an exact match is easy.  Regular Expressions can find patterns.

We start with '.', which matches any single character

We are using findall(), but the patterns work with all methods

In [None]:
s = 'have heat, hear heart, hold hard hat' 
m = re.findall(r'h.t', s)    # '.' == match any character

m

In [None]:
## Take two

s = 'have heat, hear heart, hold hard hat' 
m = re.findall(r'h..t', s)    # '.' == match any character
m

## New example

In [None]:
import re

s = 'put the pot upon the spit' 
m = re.findall(r'p.t', s)    # '.' == match any character
m

## Anchors

We can match the start of a string with ‘^’ and the end of string with ‘$’

In [None]:
s = 'put the pot upon the spit'

re.findall(r'^p.t', s)       # Find at start

In [None]:
re.findall(r'p.t$', s)       # Find at end

## This is useful to validate a string

## Match multiple targets

We can specify a set of characters to match 

```python
'p [aeou] t'
```

[aeou] means a, e, o, or u

In [None]:
s = 'put the pot upon the spit'

re.findall(r'p[aeou]t', s)

In [None]:
re.findall(r'p[aeiu]t', s)

In [None]:
re.findall(r'p[a-i]t', s)     # Match any letters in range a...i

## Negation

We can negate a set with ^
```python
[^aeou]
```
### *We have also used ^ to anchor the start*

They are used in different contexts: Python will not be confused

In [None]:
s = 'put the pot upon the spit'

re.findall(r'p[^aeou]t', s)

## Stop and think

Write two different expressions that will match 'put' and 'pit', but not pot 

## Classes of Characters

In [None]:
import string

print(string.printable)

In [None]:
import string

printable = string.printable
re.findall(r'\d', printable)     # \d == digit

## Definitions

```python
\d Digit                    [0-9]
\D Non-digit                [^0-9]
\w Alphanumeric             [a-zA-Z0-9_]
\W Non alphanumeric         [^a-zA-Z0-9_]
\s Whitespace               [' \t\n\r\x0b\x0c']
\S Non whitespace           [^' \t\n\r\x0b\x0c']
\b Word boundary
\B Non word boundary
```

## Problem: Match a Social Security Number

In [None]:
text = 'Do I have 203-45-7834 in the string?'

# Looking for 9 digits in text of the form 123-45-6789 

re.search(r'\d\d\d-\d\d-\d\d\d\d', text)

## Stop and Think

License plates in Fredonia have 3 capital letters, followed by a dash and 3 digits.  Write an expression to match Fredonian plates and test your work.  Add any tests that I forgot to include.  

CAT-492

In [None]:
s = 'DOG-gone Cat-1234 DOG-4567 BIG-123'

re.findall(<pattern>, s)

## We can include a count

In [None]:
## Before

re.search(r'\d\d\d-\d\d-\d\d\d\d', text)

### Included counts

In [None]:
re.search(r'\d{3}-\d{2}-\d{4}', text)

## Stop and Think

Rewrite the Fredonian match using counts

In [None]:
s = 'DOG-gone Cat-1234 DOG-4567 BIG-123'

re.findall(<pattern>, s)

## Quantifiers: Non-explicit counts: \*, +, ?

In [None]:
s = 'a b1 c23'

In [None]:
# One letter, one digit

re.findall(r'[a-z]\d', s) 

In [None]:
# One letter and zero or more digits (*)

re.findall(r'[a-z]\d*', s)

In [None]:
# One letter and one or more digits (+)

re.findall(r'[a-z]\d+', s) 

In [None]:
# Optional digit: One letter and zero or one digit (?)

re.findall(r'[a-z]\d?', s)

## Matching is greedy by default

```python
re.findall(r'[a-z]\d+', 'a b1 c23')

[b1', 'c23']
```
We will need to take steps to turn off greedy matching

## Stop and Think

Write an RE to break a sentence into words.

In [None]:
s = "In Hertford, Hereford, and Hampshire, hurricanes hardly ever happen"

re.findall(<pattern>, s)

## Problem: Find all words with only vowels

Use Downey's Dictionary to find words with only vowels

```python
r'^[aeiou]+$'
```
From start of string to end, nothing but vowels

In [None]:
def find_words( ): 
    lst = []
    pattern = r'^[aeiou]+$'
    with open('../words.txt', 'r') as f: 
        for line in f:
            word = line.strip()
            if re.search(pattern, word): 
                lst.append(word)
    return lst

find_words()

## Matching Special Characters

Some characters, such as +, (, or ?, have special meaning

We can still match them with the 'escape character' \

https://xkcd.com/234/

In [None]:
s = 'find this ( for me'

re.findall(r'\(', s)

In [None]:
s = 'find that * for me'

re.findall(r'\*', s)

## We can use RE to edit strings

Take a string of the form "Smith, C" and turn it into "C. Smith".

### First, we define a pattern

In [None]:
pattern = r'^[A-Z]\w+,?\s*[A-Z]$'

re.findall(pattern, 'Smith, C')

In [None]:
re.findall(pattern, 'C Smith')

In [None]:
re.findall(pattern, 'Jones K')

## Break the pattern down

```python
pattern = ‘^[A-Z]\w+,?\s*[A-Z]$' 
           ^ Start of string
            [A-Z] One upper case letter 
            \w+ One or more alphanumeric 
            ,? Optional comma
            \s* Zero or more white space 
            [A-Z] One upper case letter
            $ End of string
```        

In [None]:
re.findall(pattern, 'Smith, C')

In [None]:
re.findall(pattern, 'Potter-Phirbright, Q')

In [None]:
re.findall(pattern, 'Potter_Phirbright, Q')

## Take the pattern and break it into groups

We use parenthesis to define grouping within the pattern

In [None]:
#  Before ‘^[A-Z]\w+,?\s*[A-Z]$’ 
#  Now add parens to break into two groups

pattern = r'^([A-Z]\w+)(,?\s*[A-Z])$'

match = re.search(pattern, 'Smith, C')

In [None]:
## group 0 is the whole mishpucha

print(match.group(0))

In [None]:
# First group is '[A-Z]\w+'

print(match.group(1))

In [None]:
# Second group is ',?\s*[A-Z]$'

print(match.group(2))

## We need to break group 2 down even more

We want to remove the comma

In [None]:
pattern = r'^([A-Z]\w+)(,?\s*)([A-Z])$'

match = re.search(pattern, 'Smith, C')

In [None]:
print(match.group(0))

In [None]:
# '^([A-Z]\w+)'

print(match.group(1))

In [None]:
# '(,?\s*)'

print(match.group(2))

In [None]:
# '([A-Z])$'

print(match.group(3))

## Put it all back together

In [None]:
pattern = r'^([A-Z]\w+)(,?\s*)([A-Z])$'

match = re.search(pattern, 'Smith, C')

match.group(3) + '. ' + match.group(1)

## Stop and Think: Spoonerism

*The Lord is a shoving leopard*

A word game that never fails to amuse the very young is to 
exchange the first letters of a pair.   

Write a fragment that turns pairs like 'toy boat' into 'boy toat'

## Read more about it:

https://pymotw.com/3/re/
http://evc-cit.info/comsc020/python-regex-tutorial/#