# Welcome to Regular Expressions!

What are regular expressions?<br>They're a shorthand in programming for pattern matching.
Data that follows a pattern are recognized, regardless of what the data actually is.<br>
So what does that mean?

Please note: All facts (example strings) are taken from www.twitter.com/qikipedia

## `(###) ###-####`

## `###-##-####`

## `AA 1111`
## `AAA-1111`
## `1AA1111`

## `import re`

In Python, regular expressions and their functions are part of a modlue in the standard library.<br>
To access them, we need to import the module into any code we're running.

In [None]:
import re

In [None]:
# Note: if you run these as-is, you will get an error message.
# Find first instance of a match
# Returns a match object

re.search(pattern, string)


# Find all instances of a match
# Returns a list of all matches

re.findall(pattern, string)

In [None]:
nz_fact = "Since an earthquake hit New Zealand in 2016, the South Island has crept 35 cm closer to the North Island."

# In this case, the pattern = '\d+', the string = nz_fact
re.search('\d+', nz_fact)

In [None]:
# Find the first word with at least 4 letters

lor_fact = "The music for The Lord of the Rings: The Two Towers was ineligible for an Oscar because it reused themes from the first film. This rule was specifically changed in order to allow The Return of the King to be nominated for Best Original Music Score, which it subsequently won."
re.search('[a-z]{4,}', lor_fact)

In [None]:
# Find all the words with at least 4 letters

re.findall('[a-z]{4,}', lor_fact)

# Anchors

Anchors let you match the beginning or end of a string.<br>
Use `^` to match the beginning of a string<br>
Use `$` to match the end of a string.<br>
### Because it doesn't fit anywhere else: Or.
If you'd like to use `OR` in your Regex, use a pipe character, `|`

In [None]:
dk_fact = "Denmark hated the letter Q so much they abolished it in 1872"

# Does this string start with "Den"?

In [None]:
# Does it end with Q?


In [None]:
# Is the fact about Denmark or Sweden?



# Metacharacters

### First, Character Classes
Remember when we used `[a-z]` to find words with at least 4 letters?<br>
Using brackets like that is called a *character class* and allows you to specify a group of characters to search for, such as `[0-9]` for all digits or `[A-Z]` for capital letters. If you *don't* want those characters, add `^` after the first bracket. `[^A-Z]` Will return any character that is not a capital letter. Any combination of characters can be in `[]`. `[aeiou]` is acceptable and will match any vowel.


### But what if you want all letters, bother upper & lower case? Or a combination of letters & numbers? Or white space?
That's where metacharacters come in. There's a variety of them available. Using one of these metacharacter codes will match any character that matches that description.

Metacharacter | Characters        | Equivalent
--------------|-------------------|------------
      \d      | Any digit         |`[0-9}`
      \D      | Any non-digit     |`[^0-9]`
      \s      | Any whitespace character | `[ \t\n\r\f\v]`
      \S      | Any non-whitespace character | `[^ \t\n\r\f\v]`
      \w      | Any alphanumeric character | `[a-zA-Z0-9_]`
      \W      | Any non-alphanumeric character | `[^a-zA-Z0-9_]`
      .       | Any character (except a newline) | 
      \b      | Word boundary (i.e. any whitespace or non-alphanumeric character. Note that this is non-capturing)


In [None]:
nut_fact = "If other squirrels are watching, squirrels just pretend to bury their nuts and then they scuttle off to a secret location where they actually hide them."

# Find the last letter in each word

# Escaping

You might have noticed we're starting to use some characters, `.$^|\`, that you micht occasionally want to actually search for.

If you want to search for one of them, put a `\` in front of the character, like so - `\|`

# Quantifiers

What if you want to see if a character is repeated? What if you think a specific character might be in the string, or it might not? That's where quantifiers come in. You can see if something is there once, multiple times, or not at all.

Quantifier | Meaning        | Equivalent
--------------|-------------------|------------
      x?      | Matches x 0-1 times
      x*      | Matches x 0 or more times     
      x+      | Matches x 1 or more times 
      x{2}    | Exactly 2 of x 
      x{2,}   | 2 or more of x
      x{2,4}  | 2-4 of x

In [None]:
book_fact = "A 2009 study by the University of Sussex found that reading for just 6 minutes can reduce stress levels by up to 68%."

# What are all the numbers in the fact?

# What if we want to know if it's a percentage?

## Greedy vs. Lazy

### Regular expressions by default are greedy -- they look for the longest match to their pattern
### Adding the lazy qualifier, `?`, to regex returns the shortest possible string that matches the regex.

In [None]:
play_fact = "The word 'tragedy' has its roots in a Greek word that translates as 'goat song'. No one knows why."

# Find everything in single quotes. Be careful of whether you need to be greedy or lazy!

## Grouping

### One of the most powerful features of Regular Expressions is the ability to work with groups.
### These let you break down your regular expression matches into parts that can be used to isolate or substitute relevant data
### Use the parentheses around what you like to have as a group `(ab)c` 
### Groups are numbered starting with 1. This is because the entire match object will always be object 0. (remember match objects from `re.search()`?) To reference a group, use the notation `\g<1>`

In [None]:
castle_fact = "UK castles include Almond Castle, Cadbury Castle, Cooling Castle, Deal Castle, Drum Castle, Eye Castle, Fail Castle, Fast Castle, Sandal Castle, and Stalker Castle."

# Print just the names of each castle (NB: re.findall() will only return captured groups)

## One last thing: `re.sub()`

### To substitute a regular expression match for something else, use `re.sub(pattern, repl, string)` where *pattern* is the regular expression you're looking for, *repl* is what you want to replace it with, and string is the string you're searching through.

In [None]:
wy_fact = "In 1890, Congress was initially going to accept Wyoming as a new state only if it revoked its women’s voting rights, which they’d had since 1869. Wyoming answered with a telegram saying: ‘We will remain out of the Union one hundred years rather than come in without the women’."

# Replace all the years with the string `[____]` to create a history quiz

## Exercise

you have the below dictionary of names and phone numbers. Unfortunately, all the phone numbers are in a different format. Can you get all the phone numbers to be in the same format?

To do this, you will need re.sub()

In [None]:
companions = {"Rose":"(672)623-7923","Martha":"6427820946","Donna":"962-672-9587","Amy":"242.674.0481","Rory":"381-534-9184","Clara":"(526) 918-4813","Bill":"145-654-9823","Yaz":"324-912-0193","Ryan":"317.874.0194","Graham":"(723)418-3854"}

# Resources

www.regex101.com<br>
https://docs.python.org/3.6/howto/regex.html<br>
Email: kathleenmburnett@gmail.com

# Answers

dk_fact
1. `re.search('^Den', dk_fact)`
2. `re.search('[0-9]$', dk_fact)`
3. `re.search('Denmark|Sweden', dk_fact)`

nut_fact
1. `re.findall('\w\b', nut_fact)`

book_fact
1. `re.findall('\d+', book_fact)`
2. `re.findall('\d+%?', book_fact)`

play_fact
1. `re.findall("'.*?'", play_fact)`

castle_fact
1. `print(re.findall('(\w*) Castle', castle_fact))`

wy_fact
1. `re.sub('\d{4}', '[___]', wy_fact)`


In [None]:
# exercise answer:
for first, phone in companions.items():
    phone = re.sub(r'\(?(\d{3})\)?[ .-]?(\d{3})[ .-]?(\d{4})', '\g<1>\g<2>\g<3>', phone)
    companions[first] = phone

print(companions)`