# Regex

What is it?
- Regular expressions (called regexes or regex patterns) are a tiny language for dealing with text and character patterns

Why do we care? 
- With this, we can extract text, match text, or replace text that matches a pattern

In [10]:
import pandas as pd
import numpy as np

import re

## re library function

`re.findall(pattern, string)` 

 - finds all substrings where the REGEX matches; returns a list

### Literals - start simple

In [11]:
subject = 'abc'

#### find the letter a

In [12]:
re.findall(r'a', subject)

['a']

#### find the letter d

In [13]:
regexp = r'a'

re.findall(regexp, subject)

['a']

### Literals - make it more complex

In [14]:
subject = 'Mary had a little lamb. 1 little lamb. Not 10 lambs, not 12, not 22, just one'

#### find not

In [15]:
regexp = r'not'

re.findall(regexp, subject)

['not', 'not']

In [16]:
regexp = r'Not'

re.findall(regexp, subject)

['Not']

##### regex flag: re.IGNORECASE

In [17]:
regexp = r'not'

re.findall(regexp, subject, re.IGNORECASE)

['Not', 'not', 'not']

#### find lamb

In [18]:
regexp = r'lamb'

re.findall(regexp, subject)

['lamb', 'lamb', 'lamb']

## Metacharacters
<div class="alert alert-block alert-info">
    
- `\w`: any letter or number
- `\W`: anything that is *not* a letter or number



- `\d`: any digit
- `\D`: anything that is *not* a digit


- `\s` : any whitespace


-  `.` : anything
    
</div>

### Try them all 

In [42]:
subject = 'abcccC. 123!'

#### `\w`: any letter or number

In [45]:
regexp = r'\w\w'

re.findall(regexp, subject)

['ab', 'cc', 'cC', '12']

#### what does \w\w bring back?

In [23]:
regexp = r'\W'

re.findall(regexp, subject)

['.', ' ', '!']

#### `\W`: anything that is not a letter or number

In [24]:
regexp = r'\W'

re.findall(regexp, subject)

['.', ' ', '!']

#### `\d`: any digit

In [25]:
regexp = r'\d'

re.findall(regexp, subject)

['1', '2', '3']

#### `\D`: anything that is not a digit

In [26]:
regexp = r'\D'

re.findall(regexp, subject)

['a', 'b', 'c', 'c', 'c', 'C', '.', ' ', '!']

#### `\s` : any whitespace

In [27]:
regexp = r'\s'

re.findall(regexp, subject)

[' ']

#### `.` : anything

In [30]:
regexp = r'.'

re.findall(regexp, subject)

['a', 'b', 'c', 'c', 'c', 'C', '.', ' ', '1', '2', '3', '!']

#### find the `.` only
use an escape character, `\`, to find characters that are metacharacters

In [33]:
regexp = r'\.'

re.findall(regexp, subject)

['.']

<div class="alert alert-block alert-warning">
    <b>Mini-Exercise</b>
    
- Match the string 'c 1' using only metacharacters
- The returned list should have only element in it
- Find 3 different syntax combinations

    
</div>

In [58]:
subject = 'c 1'

In [55]:
regexp = r'\w\s\w'

re.findall(regexp, subject)

['c 1']

In [62]:
regexp = r'\D\s\d'

re.findall(regexp, subject)

['c 1']

In [64]:
regexp = r'\D\W\w'

re.findall(regexp, subject)

['c 1']

In [65]:
regexp = r'.\W.'

re.findall(regexp, subject)

['c 1']

In [66]:
regexp = r'...'

re.findall(regexp, subject)

['c 1']

In [67]:
regexp = r'.\D.'

re.findall(regexp, subject)

['c 1']

## Repeating

<div class='alert alert-info'>
    
- `{}`: custom number of repetitions
    - `{x}`: exactly x repetitions
    - `{x,}`: x or more
    - `{x,y}`: between x and y repetitions

- `*`: zero or more
- `+`: one or more
- `?`: optional / not greedy
    
</div>

In [89]:
subject = 'ccccc! 123 ccc! 99!'

#### find the whole string

In [90]:
regexp = r'.+'

re.findall(regexp, subject)

['ccccc! 123 ccc! 99!']

In [70]:
regexp = r'.*'

re.findall(regexp, subject)

['ccccc! 123 ccc! 99!', '']

#### find all the c groups

In [73]:
regexp = r'c{3}'

re.findall(regexp, subject)

['ccc', 'ccc']

In [74]:
regexp = r'c{5}'

re.findall(regexp, subject)

['ccccc']

In [76]:
regexp = r'c{3,}'

re.findall(regexp, subject)

['ccccc', 'ccc']

In [77]:
regexp = r'c+'

re.findall(regexp, subject)

['ccccc', 'ccc']

#### find 123 and anything after it

In [79]:
regexp = r'\d{3}'

re.findall(regexp, subject)

['123']

In [82]:
regexp = r'123.+'

re.findall(regexp, subject)

['123 ccc! 99!']

In [84]:
regexp = r'123(.+)' ## gets everything after a certain sequence

re.findall(regexp, subject)

[' ccc! 99!']

#### find the exclamation points and everything inbetween them

In [85]:
regexp = r'!.+!'

re.findall(regexp, subject)

['! 123 ccc! 99!']

#### find the exclamation points and everything inbetween the first two

In [86]:
regexp = r'!.+?!'

re.findall(regexp, subject)

['! 123 ccc!']

<div class='alert alert-warning'>
<b>Mini Exercise</b>
    
From the below string, find the following information:
- Find all the numbers
- Find the number that has exactly 5 digits
- Find numbers that has 4 or more digits
- Find `http://` or `https://`
- Find the sentences contained in quotes

In [95]:
subject = '''
Codeup, founded in 2014, is located at 600 Navarro St. 
Suite 350, San Antonio, TX 78230. 
tagline: "launch your career in tech!" 
You can find us online at http://codeup.com 
and our alumni portal is located at https://alumni.codeup.com
and "codeup is a great school"!
'''

subject = subject.replace('\n','')
subject

'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. tagline: "launch your career in tech!" You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.comand "codeup is a great school"!'

#### Find all the numbers

In [97]:
regexp = r'\d+'

re.findall(regexp, subject)

['2014', '600', '350', '78230']

#### Find a number that has exactly 5 digits

In [98]:
regexp = r'\d{5}'

re.findall(regexp, subject)

['78230']

In [None]:
regexp = r'\d{5}'

re.findall(regexp, subject)

#### Find numbers that has 4 or more digits

In [109]:
regexp = r'\d{4,}'

re.findall(regexp, subject)

['2014', '78230']

#### Find the sentences contained in quotes

In [102]:
regexp = r'".+?"'

re.findall(regexp, subject)

['"launch your career in tech!"', '"codeup is a great school"']

#### Find `http://` or `https://`

In [112]:
regexp = r'ht.+?/\D'

re.findall(regexp, subject)

['http://', 'https://']

In [None]:
regexp = 

re.findall(regexp, subject)

In [None]:
regexp = 

re.findall(regexp, subject)

### Any of / None of

<div class='alert alert-info'>
    
- `[]`: will match any element inside of
- `[^]`: will match any element NOT inside of
- `[-]`: will match a range of values inside of
    
</div>

#### match using brackets

In [113]:
subject = 'abc 12345 1bc'

#### find a or b

In [114]:
regexp = r'a|b'

re.findall(regexp, subject)

['a', 'b', 'b']

In [117]:
regexp = r'[ab]'

re.findall(regexp, subject)

['a', 'b', 'b']

#### find values that are NOT a or b

In [120]:
regexp = r'[^ab]'

re.findall(regexp, subject)

['c', ' ', '1', '2', '3', '4', '5', ' ', '1', 'c']

#### find values that are between 2 and 4

In [121]:
regexp = r'[2-4]'

re.findall(regexp, subject)

['2', '3', '4']

### Anchors
<div class='alert alert-info'>

- `^`: starts with
- `$`: ends with
- `\b`: word boundary

</div>

In [123]:
subject = 'kiwi aardvark banana codeup data science academy extra'

#### match all words that start with a vowel

In [131]:
regexp = r'[aeiou]\w+'

re.findall(regexp, subject)

['iwi', 'aardvark', 'anana', 'odeup', 'ata', 'ience', 'academy', 'extra']

In [132]:
#using a boundary
regexp = r'\b[aeiou]\w+'

re.findall(regexp, subject)

['aardvark', 'academy', 'extra']

In [136]:
subjects = subject.split()
subjects

['kiwi', 'aardvark', 'banana', 'codeup', 'data', 'science', 'academy', 'extra']

In [141]:
for sub in subjects:
    print(re.findall(r'^[aeiou]\w+', sub))

[]
['aardvark']
[]
[]
[]
[]
['academy']
['extra']


In [143]:
#using an anchor
regexp = r'\w+[aeiou]\b'

re.findall(regexp, subject)

['kiwi', 'banana', 'data', 'science', 'extra']

In [144]:
regexp = r'\w+[aeiou]$'

re.findall(regexp, subject)

['extra']

In [None]:
#split subjects to use anchor


In [None]:
#use for loop to cycle through them


#### match all words that end with a vowel

In [None]:
subject = 'kiwi aardvark banana codeup data science academy extra'

In [None]:
regexp = 

re.findall(regexp, subject)

<div class='alert alert-warning'>

<b>Mini Exercise</b>
    
Write regular expressions to find the following values

- Find any even numbers (regardless if its apart of a bigger number)
- Find entire numbers that are even
   
    
- Find 2 or more odd digits in a row.

    
- Find all the capital letters
- Find all words that start with a capital letter
    
<div>

In [145]:
subject = '''
Codeup, founded in 2014, is located at 600 Navarro St. 
Suite 350, San Antonio, TX 78230. 
tagline: "launch your career in tech!" 
You can find us online at http://codeup.com 
and our alumni portal is located at https://alumni.codeup.com
and "codeup is a great school"!
'''

subject = subject.replace('\n','')
subject

'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. tagline: "launch your career in tech!" You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.comand "codeup is a great school"!'

#### Find any even numbers (regardless if its apart of a bigger number)

In [176]:
regexp = r'[2468]'

re.findall(regexp, subject)

['2', '4', '6', '8', '2']

In [155]:
regexp = r'\d'

the_answer = list(re.findall(regexp, subject) )
new_list = []
for a in the_answer:
    j = int(a)
    if j % 2 == 0:
        if j != 0:
            new_list.append(a)
print(new_list)

['2', '4', '6', '8', '2']


#### Find entire numbers that are even

In [199]:
regexp = r'\w+[02468]\b'

re.findall(regexp, subject)

['2014', '600', '350', '78230']

In [156]:
regexp = r'\d+'

the_answer = list(re.findall(regexp, subject) )
new_list = []
for a in the_answer:
    j = int(a)
    if j % 2 == 0:
        new_list.append(a)
print(new_list)

['2014', '600', '350', '78230']


#### Find 2 or more odd numbers in a row (regardless if its apart of a bigger number)

In [202]:
regexp = r'[13579]{2,}'# this one is right

re.findall(regexp, subject)

['35']

In [161]:
regexp = r'\d\d'

the_answer = list(re.findall(regexp, subject) )
the_answer

new_list = []
for a in the_answer:
    j = int(a)
    if j % 2 != 0:
        new_list.append(a)
print(new_list)

['35', '23']


#### Find all the capital letters

In [190]:
regexp = r'[A-Z]'

re.findall(regexp, subject)

['C', 'N', 'S', 'S', 'S', 'A', 'T', 'X', 'Y']

In [167]:
regexp = r'\w'

answer_list = list(re.findall(regexp, subject))
new_list = []
for letter in answer_list:
    if letter.isupper():
        new_list.append(letter)
print(new_list)

['C', 'N', 'S', 'S', 'S', 'A', 'T', 'X', 'Y']


#### Find all words that start with a capital letter

In [204]:
regexp = r'[A-Z]\w\w+'

re.findall(regexp, subject)

['Codeup', 'Navarro', 'Suite', 'San', 'Antonio', 'You']

In [175]:
regexp = r'\w+'

answer_list = list(re.findall(regexp, subject))
for word in answer_list:
    if word.isnumeric() == False:
        if word[0].isupper():
            if len(word) > 2:
                print(word)

Codeup
Navarro
Suite
San
Antonio
You


### Capture Groups

<div class='alert alert-info'>
    
- `()`: grab what's contained in parentheses 
    
</div> 

In [205]:
subject = '''
    You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).
    '''

#### find the domain only

In [214]:
regexp = r'https?://(.+).com'

re.findall(regexp, subject)

['codeup']

In [215]:
regexp = r'https?://.+.com'

re.findall(regexp, subject)

['https://codeup.com']

#### find everything after the first sentence

In [220]:
regexp = r'\.\s.(.+)'

re.findall(regexp, subject)

['ur ip address is 123.123.123.123 (maybe).']

#### find the ip address

In [222]:
regexp = r'Our ip address is (.+?)\s'

re.findall(regexp, subject)

['123.123.123.123']

In [224]:
#dont be greedy
regexp = r'\d{3}.\d{3}.\d{3}.\d{3}'

re.findall(regexp, subject)

['123.123.123.123']

In [233]:
regexp = r'((\d{3}\.){3}\d{3})'

re.findall(regexp, subject)

[('123.123.123.123', '123.')]

### Non Capture Group

<div class='alert alert-info'>

- `?:`: to ignore a capture group also called shy

#### find the ip address

In [236]:
regexp = r'((?:\d{3}\.){3}\d{3})'

re.findall(regexp, subject)

['123.123.123.123']

<div class='alert alert-warning'>

<b>Mini-Exercise</b>    
    
- Find the protocol, domain, and tld
- Edit your previous code with a non-capture group to remove the domain from your result
    
</div>

In [None]:
subject = '''
    You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).
    '''

#### Find the protocol, domain, and tld

In [None]:
regexp = 

re.findall(regexp, subject)

#### Edit your previous code with a non-capture group to remove the domain

In [None]:
regexp = 

re.findall(regexp, subject)

## more re library functions

### `re.search`

- scans through a string, looking for any location where the RE matches; returns match object

In [237]:
subject = """
    Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, 
    San Antonio, TX 78230. 
    You can find us online at http://codeup.com and
    our alumni portal is located at https://alumni.codeup.com.
    """

#### find the word located

In [242]:
#save the match result
regexp = r'located'

match = re.search(regexp, subject)

In [243]:
#results type
type(match)

re.Match

In [244]:
#.span()
match.span()

(33, 40)

In [246]:
#.group()
match[0]

'located'

In [247]:
match.group()

'located'

#### find numbers

In [250]:
regexp = r'\d+'

re.search(regexp, subject)[0]

'2014'

#### find navarro

In [254]:
regexp = r'navorro'
type(
re.search(regexp, subject))

NoneType

In [None]:
#results type


### Name Capture Group

- `?P`: to name a capture group

In [255]:
subject = '''
    You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).
    '''

#### find the protocol, domain, and tld and name them

In [261]:
regexp = r'(?<protocol>https?s*)://(\w+)\.(\w+)'

re.search(regexp, subject)

error: unknown extension ?<p at position 1

In [260]:
#groups()


In [None]:
#groupdict()


##### regex flag:  `re.VERBOSE`

- `re.VERBOSE` will ignore whitespace in regex pattern

In [262]:
regexp = r'''
            (?P<protocol>http?s*)://
            (?P<domain>\w+)\.
            (?P<tld>\w+)
            '''
match = re.search(regexp, subject, re.VERBOSE)
match

<re.Match object; span=(35, 53), match='https://codeup.com'>

### `re.sub`

- allows us to match a regex and substitute in a new substring for the match; returns string

In [None]:
subject = 'abc 12345xyz'

#### remove all the digits

In [None]:
regexp = 

re.sub(regexp, subject)

#### replace all digits with an o 

In [None]:
regexp = 

re.sub(regexp, subject)

#### replace all the digits with a single o

In [None]:
regexp = 

re.sub(regexp, subject)

#### Using regex with a str.replace, add regex=True arguement

In [None]:
# subjects

### `re.compile`

- prepare a regular expression for use ahead of time, returns expression

In [None]:
emails = [
    "jane@company.com",
    "bob@company.com",
    "jane.janeway@company.com",
    "jane.janeway@dogood.org",
    "jane.janet.janeway@dogood.org", # bonus for the 3 part address
]

#### get name, domain, tld

In [None]:
regexp = r'(\w+\.*\w+)@(\w+)\.(\w+)'

for email in emails:
    print(re.findall(regexp, email))

In [None]:
pattern = re.compile(r'''
        (?P<name>\w+\.*\w+)@
        (?P<domain>\w+)\.
        (?P<tld>\w+)
        ''', re.VERBOSE)
pattern

In [None]:
contacts = [re.search(pattern,email).groupdict() for email in emails]
contacts

# Recap

re functions
- `re.findall()`:finds all substrings where the REGEX matches, returns a list
- `re.search()`: scans through a string, looking for any location where the RE matches; returns match object
    - `match.span()`
    - `match.group()`
    - `match.groups()`
    - `match.groupdict()`
- `re.sub()`:allows us to match a regex and substitute in a new substring for the match; returns string
- `re.compile()`:prepare a regular expression for use ahead of time, returns expression

flags
- `re.IGNORECASE`: ignore case
- `re.VERBOSE`: ignore whitespace



Metacharacters
- `\w`: any letter or number
- `\W`: anything that is *not* a letter or number
- `\d`: any digit
- `\D`: anything that is *not* a digit
- `\s` : any whitespace
-  `.` : anything


Repeating

- `{}`: custom number of repetitions
    - `{x}`: exactly x repetitions
    - `{x,}`: x or more
    - `{x,y}`: between x and y repetitions

- `*`: zero or more
- `+`: one or more
- `?`: optional / greedy


Any/None

- `[]`: will match anything inside of
- `[^]`: will match anything NOT inside of
- `[-]`: will match a range of values inside of


Capture Groups

- `()`: grab what's contained in parentheses 
- `?:`: to ignore a capture group also called shy
- `?P`: to name a capture group


Other
- `\`: escape character
    


