# Regex

What is it?
- Regular expressions (called regexes or regex patterns) are a tiny language for dealing with text and character patterns

Why do we care? 
- With this, we can extract text, match text, or replace text that matches a pattern

In [1]:
import pandas as pd
import numpy as np

#new import!
import re

## re library function

`re.findall(pattern, string)` 

 - finds all substrings where the REGEX matches; returns a list

### Literals - start simple

In [2]:
subject = 'abc'

#### find the letter a

In [3]:
re.findall(r'a', subject)

['a']

#### find the letter d

In [4]:
regexp = r'd'

re.findall(regexp, subject)

[]

### Literals - make it more complex

In [5]:
subject = 'Mary had a little lamb. 1x little lamb. Not 10 lambs, not 12, not 22, just one'

#### find not

In [6]:
regexp = r'not'

re.findall(regexp, subject)

['not', 'not']

In [7]:
regexp = r'Not'

re.findall(regexp, subject)

['Not']

##### regex flag: re.IGNORECASE

In [8]:
regexp = r'not'

re.findall(regexp, subject, re.IGNORECASE)

['Not', 'not', 'not']

#### find lamb

In [9]:
regexp = r'lamb'

re.findall(regexp, subject)

['lamb', 'lamb', 'lamb']

## Metacharacters
<div class="alert alert-block alert-info">
    
- `\w`: any letter or digit
- `\W`: anything that is *not* a letter or digit



- `\d`: any digit
- `\D`: anything that is *not* a digit


- `\s` : any whitespace


-  `.` : anything
    
</div>

### Try them all 

In [10]:
subject = 'abcccC. 123!w '

#### `\w`: any letter or number

In [11]:
regexp = r'\w'

re.findall(regexp, subject)

['a', 'b', 'c', 'c', 'c', 'C', '1', '2', '3', 'w']

#### what does \w\w bring back?

In [12]:
regexp = r'\w\w'

re.findall(regexp, subject)

['ab', 'cc', 'cC', '12']

#### `\W`: anything that is not a letter or number

In [13]:
subject

'abcccC. 123!w '

In [14]:
regexp = r'\W'

re.findall(regexp, subject)

['.', ' ', '!', ' ']

#### `\d`: any digit

In [15]:
regexp = r'\d'

re.findall(regexp, subject)

['1', '2', '3']

#### `\D`: anything that is not a digit

In [16]:
regexp = r'\D'

re.findall(regexp, subject)

['a', 'b', 'c', 'c', 'c', 'C', '.', ' ', '!', 'w', ' ']

#### `\s` : any whitespace

In [17]:
subject

'abcccC. 123!w '

In [18]:
regexp = r'\s'

re.findall(regexp, subject)

[' ', ' ']

#### `.` : anything

In [19]:
regexp = r'.'

re.findall(regexp, subject)

['a', 'b', 'c', 'c', 'c', 'C', '.', ' ', '1', '2', '3', '!', 'w', ' ']

#### find the `.` only
use an escape character, `\`, to find characters that are metacharacters

In [20]:
subject

'abcccC. 123!w '

In [21]:
regexp = r'\.'

re.findall(regexp, subject)

['.']

<div class="alert alert-block alert-warning">
    <b>Mini-Exercise</b>
    
- Match the string 'c 1' using only metacharacters
- The returned list should have only one element in it
- Find 3 different syntax combinations

    
</div>

In [22]:
subject = 'c 1'

In [23]:
regexp = r'...'

re.findall(regexp, subject)

['c 1']

In [24]:
regexp = r'\w\s\d'

re.findall(regexp, subject)

['c 1']

In [25]:
regexp = r'\D\s\d'

re.findall(regexp, subject)

['c 1']

In [26]:
regexp = r'\w\W\w'

re.findall(regexp, subject)

['c 1']

## Repeating

<div class='alert alert-info'>
    
- `{}`: custom number of repetitions
    - `{x}`: exactly x repetitions
    - `{x,}`: x or more
    - `{x,y}`: between x and y repetitions

- `*`: zero or more
- `+`: one or more
- `?`: optional / not greedy
    
</div>

In [27]:
subject = 'ccccc! 123 ccc! 99!'

#### find the whole string

In [28]:
regexp = r'.+'

re.findall(regexp, subject)

['ccccc! 123 ccc! 99!']

#### find all the c groups

In [29]:
subject

'ccccc! 123 ccc! 99!'

In [30]:
regexp = r'c'

re.findall(regexp, subject)

['c', 'c', 'c', 'c', 'c', 'c', 'c', 'c']

In [31]:
regexp = r'c{5}'

re.findall(regexp, subject)

['ccccc']

In [32]:
regexp = r'c{3}'

re.findall(regexp, subject)

['ccc', 'ccc']

In [33]:
regexp = r'c{3,}'

re.findall(regexp, subject)

['ccccc', 'ccc']

In [34]:
subject

'ccccc! 123 ccc! 99!'

In [35]:
regexp = r'c+'

re.findall(regexp, subject)

['ccccc', 'ccc']

#### find 123 and anything after it

In [36]:
subject

'ccccc! 123 ccc! 99!'

In [37]:
#123: find literal 123
#.: find anything
#+: find everything after
regexp = r'123.+'

re.findall(regexp, subject)

['123 ccc! 99!']

#### find the exclamation points and everything inbetween them

In [38]:
subject

'ccccc! 123 ccc! 99!'

In [39]:
#!: first exclamation point
#.+: everything after
#!: last exclamation point

regexp = r'!.+!'

re.findall(regexp, subject)

['! 123 ccc! 99!']

#### find the exclamation points and everything inbetween the first two

In [40]:
#!: first exclamation point
#.+: everything after
#?: don't be greedy and stop at the first exclamation point
#!: last exclamation point

regexp = r'!.+?!'

re.findall(regexp, subject)

['! 123 ccc!']

<div class='alert alert-warning'>
<b>Mini Exercise</b>
    
From the below string, find the following information:
- Find all the numbers
- Find the number that has exactly 5 digits
- Find numbers that has 4 or more digits
- Find the sentences contained in quotes
- Find `http://` or `https://`

In [41]:
subject = '''
Codeup, founded in 2014, is located at 600 Navarro St. 
Suite 350, San Antonio, TX 78230. 
tagline: "launch your career in tech!" 
You can find us online at http://codeup.com 
and our alumni portal is located at https://alumni.codeup.com
and "codeup is a great school"! 100,000
'''

subject = subject.replace('\n','')
subject

'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. tagline: "launch your career in tech!" You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.comand "codeup is a great school"! 100,000'

#### Find all the numbers

In [42]:
regexp = r'\d+'

re.findall(regexp, subject)

['2014', '600', '350', '78230', '100', '000']

#### Find a number that has exactly 5 digits

In [43]:
regexp = r'\d\d\d\d\d'

re.findall(regexp, subject)

['78230']

In [44]:
regexp = r'\d{5}'

re.findall(regexp, subject)

['78230']

#### Find numbers that has 4 or more digits

In [45]:
regexp =  r'\d{4,}'

re.findall(regexp, subject)

['2014', '78230']

#### Find the sentences contained in quotes

In [46]:
subject

'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. tagline: "launch your career in tech!" You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.comand "codeup is a great school"! 100,000'

In [47]:
regexp = r'".+?"'

re.findall(regexp, subject)

['"launch your career in tech!"', '"codeup is a great school"']

#### Find `http://` or `https://`

In [48]:
regexp = r'http://|https://'

re.findall(regexp, subject)

['http://', 'https://']

In [49]:
regexp = r'htt.+?//'

re.findall(regexp, subject)

['http://', 'https://']

In [50]:
regexp = r'http.*?//'

re.findall(regexp, subject)

['http://', 'https://']

In [51]:
regexp = r'https?://'

re.findall(regexp, subject)

['http://', 'https://']

In [52]:
regexp = r'https*://'

re.findall(regexp, subject)

['http://', 'https://']

### Any of / None of

<div class='alert alert-info'>
    
- `[]`: will match any element inside of
- `[^]`: will match any element NOT inside of
- `[-]`: will match a range of values inside of
    
</div>

#### match using brackets

In [53]:
subject = 'abc 123745 1bc'

#### find a or b

In [54]:
regexp = r'a|b'

re.findall(regexp, subject)

['a', 'b', 'b']

In [55]:
regexp = r'[ab]'

re.findall(regexp, subject)

['a', 'b', 'b']

#### find values that are NOT a or b

In [56]:
regexp = r'[^ab]'

re.findall(regexp, subject)

['c', ' ', '1', '2', '3', '7', '4', '5', ' ', '1', 'c']

#### find values that are between 2 and 4

In [57]:
subject

'abc 123745 1bc'

In [58]:
regexp = r'[2-4]'

re.findall(regexp, subject)

['2', '3', '4']

### Anchors
<div class='alert alert-info'>

- `^`: starts with
- `$`: ends with
- `\b`: word boundary

</div>

In [59]:
subject = 'kiwi aardvark banana codeup data science academy extra'

#### match all words that start with a vowel

In [60]:
regexp = r'[aeiou]\w+'

re.findall(regexp, subject)

['iwi', 'aardvark', 'anana', 'odeup', 'ata', 'ience', 'academy', 'extra']

In [61]:
#using a boundary
regexp = r'\b[aeiou]\w+'

re.findall(regexp, subject)

['aardvark', 'academy', 'extra']

In [62]:
subject

'kiwi aardvark banana codeup data science academy extra'

In [63]:
#using a carrot
regexp = r'^[aeiou]\w+'

re.findall(regexp, subject)

[]

In [64]:
#split subjects to use anchor
subjects = subject.split()
subjects

['kiwi', 'aardvark', 'banana', 'codeup', 'data', 'science', 'academy', 'extra']

In [65]:
#use for loop to cycle through them
for sub in subjects:
    print(re.findall(r'^[aeiou]\w+',sub))

[]
['aardvark']
[]
[]
[]
[]
['academy']
['extra']


#### match all words that end with a vowel

In [66]:
regexp = r'\w+[aeiou]\b'

re.findall(regexp, subject)

['kiwi', 'banana', 'data', 'science', 'extra']

In [67]:
subject

'kiwi aardvark banana codeup data science academy extra'

In [68]:
regexp = r'\w+[aeiou]$'

re.findall(regexp, subject)

['extra']

In [69]:
subject

'kiwi aardvark banana codeup data science academy extra'

In [70]:
regexp = r'[ka][ia]\w+'

re.findall(regexp, subject)

['kiwi', 'aardvark']

In [71]:
regexp = r'i |j '

re.findall(regexp, subject)

['i ']

<div class='alert alert-block alert-warning'>

<b>Mini Exercise</b>
    
Write regular expressions to find the following values

- Find any even digits (regardless if its apart of a bigger number)
- Find entire numbers that are even
- Find 2 or more odd digits in a row.
- Find all the capital letters
- Find all words that start with a capital letter
    
<\div>

In [72]:
subject = '''
Codeup, founded in 2014, is located at 600 Navarro St. 
Suite 350, San Antonio, TX 78230. 
tagline: "launch your career in tech!" 
You can find us online at http://codeup.com 
and our alumni portal is located at https://alumni.codeup.com
and "codeup is a great school"!
'''

subject = subject.replace('\n','')
subject

'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. tagline: "launch your career in tech!" You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.comand "codeup is a great school"!'

#### Find any even digits (regardless if its apart of a bigger number)

In [73]:
regexp = r'[24680]'

re.findall(regexp, subject)

['2', '0', '4', '6', '0', '0', '0', '8', '2', '0']

#### Find entire numbers that are even

In [74]:
regexp = r'\d*[24680]\b'

re.findall(regexp, subject)

['2014', '600', '350', '78230']

#### Find 2 or more odd digits in a row (regardless if its apart of a bigger number)

In [75]:
regexp = r'[13579][13579]'

re.findall(regexp, subject)

['35']

In [76]:
regexp = r'[13579]{2,}'

re.findall(regexp, subject)

['35']

#### Find all the capital letters

In [77]:
regexp = r'[A-Z]'

re.findall(regexp, subject)

['C', 'N', 'S', 'S', 'S', 'A', 'T', 'X', 'Y']

#### Find all words that start with a capital letter

In [78]:
regexp = r'[A-Z]\w+'

re.findall(regexp, subject)

['Codeup', 'Navarro', 'St', 'Suite', 'San', 'Antonio', 'TX', 'You']

### Capture Groups

<div class='alert alert-info'>
    
- `()`: grab what's contained in parentheses 
    
</div> 

In [79]:
subject = '''
    You can find us on the web at https://codeup.com. Our ip address is 123.897.123.123 (maybe).
    '''
subject

'\n    You can find us on the web at https://codeup.com. Our ip address is 123.897.123.123 (maybe).\n    '

#### find the domain only

In [80]:
regexp = r'https?://(.+).com'

re.findall(regexp, subject)

['codeup']

#### find everything after the first sentence

In [81]:
regexp = r'\.\s(.+)'

re.findall(regexp, subject)

['Our ip address is 123.897.123.123 (maybe).']

#### find the ip address

In [82]:
regexp = r'Our ip address is (.+)\s'

re.findall(regexp, subject)

['123.897.123.123 (maybe).']

In [83]:
#dont be greedy
regexp = r'Our ip address is (.+?)\s'

re.findall(regexp, subject)

['123.897.123.123']

In [84]:
regexp = r'((\d{3}\.){3}\d{3})'

re.findall(regexp, subject)

[('123.897.123.123', '123.')]

### Non Capture Group

<div class='alert alert-info'>

- `?:`: to ignore a capture group also called shy
    
</div>

#### find the ip address

In [85]:
regexp = r'((?:\d{3}\.){3}\d{3})'

re.findall(regexp, subject)

['123.897.123.123']

## more re library functions

### `re.search`

- scans through a string, looking for any location where the RE matches; returns match object

In [86]:
subject = """
    Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, 
    San Antonio, TX 78230. 
    You can find us online at http://codeup.com and
    our alumni portal is located at https://alumni.codeup.com.
    """

#### find the word located

In [87]:
#save the match result
regexp = r'located'

match = re.search(regexp, subject)
match

<re.Match object; span=(33, 40), match='located'>

In [88]:
#results type
type(match)

re.Match

In [89]:
#.span()
match.span()

(33, 40)

In [90]:
match[0]

'located'

In [91]:
#.group()
match.group()

'located'

#### find numbers

In [92]:
regexp = r'\d+'

re.search(regexp, subject)

<re.Match object; span=(24, 28), match='2014'>

#### find navarro

In [93]:
regexp = r'navarro'

re.search(regexp, subject)

In [94]:
#results type
type(re.search(regexp, subject))

NoneType

### Name Capture Group

- `?P`: to name a capture group

In [95]:
subject = '''
    You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).
    '''

#### find the protocol, domain, and tld and name them

In [96]:
regexp = r'(?P<protocol>http?s*)://(?P<domain>\w+)\.(?P<tld>\w+)'

match = re.search(regexp, subject)
match

<re.Match object; span=(35, 53), match='https://codeup.com'>

In [97]:
#groups()
match.groups()

('https', 'codeup', 'com')

In [98]:
#groupdict()
match.groupdict()

{'protocol': 'https', 'domain': 'codeup', 'tld': 'com'}

##### regex flag:  `re.VERBOSE`

- `re.VERBOSE` will ignore whitespace in regex pattern

In [99]:
regexp = r'''
            (?P<protocol>http?s*)://
            (?P<domain>\w+)\.
            (?P<tld>\w+)
            '''
match = re.search(regexp, subject, re.VERBOSE)
match

<re.Match object; span=(35, 53), match='https://codeup.com'>

### `re.sub(pattern, repl, string,)`

- allows us to match a regex and substitute in a new substring for the match; returns string

In [100]:
subject = 'abc 12345xyz345'

#### remove all the digits

In [101]:
regexp = r'\d'

re.sub(regexp, '', subject)

'abc xyz'

#### replace all digits with an o 

In [102]:
regexp = r'\d'

re.sub(regexp, 'o',subject)

'abc oooooxyzooo'

#### replace all the digit (groups) with a single *

In [103]:
regexp = r'\d+'

re.sub(regexp, '*', subject)

'abc *xyz*'

#### Using regex with a str.replace, add regex=True arguement

In [104]:
pd.Series(subjects).str.replace(r'[aeiou]','***', regex=True)

0          k***w***
1    ******rdv***rk
2      b***n***n***
3      c***d******p
4          d***t***
5     sc******nc***
6     ***c***d***my
7         ***xtr***
dtype: object

### `re.compile`

- prepare a regular expression for use ahead of time, returns expression

In [105]:
emails = [
    "jane@company.com",
    "bob@company.com",
    "jane.janeway@company.com",
    "jane.janeway@dogood.org",
    "jane.janet.janeway@dogood.org", # bonus for the 3 part address
]

#### get name, domain, tld

In [106]:
regexp = r'(\w+\.*\w+)@(\w+)\.(\w+)'

for email in emails:
    print(re.findall(regexp, email))

[('jane', 'company', 'com')]
[('bob', 'company', 'com')]
[('jane.janeway', 'company', 'com')]
[('jane.janeway', 'dogood', 'org')]
[('janet.janeway', 'dogood', 'org')]


In [107]:
pattern = re.compile(r'''
        (?P<name>\w+\.*\w+)@
        (?P<domain>\w+)\.
        (?P<tld>\w+)
        ''', re.VERBOSE)
pattern

re.compile(r'\n        (?P<name>\w+\.*\w+)@\n        (?P<domain>\w+)\.\n        (?P<tld>\w+)\n        ',
           re.UNICODE|re.VERBOSE)

In [108]:
contacts = [re.search(pattern,email).groupdict() for email in emails]
contacts

[{'name': 'jane', 'domain': 'company', 'tld': 'com'},
 {'name': 'bob', 'domain': 'company', 'tld': 'com'},
 {'name': 'jane.janeway', 'domain': 'company', 'tld': 'com'},
 {'name': 'jane.janeway', 'domain': 'dogood', 'tld': 'org'},
 {'name': 'janet.janeway', 'domain': 'dogood', 'tld': 'org'}]

In [109]:
pd.DataFrame(contacts)

Unnamed: 0,name,domain,tld
0,jane,company,com
1,bob,company,com
2,jane.janeway,company,com
3,jane.janeway,dogood,org
4,janet.janeway,dogood,org


# Recap

re functions
- `re.findall()`:finds all substrings where the REGEX matches, returns a list
- `re.search()`: scans through a string, looking for any location where the RE matches; returns match object
    - `match.span()`
    - `match.group()`
    - `match.groups()`
    - `match.groupdict()`
- `re.sub()`:allows us to match a regex and substitute in a new substring for the match; returns string
- `re.compile()`:prepare a regular expression for use ahead of time, returns expression

flags
- `re.IGNORECASE`: ignore case
- `re.VERBOSE`: ignore whitespace



Metacharacters
- `\w`: any letter or digit
- `\W`: anything that is *not* a letter or digit
- `\d`: any digit
- `\D`: anything that is *not* a digit
- `\s` : any whitespace
-  `.` : anything


Repeating

- `{}`: custom number of repetitions
    - `{x}`: exactly x repetitions
    - `{x,}`: x or more
    - `{x,y}`: between x and y repetitions

- `*`: zero or more
- `+`: one or more
- `?`: optional / greedy


Any/None

- `[]`: will match anything inside of
- `[^]`: will match anything NOT inside of
- `[-]`: will match a range of values inside of


Capture Groups

- `()`: grab what's contained in parentheses 
- `?:`: to ignore a capture group also called shy
- `?P`: to name a capture group


Other
- `\`: escape character
    


