# Regex

What is it?
- Regular expressions (called regexes or regex patterns) are a tiny language for dealing with text and character patterns

Why do we care? 
- With this, we can extract text, match text, or replace text that matches a pattern

In [2]:
import pandas as pd
import numpy as np

#new import!
import re

## re library function

`re.findall(pattern, string)` 

 - finds all substrings where the REGEX matches; returns a list

### Literals - start simple

In [3]:
subject = 'abc'

#### find the letter a

In [4]:
re.findall(r'a', subject)

['a']

#### find the letter d

In [6]:
regexp = r'd'

re.findall(regexp, subject)

[]

### Literals - make it more complex

In [14]:
subject = 'Mary had a little lamb. 1x little lamb. Not 10 lambs, not 12, not 22, just one'

#### find not

In [9]:
regexp = r'not'

re.findall(regexp, subject)

['not', 'not']

In [10]:
regexp = r'Not'

re.findall(regexp, subject)

['Not']

##### regex flag: re.IGNORECASE

In [11]:
regexp = r'not'

re.findall(regexp, subject, re.IGNORECASE)

['Not', 'not', 'not']

#### find lamb

In [16]:
regexp = r'lamb'

re.findall(regexp, subject)

['lamb', 'lamb', 'lamb']

## Metacharacters
<div class="alert alert-block alert-info">
    
- `\w`: any letter or digit
- `\W`: anything that is *not* a letter or digit



- `\d`: any digit
- `\D`: anything that is *not* a digit


- `\s` : any whitespace


-  `.` : anything
    
</div>

### Try them all 

In [31]:
subject = 'abcccC. 123!w '

#### `\w`: any letter or number

In [18]:
regexp = r'\w'

re.findall(regexp, subject)

['a', 'b', 'c', 'c', 'c', 'C', '1', '2', '3']

#### what does \w\w bring back?

In [19]:
regexp = r'\w\w'

re.findall(regexp, subject)

['ab', 'cc', 'cC', '12']

#### `\W`: anything that is not a letter or number

In [20]:
subject

'abcccC. 123!'

In [21]:
regexp = r'\W'

re.findall(regexp, subject)

['.', ' ', '!']

#### `\d`: any digit

In [22]:
regexp = r'\d'

re.findall(regexp, subject)

['1', '2', '3']

#### `\D`: anything that is not a digit

In [23]:
regexp = r'\D'

re.findall(regexp, subject)

['a', 'b', 'c', 'c', 'c', 'C', '.', ' ', '!']

#### `\s` : any whitespace

In [25]:
subject

'abcccC. 123!'

In [24]:
regexp = r'\s'

re.findall(regexp, subject)

[' ']

#### `.` : anything

In [26]:
regexp = r'.'

re.findall(regexp, subject)

['a', 'b', 'c', 'c', 'c', 'C', '.', ' ', '1', '2', '3', '!']

#### find the `.` only
use an escape character, `\`, to find characters that are metacharacters

In [27]:
subject

'abcccC. 123!'

In [35]:
regexp = r'\.'

re.findall(regexp, subject)

['.']

<div class="alert alert-block alert-warning">
    <b>Mini-Exercise</b>
    
- Match the string 'c 1' using only metacharacters
- The returned list should have only one element in it
- Find 3 different syntax combinations

    
</div>

In [37]:
subject = 'c 1'

In [39]:
regexp = r'...'

re.findall(regexp, subject)

['c 1']

In [40]:
regexp = r'\w\s\d'

re.findall(regexp, subject)

['c 1']

In [41]:
regexp = r'\D\s\d'

re.findall(regexp, subject)

['c 1']

In [42]:
regexp = r'\w\W\w'

re.findall(regexp, subject)

['c 1']

## Repeating

<div class='alert alert-info'>
    
- `{}`: custom number of repetitions
    - `{x}`: exactly x repetitions
    - `{x,}`: x or more
    - `{x,y}`: between x and y repetitions

- `*`: zero or more
- `+`: one or more
- `?`: optional / not greedy
    
</div>

In [43]:
subject = 'ccccc! 123 ccc! 99!'

#### find the whole string

In [49]:
regexp = r'.+'

re.findall(regexp, subject)

['ccccc! 123 ccc! 99!']

#### find all the c groups

In [53]:
subject

'ccccc! 123 ccc! 99!'

In [51]:
regexp = r'c'

re.findall(regexp, subject)

['c', 'c', 'c', 'c', 'c', 'c', 'c', 'c']

In [52]:
regexp = r'c{5}'

re.findall(regexp, subject)

['ccccc']

In [54]:
regexp = r'c{3}'

re.findall(regexp, subject)

['ccc', 'ccc']

In [55]:
regexp = r'c{3,}'

re.findall(regexp, subject)

['ccccc', 'ccc']

In [57]:
subject

'ccccc! 123 ccc! 99!'

In [56]:
regexp = r'c+'

re.findall(regexp, subject)

['ccccc', 'ccc']

#### find 123 and anything after it

In [59]:
subject

'ccccc! 123 ccc! 99!'

In [68]:
#123: find literal 123
#.: find anything
#+: find everything after
regexp = r'123.+'

re.findall(regexp, subject)

['123 ccc! 99!']

#### find the exclamation points and everything inbetween them

In [69]:
subject

'ccccc! 123 ccc! 99!'

In [71]:
#!: first exclamation point
#.+: everything after
#!: last exclamation point

regexp = r'!.+!'

re.findall(regexp, subject)

['! 123 ccc! 99!']

#### find the exclamation points and everything inbetween the first two

In [72]:
#!: first exclamation point
#.+: everything after
#?: don't be greedy and stop at the first exclamation point
#!: last exclamation point

regexp = r'!.+?!'

re.findall(regexp, subject)

['! 123 ccc!']

<div class='alert alert-warning'>
<b>Mini Exercise</b>
    
From the below string, find the following information:
- Find all the numbers
- Find the number that has exactly 5 digits
- Find numbers that has 4 or more digits
- Find the sentences contained in quotes
- Find `http://` or `https://`

In [146]:
subject = '''
Codeup, founded in 2014, is located at 600 Navarro St. 
Suite 350, San Antonio, TX 78230. 
tagline: "launch your career in tech!" 
You can find us online at http://codeup.com 
and our alumni portal is located at https://alumni.codeup.com
and "codeup is a great school"! 100,000
'''

subject = subject.replace('\n','')
subject

'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. tagline: "launch your career in tech!" You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.comand "codeup is a great school"! 100,000'

#### Find all the numbers

In [147]:
regexp = r'\d+'

re.findall(regexp, subject)

['2014', '600', '350', '78230', '100', '000']

#### Find a number that has exactly 5 digits

In [148]:
regexp = r'\d\d\d\d\d'

re.findall(regexp, subject)

['78230']

In [149]:
regexp = r'\d{5}'

re.findall(regexp, subject)

['78230']

#### Find numbers that has 4 or more digits

In [150]:
regexp =  r'\d{4,}'

re.findall(regexp, subject)

['2014', '78230']

#### Find the sentences contained in quotes

In [151]:
subject

'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. tagline: "launch your career in tech!" You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.comand "codeup is a great school"! 100,000'

In [152]:
regexp = r'".+?"'

re.findall(regexp, subject)

['"launch your career in tech!"', '"codeup is a great school"']

#### Find `http://` or `https://`

In [153]:
regexp = r'http://|https://'

re.findall(regexp, subject)

['http://', 'https://']

In [154]:
regexp = r'htt.+?//'

re.findall(regexp, subject)

['http://', 'https://']

In [155]:
regexp = r'http.*?//'

re.findall(regexp, subject)

['http://', 'https://']

In [156]:
regexp = r'https?://'

re.findall(regexp, subject)

['http://', 'https://']

In [157]:
regexp = r'https*://'

re.findall(regexp, subject)

['http://', 'https://']

### Any of / None of

<div class='alert alert-info'>
    
- `[]`: will match any element inside of
- `[^]`: will match any element NOT inside of
- `[-]`: will match a range of values inside of
    
</div>

#### match using brackets

In [166]:
subject = 'abc 123745 1bc'

#### find a or b

In [167]:
regexp = r'a|b'

re.findall(regexp, subject)

['a', 'b', 'b']

In [168]:
regexp = r'[ab]'

re.findall(regexp, subject)

['a', 'b', 'b']

#### find values that are NOT a or b

In [169]:
regexp = r'[^ab]'

re.findall(regexp, subject)

['c', ' ', '1', '2', '3', '7', '4', '5', ' ', '1', 'c']

#### find values that are between 2 and 4

In [170]:
subject

'abc 123745 1bc'

In [171]:
regexp = r'[2-4]'

re.findall(regexp, subject)

['2', '3', '4']

### Anchors
<div class='alert alert-info'>

- `^`: starts with
- `$`: ends with
- `\b`: word boundary

</div>

In [172]:
subject = 'kiwi aardvark banana codeup data science academy extra'

#### match all words that start with a vowel

In [176]:
regexp = r'[aeiou]\w+'

re.findall(regexp, subject)

['iwi', 'aardvark', 'anana', 'odeup', 'ata', 'ience', 'academy', 'extra']

In [183]:
#using a boundary
regexp = r'\b[aeiou]\w+'

re.findall(regexp, subject)

['aardvark', 'academy', 'extra']

In [186]:
subject

'kiwi aardvark banana codeup data science academy extra'

In [185]:
#using a carrot
regexp = r'^[aeiou]\w+'

re.findall(regexp, subject)

[]

In [190]:
#split subjects to use anchor
subjects = subject.split()
subjects

['kiwi', 'aardvark', 'banana', 'codeup', 'data', 'science', 'academy', 'extra']

In [192]:
#use for loop to cycle through them
for sub in subjects:
    print(re.findall(r'^[aeiou]\w+',sub))

[]
['aardvark']
[]
[]
[]
[]
['academy']
['extra']


#### match all words that end with a vowel

In [195]:
regexp = r'\w+[aeiou]\b'

re.findall(regexp, subject)

['kiwi', 'banana', 'data', 'science', 'extra']

In [197]:
subject

'kiwi aardvark banana codeup data science academy extra'

In [196]:
regexp = r'\w+[aeiou]$'

re.findall(regexp, subject)

['extra']

In [200]:
subject

'kiwi aardvark banana codeup data science academy extra'

In [201]:
regexp = r'[ka][ia]\w+'

re.findall(regexp, subject)

['kiwi', 'aardvark']

In [208]:
regexp = r'i |j '

re.findall(regexp, subject)

['i ']

<div class='alert alert-warning'>

<b>Mini Exercise</b>
    
Write regular expressions to find the following values

- Find any even digits (regardless if its apart of a bigger number)
- Find entire numbers that are even
   
    
- Find 2 or more odd digits in a row.

    
- Find all the capital letters
- Find all words that start with a capital letter
    
<div>

In [243]:
subject = '''
Codeup, founded in 2014, is located at 600 Navarro St. 
Suite 350, San Antonio, TX 78230. 
tagline: "launch your career in tech!" 
You can find us online at http://codeup.com 
and our alumni portal is located at https://alumni.codeup.com
and "codeup is a great school"!
'''

subject = subject.replace('\n','')
subject

'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. tagline: "launch your career in tech!" You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.comand "codeup is a great school"!'

#### Find any even digits (regardless if its apart of a bigger number)

In [222]:
regexp = r'[24680]'

re.findall(regexp, subject)

['2', '0', '4', '6', '0', '0', '0', '8', '2', '0']

#### Find entire numbers that are even

In [226]:
regexp = r'\d*[24680]\b'

re.findall(regexp, subject)

['2014', '600', '350', '78230']

#### Find 2 or more odd digits in a row (regardless if its apart of a bigger number)

In [229]:
regexp = r'[13579][13579]'

re.findall(regexp, subject)

['35']

In [232]:
regexp = r'[13579]{2,}'

re.findall(regexp, subject)

['35']

#### Find all the capital letters

In [238]:
regexp = r'[A-Z]'

re.findall(regexp, subject)

['C', 'N', 'S', 'S', 'S', 'A', 'T', 'X', 'Y']

#### Find all words that start with a capital letter

In [249]:
regexp = r'[A-Z]\w+'

re.findall(regexp, subject)

['Codeup', 'Navarro', 'St', 'Suite', 'San', 'Antonio', 'TX', 'You']

### Capture Groups

<div class='alert alert-info'>
    
- `()`: grab what's contained in parentheses 
    
</div> 

In [281]:
subject = '''
    You can find us on the web at https://codeup.com. Our ip address is 123.897.123.123 (maybe).
    '''
subject

'\n    You can find us on the web at https://codeup.com. Our ip address is 123.897.123.123 (maybe).\n    '

#### find the domain only

In [282]:
regexp = r'https?://(.+).com'

re.findall(regexp, subject)

['codeup']

#### find everything after the first sentence

In [283]:
regexp = r'\.\s(.+)'

re.findall(regexp, subject)

['Our ip address is 123.897.123.123 (maybe).']

#### find the ip address

In [284]:
regexp = r'Our ip address is (.+)\s'

re.findall(regexp, subject)

['123.897.123.123 (maybe).']

In [285]:
#dont be greedy
regexp = r'Our ip address is (.+?)\s'

re.findall(regexp, subject)

['123.897.123.123']

In [286]:
regexp = r'((\d{3}\.){3}\d{3})'

re.findall(regexp, subject)

[('123.897.123.123', '123.')]

### Non Capture Group

<div class='alert alert-info'>

- `?:`: to ignore a capture group also called shy

#### find the ip address

In [287]:
regexp = r'((?:\d{3}\.){3}\d{3})'

re.findall(regexp, subject)

['123.897.123.123']

<div class='alert alert-warning'>

<b>Mini-Exercise</b>    
    
- Find the protocol, domain, and tld
- Edit your previous code with a non-capture group to remove the domain from your result
    
</div>

In [None]:
subject = '''
    You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).
    '''

#### Find the protocol, domain, and tld

In [290]:
regexp = r'(https)://(\w+)\.(\w+)'

re.findall(regexp, subject)

[('https', 'codeup', 'com')]

#### Edit your previous code with a non-capture group to remove the domain

In [291]:
regexp = r'(https)://(?:\w+)\.(\w+)'

re.findall(regexp, subject)

[('https', 'com')]

## more re library functions

### `re.search`

- scans through a string, looking for any location where the RE matches; returns match object

In [292]:
subject = """
    Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, 
    San Antonio, TX 78230. 
    You can find us online at http://codeup.com and
    our alumni portal is located at https://alumni.codeup.com.
    """

#### find the word located

In [296]:
#save the match result
regexp = r'located'

match = re.search(regexp, subject)
match

<re.Match object; span=(33, 40), match='located'>

In [297]:
#results type
type(match)

re.Match

In [298]:
#.span()
match.span()

(33, 40)

In [301]:
match[0]

'located'

In [302]:
#.group()
match.group()

'located'

#### find numbers

In [307]:
regexp = r'\d+'

re.search(regexp, subject)

<re.Match object; span=(24, 28), match='2014'>

#### find navarro

In [308]:
regexp = r'navarro'

re.search(regexp, subject)

In [309]:
#results type
type(re.search(regexp, subject))

NoneType

### Name Capture Group

- `?P`: to name a capture group

In [322]:
subject = '''
    You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).
    '''

#### find the protocol, domain, and tld and name them

In [324]:
regexp = r'(?P<protocol>http?s*)://(?P<domain>\w+)\.(?P<tld>\w+)'

match = re.search(regexp, subject)
match

<re.Match object; span=(35, 53), match='https://codeup.com'>

In [325]:
#groups()
match.groups()

('https', 'codeup', 'com')

In [326]:
#groupdict()
match.groupdict()

{'protocol': 'https', 'domain': 'codeup', 'tld': 'com'}

##### regex flag:  `re.VERBOSE`

- `re.VERBOSE` will ignore whitespace in regex pattern

In [329]:
regexp = r'''
            (?P<protocol>http?s*)://
            (?P<domain>\w+)\.
            (?P<tld>\w+)
            '''
match = re.search(regexp, subject, re.VERBOSE)
match

<re.Match object; span=(35, 53), match='https://codeup.com'>

### `re.sub(pattern, repl, string,)`

- allows us to match a regex and substitute in a new substring for the match; returns string

In [343]:
subject = 'abc 12345xyz345'

#### remove all the digits

In [337]:
regexp = r'\d'

re.sub(regexp, '', subject)

'abc xyz'

#### replace all digits with an o 

In [338]:
regexp = r'\d'

re.sub(regexp, 'o',subject)

'abc oooooxyzooo'

#### replace all the digit (groups) with a single *

In [341]:
regexp = r'\d+'

re.sub(regexp, '*', subject)

'abc *xyz*'

#### Using regex with a str.replace, add regex=True arguement

In [348]:
pd.Series(subjects).str.replace(r'[aeiou]','***', regex=True)

0          k***w***
1    ******rdv***rk
2      b***n***n***
3      c***d******p
4          d***t***
5     sc******nc***
6     ***c***d***my
7         ***xtr***
dtype: object

### `re.compile`

- prepare a regular expression for use ahead of time, returns expression

In [349]:
emails = [
    "jane@company.com",
    "bob@company.com",
    "jane.janeway@company.com",
    "jane.janeway@dogood.org",
    "jane.janet.janeway@dogood.org", # bonus for the 3 part address
]

#### get name, domain, tld

In [351]:
regexp = r'(\w+\.*\w+)@(\w+)\.(\w+)'

for email in emails:
    print(re.findall(regexp, email))

[('jane', 'company', 'com')]
[('bob', 'company', 'com')]
[('jane.janeway', 'company', 'com')]
[('jane.janeway', 'dogood', 'org')]
[('janet.janeway', 'dogood', 'org')]


In [352]:
pattern = re.compile(r'''
        (?P<name>\w+\.*\w+)@
        (?P<domain>\w+)\.
        (?P<tld>\w+)
        ''', re.VERBOSE)
pattern

re.compile(r'\n        (?P<name>\w+\.*\w+)@\n        (?P<domain>\w+)\.\n        (?P<tld>\w+)\n        ',
           re.UNICODE|re.VERBOSE)

In [354]:
contacts = [re.search(pattern,email).groupdict() for email in emails]
contacts

[{'name': 'jane', 'domain': 'company', 'tld': 'com'},
 {'name': 'bob', 'domain': 'company', 'tld': 'com'},
 {'name': 'jane.janeway', 'domain': 'company', 'tld': 'com'},
 {'name': 'jane.janeway', 'domain': 'dogood', 'tld': 'org'},
 {'name': 'janet.janeway', 'domain': 'dogood', 'tld': 'org'}]

In [355]:
pd.DataFrame(contacts)

Unnamed: 0,name,domain,tld
0,jane,company,com
1,bob,company,com
2,jane.janeway,company,com
3,jane.janeway,dogood,org
4,janet.janeway,dogood,org


# Recap

re functions
- `re.findall()`:finds all substrings where the REGEX matches, returns a list
- `re.search()`: scans through a string, looking for any location where the RE matches; returns match object
    - `match.span()`
    - `match.group()`
    - `match.groups()`
    - `match.groupdict()`
- `re.sub()`:allows us to match a regex and substitute in a new substring for the match; returns string
- `re.compile()`:prepare a regular expression for use ahead of time, returns expression

flags
- `re.IGNORECASE`: ignore case
- `re.VERBOSE`: ignore whitespace



Metacharacters
- `\w`: any letter or digit
- `\W`: anything that is *not* a letter or digit
- `\d`: any digit
- `\D`: anything that is *not* a digit
- `\s` : any whitespace
-  `.` : anything


Repeating

- `{}`: custom number of repetitions
    - `{x}`: exactly x repetitions
    - `{x,}`: x or more
    - `{x,y}`: between x and y repetitions

- `*`: zero or more
- `+`: one or more
- `?`: optional / greedy


Any/None

- `[]`: will match anything inside of
- `[^]`: will match anything NOT inside of
- `[-]`: will match a range of values inside of


Capture Groups

- `()`: grab what's contained in parentheses 
- `?:`: to ignore a capture group also called shy
- `?P`: to name a capture group


Other
- `\`: escape character
    


