# Regex

What is it?
- Regular expressions (called regexes or regex patterns) are a tiny language for dealing with text and character patterns

Why do we care? 
- With this, we can extract text, match text, or replace text that matches a pattern

In [1]:
import pandas as pd
import numpy as np

#new import: regex

import re

## re library function

`re.findall(pattern, string)` 

 - finds all substrings where the REGEX matches; returns a list

### Literals - start simple

In [2]:
subject = 'abc'

#### find the letter a

In [3]:
re.findall(r'a', subject)

['a']

#### find the letter d

In [4]:
regexp = r'd' #will send out empty list because 'd' is not in abc

re.findall(regexp, subject)

[]

### Literals - make it more complex

In [5]:
subject = 'Mary had a little lamb. 1 little lamb. Not 10 lambs, not 12, not 22, just one'

#### find not

In [6]:
regexp = r'not'

re.findall(regexp, subject)

['not', 'not']

In [7]:
regexp = r'Not' #case sensitive

re.findall(regexp, subject)

['Not']

##### regex flag: re.IGNORECASE

In [8]:
regexp = r'not'

re.findall(regexp, subject, re.IGNORECASE) #re.IGNORECASE will ignore case

['Not', 'not', 'not']

#### find lamb

In [9]:
regexp = r'lamb'

re.findall(regexp, subject) #found lambs even though the third 'lamb' is lambs

['lamb', 'lamb', 'lamb']

## Metacharacters
<div class="alert alert-block alert-info">
    
- `\w`: any letter or number
- `\W`: anything that is *not* a letter or number



- `\d`: any digit
- `\D`: anything that is *not* a digit


- `\s` : any whitespace


-  `.` : anything (includes white spaces)
    
</div>

### Try them all 

In [11]:
subject = 'abcccC. 123!'

#### `\w`: any letter/number

In [12]:
regexp = r'\w'

re.findall(regexp, subject)

['a', 'b', 'c', 'c', 'c', 'C', '1', '2', '3']

#### what does \w\w bring back?

In [13]:
regexp = r'\w\w'

re.findall(regexp, subject)

['ab', 'cc', 'cC', '12']

#### `\W`: anything that is not a letter or number

In [14]:
regexp = r'\W'

re.findall(regexp, subject)

['.', ' ', '!']

#### `\d`: any digit

In [15]:
regexp = r'\d'

re.findall(regexp, subject)

['1', '2', '3']

#### `\D`: anything that is not a digit

In [16]:
regexp = r'\D'

re.findall(regexp, subject)

['a', 'b', 'c', 'c', 'c', 'C', '.', ' ', '!']

#### `\s` : any whitespace

In [17]:
regexp = r'\s'

re.findall(regexp, subject)

[' ']

#### `.` : anything

In [18]:
regexp = r'.'

re.findall(regexp, subject)

['a', 'b', 'c', 'c', 'c', 'C', '.', ' ', '1', '2', '3', '!']

#### find the `.` only
use an escape character, `\`, to find characters that are metacharacters

In [19]:
subject

'abcccC. 123!'

In [20]:
regexp = r'\.'

re.findall(regexp, subject) #with '\.' find just the '.' but with anything else just the letter no ''\'

['.']

<div class="alert alert-block alert-warning">
    <b>Mini-Exercise</b>
    
- Match the string 'c 1' using only metacharacters
- The returned list should have only element in it
- Find 3 different syntax combinations

    
</div>

## Metacharacters
<div class="alert alert-block alert-info">
    
- `\w`: any letter or number
- `\W`: anything that is *not* a letter or number



- `\d`: any digit
- `\D`: anything that is *not* a digit


- `\s` : any whitespace


-  `.` : anything (includes white spaces)
    
</div>

In [21]:
subject = 'c 1'

In [22]:
regexp = r'\w\d'

re.findall(regexp, subject)

[]

In [23]:
regexp = r'\d\w'
re.findall(regexp, subject)

[]

In [24]:
regexp = r'\W\d'
re.findall(regexp, subject)

[' 1']

In [25]:
regexp = r'\w\D'
re.findall(regexp, subject)

['c ']

In [26]:
regeexp = r'\W\D'
re.findall(regexp, subject)

['c ']

In [27]:
regexp = r'\D\s\d'

re.findall(regexp, subject)

['c 1']

In [28]:
regexp = r'\w\W\w'

re.findall(regexp, subject)

['c 1']

## Repeating

<div class='alert alert-info'>  
    
-------------------------------------------------------------------------

- `{}`: custom number of repetitions
    - `{x}`: exactly x repetitions
    - `{x,}`: x or more
    - `{x,y}`: between x and y repetitions

- `*`: zero or more
- `+`: one or more
- `?`: optional / greedy  
-------------------------------------------------------------------------
- `\w`: any letter or number
- `\W`: anything that is *not* a letter or number
- `\d`: any digit
- `\D`: anything that is *not* a digit
- `\s` : any whitespace
-  `.` : anything (includes white spaces)
</div>

In [29]:
subject = 'ccccc! 123 ccc! 99!'

#### find the whole string

In [30]:
regexp = r'.+'

re.findall(regexp, subject)

['ccccc! 123 ccc! 99!']

#### find all the c groups

In [31]:
regexp = r'c'

re.findall(regexp, subject)

['c', 'c', 'c', 'c', 'c', 'c', 'c', 'c']

In [32]:
regexp = r'c{5}'

re.findall(regexp, subject)

['ccccc']

In [33]:
regexp = r'c{3}'

re.findall(regexp, subject)

['ccc', 'ccc']

In [34]:
regexp = r'c{3,}'

re.findall(regexp, subject)

['ccccc', 'ccc']

In [35]:
regexp = r'c+'

re.findall(regexp, subject)

['ccccc', 'ccc']

#### find 123 and anything after it

In [36]:
regexp = r'123+'

re.findall(regexp, subject) 

['123']

above will look for any repeats of three but if I wanted anyting 
that came after 3 I would say below

In [37]:
regexp = r'123.+'

re.findall(regexp, subject)

['123 ccc! 99!']

#### find the exclamation points and everything inbetween them

In [38]:
#!: first exclamation point
#.+: everything after
#!: last exclamation point

regexp = r'!.+!'

re.findall(regexp, subject)

['! 123 ccc! 99!']

#### find the exclamation points and everything inbetween the first two

In [39]:
regexp = r'!.+?!'

re.findall(regexp, subject)

['! 123 ccc!']

<div class='alert alert-warning'>
<b>Mini Exercise</b>
    
From the below string, find the following information:
- Find all the numbers
- Find the number that has exactly 5 digits
- Find numbers that has 4 or more digits
- Find `http://` or `https://`
- Find the sentences contained in quotes

<div class='alert alert-info'>

  
-------------------------------------------------------------------------
Repeating
    
- `{}`: custom number of repetitions
    - `{x}`: exactly x repetitions
    - `{x,}`: x or more
    - `{x,y}`: between x and y repetitions

- `*`: zero or more
- `+`: one or more
- `?`: optional / greedy  
------------------------------------------------------------------------- 
Meta
    
- `\w`: any letter or number
- `\W`: anything that is *not* a letter or number
- `\d`: any digit
- `\D`: anything that is *not* a digit
- `\s` : any whitespace
-  `.` : anything (includes white spaces)
-------------------------------------------------------------------------

    
    
</div>

In [40]:
subject = '''
Codeup, founded in 2014, is located at 600 Navarro St. 
Suite 350, San Antonio, TX 78230. 
tagline: "launch your career in tech!" 
You can find us online at http://codeup.com 
and our alumni portal is located at https://alumni.codeup.com
and "codeup is a great school"!
'''

subject = subject.replace('\n','')
subject

'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. tagline: "launch your career in tech!" You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.comand "codeup is a great school"!'

#### Find all the numbers

In [41]:
regexp = r'\w\d'

#better is r'\d+'

re.findall(regexp, subject) #why did it break them up in pairs

['20', '14', '60', '35', '78', '23']

In [42]:
regexp = r'\d+'

re.findall(regexp, subject)

['2014', '600', '350', '78230']

#### Find a number that has exactly 5 digits

In [43]:
regexp = r'\d{5}'

re.findall(regexp, subject)

['78230']

#### Find numbers that has 4 or more digits

In [44]:
regexp = r'\d{4}.'

re.findall(regexp, subject)

['2014,', '78230']

#### Find `http://` or `https://`

In [45]:
regexp = r'htt.+?//'

re.findall(regexp, subject)

['http://', 'https://']

#### Find the sentences contained in quotes

In [46]:
regexp = r'".+?"'

re.findall(regexp, subject)

['"launch your career in tech!"', '"codeup is a great school"']

### Any of / None of

<div class='alert alert-info'>
    
- `[]`: will match any element inside of
- `[^]`: will match any element NOT inside of
- `[-]`: will match a range of values inside of
    
------------------------------------------------------------------------- 
- `\w`: any letter or number
- `\W`: anything that is *not* a letter or number
- `\d`: any digit
- `\D`: anything that is *not* a digit
- `\s` : any whitespace
-  `.` : anything   

- `{}`: custom number of repetitions
- `{x}`: exactly x repetitions
- `{x,}`: x or more
- `{x,y}`: between x and y repetitions

- `*`: zero or more
- `+`: one or more
- `?`: optional / greedy will make the closest character to it's left optional
-------------------------------------------------------------------------
    
</div>

#### match using brackets

In [47]:
subject = 'abc 12345 1bc'

#### find a or b

In [48]:
regexp = r'a|b'

re.findall(regexp, subject)

['a', 'b', 'b']

#### find values that are NOT a or b

In [49]:
regexp = r'[^ab]'

re.findall(regexp, subject)

['c', ' ', '1', '2', '3', '4', '5', ' ', '1', 'c']

#### find values that are between 2 and 4

In [50]:
regexp = r'[2-4]'

re.findall(regexp, subject)

['2', '3', '4']

### Anchors
<div class='alert alert-info'>

- `^`: starts with
- `$`: ends with
- `\b`: word boundary

</div>

In [51]:
subject = 'kiwi aardvark banana codeup data science academy extra'

#### match all words that start with a vowel

In [52]:
regexp = r'[aeiou]\w+'

re.findall(regexp, subject)

['iwi', 'aardvark', 'anana', 'odeup', 'ata', 'ience', 'academy', 'extra']

In [53]:
#using a boundary
regexp = r'\b[aeiou]\w+'

re.findall(regexp, subject)

['aardvark', 'academy', 'extra']

In [54]:
regexp = r'\b[aeiou]\w+'

re.findall(regexp, subject)

['aardvark', 'academy', 'extra']

In [55]:
#using an anchor
regexp = r'^[aeiou]\w+'

re.findall(regexp, subject)

[]

In [56]:
#split subjects to use anchor
subjects = subject.split()
subjects

['kiwi', 'aardvark', 'banana', 'codeup', 'data', 'science', 'academy', 'extra']

In [57]:
#use for loop to cycle through them
for sub in subjects:
    print(re.findall(r'^[aeiou]\w+',sub))

[]
['aardvark']
[]
[]
[]
[]
['academy']
['extra']


#### match all words that end with a vowel

In [58]:
subject

'kiwi aardvark banana codeup data science academy extra'

In [59]:
regexp = r'\w+[aeiou]$'
re.findall(regexp, subject)

['extra']

In [60]:
regexp = r'[ka][ia]\w+'

re.findall(regexp, subject)

['kiwi', 'aardvark']

<div class='alert alert-info'>

-------------------------------------------------------------------------
| 
- `\w`: any letter or number
- `\W`: anything that is *not* a letter or number
- `\d`: any digit
- `\D`: anything that is *not* a digit
- `\s` : any whitespace
-  `.` : anything   
    
|
    
- `{}`: custom number of repetitions
- `{x}`: exactly x repetitions
- `{x,}`: x or more
- `{x,y}`: between x and y repetitions
    
|
- `*`: zero or more
- `+`: one or more
- `?`: optional / greedy will make the closest character to it's left optional
    
|
    
- `[]`: will match any element inside of
- `[^]`: will match any element NOT inside of
- `[-]`: will match a range of values inside of
    
|    
    
- `^`: starts with
- `$`: ends with
- `\b`: word boundary
  
-------------------------------------------------------------------------
    
</div>

In [61]:
regexp = r'[\d][\W].'
re.findall(regexp, subject)

[]

<div class='alert alert-warning'>

<b>Mini Exercise</b>
    
Write regular expressions to find the following values

- Find any even numbers (regardless if its apart of a bigger number)
- Find entire numbers that are even
   
    
- Find 2 or more odd numbers in a row.

    
- Find all the capital letters
- Find all words that start with a capital letter
    
<div>

In [62]:
subject = '''
Codeup, founded in 2014, is located at 600 Navarro St. 
Suite 350, San Antonio, TX 78230. 
tagline: "launch your career in tech!" 
You can find us online at http://codeup.com 
and our alumni portal is located at https://alumni.codeup.com
and "codeup is a great school"!
'''

subject = subject.replace('\n','')
subject

'Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, San Antonio, TX 78230. tagline: "launch your career in tech!" You can find us online at http://codeup.com and our alumni portal is located at https://alumni.codeup.comand "codeup is a great school"!'

In [63]:
regexp = r'\d.'

re.findall(regexp, subject)

['20', '14', '60', '0 ', '35', '0,', '78', '23', '0.']

#### Find any even numbers (regardless if its apart of a bigger number)

In [64]:
#over thought this one

regexp = r'[24680]'

re.findall(regexp, subject)

['2', '0', '4', '6', '0', '0', '0', '8', '2', '0']

#### Find entire numbers that are even

In [65]:
regexp = r'\d+[24680]\b' 
#in plain english d allows for all digits, + asks for one or more characters \b stops it from moving forward

re.findall(regexp, subject)

['2014', '600', '350', '78230']

#### Find 2 or more odd digits in a row (regardless if its apart of a bigger number)

In [66]:
regexp = r'[13579][13579]'#in plain english putting these two together says, give me two odd digits one next to the other

re.findall(regexp, subject)

['35']

#### Find all the capital letters

In [67]:
regexp = r'[A-Z]'

re.findall(regexp, subject)

['C', 'N', 'S', 'S', 'S', 'A', 'T', 'X', 'Y']

#### Find all words that start with a capital letter

In [68]:
regexp = r'[A-Z]\w*'

re.findall(regexp, subject)

['Codeup', 'Navarro', 'St', 'Suite', 'San', 'Antonio', 'TX', 'You']

### Capture Groups

<div class='alert alert-info'>
    
- `()`: grab what's contained in parentheses 
    
</div> 

In [69]:
subject = '''
    You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).
    '''

#### find the domain only

In [70]:
regexp = r'https?://(.+).com'

re.findall(regexp, subject)

['codeup']

#### find everything after the first sentence

In [71]:
regexp = r'\.\s(.+)'

re.findall(regexp, subject)

['Our ip address is 123.123.123.123 (maybe).']

#### find the ip address

In [72]:
regexp = r'Our ip address is (.+)\s'
re.findall(regexp, subject)

['123.123.123.123 (maybe).']

In [73]:
#dont be greedy
regexp = r'Our ip address is (.+?)\s'

re.findall(regexp, subject)

['123.123.123.123']

In [74]:
regexp = r'((\d{3}\.){3})'

re.findall(regexp, subject)

[('123.123.123.', '123.')]

### Non Capture Group

<div class='alert alert-info'>

- `?:`: to ignore a capture group also called shy

#### find the ip address

In [75]:
regexp = r'((?:\d{3}\.){3}\d{3})'

re.findall(regexp, subject)

['123.123.123.123']

<div class='alert alert-warning'>

<b>Mini-Exercise</b>    
    
- Find the protocol, domain, and tld
- Edit your previous code with a non-capture group to remove the domain from your result
    
</div>

In [76]:
subject = '''
    You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).
    '''

#### Find the protocol, domain, and tld

In [77]:
regexp = r'((?:\d{3}\.){3}\d{3})'

re.findall(regexp, subject)

['123.123.123.123']

## more re library functions

### `re.search`

- scans through a string, looking for any location where the RE matches; returns match object

In [78]:
subject = """
    Codeup, founded in 2014, is located at 600 Navarro St. Suite 350, 
    San Antonio, TX 78230. 
    You can find us online at http://codeup.com and
    our alumni portal is located at https://alumni.codeup.com.
    """

#### find the word located

In [79]:
#save the match result
regexp = r'located'

match = re.search(regexp, subject)
match

<re.Match object; span=(33, 40), match='located'>

In [80]:
#results type
type(match)

re.Match

In [81]:
#.span()
match.span()

(33, 40)

In [82]:
match[0]

'located'

In [83]:
#.group()
match.group()

'located'

#### find numbers

In [84]:
regexp = r'\d+'

re.search(regexp, subject)

<re.Match object; span=(24, 28), match='2014'>

#### find navarro

In [85]:
regexp = r'navarro'

re.search(regexp, subject)

In [86]:
#results type
type(re.search(regexp, subject))

NoneType

### Name Capture Group

- `?P`: to name a capture group

In [87]:
subject = '''
    You can find us on the web at https://codeup.com. Our ip address is 123.123.123.123 (maybe).
    '''

#### find the protocol, domain, and tld and name them

In [88]:
regexp = r'(?P<protocol>https?s*)://(?P<domain>\w+)\.(?P<tld>\w+)'

match = re.search(regexp, subject)
match

<re.Match object; span=(35, 53), match='https://codeup.com'>

In [89]:
#groups()
match.groups()

('https', 'codeup', 'com')

In [90]:
#groupdict()
match.groupdict()

{'protocol': 'https', 'domain': 'codeup', 'tld': 'com'}

##### regex flag:  `re.VERBOSE`

- `re.VERBOSE` will ignore whitespace in regex pattern

In [91]:
regexp = r'''
            (?P<protocol>http?s*)://
            (?P<domain>\w+)\.
            (?P<tld>\w+)
            '''
match = re.search(regexp, subject, re.VERBOSE)
match

<re.Match object; span=(35, 53), match='https://codeup.com'>

### `re.sub`

- allows us to match a regex and substitute in a new substring for the match; returns string

In [92]:
subject = 'abc 12345xyz'

#### remove all the digits

In [93]:
regexp = r'\d'

re.sub(regexp,'' ,subject)

'abc xyz'

#### replace all digits with a space

In [94]:
regexp = r'\d'

re.sub(regexp,' ' ,subject)

'abc      xyz'

#### replace all the digits with a single o

In [95]:
regexp = r'\d+'

re.sub(regexp,'*', subject)

'abc *xyz'

#### Using regex with a str.replace, add regex=True arguement

In [96]:
# subjects
pd.Series(subjects).str.replace(r'[aeiou]', '***', regex=True)

0          k***w***
1    ******rdv***rk
2      b***n***n***
3      c***d******p
4          d***t***
5     sc******nc***
6     ***c***d***my
7         ***xtr***
dtype: object

### `re.compile`

- prepare a regular expression for use ahead of time, returns expression

In [97]:
emails = [
    "jane@company.com",
    "bob@company.com",
    "jane.janeway@company.com",
    "jane.janeway@dogood.org",
    "jane.janet.janeway@dogood.org", # bonus for the 3 part address
]

#### get name, domain, tld

In [98]:
regexp = r'(\w+\.*\w+)@(\w+)\.(\w+)'

for email in emails:
    print(re.findall(regexp, email))

[('jane', 'company', 'com')]
[('bob', 'company', 'com')]
[('jane.janeway', 'company', 'com')]
[('jane.janeway', 'dogood', 'org')]
[('janet.janeway', 'dogood', 'org')]


In [99]:
pattern = re.compile(r'''
        (?P<name>\w+\.*\w+)@
        (?P<domain>\w+)\.
        (?P<tld>\w+)
        ''', re.VERBOSE)
pattern

re.compile(r'\n        (?P<name>\w+\.*\w+)@\n        (?P<domain>\w+)\.\n        (?P<tld>\w+)\n        ',
           re.UNICODE|re.VERBOSE)

In [100]:
contacts = [re.search(pattern,email).groupdict() for email in emails]
contacts

[{'name': 'jane', 'domain': 'company', 'tld': 'com'},
 {'name': 'bob', 'domain': 'company', 'tld': 'com'},
 {'name': 'jane.janeway', 'domain': 'company', 'tld': 'com'},
 {'name': 'jane.janeway', 'domain': 'dogood', 'tld': 'org'},
 {'name': 'janet.janeway', 'domain': 'dogood', 'tld': 'org'}]

In [101]:
pd.DataFrame(contacts)

Unnamed: 0,name,domain,tld
0,jane,company,com
1,bob,company,com
2,jane.janeway,company,com
3,jane.janeway,dogood,org
4,janet.janeway,dogood,org


# Recap

re functions
- `re.findall()`:finds all substrings where the REGEX matches, returns a list
- `re.search()`: scans through a string, looking for any location where the RE matches; returns match object
    - `match.span()`
    - `match.group()`
    - `match.groups()`
    - `match.groupdict()`
- `re.sub()`:allows us to match a regex and substitute in a new substring for the match; returns string
- `re.compile()`:prepare a regular expression for use ahead of time, returns expression

flags
- `re.IGNORECASE`: ignore case
- `re.VERBOSE`: ignore whitespace



Metacharacters
- `\w`: any letter or number
- `\W`: anything that is *not* a letter or number
- `\d`: any digit
- `\D`: anything that is *not* a digit
- `\s` : any whitespace
-  `.` : anything


Repeating

- `{}`: custom number of repetitions
    - `{x}`: exactly x repetitions
    - `{x,}`: x or more
    - `{x,y}`: between x and y repetitions

- `*`: zero or more
- `+`: one or more
- `?`: optional / greedy


Any/None

- `[]`: will match anything inside of
- `[^]`: will match anything NOT inside of
- `[-]`: will match a range of values inside of


Capture Groups

- `()`: grab what's contained in parentheses 
- `?:`: to ignore a capture group also called shy
- `?P`: to name a capture group


Other
- `\`: escape character
    


