## Regular Expressions 
- Text string for describing search pattern
- 'nlp' -> searches for nlp pattern
- '[j-q]' -> searches for characters between j & q, returns individual characters
- '[j-q]+' -> searches for characters between j & q, but also searches for strings longer than 1 character
- '[0-9]+' -> searches for numbers between 0 and 9, but also searches for numbers longer than 1 digit 
- '[j-q0-9]+' -> searches for sequencies for letters between j & q or 0 & 9

### Why is RegEx useful?
- Identify whitespace between wrods/tokens
- Identifying/creating delimiters or end-of-line escape characters 
- Removing punctuation or numbers from text 
- Clean HTML tags from text
- Identify patterns

### Use-cases
- Confirm passwords meet criteria 
- Search URL for some substring
- Searching for files on your computer
- Document scraping 

### Exercise Takeaways
- Useful methods for tokenizing 
-- findall(): search for the words and ignore the stuff that separates the words
-- split(): search for the characters that split the words by ignoring the actual words themselves

- Useful Regexes for tokenizing 
-- '\W' & '\w' for words 
-- '\S' & '\s' for whitespaces

In [1]:
import re

In [2]:
test1 = 'This is a made up string to test 2 different regex methods'
test2 = 'This     is a made    up string    to test 2  different regex methods'
test3 = 'This-is-a-made/up.string*to>>>test----2"""different-regex-methods'

### Split stentence into a list of words

In [9]:
print(re.split('\s', test1))
print(re.split('\s', test2))
print(re.split('\s', test3))

['This', 'is', 'a', 'made', 'up', 'string', 'to', 'test', '2', 'different', 'regex', 'methods']
['This', '', '', '', '', 'is', 'a', 'made', '', '', '', 'up', 'string', '', '', '', 'to', 'test', '2', '', 'different', 'regex', 'methods']
['This-is-a-made/up.string*to>>>test----2"""different-regex-methods']


In [24]:
print(re.split('\s+',test2))
print(re.split('\W+', test2))

['This', 'is', 'a', 'made', 'up', 'string', 'to', 'test', '2', 'different', 'regex', 'methods']
['This', 'is', 'a', 'made', 'up', 'string', 'to', 'test', '2', 'different', 'regex', 'methods']


In [23]:
print(re.split('[-/>".*]+',test3))
print(re.split('\W+', test3))

['This', 'is', 'a', 'made', 'up', 'string', 'to', 'test', '2', 'different', 'regex', 'methods']
['This', 'is', 'a', 'made', 'up', 'string', 'to', 'test', '2', 'different', 'regex', 'methods']


In [32]:
# Find all
print(re.findall('\S+', test1))
print(re.findall('\S+', test2))
print(re.findall('\w+', test3))

['This', 'is', 'a', 'made', 'up', 'string', 'to', 'test', '2', 'different', 'regex', 'methods']
['This', 'is', 'a', 'made', 'up', 'string', 'to', 'test', '2', 'different', 'regex', 'methods']
['This', 'is', 'a', 'made', 'up', 'string', 'to', 'test', '2', 'different', 'regex', 'methods']


### Replacing a specific string

In [40]:
pep8 = 'I try to follow PEP8 guidelines'
pep7 = 'I try to follow PEP7 guidelines'
peep8 = 'I try to follow PEEP8 guidelines'

In [43]:
print(re.findall('[a-zA-Z]+',pep8))
print(re.findall('[A-Z]+',pep8))
print(re.findall('[A-Z]+[0-9]',pep8))

['I', 'try', 'to', 'follow', 'PEP', 'guidelines']
['I', 'PEP']
['PEP8']


In [47]:
print(re.findall('[A-Z]+[0-9]+',peep8))
print(re.findall('[A-Z]+[0-9]+',pep7))

['PEEP8']
['PEP7']


In [48]:
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Styleguide', pep8)

'I try to follow PEP8 Python Styleguide guidelines'

In [50]:
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Styleguide', pep7)
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Styleguide', peep8)

'I try to follow PEP8 Python Styleguide guidelines'

### Other Examples 
- re.search()
- re.match()
- re.fullmatch()
- re.finditer()
- re.escape()

### Searching Methods
#### Identifiers
- \d : any numbers 
- \D : anything but a number
= \s : a space
- \S : anything but a space
- \w any character
- \W anything but a character
- . anything except for a newline
- \. look for periods/digits
- \b the whitespace around words 

#### Modifiers
- {1,3} : We're expecting 1-3
- '+' Match 1 or more 
- '*' Match 0 or more
- ? Match 0 or 1 
- $ match the end of a string
- ^ matching the beginning of a string
- | either or 
- [] range 
- {x} expecting x amount

#### White Space Characters 
- \n new line 
- \s space
- \t tab
- \e escape
- \f form feed
- \r return


In [56]:
test_str = 'Hello world'

In [65]:
re.findall('l',test_str)

['l', 'l', 'l']

In [69]:
test_str2 = '1.Hello\n2.World'
print(test_str2)

1.Hello
2.World


In [88]:
print(re.findall('\w+',test_str2))
print(re.split('\w+', test_str2))

['1', 'Hello', '2', 'World']
['', '.', '\n', '.', '']


In [89]:
ex_str = '''
Jessica is 15 years old, and Daniel is 27 years old. 
Edward is 97, and his grandfather, Oscar, is 102.
'''

In [100]:
ages = re.findall('\d{1,3}', ex_str)
names = re.findall('[A-Z][a-z]+', ex_str)
print(ages)
print(names)

['15', '27', '97', '102']
['Jessica', 'Daniel', 'Edward', 'Oscar']


In [101]:
ageDict = dict(zip(names,ages))
print(ageDict)

{'Jessica': '15', 'Daniel': '27', 'Edward': '97', 'Oscar': '102'}
