# NLP Basics: Learning how to use regular expressions

### Using regular expressions in Python

Python's `re` package is the most commonly used regex resource. More details can be found [here](https://docs.python.org/3/library/re.html).

In [2]:
# import regex package
import re

# we have three test sentences
re_test = 'This is a made up string to test 2 different regex methods'
re_test_messy = 'This      is a made up     string to test 2    different regex methods'
re_test_messy1 = 'This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods'

### Splitting a sentence into a list of words

In [3]:
# part 1, is using the split() method from re package. The goal is to identify 
# the white space or characters separating the words, and then tokenize in that way.

# \s - tells Pyton to look for a single white space to split the string. 
re.split('\s', re_test)


['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [5]:
# \s - tells Pyton to look for a single white space to split the string. 
# does not work as good the with second sentence
re.split('\s', re_test_messy)

['This',
 '',
 '',
 '',
 '',
 '',
 'is',
 'a',
 'made',
 'up',
 '',
 '',
 '',
 '',
 'string',
 'to',
 'test',
 '2',
 '',
 '',
 '',
 'different',
 'regex',
 'methods']

In [7]:
# \s+ - tells Pyton to look for one or more white spaces to split the string. 
re.split('\s+', re_test_messy)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [9]:
# \s - tells Pyton to look for a single white space to split the string. 
# does not work as good the with third sentence
re.split('\s+', re_test_messy1)

['This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods']

In [10]:
#  \W+ - this regex will search for any non-word character, and it will usr that to define its split.
re.split('\W+', re_test_messy1)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [11]:
# The other option is, instead of searching for what splits the words, just search for the actual words themselves
# and ignore the characters that split the words. This can be done using the findall() method.

# \S+ - it looks for one or more non white space characters, so we get the opposite effect
re.findall('\S+', re_test)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [12]:
re.findall('\S+', re_test_messy)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [13]:
# it cannot handle the third scenario, because the dashes and backslashes and periods still count as 
# non white space characters, so it still thinks those are part of the word.
re.findall('\S+', re_test_messy1)

['This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods']

In [14]:
# \w+ - this will search for one or more word characters, so basically it will search for tokens that resemble a word.
re.findall('\w+', re_test_messy)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

### Replacing a specific string

In [15]:
# pep8 is a style guide for Python
pep8_test = 'I try to follow PEP8 guidelines'
pep7_test = 'I try to follow PEP7 guidelines'
peep8_test = 'I try to follow PEEP8 guidelines'

In [17]:
# We need to come up with a pattern that will not only capture pep8, but also your mistakes (pep7, peep8).
# We want to build a process that will identify every place where pep8 occurs, plus some likely typos and replace it # with pep8 Python style guide

re.findall('[a-z]+' , pep8_test)

['try', 'to', 'follow', 'guidelines']

In [18]:
re.findall('[A-Z]+' , pep8_test)

['I', 'PEP']

In [20]:
re.findall('[A-Z]+[0-9]+' , pep8_test)

['PEP8']

In [21]:
re.findall('[A-Z]+[0-9]+' , pep7_test)

['PEP7']

In [22]:
re.findall('[A-Z]+[0-9]+' , peep8_test)

['PEEP8']

In [23]:
#  The subfunction will search for a regx pattern and then when it finds it will replace it with a given string.
re.sub('[A-Z]+[0-9]+' , 'PEP8 Python Style Guide',  pep7_test)

'I try to follow PEP8 Python Style Guide guidelines'

In [24]:
re.sub('[A-Z]+[0-9]+' , 'PEP8 Python Style Guide',  pep7_test)

'I try to follow PEP8 Python Style Guide guidelines'

In [25]:
re.sub('[A-Z]+[0-9]+' , 'PEP8 Python Style Guide',  peep8_test)

'I try to follow PEP8 Python Style Guide guidelines'

### Other examples of regex methods

- re.search()
- re.match()
- re.fullmatch()
- re.finditer()
- re.escape()