# Regular Expression (Regex) in NLP 

1. Identifying white spaces between text/char
2. Identifying/creating delimiters or end-of-line escape characters
3. Removing punctuation/numbers or HTML tags from text
4. Identifying patterns in texts

In [2]:
import re

#Cases
test = 'This is a made up string to test 2 different regex methods'
test_space = 'This      is a made up     string to test 2    different regex methods'
test_messy = 'This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods'

### 1. Split a sentence into list of words or tokens

1) Using re.split()



In [3]:
#Try to remove white spaces
re.split('\s', test)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [4]:
re.split('\s', test_space)

['This',
 '',
 '',
 '',
 '',
 '',
 'is',
 'a',
 'made',
 'up',
 '',
 '',
 '',
 '',
 'string',
 'to',
 'test',
 '2',
 '',
 '',
 '',
 'different',
 'regex',
 'methods']

There are multiple white spaces, so doesn't work well

In [5]:
#Use + to remove multiple instances of a character
re.split('\s+', test_space)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [6]:
re.split('\s+', test_messy)

['This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods']

Doesn't change anything as there is no white space in the sentence. Have to try a more robust regex method to handle  such cases

In [7]:
#W stands for any non-word character
re.split('\W+', test_messy)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

2) Using re.findall()

In [8]:
#S returns non-space characters
re.findall('\S+', test)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [9]:
re.findall('\S+', test_space)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [10]:
re.findall('\S+', test_messy)

['This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods']

Since there is no space so the whole text is returned

In [12]:
#w referes to word characters
re.findall('\w+', test)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

### Useful functions for tokenizing:
    1. split()
    2. findall()
    
### Useful regex for tokenizing:
    1. words - `\W`, `\w`
    2. spaces - `\S`, `\s`
    

### 2. Replacing a specific string

In [14]:
#Cases: Replace PEP8 or any other similar instances with "PEP8 Python Styling"
pep8_test = 'I try to follow PEP8 guidelines'
pep7_test = 'I try to follow PEP7 guidelines'
peep8_test = 'I try to follow PEEP8 guidelines'

In [16]:
re.findall('[a-z]+', pep8_test)

['try', 'to', 'follow', 'guidelines']

'+' indicates all letters until hits a space, hence all words returned, but are lowercase. Our target is in uppercase

In [17]:
re.findall('[A-Z]+', pep8_test)

['I', 'PEP']

Need to add number to regex and we don't need 'I'

In [19]:
re.findall('[A-Z]+[0-9]+', pep8_test)

['PEP8']

In [20]:
#Test above code on other cases
re.findall('[A-Z]+[0-9]+', pep7_test)

['PEP7']

In [21]:
re.findall('[A-Z]+[0-9]+', peep8_test)

['PEEP8']

In [22]:
#So our document has what we are looking for.
#Next try substituting the instances with "PEP8 Python Styling" using re.sub

re.sub('[A-Z]+[0-9]+', 'PEP8 Python Styling', pep8_test )

'I try to follow PEP8 Python Styling guidelines'

In [23]:
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Styling', pep7_test )

'I try to follow PEP8 Python Styling guidelines'

In [24]:
re.sub('[A-Z]+[0-9]+', 'PEP8 Python Styling', peep8_test )

'I try to follow PEP8 Python Styling guidelines'

### Other examples of regex methods

- re.search()
- re.match()
- re.fullmatch()
- re.finditer()
- re.escape()