# NLP Basics: Learning how to use regular expressions

### Regular Expression (REGEX)

> A regular expression, or a regex for short, is a text string used for describing a certain search pattern. So if it was, "I love nlp" then this search pattern would just capture and return "nlp." 


> Another way to identify the "nlp" string would be to use the expression "[ j - q ]". It will search for all single characters between 'j' and 'q' in whatever text we're looking at. But this will search for all characters between 'j' and 'q', not just 'n', 'l', and 'p'. 

> If we use a "+" after our expression (For eg."[ j - q ] +"), then we can search for strings longer than one character.


### Using regular expressions in Python

Python's `re` package is the most commonly used regex resource. More details can be found [here](https://docs.python.org/3/library/re.html).

In [1]:
import re

re_test = 'This is a made up string to test 2 different regex methods'
re_test_messy = 'This      is a made up     string to test 2    different regex methods'
re_test_messy1 = 'This-is-a-made/up.string*to>>>>test----2""""""different~regex-methods'

### Splitting a sentence into a list of words

In [2]:
re.split("\s", re_test)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [3]:
re.split("\s+", re_test_messy)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

Now, for the last one, it has no space, but has a lot of white special characters.

> In the code, instead of using "\s+", we're going to change that to "\W+".

> This regex will search for any non-word character, and it will use that to define its split, so, whether it's a space, a backslash, quotes, period etc. 

> We're going to keep this plus sign in there, because we know that our string has more than one special characters in a row. 

In [4]:
re.split("\W+", re_test_messy1)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [5]:
re.findall("\S+", re_test)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [6]:
re.findall("\S+", re_test_messy)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

In [7]:
re.findall("\w+", re_test_messy1)

['This',
 'is',
 'a',
 'made',
 'up',
 'string',
 'to',
 'test',
 '2',
 'different',
 'regex',
 'methods']

### Replacing a specific string

We will see different regex for the "findall( )" function, and we're going to see which will find the word "PEP8", and then after we have the proper regx, we can use a sub-method to replace it.

In [8]:
pep8_test = 'I try to follow PEP8 guidelines'
pep7_test = 'I try to follow PEP7 guidelines'
peep8_test = 'I try to follow PEEP8 guidelines'

In [9]:
re.findall("[a-z]+",pep8_test)

['try', 'to', 'follow', 'guidelines']

From the above output, we can see that the regex is case-sensitive. The above output has ignored the uppercase letters

In [10]:
re.findall("[A-Z]+",pep8_test)

['I', 'PEP']

In [11]:
re.findall("[a-z0-9]+",pep8_test)

['try', 'to', 'follow', '8', 'guidelines']

In [12]:
re.findall("[A-Z0-9]+",pep8_test)

['I', 'PEP8']

In [13]:
re.findall("[A-Z]+[0-9]+",pep8_test)

['PEP8']

We got what we were looking for in the string. The above regex works perfectly fine.

Let's run on all the other examples.

In [14]:
print(re.findall("[A-Z]+[0-9]+",pep8_test),
re.findall("[A-Z]+[0-9]+",pep7_test),
re.findall("[A-Z]+[0-9]+",peep8_test))

['PEP8'] ['PEP7'] ['PEEP8']


> Now, we will use the "sub( )" to replace the misspelled word or rather the word "PEP8", "PEP7", or "PEEP8", with "PEP8 style"

In [15]:
# Before

print(pep8_test)


# After

print(re.sub("[A-Z]+[0-9]+", "PEP8 Style", pep8_test))

I try to follow PEP8 guidelines
I try to follow PEP8 Style guidelines


In [16]:
# Before

print(pep7_test)


# After

print(re.sub("[A-Z]+[0-9]+", "PEP8 Style", pep7_test))

I try to follow PEP7 guidelines
I try to follow PEP8 Style guidelines


In [17]:
# Before

print(peep8_test)


# After

print(re.sub("[A-Z]+[0-9]+", "PEP8 Style", peep8_test))

I try to follow PEEP8 guidelines
I try to follow PEP8 Style guidelines
