<a href="https://colab.research.google.com/github/IndianJohnnyDepp/NLP_Practice_Notebooks/blob/main/Regular_expressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Regular Expressions

Regular Expressions (sometimes called regex for short) allow a user to search for strings using almost any sort of rule they can come up with. For example, finding all capital letters in a string, or finding a phone number in a document.

Regular expressions are notorious for their seemingly strange syntax. This strange syntax is a byproduct of their flexibility. Regular expressions have to be able to filter out any string pattern you can imagine, which is why they have a complex string pattern format.

Regular expressions are handled using Python's built-in **re** library. See [the docs](https://docs.python.org/3/library/re.html) for more information.

In [None]:
sampletext = "My phone number is +91-9677080857 and my dad's phone number is +91-9884831392"

In [None]:
"phone" in sampletext

True

In [None]:
r"+\d{2}-\d{10}" in sampletext

False

In [None]:
import re

In [None]:
re.search("phone", sampletext)

<re.Match object; span=(3, 8), match='phone'>

In [None]:
re.search("phone", sampletext).span()

(3, 8)

Search only gets you the first instance

In [None]:
print(re.search("phone", sampletext).start())
print(re.search("phone", sampletext).end())

3
8


In [None]:
re.findall("phone", sampletext)

['phone', 'phone']

In [None]:
len(re.findall("phone", sampletext))

2

In [None]:
for instance in re.finditer("phone", sampletext):
  print(instance.span())

(3, 8)
(47, 52)


In [17]:
instance.group()

'phone'

In [18]:
instance

<re.Match object; span=(47, 52), match='phone'>

## Identifiers for Characters in Patterns

Characters such as a digit or a single string have different codes that represent them. You can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:

    r'mypattern'
    
placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

Below you can find a table of all the possible identifiers:

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

In [20]:
sampletext

"My phone number is +91-9677080857 and my dad's phone number is +91-9884831392"

Notice the repetition of \d. That is a bit of an annoyance, especially if we are looking for very long strings of numbers. Let's explore the possible quantifiers.

## Quantifiers

Now that we know the special character designations, we can use them along with quantifiers to define how many we expect.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [30]:
pattern = r'\+\d{2}-\d{10}'

In [31]:
phonenumber = re.search(pattern,sampletext)

In [32]:
phonenumber

<re.Match object; span=(19, 33), match='+91-9677080857'>

In [33]:
phonenumber.group()

'+91-9677080857'

## Groups

What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code (the first three digits). We can use groups for any general task that involves grouping together regular expressions (so that we can later break them down).

Using the phone number example, we can separate groups of regular expressions using parentheses:

In [34]:
phone_pattern = re.compile(r"(\+\d{2})-(\d{10})")

In [35]:
phone_pattern

re.compile(r'(\+\d{2})-(\d{10})', re.UNICODE)

In [36]:
pattern1 = r"\+\d{2}-\d{10}"

In [44]:
searchobj = re.search(pattern1,sampletext)

In [45]:
searchobj.group(0)

'+91-9677080857'

In [51]:
pattern2 = r"(\+\d{2})-(\d{10})"

In [52]:
searchobj1 = re.search(pattern2,sampletext)

In [60]:
searchobj1.group()

'+91-9677080857'

In [62]:
searchobj1.group(1)

'+91'

In [55]:
searchobj1 = re.search(phone_pattern,sampletext)

In [56]:
searchobj1.group(0)

'+91-9677080857'

In [63]:
re.search(r"man|woman", "The man is here")

<re.Match object; span=(4, 7), match='man'>

In [64]:
re.search(r"man|woman", "The woman is here")

<re.Match object; span=(4, 9), match='woman'>

In [65]:
re.findall(r".at", "The fat cat sat at the batmobile")

['fat', 'cat', 'sat', ' at', 'bat']

In [66]:
re.findall(r"..at.", "The fat cat sat at the batmobile")

[' fat ', ' sat ', ' batm']

In [67]:
re.findall(r"\d$", "1number2")

['2']

In [68]:
re.findall(r"^\d", "1number2")

['1']

Exclude numbers

In [69]:
re.findall(r"[^\d]", "1number2")

['n', 'u', 'm', 'b', 'e', 'r']

In [70]:
re.findall(r"[^\d]+", "1number2")

['number']

In [71]:
re.findall(r"[^\d]+", "1number2 2to0 3be4 5removed8")

['number', ' ', 'to', ' ', 'be', ' ', 'removed']

In [72]:
punct = "This is a string with punctuations! How to remove them? Let's find out."

In [74]:
re.findall("[^.?!]+",punct)

['This is a string with punctuations',
 ' How to remove them',
 " Let's find out"]

Let us join the list to get one string as original

In [75]:
" ".join(re.findall("[^.?!]+",punct))

"This is a string with punctuations  How to remove them  Let's find out"

In [76]:
re.findall(r'[\w]+-[\w]+', "I want to pick hyphen-words. I am re-applying regular expressions logics.")

['hyphen-words', 're-applying']