# Regular Expressions in action

Import the 're' library that helps in resolving regular expressions in Python

In [3]:
import re

Regex have the capability to match simple expressions using ordinary characters as well as complex patterns using special characters.

- **Ordinary characters include simple alphabets, numbers and symbols.**

Ordinary characters are used to get exact matches e.g. if you wanted to wanted to find the occurences of the term 'Python' in some text then your regex would be -->'*Python*'.

- **Special characters allow you to create generic pattern in text that are more like a 'Closest match'.**

For example if you want to match an email address then you cannot specify an exact match since people have a different emails. However there is a pattern that you can use to your benefit. proper emails will always have an '**@**' symbol in the middle and end with '**.com**'. Let's see how we can find this pattern.


Generate some random text

In [4]:
#random text with emails
txt ='''Hello from shubhamg199630@gmail.com 
to priya@yahoo.com about the meeting @2PM'''

To find the provided pattern we use the *re.search()* function.

In [5]:
plain_text = re.search('Hello', txt) #find the word hello in the above text

to find the word 'Hello' we simply used it as an expression.

the *re.seach()* function returns the first occurence of th eprovided expresssion as well as its indexes.

In [9]:
plain_text #print out the result

<re.Match object; span=(0, 5), match='Hello'>

In [20]:
print(plain_text.group()) #print just the match

Hello


In [21]:
print(plain_text.span()) #print indexes where the match was found

(0, 5)


Now lets try a more complex pattern and find the email addresses in the text

In [12]:
email = re.search('\S+@{1}\S+[.com]{1}',txt) #regex to find an email address

In [14]:
print(email) #found the first occuring email

<re.Match object; span=(11, 35), match='shubhamg199630@gmail.com'>


let's understand the regex a little bit.

- \S : Finds a non-whitespace character
- \+ : Specifies to find 1 or more non-whitespace occurences
- @ : Exact match, specifies to find a '@' symbol.
- {1} : specifies to find only 1 '@' symbol.
- \S : Again specifies to find non-whitespace characters.
- \+ : Find atleast 1 non-white space characters.
- [.com] : Find exact match for .com
- {1} : Find exactly one occurence for '.com'

### Find all occurences

if we want to extract all occurences of the provided regex, then we use the *re.findall()* function

In [17]:
emails = re.findall('\S+@{1}\S+[.com]{1}', txt)

In [18]:
print(emails)

['shubhamg199630@gmail.com', 'priya@yahoo.com']


### Substitue Expression

We can substitue the given expression with a string of our own using the *re.sub()* function.

In [22]:
substituted_string = re.sub('\S+@{1}\S+[.com]{1}', '', txt) #remove emails from the given text.

In [23]:
print(substituted_string)

Hello from  
to  about the meeting @2PM


Job Done!

This feature can be used to redact documents. 

Say we want to remove emails from a text so that no confidentional contact information is exposed.

We can simple substitue it with an \<email\> tag

In [24]:
redacted = re.sub('\S+@{1}\S+[.com]{1}', '<email>', txt)

In [25]:
print(redacted)

Hello from <email> 
to <email> about the meeting @2PM


### Multiple expressions in a single line.

Say we want to find email as well as the time mentioned in the meeting. 

We can specify an OR expression to tell python to match either expression1 or expression2 or both.


In [27]:
re.findall('\S+@{1}\S+[.com]{1}|@{1}[0-12]{1}([AM]{1}|[PM]{1})', txt)

['', '', 'P']