## *About Regex (Regular Expression)*

Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use REs to modify a string or to split it apart in various ways.

For more information : https://docs.python.org/3/howto/regex.html#introduction

In [1]:
import re

In [2]:
content = """
My name is Saurav Prasad I have done my graduation in chemistry, 
and my masters is in Operational Research I m working as a data scientist for an MNC.
I swtiched from chemistry to mathematically loaded field.I like to dance, 
and I was the vice-president of the dance scoiety of my college.My phone number is 634-458-6815
"""

In [3]:
#defining the pattern
pattern = "masters"

In [4]:
#object of the pattern
search=re.search(pattern,content)

In [5]:
#method returns the position of pattern
search.span()

(74, 81)

In [6]:
#return starting point of the pattern
search.start()

74

In [7]:
#return end point of the pattern 
search.end()

81

In [8]:
pattern = "chemistry"
re.search(pattern,content)

matches = re.findall(pattern, content)
len(matches)

2

In [9]:
for match in re.finditer(pattern, content):
    print(match.span())

(55, 64)
(169, 178)


## *Generalised Pattern*

In [10]:
pattern = r"\d\d\d-\d\d\d-\d\d\d\d"

match = re.search(pattern,content)

match.span()

(311, 323)

In [11]:
match.group()

'634-458-6815'

In [12]:
#using identifiers and quantifiers
pattern = r"(\d{3})-(\d{3})-(\d{4})"
match1 = re.search(pattern,content)
match1.group()

'634-458-6815'

In [13]:
print(match1.group(1))
print(match1.group(2))
print(match1.group(3))

634
458
6815


In [14]:
## *Text Cleaning

In [15]:
text = """
My name is Saurav        Prasad I've done my graduation in chemistry, 
and my masters is         in Operational Research I m working as a data scientist for an MNC.
I swtiched from ChemisTRY to MATHEmatically loaded field.I like to dance, 
and I      was the vice-president of the dance scoiety of my college.
My phone number is    585-334-6815.
I dont give #$%^&*   about what people think.
I just want to travel to ROME(Italy).
my website is    http://www.datasciencenovice.com
my mail    address is datasciencenovice@gmail.com
"""

In [16]:
print(text)


My name is Saurav        Prasad I've done my graduation in chemistry, 
and my masters is         in Operational Research I m working as a data scientist for an MNC.
I swtiched from ChemisTRY to MATHEmatically loaded field.I like to dance, 
and I      was the vice-president of the dance scoiety of my college.
My phone number is    585-334-6815.
I dont give #$%^&*   about what people think.
I just want to travel to ROME(Italy).
my website is    http://www.datasciencenovice.com
my mail    address is datasciencenovice@gmail.com



In [17]:
#removing multiple spaces
text= re.sub(r"\s+"," ",text)
print(text)

 My name is Saurav Prasad I've done my graduation in chemistry, and my masters is in Operational Research I m working as a data scientist for an MNC. I swtiched from ChemisTRY to MATHEmatically loaded field.I like to dance, and I was the vice-president of the dance scoiety of my college. My phone number is 585-334-6815. I dont give #$%^&* about what people think. I just want to travel to ROME(Italy). my website is http://www.datasciencenovice.com my mail address is datasciencenovice@gmail.com 


In [18]:
#replacing URL with word URL
text = re.sub(r"http\S+","URL",text)
print(text)

 My name is Saurav Prasad I've done my graduation in chemistry, and my masters is in Operational Research I m working as a data scientist for an MNC. I swtiched from ChemisTRY to MATHEmatically loaded field.I like to dance, and I was the vice-president of the dance scoiety of my college. My phone number is 585-334-6815. I dont give #$%^&* about what people think. I just want to travel to ROME(Italy). my website is URL my mail address is datasciencenovice@gmail.com 


In [19]:
#removing phone number
pattern = r"(\d{3})-(\d{3})-(\d{4})"
text = re.sub(pattern,"XXX",text)
print(text)

 My name is Saurav Prasad I've done my graduation in chemistry, and my masters is in Operational Research I m working as a data scientist for an MNC. I swtiched from ChemisTRY to MATHEmatically loaded field.I like to dance, and I was the vice-president of the dance scoiety of my college. My phone number is XXX. I dont give #$%^&* about what people think. I just want to travel to ROME(Italy). my website is URL my mail address is datasciencenovice@gmail.com 


In [20]:
#removing gmail ad
text = re.sub(r"\S+@gmail.com","E_MAIL",text)
print(text)

 My name is Saurav Prasad I've done my graduation in chemistry, and my masters is in Operational Research I m working as a data scientist for an MNC. I swtiched from ChemisTRY to MATHEmatically loaded field.I like to dance, and I was the vice-president of the dance scoiety of my college. My phone number is XXX. I dont give #$%^&* about what people think. I just want to travel to ROME(Italy). my website is URL my mail address is E_MAIL 


In [21]:
#lower casing
text = text.lower()
print(text)

 my name is saurav prasad i've done my graduation in chemistry, and my masters is in operational research i m working as a data scientist for an mnc. i swtiched from chemistry to mathematically loaded field.i like to dance, and i was the vice-president of the dance scoiety of my college. my phone number is xxx. i dont give #$%^&* about what people think. i just want to travel to rome(italy). my website is url my mail address is e_mail 


In [22]:
#changing grammatcial syntax
text = re.sub("i've","i have",text)
print(text)

 my name is saurav prasad i have done my graduation in chemistry, and my masters is in operational research i m working as a data scientist for an mnc. i swtiched from chemistry to mathematically loaded field.i like to dance, and i was the vice-president of the dance scoiety of my college. my phone number is xxx. i dont give #$%^&* about what people think. i just want to travel to rome(italy). my website is url my mail address is e_mail 


In [23]:
# removing special Character 
text = re.sub("[#%^&*()$]","",text)

In [24]:
print(text)

 my name is saurav prasad i have done my graduation in chemistry, and my masters is in operational research i m working as a data scientist for an mnc. i swtiched from chemistry to mathematically loaded field.i like to dance, and i was the vice-president of the dance scoiety of my college. my phone number is xxx. i dont give  about what people think. i just want to travel to romeitaly. my website is url my mail address is e_mail 


In [25]:
def clean_text(text):
    
    """Returns the clean text"""
    
    text= re.sub(r"\s+"," ",text)
    text = re.sub(r"http\S+","URL",text)
    
    pattern = r"(\d{3})-(\d{3})-(\d{4})"
    text = re.sub(pattern,"XXX",text)
    
    text = re.sub(r"\S+@gmail.com","E_MAIL",text)
    text = text.lower()
    
    text = re.sub("i've","i have",text)
    text = re.sub("[#%^&*()$]","",text)
    
    return text
    

In [26]:
text = """
My name is Saurav        Prasad I've done my graduation in chemistry, 
and my masters is         in Operational Research I m working as a data scientist for an MNC.
I swtiched from ChemisTRY to MATHEmatically loaded field.I like to dance, 
and I      was the vice-president of the dance scoiety of my college.
My phone number is    585-334-6815.
I dont give #$%^&*   about what people think.
I just want to travel to ROME(Italy).
my website is    http://www.datasciencenovice.com
my mail    address is datasciencenovice@gmail.com
"""

In [27]:
text = clean_text(text)
print(text)

 my name is saurav prasad i have done my graduation in chemistry, and my masters is in operational research i m working as a data scientist for an mnc. i swtiched from chemistry to mathematically loaded field.i like to dance, and i was the vice-president of the dance scoiety of my college. my phone number is xxx. i dont give  about what people think. i just want to travel to romeitaly. my website is url my mail address is e_mail 
