##Parsing text using Regular Expression


```
• re.I: This flag is used for ignoring casing
• re.L: This flag is used to find a local dependent.
• re.M: This flag is useful if you want to find patterns throughout multiple lines.
• re.S: This flag is used to find dot matches.
• re.U: This flag is used to work for unicode data.
• re.X: This flag is used for writing regex in a more readable format.
```



Regular expressions’ functionality:

Regular expressions’ functionality:
```
• Find the single occurrence of character a and b: Regex: [ab]
• Find characters except for a and b: Regex: [^ab]
• Find the character range of a to z: Regex: [a-z]
• Find a range except a to z: Regex: [^a-z]
• Find all the characters a to z as well as A to Z: Regex: [a-zA-Z]
• Any whitespace character: Regex: \s
• Any non-whitespace character: Regex: \S
• Any digit: Regex: \d
• Any non-digit: Regex: \D
• Any non-words: Regex: \W
• Any words: Regex: \w
• Either match a or b: Regex: (a|b)
The occurrence of a is either zero or one :-
• Matches zero or one occurrence but not more than one occurrence: Regex: a? ; ?
• The occurrence of a is zero times or more than that: Regex: a* ; (* matches zero or more than that)
• The occurrence of a is one time or more than that: Regex: a+ ; (+ matches occurrences one or more thatone time)
• Exactly match three occurrences of a: Regex: a{3} 
• Match simultaneous occurrences of a with 3 or more than 3: Regex: a{3,}
• Match simultaneous occurrences of a between 3 to 6: Regex: a{3,6}
• Starting of the string: Regex: ^
• Ending of the string: Regex: $
• Match word boundary: Regex: \b
• Non-word boundary: Regex: \B
```

re.match() and re.search() functions are used to find the patterns 
and then can be processed according to the requirements of the application.
```
• re.match(): This checks for a match of the string only at the beginning of the string. So, if it finds the pattern at the beginning
 of the input string, then it returns the matched pattern; otherwise; it returns a noun.
• re.search(): This checks for a match of the string anywhere in the string. It finds all the occurrences of the pattern in the given 
input string or data
```

Tokenizing

In [1]:
import re
sen='I like this book very much.'
re.split('\s+',sen)

['I', 'like', 'this', 'book', 'very', 'much.']

Extracting email id

In [2]:
doc = "For more details please mail us at: xyz@abc.com, pqr@mno.com and also at abc@edu.com.np"
addresses = re.findall(r'[\w\.-]+@[\w\.-]+', doc)
for address in addresses:
 print(address)

xyz@abc.com
pqr@mno.com
abc@edu.com.np


Replacing email IDS

In [3]:
doc = "For more details please mail us at xyz@abc.com or you can visit at xyz@gmail.com"
new_email_address = re.sub(r'([\w\.-]+)@([\w\.-]+)', r'abc@email.com', doc)
print(new_email_address)

For more details please mail us at abc@email.com or you can visit at abc@email.com


Extracting data from the ebook and performing regex

In [4]:
import re
import requests
url = 'https://www.gutenberg.org/files/2638/2638-0.txt'
#function to extract
def get_book(url):
 # Sends a http request to get the text from project Gutenberg
 raw = requests.get(url).text
 # Discards the metadata from the beginning of the book
 start = re.search(r"\*\*\* The Project Gutenberg eBook .*\*\*\*", raw)
 if start is not None:
   print(start.group())
 else:
    print("No email address found")
 # Discards the metadata from the end of the book
 stop = re.search(r"II", raw).start()
 # Keeps the relevant text
 text = raw[start:stop]
 return text
# processing
def preprocess(sentence):
 return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower()
#calling the above function
book = get_book(url)
processed_book = preprocess(book)
print(processed_book)




No email address found
 the project gutenberg ebook of the idiot by fyodor dostoyevsky this ebook is for the use of anyone anywhere in the united states and most other parts of the world at no cost and with almost no restrictions whatsoever. you may copy it give it away or re use it under the terms of the project gutenberg license included with this ebook or online at www.gutenberg.org. if you are not located in the united states you will have to check the laws of the country where you are located before using this ebook. title the idiot author fyodor dostoyevsky translator eva martin release date may 2001 ebook 2638 most recently updated june 21 2021 language english character set encoding utf 8 produced by martin adamson david widger with corrections by andrew sly start of the project gutenberg ebook the idiot the idiot by fyodor dostoyevsky translated by eva martin contents part i part 


In [24]:
import re
import requests
url = 'https://www.gutenberg.org/files/2638/2638-0.txt'
#function to extract
def get_book(url):
 raw = requests.get(url).text
 # Discards the metadata from the beginning of the book
 start = re.search(r"\*\*\* START OF THE PROJECT GUTENBERG EBOOK .* \*\*\*",raw ).end()
 # Discards the metadata from the end of the book
 stop = re.search(r"particular", raw).start()
 # Keeps the relevant text between start and stop
 text = raw[start:stop]
 return text

# processing
def preprocess(sentence):
 return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower()
 
#calling the above function
book = get_book(url)
processed_book = preprocess(book)
print(processed_book)

 the idiot by fyodor dostoyevsky translated by eva martin contents part i part ii part iii part iv part i i. towards the end of november during a thaw at nine o clock one morning a train on the warsaw and petersburg railway was approaching the latter city at full speed. the morning was so damp and misty that it was only with great difficulty that the day succeeded in breaking and it was impossible to distinguish anything more than a few yards away from the carriage windows. some of the passengers by this 


Count number of times "the" is appeared in processed book

In [25]:
len(re.findall(r'the', processed_book))

8

Replace "i" that come after white space (\s) and followed bt whitespace with "I"

In [27]:
processed_book = re.sub(r'\si\s', " I ", processed_book)
print(processed_book)

 the idiot by fyodor dostoyevsky translated by eva martin contents part I part ii part iii part iv part I i. towards the end of november during a thaw at nine o clock one morning a train on the warsaw and petersburg railway was approaching the latter city at full speed. the morning was so damp and misty that it was only with great difficulty that the day succeeded in breaking and it was impossible to distinguish anything more than a few yards away from the carriage windows. some of the passengers by this 


Find all occurance of text in the format "a to z--a to z" ie two words consisting of alphabets and numbers seperated by hyphens

In [32]:
book=requests.get("https://www.gutenberg.org/files/2638/2638-0.txt").text

In [34]:
re.findall(r'[a-zA-Z0-9]*--[a-zA-Z0-9]*', book)

['one--the', 'away--you']