### Example on Regular Expression

In [1]:
import re

##### Let us examine a use case for split()

Sometimes an input text document is very large, and we need to
further break it in down into sentences for down-stream NLP tasks.
We can try to split such documents based on common punctuation
symbols (e.g. full stop, question mark, white space etc).

In [2]:
string = """Mr. Smith bought cheapsite.com for 1.5 million dollars. 
He paid a lot for it. Did he mind? Adam Jones thinks he didn't. 
In any case, this isn't true... Well, with a probability of .9, it isn't! Do you agree?"""

In [3]:
pattern = "(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s"

In [4]:
result = re.split(pattern, string)
print(result)

['Mr. Smith bought cheapsite.com for 1.5 million dollars.', '\nHe paid a lot for it.', 'Did he mind?', "Adam Jones thinks he didn't.", "\nIn any case, this isn't true...", "Well, with a probability of .9, it isn't! Do you agree?"]


#### Let us see how to apply compile method for split

In [5]:
pattern = "(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s"
compiled_pattern = re.compile(pattern)

In [6]:
result = compiled_pattern.split(string)
print (result)

['Mr. Smith bought cheapsite.com for 1.5 million dollars.', '\nHe paid a lot for it.', 'Did he mind?', "Adam Jones thinks he didn't.", "\nIn any case, this isn't true...", "Well, with a probability of .9, it isn't! Do you agree?"]


In [11]:

new_result = []
for string in result:
    newstring = re.sub("\\n", "", string)
    new_results = new_result.append(newstring)
    
print (new_result)      

['Mr. Smith bought cheapsite.com for 1.5 million dollars.', 'He paid a lot for it.', 'Did he mind?', "Adam Jones thinks he didn't.", "In any case, this isn't true...", "Well, with a probability of .9, it isn't! Do you agree?"]


In [9]:
for string in result:
    print (string)

Mr. Smith bought cheapsite.com for 1.5 million dollars.

He paid a lot for it.
Did he mind?
Adam Jones thinks he didn't.

In any case, this isn't true...
Well, with a probability of .9, it isn't! Do you agree?
