<a href="https://colab.research.google.com/github/Kawarjeet/Project-NLP/blob/main/Data%20Cleaning/cleaning_using_regex_and_nltk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cleaning our Text Data using Regular Expression and NLTK

- To clean our text data, let us take a simple case in which the text data file is pretty much clean. We open the file using inbuilt open() python function which takes in two parameters in our case. 
- 'rt' : the 'r' stands for read mode and 't' for text format. 
- we could have opened the file in binary mode too. To do that the parameter would have changed to 'rb' inplace of 'rt'


In [15]:
file = open('poem.txt', 'rt') 

- Next we read the data in the text file into a variable using the python read() function. We could have also specified the number of bytes we would have wanted to read as read(33) - reads first 33 bytes from the text file. Note, one character is one byte.

In [16]:
text = file.read()

In [17]:
file.close()

In [18]:
print(text)

 by Manuel Gutiérrez Nájera

I want to die as the day declines, 
at high sea and facing the sky, 
while agony seems like a dream 
and my soul like a bird that can fly. 

To hear not, at this last moment, 
once alone with sky and sea, 
any more voices nor weeping prayers 
than the majestic beating of the waves. 

To die when the sad light retires 
its golden network from the green waves 
to be like the sun that slowly expires; 
something very luminous that fades. 

To die, and die young, before 
fleeting time removes the gentle crown, 
while life still says: "I'm yours" 
though we know with our hearts that she lies. 


In [19]:
words  = text.split()

In [20]:
print(words)

['by', 'Manuel', 'Gutiérrez', 'Nájera', 'I', 'want', 'to', 'die', 'as', 'the', 'day', 'declines,', 'at', 'high', 'sea', 'and', 'facing', 'the', 'sky,', 'while', 'agony', 'seems', 'like', 'a', 'dream', 'and', 'my', 'soul', 'like', 'a', 'bird', 'that', 'can', 'fly.', 'To', 'hear', 'not,', 'at', 'this', 'last', 'moment,', 'once', 'alone', 'with', 'sky', 'and', 'sea,', 'any', 'more', 'voices', 'nor', 'weeping', 'prayers', 'than', 'the', 'majestic', 'beating', 'of', 'the', 'waves.', 'To', 'die', 'when', 'the', 'sad', 'light', 'retires', 'its', 'golden', 'network', 'from', 'the', 'green', 'waves', 'to', 'be', 'like', 'the', 'sun', 'that', 'slowly', 'expires;', 'something', 'very', 'luminous', 'that', 'fades.', 'To', 'die,', 'and', 'die', 'young,', 'before', 'fleeting', 'time', 'removes', 'the', 'gentle', 'crown,', 'while', 'life', 'still', 'says:', '"I\'m', 'yours"', 'though', 'we', 'know', 'with', 'our', 'hearts', 'that', 'she', 'lies.']


- As we can see that the issue with a simple split() function is that it is separating the words by white space and preserves punctuations like comma. 

- To over come this we could use regex model which separates strings of alphanumeric characters( a-z, A-Z, 0-9 and '_' ) 
- We need to import the 're' library to use the regex model.
- The \W is a regex special sequence that matches any Non-alphanumeric character.

In [21]:
import re
words_using_regex_split = re.split(r'\W+', text)
print(words_using_regex_split)

['', 'by', 'Manuel', 'Gutiérrez', 'Nájera', 'I', 'want', 'to', 'die', 'as', 'the', 'day', 'declines', 'at', 'high', 'sea', 'and', 'facing', 'the', 'sky', 'while', 'agony', 'seems', 'like', 'a', 'dream', 'and', 'my', 'soul', 'like', 'a', 'bird', 'that', 'can', 'fly', 'To', 'hear', 'not', 'at', 'this', 'last', 'moment', 'once', 'alone', 'with', 'sky', 'and', 'sea', 'any', 'more', 'voices', 'nor', 'weeping', 'prayers', 'than', 'the', 'majestic', 'beating', 'of', 'the', 'waves', 'To', 'die', 'when', 'the', 'sad', 'light', 'retires', 'its', 'golden', 'network', 'from', 'the', 'green', 'waves', 'to', 'be', 'like', 'the', 'sun', 'that', 'slowly', 'expires', 'something', 'very', 'luminous', 'that', 'fades', 'To', 'die', 'and', 'die', 'young', 'before', 'fleeting', 'time', 'removes', 'the', 'gentle', 'crown', 'while', 'life', 'still', 'says', 'I', 'm', 'yours', 'though', 'we', 'know', 'with', 'our', 'hearts', 'that', 'she', 'lies', '']


In [22]:
words_using_regex_split.count("like") # we see that number of 'like' string are 3.

3

- However, regex is splitting "What's" string as 'What' and 's' which is not great. So we could use the combination of split() function and the re.split() function to achieve our result. The result that we are looking for is words without and white space and without any punctuations. Let's look at a constant called 'strin.punctuation'. Let's explore it. But before using it we have to import the string library. 

In [24]:
import string
print(string.punctuation)

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [26]:
re_punc = re.compile('[%s]' % re.escape(string.punctuation))

In [27]:

stripped = [re_punc.sub('', w) for w in words]

In [28]:
print(stripped)

['by', 'Manuel', 'Gutiérrez', 'Nájera', 'I', 'want', 'to', 'die', 'as', 'the', 'day', 'declines', 'at', 'high', 'sea', 'and', 'facing', 'the', 'sky', 'while', 'agony', 'seems', 'like', 'a', 'dream', 'and', 'my', 'soul', 'like', 'a', 'bird', 'that', 'can', 'fly', 'To', 'hear', 'not', 'at', 'this', 'last', 'moment', 'once', 'alone', 'with', 'sky', 'and', 'sea', 'any', 'more', 'voices', 'nor', 'weeping', 'prayers', 'than', 'the', 'majestic', 'beating', 'of', 'the', 'waves', 'To', 'die', 'when', 'the', 'sad', 'light', 'retires', 'its', 'golden', 'network', 'from', 'the', 'green', 'waves', 'to', 'be', 'like', 'the', 'sun', 'that', 'slowly', 'expires', 'something', 'very', 'luminous', 'that', 'fades', 'To', 'die', 'and', 'die', 'young', 'before', 'fleeting', 'time', 'removes', 'the', 'gentle', 'crown', 'while', 'life', 'still', 'says', 'Im', 'yours', 'though', 'we', 'know', 'with', 'our', 'hearts', 'that', 'she', 'lies']


- ***string.printable*** is another constant that separates non-printable characters from printable characters. 

In [30]:
print(string.printable)

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 	



In [32]:
re_printable = re.compile('[%s]' % re.escape(string.printable))
non_printable_stripped = [re_printable.sub('', w) for w in words]
print(non_printable_stripped)

['', '', 'é', 'á', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']


- As we can see, the above query prints all the non printable characters in the array. Let us remove all the unnecessary elements from this. 

In [38]:
non_printable_stripped1 = [i for i in non_printable_stripped if i!='' ]
print(non_printable_stripped1)

['é', 'á']


- Next we convert all of our words into lower case using the lower() function. Same is the case with upper case.

In [40]:
lowered_words = [words.lower() for words in stripped]
upper_words = [words.upper() for words in stripped]
print(lowered_words)
print(upper_words)

['by', 'manuel', 'gutiérrez', 'nájera', 'i', 'want', 'to', 'die', 'as', 'the', 'day', 'declines', 'at', 'high', 'sea', 'and', 'facing', 'the', 'sky', 'while', 'agony', 'seems', 'like', 'a', 'dream', 'and', 'my', 'soul', 'like', 'a', 'bird', 'that', 'can', 'fly', 'to', 'hear', 'not', 'at', 'this', 'last', 'moment', 'once', 'alone', 'with', 'sky', 'and', 'sea', 'any', 'more', 'voices', 'nor', 'weeping', 'prayers', 'than', 'the', 'majestic', 'beating', 'of', 'the', 'waves', 'to', 'die', 'when', 'the', 'sad', 'light', 'retires', 'its', 'golden', 'network', 'from', 'the', 'green', 'waves', 'to', 'be', 'like', 'the', 'sun', 'that', 'slowly', 'expires', 'something', 'very', 'luminous', 'that', 'fades', 'to', 'die', 'and', 'die', 'young', 'before', 'fleeting', 'time', 'removes', 'the', 'gentle', 'crown', 'while', 'life', 'still', 'says', 'im', 'yours', 'though', 'we', 'know', 'with', 'our', 'hearts', 'that', 'she', 'lies']
['BY', 'MANUEL', 'GUTIÉRREZ', 'NÁJERA', 'I', 'WANT', 'TO', 'DIE', 'AS',

# Cleaning using NLTK

Till now we were cleaning our text document using python split() function and regex model. Let us now dive into NLTK, which is nothing but Natural Language Toolkit. 

In the script below we use sent_tokenize to split text based on sentences that are separated by \n escape sequence. 

In [47]:
import nltk
nltk.download('punkt')
from nltk import sent_tokenize
sentences = sent_tokenize(text)
print(sentences)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[' by Manuel Gutiérrez Nájera\n\nI want to die as the day declines, \nat high sea and facing the sky, \nwhile agony seems like a dream \nand my soul like a bird that can fly.', 'To hear not, at this last moment, \nonce alone with sky and sea, \nany more voices nor weeping prayers \nthan the majestic beating of the waves.', 'To die when the sad light retires \nits golden network from the green waves \nto be like the sun that slowly expires; \nsomething very luminous that fades.', 'To die, and die young, before \nfleeting time removes the gentle crown, \nwhile life still says: "I\'m yours" \nthough we know with our hearts that she lies.']


- The word_tokenize helps to separate the strings based on white spaces. Also, it takes care of the punctuations and separates them out to. But it does not separate in words punctuation like "I'm" effectively

In [48]:
from nltk import word_tokenize
words_tokenized_using_nltk = word_tokenize(text)
print(words_tokenized_using_nltk)

['by', 'Manuel', 'Gutiérrez', 'Nájera', 'I', 'want', 'to', 'die', 'as', 'the', 'day', 'declines', ',', 'at', 'high', 'sea', 'and', 'facing', 'the', 'sky', ',', 'while', 'agony', 'seems', 'like', 'a', 'dream', 'and', 'my', 'soul', 'like', 'a', 'bird', 'that', 'can', 'fly', '.', 'To', 'hear', 'not', ',', 'at', 'this', 'last', 'moment', ',', 'once', 'alone', 'with', 'sky', 'and', 'sea', ',', 'any', 'more', 'voices', 'nor', 'weeping', 'prayers', 'than', 'the', 'majestic', 'beating', 'of', 'the', 'waves', '.', 'To', 'die', 'when', 'the', 'sad', 'light', 'retires', 'its', 'golden', 'network', 'from', 'the', 'green', 'waves', 'to', 'be', 'like', 'the', 'sun', 'that', 'slowly', 'expires', ';', 'something', 'very', 'luminous', 'that', 'fades', '.', 'To', 'die', ',', 'and', 'die', 'young', ',', 'before', 'fleeting', 'time', 'removes', 'the', 'gentle', 'crown', ',', 'while', 'life', 'still', 'says', ':', '``', 'I', "'m", 'yours', "''", 'though', 'we', 'know', 'with', 'our', 'hearts', 'that', 'she

- Let us now separate out the punctuation from our result. We use isalpha() function to do that as shown in the below script. Note running the script not all rmoves punctuation tokens but also examples like "'m"

In [50]:
words_tokenized_without_punc = [word for word in words_tokenized_using_nltk if word.isalpha()]
print(words_tokenized_without_punc)

['by', 'Manuel', 'Gutiérrez', 'Nájera', 'I', 'want', 'to', 'die', 'as', 'the', 'day', 'declines', 'at', 'high', 'sea', 'and', 'facing', 'the', 'sky', 'while', 'agony', 'seems', 'like', 'a', 'dream', 'and', 'my', 'soul', 'like', 'a', 'bird', 'that', 'can', 'fly', 'To', 'hear', 'not', 'at', 'this', 'last', 'moment', 'once', 'alone', 'with', 'sky', 'and', 'sea', 'any', 'more', 'voices', 'nor', 'weeping', 'prayers', 'than', 'the', 'majestic', 'beating', 'of', 'the', 'waves', 'To', 'die', 'when', 'the', 'sad', 'light', 'retires', 'its', 'golden', 'network', 'from', 'the', 'green', 'waves', 'to', 'be', 'like', 'the', 'sun', 'that', 'slowly', 'expires', 'something', 'very', 'luminous', 'that', 'fades', 'To', 'die', 'and', 'die', 'young', 'before', 'fleeting', 'time', 'removes', 'the', 'gentle', 'crown', 'while', 'life', 'still', 'says', 'I', 'yours', 'though', 'we', 'know', 'with', 'our', 'hearts', 'that', 'she', 'lies']


### Removing stopwords from the text files


Next we learn how to remove the stopwords from our text file using NLTK library. 

In [52]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')
print(stop_words)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'ea

In [53]:
tokenized_and_no_stopwords = [i for i in words_tokenized_without_punc if i not in stop_words]
print(tokenized_and_no_stopwords)

['Manuel', 'Gutiérrez', 'Nájera', 'I', 'want', 'die', 'day', 'declines', 'high', 'sea', 'facing', 'sky', 'agony', 'seems', 'like', 'dream', 'soul', 'like', 'bird', 'fly', 'To', 'hear', 'last', 'moment', 'alone', 'sky', 'sea', 'voices', 'weeping', 'prayers', 'majestic', 'beating', 'waves', 'To', 'die', 'sad', 'light', 'retires', 'golden', 'network', 'green', 'waves', 'like', 'sun', 'slowly', 'expires', 'something', 'luminous', 'fades', 'To', 'die', 'die', 'young', 'fleeting', 'time', 'removes', 'gentle', 'crown', 'life', 'still', 'says', 'I', 'though', 'know', 'hearts', 'lies']


- As we see that some of the stop words are not removed. This is because the stop words are all in lower case. So lets convert the words in lower case first and again apply the stop word removal method.

In [54]:
tokenized_and_no_stopwords = [i.lower() for i in tokenized_and_no_stopwords]
print(tokenized_and_no_stopwords)

['manuel', 'gutiérrez', 'nájera', 'i', 'want', 'die', 'day', 'declines', 'high', 'sea', 'facing', 'sky', 'agony', 'seems', 'like', 'dream', 'soul', 'like', 'bird', 'fly', 'to', 'hear', 'last', 'moment', 'alone', 'sky', 'sea', 'voices', 'weeping', 'prayers', 'majestic', 'beating', 'waves', 'to', 'die', 'sad', 'light', 'retires', 'golden', 'network', 'green', 'waves', 'like', 'sun', 'slowly', 'expires', 'something', 'luminous', 'fades', 'to', 'die', 'die', 'young', 'fleeting', 'time', 'removes', 'gentle', 'crown', 'life', 'still', 'says', 'i', 'though', 'know', 'hearts', 'lies']


In [56]:
tokenized_and_no_stopwords = [ word for word in tokenized_and_no_stopwords if word not in stop_words]
print(tokenized_and_no_stopwords)

['manuel', 'gutiérrez', 'nájera', 'want', 'die', 'day', 'declines', 'high', 'sea', 'facing', 'sky', 'agony', 'seems', 'like', 'dream', 'soul', 'like', 'bird', 'fly', 'hear', 'last', 'moment', 'alone', 'sky', 'sea', 'voices', 'weeping', 'prayers', 'majestic', 'beating', 'waves', 'die', 'sad', 'light', 'retires', 'golden', 'network', 'green', 'waves', 'like', 'sun', 'slowly', 'expires', 'something', 'luminous', 'fades', 'die', 'die', 'young', 'fleeting', 'time', 'removes', 'gentle', 'crown', 'life', 'still', 'says', 'though', 'know', 'hearts', 'lies']


- And there it is! We have removed all the non necessary words. 

- As we have seen, the pipeline in cleaning the text file is as follows: 
<ol>
  <li>load the raw text</li>
  <li>split the tokens</li>
  <li>convert to lowercase</li>
  <li>remove punctuation from the tokens</li>
  <li>filter out remaining tokens that are not alpha numeric</li>
  <li>Filter out tokens that are stopwords</li>
</ol>

# Let us now work the entire learning in one script for a clean text document. 

In [67]:
# importing the necessary libraries
import nltk
from nltk import word_tokenize
import string

In [68]:
nltk.download('stopwords')
stop_words = stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [71]:
test_file = open('poem.txt', 'rt')
script = test_file.read()
test_file.close()

words = word_tokenize(script)
re_punc = re.compile('[%s]' % re.escape(string.punctuation))
punc_stripped_script = [re_punc.sub('',w) for w in words]
low_punc_strip_tok_script = [word.lower() for word in punc_stripped_script ]
low_punc_strip_tok_script = [word for word in low_punc_strip_tok_script if word!='']
nostop_low_nopunc_tok = [ word for word in low_punc_strip_tok_script if word not in stop_words]
print(len(words))
print(len(nostop_low_nopunc_tok))


131
61
