<p> <center> <a href="../Start_Here.ipynb">Home Page</a> </center> </p>

 
<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="Overview.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
        <a href="Overview.ipynb">1</a>
        <a >2</a>
        <a href="QandA_data_processing.ipynb">3</a>
        <a href="Exercise.ipynb">4</a>
        <a href="Summary.ipynb">5</a>
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a href="QandA_data_processing.ipynb">Next Notebook</a></span>
</div>

# Common Preprocessing Techniques for Raw Text Data 

---

## Goal

The goal of this notebook is to learn some common NLP preprocessing techniques useful for formatting raw text data. 

## Background

We shall consider 5 preprocessing and cleaning techniques to unstructured text data. Our sample text data would be an example from the Natural Questions (NQ) dataset that we saw in the previous notebook. `document_text` column would be extracted for illustration. Our assumption would be to view the `document_text` as raw text passage sourced from the web, and to be used to build question answer text file for creating dataset in `SQUAD` json format. 

The first step is to fetch the text data and examine.

**Fetch Text Data**

In [None]:
import json
import gzip
import pandas as pd

count =1 
num_row = 0
input_file_path = '../source_code/data/v1.0-simplified_simplified-nq-train.jsonl.gz'

text_data= ''
with gzip.open(input_file_path, 'rb') as file: 
    for l in file:
        utf8_in = l.decode("utf8", "strict")
        data_rows = json.loads(utf8_in)               
        
        text_data = data_rows['document_text']  

        print(data_rows)
        num_row +=1
        if(num_row ==count): break 


In [None]:
text_data

From the fetched text data, we are only interested in `document_text` as the source for our raw data. As shown in the cell above, the text is noised with `html tags`, `symbols`, `special characters`, and `non-English` characters. Now let’s start with the preprocessing by removing the html tags:

### Removal of HTML tags

This technique is useful especially when the text data scrapped from websites. The HTML tags are removed using regular expressions.

In [None]:
import numpy as np
import pandas as pd
import re
import string

In [None]:
html_reg_exp = re.compile('<.*?>')
text_data1 = html_reg_exp.sub(r'', text_data)

text_data1

Now, we have the html tags removed. The next step is to remove URLs in the text data.

### Removal of URL

If you look closely into the output text of the previous cell, you can spot a url as the one given below:

```bash
.......Definitions and Implementation Under the CAN -- SPAM Act ; Final Rule '' ( PDF ) . FTC.gov . May 21 , 2008 .   Retrieved from `` https://en.wikipedia.org/w/index.php?title=Email_marketing&oldid=814071202 '' Categories :   Advertising by medium   Email.......................................................................
```
What we intend to do is to remove this: `https://en.wikipedia.org/w/index.php?title=Email_marketing&oldid=814071202`, using regular expression. Run the next cell below to remove the URL.

In [None]:
url_reg_exp = re.compile(r'https?://\S+|www\.\S+')
text_data2 =  url_reg_exp.sub(r'', text_data1)

text_data2

Now, we have successfully removed the URL. Please kindly note that the URL is not always removed from text, but the action is determined by the purpose you intend to achieve with the text data. In the case of creating a corpus or QA where none of the question would require an answer that would contain URL then, the URL removal technique is justify being executed otherwise, ignore.    

### Removal of Punctuations

Looking through our last output text, our next step would be to remove unwanted punctuations but retain commas and full stops.  One way to achieve this purpose is to use the python function `string.punctuation` and then apply `string.translate` function on `maketrans` function. However, we must be careful because `string.punctuation` contains  ```!"#$%&\'()*+,-./:;<=>?@[\\]^_{|}`~``` , hence, we shall to use a customized string punctuation function. Also, the use of this technique is subject to text data use case. Let’s run the cell below to remove the unwanted punctuations.

In [None]:
PUNCT_2_RM = string.punctuation

PUNCT_2_RM

Exclude `.` and `,` from the punctuation

In [None]:
CUSTOM_PUNCT_2_RM = "!\"#$%&\'()*+-/:;<=>?@[\\]^_`{|}~'"

In [None]:
text_data3= text_data2.translate(str.maketrans('', '', CUSTOM_PUNCT_2_RM))

#text_data3 = re.sub(r'[^\w\s]', '', text_data2)

text_data3

### Removal of non-English Words

In our text data, non-English words  like `ಕನ್ನಡ `, `日本 語`, ` فارسی `, and `Русский`, have to be removed because we consider English language characters for our use case.  One of the easiest ways to do that is to use the `nltk word corpus` to filter non-English words from the text data.  The `Spacy` library is another option to achieve same purpose.  Run the cells below:

In [None]:
'''
import spacy
from spacy.language import Language
nlp = spacy.load('en_core_web_sm')
#nlp.add_pipe("ner", source=spacy.load("en_core_web_sm"))
text = 'This is an english text.'
doc = nlp(text)
# document level language detection. Think of it like average language of the document!
print(doc._.Language)
'''
import string
import nltk 
nltk.download('words')
words = set(nltk.corpus.words.words())

text_data4 = " ".join(w for w in nltk.wordpunct_tokenize(text_data3) if w.lower() in words or not w.isalpha())
text_data4


From the output, we can see that the non-English words have been removed but also left some unknown symbols. Theses symbol can be tracked by finding their indexes within the text and then filter them away. However, the indexes sometimes require adding or subtract 1~5 to get the exact index. There are several other ways to removing non-English words, it’s all depend on your Python coding knowledge The next three cells below show how to do that: 

In [None]:
index_of_start_sym = text_data4.find(' াং')
index_of_start_sym

In [None]:
index_of_end_sym =text_data4.find(' ు')
index_of_end_sym

In [None]:
#12672
index_of_end_sym = index_of_end_sym + 4
text_data5 = text_data4[0:index_of_start_sym] + text_data4[index_of_end_sym:]

text_data5

For Question Answering purposes, at this junction, further preprocessing may distort the text data in terms of sentence or paragraph semantic. It is important to have good and clear understanding of the text content in order to know whether the figures/ numbers within the text data are useful for `quantification`, `date purposes`, `percentages`, `ratio`, etc., before removing some/all of it. It is obvious that the text contains several nested full stops (.) which can be replaced with single ones. The text data could also be checked for correct spelling so as to aid the correctness of the sentences and paragraphs within the text data.


Running the cell below will remove nested full stops `...` by replacing with nothing:

In [None]:
text_data6 = text_data5.replace('...', '')
text_data6

Run the cell will replace three spaced full stops ` . . . `  within the text with just one `.`:  

In [None]:
text_data7 = text_data6.replace('. . .', '.')
text_data7

The text data is looking better as compared to what it was when we got started. Because text preprocessing is subjective to the text data containment and its’ use cases therefore, we would like to briefly touch other text preprocessing techniques which are important.

## Other Text Preprocessing Techniques 

### Word Correction

Checking for spelling correctness is an important task in text preprocessing before proceeding to build a dataset to train a model or for data analysis. Words that are not correctly spelt or as a result of typos should be replaced with correct ones. Two of the ways to achieve that is using the `Jaccard distance method` from the `NLTK python` library or the `pyspellchecker` library. However, the spelling correction is not perfect. For example, you may intend to write the word `contain` and because of typos the word is written as `contan`. There is a likelihood that the word is corrected as `constant` rather than `contain`.


Please, run the next two cells below:

In [None]:
#credit to: https://www.geeksforgeeks.org/correcting-words-using-nltk-in-python/ where this code was adapted

from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams
nltk.download('words')
from nltk.corpus import words
  
  
word_list = words.words()

typo_words = 'some of the sentnce cntain some errores'.split()
for word in typo_words:
    temp = [(jaccard_distance(set(ngrams(word, 2)), set(ngrams(w, 2))),w) for w in word_list if w[0]==word[0]]
    print(sorted(temp, key = lambda val:val[0])[0][1])



**Apply spell checker**

In [None]:
from spellchecker import SpellChecker

spell = SpellChecker()
def correct_spellings(text):
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)
        
text = "some of the sentnce cntain some errores"
correct_spellings(text)

### Removal of Emojis

When text data is scrapped from an online social media, there is tendency that it may contain emojis which are not needed during data analysis or training, hence the need to remove the emojis.  For more information on how to remove `symbols & pictures`, `transport & map symbols`, `flags (iOS)` and `Chinese char` visit [slowkow/remove-emoji.py GitHub](https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b) 

In [None]:
emojis1 = ' 🔥🔥😀 😃 😄 😁 😆 '
emojis2 = ' 😅 😂 👍👀😍😱🤪🥂'
emoji_len = len(emojis1)

text_data8 = text_data7[:1000]+ emojis1 + text_data7[1001+emoji_len:]+ emojis2 
text_data8

In [None]:
#source: https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b

emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE)
text_data9 = emoji_pattern.sub(r'', text_data8)

text_data9 

### Tokenization

Tokenization is the process of breaking up a text into sentences or words. Each word or sentence in a text is considered as a token. Tokenization allows a detailed analysis of text data when as it is broken into smaller units. Examples given below include sentence tokenization and world tokenization.

**Sentence tokenization:**

In [None]:
import nltk
nltk.download('punkt')

text = text_data9 

text_into_sentences = nltk.sent_tokenize(text.lower())
print ("Total sentence: ",len(text_into_sentences))
text_into_sentences


Let’s take the first sentence and then tokenize into words. You can as well tokenize the entire text into words.  

**Word tokenization:**

In [None]:
sentence_into_words = nltk.word_tokenize(text_into_sentences[0].lower())
print ("Total words: ",len(sentence_into_words))
sentence_into_words

### Removal of stopwords

Stopwords are a collection of words that commonly occur in any language but of no importance to text analysis except for part of speech (POS) tagging in sentences. They are usually determinant, pronounce, some verbs, and adverbs  like `a`, `his`, `herself`, `the`, `in`,` out` ,etc.  Stopwords do not only exist in English language but also in other languages. Examples are given in the cell below.


In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')
", ".join(stopwords.words('english'))

In [None]:
#no_stopword_text = " ".join([word for word in text.split() if str(word).strip() not in stopwords.words('english')])
no_stopword_text = " ".join([word for word in str(text_into_sentences[0]).split() if word not in set(stopwords.words('english'))])
print('original text: {}'.format(text_into_sentences[0]))
print('stopword removed: {}'.format(no_stopword_text))

### Stemming

Stemming is the process of reducing inflection in words i.e., reducing words to their root form especially words in past and continuous form e.g., `dancing` is reduced to its root `dance`. Most widely used stemming algorithm is the porter stemmer within the `nltk` library. Oftentimes some words are wrongly stemmed for example `navigation` may be reduced to `navig`. Run the cells below to see the effect of stemming on our text `(no_stopword_text)`.        

In [None]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

stem_words =''
for word in no_stopword_text.split():
   stem_words += stemmer.stem(word)+' '

print('original text: {}'.format(no_stopword_text))
print('stemmed text: {}'.format(stem_words))
# " ".join([stemmer.stem(word) for word in text.split()])

### Lemmatization

Due to the limitations of Stemming, Lemmatization is used to recovered words that are not properly stemmed. Lemmatization uses POS (part of speech) to understand the context we want to lemmatize our word. By default, the POS is set to Noun.  You can checkout [sudalairaj kumar](https://www.kaggle.com/code/sudalairajkumar/getting-started-with-text-preprocessing) on [Kaggle](https://www.kaggle.com/) for more in depth. Now, you can run the cell below to see how our previous text is lemmatized without chopping off letters from words.

In [None]:
#Code adapted from: https://www.kaggle.com/code/sudalairajkumar/getting-started-with-text-preprocessing

from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')



lemmatizer = WordNetLemmatizer()
wordnet_map = {"N":wordnet.NOUN, "V":wordnet.VERB, "J":wordnet.ADJ, "R":wordnet.ADV}

pos_tagged_text = nltk.pos_tag(no_stopword_text.split())
lemma_words = " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])

print('original text: {}'.format(no_stopword_text))
print('stemmed text: {}'.format(lemma_words))

We have learnt how to apply different NLP preprocessing techniques to preprocess text data. The next phase would be to build a SQuAD format json file from the processed text data.

## References

- https://www.geeksforgeeks.org/correcting-words-using-nltk-in-python/
- https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
- https://www.kaggle.com/code/sudalairajkumar/getting-started-with-text-preprocessing
---
## Licensing

Copyright © 2022 OpenACC-Standard.org. This material is released by OpenACC-Standard.org, in collaboration with NVIDIA Corporation, under the Creative Commons Attribution 4.0 International (CC BY 4.0). These materials include references to hardware and software developed by other entities; all applicable licensing and copyrights apply.

<div>
    <span style="float: left; width: 33%; text-align: left;"><a href="Overview.ipynb">Previous Notebook</a></span>
    <span style="float: left; width: 33%; text-align: center;">
        <a href="Overview.ipynb">1</a>
        <a >2</a>
        <a href="QandA_data_processing.ipynb">3</a>
        <a href="Exercise.ipynb">4</a>
        <a href="Summary.ipynb">5</a>
    </span>
    <span style="float: left; width: 33%; text-align: right;"><a href="QandA_data_processing.ipynb">Next Notebook</a></span>
</div>

<p> <center> <a href="../Start_Here.ipynb">Home Page</a> </center> </p>
