# **Text Preprocessing in Natural Language Processing (NLP)**

We want to use machine learning models for the analysis of a text data to predict text similarities. To prepare our dataset for this, we take some specific steps to clean the text data in order to get faster and efficient results. We use **Natural Language processing (NLP)** for this. NLP is a branch of Data Science that helps analyzing Text data.

I will follow some of the steps suggested in chapter 4 of Anandarajan book ("Practical Text Analytics" **[Link](https://https://link.springer.com/chapter/10.1007/978-3-319-95663-3_4)**). 

*   Each instance in the collected text data should have a unique identifier (known as documents).
*   Many of those documents builds a document collection (known as corpus).

The steps of the text data pre-processing process by this author are the following:

1.   Unitize & Tokenize
2.   Standardize & Cleanse
3.   Stop Word Removal
4.   Stem or Lemmatize



## **Some Common Text Preprocessing Steps**

The text preprocessing process is an important step taken before starting the text analysis because it will remove all the unnecessary information from the raw text data to make the analysis process smoother. **This process would usually take longer than the analysis itself.**


## **1. Unitize and Tokenize:**


*   We choose the unit of text to analyze. It can be a word or a group of words.
*   Tokenization is the process of splitting the text into smaller units.
*   Depending on the problem that we want to solve, we can choose to use word or sentence tokenization.
*   In this case, the grammar and the order of the text are not considered when we look for the quantitative representation of the text data. This is known as the bag-of-words model, in which the representation of the words in a text takes no notice of the grammar and word order keeping the multiplicity of the words.

**N-grams Model:** 

*  N-grams are tokens that are continuous word sequences with a certain length N. For example, if we have $n=1$, we have a 1-grams (unigram) which is a token composed by one word. If we have $n=2$, we have a 2-grams (bigram) which are tokens composed by two consecutive words, and so on.

*  An example of a bigram: 

**Text example:** "Contract terms evolve in response to their environments, including new laws."

Here we show the tokens in bold text:

*  1st token: **Contract terms** evolve in response to their environments, 
including new laws.
*  2nd token: Contract **terms evolve** in response to their environments, 
including new laws.
*  3rd token: Contract terms **evolve in** response to their environments, 
including new laws.
*  4th token: Contract terms evolve **in response** to their environments, 
including new laws.
*  5th token: Contract terms evolve in **response to** their environments, 
including new laws.
*  6th token: Contract terms evolve in response **to their** environments, 
including new laws.
*  7th token: Contract terms evolve in response to **their environments**, 
including new laws.
*  8th token: Contract terms evolve in response to their **environments,** 
including new laws.
*  9th token: Contract terms evolve in response to their environments**,** **including** new laws.
*  10th token: Contract terms evolve in response to their environments,**including new** laws.
*  11th token: Contract terms evolve in response to their environments, 
including **new laws**.
*  12th token: Contract terms evolve in response to their environments, 
including new **laws.**

## **2. Standardization and Cleaning:**

In this step, we clean the tokens in order to not have problems when we find the multiplicity of the tokens. For example, two tokens have the same word but one is in uppercase and the other one is in lowercase. They are the same word but they would be considered different tokens because they have different letter case. Removing these special characters are also known as normalization.

Some of the steps to conisder are the following:

*  Convert the words in the text to lower case
*  Remove numbers, punctuation, and any special characters, links, hashtags, among others.
*  Remove any extra (white) space

**Text example:** "**C**ontract terms evolve in response to their environments**,** including new laws**.**"

**Clean Text example:** "contract terms evolve in response to their environments including new laws"

## **3.   Stop Word Removal**


Stopwords are the common words that do not add any valuable information to the analysis such as a, about, an, as, that, among others.

A good list of stopwords in English (and other languages) can be found in this **[link](https://www.ranks.nl/stopwords)**.

**Text example:** "contract terms evolve **in** response **to their** environments including **new** laws"

**Clean Text example:** "contract terms evolve response environments including laws"

**Comment:**

*  Some projects will have certain terms with high frequency and they might not add any value into the analysis. In this case, we build our own dictionary.

## **4.   Stem or Lemmatize**

We need to conisder two concepts:

* **Syntax:** It studies sentence structure, grammar and parts of speech. 

* **Semantics:** It studies the meaning of the sentences. It covers synonymy (same meaning for two different words) and polysemy (single word with multiple meanings).

### **Stemming**

* It reduces words to their root word.
* This helps to find more unique tokens.
* Words with the same root, often, share the same meaning but not always.
* Stemming would also depend on the project (willing to increase the errors or not).

### **Lemmatization**

*  Stemming has an issue when a word has multiple meanings.
*  Lemmatization takes care of this issue by considering the morphological analysis of the words.
*  Lemmatization groups tokens considering the part of the speech.

# **Code**

Here is an example on applying tokenization in Python: (Please, see the Text_Preprocessing_code for more examples)

In [7]:
!pip install nltk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [16]:
#Tokenization
import nltk
nltk.download('punkt')

from nltk.tokenize import sent_tokenize, word_tokenize


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [24]:
#Example of a text with 3 sentences
text_data = "This is the first sentence. Then, it comes the second sentence. Finally, this is the last sentence."

#sent_tokenize divides the text into different lines
nltk_tokens1 = sent_tokenize(text_data)
print(nltk_tokens1)

['This is the first sentence.', 'Then, it comes the second sentence.', 'Finally, this is the last sentence.']


In [26]:
#word_tokenize divides the text into words
nltk_tokens2 = word_tokenize(text_data)
print(nltk_tokens2)

['This', 'is', 'the', 'first', 'sentence', '.', 'Then', ',', 'it', 'comes', 'the', 'second', 'sentence', '.', 'Finally', ',', 'this', 'is', 'the', 'last', 'sentence', '.']
