# <center> <font size = 24 color = 'steelblue'> <b>Text Pre-Processing

## Overview:

The goal is to understand and implement the essential steps of text preprocessing using Python. The notebook covers data cleaning tasks like tokenization, case normalization, spelling correction, POS tagging, and NER. It also includes stemming, lemmatization, and noise removal, such as stopwords, URLs, punctuations, and emoticons, to prepare text data for NLP applications.

<div class="alert alert-block alert-info">
    
<font size = 4> 

**By the end of this notebook you will be able to:**
- Understand steps involved in text preprocessing
- Implement text oreprocessing using  python

# <a id= 't0'> 
<font size = 4>
    
**Table of contents:**<br>
[1. Installation and import of necessary packages](#t1)<br>
[2. Download the necessary corpus from NLTK](#t2)<br>
[3. Data cleaning steps](#t3)<br>
> [3.1 Tokenization](#t3.1)<br>
> [3.2 Changing case](#t3.2)<br>
> [3.3 Spelling correction](#t3.3)<br>
> [3.4 POS Tagging](#t3.4)<br>
> [3.5 Named entity recognition (NER)](#t3.5)<br>
> [3.6 Stemming and Lemmatization](#t3.6)<br>
>> [a. Stemming](#3a)<br>
>> [b. Lemmatization](#3b)<br>

> [3.7 Noise entity removal](#t3.7)<br>
>> [a. Remove stopwords](#a)<br>
>> [b. Remove urls](#b)<br>
>> [c. Remove punctuations](#c)<br>
>> [d. Remove emoticons](#d)<br>

##### <a id = 't1'>
<font size = 10 color = 'midnightblue'> <b>Installation and import of necessary packages

In [1]:
!pip install nltk==3.8.1
!pip install spacy==3.5.1
!pip install re
!pip install string
!python -m spacy download en_core_web_sm
!pip install svgling
import spacy.cli
spacy.cli.download("en_core_web_lg")

Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Defaulting to user installation because normal site-packages is not writeable

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Defaulting to user installation because normal site-packages is not writeable
[31mERROR: Could not find a version that satisfies the requirement re (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for re[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31

  hasattr(torch, "has_mps")
  and torch.has_mps  # type: ignore[attr-defined]
2024-10-30 07:29:47.384554: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-30 07:29:47.397755: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-30 07:29:47.412627: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-30 07:29:47.416987: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been regi

Defaulting to user installation because normal site-packages is not writeable
Collecting en-core-web-lg==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.5.0/en_core_web_lg-3.5.0-py3-none-any.whl (587.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m587.7/587.7 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-3.5.0



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')


In [2]:
import nltk
import spacy
import re
from string import punctuation

[top](#t0)

##### <a id = 't2'>
<font size = 10 color = 'midnightblue'> <b>Download necessary corpus and models from nltk

In [3]:
nltk.download('punkt')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
nltk.download('words')
nltk.download('maxent_ne_chunker')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /voc/work/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /voc/work/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /voc/work/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package words to /voc/work/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /voc/work/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package stopwords to /voc/work/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /voc/work/nltk_data...


True

<div class="alert alert-block alert-info">
    
<font size = 4> 

**Note:**
    
- A LoadError will be raised whenever there is a missing corpus or model which is a dependency for some other function.
- Use `nltk.download( <name of the corpus/model> )` for downloading the requirements.



<font size = 6 color = seagreen> <b> Import the necessary corpus

In [4]:
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

##### <a id = 't3'>
<font size = 10 color = 'midnightblue'> <b> Data cleaning steps

<font size = 6 color = seagreen> <b> <center> Let's start by defining a custom text for preprocessing.<br>
<font size = 6 color = seagreen> <center>This text contains emoticons, punctuations urls etc.

In [5]:
text = """Embracing life's challenges is like navigating a journey. 🚀
Stay motivated, overcome hurdles, and explore new paths to success!
Check out inspiring stories at https://motivationalhub.com for an extra boost!"""
print(text)

Embracing life's challenges is like navigating a journey. 🚀
Stay motivated, overcome hurdles, and explore new paths to success!
Check out inspiring stories at https://motivationalhub.com for an extra boost!


[top](#t0)

<a id = 't3.1'>
<font size = 6 color = pwdrblue>  <b>Tokenization 

<div class="alert alert-block alert-success" style="font-size: 16px;"> <!-- Set font size using CSS -->
    <ul>
        <li>Tokenization is the process of breaking down text into smaller components, known as tokens.</li>
        <li>Tokens can be:
            <ul>
                <li>Words</li>
                <li>Phrases</li>
                <li>Symbols</li>
                <li>Other meaningful elements</li>
            </ul>
        </li>
        <li>Tokenization is a foundational step in natural language processing (NLP).</li>
        <li>It transforms unstructured text into a structured format that algorithms can analyze and manipulate.</li>
        <li> Sentence tokenization further divides text into individual sentences for better context understanding.</li> <!-- Added line for sentence tokenization -->
    </ul>
</div>


</div>


![Image Description](https://labcontent.simplicdn.net/data-content/content-assets/Data_and_AI/NLP/tokenization.png)


<font size = 5 color = seagreen>  <b>Word tokenization

In [6]:
word_tokens = nltk.word_tokenize(text)
print(word_tokens)

['Embracing', 'life', "'s", 'challenges', 'is', 'like', 'navigating', 'a', 'journey', '.', '🚀', 'Stay', 'motivated', ',', 'overcome', 'hurdles', ',', 'and', 'explore', 'new', 'paths', 'to', 'success', '!', 'Check', 'out', 'inspiring', 'stories', 'at', 'https', ':', '//motivationalhub.com', 'for', 'an', 'extra', 'boost', '!']


<font size = 5 color = seagreen>  <b>Sentence tokenization

In [7]:
sentences = nltk.sent_tokenize(text)
for i in range(len(sentences)):
    print(f"{i}:  {sentences[i]}")

0:  Embracing life's challenges is like navigating a journey.
1:  🚀
Stay motivated, overcome hurdles, and explore new paths to success!
2:  Check out inspiring stories at https://motivationalhub.com for an extra boost!


<div class="alert alert-block alert-success">
<font size = 4> 

**However, if the text contains emoticons or URLs, word tokenization may split them, complicating the text cleaning process. Hence, a simple text split function could be more helpful in this context.**


In [8]:
word_tokens = text.split()
print(word_tokens)

['Embracing', "life's", 'challenges', 'is', 'like', 'navigating', 'a', 'journey.', '🚀', 'Stay', 'motivated,', 'overcome', 'hurdles,', 'and', 'explore', 'new', 'paths', 'to', 'success!', 'Check', 'out', 'inspiring', 'stories', 'at', 'https://motivationalhub.com', 'for', 'an', 'extra', 'boost!']


<div class="alert alert-block alert-success">
<font size = 4> 

- <b>This also creates word tokens but keeps emoticons, urls, address handles, and hastags etc. together for further analysis.


[top](#t0)

<a id = 't3.2'>
<font size = 6 color = pwdrblue>  <b>Changing the case.

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Change of case is a text normalization process.
- This process provides for uniform representation and reduces the vocabulary size.
- Casing also eases the process of text matching, entity recognition, search and retrieval.
- Changing the casing of the data reduces redundancy and helps the ML model generalize better.

In [9]:
words_lower_case = text.lower().split()
print(words_lower_case)

['embracing', "life's", 'challenges', 'is', 'like', 'navigating', 'a', 'journey.', '🚀', 'stay', 'motivated,', 'overcome', 'hurdles,', 'and', 'explore', 'new', 'paths', 'to', 'success!', 'check', 'out', 'inspiring', 'stories', 'at', 'https://motivationalhub.com', 'for', 'an', 'extra', 'boost!']


[top](#t0)

<a id = 't3.3'>
<font size = 6 color = pwdrblue>  <b>Spelling correction

<div class="alert alert-block alert-success">
<font size = 4> 

- This improves text quality and avoids miscommunication.
- Spell correction helps support language models and embeddings.
- Spell correction helps reducing ambiguity and handle out of vocabulary data.

**For spelling correction we are using:**
 - `nltk.edit_distance` to measure distance between the words in the text and the vocabulary available in nltk.
 - `edit_distance` calculate the `Levenshtein edit-distance` between two strings to check similarity between words in the text and words of the valid vocabulary.

In [10]:
# Tokenize the text
word_tokens = text.lower().split()

In [11]:
# Get list of English words
words = nltk.corpus.words.words()

In [12]:
print("Total number of words in the vocabulary : ", len(words))

Total number of words in the vocabulary :  236736


In [13]:
# Correct spelling of each word
corrected_tokens = []
for token in word_tokens:
    # Find the word with the lowest distance and replace it
    corrected_token = min(words, key=lambda x: nltk.edit_distance(x, token))
    corrected_tokens.append(corrected_token)
print("Corrected tokens:", corrected_tokens)

Corrected tokens: ['embracing', 'life', 'challenge', 'is', 'like', 'navigation', 'a', 'journey', 'A', 'stay', 'motivate', 'overcome', 'hurdies', 'and', 'explore', 'new', 'patas', 'to', 'success', 'check', 'out', 'inspiring', 'storied', 'at', 'motivational', 'for', 'an', 'extra', 'boost']


[top](#t0)

<a id = 't3.4'>
<font size = 6 color = pwdrblue>  <b>POS Tagging

<div class="alert alert-block alert-success">
<font size = 4> 
    
Part-of-Speech tagging involves assigning words in a text corpus to specific parts of speech based on their definitions and contextual usage.

In [14]:
# Tokenize the text
word_tokens = text.split()

In [15]:
# Part-of-speech tagging can be done using pos_tag function of nltk.
tagged = nltk.pos_tag(word_tokens)

In [16]:
print(tagged)

[('Embracing', 'VBG'), ("life's", 'NN'), ('challenges', 'NNS'), ('is', 'VBZ'), ('like', 'IN'), ('navigating', 'VBG'), ('a', 'DT'), ('journey.', 'NN'), ('🚀', 'NNP'), ('Stay', 'NNP'), ('motivated,', 'VBZ'), ('overcome', 'JJ'), ('hurdles,', 'NN'), ('and', 'CC'), ('explore', 'VB'), ('new', 'JJ'), ('paths', 'NNS'), ('to', 'TO'), ('success!', 'VB'), ('Check', 'NNP'), ('out', 'RP'), ('inspiring', 'VBG'), ('stories', 'NNS'), ('at', 'IN'), ('https://motivationalhub.com', 'NN'), ('for', 'IN'), ('an', 'DT'), ('extra', 'JJ'), ('boost!', 'NN')]


[top](#t0)

<a id = 't3.5'>
<font size = 6 color = pwdrblue>  <b>Named Entity Recognition (NER)

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Named entity recognition (NER) is a natural language processing (NLP) technique that involves identifying and classifying entities (objects, places, people, organizations, dates, monetary values, percentages, etc.) in text.
- Named entities can belong to various categories, such as:

|**Entity Object**| **Meaning** |
|-|-|
|Person |Individual names of people.|
|Location| Places, cities, countries, etc.|
|Organization | Names of companies, institutions, etc.|
|Date | Temporal expressions like dates and times.|
|Money| Currency amounts.|
|Percent| Percentage values.|

<font size = 5 color = seagreen> <b><center> Let's consider a different example text to understand named entity recognition

In [17]:
text_example = "In 2019, Apple Inc. announced the launch of the iPhone 11 at their headquarters in Cupertino, California, with Tim Cook, the CEO, presenting the new features."
print(text_example)

In 2019, Apple Inc. announced the launch of the iPhone 11 at their headquarters in Cupertino, California, with Tim Cook, the CEO, presenting the new features.


In [18]:
# tokenize the text
word_tokens = text_example.split()

In [19]:
# get the pos tags
tagged = nltk.pos_tag(word_tokens)
print(tagged)

[('In', 'IN'), ('2019,', 'CD'), ('Apple', 'NNP'), ('Inc.', 'NNP'), ('announced', 'VBD'), ('the', 'DT'), ('launch', 'NN'), ('of', 'IN'), ('the', 'DT'), ('iPhone', 'NN'), ('11', 'CD'), ('at', 'IN'), ('their', 'PRP$'), ('headquarters', 'NNS'), ('in', 'IN'), ('Cupertino,', 'NNP'), ('California,', 'NNP'), ('with', 'IN'), ('Tim', 'NNP'), ('Cook,', 'NNP'), ('the', 'DT'), ('CEO,', 'NNP'), ('presenting', 'VBG'), ('the', 'DT'), ('new', 'JJ'), ('features.', 'NN')]


In [20]:
named_entities = nltk.ne_chunk(tagged)
print(named_entities)

(S
  In/IN
  2019,/CD
  Apple/NNP
  Inc./NNP
  announced/VBD
  the/DT
  launch/NN
  of/IN
  the/DT
  (ORGANIZATION iPhone/NN)
  11/CD
  at/IN
  their/PRP$
  headquarters/NNS
  in/IN
  Cupertino,/NNP
  California,/NNP
  with/IN
  (PERSON Tim/NNP)
  Cook,/NNP
  the/DT
  CEO,/NNP
  presenting/VBG
  the/DT
  new/JJ
  features./NN)


<font size = 5 color = seagreen> <b>Named entity recognition can also be implemented using spcay packages

In [21]:
# Load the pre-trained English language model
nlp = spacy.load("en_core_web_lg")

In [22]:
# Create a nlp object of the text
doc = nlp(text_example)

In [23]:
# Extract named entities
entities = [(ent.text, ent.label_) for ent in doc.ents]

In [24]:
# Print the named entities
print(entities)

[('2019', 'DATE'), ('Apple Inc.', 'ORG'), ('11', 'CARDINAL'), ('Cupertino', 'GPE'), ('California', 'GPE'), ('Tim Cook', 'PERSON')]


[top](#t0)

<a id = 't3.6'>
<font size = 6 color = pwdrblue>  <b>Stemming and Lemmatization

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Stemming and lemmatization are techniques used in NLP and text mining to reduce words to their base or root forms, simplifying the process of analysis and text understanding.

In [25]:
text = """Embracing life's challenges is like navigating a journey. 🚀
Stay motivated, overcome hurdles, and explore new paths to success!
Check out inspiring stories at https://motivationalhub.com for an extra boost!"""

<font size = 5 color = seagreen> <b>Let's start by tokenising the text

In [26]:
word_tokens = text.lower().split()

<a id = '3a'>
<font size = 5 color = seagreen> <b> Stemming

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Stemming is the process of removing suffixes or prefixes from words to obtain their root or base form, known as the stem. The goal is to reduce words to a common form, even if it is not a valid word.
- Porter stemmer is one of the most used stemming technique.

In [27]:
# create stemmer object
stemmer = nltk.stem.PorterStemmer()

In [28]:
# stem each token
stemmed_tokens = [stemmer.stem(token) for token in word_tokens]

In [29]:
print("Stemmed tokens:", stemmed_tokens)

Stemmed tokens: ['embrac', "life'", 'challeng', 'is', 'like', 'navig', 'a', 'journey.', '🚀', 'stay', 'motivated,', 'overcom', 'hurdles,', 'and', 'explor', 'new', 'path', 'to', 'success!', 'check', 'out', 'inspir', 'stori', 'at', 'https://motivationalhub.com', 'for', 'an', 'extra', 'boost!']


<a id = '3b'>
<font size = 5 color = seagreen> <b> Lemmatization

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Lemmatization is the process of reducing words to their base or dictionary form, known as the lemma.
- Lemmatization considers the context and meaning of a word and produces valid words.
- NLTK provides wordnet based lemmatizer.



<div class="alert alert-block alert-info" style="font-size: 16px; margin-top: 20px;">
    <h4>What is WordNet?</h4>
    <p>WordNet is a large lexical database of English, where words are grouped into sets of synonyms called synsets. Each synset contains a word or phrase and provides definitions and examples of usage. It is widely used in natural language processing (NLP) for tasks such as semantic analysis and word sense disambiguation.</p>
    <p>NLTK provides a WordNet-based lemmatizer that utilizes this resource to reduce words to their base or dictionary form (lemma), which is essential for tasks like text normalization.</p>
</div>


In [30]:
# Create lemmatizer object
lemmatizer = nltk.stem.WordNetLemmatizer()

In [None]:
# Lemmatize each token
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in word_tokens]

In [None]:
print("Lemmatized tokens:", lemmatized_tokens)

<div class="alert alert-block alert-success">
<font size = 4> 
    
**Using PoS tagging in lemmatization**
  - For implementation of PoS tag based lemmatization, we pass the PoS tag for each word in the sentence.
  - To acheive this we need to first map PoS tags from Penn Treebank to WordNet PoS tags.
  - The below function performs the task:

In [None]:
# pos tag mapping
def pos_tagger(nltk_tag):
    if nltk_tag.startswith('J'):
        return wordnet.ADJ
    elif nltk_tag.startswith('V'):
        return wordnet.VERB
    elif nltk_tag.startswith('N'):
        return wordnet.NOUN
    elif nltk_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [None]:
# Get the pos tag
tagged = nltk.pos_tag(word_tokens)

In [None]:


# Get the root word for each of the tokens using their corresponding pos-tags
lemma_sent = []
for word, tag in tagged:
    new_tag = pos_tagger(tag)
    lemma = lemmatizer.lemmatize(word, new_tag)
    lemma_sent.append(lemma)

In [None]:
print(f"Original sentence : \n{text}")

In [None]:
print(f"Lemmatized sentence : \n{' '.join(lemma_sent)}")

[top](#t0)

<a id = 't3.7'>
<font size = 6 color = pwdrblue>  <b>Noise entity removal

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Noise entity removal in NLP involves the identification and removal of irrelevant or undesired entities from a given text.
- Noise entities can be entities that are not relevant to the analysis or entities that add unnecessary complexity to the task at hand.

In [None]:
text = """Embracing life's challenges is like navigating a journey. 🚀
Stay motivated, overcome hurdles, and explore new paths to success!
Check out inspiring stories at https://motivationalhub.com for an extra boost!"""

In [None]:
# Tokenization
word_tokens = text.lower().split()

In [None]:
# PoS tagging
tagged = nltk.pos_tag(word_tokens)

In [None]:
# Lemmatization
lemma_sent = []
for word, tag in tagged:
    new_tag = pos_tagger(tag)
    lemma = lemmatizer.lemmatize(word, new_tag)
    lemma_sent.append(lemma)

<a id = 'a'>
<font size = 5 color = seagreen> <b> a. Remove stopwords

<div class="alert alert-block alert-success">
<font size = 4> 

- Identify and remove common stopwords (e.g., "is," "the," "and") that do not carry much semantic meaning.
- This can help in focusing on more meaningful entities.

In [None]:
# Obtain the list of stopwords from the corpus
stp_wrds_eng = stopwords.words('english')
print(stp_wrds_eng)

In [None]:
# Removing stopwords
text_clean = [w for w in lemma_sent if w not in stp_wrds_eng]
print(f"Lemmatized : \n{' '.join(lemma_sent)}")
print(f"Cleaned  : \n{' '.join(text_clean)}")

<a id = 'b'>
<font size = 5 color = seagreen> <b> b. Removing urls

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Urls are not essential for many analysis process, hence they need to be removed.
- We can use regex to identify and remove the urls.

In [None]:
# Identifying and substituting urls using the pattern 'https\S+' for urls
text_clean = re.sub(r'http\S+', '', ' '.join(text_clean), flags=re.MULTILINE)
print(text_clean)

<a id = 'c'>
<font size = 5 color = seagreen> <b> c. Remove punctuations

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Punctuations are not always useful for anlaysis hence they shall also be removed.
- The list of punctuations can be obtained from the `string` package.

In [None]:
text_clean = [w for w in text_clean if w not in punctuation]
print(f"Cleaned  : \n{''.join(text_clean)}")

<div class="alert alert-block alert-info">
<font size = 4> 

**Note:**
- In sentiment analysis sometimes punctuations like ! or ? may be significant for analysis.
- Text cleaning steps should be customised based on the analysis objective.



<a id = 'd'>
<font size = 5 color = seagreen> <b>d. Remove emoticons

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Most of the text from the social media is nowadays filled with emoticons.
- Handling emoticons becomes a necessary part of NLP pipeline.
- They may be directly removed for simplicity.
- This can also be achieved using regex by specifying the unicodes for these emoticons as given below:

In [None]:
RE_EMOJI = re.compile('[\U00010000-\U0010ffff]', flags=re.UNICODE)
def strip_emoji(text):
    return RE_EMOJI.sub(r'', text)

In [None]:
# Use function to remove emoticons from text
text_clean = strip_emoji(''.join(text_clean))
print(text_clean)

<div class="alert alert-block alert-info">
<font size = 4> 

**Note :**
 - Emoticons may be replaced with their intended meaning in form of text.
 - For example: 😀 translates to  happy face.
 - This process is used in the vader sentiment package in data cleaning steps in sentiment analysis.



[top](#t0)