<a href="https://colab.research.google.com/github/Mohana-AI/NLP/blob/main/Text_pre_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Text Pre-processing**

Text preprocessing refers to a series of techniques used to clean, transform and prepare raw textual data into a format that is suitable for NLP or ML tasks.

*   Removing punctuations like . , ! $( ) * % @
*   Removing URLs
*   Removing Stop words
*   Lower casing
*   Tokenization
*   Stemming
*   Lemmatization
*   Numeric Token Removal
*   Whitespace Removal
*   Input TextSpell Checking and Correction
*   Text Normalization
*   Entity Recognition and Masking
*   Removing HTML Tags and Special Characters

**Customer Support on Twitter**

In [1]:
import numpy as np
import pandas as pd
import re
import nltk
import spacy
import string
pd.options.mode.chained_assignment = None

full_df = pd.read_csv("/content/drive/MyDrive/NLP/Dataset/twcs.csv", nrows=5000)
df = full_df[["text"]]
df["text"] = df["text"].astype(str)
full_df.head()

Unnamed: 0,tweet_id,author_id,inbound,created_at,text,response_tweet_id,in_response_to_tweet_id
0,1,sprintcare,False,Tue Oct 31 22:10:47 +0000 2017,@115712 I understand. I would like to assist y...,2.0,3.0
1,2,115712,True,Tue Oct 31 22:11:45 +0000 2017,@sprintcare and how do you propose we do that,,1.0
2,3,115712,True,Tue Oct 31 22:08:27 +0000 2017,@sprintcare I have sent several private messag...,1.0,4.0
3,4,sprintcare,False,Tue Oct 31 21:54:49 +0000 2017,@115712 Please send us a Private Message so th...,3.0,5.0
4,5,115712,True,Tue Oct 31 21:49:35 +0000 2017,@sprintcare I did.,4.0,6.0


**Lower Casing**

Lowercasing is a common preprocessing step in Natural Language Processing (NLP) tasks. It involves converting all letters in a text to lowercase.
* **Normalization:** Normalizes the text by ensuring that words with the same characters but different cases are treated as identical. For example, "Hello" and "hello" would become the same word after lowercasing.
* **Reducing Vocabulary Size:** Lowercasing reduces the complexity of the vocabulary by collapsing words with different cases into a single token. This can help in improving the efficiency of NLP models and reducing the amount of data required for training.
* **Consistency:** Lowercasing ensures consistency in text processing and analysis. It prevents discrepancies that may arise from variations in capitalization.
* **Avoiding Duplication:** Lowercasing prevents duplication of words due to differences in capitalization. For example, "Apple" and "apple" would be treated as the same word after lowercasing, avoiding redundant entries in datasets or analyses.



In [2]:
df["text_lower"] = df["text"].str.lower()
df.head()

Unnamed: 0,text,text_lower
0,@115712 I understand. I would like to assist y...,@115712 i understand. i would like to assist y...
1,@sprintcare and how do you propose we do that,@sprintcare and how do you propose we do that
2,@sprintcare I have sent several private messag...,@sprintcare i have sent several private messag...
3,@115712 Please send us a Private Message so th...,@115712 please send us a private message so th...
4,@sprintcare I did.,@sprintcare i did.


**Removal of Punctuations**

Punctuation marks such as periods, commas, question marks, and exclamation points do not usually carry semantic meaning on their own and can be safely removed to simplify text analysis.

* **Noise Reduction:** Punctuation marks can add noise to the text data without contributing significantly to the content's meaning. Removing them can help reduce this noise.
* **Tokenization:** In tokenization, where text is split into individual tokens (words or subwords), punctuation marks are often considered as separate tokens. Removing them can improve the tokenization process by avoiding unnecessary token proliferation.
* **Normalization:** Removing punctuation marks can contribute to text normalization, making it easier to process and analyze text consistently.

In [3]:
# drop the new column created in last cell
df.drop(["text_lower"], axis=1, inplace=True)

PUNCT_TO_REMOVE = string.punctuation
def remove_punctuation(text):
    """custom function to remove the punctuation"""
    return text.translate(str.maketrans('', '', PUNCT_TO_REMOVE))

df["text_wo_punct"] = df["text"].apply(lambda text: remove_punctuation(text))
df.head()

Unnamed: 0,text,text_wo_punct
0,@115712 I understand. I would like to assist y...,115712 I understand I would like to assist you...
1,@sprintcare and how do you propose we do that,sprintcare and how do you propose we do that
2,@sprintcare I have sent several private messag...,sprintcare I have sent several private message...
3,@115712 Please send us a Private Message so th...,115712 Please send us a Private Message so tha...
4,@sprintcare I did.,sprintcare I did


**Removal of stopwords**

words, such as "the," "is," "at," and "in," occur frequently in text but typically do not carry significant semantic meaning.

* **Reducing Noise:** Stopwords are frequently occurring words that may not contribute much to the overall meaning of a text. Removing them can reduce noise and focus on more meaningful words.
* **Improving Efficiency:** Stopword removal can reduce the size of the vocabulary and the computational complexity of NLP tasks such as text classification or clustering.
*   **Improving Model Performance:** By removing irrelevant words, stopwords removal can improve the accuracy of NLP models by focusing on more informative terms.

In [9]:
import nltk

nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
print(", ".join(stop_words))

i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, must

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [10]:
STOPWORDS = set(stopwords.words('english'))
def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

df["text_wo_stop"] = df["text_wo_punct"].apply(lambda text: remove_stopwords(text))
df.head()

Unnamed: 0,text,text_wo_punct,text_wo_stop
0,@115712 I understand. I would like to assist y...,115712 I understand I would like to assist you...,115712 I understand I would like assist We wou...
1,@sprintcare and how do you propose we do that,sprintcare and how do you propose we do that,sprintcare propose
2,@sprintcare I have sent several private messag...,sprintcare I have sent several private message...,sprintcare I sent several private messages one...
3,@115712 Please send us a Private Message so th...,115712 Please send us a Private Message so tha...,115712 Please send us Private Message assist J...
4,@sprintcare I did.,sprintcare I did,sprintcare I


**Removal of Frequent words**

In the previos preprocessing step, we removed the stopwords based on language information. But say, if we have a domain specific corpus, we might also have some frequent words which are of not so much importance to us.

So this step is to remove the frequent words in the given corpus. If we use something like tfidf, this is automatically taken care of.

In [11]:
from collections import Counter
cnt = Counter()
for text in df["text_wo_stop"].values:
    for word in text.split():
        cnt[word] += 1

cnt.most_common(10)

[('I', 1437),
 ('us', 752),
 ('DM', 514),
 ('help', 479),
 ('Please', 376),
 ('We', 338),
 ('Hi', 293),
 ('Thanks', 287),
 ('get', 279),
 ('please', 247)]

In [12]:
FREQWORDS = set([w for (w, wc) in cnt.most_common(10)])
def remove_freqwords(text):
    """custom function to remove the frequent words"""
    return " ".join([word for word in str(text).split() if word not in FREQWORDS])

df["text_wo_stopfreq"] = df["text_wo_stop"].apply(lambda text: remove_freqwords(text))
df.head()

Unnamed: 0,text,text_wo_punct,text_wo_stop,text_wo_stopfreq
0,@115712 I understand. I would like to assist y...,115712 I understand I would like to assist you...,115712 I understand I would like assist We wou...,115712 understand would like assist would need...
1,@sprintcare and how do you propose we do that,sprintcare and how do you propose we do that,sprintcare propose,sprintcare propose
2,@sprintcare I have sent several private messag...,sprintcare I have sent several private message...,sprintcare I sent several private messages one...,sprintcare sent several private messages one r...
3,@115712 Please send us a Private Message so th...,115712 Please send us a Private Message so tha...,115712 Please send us Private Message assist J...,115712 send Private Message assist Just click ...
4,@sprintcare I did.,sprintcare I did,sprintcare I,sprintcare


**Removal of Rare words**

In [13]:
# Drop the two columns which are no more needed
df.drop(["text_wo_punct", "text_wo_stop"], axis=1, inplace=True)

n_rare_words = 10
RAREWORDS = set([w for (w, wc) in cnt.most_common()[:-n_rare_words-1:-1]])
def remove_rarewords(text):
    """custom function to remove the rare words"""
    return " ".join([word for word in str(text).split() if word not in RAREWORDS])

df["text_wo_stopfreqrare"] = df["text_wo_stopfreq"].apply(lambda text: remove_rarewords(text))
df.head()

Unnamed: 0,text,text_wo_stopfreq,text_wo_stopfreqrare
0,@115712 I understand. I would like to assist y...,115712 understand would like assist would need...,115712 understand would like assist would need...
1,@sprintcare and how do you propose we do that,sprintcare propose,sprintcare propose
2,@sprintcare I have sent several private messag...,sprintcare sent several private messages one r...,sprintcare sent several private messages one r...
3,@115712 Please send us a Private Message so th...,115712 send Private Message assist Just click ...,115712 send Private Message assist Just click ...
4,@sprintcare I did.,sprintcare,sprintcare


**Stemming**

Stemming is a text processing technique used to reduce words to their base or root form, called the stem. It involves stripping the suffixes or prefixes from words to normalize them, so variations of the same word are treated as the same word.

*   "running" becomes "run"
*   "dogs" becomes "dog"
*   "swimming" becomes "swim"


1.   Text Normalization:For instance, in sentiment analysis, treating "love," "loved," and "loving" as the same word "love" helps in capturing the sentiment more accurately.
2.   Improves Information Retrieval: For example, searching for "running shoes" should also retrieve results for "run shoes" and "runner shoes."
3. Reduces Feature Dimensionality : For instance, without stemming, "run," "running," and "ran" might be considered as different features, leading to redundancy and increased complexity.
4. Speeds Up Processing : simplifies text processing pipelines and speeds up computational tasks by reducing the variations of words that need to be processed
5. Enhances Text Analysis : aids in tasks such as topic modeling, sentiment analysis, and document clustering by grouping similar words together

In [14]:
from nltk.stem.porter import PorterStemmer

# Drop the two columns
df.drop(["text_wo_stopfreq", "text_wo_stopfreqrare"], axis=1, inplace=True)

stemmer = PorterStemmer()
def stem_words(text):
    return " ".join([stemmer.stem(word) for word in text.split()])

df["text_stemmed"] = df["text"].apply(lambda text: stem_words(text))
df.head()

Unnamed: 0,text,text_stemmed
0,@115712 I understand. I would like to assist y...,@115712 i understand. i would like to assist y...
1,@sprintcare and how do you propose we do that,@sprintcar and how do you propos we do that
2,@sprintcare I have sent several private messag...,@sprintcar i have sent sever privat messag and...
3,@115712 Please send us a Private Message so th...,@115712 pleas send us a privat messag so that ...
4,@sprintcare I did.,@sprintcar i did.


We can see that words like private and propose have their e at the end chopped off due to stemming. This is not intented. What can we do fort hat? We can use Lemmatization in such cases.

Also this porter stemmer is for English language. If we are working with other languages, we can use snowball stemmer. The supported languages for snowball stemmer are

In [15]:
from nltk.stem.snowball import SnowballStemmer
SnowballStemmer.languages

('arabic',
 'danish',
 'dutch',
 'english',
 'finnish',
 'french',
 'german',
 'hungarian',
 'italian',
 'norwegian',
 'porter',
 'portuguese',
 'romanian',
 'russian',
 'spanish',
 'swedish')

**Lemmatization**

Involves analyzing words in their context and converting them into their base forms, which are actual words found in the language's dictionary. It uses morphological analysis to achieve this, taking into account factors like part of speech (POS) tags, grammatical rules, and semantic meaning.
*   "running" becomes "run"
*   "better" becomes "good"
*   "mice" becomes "mouse"


1. **Accurate Representation :** Since lemmatization considers the context and semantics of words, it ensures that the resulting lemma is a valid word in the language, which is crucial for tasks like text generation, translation, and semantic analysis.
2. **Improved Information Retrieval :**  lemmatization improves the retrieval of relevant documents or information by reducing words to their canonical forms. This ensures that variations of the same word are treated as the same entity, leading to better search results and user experience.
3. **Enhanced Text Analysis :** Lemmatization is beneficial in text analysis tasks such as sentiment analysis, topic modeling, and named entity recognition.
4. **Reduced Vocabulary Size :** This reduction can lead to faster and more efficient processing in NLP tasks, especially in machine learning models where feature dimensionality is a concern.
5. **Maintains Semantic Meaning:** Unlike stemming, which may sometimes produce non-words or lose the semantic meaning of words, lemmatization preserves the semantic integrity of words. This is crucial for tasks that require a deep understanding of language semantics, such as question answering systems or chatbots.
6. **Part of Speech Tagging :** Lemmatization often involves part of speech tagging, which provides additional information about the grammatical role of words in a sentence. This information can be valuable in syntactic analysis, grammar checking, and text parsing tasks.



In [18]:
from nltk.stem import WordNetLemmatizer

nltk.download('wordnet')
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
  return " ".join([lemmatizer.lemmatize(word) for word in text.split()])

df["text_lemmatized"] = df["text"].apply(lambda text: lemmatize_words(text))
df.head()

[nltk_data] Downloading package wordnet to /root/nltk_data...


Unnamed: 0,text,text_stemmed,text_lemmatized
0,@115712 I understand. I would like to assist y...,@115712 i understand. i would like to assist y...,@115712 I understand. I would like to assist y...
1,@sprintcare and how do you propose we do that,@sprintcar and how do you propos we do that,@sprintcare and how do you propose we do that
2,@sprintcare I have sent several private messag...,@sprintcar i have sent sever privat messag and...,@sprintcare I have sent several private messag...
3,@115712 Please send us a Private Message so th...,@115712 pleas send us a privat messag so that ...,@115712 Please send u a Private Message so tha...
4,@sprintcare I did.,@sprintcar i did.,@sprintcare I did.


In [19]:
lemmatizer.lemmatize("running")

'running'

It returned running as such without converting it to the root form run. This is because the lemmatization process depends on the POS tag to come up with the correct lemma. Now let us lemmatize again by providing the POS tag for the word

In [20]:
lemmatizer.lemmatize("running", "v") # v for verb

'run'

Now we are getting the root form run. So we also need to provide the POS tag of the word along with the word for lemmatizer in nltk. Depending on the POS, the lemmatizer may return different results.

In [21]:
print("Word is : stripes")
print("Lemma result for verb : ",lemmatizer.lemmatize("stripes", 'v'))
print("Lemma result for noun : ",lemmatizer.lemmatize("stripes", 'n'))

Word is : stripes
Lemma result for verb :  strip
Lemma result for noun :  stripe


Now let us redo the lemmatization process for our dataset.

In [24]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
wordnet_map = {"N":wordnet.NOUN, "V":wordnet.VERB, "J":wordnet.ADJ, "R":wordnet.ADV}
def lemmatize_words(text):
    pos_tagged_text = nltk.pos_tag(text.split())
    return " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])

df["text_lemmatized"] = df["text"].apply(lambda text: lemmatize_words(text))
df.head()

Unnamed: 0,text,text_stemmed,text_lemmatized
0,@115712 I understand. I would like to assist y...,@115712 i understand. i would like to assist y...,@115712 I understand. I would like to assist y...
1,@sprintcare and how do you propose we do that,@sprintcar and how do you propos we do that,@sprintcare and how do you propose we do that
2,@sprintcare I have sent several private messag...,@sprintcar i have sent sever privat messag and...,@sprintcare I have send several private messag...
3,@115712 Please send us a Private Message so th...,@115712 pleas send us a privat messag so that ...,@115712 Please send u a Private Message so tha...
4,@sprintcare I did.,@sprintcar i did.,@sprintcare I did.


We can now see that in the third row, sent got converted to send since we provided the POS tag for lemmatization.

**Removal of Emojis**

Removing emojis from text data can be important in various natural language processing tasks, especially when emojis are not relevant or can interfere with the analysis.

In [25]:
# Reference : https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(string):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

remove_emoji("game is on 🔥🔥")

'game is on '

In [26]:
remove_emoji("Hilarious😂")

'Hilarious'

**Removal of Emoticons**
Removing emoticons from text data follows a similar approach to removing emojis. Emoticons are usually represented by specific character sequences, such as ":)", ":(", etc.

*   Using Regular Expressions (Regex)
*   String Replacement




In [27]:
# Thanks : https://github.com/NeelShah18/emot/blob/master/emot/emo_unicode.py
EMOTICONS = {
    u":‑\)":"Happy face or smiley",
    u":\)":"Happy face or smiley",
    u":-\]":"Happy face or smiley",
    u":\]":"Happy face or smiley",
    u":-3":"Happy face smiley",
    u":3":"Happy face smiley",
    u":->":"Happy face smiley",
    u":>":"Happy face smiley",
    u"8-\)":"Happy face smiley",
    u":o\)":"Happy face smiley",
    u":-\}":"Happy face smiley",
    u":\}":"Happy face smiley",
    u":-\)":"Happy face smiley",
    u":c\)":"Happy face smiley",
    u":\^\)":"Happy face smiley",
    u"=\]":"Happy face smiley",
    u"=\)":"Happy face smiley",
    u":‑D":"Laughing, big grin or laugh with glasses",
    u":D":"Laughing, big grin or laugh with glasses",
    u"8‑D":"Laughing, big grin or laugh with glasses",
    u"8D":"Laughing, big grin or laugh with glasses",
    u"X‑D":"Laughing, big grin or laugh with glasses",
    u"XD":"Laughing, big grin or laugh with glasses",
    u"=D":"Laughing, big grin or laugh with glasses",
    u"=3":"Laughing, big grin or laugh with glasses",
    u"B\^D":"Laughing, big grin or laugh with glasses",
    u":-\)\)":"Very happy",
    u":‑\(":"Frown, sad, andry or pouting",
    u":-\(":"Frown, sad, andry or pouting",
    u":\(":"Frown, sad, andry or pouting",
    u":‑c":"Frown, sad, andry or pouting",
    u":c":"Frown, sad, andry or pouting",
    u":‑<":"Frown, sad, andry or pouting",
    u":<":"Frown, sad, andry or pouting",
    u":‑\[":"Frown, sad, andry or pouting",
    u":\[":"Frown, sad, andry or pouting",
    u":-\|\|":"Frown, sad, andry or pouting",
    u">:\[":"Frown, sad, andry or pouting",
    u":\{":"Frown, sad, andry or pouting",
    u":@":"Frown, sad, andry or pouting",
    u">:\(":"Frown, sad, andry or pouting",
    u":'‑\(":"Crying",
    u":'\(":"Crying",
    u":'‑\)":"Tears of happiness",
    u":'\)":"Tears of happiness",
    u"D‑':":"Horror",
    u"D:<":"Disgust",
    u"D:":"Sadness",
    u"D8":"Great dismay",
    u"D;":"Great dismay",
    u"D=":"Great dismay",
    u"DX":"Great dismay",
    u":‑O":"Surprise",
    u":O":"Surprise",
    u":‑o":"Surprise",
    u":o":"Surprise",
    u":-0":"Shock",
    u"8‑0":"Yawn",
    u">:O":"Yawn",
    u":-\*":"Kiss",
    u":\*":"Kiss",
    u":X":"Kiss",
    u";‑\)":"Wink or smirk",
    u";\)":"Wink or smirk",
    u"\*-\)":"Wink or smirk",
    u"\*\)":"Wink or smirk",
    u";‑\]":"Wink or smirk",
    u";\]":"Wink or smirk",
    u";\^\)":"Wink or smirk",
    u":‑,":"Wink or smirk",
    u";D":"Wink or smirk",
    u":‑P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"X‑P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"XP":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":‑Þ":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":Þ":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":b":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"d:":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"=p":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u">:P":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u":‑/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":-[.]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u">:[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u">:/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=/":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=[(\\\)]":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":L":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u"=L":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":S":"Skeptical, annoyed, undecided, uneasy or hesitant",
    u":‑\|":"Straight face",
    u":\|":"Straight face",
    u":$":"Embarrassed or blushing",
    u":‑x":"Sealed lips or wearing braces or tongue-tied",
    u":x":"Sealed lips or wearing braces or tongue-tied",
    u":‑#":"Sealed lips or wearing braces or tongue-tied",
    u":#":"Sealed lips or wearing braces or tongue-tied",
    u":‑&":"Sealed lips or wearing braces or tongue-tied",
    u":&":"Sealed lips or wearing braces or tongue-tied",
    u"O:‑\)":"Angel, saint or innocent",
    u"O:\)":"Angel, saint or innocent",
    u"0:‑3":"Angel, saint or innocent",
    u"0:3":"Angel, saint or innocent",
    u"0:‑\)":"Angel, saint or innocent",
    u"0:\)":"Angel, saint or innocent",
    u":‑b":"Tongue sticking out, cheeky, playful or blowing a raspberry",
    u"0;\^\)":"Angel, saint or innocent",
    u">:‑\)":"Evil or devilish",
    u">:\)":"Evil or devilish",
    u"\}:‑\)":"Evil or devilish",
    u"\}:\)":"Evil or devilish",
    u"3:‑\)":"Evil or devilish",
    u"3:\)":"Evil or devilish",
    u">;\)":"Evil or devilish",
    u"\|;‑\)":"Cool",
    u"\|‑O":"Bored",
    u":‑J":"Tongue-in-cheek",
    u"#‑\)":"Party all night",
    u"%‑\)":"Drunk or confused",
    u"%\)":"Drunk or confused",
    u":-###..":"Being sick",
    u":###..":"Being sick",
    u"<:‑\|":"Dump",
    u"\(>_<\)":"Troubled",
    u"\(>_<\)>":"Troubled",
    u"\(';'\)":"Baby",
    u"\(\^\^>``":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(\^_\^;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(-_-;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(~_~;\) \(・\.・;\)":"Nervous or Embarrassed or Troubled or Shy or Sweat drop",
    u"\(-_-\)zzz":"Sleeping",
    u"\(\^_-\)":"Wink",
    u"\(\(\+_\+\)\)":"Confused",
    u"\(\+o\+\)":"Confused",
    u"\(o\|o\)":"Ultraman",
    u"\^_\^":"Joyful",
    u"\(\^_\^\)/":"Joyful",
    u"\(\^O\^\)／":"Joyful",
    u"\(\^o\^\)／":"Joyful",
    u"\(__\)":"Kowtow as a sign of respect, or dogeza for apology",
    u"_\(\._\.\)_":"Kowtow as a sign of respect, or dogeza for apology",
    u"<\(_ _\)>":"Kowtow as a sign of respect, or dogeza for apology",
    u"<m\(__\)m>":"Kowtow as a sign of respect, or dogeza for apology",
    u"m\(__\)m":"Kowtow as a sign of respect, or dogeza for apology",
    u"m\(_ _\)m":"Kowtow as a sign of respect, or dogeza for apology",
    u"\('_'\)":"Sad or Crying",
    u"\(/_;\)":"Sad or Crying",
    u"\(T_T\) \(;_;\)":"Sad or Crying",
    u"\(;_;":"Sad of Crying",
    u"\(;_:\)":"Sad or Crying",
    u"\(;O;\)":"Sad or Crying",
    u"\(:_;\)":"Sad or Crying",
    u"\(ToT\)":"Sad or Crying",
    u";_;":"Sad or Crying",
    u";-;":"Sad or Crying",
    u";n;":"Sad or Crying",
    u";;":"Sad or Crying",
    u"Q\.Q":"Sad or Crying",
    u"T\.T":"Sad or Crying",
    u"QQ":"Sad or Crying",
    u"Q_Q":"Sad or Crying",
    u"\(-\.-\)":"Shame",
    u"\(-_-\)":"Shame",
    u"\(一一\)":"Shame",
    u"\(；一_一\)":"Shame",
    u"\(=_=\)":"Tired",
    u"\(=\^\·\^=\)":"cat",
    u"\(=\^\·\·\^=\)":"cat",
    u"=_\^=	":"cat",
    u"\(\.\.\)":"Looking down",
    u"\(\._\.\)":"Looking down",
    u"\^m\^":"Giggling with hand covering mouth",
    u"\(\・\・?":"Confusion",
    u"\(?_?\)":"Confusion",
    u">\^_\^<":"Normal Laugh",
    u"<\^!\^>":"Normal Laugh",
    u"\^/\^":"Normal Laugh",
    u"\（\*\^_\^\*）" :"Normal Laugh",
    u"\(\^<\^\) \(\^\.\^\)":"Normal Laugh",
    u"\(^\^\)":"Normal Laugh",
    u"\(\^\.\^\)":"Normal Laugh",
    u"\(\^_\^\.\)":"Normal Laugh",
    u"\(\^_\^\)":"Normal Laugh",
    u"\(\^\^\)":"Normal Laugh",
    u"\(\^J\^\)":"Normal Laugh",
    u"\(\*\^\.\^\*\)":"Normal Laugh",
    u"\(\^—\^\）":"Normal Laugh",
    u"\(#\^\.\^#\)":"Normal Laugh",
    u"\（\^—\^\）":"Waving",
    u"\(;_;\)/~~~":"Waving",
    u"\(\^\.\^\)/~~~":"Waving",
    u"\(-_-\)/~~~ \($\·\·\)/~~~":"Waving",
    u"\(T_T\)/~~~":"Waving",
    u"\(ToT\)/~~~":"Waving",
    u"\(\*\^0\^\*\)":"Excited",
    u"\(\*_\*\)":"Amazed",
    u"\(\*_\*;":"Amazed",
    u"\(\+_\+\) \(@_@\)":"Amazed",
    u"\(\*\^\^\)v":"Laughing,Cheerful",
    u"\(\^_\^\)v":"Laughing,Cheerful",
    u"\(\(d[-_-]b\)\)":"Headphones,Listening to music",
    u'\(-"-\)':"Worried",
    u"\(ーー;\)":"Worried",
    u"\(\^0_0\^\)":"Eyeglasses",
    u"\(\＾ｖ\＾\)":"Happy",
    u"\(\＾ｕ\＾\)":"Happy",
    u"\(\^\)o\(\^\)":"Happy",
    u"\(\^O\^\)":"Happy",
    u"\(\^o\^\)":"Happy",
    u"\)\^o\^\(":"Happy",
    u":O o_O":"Surprised",
    u"o_0":"Surprised",
    u"o\.O":"Surpised",
    u"\(o\.o\)":"Surprised",
    u"oO":"Surprised",
    u"\(\*￣m￣\)":"Dissatisfied",
    u"\(‘A`\)":"Snubbed or Deflated"
}

In [28]:
def remove_emoticons(text):
    emoticon_pattern = re.compile(u'(' + u'|'.join(k for k in EMOTICONS) + u')')
    return emoticon_pattern.sub(r'', text)

remove_emoticons("Hello :-)")

'Hello '

In [29]:
remove_emoticons("I am sad :(")

'I am sad '

**Conversion of Emoticon to Words**

Converting emoticons to words can be useful in text processing tasks where you want to replace emoticons with their textual representations for better analysis or understanding.

In [30]:
def convert_emoticons(text):
    for emot in EMOTICONS:
        text = re.sub(u'('+emot+')', "_".join(EMOTICONS[emot].replace(",","").split()), text)
    return text

text = "Hello :-) :-)"
convert_emoticons(text)

'Hello Happy_face_smiley Happy_face_smiley'

In [31]:
text = "I am sad :()"
convert_emoticons(text)

'I am sad Frown_sad_andry_or_poutingConfusion'

**Conversion of Emoji to Words**
Converting emojis to words can be useful for various text processing tasks, especially when you want to replace emojis with their textual representations for better analysis or understanding.

*   Using a Dictionary : One way to convert emojis to words is by creating a dictionary that maps emojis to their corresponding textual representations.
*   Using Regular Expressions (Regex) : Another approach is to use regular expressions to match emojis and replace them with their corresponding words.

In [33]:
def convert_emojis_to_words(text):
    emoji_dict = {
        "😊": "smile",
        "😢": "sad",
        "😂": "laugh",
        # Add more mappings as needed
    }
    words = text.split()
    converted_words = [emoji_dict[word] if word in emoji_dict else word for word in words]
    return ' '.join(converted_words)

text_with_emojis = "Hello! How are you? 😊 😂"
text_with_words = convert_emojis_to_words(text_with_emojis)
print(text_with_words)  # Output: Hello! How are you? smile laugh

Hello! How are you? smile laugh


In [34]:
import re

def convert_emojis_to_words(text):
    emoji_patterns = {
        r'😊': 'smile',
        r'😢': 'sad',
        r'😂': 'laugh',
        # Add more patterns as needed
    }
    for pattern, word in emoji_patterns.items():
        text = re.sub(pattern, word, text)
    return text

text_with_emojis = "Hello! How are you? 😊 😂"
text_with_words = convert_emojis_to_words(text_with_emojis)
print(text_with_words)  # Output: Hello! How are you? smile laugh

Hello! How are you? smile laugh


**Removal of URLs**

Removing URLs from text data is a common preprocessing step in NLP tasks to clean the text and focus on the textual content without including web links.


*   Using Regular Expressions (Regex):Regular expressions are effective for pattern matching, making them suitable for identifying and removing URLs from text data.
*   Using BeautifulSoup (for HTML content): If your text data contains HTML content and you want to remove URLs embedded within HTML tags, you can use BeautifulSoup along with regex to achieve this



In [35]:
def remove_urls(text):
    url_pattern = re.compile(r'https?://\S+|www\.\S+')
    return url_pattern.sub(r'', text)

In [36]:
text = "Driverless AI NLP blog post on https://www.h2o.ai/blog/detecting-sarcasm-is-difficult-but-ai-may-have-an-answer/"
remove_urls(text)

'Driverless AI NLP blog post on '

In [37]:
text = "Please refer to link http://lnkd.in/ecnt5yC for the paper"
remove_urls(text)

'Please refer to link  for the paper'

In [38]:
text = "Want to know more. Checkout www.h2o.ai for additional information"
remove_urls(text)

'Want to know more. Checkout  for additional information'

**Removal of HTML Tags**
One another common preprocessing technique that will come handy in multiple places is removal of html tags. This is especially useful, if we scrap the data from different websites. We might end up having html strings as part of our text.

In [39]:
def remove_html(text):
    html_pattern = re.compile('<.*?>')
    return html_pattern.sub(r'', text)

text = """<div>
<h1> H2O</h1>
<p> AutoML</p>
<a href="https://www.h2o.ai/products/h2o-driverless-ai/"> Driverless AI</a>
</div>"""

print(remove_html(text))


 H2O
 AutoML
 Driverless AI



In [40]:
from bs4 import BeautifulSoup

def remove_html(text):
    return BeautifulSoup(text, "lxml").text

text = """<div>
<h1> H2O</h1>
<p> AutoML</p>
<a href="https://www.h2o.ai/products/h2o-driverless-ai/"> Driverless AI</a>
</div>
"""

print(remove_html(text))


 H2O
 AutoML
 Driverless AI




**Chat Words Conversion**

Chat words, also known as internet slang or text abbreviations, are often used in informal communication such as messaging, social media, and online forums. Converting chat words to their full forms or standard English equivalents can be useful for text normalization and improving readability.

In [41]:
def convert_chat_words(text):
    chat_words_dict = {
        'lol': 'laugh out loud',
        'brb': 'be right back',
        'omg': 'oh my god',
        # Add more mappings as needed
    }
    words = text.split()
    converted_words = [chat_words_dict[word.lower()] if word.lower() in chat_words_dict else word for word in words]
    return ' '.join(converted_words)

text_with_chat_words = "omg, that's hilarious! lol brb"
text_with_full_forms = convert_chat_words(text_with_chat_words)
print(text_with_full_forms)

omg, that's hilarious! laugh out loud be right back


In [42]:
import re

def convert_chat_words(text):
    chat_words_dict = {
        'lol': 'laugh out loud',
        'brb': 'be right back',
        'omg': 'oh my god',
        # Add more mappings as needed
    }
    pattern = re.compile(r'\b(' + '|'.join(chat_words_dict.keys()) + r')\b', re.IGNORECASE)
    return pattern.sub(lambda x: chat_words_dict[x.group().lower()], text)

text_with_chat_words = "omg, that's hilarious! lol brb"
text_with_full_forms = convert_chat_words(text_with_chat_words)
print(text_with_full_forms)

oh my god, that's hilarious! laugh out loud be right back


**Spelling Correction**

Spelling correction is essential for improving the quality and readability of text data, especially in natural language processing tasks. Python provides several libraries and techniques to perform spelling correction effectively. One popular library for this purpose is pyaspell, which interfaces with the GNU Aspell spell checker.

In [43]:
!pip install pyspellchecker

Collecting pyspellchecker
  Downloading pyspellchecker-0.8.1-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.8.1


In [44]:
from spellchecker import SpellChecker

spell = SpellChecker()
def correct_spellings(text):
    corrected_text = []
    misspelled_words = spell.unknown(text.split())
    for word in text.split():
        if word in misspelled_words:
            corrected_text.append(spell.correction(word))
        else:
            corrected_text.append(word)
    return " ".join(corrected_text)

text = "speling correctin"
correct_spellings(text)

'spelling correcting'

In [45]:
text = "thnks for readin the notebook"
correct_spellings(text)

'thanks for reading the notebook'

**Numeric Token Removal**

In [46]:
import re

def remove_numeric_tokens(text):
    # Define a regular expression pattern to match numeric tokens
    pattern = r'\b\d+\b'
    # Use re.sub() to replace numeric tokens with an empty string
    cleaned_text = re.sub(pattern, '', text)
    return cleaned_text

# Example usage
text_with_numerics = "There are 123 apples and 456 oranges."
cleaned_text = remove_numeric_tokens(text_with_numerics)
print(cleaned_text)  # Output: There are apples and oranges.

There are  apples and  oranges.


**Whitespace Removal**

To remove whitespace (including spaces, tabs, and newline characters) from a string in Python, you can use the str.replace() method or regular expressions from the re module.

In [47]:
def remove_whitespace(text):
    # Use str.replace() to replace whitespace characters with an empty string
    cleaned_text = text.replace(" ", "").replace("\t", "").replace("\n", "")
    return cleaned_text

# Example usage
text_with_whitespace = "  Hello \t\n World!  "
cleaned_text = remove_whitespace(text_with_whitespace)
print(cleaned_text)  # Output: HelloWorld!


HelloWorld!


**Input TextSpell Checking and Correction**

In Natural Language Processing (NLP), spell checking and correction are essential tasks for ensuring the accuracy and quality of textual data. One way to perform spell checking and correction in NLP is by using libraries and techniques specifically designed for text processing.

In [49]:
from textblob import TextBlob

def correct_spelling_textblob(text):
    blob = TextBlob(text)
    corrected_text = str(blob.correct())
    return corrected_text

# Example usage
text_with_spelling_errors = "I am goinng to the maret."
corrected_text = correct_spelling_textblob(text_with_spelling_errors)
print(corrected_text)

I am going to the market.


**Text Normalization**

Text normalization refers to the process of converting text data into a standardized format, making it easier to process and analyze. This process typically involves several steps such as lowercasing, removing punctuation, handling contractions, expanding abbreviations, and more.

In [54]:
import re

def normalize_text(text):
    # Lowercase the text
    text = text.lower()

    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)

    # Handle contractions
    contractions_dict = {
        "ain't": "is not",
        "aren't": "are not",
        "can't": "cannot",
        # Add more contractions as needed
    }
    for contraction, expansion in contractions_dict.items():
        text = re.sub(r'\b' + contraction + r'\b', expansion, text)

    # Expand abbreviations
    abbreviations_dict = {
        "lol": "laugh out loud",
        "brb": "be right back",
        # Add more abbreviations as needed
    }
    for abbreviation, expansion in abbreviations_dict.items():
        text = re.sub(r'\b' + abbreviation + r'\b', expansion, text)

    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Example usage
text_to_normalize = "I ain't going to the store, BRB!"
normalized_text = normalize_text(text_to_normalize)
print(normalized_text)  # Output: i is not going to the store be right back

i aint going to the store be right back


**Entity Recognition and Masking**

Entity recognition and masking involve identifying and replacing sensitive or confidential information, such as names, locations, dates, and other entities, with placeholders or masks to protect privacy and confidentiality. In Python, you can achieve entity recognition and masking using libraries like SpaCy and regular expressions.

In [55]:
import spacy

def mask_entities(text):
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)

    masked_text = text
    for ent in doc.ents:
        masked_text = masked_text.replace(ent.text, "<ENTITY>")

    return masked_text

# Example usage
text_with_entities = "John Smith lives in New York and works for ABC Corp."
masked_text = mask_entities(text_with_entities)
print(masked_text)  # Output: <ENTITY> lives in <ENTITY> and works for <ENTITY>


<ENTITY> lives in <ENTITY> and works for <ENTITY>


**Removing HTML Tags and Special Characters**

To remove HTML tags and special characters from a text string in Python, you can use libraries like BeautifulSoup for HTML tag removal and regular expressions (regex) for special character removal.

In [56]:
from bs4 import BeautifulSoup

def remove_html_tags(text):
    soup = BeautifulSoup(text, 'html.parser')
    cleaned_text = soup.get_text(separator=' ')
    return cleaned_text

# Example usage
html_text = "<p>This is <strong>bold</strong> and <em>italic</em> text.</p>"
cleaned_text = remove_html_tags(html_text)
print(cleaned_text)  # Output: This is bold and italic text.

This is  bold  and  italic  text.


Thank you...! Keep learning..!