<a href="https://www.kaggle.com/code/faizulislam19095/nlp-tutorial-2-text-preprocessing?scriptVersionId=156674942" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

* **What is Text Preprocessing?**

**Text preprocessing in Natural Language Processing (NLP) involves a set of tasks and techniques aimed at cleaning and transforming raw text data into a format that is suitable for analysis and machine learning models. It plays a crucial role in enhancing the quality and effectiveness of NLP applications by addressing various challenges associated with unstructured text.**

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv("/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


**N.B. Please be aware that I will apply several procedures here, and these may not be directly relevant to the current dataset. Nevertheless, I will implement them for demonstration purposes. If your dataset necessitates these procedures for preprocessing, you can readily adapt and apply them to your specific dataset.**

# Step 01: Lowercasing

Let's consider two sentences "Blue sky and puffy clouds are so pleasent to watch" and "The sky is blue." If we notice here, the word 'BLUE' in both sentences referring to the blue color.The difference is one started with capital B and another one with small b. But the algorithm will consider them as two different words('Blue' and 'blue') even though these are same words. So in order to avoid this kind of unnecessary complexities, the first thing we do is we lowercase the sentences.


Lowercasing is a crucial step in text preprocessing for Natural Language Processing (NLP). It involves converting all letters in a text to lowercase, serving several purposes.It ensures uniformity, treating words with different cases as identical and simplifying subsequent analyses. Lowercasing also contributes to word normalization, reducing words to a consistent case, such as transforming "Blue" and "blue" to the same representation.

In [3]:
df['review'] = df['review'].str.lower()

# After lowercasing all the reviews
df['review'].head()

0    one of the other reviewers has mentioned that ...
1    a wonderful little production. <br /><br />the...
2    i thought this was a wonderful way to spend ti...
3    basically there's a family where a little boy ...
4    petter mattei's "love in the time of money" is...
Name: review, dtype: object

# Step 02: Removing HTML tags

Eliminating HTML tags is a crucial preprocessing step in NLP because HTML tags, primarily used for web page formatting, often introduce noise and structural information that is irrelevant to language analysis tasks. The focus in NLP is typically on extracting meaningful content from the text rather than the webpage structure. By removing HTML tags, we ensure consistency in text representation, prevent confusion for NLP models, facilitate improved tokenization, and enhance overall readability. 

**N.B. Please note that moving forward, we will be employing Regular Expressions (RegEx). Therefore, it is essential to have a moderate understanding of Regular Expressions before delving into the realm of text preprocessing in Natural Language Processing (NLP).**

In [4]:
import re 
def remove_html_tags(text):
    pattern = re.compile('<.*?>')
    return pattern.sub(r'',text)
df['review'] = df['review'].apply(remove_html_tags)

In [5]:
# After removing all the html tags
df['review'].head()

0    one of the other reviewers has mentioned that ...
1    a wonderful little production. the filming tec...
2    i thought this was a wonderful way to spend ti...
3    basically there's a family where a little boy ...
4    petter mattei's "love in the time of money" is...
Name: review, dtype: object

# Step 03: Removing URL


Removing URLs from text data is a common preprocessing step in NLP for several reasons. URLs, or web links, often carry information related to web formatting and navigation that is irrelevant or even detrimental to certain NLP tasks. By eliminating URLs, the focus shifts to the textual content, allowing NLP models to extract meaningful patterns without interference from web-specific elements. URLs may introduce noise, complicate tokenization, and affect the overall consistency of the text representation. 

**Although there is no URL in this particular dataset but this is  for demonstration purpose. So if you encounter with URLs in your dataset, we can follow this procedure.**

In [6]:
def remove_url(text):
    pattern = re.compile(r'https?://\S+|www.\.\S+')
    return pattern.sub(r'',text)

# Let's consider a text 
text1 = "The official website of Google is https://www.google.com/"
# Now we are applying the function we have made earlier in this text
remove_urls = remove_url(text1)
print(remove_urls)

The official website of Google is 


# Step 04: Removing Punctuation


Punctuation marks are symbols such as periods, commas, exclamation points,question marks etc that are used in written language to indicate pauses, boundaries, or the tone of a sentence. In NLP, removing punctuation is a common preprocessing step for several reasons. Punctuation marks may introduce noise to the text, and their presence can impact the accuracy and efficiency of various NLP tasks. When analyzing text, NLP models often focus on the semantics of words and sentences rather than the syntactic nuances introduced by punctuation. By eliminating punctuation, the text becomes more uniform, tokenization becomes simpler, and the overall representation of the language becomes cleaner. 

In [7]:
import string
punctuation_list = string.punctuation
print(f'The punctuations are {punctuation_list}')

The punctuations are !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [8]:
def remove_punctuation(text):
    for char in punctuation_list:
        text = text.replace(char,'')
    return text

text2 = "In a bustling city, the streets were filled with people rushing to catch trains,buses & taxis!"
rem_punct = remove_punctuation(text2)
print(rem_punct)

In a bustling city the streets were filled with people rushing to catch trainsbuses  taxis


# Step 05: Removing Chat Words

Chat words, also known as chat language or text speak, refer to informal and abbreviated words and expressions commonly used in online communication, particularly in chat rooms, messaging apps, and social media platforms.In NLP, removing chat words is a crucial preprocessing step for several reasons. Chat words can introduce noise and ambiguity into the text data, making it challenging for NLP models to accurately interpret and analyze the content. These informal expressions often deviate from standard grammar and vocabulary, leading to a less consistent and more challenging dataset for language processing tasks. By eliminating chat words, the focus shifts to a more standardized and formal representation of language.

In [9]:
chat_words ={
    'AFAIK': 'As Far As I Know',
    'AFK': 'Away From Keyboard',
    'ASAP': 'As Soon As Possible',
    'ATK': 'At The Keyboard',
    'ATM': 'At The Moment',
    'A3': 'Anytime, Anywhere, Anyplace',
    'BAK': 'Back At Keyboard',
    'BBL': 'Be Back Later',
    'BBS': 'Be Back Soon',
    'BFN': 'Bye For Now',
    'B4N': 'Bye For Now',
    'BRB': 'Be Right Back',
    'BRT': 'Be Right There',
    'BTW': 'By The Way',
    'B4': 'Before',
    'B4N': 'Bye For Now',
    'CU': 'See You',
    'CUL8R': 'See You Later',
    'CYA': 'See You',
    'FAQ': 'Frequently Asked Questions',
    'FC': 'Fingers Crossed',
    'FWIW': 'For What It\'s Worth',
    'FYI': 'For Your Information',
    'GAL': 'Get A Life',
    'GG': 'Good Game',
    'GN': 'Good Night',
    'GMTA': 'Great Minds Think Alike',
    'GR8': 'Great!',
    'G9': 'Genius',
    'IC': 'I See',
    'ICQ': 'I Seek you (also a chat program)',
    'ILU': 'ILU: I Love You',
    'IMHO': 'In My Honest/Humble Opinion',
    'IMO': 'In My Opinion',
    'IOW': 'In Other Words',
    'IRL': 'In Real Life',
    'KISS': 'Keep It Simple, Stupid',
    'LDR': 'Long Distance Relationship',
    'LMAO': 'Laugh My A.. Off',
    'LOL': 'Laughing Out Loud',
    'LTNS': 'Long Time No See',
    'L8R': 'Later',
    'MTE': 'My Thoughts Exactly',
    'M8': 'Mate',
    'NRN': 'No Reply Necessary',
    'OIC': 'Oh I See',
    'PITA': 'Pain In The A..',
    'PRT': 'Party',
    'PRW': 'Parents Are Watching',
    'QPSA?': 'Que Pasa?',
    'ROFL': 'Rolling On The Floor Laughing',
    'ROFLOL': 'Rolling On The Floor Laughing Out Loud',
    'ROTFLMAO': 'Rolling On The Floor Laughing My A.. Off',
    'SK8': 'Skate',
    'STATS': 'Your sex and age',
    'ASL': 'Age, Sex, Location',
    'THX': 'Thank You',
    'TTFN': 'Ta-Ta For Now!',
    'TTYL': 'Talk To You Later',
    'U': 'You',
    'U2': 'You Too',
    'U4E': 'Yours For Ever',
    'WB': 'Welcome Back',
    'WTF': 'What The F...',
    'WTG': 'Way To Go!',
    'WUF': 'Where Are You From?',
    'W8': 'Wait...',
    '7K': 'Sick:-D Laugher',
    'TFW': 'That feeling when',
    'MFW': 'My face when',
    'MRW': 'My reaction when',
    'IFYP': 'I feel your pain',
    'LOL': 'Laughing out loud',
    'TNTL': 'Trying not to laugh',
    'JK': 'Just kidding',
    'IDC': 'I don’t care',
    'ILY': 'I love you',
    'IMU': 'I miss you',
    'ADIH': 'Another day in hell',
    'IDC': 'I don’t care',
    'ZZZ': 'Sleeping, bored, tired',
    'WYWH': 'Wish you were here',
    'TIME': 'Tears in my eyes',
    'BAE': 'Before anyone else',
    'FIMH': 'Forever in my heart',
    'BSAAW': 'Big smile and a wink',
    'BWL': 'Bursting with laughter',
    'LMAO': 'Laughing my a** off',
    'BFF': 'Best friends forever',
    'CSL': 'Can’t stop laughing'
}


In [10]:
def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words:
            new_text.append(chat_words[w.upper()])
        else:
            new_text.append(w)
    return " ".join(new_text)

In [11]:
a = chat_conversion("LOL this is very funny")
b = chat_conversion("TTYL , I'm in a hurry")
c = chat_conversion("Send him the mail ASAP")
print(a)
print(b)
print(c)

Laughing out loud this is very funny
Talk To You Later , I'm in a hurry
Send him the mail As Soon As Possible


# Step 06: Spell Checker

**The libraries used for spell checking are TextBlob,enchanter,spellchecker,pyspellchecker etc.**

In [12]:
from textblob import TextBlob
incorrect_text = "Suddnly, the weathr became very cold and it startd to snow unexpectedly."
txtblb = TextBlob(incorrect_text)
txtblb.correct().string

'Suddenly, the weather became very cold and it started to snow unexpectedly.'

# Step 07: Removing Stopwords

Stopwords in English are commonly used words that are often considered to be of little value in text analysis due to their high frequency and general utility in constructing sentences. 

Examples of stop words in English include:

* Common Pronouns: I, me, he, she, it, you, they, we, us, them.
* Prepositions: in, on, at, by, with, under, over, between, through, etc.
* Conjunctions: and, or, but, so, for, nor, yet.
* Articles: a, an, the.
* Common Verbs: is, am, are, was, were, be, being, been, have, has, had, do, does, did

We remove stop words before approaching problems like Sentiment analysis but in the case of POS tagging (Parts Of Speech tagging) we don't remove stop words! So we have to be careful in these cases.

In [13]:
from nltk.corpus import stopwords
stopwords.words('english') # Stopwords in English
stopwords.words('french') # Stopwords in French
stopwords.words('spanish') # Stopwords in Spanish

['de',
 'la',
 'que',
 'el',
 'en',
 'y',
 'a',
 'los',
 'del',
 'se',
 'las',
 'por',
 'un',
 'para',
 'con',
 'no',
 'una',
 'su',
 'al',
 'lo',
 'como',
 'más',
 'pero',
 'sus',
 'le',
 'ya',
 'o',
 'este',
 'sí',
 'porque',
 'esta',
 'entre',
 'cuando',
 'muy',
 'sin',
 'sobre',
 'también',
 'me',
 'hasta',
 'hay',
 'donde',
 'quien',
 'desde',
 'todo',
 'nos',
 'durante',
 'todos',
 'uno',
 'les',
 'ni',
 'contra',
 'otros',
 'ese',
 'eso',
 'ante',
 'ellos',
 'e',
 'esto',
 'mí',
 'antes',
 'algunos',
 'qué',
 'unos',
 'yo',
 'otro',
 'otras',
 'otra',
 'él',
 'tanto',
 'esa',
 'estos',
 'mucho',
 'quienes',
 'nada',
 'muchos',
 'cual',
 'poco',
 'ella',
 'estar',
 'estas',
 'algunas',
 'algo',
 'nosotros',
 'mi',
 'mis',
 'tú',
 'te',
 'ti',
 'tu',
 'tus',
 'ellas',
 'nosotras',
 'vosotros',
 'vosotras',
 'os',
 'mío',
 'mía',
 'míos',
 'mías',
 'tuyo',
 'tuya',
 'tuyos',
 'tuyas',
 'suyo',
 'suya',
 'suyos',
 'suyas',
 'nuestro',
 'nuestra',
 'nuestros',
 'nuestras',
 'vuestro'

In [14]:
from nltk.corpus import stopwords
def remove_stopwords(text):
    new_text = []
    for word in text.split():
        if word.lower() not in stopwords.words('english'):
            new_text.append(word)
    return ' '.join(new_text)


text = "I was roaming around a jungle while I saw a tiger.It was sleeping under a tree. I waited there for three hours"
remove_stopwords(text)

'roaming around jungle saw tiger.It sleeping tree. waited three hours'

# Step 08: Remove emojis 😊

In [15]:
def remove_emojis(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F700-\U0001F77F"  # alchemical symbols
                               u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                               u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                               u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                               u"\U0001FA00-\U0001FA6F"  # Chess Symbols
                               u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                               u"\U00002702-\U000027B0"  # Dingbats
                               u"\U000024C2-\U0001F251" 
                               "]+", flags=re.UNICODE)
    
    return emoji_pattern.sub(r'', text)

text_with_emojis = "Hello! 😊 This is a sample text with emojis. 🌟"
text_without_emojis = remove_emojis(text_with_emojis)

print("Original text:", text_with_emojis)
print("Text without emojis:", text_without_emojis)


Original text: Hello! 😊 This is a sample text with emojis. 🌟
Text without emojis: Hello!  This is a sample text with emojis. 


# Step 09: Tokenization

Tokenization is a fundamental process in NLP that involves breaking down a given text into individual units, referred to as tokens. These tokens can be words, phrases, or other meaningful elements, depending on the granularity of the analysis. The primary goal of tokenization is to transform unstructured text into a format that is suitable for further analysis and processing.

In easier words, Tokenization is a preprocessing step designed to make raw text understandable and manageable for machines. By breaking down the text into smaller units (tokens), machines can effectively analyze and interpret the language. Each token represents a meaningful unit, such as a word or a phrase, allowing the machine to grasp the structures and patterns within the text.

**A. Using the Split function**

In [16]:
# Word tokenization
sentence1 = "I am going to Australia"
word_tokens = sentence1.split()
print(word_tokens)

['I', 'am', 'going', 'to', 'Australia']


In [17]:
# Sentence tokenization
sentence2 = "I am going to Australia.I will stay there for 5 days.I hope the tour will be great"
sentence_tokens = sentence2.split('.')
print(sentence_tokens)

['I am going to Australia', 'I will stay there for 5 days', 'I hope the tour will be great']


In [18]:
# Problems in using splilt function (in word tokenization)
sentence3 = "I am going to Australia!"
sentence3.split()

['I', 'am', 'going', 'to', 'Australia!']

**In the provided code, the use of the split() function resulted in tokenizing "Australia" as "Australia!" due to the presence of an exclamation mark. Consequently, the machine would treat "Australia!" and "Australia" as distinct words. This creates an issue because, in subsequent instances, if the machine encounters the word "Australia" without an exclamation mark, it may not recognize it as the same entity. So using split() function becomes problematic in such cases.**

In [19]:
# Problems in using splilt function (in sentence tokenization)

sentence4 = "Where do you live? I haven't seen you here before"
sentence4.split('.')

["Where do you live? I haven't seen you here before"]

**In sentence tokenization, we can see that two different sentence got tokenized in one as we splitted on the basis of   '.'**

**B. Using Regular Expression**

In [20]:
sentence3 = "I am going to Australia!"
tokens = re.findall("[\w]+",sentence3)
print(tokens)

['I', 'am', 'going', 'to', 'Australia']



While regular expressions offer a slight performance advantage over the split() function for tokenization, they still have their limitations. A more effective approach for tokenization is to leverage specialized libraries. In this case, we'll make use of the NLTK and Spacy libraries. These libraries are designed for natural language processing tasks and provide robust and efficient tokenization mechanisms.

**C. Using NLTK (Natural Language ToolKit)**

NLTK is a powerful Python library designed for working with human language data. NLTK provides access to a vast collection of corpora, lexical resources, and algorithms, making it a valuable resource

In [21]:
from nltk.tokenize import sent_tokenize,word_tokenize
sentence3 = "I am going to Australia!"
word_tokenize(sentence3)

['I', 'am', 'going', 'to', 'Australia', '!']

In [22]:
sentence4 = "Where do you live? I haven't seen you here before"
sent_tokenize(sentence4)

['Where do you live?', "I haven't seen you here before"]

In [23]:
# Let's experiment the NLTK library with some complex sentences
sentence5 = "I have a Ph.D in A.I"
sentence6 = "We're here to help! Mail us at abc@gmail.com"
sentence7 = "A 5km ride cost $10.50"

In [24]:
word_tokenize(sentence5)

['I', 'have', 'a', 'Ph.D', 'in', 'A.I']

PERFECT!

In [25]:
word_tokenize(sentence6)

['We',
 "'re",
 'here',
 'to',
 'help',
 '!',
 'Mail',
 'us',
 'at',
 'abc',
 '@',
 'gmail.com']

In [26]:
word_tokenize(sentence7)

['A', '5km', 'ride', 'cost', '$', '10.50']

So we can see that NLTK is also not givig perfect results in sentences like 6 and 7. But certainly it performs better than split() and regular expression.

A library called Spacy is great for tokenization! Let's see!

Spacy is a leading open-source library for advanced NLP tasks in Python. It is designed for efficient and production-ready processing of large volumes of text data. Spacy excels in tasks such as tokenization, part-of-speech tagging, named entity recognition, and dependency parsing. Known for its speed and accuracy. 

**D. Using Spacy**

In [27]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [28]:
doc1 = nlp(sentence5)
doc2 = nlp(sentence6)
doc3 = nlp(sentence7)

In [29]:
for token in doc1:
    print(token)

I
have
a
Ph
.
D
in
A.I


In [30]:
for token in doc2:
    print(token)

We
're
here
to
help
!
Mail
us
at
abc@gmail.com


In [31]:
for token in doc3:
    print(token)

A
5
km
ride
cost
$
10.50


**So, it is clearly  seen that Spacy is performing better than the 3 methods stated earlier!**

# Step 10: Stemming

Stemming is a text normalization technique used in NLP and information retrieval to reduce words to their base or root form, known as the "stem." The objective of stemming is to group words with similar meanings together by removing prefixes or suffixes, thus treating variations of a word as a common root.

For example, stemming would transform words like "running," "runner," and "ran" to the common stem "run." 

In [32]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

stem_txt1 = "walking walks walked walk"
stem_words(stem_txt1) # Will return to the base word- walk

'walk walk walk walk'

In [33]:
stem_txt2 = "probably my alltime favorite story is a story of sacrifice and dedication to a noble cause."
stem_words(stem_txt2)

'probabl my alltim favorit stori is a stori of sacrific and dedic to a nobl cause.'

**The problem of stemming is that sometime stemming gives some root words which doesn't exist(not meaningful) in that language like (probabl,alltim,favorit,stori) etc. In such cases we have to use lemmatization. Lemmatization is a method which works similar as stemming but lemmatization always returns the root words which has meaning.
Lemmatization is slower than stemming because lemmatization again searches for root words with meaning whereas stemming doesn't always care about the meaning of the root word. So if you have to show the output to the user so meaningfulness is important is that case so lemmatization is used here. And if you don't have to show output to the user and you are concerned with the speed then you should use stemming.**

# Step 11: Lemmatization

In [34]:
nlp = spacy.load("en_core_web_sm")
sentence = "The cats are running and jumping on the roof."
doc = nlp(sentence)
print("{0:15} {1:15}".format("Word", "Lemma"))
for token in doc:
    print("{0:15} {1:15}".format(token.text, token.lemma_))


Word            Lemma          
The             the            
cats            cat            
are             be             
running         run            
and             and            
jumping         jump           
on              on             
the             the            
roof            roof           
.               .              


<div style="background-color:black; color:white; padding:20px; font-size:24px; font-weight:bold; text-align:left; border-radius:70px;"> 
      If this notebook helps, please consider UPVOTING.This inspires me a lot 😊 
</div>


Check out my previous works:
* [NLP tutorial 1: A detailed introduction to NLP](https://www.kaggle.com/code/faizulislam19095/a-detailed-introduction-to-nlp)
* [Lung Cancer Prediction- EDA+SMOTE+ML modeling](https://www.kaggle.com/code/faizulislam19095/lung-cancer-prediction-eda-smote-modeling)