# Introduction

Text preprocessing is a crucial step in Natural Language Processing (NLP) that involves transforming raw text into a format suitable for analysis. This process helps improve the performance of machine learning models by reducing noise and ensuring that the data is consistent and relevant.

# Load data
Let's use the imdb movies review dataset to learn the text preprocessing. 

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.read_csv("/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv")

In [None]:
df.shape

In [None]:
df.sample(5)

# 1. Lowercasing

The primary goal is to convert all characters in the text to lowercase, ensuring uniformity and consistency throughout the dataset. This step helps eliminate variations caused by differences in capitalization, which can lead to the same words being treated as distinct tokens due to the case sensitive nature of languages like Python.

It ensures that words like "Apple," "apple," and "APPLE" are treated as the same word

In [None]:
# Sample text
text = "Hello World! I'm learning text preprocessing."

# Convert to lowercase
lowercase_text = text.lower()

print(lowercase_text)

In [None]:
df['review'][45]

In [None]:
df['review'].str.lower()

In [None]:
df

# 2. Remove HTML Tags

In many real-world applications, text data is often extracted from web pages, which can contain HTML tags. These tags are useful for structuring content but are irrelevant when processing text for NLP tasks. Removing HTML tags is, therefore, a key preprocessing step to clean the data and make it usable for text analysis.

HTML tags,don’t provide any useful information for text analysis and can interfere with the accuracy of NLP models.

Better Focus on Content: By removing these tags, we focus only on the actual text content, which helps the model capture meaningful information.

Here’s an example of raw text with HTML tags:

<p>Hello, <strong>world!</strong> Welcome to <a href='https://example.com'>NLP</a> tutorials.</p>


In [None]:
import re

def remove_html_tags(text):
    # Regular expression pattern to match HTML tags
    pattern = re.compile('<.*?>')
    
    # Replace the HTML tags with an empty string
    return re.sub(pattern, '', text)


In [None]:
html_text = "<p>Hello, <strong>world!</strong> Welcome to <a href='https://example.com'>NLP</a> tutorials.</p>"
html_text

In [None]:
cleaned_text = remove_html_tags(html_text)
print(cleaned_text)

In [None]:
df['review'] = df['review'].apply(remove_html_tags)

In [None]:
df['review']

# 3. Remove URLs

URLs, while essential for linking content on the web, typically don't contribute meaningful linguistic information for most NLP tasks, such as text classification, sentiment analysis, or language modeling. 

Instead, they add noise and can confuse machine learning models. By removing URLs, we help the model focus on the core text, making it easier to extract patterns and meaning. 
In most cases, removing URLs enhances the overall quality and performance of NLP models, unless the task specifically requires analyzing hyperlinks.

In [None]:
import re

def remove_urls(text):
    # Regular expression pattern to match URLs
    url_pattern = re.compile(r'http[s]?://\S+|www\.\S+')
    
    # Replace URLs with an empty string
    return re.sub(url_pattern, '', text)

In [None]:
text_with_urls = "Check out my portfolio at https://example.com or visit www.example.org for more details."
print(text_with_urls)

In [None]:
cleaned_text = remove_urls(text_with_urls)
print(cleaned_text)

In [None]:
# if there are any links in our dataset, let's remove that
df['review'] = df['review'].apply(remove_urls)

In [None]:
df['review']

# 4. Remove Punctuation

Punctuation is often irrelevant in many NLP tasks, so removing it helps in simplifying the text. 

Removing punctuation during text preprocessing is important primarily because of its impact on tokenization. 

Punctuation marks, while important for human readability, generally do not carry significant meaning in most NLP tasks. 
When left in the text, punctuation can cause issues during tokenization, leading to an unnecessarily large or fragmented vocabulary. 
For instance, "hello!" and "hello" would be treated as different tokens if punctuation is not removed, despite having the same meaning.

In [5]:
import string
string.punctuation # are the python punctuations

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [6]:
def remove_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))

In [7]:
# Example
english_text = "Hello, world! How's everything going?"
cleaned_text = remove_punctuation(english_text)
print(cleaned_text)

Hello world Hows everything going


In [8]:
import re

def remove_nepali_punctuation(text):
    # Defining both English and Nepali punctuation marks
    nepali_punctuation = r'[।,!?–—"\'\‘\’\“\”\(\)\[\]\{\}:;]'
    
    # Use regex to remove the defined punctuation marks
    return re.sub(nepali_punctuation, '', text)

In [9]:
# Example
nepali_text = "यो एक परीक्षण वाक्य हो। के तपाईँलाई थाहा छ?"
cleaned_nepali_text = remove_nepali_punctuation(nepali_text)
print(cleaned_nepali_text)

यो एक परीक्षण वाक्य हो के तपाईँलाई थाहा छ


In [10]:
# Applying this in our dataset
df['review'] = df['review'].apply(remove_punctuation)
df['review']

NameError: name 'df' is not defined

# 5. Chat word treatment

In Natural Language Processing (NLP) tasks involving chat data, handling chat-specific language is crucial for accurate analysis. Chat messages are typically informal, often filled with abbreviations, slang, emotions, inconsistent grammar, and punctuation variations. Properly treating these "chat words" is an important preprocessing step.

Here are some common challenges and strategies for dealing with chat words:

Abbreviations and Slang:

Chat data often includes abbreviations like "u" for "you," "lol" for "laugh out loud," or "omg" for "oh my god."
Solution: Use a mapping or a dictionary to expand common abbreviations to their full forms.


Misspellings and Typos:

Chat text often contains misspellings due to fast typing, like "thnks" instead of "thanks."
Solution: Apply spelling correction techniques or use models trained to handle noisy text.

Repetitions:

Users often exaggerate for emphasis, such as "soooo good" or "noooo way."
Solution: Normalize repeated characters (e.g., reducing "soooo" to "so").


In [11]:
#source: https://github.com/rishabhverma17/sms_slang_translator/blob/master/slang.txt
chat_abbreviations = {
    "AFAIK": "As Far As I Know",
    "AFK": "Away From Keyboard",
    "ASAP": "As Soon As Possible",
    "ATK": "At The Keyboard",
    "ATM": "At The Moment",
    "A3": "Anytime, Anywhere, Anyplace",
    "BAK": "Back At Keyboard",
    "BBL": "Be Back Later",
    "BBS": "Be Back Soon",
    "BFN": "Bye For Now",
    "B4N": "Bye For Now",
    "BRB": "Be Right Back",
    "BRT": "Be Right There",
    "BTW": "By The Way",
    "B4": "Before",
    "CU": "See You",
    "CUL8R": "See You Later",
    "CYA": "See You",
    "FAQ": "Frequently Asked Questions",
    "FC": "Fingers Crossed",
    "FWIW": "For What It's Worth",
    "FYI": "For Your Information",
    "GAL": "Get A Life",
    "GG": "Good Game",
    "GN": "Good Night",
    "GMTA": "Great Minds Think Alike",
    "GR8": "Great!",
    "G9": "Genius",
    "IC": "I See",
    "ICQ": "I Seek You (also a chat program)",
    "ILU": "I Love You",
    "IMHO": "In My Honest/Humble Opinion",
    "IMO": "In My Opinion",
    "IOW": "In Other Words",
    "IRL": "In Real Life",
    "KISS": "Keep It Simple, Stupid",
    "LDR": "Long Distance Relationship",
    "LMAO": "Laugh My A.. Off",
    "LOL": "Laughing Out Loud",
    "LTNS": "Long Time No See",
    "L8R": "Later",
    "MTE": "My Thoughts Exactly",
    "M8": "Mate",
    "NRN": "No Reply Necessary",
    "OIC": "Oh I See",
    "PITA": "Pain In The A..",
    "PRT": "Party",
    "PRW": "Parents Are Watching",
    "QPSA": "Que Pasa?",
    "ROFL": "Rolling On The Floor Laughing",
    "ROFLOL": "Rolling On The Floor Laughing Out Loud",
    "ROTFLMAO": "Rolling On The Floor Laughing My A.. Off",
    "SK8": "Skate",
    "STATS": "Your Sex and Age",
    "ASL": "Age, Sex, Location",
    "THX": "Thank You",
    "TTFN": "Ta-Ta For Now!",
    "TTYL": "Talk To You Later",
    "U": "You",
    "U2": "You Too",
    "U4E": "Yours For Ever",
    "WB": "Welcome Back",
    "WTF": "What The F...",
    "WTG": "Way To Go!",
    "WUF": "Where Are You From?",
    "W8": "Wait...",
    "7K": "Sick:-D Laugher",
    "TFW": "That Feeling When",
    "MFW": "My Face When",
    "MRW": "My Reaction When",
    "IFYP": "I Feel Your Pain",
    "LOL": "Laughing Out Loud",
    "TNTL": "Trying Not To Laugh",
    "JK": "Just Kidding",
    "IDC": "I Don’t Care",
    "ILY": "I Love You",
    "IMU": "I Miss You",
    "ADIH": "Another Day In Hell",
    "ZZZ": "Sleeping, Bored, Tired",
    "WYWH": "Wish You Were Here",
    "TIME": "Tears In My Eyes",
    "BAE": "Before Anyone Else",
    "FIMH": "Forever In My Heart",
    "BSAAW": "Big Smile And A Wink",
    "BWL": "Bursting With Laughter",
    "LMAO": "Laughing My A** Off",
    "BFF": "Best Friends Forever",
    "CSL": "Can’t Stop Laughing"
}

In [71]:
typos_corrections = {
    "teh": "the",
    "recieve": "receive",
    "occuring": "occurring",
    "adress": "address",
    "tommorow": "tomorrow",
    "becuase": "because",
    "definately": "definitely",
    "seperate": "separate",
    "untill": "until",
    "embarass": "embarrass",
    "neccessary": "necessary",
    "wich": "which",
    "thier": "their",
    "wierd": "weird",
    "affect": "effect",
    "loose": "lose",
    "alot": "a lot",
    "suprise": "surprise",
    "occured": "occurred",
    "accomodate": "accommodate",
    "wierd": "weird",
    "tehy": "they",
    "bruh": "brother",
    "beleive": "believe",
    "enviroment": "environment",
    "definately": "definitely",
    "restraunt": "restaurant"
}

In [72]:
 # Common Chat Repetitions
    
chat_repetitions = {
    "soooo": "so",
    "heyyyy": "hey",
    "yesss": "yes",
    "noooo": "no",
    "woooowwwww": "wow",
    "whyyy": "why",
    "cyaaa": "cya",
    "hiii": "hi",
    "okkk": "ok",
    "loool": "lol",
    "yaaa": "yeah",
    "bff": "best friends forever"
}

In [73]:
def normalize_repetitions(text):
    # Reduce repeated characters to two characters max (e.g., "soooo" -> "so")
    return re.sub(r'(.)\1+', r'\1\1', text)

In [74]:
def correct_typos(text, typos_dict):
    """Correct typos based on the dictionary."""
    for typo, correct in typos_dict.items():
        text = re.sub(r'\b' + typo + r'\b', correct, text)  # Replace whole words
    return text

In [83]:
chat_processing_dict = {**chat_abbreviations, **chat_repetitions, **typos_corrections}

def preprocess_chat(text):
    new_text = []
    
    # Convert dictionary keys to lowercase for consistent lookup
    chat_processing_dict_lower = {k.lower(): v for k, v in chat_processing_dict.items()}
    # print(chat_processing_dict_lower)
    for word in text.split():
        # Convert the word to lowercase for case-insensitive matching
        if word.lower() in chat_processing_dict_lower:
            new_text.append(chat_processing_dict_lower[word.lower()])
        else:
            new_text.append(word)
    
    return " ".join(new_text)

In [85]:

# Example usage
example_chat = "heyyyy bruh come home ASAP"
processed_chat = preprocess_chat(example_chat)
print(processed_chat)  # Output: "hey the be right back, ill cya later!!! lol"

hey brother come home As Soon As Possible


# 6. Spelling Correction

There are various libraries for spell correction. But they work only for the common words. 
While working with the regional language and some specific domain, it's better to create our own spell checker. 

In [98]:
from textblob import TextBlob


sentence = "seeveral ggenerations of late king's family aare ddestroyed in the saame mannner"
textBlob = TextBlob(sentence)
textBlob.correct().string

"several generations of late king's family are destroyed in the same manner"

# 7. Removing Stop words

Stopwords are common words like "is," "the," "and," or "to" that don't carry significant meaning and often occur frequently in text. In natural language processing (NLP) tasks such as sentiment analysis, document classification etc., removing stopwords reduces noise and thus enhance model performance. 

However, it in tasks lik POS it should not be removed. 



In [107]:
from nltk.corpus import stopwords

print("---\nEnglish Stopwords\n----")
print(stopwords.words('english'))
print("---\nNepali Stopwords\n----")
print(stopwords.words('nepali'))

---
English Stopwords
----
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'ow

['hi', 'there', 'how', 'are', 'you!']

In [116]:
from nltk.corpus import stopwords


def remove_stopwords(text):
    stop_words = stopwords.words("english")
    filtered_words = [word for word in text.split() if word.lower() not in stop_words]
    return ' '.join(filtered_words)

In [118]:
text = "This is a simple example to demonstrate removing stopwords from a text"
cleaned_text = remove_stopwords(text)
print(cleaned_text)

simple example demonstrate removing stopwords text


In [120]:
text = "I will go to the park in the evening"
cleaned_text = remove_stopwords(text)
print(cleaned_text) 

go park evening


# 8. Handling Emojis

Handling emojis in text is important because emojis convey sentiment, emotions, and sometimes additional context in communication. 

Depending on your use case, you can choose to either remove, replace, or convert emojis.



In [122]:
import re

def remove_emojis(text):
    # Unicode range for emojis
    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F1E0-\U0001F1FF"  # flags (iOS)
        "\U00002702-\U000027B0"  # other symbols
        "\U000024C2-\U0001F251"
        "]+", flags=re.UNICODE)
    
    return emoji_pattern.sub(r'', text)

In [127]:
text = "I am happy 😊, but a little tired 😴"
clean_text = remove_emojis(text)
clean_text

'I am happy , but a little tired '

In [132]:
# Replacing emojis with their meaning

import emoji

def replace_emojis(text):
    return emoji.demojize(text, delimiters=("", ""))

# Example usage
text = "I am happy 😊, but a little tired 😴"
processed_text = replace_emojis(text)
print(processed_text)

I am happy smiling_face_with_smiling_eyes, but a little tired sleeping_face


In [135]:
# Using a sentiment map to replace emoji with sentiments
emoji_sentiment_map = {
    "😊": "happy",
    "😴": "tired",
    "😡": "angry",
    "😂": "laughing",
    "😢": "sad"
}

def convert_emojis_to_sentiment(text):
    for emoji_char, sentiment in emoji_sentiment_map.items():
        text = text.replace(emoji_char, sentiment)
    return text

# Example usage
text = "I am 😊, but a little 😴"
processed_text = convert_emojis_to_sentiment(text)
print(processed_text) 


I am happy, but a little tired


# 9. Tokenization

Tokenization is the process of breaking down text into smaller units called "tokens." These tokens can be words, subwords, or characters, depending on the approach. Tokenization is a crucial step in text preprocessing because it transforms unstructured text data into a format that can be easily understood and analyzed by machine learning models.

### Tokenization Tools:
#### NLTK (Natural Language Toolkit): 
Provides simple word and sentence tokenizers.
#### SpaCy: 
A more advanced NLP library that handles tokenization efficiently.
#### Hugging Face Tokenizers: 
Designed for fast, memory-efficient tokenization in deep learning models like BERT, GPT, etc.

In [143]:
# using python split function

# word
text = "Tokenization is important for NLP tasks."
print(text.split())

# sentence
text = "Tokenization is the first step in NLP. It splits text into smaller parts."
print(text.split("."))

['Tokenization', 'is', 'important', 'for', 'NLP', 'tasks.']
['Tokenization is the first step in NLP', ' It splits text into smaller parts', '']


It has it's own limitation. As seen above, the **tasks.** is now a token and it will be different from the next **tasks** token. This creates a issue. 

In [144]:
# Regular Expressions

import re

def tokenize(text):
    # Regular expression for matching words, numbers, and punctuation
    pattern = r'\w+|[^\w\s]'
    return re.findall(pattern, text)

['Tokenization', 'is', 'hard', '!', 'Let', "'", 's', 'break', 'it', 'down', ':', '48', '+', '20', '=', '68', '.']


In [145]:
# Example
text = "Tokenization is hard! Let's break it down: 48+20 = 68."
tokens = tokenize(text)
print(tokens)

['Tokenization', 'is', 'hard', '!', 'Let', "'", 's', 'break', 'it', 'down', ':', '48', '+', '20', '=', '68', '.']


### Using NLTK

In [138]:
# Word tokenization

from nltk.tokenize import word_tokenize

text = "Tokenization is important for NLP tasks."
tokens = word_tokenize(text)
print(tokens) 

['Tokenization', 'is', 'important', 'for', 'NLP', 'tasks', '.']


In [141]:
# Sentence tokenization

from nltk.tokenize import sent_tokenize

text = "Tokenization is the first step in NLP. It splits text into smaller parts."
sentences = sent_tokenize(text)

sentences


['Tokenization is the first step in NLP.',
 'It splits text into smaller parts.']

### Using Spacy
This is probably the best tokenization library. 

In [148]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [150]:
text = "Tokenization is the first step in NLP."
tok1 = nlp(text)

for tok in tok1:
    print(tok)

Tokenization
is
the
first
step
in
NLP
.


# 10. Stemming

Stemming is a text normalization technique in Natural Language Processing (NLP) that **reduces words to their base or root form.** The root form (called a "*stem*") may not always be a valid word but still represents the core meaning. 

For example, words like "running," "runs," and "ran" might all be reduced to "**run**" in their stemmed form.

- Most widely used in **Information Retrieval System**, like google search engienes. 

In [155]:
from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()

In [162]:
def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

In [165]:
# words = ["running", "runs", "runner", "easily", "fairly"]
sentence1 = "Running man runs like he has runned for years."

print(stem_words(sentence1))

run man run like he ha run for years.


In [171]:
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize

stemmer = SnowballStemmer("english", True)
text = "There is nothing either good or bad but thinking makes it so."
words = word_tokenize(text)
stemmed_words = [stemmer.stem(word) for word in words]

print("Original:", text)
print("Tokenized:", words)
print("Stemmed:", stemmed_words)

Original: There is nothing either good or bad but thinking makes it so.
Tokenized: ['There', 'is', 'nothing', 'either', 'good', 'or', 'bad', 'but', 'thinking', 'makes', 'it', 'so', '.']
Stemmed: ['there', 'is', 'noth', 'either', 'good', 'or', 'bad', 'but', 'think', 'make', 'it', 'so', '.']


The output os the stemming may not be a english word always. 
- If speed matters, and we don't have to show output to users, go with stemming. Else, **lemmatization** is the way to go. 

# 11. Lemmatization

Lemmatization is another text normalization technique in Natural Language Processing (NLP) that reduces words to their base or dictionary form, called a lemma. Unlike stemming, lemmatization considers the context of a word, using linguistic knowledge such as the word's part of speech, and ensures the result is a valid word in the language.

For example, words like "am," "is," "are," and "were" are all lemmatized to "be" because they are different forms of the same verb.

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import word_tokenize, pos_tag
 
nltk.download()
    
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:         
        return wordnet.NOUN
       
def lemmatize_passage(text):
    words = word_tokenize(text)
    pos_tags = pos_tag(words)
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in pos_tags]
    lemmatized_sentence = ' '.join(lemmatized_words)
    return lemmatized_sentence
 
text = "There is nothing either good or bad but thinking makes it so."
result = lemmatize_passage(text)
 
print("Original:", text)
print("Tokenized:", word_tokenize(text))
print("Lemmatized:", result)

In [180]:
import nltk
print(nltk.data.path)

['/root/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']


In [None]:
import nltk
from nltk.stem import WordNetLemmatizer

# Create a lemmatizer object
lemmatizer = WordNetLemmatizer()

# Example words
words = ["running", "ran", "better", "feet", "geese", "are"]

# Apply lemmatization
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print(lemmatized_words)

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import word_tokenize, pos_tag
 
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:         
        return wordnet.NOUN
       
def lemmatize_passage(text):
    words = word_tokenize(text)
    pos_tags = pos_tag(words)
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word, get_wordnet_pos(tag)) for word, tag in pos_tags]
    lemmatized_sentence = ' '.join(lemmatized_words)
    return lemmatized_sentence
 
text = "There is nothing either good or bad but thinking makes it so."
result = lemmatize_passage(text)
 
print("Original:", text)
print("Tokenized:", word_tokenize(text))
print("Lemmatized:", result)

# Conclusion

Text preprocessing is a crucial step in Natural Language Processing (NLP) that significantly influences the quality and effectiveness of models. It involves a series of techniques aimed at transforming raw text into a structured format suitable for analysis and modeling.