# Text Data Preprocessing Notebook 📝

## Introduction 🚀

Welcome to the "Text Data Preprocessing" notebook (02_clean_text) 📊. In this notebook, we will focus on the essential steps of text data preprocessing. Text data often requires various transformations to make it suitable for analysis and modeling. We'll perform tasks such as handling HTML entities, removing special characters, converting emojis, and more to prepare our text data for further analysis.

## Table of Contents 📑

- [Importing Libraries](#importing-libraries)
- [Loading Data](#loading-data)
- [Loading Preprocessing Resources](#loading-preprocessing-resources)
- [Text Preprocessing](#text-preprocessing)
- [Sample Data Transformation](#sample-data-transformation)
- [Full Dataset Processing](#full-dataset-processing)

## Importing Libraries 📚

Let's start by importing the necessary libraries for our text preprocessing tasks.

In [1]:
import html
import re

import emoji
import pandas as pd
from tqdm import tqdm

## Loading Data 📊
We will load the text data from a parquet file for preprocessing.

In [2]:
tqdm.pandas()
pd.set_option('display.max_colwidth', None)

parquet_file_path = './../data/Text_dataset.br'
df = pd.read_parquet(parquet_file_path, engine='pyarrow')
df.head()

Unnamed: 0,text
0,family mormon have never tried explain them they still stare puzzled from time time like some kind strange creature nonetheless they have come admire for the patience calmness equanimity acceptance and compassion have developed all the things buddhism teaches
1,buddhism has very much lot compatible with christianity especially considering that sin and suffering are almost the same thing suffering caused wanting things shouldn want going about getting things the wrong way christian this would mean wanting things that don coincide with god will and wanting things that coincide but without the aid jesus buddhism could also seen proof god all mighty will and omnipotence certainly christians are lucky have one such christ there side but what about everyone else well many christians believe god grace salvation and buddhism god way showing grace upon others would also help study the things jesus said and see how buddha has made similar claims such rich man getting into heaven joke basically advocating that should rid ourselves material possessions fact distinctly remembered jesus making someone cry because that someone asked what achieve salvation and jesus replied with live like buddhist very very roughly translated also point out that buddha rarely spoke anything about god theory personally because knew well enough leave that jesus and mohamed who came later just remember conflict difference opinion but education can fun involving and enlightening easier teach something than prove right like intelligent design
2,seriously don say thing first all they won get its too complex explain normal people anyway and they are dogmatic then doesn matter what you say see mechante post and for any reason you decide later life move from buddhism and that doesn suit you identity though you still get keep all the wisdom then your family will treat you like you went through weird hippy phase for while there didncha and you never hear the end pro tip don put one these your wall jpg
3,what you have learned yours and only yours what you want teach different focus the goal not the wrapping paper buddhism can passed others without word about the buddha
4,for your own benefit you may want read living buddha living christ thich nhat hanh you might find any subsequent discussions with your loved ones easier you are able articulate some the parallels that exist between buddhism and christianity don surprised they react negatively for having lost you treat them with compassion and deserved understanding although they may indeed display signs being hurt your new path properly sharing with them way that may alleviate their fear something they may perceive wrong the very least alien their beliefs may help allowing them the long run accept although not necessarily agree with your decision regardless where they end you have make your own way


## Loading Preprocessing Resources 📚
We will load resources such as dictionaries for handling abbreviations, apostrophes, and emoticons.

In [3]:
abbreviations_df = pd.read_csv("./../data/Text-Preprocessing-Data/abbreviations.csv") 
apostrophe_df = pd.read_csv("./../data/Text-Preprocessing-Data/apostrophe.csv")
emoticons_df = pd.read_csv("./../data/Text-Preprocessing-Data/emoticons.csv")

# Create dictionaries from the loaded resources
abbreviations_dict = dict(abbreviations_df.values)
apostrophe_dict = dict(apostrophe_df.values)
emoticons_dict = dict(emoticons_df.values)

## Text Preprocessing 🧹

We'll perform various text preprocessing steps, including handling HTML entities, removing mentions, links, emojis, and more.

In [4]:
def lookup_dict(text: str, dictionary: dict) -> str:
    """
    Replace placeholders in the text with values from a dictionary.

    Args:
        text (str): The text containing placeholders to be replaced.
        dictionary (dict): A dictionary containing placeholder-value pairs.

    Returns:
        str: The text with placeholders replaced by their corresponding values.
    """
    for word in text.split():
        if word in dictionary:
            text = text.replace(word, dictionary[word])
    return text

In [5]:
def preprocessing(input_text: str) -> str:
    """Preprocess the input text for natural language processing.

    Args:
        input_text (str): The input text to be preprocessed.

    Returns:
        str: The preprocessed text.
    """
    # Step A : Converting html entities i.e. (&lt; &gt; &amp;)
    text = html.unescape(input_text)
    # Step B: Remove HTML tags
    text = re.sub(re.compile('<.*?>'), "", text)
    # Step C : Removing "@user" from all the text
    text = re.sub("@[\\w]*", "", text)
    # Step D : Remove http & https links
    text = re.sub("http://\\S+|https://\\S+", "", text)
    # Step E : Emoticon Lookup
    text = lookup_dict(text, emoticons_dict)
    # Step F : Emoji Lookup
    text = emoji.demojize(text, delimiters=(" ", " "))
    # Step G : Changing all the text into lowercase
    text = text.lower()
    # Step H : Apostrophe Lookup
    text = text.replace("’", "'")
    text = lookup_dict(text, apostrophe_dict)
    # Step I : Short Word Lookup
    text = lookup_dict(text, abbreviations_dict)
    # Step J : Replacing Punctuations, Special Characters & Numbers (integers) with space
    text = re.sub(r"[^a-z]", " ", text)
    # Step K: Remove whitespace
    text = re.sub(r"\s+", " ", text)
    return text

## Sample Data Transformation 📄

Let's apply the text preprocessing functions to a sample of the dataset to see the results.

In [6]:
sample_df = df.sample(20, random_state=42)
sample_df['clean_text'] = sample_df['text'].progress_apply(preprocessing)
sample_df.head(20)

100%|██████████| 20/20 [00:00<00:00, 101.50it/s]


Unnamed: 0,text,clean_text
712626,☹️\n#NowPlaying “Bad (feat. Rihanna) [Remix]” by @Wale on #Anghami https://t.co/iHCmOj9wU3,frowning face nowplaying bad feat rihanna remix by on anghami
21935,floor test tomorrow nomination anglo indian assembly yedduyurappa must not make any major policy decisions meanwhile supremecourt,floor test tomorrow nomination anglo indian assembly yedduyurappa must not make any major policy decisions meanwhile supremecourt
324532,"@Dizdi17 SP has regularly lied to people about our proposal not getting past the clerks. We are meant to believe that having torpedoed it with MPs LCAG then put our settlement proposal forward?\nSteve still doesn’t seem to grasp the idea was to change the law, not ask HMRC for concessions https://t.co/qPFNVQFxqb",sp has regularly lied to people about our proposal not getting past the clerks we are meant to believe that having torpedoed it with mps lcag then put our settlement proposal forward steve still does not seem to grasp the idea was to change the law not ask hmrc for concessions
566900,@KattyKay_ Why do you keep watching? At this point anyone watching tv is the problem,why do you keep watching at this point anyone watching television is the problem
764931,I’m not no criminal baby dismiss me you ain’t gotta keep bringing me back ij hete,i am not no criminal baby dismiss me you are not got to keep bringing me back i am joking hete
528705,Another fact for the libs. The Supreme Court does NOT make law. So... #RowvWade wasn’t legitimate to start.,another fact for the libs the supreme court does not make law so rowvwade was not legitimate to start
325000,@BobbyKingNoon @_Tonnnyy @BIessedDiamond @MameronCagee502 @combat_insider @redpahR He never said it was the best version of Dustin,he never said it was the best version of dustin
550895,"@EllenCoughlan @nhs_quality @TonyRoberts9 @HealthFdn Thanks again Ellen, and a great pleasure to 'meet' you via Skype today. Hope we'll catch up again somewhere soon",thanks again ellen and a great pleasure to meet you via skype today hope we will catch up again somewhere soon
338023,Criminal. https://t.co/G0scKkIsqk,criminal
242761,think till she critical modi invincible ’ start worry the day she starts supporting him,think till she critical modi invincible start worry the day she starts supporting him


Test text preprocessing on a single example

In [7]:
input_text = """                               \n \t
    <p>This is an example text with some special characters &amp; symbols. 
    It contains @user mentions, links like https://example.com, 
    and emoticons like :-), as well as some emojis: 😄👍.</p>
    Let's test the preprocessing steps asap !
    """
print(f"{preprocessing(input_text) = }")

preprocessing(input_text) = ' this is an example text with some special characters symbols it contains mentions links like and emoticons like as well as some emojis grinning face with smiling eyes thumbs up let us test the preprocessing steps as soon as possible '


## Full Dataset Processing 🚀

If you'd like to run this text preprocessing on the full dataset, you can find the code in the [src/utils/clean_text.py](src/utils/clean_text.py) file. Simply execute the script to preprocess the entire dataset.

## Conclusion 📝
This notebook has covered the crucial steps of text preprocessing, including handling HTML entities, user mentions, links, emoticons, and more. The resulting clean text is now ready for use in various NLP tasks, such as sentiment analysis, topic modeling, or text classification.

Happy text preprocessing! 🧼✨