# Text Data Preprocessing Notebook 📝

## Introduction 🚀

Welcome to the "Text Data Preprocessing" notebook (02_clean_text) 📊. In this notebook, we will focus on the essential steps of text data preprocessing. Text data often requires various transformations to make it suitable for analysis and modeling. We'll perform tasks such as handling HTML entities, removing special characters, converting emojis, and more to prepare our text data for further analysis.

## Table of Contents 📑

- [Importing Libraries](#importing-libraries)
- [Loading Data](#loading-data)
- [Loading Preprocessing Resources](#loading-preprocessing-resources)
- [Text Preprocessing](#text-preprocessing)
- [Sample Data Transformation](#sample-data-transformation)
- [Full Dataset Processing](#full-dataset-processing)

## Importing Libraries 📚

Let's start by importing the necessary libraries for our text preprocessing tasks.

In [1]:
import html
import re

import emoji
import pandas as pd
from tqdm import tqdm

## Loading Data 📊
We will load the text data from a parquet file for preprocessing.

In [2]:
tqdm.pandas()
pd.set_option('display.max_colwidth', None)

parquet_file_path = './../data/Text_dataset.br'
df = pd.read_parquet(parquet_file_path, engine='pyarrow')
df.head()

Unnamed: 0,text
0,family mormon have never tried explain them they still stare puzzled from time time like some kind strange creature nonetheless they have come admire for the patience calmness equanimity acceptance and compassion have developed all the things buddhism teaches
1,buddhism has very much lot compatible with christianity especially considering that sin and suffering are almost the same thing suffering caused wanting things shouldn want going about getting things the wrong way christian this would mean wanting things that don coincide with god will and wanting things that coincide but without the aid jesus buddhism could also seen proof god all mighty will and omnipotence certainly christians are lucky have one such christ there side but what about everyone else well many christians believe god grace salvation and buddhism god way showing grace upon others would also help study the things jesus said and see how buddha has made similar claims such rich man getting into heaven joke basically advocating that should rid ourselves material possessions fact distinctly remembered jesus making someone cry because that someone asked what achieve salvation and jesus replied with live like buddhist very very roughly translated also point out that buddha rarely spoke anything about god theory personally because knew well enough leave that jesus and mohamed who came later just remember conflict difference opinion but education can fun involving and enlightening easier teach something than prove right like intelligent design
2,seriously don say thing first all they won get its too complex explain normal people anyway and they are dogmatic then doesn matter what you say see mechante post and for any reason you decide later life move from buddhism and that doesn suit you identity though you still get keep all the wisdom then your family will treat you like you went through weird hippy phase for while there didncha and you never hear the end pro tip don put one these your wall jpg
3,what you have learned yours and only yours what you want teach different focus the goal not the wrapping paper buddhism can passed others without word about the buddha
4,for your own benefit you may want read living buddha living christ thich nhat hanh you might find any subsequent discussions with your loved ones easier you are able articulate some the parallels that exist between buddhism and christianity don surprised they react negatively for having lost you treat them with compassion and deserved understanding although they may indeed display signs being hurt your new path properly sharing with them way that may alleviate their fear something they may perceive wrong the very least alien their beliefs may help allowing them the long run accept although not necessarily agree with your decision regardless where they end you have make your own way


## Loading Preprocessing Resources 📚
We will load resources such as dictionaries for handling abbreviations, apostrophes, and emoticons.

In [3]:
abbreviations_df = pd.read_csv("./../data/Text-Preprocessing-Data/abbreviations.csv") 
apostrophe_df = pd.read_csv("./../data/Text-Preprocessing-Data/apostrophe.csv")
emoticons_df = pd.read_csv("./../data/Text-Preprocessing-Data/emoticons.csv")

# Create dictionaries from the loaded resources
abbreviations_dict = dict(abbreviations_df.values)
apostrophe_dict = dict(apostrophe_df.values)
emoticons_dict = dict(emoticons_df.values)

## Text Preprocessing 🧹

We'll perform various text preprocessing steps, including handling HTML entities, removing mentions, links, emojis, and more.

In [4]:
def lookup_dict(text: str, dictionary: dict) -> str:
    """
    Replace placeholders in the text with values from a dictionary.

    Args:
        text (str): The text containing placeholders to be replaced.
        dictionary (dict): A dictionary containing placeholder-value pairs.

    Returns:
        str: The text with placeholders replaced by their corresponding values.
    """
    for word in text.split():
        if word in dictionary:
            text = text.replace(word, dictionary[word])
    return text

In [5]:
def preprocessing(input_text: str) -> str:
    """Preprocess the input text for natural language processing.

    Args:
        input_text (str): The input text to be preprocessed.

    Returns:
        str: The preprocessed text.
    """
    # Step A : Converting html entities i.e. (&lt; &gt; &amp;)
    text = html.unescape(input_text)
    # Step B : Removing "@user" from all the text
    text = re.sub("@[\w]*", "", text)
    # Step C : Remove http & https links
    text = re.sub("http://\S+|https://\S+", "", text)
    # Step D : Emoticon Lookup
    text = lookup_dict(text, emoticons_dict)
    # Step E : Emoji Lookup
    text = emoji.demojize(text)
    # Step F : Changing all the text into lowercase
    text = text.lower()
    # Step G : Apostrophe Lookup
    text = lookup_dict(text, apostrophe_dict)
    # Step H : Short Word Lookup
    text = lookup_dict(text, abbreviations_dict)
    # Step I : Replacing Punctuations, Special Characters & Numbers (integers) with space
    text = re.sub(r"[^a-z]", " ", text)
    # Step J: Remove whitespace
    text = re.sub(r"\s+", " ", text)
    return text

## Sample Data Transformation 📄

Let's apply the text preprocessing functions to a sample of the dataset to see the results.

In [6]:
sample_df = df.sample(20)
sample_df['clear_text'] = sample_df['text'].progress_apply(preprocessing)
sample_df.head(20)

100%|██████████| 20/20 [00:00<00:00, 666.49it/s]


Unnamed: 0,text,clear_text
69005,"I had watched this as a kid but, not being much of a Jerry Lewis fan, I had completely forgotten it (not that it's in any way memorable). The film revolves around impersonation (which seems to be in the curriculum of every comic star!) - in this case a German officer - and, while not as bad as Leonard Maltin claims (awarding it a BOMB rating), it's not exactly classic stuff either - certainly leagues behind Chaplin's THE GREAT DICTATOR (1940), even if comparably narcissistic! Ironically, the scenes prior to the appearance of the would-be wacky General offer more felicities than the rather forced humor at Nazi expense! <br /><br />The film was really Lewis' last gasp during his heyday; in fact, this proved to be his last vehicle to be released for 10 years (it's painfully apparent here that his particular brand of foolishness wouldn't pass muster in the age of Mel Brooks and Woody Allen)!",i had watched this as a kid but not being much of a jerry lewis fan i had completely forgotten it not that it is in any where are you memorable the film revolves around impersonation which seems to be in the curriculum of every comic star in this case a german officer and while not as bad as leonard maltin claims awarding it a bomb rating it is not exactly classic stuff either certainly leagues behind chaplin s the great dictator even if comparably narcissistic ironically the scenes prior to the appearance of the would be wacky general offer more felicities than the rather forced humor at nazi expense br br the film was really lewis last gasp during his heyday in fact this proved to be his last vehicle to be released for years it is painfully apparent here that his particular brand of foolishness would not pass muster in the age of mel brooks and woody allen
1131240,@BlockFi Sorry I had to remove funds due to the poor communication recently and market turmoil. When will BIA be available for American clients? Any updates on that please?,sorry i had to remove funds due to the poor communication recently and market turmoil when will bia be available for american clients any updates on that please
1111502,@GreaseMonkey42 @SharonGF_NBCT @RepMTG Sometimes I feel like they are lost souls. They believe everything the LameStream Media tells them. Wait for the laws to come into effect that will allow us to sue LameStream Media when they don’t tell the truth! Then their eyes will start opening I hope!,sometimes i feel like they are lost souls they believe everything the lamestream media tells them wait for the laws to come into effect that will allow us to sue lamestream media when they don t tell the truth then their eyes will start opening i hope
705407,"@leeandrew31 @Dane99_ I didn't think we would match last years total either, looks like we might now",i did not think whatever would match last years total either looks like whatever might now
718767,@Assade_Cule @Adaminho_FCB Ça ne répond tjs pas à ma question.,a ne r pond tjs pas mom alert master of arts question
765311,@AstrosFan86 🤣🤣🤣🤣 best SS in Astros history Carlos Correa El Capitan #1,rolling on the floor laughing rolling on the floor laughing rolling on the floor laughing rolling on the floor laughing best screen shot in astros history carlos correa el capitan
206011,modi work 365 days for country but every election time rahul yuvoraj priyanka gandhi yuvrani goes poor house eat temple and take photo after finish vote they back italy other country for long vacation they come just for vote,modi work days for country but every election time rahul yuvoraj priyanka gandhi yuvrani goes poor house eat temple and take photo after finish vote they back i trust and love you other country for long vacation they come just for vote
25842,most def not paying for her bundle farm,most def not paying for her bundle farm
1106588,@just23things @nctzenbase Skincare dream juga ada kan? S0methinc,skincare dream juga ada kan s methinc
925672,"@TnDawg34 @ironminja @OriginalRoddick @JAWestman If you DO genuinely want to reduce abortions. Do 3 things: increase access to contraception, improve financial situation of many women especially in cities, provide adequate housing prices and maternity/paternity leave so they can raise a child. Banning it fixes non of these",if you do genuinely want to reduce abortions do things increase access to contraception improve financial situation of many women especially in cities provide adequate housing prices and maternity paternity leave so they can raise a child banning it fixes non of these


Test text preprocessing on a single example

In [7]:
test_text = "@Dbgtiscanon @SupiotUgo @TheStarWarsAcad Jake lloyd especially man poor kid! 😫😭"
print(f"{preprocessing(test_text)=}")

preprocessing(test_text)=' jake lloyd especially man poor kid tired face loudly crying face '


## Full Dataset Processing 🚀

If you'd like to run this text preprocessing on the full dataset, you can find the code in the [src/utils/clean_text.py](src/utils/clean_text.py) file. Simply execute the script to preprocess the entire dataset.

## Conclusion 📝
This notebook has covered the crucial steps of text preprocessing, including handling HTML entities, user mentions, links, emoticons, and more. The resulting clean text is now ready for use in various NLP tasks, such as sentiment analysis, topic modeling, or text classification.

Happy text preprocessing! 🧼✨