## Text Preprocessing

### Introduction

This notebook is for cleaning up the actual content of the review. This is because the reviews might contain many repeated characters, random emojis, and even punctuations...

### Setup and Initial Data Viewing

First, we will load the dataset we cleaned up and view it.

In [2]:
# Importing the pandas library to work with dataframes
import pandas as pd

# Loading the cleaned data from a CSV file
df = pd.read_csv('../DATASETS/cleaned_data.csv')

# Displaying the first few rows to get a sense of the data
df.head()

Unnamed: 0,content,sentiment
0,Plsssss stoppppp giving screen limit like when...,negative
1,Good,positive
2,👍👍,positive
3,Good,neutral
4,"App is useful to certain phone brand ,,,,it is...",negative


### Importing Necessary Libraries for Text Processing

We will use the `emoji` library to interpret emojis, the `re` library for regex operations, which are critical for text cleaning, and the `contractions` library to expand any possible contractions:

In [4]:
# Importing libraries for handling text data
import emoji  # For converting emojis to text
import re  # For regex operations
import contractions  # For expanding contractions

### Defining the Text Cleaning Function

We define a comprehensive function that cleans the text by performing several operations:

In [5]:
def text_cleaner(text):
    # Convert text to lowercase to standardize it
    text = text.lower()

    # Replace emojis with a text description to capture their sentiment
    text = emoji.demojize(text, delimiters=(" ", " "))

    # Expand contractions to aid in the uniformity of the text
    text = contractions.fix(text)

    # Remove URLs as they are not useful for sentiment analysis
    text = re.sub(r'http\S+', '', text)

    # Remove HTML tags to clean up the text
    text = re.sub(r'<.*?>', '', text)

    # Replace non-word characters with a space to separate words better
    text = re.sub(r'[^\w\s]', ' ', text)

    # Reduce characters repeated more than twice to two to reduce exaggeration
    text = re.sub(r'(.)\1{2,}', r'\1\1', text)

    # Normalize white spaces
    text = re.sub(r'\s+', ' ', text).strip()  

    return text

### Testing the Text Cleaning Function

It's helpful to test the function with a sample text to ensure it performs as expected:

In [6]:
# Testing the text cleaner with a sample text
text_cleaner("This iSn't Gooood!!!")  # Output should be a cleaned version of the input

'this is not good'

### Applying the Text Cleaning Function

Next, we apply the text cleaning function to the entire dataset:

In [7]:
# Applying the cleaning function to each review in the dataset
df['content_cleaned'] = df['content'].apply(text_cleaner)

After that, we have a new column in our dataset called "content_cleaned". We view our preprocessed dataset here:

In [10]:
df.head()

Unnamed: 0,content,sentiment,content_cleaned
0,Plsssss stoppppp giving screen limit like when...,Negative,plss stopp giving screen limit like when you a...
1,Good,Positive,good
2,👍👍,Positive,thumbs_up thumbs_up
3,Good,Neutral,good
4,"App is useful to certain phone brand ,,,,it is...",Negative,app is useful to certain phone brand it is not...


### Displaying Some Preprocessed Reviews

To verify the cleaning process, let's look at some before-and-after examples of the cleaned text:

In [8]:
# Printing some original and cleaned reviews to compare
for i in range(1000, 1010):
    print('Original:', df['content'][i])
    print('Cleaned:', df['content_cleaned'][i])
    print()

Original: Great App💙
Cleaned: great app blue_heart

Original: Great audio and video
Cleaned: great audio and video

Original: Every time I request a movie that I would like to watch I get the same "Sorry, we don't have that one." Then I am referred to movies like "Three Stooges Conquer Space" or something like that.
Cleaned: every time i request a movie that i would like to watch i get the same sorry we do not have that one then i am referred to movies like three stooges conquer space or something like that

Original: Best ott app
Cleaned: best ott app

Original: The in-app help call feature on Netflix is an absolute disaster. This was hands down the worst customer service experience I have ever had. First off, once you initiate the help call within the app, there's no way to go back. You’re trapped in an endless loop with no escape, forced to deal with whatever horrors await. The app essentially hijacks your device, and you can't access any other features while on the call. The repres

### Saving the Preprocessed Data

Finally, we save the preprocessed text data for further analysis or modeling:

In [11]:
# Saving the preprocessed data to a new CSV file
df.to_csv('../DATASETS/preprocessed_text.csv', index=False)