## Text Preprocessing

### Introduction

In this notebook, we'll be taking a look at how to clean up the content of the reviews. User reviews often have repeated characters, random emojis, and unnecessary punctuation, which can make the data harder to work with. Here, you'll find a range of techniques and ideas for cleaning and preparing text data for analysis, which you can use in your own projects.

### Setup and Initial Data Viewing

As always, we'll start by loading the cleaned dataset and taking a quick look at it.

In [37]:
# Importing the pandas library to work with dataframes
import pandas as pd

# Loading the cleaned data from a CSV file
df = pd.read_csv('../DATASETS/cleaned_data.csv')

# Displaying the first few rows to get a sense of the data
df.head()

Unnamed: 0,content,sentiment
0,Plsssss stoppppp giving screen limit like when...,negative
1,Good,positive
2,👍👍,positive
3,Good,neutral
4,"App is useful to certain phone brand ,,,,it is...",negative


As you can see from the first few rows, the dataset contains repeated emojis, characters, and random punctuation marks, all of which need to be cleaned up before we can do any further analysis.

### Importing Necessary Libraries for Text Processing

To effectively clean the text, we'll use several libraries:

- The emoji library helps us interpret and convert emojis into text descriptions.
- The re library provides tools for regex operations, crucial for identifying and replacing patterns in text.
- The contractions library will be used to expand contractions (like "isn't" to "is not"), which helps standardize the text.

In [38]:
# Importing libraries for handling text data
import emoji  # For converting emojis to text
import re  # For regex operations
import contractions  # For expanding contractions

### Defining the Text Cleaning Function

We'll create a function that systematically cleans the text by following these steps:

1. **Convert the text to lowercase for consistency.**  
   *Example:*  
   `"This is GREAT!"` → `"this is great!"`

2. **Replace emojis with descriptive text.**  
   *Example:*  
   `"I'm so happy 😊"` → `"i'm so happy smiling_face_with_smiling_eyes"`

3. **Expand contractions to standardize the language.**  
   *Example:*  
   `"I'll be there"` → `"i will be there"`

4. **Replace non-word characters with spaces to improve word separation.**  
   *Example:*  
   `"Hello!!! How's it going?"` → `"hello how s it going"`

5. **Reduce exaggerated characters to a standard form.**  
   *Example:*  
   `"Soooo goooood!!!"` → `"soo good"`

6. **Remove consecutive repeated words to avoid redundancy.**  
   *Example:*  
   `"This is is amazing"` → `"this is amazing"`

7. **Normalize white spaces for clean and properly spaced text.**  
   *Example:*  
   `"Too    many    spaces!"` → `"too many spaces!"`

In [39]:
def text_cleaner(text):
    
    # Convert text to lowercase
    text = text.lower()

    # Replace emojis with a text description to capture their sentiment
    text = emoji.demojize(text, delimiters=(" ", " "))

    # Expand contractions
    text = contractions.fix(text)

    # Replace non-word characters with a space to separate words better
    text = re.sub(r'[^\w\s]', ' ', text)

    # Reduce characters repeated more than twice to two to reduce exaggeration
    text = re.sub(r'(.)\1{2,}', r'\1\1', text)

    # Remove consecutive repeated words
    text = re.sub(r'\b(\w+)(?:\s+\1)+\b', r'\1', text)

    # Normalize white spaces
    text = re.sub(r'\s+', ' ', text).strip()  

    return text

### Testing the Text Cleaning Function

It's a good idea to test this function on a sample text before you apply it to the whole dataset, just to make sure it's working as it should:

In [40]:
# Testing the text cleaner with a sample text
text_cleaner("This....iSn't GoOoOood!!!")  # Output should be a cleaned version of the input

'this is not good'

### Applying the Text Cleaning Function

At this point, we can apply the text cleaning function to the entire dataset:

In [41]:
# Applying the cleaning function to each review in the dataset
df['content_cleaned'] = df['content'].apply(text_cleaner)

After that, we have a new column in our dataset called "content_cleaned". Let's view it.

In [42]:
df.head()

Unnamed: 0,content,sentiment,content_cleaned
0,Plsssss stoppppp giving screen limit like when...,negative,plss stopp giving screen limit like when you a...
1,Good,positive,good
2,👍👍,positive,thumbs_up
3,Good,neutral,good
4,"App is useful to certain phone brand ,,,,it is...",negative,app is useful to certain phone brand it is not...


### Displaying Some Preprocessed Reviews

To verify the cleaning process, let's look at some before-and-after examples of the cleaned text:

In [43]:
# Printing some original and cleaned reviews to compare
for i in range(40, 45):
    print('Original:', df['content'][i])
    print('Cleaned:', df['content_cleaned'][i])
    print()

Original: hhaha
Cleaned: hhaha

Original: The bloodline Roman Reigns Jackie Redmond wwe
Cleaned: the bloodline roman reigns jackie redmond wwe

Original: Have had netflix since it first went digital. Canceled membership today. I'm not paying $8 extra dollars a month on top of the $16 I already pay for my 90 year old grandma to be able to watch now and then. In addition....now I can't even use my own account that I pay for on other TVs when I'm traveling for work or at a friends house? Get lost. I'll give Nana my other streaming passwords. Greedy nonsense.
Cleaned: have had netflix since it first went digital canceled membership today i am not paying 8 extra dollars a month on top of the 16 i already pay for my 90 year old grandma to be able to watch now and then in addition now i cannot even use my own account that i pay for on other tvs when i am traveling for work or at a friends house get lost i will give nana my other streaming passwords greedy nonsense

Original: No me gustó lo de

### Saving the Preprocessed Data

Finally, we save the preprocessed text data for further analysis or modeling:

In [44]:
# Saving the preprocessed data to a new CSV file
df.to_csv('../DATASETS/preprocessed_text.csv', index=False)