## Text Preprocessing

### Introduction

In this notebook, we'll be taking a look at how to clean up the content of the reviews. User reviews often have repeated characters, random emoticons, and unnecessary punctuation, which can make the data harder to work with. Here, you'll find a range of techniques and ideas for cleaning and preparing text data for analysis, which you can use in your own projects.

### Setup and Initial Data Viewing

As always, we'll start by loading the cleaned dataset and taking a quick look at it.

In [24]:
# Importing pandas
import pandas as pd

# Loading the cleaned data
df = pd.read_csv('../DATASETS/cleaned_data.csv')

# Displaying the first few rows to get a sense of the data
df.head()

Unnamed: 0,content,score
0,Plsssss stoppppp giving screen limit like when...,2
1,Good,5
2,👍👍,5
3,Good,3
4,"App is useful to certain phone brand ,,,,it is...",1


As you can see from the first few rows, the dataset contains repeated emojis, characters, and random punctuation marks, all of which need to be cleaned up before we can do any further analysis.

### Importing Necessary Libraries for Text Processing

To clean up the text effectively, we'll use a couple of libraries:

- The emoji library helps us interpret and convert emojis into text descriptions.
- The re library provides tools for regex operations, which are crucial for identifying and replacing patterns in text.
- The contractions library will be used to expand contractions (like "isn't" to "is not"), which helps standardize the text.

In [25]:
# Importing libraries for handling text data
import emoji  # For converting emojis to text
import re  # For regex operations
import contractions  # For expanding contractions

### Defining the Text Cleaning Function

We'll create a function that systematically cleans the text by following these steps:

1. **Convert the text to lowercase for consistency.**  
   *Example:*  
   `"This is GREAT!"` → `"this is great!"`

2. **Replace emojis with descriptive text.**  
   *Example:*  
   `"I'm so happy 👍"` → `"i'm so happy thumbs_up"`

3. **Expand contractions to standardize the language.**  
   *Example:*  
   `"I'll be there"` → `"i will be there"`

4. **Replace non-word characters with spaces to improve word separation.**  
   *Example:*  
   `"Hello!!! How's it going?"` → `"hello how s it going"`

5. **Reduce exaggerated characters to a standard form.**  
   *Example:*  
   `"Soooo goooood!!!"` → `"soo good"`

6. **Remove consecutive repeated words to avoid redundancy.**  
   *Example:*  
   `"This is is amazing"` → `"this is amazing"`

7. **Normalize white spaces for clean and properly spaced text.**  
   *Example:*  
   `"Too    many    spaces!"` → `"too many spaces!"`

**NOTE**: There is no "one-size-fits-all" solution to text preprocessing. The steps above are suggestions only and you should adapt them to the specific needs of your project.

In [26]:
def text_cleaner(text):
    
    # Convert text to lowercase
    text = text.lower()

    # Replace emojis with a text description to capture their sentiment
    text = emoji.demojize(text, delimiters=(" ", " "))

    # Expand contractions
    text = contractions.fix(text)

    # Replace non-word characters with a space to separate words better
    text = re.sub(r'[^\w\s]', ' ', text)

    # Reduce characters repeated more than twice to two to reduce exaggeration
    text = re.sub(r'(.)\1{2,}', r'\1\1', text)

    # Remove consecutive repeated words
    text = re.sub(r'\b(\w+)(?:\s+\1)+\b', r'\1', text)

    # Normalize white spaces
    text = re.sub(r'\s+', ' ', text).strip()  

    return text

### Testing the Text Cleaning Function

It's a good idea to test this function on a sample of text before applying it to the entire dataset. This will ensure that it works as it should:

In [27]:
# Testing the text cleaner with a sample text
text_cleaner("This....iSn't GoOoOood good!!! 👎👎👎")  # Output should be a cleaned version of the input

'this is not good thumbs_down'

### Applying the Text Cleaning Function

Now that we've defined the text cleaning function, we can apply it to the entire dataset. After cleaning the text, we will add a new column named `content_cleaned` to the dataset. This column will contain the cleaned version of the text.

In [28]:
# Applying the cleaning function to each review in the dataset
df['content_cleaned'] = df['content'].apply(text_cleaner)

We can see now that we have the new `content_cleaned` column:

In [29]:
df.head()

Unnamed: 0,content,score,content_cleaned
0,Plsssss stoppppp giving screen limit like when...,2,plss stopp giving screen limit like when you a...
1,Good,5,good
2,👍👍,5,thumbs_up
3,Good,3,good
4,"App is useful to certain phone brand ,,,,it is...",1,app is useful to certain phone brand it is not...


### Displaying Some Preprocessed Reviews

We can compare some original and cleaned text examples to make sure that the clean function works as expected. This will help us to see how the function transforms the text.

In [30]:
# Printing some original and cleaned reviews to compare
for i in range(80, 85):
    print('Original:', df['content'][i])
    print('Cleaned:', df['content_cleaned'][i])
    print()

Original: Account sharing? Lol. Very expensive subscription. Very limited quality choices. Not worth it. Just pirate.
Cleaned: account sharing lol very expensive subscription very limited quality choices not worth it just pirate

Original: Netflix is ​​a great medium for movies and TV shows. I hope more great movies and series will come on Netflix.
Cleaned: netflix is a great medium for movies and tv shows i hope more great movies and series will come on netflix

Original: Worst very Very Very bad 👎 😕 😞 😑
Cleaned: worst very bad thumbs_down confused_face disappointed_face expressionless_face

Original: Because you can watch movies and relax while it plays and I can just give this 5 starts for everything it does
Cleaned: because you can watch movies and relax while it plays and i can just give this 5 starts for everything it does

Original: Back then Netflix was my favorite app because it had alot of newly released movies but now There's no good movies and the ones that is newly release

### Saving the Preprocessed Data

Finally, we save the preprocessed text data for further analysis or modeling:

In [31]:
# Saving the preprocessed data to a new CSV file
df.to_csv('../DATASETS/preprocessed_text.csv', index=False)