# Data Preprocess

Based on the initial EDA analysis, given the dataset characteristics, minimal preprocessing should suffice:

1. Convert text to lowercase
2. Remove extra whitespace
3. No need for complex preprocessing due to:
    - Small and balanced dataset size
    - Clear sentiment patterns
    - Well-distributed classes
4. Removing stopwords may help to reduce the noise of the token, but considering we fine-tune a bert model which may benefit from the context provided by these common words, we could keep the stopwords for now and start experiment with a simple processing approach for now.
5. Removing punctuation - while keeping stopwords can help simplify the data without losing essential context.

Simple preprocessing might help preserve important sentiment indicators. This dataset size is suitable for quick experimentation and model iteration.


## Import required libraries

In [7]:
import pandas as pd
import os
import re
import string

In [8]:
df_raw = pd.read_csv('../data/raw/Restaurant_Reviews.tsv', sep='\t')

## Preprocess the Review

In [9]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Remove newlines and extra whitespace
    text = ' '.join(text.split())
    
    return text


In [10]:
def process_raw_data(df):
    # Apply preprocess_text function to the 'Review' column
    df['Review'] = df['Review'].apply(lambda x: preprocess_text(x))
    return df

In [11]:
# Function to save the processed DataFrame
def save_processed_data(df, file_name, folder_path='../data/processed'):
    os.makedirs(folder_path, exist_ok=True)

    file_path = os.path.join(folder_path, file_name)
    df.to_csv(file_path, index=False)
    print(f"Processed data saved to {file_path}")

In [12]:
# call the helper functions
df_processed = process_raw_data(df_raw)
save_processed_data(df_processed, 'processed_data.csv')

Processed data saved to ../data/processed/processed_data.csv
