### preprocessing and preparing the text data for machine learning ###

Cleaning Text Data: wrote a clean_text function to convert all text to lower case and remove non-alphanumeric characters, which is a standard practice to normalize the text.
Text Representation: used TfidfVectorizer to convert the cleaned text data into numerical values, maintaining a limit of 1000 features for efficiency.
Data Splitting: partitioned the data into training and testing sets, with 20% reserved for testing to evaluate the model's performance later.

In [3]:
# Data Pre-processing
import pandas as pd
import re

# Loading the dataset for further processing
df = pd.read_csv('/Users/adese/Downloads/Singapore_Airlines_Reviews New.csv')

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[\t\n\r]+', ' ', text)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    return text

df['cleaned_text'] = df['text'].apply(clean_text)


In [5]:
# Text-Representation for the model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

vectorizer = TfidfVectorizer(max_features=1000)

# Spliting the data into training and testing sets to prevent data leaks 
X_train, X_test, y_train, y_test = train_test_split(df['cleaned_text'], df['rating'], test_size=0.2, random_state=42)

# Vectorizing the text
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)
