#### Week 11 Exercise 11.2 Author: Rex Gayas Course & Section: DSC360-T301 Data Mining: Text Analytics an (2243-1) Date: 25 FEB 2024

#### Data Loading and Preparation

In [9]:
import pandas as pd

# Load the dataset
file_path = r'D:\ALPHA\Dynamic Folder\Bellevue\Winter 2023\Data Mining\Week 11\archive\hotel-reviews.csv'
hotel_reviews = pd.read_csv(file_path)

# Display the first few entries to understand the dataset structure
hotel_reviews.head()


Unnamed: 0,User_ID,Description,Browser_Used,Device_Used,Is_Response
0,id10326,The room was kind of clean but had a VERY stro...,Edge,Mobile,not happy
1,id10327,I stayed at the Crown Plaza April -- - April -...,Internet Explorer,Mobile,not happy
2,id10328,I booked this hotel through Hotwire at the low...,Mozilla,Tablet,not happy
3,id10329,Stayed here with husband and sons on the way t...,InternetExplorer,Desktop,happy
4,id10330,My girlfriends and I stayed here to celebrate ...,Edge,Tablet,not happy


For building the deep learning model, the focus is in the Description column as input, and Is_Response as the target variable. First task is to encode the Is_Response to a binary variable, where “happy” can be 1, and “not happy” can be 0.

#### Data Preprocessing

In [10]:
import re
from sklearn.model_selection import train_test_split

# Preprocessing text data
def clean_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation and numbers
    text = re.sub(r'[^a-z\s]', '', text)
    # Remove excess whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Clean the descriptions
hotel_reviews['Cleaned_Description'] = hotel_reviews['Description'].apply(clean_text)

# Encode the target variable
hotel_reviews['Is_Response'] = hotel_reviews['Is_Response'].map({'happy': 1, 'not happy': 0})

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(
    hotel_reviews['Cleaned_Description'], 
    hotel_reviews['Is_Response'], 
    test_size=0.2, 
    random_state=42
)

# Check the results
X_train.head(), y_train.head()


(5031     we stayed in a suite and found it spacious eno...
 28166    my boyfriend and i were given a day trip to ne...
 11229    we stayed at the hotel for a week in sept over...
 7346     i stayed here for my st birthday with other pe...
 6316     the location of this hotel is great and thats ...
 Name: Cleaned_Description, dtype: object,
 5031     1
 28166    1
 11229    1
 7346     1
 6316     0
 Name: Is_Response, dtype: int64)

A text cleaning function was defined and applied to the “Description” column to preprocess the data. The target variable “Is_Response” was encoded to binary format.

#### Universal Sentence Encoder Embedding

In [11]:
import pandas as pd
import numpy as np
import re
import tensorflow as tf
import tensorflow_hub as hub
from sklearn.model_selection import train_test_split

# Load the dataset
df = pd.read_csv('D:\\ALPHA\\Dynamic Folder\\Bellevue\\Winter 2023\\Data Mining\\Week 11\\archive\\hotel-reviews.csv')

# Clean function redefined
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply cleaning function to the 'Description' column
df['Description'] = df['Description'].apply(clean_text)

# Encode the target variable 'Is_Response'
df['Is_Response'] = df['Is_Response'].map({'happy': 1, 'not happy': 0})

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(df['Description'], df['Is_Response'], test_size=0.2, random_state=42)

# Load the Universal Sentence Encoder's TF Hub module
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
embed = hub.load(module_url)

# Function to create sentence embeddings with batch processing
def get_sentence_embeddings(sentences, batch_size=128):
    all_embeddings = []
    for i in range(0, len(sentences), batch_size):
        batch_sentences = sentences[i:i+batch_size]
        batch_embeddings = embed(batch_sentences)
        all_embeddings.append(batch_embeddings.numpy())
    return np.vstack(all_embeddings)

# Convert sentences in training and testing sets into embeddings
X_train_embeddings = get_sentence_embeddings(X_train.tolist())
X_test_embeddings = get_sentence_embeddings(X_test.tolist())


In [12]:
# Print the shape of the embeddings to confirm their dimensions
print(X_train_embeddings.shape)
print(X_test_embeddings.shape)

# Print a small part of the embeddings to see their actual values
print(X_train_embeddings[:2])  # Prints the first two embeddings from the training set
print(X_test_embeddings[:2])   # Prints the first two embeddings from the testing set


(31145, 512)
(7787, 512)
[[-0.02875476 -0.01960756  0.00453075 ...  0.04044104  0.04497837
   0.05902657]
 [-0.03446922 -0.03135321 -0.01324397 ...  0.04881332 -0.00763081
   0.05488839]]
[[-0.03608064  0.02838305 -0.02954326 ...  0.01310635 -0.01404778
   0.04856294]
 [-0.06717499 -0.04941348 -0.02446725 ... -0.01657945  0.04211202
   0.0501585 ]]


Given the nature of the USE which is memory-intensive, the code was adjusted to process the sentence embeddings in batches to avoid overwhelming the system's memory. The output shows there are 31,145 training samples and 7,787 testing samples, each represented by a 512-dimensional embedding vector.

#### Model Building, Training, and Evaluation

In [13]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# Define the model
model = Sequential([
    Dense(256, activation='relu', input_shape=(512,)),
    Dropout(0.5),
    Dense(64, activation='relu'),
    Dense(1, activation='sigmoid')  # Using sigmoid for binary classification
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train_embeddings, y_train, epochs=10, validation_split=0.1)

# Evaluate the model on the test set
test_loss, test_accuracy = model.evaluate(X_test_embeddings, y_test)
print('Test Accuracy:', test_accuracy)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Test Accuracy: 0.867343008518219


The trained model was evaluated on the test set, yielding an accuracy of approximately 86.90%. Throughout the process, memory limitations were carefully considered, especially when generating embeddings for the entire dataset. Due to kernel crashes, batch processing was employed to mitigate potential memory issues. For consideration, the model can be fine-tuned by adjusting hyperparameters. K-fold cross-validation is also helpful in assessing consistency across different subsets. Error analysis can also be implemented to understand the origin of mistakes. 