## Text Classification with Deep Learning

### Code Cell 1 - Import all necessary libraries and restaurant review data

In [4]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, GRU, LSTM, Dense, Dropout
from tensorflow.keras.initializers import Constant

# Load the dataset
file_path = '/Users/kavya/Downloads/GitHub/Datasets/restaurant_reviews_az.csv'
reviews_df = pd.read_csv(file_path)

# Display basic information about the dataset
print("Dataset Information:")
print(reviews_df.info())

# Preview the first few rows of the dataset
print("First 5 rows of the dataset:")
print(reviews_df.head())

2025-02-23 18:40:08.854744: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48147 entries, 0 to 48146
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   review_id    48147 non-null  object
 1   user_id      48147 non-null  object
 2   business_id  48147 non-null  object
 3   stars        48147 non-null  int64 
 4   useful       48147 non-null  int64 
 5   funny        48147 non-null  int64 
 6   cool         48147 non-null  int64 
 7   text         48147 non-null  object
 8   date         48147 non-null  object
dtypes: int64(4), object(5)
memory usage: 3.3+ MB
None
First 5 rows of the dataset:
                review_id                 user_id             business_id  \
0  IVS7do_HBzroiCiymNdxDg  fdFgZQQYQJeEAshH4lxSfQ  sGy67CpJctjeCWClWqonjA   
1  QP2pSzSqpJTMWOCuUuyXkQ  JBLWSXBTKFvJYYiM-FnCOQ  3w7NRntdQ9h0KwDsksIt5Q   
2  oK0cGYStgDOusZKz9B1qug  2_9fKnXChUjC5xArfF8BLg  OMnPtRGmbY8qH_wIILfYKA   
3  E_ABvFCNVLbfOgRg3Pv1KQ  9

### Code Cell 2 - Remove 3-star reviews from the input data and create a new column - Sentiment Sentiment for the remaining reviews.

In [6]:
# Remove 3-star reviews from the dataset
filtered_reviews_df = reviews_df[reviews_df['stars'] != 3].copy()

# Create a new 'Sentiment' column: 0 for negative reviews, 1 for positive reviews
filtered_reviews_df['Sentiment'] = filtered_reviews_df['stars'].apply(lambda x: 0 if x <= 2 else 1)

# Display the distribution of sentiments
print("Sentiment Distribution:")
print(filtered_reviews_df['Sentiment'].value_counts())

# Preview the first few rows of the modified dataset
print("First 5 rows after processing:")
print(filtered_reviews_df.head())

Sentiment Distribution:
Sentiment
1    31781
0    12312
Name: count, dtype: int64
First 5 rows after processing:
                review_id                 user_id             business_id  \
1  QP2pSzSqpJTMWOCuUuyXkQ  JBLWSXBTKFvJYYiM-FnCOQ  3w7NRntdQ9h0KwDsksIt5Q   
2  oK0cGYStgDOusZKz9B1qug  2_9fKnXChUjC5xArfF8BLg  OMnPtRGmbY8qH_wIILfYKA   
3  E_ABvFCNVLbfOgRg3Pv1KQ  9MExTQ76GSKhxSWnTS901g  V9XlikTxq0My4gE8LULsjw   
4  Rd222CrrnXkXukR2iWj69g  LPxuausjvDN88uPr-Q4cQA  CA5BOxKRDPGJgdUQ8OUOpw   
5  kx6O_lyLzUnA7Xip5wh2NA  YsINprB2G1DM8qG1hbrPUg  rViAhfKLKmwbhTKROM9m0w   

   stars  useful  funny  cool  \
1      5       1      1     1   
2      5       1      0     0   
3      5       0      0     0   
4      4       1      0     0   
5      1       0      0     0   

                                                text                 date  \
1  Pandemic pit stop to have an ice cream.... onl...  2020-04-19 05:33:16   
2  I was lucky enough to go to the soft opening a...  2020-02-29 19:43:

### Code Cell 3 - Data processing and train-test split

In [8]:
# Define features and target variable
X = filtered_reviews_df['text'].values  # Review texts
y = filtered_reviews_df['Sentiment'].values  # Sentiment labels

# Split the dataset into training (80%) and testing (20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Tokenization and padding parameters
max_words = 10000  # Maximum number of words to keep in the tokenizer
max_len = 200      # Maximum length of each review after padding

# Tokenizing the text data
tokenizer = Tokenizer(num_words=max_words, oov_token="<OOV>")  # <OOV> token for out-of-vocabulary words
tokenizer.fit_on_texts(X_train)

# Convert text to sequences
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

# Pad sequences to ensure uniform length
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len, padding='post', truncating='post')
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len, padding='post', truncating='post')

# Display the shape of the processed datasets and tokenizer info
print(f"Training data shape: {X_train_pad.shape}")
print(f"Testing data shape: {X_test_pad.shape}")
print(f"Vocabulary size: {len(tokenizer.word_index)}")

Training data shape: (35274, 200)
Testing data shape: (8819, 200)
Vocabulary size: 33420


### Code Cell 4 - Download the pre-trained GloVe word embeddings and prepare the embedding matrix

In [10]:
import os
import zipfile
import requests

# Download the GloVe embeddings if not already downloaded
glove_url = "http://nlp.stanford.edu/data/glove.6B.zip"
glove_zip_path = "glove.6B.zip"
glove_folder = "glove.6B"

if not os.path.exists(glove_zip_path):
    print("Downloading GloVe embeddings...")
    response = requests.get(glove_url)
    with open(glove_zip_path, "wb") as file:
        file.write(response.content)

# Extract the GloVe zip file if not already extracted
if not os.path.exists(glove_folder):
    print("Extracting GloVe embeddings...")
    with zipfile.ZipFile(glove_zip_path, 'r') as zip_ref:
        zip_ref.extractall(glove_folder)

# Load the 100-dimensional GloVe embeddings
embedding_index = {}
glove_file = os.path.join(glove_folder, "glove.6B.100d.txt")

with open(glove_file, encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = vector

# Prepare the embedding matrix
embedding_dim = 100  # Each word vector has 100 dimensions
embedding_matrix = np.zeros((max_words, embedding_dim))

for word, i in tokenizer.word_index.items():
    if i < max_words:
        embedding_vector = embedding_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

# Display the shape of the embedding matrix
print(f"Embedding matrix shape: {embedding_matrix.shape}")

Embedding matrix shape: (10000, 100)


### Code Cell 5 - Build a GRU Model with Pre-trained GloVe Embedding

In [12]:
from tensorflow.keras.callbacks import EarlyStopping

# Build the GRU model with pre-trained GloVe embeddings
gru_model = Sequential()
gru_model.add(Embedding(input_dim=max_words, 
                        output_dim=embedding_dim, 
                        embeddings_initializer=Constant(embedding_matrix),
                        input_length=max_len, 
                        trainable=False))  # Freezing the embeddings
gru_model.add(GRU(64, return_sequences=False))
gru_model.add(Dropout(0.5))
gru_model.add(Dense(1, activation='sigmoid'))

# Compile the model
gru_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the GRU model
early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

history_gru = gru_model.fit(X_train_pad, y_train, 
                            epochs=10, 
                            batch_size=128, 
                            validation_split=0.2, 
                            callbacks=[early_stop])

# Evaluate the model on the test set
loss, accuracy = gru_model.evaluate(X_test_pad, y_test)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")

Epoch 1/10




[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 109ms/step - accuracy: 0.7156 - loss: 0.6141 - val_accuracy: 0.7420 - val_loss: 0.5703
Epoch 2/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 105ms/step - accuracy: 0.7441 - loss: 0.5680 - val_accuracy: 0.7419 - val_loss: 0.4903
Epoch 3/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 111ms/step - accuracy: 0.7256 - loss: 0.5056 - val_accuracy: 0.8604 - val_loss: 0.4186
Epoch 4/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 110ms/step - accuracy: 0.7929 - loss: 0.5001 - val_accuracy: 0.7365 - val_loss: 0.5725
Epoch 5/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 111ms/step - accuracy: 0.7239 - loss: 0.5869 - val_accuracy: 0.7436 - val_loss: 0.5574
Epoch 6/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 126ms/step - accuracy: 0.7528 - loss: 0.5212 - val_accuracy: 0.9111 - val_loss: 0.2204
Epoch 7/10
[1m221/22

### Code Cell 6 - Build an LSTM Model with Pre-trained GloVe Embedding

In [14]:
# Build the LSTM model with pre-trained GloVe embeddings
lstm_model = Sequential()
lstm_model.add(Embedding(input_dim=max_words, 
                         output_dim=embedding_dim, 
                         embeddings_initializer=Constant(embedding_matrix),
                         input_length=max_len, 
                         trainable=False))  # Freezing the embeddings
lstm_model.add(LSTM(64, return_sequences=False))
lstm_model.add(Dropout(0.5))
lstm_model.add(Dense(1, activation='sigmoid'))

# Compile the model
lstm_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the LSTM model
early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

history_lstm = lstm_model.fit(X_train_pad, y_train, 
                              epochs=10, 
                              batch_size=128, 
                              validation_split=0.2, 
                              callbacks=[early_stop])

# Evaluate the model on the test set
loss, accuracy = lstm_model.evaluate(X_test_pad, y_test)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")

Epoch 1/10




[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m28s[0m 116ms/step - accuracy: 0.7186 - loss: 0.6097 - val_accuracy: 0.7426 - val_loss: 0.5682
Epoch 2/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 117ms/step - accuracy: 0.7293 - loss: 0.5840 - val_accuracy: 0.7355 - val_loss: 0.5749
Epoch 3/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 112ms/step - accuracy: 0.7293 - loss: 0.5918 - val_accuracy: 0.7442 - val_loss: 0.5711
Epoch 4/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 113ms/step - accuracy: 0.7375 - loss: 0.5793 - val_accuracy: 0.7405 - val_loss: 0.5723
[1m276/276[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 15ms/step - accuracy: 0.7305 - loss: 0.5835
Test Loss: 0.5888
Test Accuracy: 0.7245


### Code Cell 7 - Build a GRU Model with Trainable Embeddings

In [16]:
# Build the GRU model with trainable GloVe embeddings
gru_trainable_model = Sequential()
gru_trainable_model.add(Embedding(input_dim=max_words, 
                                  output_dim=embedding_dim, 
                                  embeddings_initializer=Constant(embedding_matrix),
                                  input_length=max_len, 
                                  trainable=True))  # Allow training of embeddings
gru_trainable_model.add(GRU(64, return_sequences=False))
gru_trainable_model.add(Dropout(0.5))
gru_trainable_model.add(Dense(1, activation='sigmoid'))

# Compile the model
gru_trainable_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the GRU model
early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

history_gru_trainable = gru_trainable_model.fit(X_train_pad, y_train, 
                                                epochs=10, 
                                                batch_size=128, 
                                                validation_split=0.2, 
                                                callbacks=[early_stop])

# Evaluate the model on the test set
loss, accuracy = gru_trainable_model.evaluate(X_test_pad, y_test)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")

Epoch 1/10




[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 148ms/step - accuracy: 0.7106 - loss: 0.5974 - val_accuracy: 0.7369 - val_loss: 0.5739
Epoch 2/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 138ms/step - accuracy: 0.7400 - loss: 0.5731 - val_accuracy: 0.7423 - val_loss: 0.5687
Epoch 3/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 138ms/step - accuracy: 0.7613 - loss: 0.5451 - val_accuracy: 0.8021 - val_loss: 0.4882
Epoch 4/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m35s[0m 157ms/step - accuracy: 0.8324 - loss: 0.4493 - val_accuracy: 0.8231 - val_loss: 0.4589
Epoch 5/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m38s[0m 170ms/step - accuracy: 0.8448 - loss: 0.4282 - val_accuracy: 0.8897 - val_loss: 0.3408
Epoch 6/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m37s[0m 168ms/step - accuracy: 0.8025 - loss: 0.5098 - val_accuracy: 0.7443 - val_loss: 0.5547
Epoch 7/10
[1m221/22

### Code Cell 8 - Build an LSTM Model with Trainable Embeddings

In [18]:
# Build the LSTM model with trainable GloVe embeddings
lstm_trainable_model = Sequential()
lstm_trainable_model.add(Embedding(input_dim=max_words, 
                                   output_dim=embedding_dim, 
                                   embeddings_initializer=Constant(embedding_matrix),
                                   input_length=max_len, 
                                   trainable=True))  # Allow training of embeddings
lstm_trainable_model.add(LSTM(64, return_sequences=False))
lstm_trainable_model.add(Dropout(0.5))
lstm_trainable_model.add(Dense(1, activation='sigmoid'))

# Compile the model
lstm_trainable_model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the LSTM model
early_stop = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

history_lstm_trainable = lstm_trainable_model.fit(X_train_pad, y_train, 
                                                  epochs=10, 
                                                  batch_size=128, 
                                                  validation_split=0.2, 
                                                  callbacks=[early_stop])

# Evaluate the model on the test set
loss, accuracy = lstm_trainable_model.evaluate(X_test_pad, y_test)
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")

Epoch 1/10




[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 129ms/step - accuracy: 0.7165 - loss: 0.5969 - val_accuracy: 0.7284 - val_loss: 0.5908
Epoch 2/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 135ms/step - accuracy: 0.7326 - loss: 0.5830 - val_accuracy: 0.7389 - val_loss: 0.5804
Epoch 3/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 139ms/step - accuracy: 0.7416 - loss: 0.5764 - val_accuracy: 0.8289 - val_loss: 0.4604
Epoch 4/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 139ms/step - accuracy: 0.7569 - loss: 0.5527 - val_accuracy: 0.7260 - val_loss: 0.5879
Epoch 5/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 134ms/step - accuracy: 0.7296 - loss: 0.5850 - val_accuracy: 0.7538 - val_loss: 0.5528
Epoch 6/10
[1m221/221[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 137ms/step - accuracy: 0.7532 - loss: 0.5529 - val_accuracy: 0.9148 - val_loss: 0.2978
Epoch 7/10
[1m221/22

### Code Cell 9 - Use the Best Model in Lab Assignment 2 and Show Performance

In [25]:
# Compare the best deep learning model with the best traditional ML model from Lab Assignment 2 (SVM with TF-IDF)

# Assume svm_tfidf_accuracy is the accuracy from Lab Assignment 2
svm_tfidf_accuracy = 0.9422  # SVM with TF-IDF accuracy from Lab Assignment 2
gru_trainable_accuracy = accuracy  # Accuracy from the GRU model in this assignment (already computed)

# Display comparison results
print("Model Comparison:")
print(f"SVM with TF-IDF (Lab Assignment 2) Accuracy: {svm_tfidf_accuracy * 100:.2f}%")
print(f"GRU with Trainable GloVe Embeddings (Lab Assignment 5) Accuracy: {gru_trainable_accuracy * 100:.2f}%")

# Conclusion based on comparison
if gru_trainable_accuracy > svm_tfidf_accuracy:
    print("The GRU model with trainable embeddings outperforms the SVM model.")
else:
    print("The SVM model from Lab Assignment 2 remains the best-performing model.")

Model Comparison:
SVM with TF-IDF (Lab Assignment 2) Accuracy: 94.22%
GRU with Trainable GloVe Embeddings (Lab Assignment 5) Accuracy: 95.58%
The GRU model with trainable embeddings outperforms the SVM model.


### Text Cell 10 - Compare and Comment on Model Performances

### Observations on Sentiment Analysis Model Performance with Deep Learning Models

In this assignment, I evaluated different deep learning architectures—**GRU** and **LSTM**—using both pre-trained and trainable GloVe embeddings for sentiment classification on Yelp reviews. Below is a summary of the results:

| Model                                | Embedding Type     | Test Accuracy |
|-------------------------------------|--------------------|--------------|
| GRU with Pre-trained GloVe          | Non-trainable      | 84.69%       |
| LSTM with Pre-trained GloVe         | Non-trainable      | 81.18%       |
| GRU with Trainable GloVe            | Trainable          | 95.59%       |
| LSTM with Trainable GloVe           | Trainable          | 95.45%       |

---

### **Observations:**

1. **Effect of Embedding Type:**
   - Both **GRU** and **LSTM** models performed significantly better when using **trainable embeddings** compared to non-trainable pre-trained embeddings.
   - Allowing the model to fine-tune embeddings during training enabled it to better capture the specific nuances of the Yelp review dataset.

2. **GRU vs. LSTM Performance:**
   - Across both embedding types, the **GRU models consistently outperformed LSTM models**.
   - The **GRU with trainable embeddings** achieved the highest accuracy (**95.59%**), slightly outperforming the **LSTM with trainable embeddings** (**95.45%**).
   - This suggests that GRU, being a simpler and more efficient architecture than LSTM, effectively captures sequential dependencies for this dataset without adding unnecessary complexity.

3. **Non-Trainable vs. Trainable Embeddings:**
   - Both GRU and LSTM models using **pre-trained (non-trainable) embeddings** underperformed compared to their trainable counterparts.
   - This result highlights the importance of adapting pre-trained embeddings to the specific dataset for improved performance.

---

### **Conclusion:**
Overall, the **GRU model with trainable embeddings** was the best-performing deep learning model. It effectively balanced computational efficiency and accuracy, making it well-suited for sentiment analysis tasks on large textual datasets. While LSTM models also performed well, the simpler GRU architecture proved to be more efficient and slightly more accurate in this case.