# Sentiment Analysis on Yelp Review Dataset

In this tutorial notebook, I am going to learn from Delip Rao and Brian McMahan's book "Natural Language Processing with PyTorch" and modify it to further understand it. We will be building a Sentiment Analyzer for the Yelp Review Dataset. 

First, we have to split the dataset into three sets: Training, Validation and Testing. 

From training dataset, our model will derive parameters, with validation set, our model can make decisions (by selecting among hyperparameters) and the testing set for final evaluation.

I have downloaded the Dataset from this [link](https://www.kaggle.com/datasets/ilhamfp31/yelp-review-dataset)

I am storing this under the `/data` folder under the name `yelp_review`

## 1. Data Preprocessing

In [1]:
# Import Statements
import pandas as pd
import numpy as np
import re
import collections
import string

In [2]:
from collections import Counter
from torch.utils.data import Dataset, DataLoader

In [3]:
np.random.seed(42)

In [4]:
data_base_dir = "data/yelp_review/"
train_dataset = pd.read_csv(data_base_dir+"train.csv", header = None)
test_dataset = pd.read_csv(data_base_dir+"test.csv", header = None)

We have to see the distribution of data, having uneven data will make our model more biased.

In [5]:
train_dataset.columns = ["Rating", "Review"]
test_dataset.columns = ["Rating", "Review"]

In [6]:
train_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 560000 entries, 0 to 559999
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   Rating  560000 non-null  int64 
 1   Review  560000 non-null  object
dtypes: int64(1), object(1)
memory usage: 8.5+ MB


There are 560,000 review in our dataset. Let us see the distribution for Positive reviews and Negative reviews.

In [7]:
train_dataset["Rating"].value_counts()

1    280000
2    280000
Name: Rating, dtype: int64

It looks like we have equal distribution, we need to take a subset of this Dataset, about `10%` with the same distribution. Before we do that, let us check the distribution of `test_dataset`.

In [8]:
test_dataset["Rating"].value_counts()

2    19000
1    19000
Name: Rating, dtype: int64

Even `test_dataset` has the same equal distribution, we want to be able to create three sets: Train, Val, Test. 

In [9]:
main_dataset = pd.concat([train_dataset, test_dataset], ignore_index= True)
main_dataset.head()

Unnamed: 0,Rating,Review
0,1,"Unfortunately, the frustration of being Dr. Go..."
1,2,Been going to Dr. Goldberg for over 10 years. ...
2,1,I don't know what Dr. Goldberg was like before...
3,1,I'm writing this review to give you a heads up...
4,2,All the food is great here. But the best thing...


In [10]:
main_dataset["Rating"].value_counts()

1    299000
2    299000
Name: Rating, dtype: int64

I have combined both the Training dataset and the Testing dataset. We are going to create three new subsets from these. 

`Train - 70%, Val - 15%, Test - 15%`

In [11]:
main_dataset = main_dataset.sample(frac=1).reset_index(drop=True)

In [12]:
main_dataset.head()

Unnamed: 0,Rating,Review
0,2,I'm so glad my friends told me to go here!! Wo...
1,1,I was really looking forward to trying this pl...
2,1,"I didnt know \""sh*tty wok\"" from south park ex..."
3,2,I was looking for a pizza delivery place w/ mo...
4,2,Solid breakfast food.


In [13]:
main_dataset = main_dataset[:int(0.1*len(main_dataset))]

In [14]:
split = []
dataset_size = len(main_dataset)
train_rec = int(0.7 * dataset_size)
test_rec = int(0.15 * dataset_size)
val_rec = dataset_size - train_rec - test_rec
for i in range(train_rec):
    split.append("train")
for i in range(test_rec):
    split.append("test")
for i in range(val_rec):
    split.append("val")
    
print(len(split))

59800


In [15]:
main_dataset["split"] = split

In [16]:
main_dataset["split"].value_counts()

train    41860
test      8970
val       8970
Name: split, dtype: int64

In [17]:
main_dataset["Rating"].value_counts()

1    29990
2    29810
Name: Rating, dtype: int64

In [18]:
def cleaning_dataset(review):
    review = review.lower()
    review = re.sub(r'([.,?!])', r' \1', review)
    review = re.sub(r'([^a-zA-Z.,!?])', r' ', review)
    return review

In [19]:
main_dataset["Review"] = main_dataset["Review"].apply(cleaning_dataset)

In [20]:
main_dataset.head()

Unnamed: 0,Rating,Review,split
0,2,i m so glad my friends told me to go here ! ! ...,train
1,1,i was really looking forward to trying this pl...,train
2,1,i didnt know sh tty wok from south park ex...,train
3,2,i was looking for a pizza delivery place w mo...,train
4,2,solid breakfast food .,train


In [21]:
main_dataset.to_csv("data/yelp_review/reviews.csv")

## 2. Vocabulary
Think of this as two dictionaries, one mapping tokens to ids and another mappings ids to tokens.

In [22]:
class Vocabulary(object):
    # Extracts vocab for mapping
    def __init__(self, token_to_idx = None, add_unk = True, unk_token = "<UNK>"):
        # token_to_idx is pre-existent mapping of tokens to index
        if token_to_idx is None:
            token_to_idx = {}
        self._token_to_idx = token_to_idx
        
        self._idx_to_token = {idx : token for token, idx in self._token_to_idx.items()}
        
        self._add_unk = add_unk
        self._unk_token = unk_token
        
        self.unk_index = -1
        if add_unk:
            self.unk_index = self.add_token(unk_token)
            
    def to_serializable(self):
        # returns a serializable dictionary (ordered)
        return {
            "token_to_idx" : self._token_to_idx,
            "add_unk" : self._add_unk,
            "unk_token" : self._unk_token
        }
    
    @classmethod
    def from_serializable(cls, contents):
        # Instantiate a vocab from serialized dictionary
        return cls(**contents)
    
    def add_token(self, token):
        if token in self._token_to_idx:
            index = self._token_to_idx[token]
        else:
            index = len(self._token_to_idx)
            self._token_to_idx[token] = index
            self._idx_to_token[index] = token
        return index
    
    def lookup_token(self, token):
        if self._add_unk:
            return self._token_to_idx.get(token, self.unk_index)
        else:
            return self._token_to_idx[token]
    
    def lookup_index(self, index):
        if index not in self._idx_to_token:
            raise KeyError("Index not present")
        else:
            return self._idx_to_token[index]
    
    def __str__(self):
        return f"<Vocabulary(size={len(self)})>"
    
    def __len__(self):
        return len(self._token_to_idx)
        

## 3. Vectorizer
Converts reviews (texts) to vectors. It does Collapsed One Hot representation. We don't really account for the semantic meaning or the number of occurences (we do have a CUT_OFF). We only care if the word is present in the review, not how many times.

In [23]:
class ReviewVectorizer(object):
    def __init__(self, review_vocab, rating_vocab):
        self.review_vocab = review_vocab # Maps words to integers
        self.rating_vocab = rating_vocab # Maps class labels to integers
        
    def vectorize(self, review):
        one_hot = np.zeros(len(self.review_vocab), dtype = np.float32)
        
        for token in review.split(" "):
            if token not in string.punctuation:
                one_hot[self.review_vocab.lookup_token(token)] = 1
                
        return one_hot
    
    @classmethod
    def from_df(cls, df, cutoff = 25):
        review_vocab = Vocabulary(add_unk= True)
        rating_vocab = Vocabulary(add_unk= False)
        
        # Add ratings to rating_vocab
        for rating in sorted(set(df.Rating)):
            rating_vocab.add_token(rating)
            
        # Cross threshold add word
        word_counts = Counter()
        for review in df.Review:
            for word in review.split(" "):
                if word not in string.punctuation:
                    word_counts[word] += 1
            
        for word, count in word_counts.items():
            if count > cutoff:
                review_vocab.add_token(word)
                
        return cls(review_vocab, rating_vocab)
    
    
    @classmethod
    def from_serializable(cls, contents):
        review_vocab = Vocabulary.from_serializable(contents["review_vocab"])
        rating_vocab = Vocabulary.from_serializable(contents["rating_vocab"])
        
        return cls(review_vocab = review_vocab, rating_vocab = rating_vocab)
    
    def to_serializable(self):
        # Serializable dictionary for caching (IDK what this means, to be honest)
        return {
            "review_vocab" : self.review_vocab.to_serializable(),
            "rating_vocab" : self.rating_vocab.to_serializable()
        }

## 4. Dataset

We denote the entry point method (from where the data flows in) with the `@classmethod` decorator. From what I comprehend, Decorators are essentially function passed into another function. It extends the behaviour of the function that is getting passed in without explicitly modifying the code. Decorators wrap a function and modify its behaviour. 

In [24]:
class YelpReviewDataset(Dataset):
    def __init__(self, df, vectorizer):
        self.df = df
        self._vectorizer = vectorizer
        
        # We are going to split our dataset into the three sets: Train, Val and Test
        # Train Split
        self.train_df = self.df[self.df.split == "train"]
        self.train_size = len(self.train_df)
        
        # Val Split
        self.val_df = self.df[self.df.split == "val"]
        self.val_size = len(self.val_df)
        
        # Test Split
        self.test_df = self.df[self.df.split == "test"]
        self.test_size = len(self.test_df)
        
        self._lookup_dict = {"train" : (self.train_df, self.train_size),
                             "val" : (self.val_df, self.val_size),
                             "test" : (self.test_df, self.test_size)}
        
        # By default, we will be in Train set
        self.set_split("train")
        
    @classmethod
    def load_dataset_and_make_vectorizer(cls,csv_location):
        df = pd.read_csv(csv_location)
        train_review_df = df[df.split=='train']
        return cls(df, ReviewVectorizer.from_df(train_review_df))

    
    def get_vectorizer(self):
        return self._vectorizer
    
    def set_split(self, split = "train"):
        self._target_split = split
        self._target_df, self._target_size = self._lookup_dict[split]
        
    def __len__(self):
        # This function specifies the size of the dataset
        return self._target_size
    
    def __getitem__(self, index):
        # Takes in index and returns features and labels
        row = self._target_df.iloc[index]
        
        review_vector = self._vectorizer.vectorize(row["Review"])
        
        rating_index = self._vectorizer.rating_vocab.lookup_token(row["Rating"])
        
        return {"x_data" : review_vector, "y_target" : rating_index}
    
    def get_num_batches(self, batch_size):
        return len(self) // batch_size

## 5. DataLoader
We are grouping the data points for Batch Training. Our dataloader in called `generate_batches`.

In [25]:
def generate_batches(dataset, batch_size, shuffle = True, drop_last = True, device = "cpu"):
    dataloader = DataLoader(dataset = dataset, batch_size = batch_size, shuffle=shuffle, drop_last=drop_last)
    
    for data_dict in dataloader:
        # Port it to device
        out_data_dict = {}
        for name, tensor in data_dict.items():
            out_data_dict[name] = data_dict[name].to(device)
        yield out_data_dict

## 6. A simple Perceptron
One hidden layer with sigmoid activation function

In [26]:
import torch.nn as nn
import torch.nn.functional as F

In [27]:
class ReviewClassifier(nn.Module):
    def __init__(self, num_features):
        # num_features is size of the input feature vector
        super(ReviewClassifier, self).__init__()
        self.fc1 = nn.Linear(in_features=num_features, out_features= 1)
    def forward(self, x_in, apply_sigmoid = False):
        y_out = self.fc1(x_in).squeeze()
        if apply_sigmoid:
            y_out = F.sigmoid(y_out)
        return y_out

## 7. Training Setup and Training Loop

In [45]:
from argparse import Namespace
import torch

In [46]:
args = Namespace(
    # Data and Path
    frequency_cuttoff = 25, 
    model_state_file = "model.pth",
    csv_location = "data/yelp_review/reviews.csv",
    save_dir = "/",
    # Training hyperparams
    batch_size = 128,
    early_stopping_criteria = 5,
    learning_rate = 0.001,
    num_epochs = 100,
    seed = 42
)

In [47]:
import torch.optim as optim

In [48]:
def make_train_state(args):
    return {
        "epoch_index" : 0,
        "train_loss" : [],
        "train_acc" : [],
        "val_loss" : [],
        "val_acc" : [],
        "test_loss" : -1,
        "test_acc" : -1,
    }

train_state = make_train_state(args)

In [49]:
args.device = "mps"

In [50]:
# First the Dataset and Vectorizer
dataset = YelpReviewDataset.load_dataset_and_make_vectorizer(args.csv_location)
vectorizer = dataset.get_vectorizer()

In [51]:
# Model
classifier = ReviewClassifier(num_features= len(vectorizer.review_vocab))
classifier = classifier.to(args.device)

In [52]:
# Loss function and Optimizer
loss_func = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(classifier.parameters(), lr = args.learning_rate)

In [53]:
def compute_accuracy(y_pred, y_target):
    y_target = y_target.cpu()
    y_pred_indices = (torch.sigmoid(y_pred)>0.5).cpu().long()#.max(dim=1)[1]
    n_correct = torch.eq(y_pred_indices, y_target).sum().item()
    return n_correct / len(y_pred_indices) * 100

In [54]:
# Training Loop
for epoch_index in range(args.num_epochs):
    train_state["epoch_index"] = epoch_index
    
    # Iterating over training dataset
    dataset.set_split("train")
    
    batch_generator = generate_batches(dataset, batch_size= args.batch_size, device= args.device)
    
    running_loss = 0.0
    running_acc = 0.0
    classifier.train()
    
    for batch_index, batch_dict in enumerate(batch_generator):
        # Training routine is 5 steps
        # Step 1: Zero the gradients
        optimizer.zero_grad()
        
        # Step 2: Compute output
        y_pred = classifier(x_in = batch_dict["x_data"].float())
        
        # Step 3: Compute Loss
        loss = loss_func(y_pred, batch_dict["y_target"].float())
        loss_batch = loss.item()
        running_loss += (loss_batch - running_loss) / (batch_index + 1)
        
        # Step 4: Use loss to produce gradients
        loss.backward()
        
        # Step 5: Use optimizer to take gradient step
        optimizer.step()
        
        # Computing accuracy
        acc_batch = compute_accuracy(y_pred, batch_dict["y_target"])
        running_acc += (acc_batch - running_acc) / (batch_index + 1)
        
    train_state["train_loss"].append(running_loss)
    train_state["train_acc"].append(running_acc)
    
    # Iterating over val dataset
    dataset.set_split("val")
    
    batch_generator = generate_batches(dataset, batch_size= args.batch_size, device= args.device)
    
    running_loss = 0.0
    running_acc = 0.0
    classifier.train()
    
    for batch_index, batch_dict in enumerate(batch_generator):
        # Training routine is 5 steps
        # Step 1: Zero the gradients
        optimizer.zero_grad()
        
        # Step 2: Compute output
        y_pred = classifier(x_in = batch_dict["x_data"].float())
        
        # Step 3: Compute Loss
        loss = loss_func(y_pred, batch_dict["y_target"].float())
        loss_batch = loss.item()
        running_loss += (loss_batch - running_loss) / (batch_index + 1)
        
        # Step 4: Use loss to produce gradients
        loss.backward()
        
        # Step 5: Use optimizer to take gradient step
        optimizer.step()
        
        # Computing accuracy
        acc_batch = compute_accuracy(y_pred, batch_dict["y_target"])
        running_acc += (acc_batch - running_acc) / (batch_index + 1)
        
    train_state["val_loss"].append(running_loss)
    train_state["val_acc"].append(running_acc)
    
    

In [55]:
train_state

{'epoch_index': 99,
 'train_loss': [0.4710985072162175,
  0.3074474175679938,
  0.2562686162986522,
  0.22865879604029013,
  0.21080981354465544,
  0.19784240704884212,
  0.18809375834483258,
  0.18008661867099438,
  0.17350872462495753,
  0.16779210413905823,
  0.16288448984105303,
  0.15863256498214312,
  0.1548691922371541,
  0.15126183722453018,
  0.14826552239488763,
  0.14525991347860492,
  0.14262602970414215,
  0.14018622068090172,
  0.13796529454102222,
  0.1359060880149907,
  0.13382846572712664,
  0.13194971761754531,
  0.13029966166168913,
  0.12855416096587433,
  0.12700061440832386,
  0.125540496103625,
  0.12408057980763444,
  0.12275315807500024,
  0.1214307383819275,
  0.12019194511053033,
  0.11905493744469564,
  0.11784644658457977,
  0.11667327708151724,
  0.11569835105483686,
  0.11463970232994192,
  0.11356221071077052,
  0.11267241256864056,
  0.11175121456977062,
  0.1110219381840769,
  0.10996307916825339,
  0.10920918667945291,
  0.10835462413295323,
  0.10762

In [56]:
dataset.set_split('test')
batch_generator = generate_batches(dataset, 
                                   batch_size=args.batch_size, 
                                   device=args.device)
running_loss = 0.
running_acc = 0.
classifier.eval()

for batch_index, batch_dict in enumerate(batch_generator):
    # compute the output
    y_pred = classifier(x_in=batch_dict['x_data'].float())

    # compute the loss
    loss = loss_func(y_pred, batch_dict['y_target'].float())
    loss_t = loss.item()
    running_loss += (loss_t - running_loss) / (batch_index + 1)

    # compute the accuracy
    acc_t = compute_accuracy(y_pred, batch_dict['y_target'])
    running_acc += (acc_t - running_acc) / (batch_index + 1)

train_state['test_loss'] = running_loss
train_state['test_acc'] = running_acc

In [57]:
print(train_state["test_loss"])
print(train_state["test_acc"])

0.3128017302070344
90.90401785714286


In [60]:
def predict_rating(review, classifier, vectorizer, decision_threshold=0.5):
    """Predict the rating of a review
    
    Args:
        review (str): the text of the review
        classifier (ReviewClassifier): the trained model
        vectorizer (ReviewVectorizer): the corresponding vectorizer
        decision_threshold (float): The numerical boundary which separates the rating classes
    """
    review = cleaning_dataset(review)
    
    vectorized_review = torch.tensor(vectorizer.vectorize(review))
    result = classifier(vectorized_review.view(1, -1))
    
    probability_value = F.sigmoid(result).item()
    index = 1
    if probability_value < decision_threshold:
        index = 0

    return vectorizer.rating_vocab.lookup_index(index)

In [65]:
test_review = "this is out of the world"

classifier = classifier.cpu()
prediction = predict_rating(test_review, classifier, vectorizer, decision_threshold=0.5)
print("{} -> {}".format(test_review, prediction))

this is out of the world -> 2
