# Sentiment Analysis on Yelp Review Dataset

In this tutorial notebook, I am going to learn from Delip Rao and Brian McMahan's book "Natural Language Processing with PyTorch" and modify it to further understand it. We will be building a Sentiment Analyzer for the Yelp Review Dataset. 

First, we have to split the dataset into three sets: Training, Validation and Testing. 

From training dataset, our model will derive parameters, with validation set, our model can make decisions (by selecting among hyperparameters) and the testing set for final evaluation.

I have downloaded the Dataset from this [link](https://www.kaggle.com/datasets/ilhamfp31/yelp-review-dataset)

I am storing this under the `/data` folder under the name `yelp_review`

## 1. Data Preprocessing

In [1]:
# Import Statements
import pandas as pd
import numpy as np
import re
import collections

In [2]:
np.random.seed(42)

In [3]:
data_base_dir = "data/yelp_review/"
train_dataset = pd.read_csv(data_base_dir+"train.csv", header = None)
test_dataset = pd.read_csv(data_base_dir+"test.csv", header = None)

We have to see the distribution of data, having uneven data will make our model more biased.

In [4]:
train_dataset.columns = ["Rating", "Review"]
test_dataset.columns = ["Rating", "Review"]

In [5]:
train_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 560000 entries, 0 to 559999
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   Rating  560000 non-null  int64 
 1   Review  560000 non-null  object
dtypes: int64(1), object(1)
memory usage: 8.5+ MB


There are 560,000 review in our dataset. Let us see the distribution for Positive reviews and Negative reviews.

In [6]:
train_dataset["Rating"].value_counts()

1    280000
2    280000
Name: Rating, dtype: int64

It looks like we have equal distribution, we need to take a subset of this Dataset, about `10%` with the same distribution. Before we do that, let us check the distribution of `test_dataset`.

In [7]:
test_dataset["Rating"].value_counts()

2    19000
1    19000
Name: Rating, dtype: int64

Even `test_dataset` has the same equal distribution, we want to be able to create three sets: Train, Val, Test. 

In [8]:
main_dataset = pd.concat([train_dataset, test_dataset], ignore_index= True)
main_dataset.head()

Unnamed: 0,Rating,Review
0,1,"Unfortunately, the frustration of being Dr. Go..."
1,2,Been going to Dr. Goldberg for over 10 years. ...
2,1,I don't know what Dr. Goldberg was like before...
3,1,I'm writing this review to give you a heads up...
4,2,All the food is great here. But the best thing...


In [9]:
main_dataset["Rating"].value_counts()

1    299000
2    299000
Name: Rating, dtype: int64

I have combined both the Training dataset and the Testing dataset. We are going to create three new subsets from these. 

`Train - 70%, Val - 15%, Test - 15%`

In [10]:
main_dataset = main_dataset.sample(frac=1).reset_index(drop=True)

In [11]:
main_dataset.head()

Unnamed: 0,Rating,Review
0,2,I'm so glad my friends told me to go here!! Wo...
1,1,I was really looking forward to trying this pl...
2,1,"I didnt know \""sh*tty wok\"" from south park ex..."
3,2,I was looking for a pizza delivery place w/ mo...
4,2,Solid breakfast food.


In [12]:
main_dataset = main_dataset[:int(0.1*len(main_dataset))]

In [13]:
split = []
dataset_size = len(main_dataset)
train_rec = int(0.7 * dataset_size)
test_rec = int(0.15 * dataset_size)
val_rec = dataset_size - train_rec - test_rec
for i in range(train_rec):
    split.append("train")
for i in range(test_rec):
    split.append("test")
for i in range(val_rec):
    split.append("val")
    
print(len(split))

59800


In [14]:
main_dataset["split"] = split

In [15]:
main_dataset["split"].value_counts()

train    41860
test      8970
val       8970
Name: split, dtype: int64

In [16]:
main_dataset["Rating"].value_counts()

1    29990
2    29810
Name: Rating, dtype: int64

In [17]:
def cleaning_dataset(review):
    review = review.lower()
    review = re.sub(r'([.,?!])', r' \1', review)
    review = re.sub(r'([^a-zA-Z.,!?])', r' ', review)
    return review

In [19]:
main_dataset["Review"] = main_dataset["Review"].apply(cleaning_dataset)

In [21]:
main_dataset.head()

Unnamed: 0,Rating,Review,split
0,2,i m so glad my friends told me to go here ! ! ...,train
1,1,i was really looking forward to trying this pl...,train
2,1,i didnt know sh tty wok from south park ex...,train
3,2,i was looking for a pizza delivery place w mo...,train
4,2,solid breakfast food .,train
