**Initialization**
* I use these 3 lines of code on top of my each Notebooks because it will help to prevent any problems while reloading and reworking on a same Project or Problem. And the third line of code helps to make visualization within the Notebook.

In [1]:
#@ Initialization:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

**Downloading the Libraries and Dependencies**
* I have downloaded all the Libraries and Dependencies required for this Project in one particular cell.

In [2]:
#@ Downloading the Libraries and Dependencies:
import numpy as np
import pandas as pd
import re
import collections

from argparse import Namespace
from IPython.display import display 

**Getting the Data**
* I have used Google Colab for this Project so the process of downloading and reading the Data might be different in other platforms. I have used [**Yelp Reviews Dataset**](https://www.kaggle.com/yelp-dataset/yelp-dataset) for this Project. In 2015, Yelp held a contest to predict the Rating of the Restaurants given it's Reviews. Zhang, Zhao, and Lecun simplified the Dataset by converting the Ratings into Sentiments viz. Positive Sentiment for 3 to 4 star Ratings and Negative Sentiment for 1 to 2 star Ratings. The Dataset is splitted into 560,000 Training Samples and 38,000 Testing Samples. 

In [3]:
#@ Getting the Dataset:
args = Namespace(
    raw_train_dataset = "/content/drive/My Drive/Colab Notebooks/Reviews/raw_train.csv",
    raw_test_dataset = "/content/drive/My Drive/Colab Notebooks/Reviews/raw_test.csv",
    proportion_subset_of_train = 0.1,
    train_proportion = 0.7,
    val_proportion = 0.15,
    test_proportion = 0.15,   
    output_munged = "/content/drive/My Drive/Colab Notebooks/Reviews/reviews_with_splits_lite.csv",
    seed = 1337
)

#@ Reading the Raw Dataset:
train_reviews = pd.read_csv(args.raw_train_dataset, header=None, names=["rating", "review"])
train_reviews = train_reviews[~pd.isnull(train_reviews.review)]
test_reviews = pd.read_csv(args.raw_test_dataset, header=None, names=["rating", "review"])
test_reviews = test_reviews[~pd.isnull(test_reviews.review)]

#@ Inspecting the DataFrame:
display(train_reviews.head())
print(" ")
display(test_reviews.head())

Unnamed: 0,rating,review
0,1,"Unfortunately, the frustration of being Dr. Go..."
1,2,Been going to Dr. Goldberg for over 10 years. ...
2,1,I don't know what Dr. Goldberg was like before...
3,1,I'm writing this review to give you a heads up...
4,2,All the food is great here. But the best thing...


 


Unnamed: 0,rating,review
0,1,Ordered a large Mango-Pineapple smoothie. Stay...
1,2,Quite a surprise! \n\nMy wife and I loved thi...
2,1,"First I will say, this is a nice atmosphere an..."
3,2,I was overall pretty impressed by this hotel. ...
4,1,Video link at bottom review. Worst service I h...


**Processing the Dataset**

In [4]:
#@ Creating the Subset of the Reviews Dataset:
by_rating = collections.defaultdict(list)                     # Collections stores the collection of Data.
for _, row in train_reviews.iterrows():
  by_rating[row.rating].append(row.to_dict())

review_subset = []
for _, item_list in sorted(by_rating.items()):
  n_total = len(item_list)
  n_subset = int(args.proportion_subset_of_train * n_total)
  review_subset.extend(item_list[:n_subset])

#@ Creating the DataFrame:
review_subset = pd.DataFrame(review_subset)

#@ Inspecting the DataFrame:
review_subset.head()

Unnamed: 0,rating,review
0,1,"Unfortunately, the frustration of being Dr. Go..."
1,1,I don't know what Dr. Goldberg was like before...
2,1,I'm writing this review to give you a heads up...
3,1,Wing sauce is like water. Pretty much a lot of...
4,1,Owning a driving range inside the city limits ...


In [5]:
#@ Performing the Basic EDA:
display(train_reviews.rating.value_counts())                   # Inspecting the Number of Ratings.
print(" ")
display(review_subset.rating.value_counts())                   # Inspecting the Number of Ratings.
print(" ")
display(set(review_subset.rating))                             # Unique Ratings in the DataFrame.

2    280000
1    280000
Name: rating, dtype: int64

 


2    28000
1    28000
Name: rating, dtype: int64

 


{1, 2}

**Processing the DataFrame**
* Creating Training, Validation and Testing Splits in the DataFrame.

In [6]:
#@ Splitting the Subset by Rating to create New Training, Validation and Testing Splits:
by_rating = collections.defaultdict(list)
for _, row in review_subset.iterrows():
  by_rating[row.rating].append(row.to_dict())

#@ Creating the Split Data:
final_list = []
np.random.seed(args.seed)
for _, item_list in sorted(by_rating.items()):
  np.random.shuffle(item_list)                                     # Shuffling the Data randomly.
  n_total = len(item_list)
  n_train = int(args.train_proportion * n_total)
  n_val = int(args.val_proportion * n_total)
  n_test = int(args.test_proportion * n_total)
  #@ Giving the Data point a split Attribute:
  for item in item_list[:n_train]:
    item["split"] = "train"
  for item in item_list[n_train:n_train+n_val]:
    item["split"] = "val"
  for item in item_list[n_train+n_val:n_train+n_val+n_test]:
    item["split"] = "test" 
  #@ Adding to the Final List:
  final_list.extend(item_list)

#@ Creating the Final DataFrame:
final_reviews = pd.DataFrame(final_list)

#@ Inspecting the Final Result:
display(final_reviews.head())                                     # Inspecting the DataFrame.
print(" ")
display(final_reviews.split.value_counts())                        # Inspecting the Training, Validation and Testing Data.

Unnamed: 0,rating,review,split
0,1,Terrible place to work for I just heard a stor...,train
1,1,"3 hours, 15 minutes-- total time for an extrem...",train
2,1,My less than stellar review is for service. ...,train
3,1,I'm granting one star because there's no way t...,train
4,1,The food here is mediocre at best. I went afte...,train


 


train    39200
val       8400
test      8400
Name: split, dtype: int64

**Cleaning the Data**
* I will clean the Data minimally by adding whitespace around Punctuation symbols and Removing Extraneous symbols which are not Punctuations for all the Splits.

In [7]:
#@ Cleaning the Data:
def preprocess_text(text):
  text = text.lower()                                      # Converting into Lowercase.
  text = re.sub(r"([.,!?])", r" \1 ", text)
  text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
  return text

#@ Processing the Review Column:
final_reviews["review"] = final_reviews.review.apply(preprocess_text)

#@ Processing the Rating Column:
final_reviews["rating"] = final_reviews.rating.apply({1:"negative", 2:"positive"}.get)

#@ Inspecting the DataFrame:
final_reviews.head(7)

Unnamed: 0,rating,review,split
0,negative,terrible place to work for i just heard a stor...,train
1,negative,"hours , minutes total time for an extremely s...",train
2,negative,my less than stellar review is for service . w...,train
3,negative,i m granting one star because there s no way t...,train
4,negative,the food here is mediocre at best . i went aft...,train
5,negative,n n nwe looked at our entertainment book for ...,train
6,negative,i had an appointment that was made months in a...,train


In [8]:
#@ Preparing the Data:
final_reviews.to_csv(args.output_munged, index=False)