# Classifying Yelp Reviews

# Example: Classifying Sentiment of Restaurant Reviews
Dataset:
- Using the Yelp Datasets <br/>

Which pairs reviews with their sentiment labels (positive or negative), from scratch and using some utils of pytorch Dataloader and Dataset

## The Dataset Pytorch Class

Pytorch provides an abstraction for the dataset by providing a `Dataset` class. The `Dataset` class is an abstract iterator. When using PyTorch with a new dataset, you must first subclass (or inherit from) the `Dataset` class and implement these methods:

- `__getitem__()`
- `__len__()`

This creates a conceptual pact that allows various PyTorch utilities to work with our dataset (`Dataloader`).

## The vectorizer Class

It's a class that handles the conversion from review text to a vector of numbers representing the review. Only through some vectorization step can a neural network interact with text data. The overall desing pattern is to implement a dataset class that handles the vectorization logic for one data point. Then, PyTorch's `DataLoader`will create minibatches by sampling and collating from the dataset.

## The Vocabulary Class
The first Stage in going from text to vectorized minibatch is to map each token to a numerical version of itself. The standard methodology is to have a bijection (an inverse function or a mapping that can be reversed) between the tokens and integers. In python, this is simple two dictionaries. We encapsulate this bijection into a `Vocabulary`class, this class not only manage this bijection - allowing the user to add new tokens and have the index autoincrement - but also handles a special token called `UNK`, which stands for "unknown". By using the `UNK` token, we can handle tokens at test time that where never seen in training.

## The pipeline

So the pipeline is to construct a `Vocabulary`class to handles tokens and vocabularies, a `Vectorizer`class to handle the representation of each review, and the `Dataset` class inherited from PyTorch to use the `DataLoader` utilities.

In [1]:
import os
import re
import torch
import pickle

import torch.optim as optim
import torch.nn.functional as F
import torch.nn as nn
import pandas as pd
import numpy as np

from torch.utils.data import DataLoader, Dataset
from sklearn.preprocessing import LabelBinarizer
from sklearn.feature_extraction.text import CountVectorizer
from tqdm.notebook import tqdm as tqdm_notebook
from tqdm import tqdm as tqdm_console
from argparse import Namespace

# Args

In [2]:
def set_seed_everywhere(seed, cuda):
    np.random.seed(seed)
    torch.manual_seed(seed)
    if cuda:
        torch.cuda.manual_seed_all(seed)

def handle_dirs(dirpath):
    if not os.path.exists(dirpath):
        os.makedirs(dirpath)

In [3]:
args = Namespace(
    # Data and Path information
    frequency_cutoff=25,
    model_state_file='model{0}.pth',
    review_csv='data/yelp/reviews_with_splits_lite.csv',
    # review_csv='data/yelp/reviews_with_splits_full.csv',
    save_dir='model_storage/',
    vectorizer_file='vectorizer.json',
    # No Model hyper parameters
    # Training hyper parameters
    batch_size=128,
    early_stopping_criteria=5,
    learning_rate=0.001,
    num_epochs=100,
    seed=1337,
    # Runtime options
    catch_keyboard_interrupt=True,
    cuda=True,
    expand_filepaths_to_save_dir=True,
    reload_from_files=False,
)

if args.expand_filepaths_to_save_dir:
    args.vectorizer_file = os.path.join(args.save_dir,
                                        args.vectorizer_file)

    args.model_state_file = os.path.join(args.save_dir,
                                         args.model_state_file)
    
    print("Expanded filepaths: ")
    print("\t{}".format(args.vectorizer_file))
    print("\t{}".format(args.model_state_file))
    
# Check CUDA
if not torch.cuda.is_available():
    args.cuda = False

print("Using CUDA: {}".format(args.cuda))

args.device = torch.device("cuda" if args.cuda else "cpu")

# Set seed for reproducibility
set_seed_everywhere(args.seed, args.cuda)

# handle dirs
handle_dirs(args.save_dir)

Expanded filepaths: 
	model_storage/vectorizer.json
	model_storage/model{0}.pth
Using CUDA: False


# Load Data

In [4]:
reviews = pd.read_csv(args.review_csv)

In [5]:
reviews.head()

Unnamed: 0,rating,review,split
0,negative,"on a recent visit to las vegas , my friends an...",train
1,positive,"excellent food ! we had the pompedoro , chicke...",train
2,positive,a great little glimpse back into old vegas . t...,train
3,positive,i was in phoenix for a couple of days for a co...,train
4,positive,what a treasure ! i have been doing yoga for y...,train


In [6]:
train_df = reviews[reviews.split == "train"].copy()
val_df = reviews[reviews.split == "val"].copy()
test_df = reviews[reviews.split == "test"].copy()

# Train Vectorizer and Label Binarizer

In [7]:
vectorizer = CountVectorizer(binary=True, min_df=args.frequency_cutoff)
vectorizer.fit(train_df.review.values)

CountVectorizer(binary=True, min_df=25)

In [8]:
le = LabelBinarizer()
le.fit(train_df.rating.values)

LabelBinarizer()

In [9]:
vars(le)

{'neg_label': 0,
 'pos_label': 1,
 'sparse_output': False,
 'y_type_': 'binary',
 'sparse_input_': False,
 'classes_': array(['negative', 'positive'], dtype='<U8')}

In [10]:
list(vars(vectorizer).keys())

['input',
 'encoding',
 'decode_error',
 'strip_accents',
 'preprocessor',
 'tokenizer',
 'analyzer',
 'lowercase',
 'token_pattern',
 'stop_words',
 'max_df',
 'min_df',
 'max_features',
 'ngram_range',
 'vocabulary',
 'binary',
 'dtype',
 'fixed_vocabulary_',
 '_stop_words_id',
 'stop_words_',
 'vocabulary_']

# Dataset

In [15]:
class ReviewDataset(Dataset):
    def __init__(self, reviews_df, vectorizer, label_encoder):
        self.vectorizer = vectorizer
        self.data = reviews_df.review.values
        self.label = label_encoder.transform(reviews_df.rating.values).squeeze()
    
    def __len__(self):
        return len(self.label)
    
    def __getitem__(self, idx):
        return self.vectorizer.transform([self.data[idx]]).toarray(), self.label[idx]

In [16]:
train_dataset = ReviewDataset(train_df, vectorizer, le)
val_dataset = ReviewDataset(val_df, vectorizer, le)
test_dataset = ReviewDataset(test_df, vectorizer, le)

# DataLoader

In [18]:
train_loader = DataLoader(train_dataset, shuffle=True, batch_size=args.batch_size, drop_last=True)
val_loader = DataLoader(val_dataset, shuffle=True, batch_size=args.batch_size, drop_last=True)
test_loader = DataLoader(test_dataset, shuffle=True, batch_size=args.batch_size, drop_last=True)

# Model

In [21]:
class ReviewClassifier(nn.Module):
    """ a simple perceptron based classifier """
    def __init__(self, num_features):
        """
        Args:
            num_features (int): the size of the input feature vector
        """
        super(ReviewClassifier, self).__init__()
        self.fc1 = nn.Linear(in_features=num_features, 
                             out_features=1)

    def forward(self, x_in, apply_sigmoid=False):
        """The forward pass of the classifier
        
        Args:
            x_in (torch.Tensor): an input data tensor. 
                x_in.shape should be (batch, num_features)
            apply_sigmoid (bool): a flag for the sigmoid activation
                should be false if used with the Cross Entropy losses
        Returns:
            the resulting tensor. tensor.shape should be (batch,)
        """
        y_out = self.fc1(x_in.squeeze()).squeeze()
        if apply_sigmoid:
            y_out = torch.sigmoid(y_out)
        return y_out

# Training Loop

## Utility functions

In [22]:
def make_train_state(args):
    return {'stop_early': False,
            'early_stopping_step': 0,
            'early_stopping_best_val': 1e8,
            'learning_rate': args.learning_rate,
            'epoch_index': 0,
            'train_loss': [],
            'train_acc': [],
            'val_loss': [],
            'val_acc': [],
            'test_loss': -1,
            'test_acc': -1,
            'model_filename': args.model_state_file}

def update_train_state(args, model, train_state):
    """Handle the training state updates.

    Components:
     - Early Stopping: Prevent overfitting.
     - Model Checkpoint: Model is saved if the model is better

    :param args: main arguments
    :param model: model to train
    :param train_state: a dictionary representing the training state values
    :returns:
        a new train_state
    """

    # Save one model at least
    if train_state['epoch_index'] == 0:
        torch.save(model.state_dict(), train_state['model_filename'])
        train_state['stop_early'] = False

    # Save model if performance improved
    elif train_state['epoch_index'] >= 1:
        loss_tm1, loss_t = train_state['val_loss'][-2:]

        # If loss worsened
        if loss_t >= train_state['early_stopping_best_val']:
            # Update step
            train_state['early_stopping_step'] += 1
        # Loss decreased
        else:
            # Save the best model
            if loss_t < train_state['early_stopping_best_val']:
                torch.save(model.state_dict(), train_state['model_filename'])

            # Reset early stopping step
            train_state['early_stopping_step'] = 0

        # Stop early ?
        train_state['stop_early'] = \
            train_state['early_stopping_step'] >= args.early_stopping_criteria

    return train_state

def compute_accuracy(y_pred, y_target):
    y_target = y_target.cpu()
    y_pred_indices = (torch.sigmoid(y_pred)>0.5).cpu().long()#.max(dim=1)[1]
    n_correct = torch.eq(y_pred_indices, y_target).sum().item()
    return n_correct / len(y_pred_indices) * 100

## Initialization

In [23]:
classifier = ReviewClassifier(num_features=len(vectorizer.vocabulary_))

In [24]:
classifier

ReviewClassifier(
  (fc1): Linear(in_features=7164, out_features=1, bias=True)
)

In [25]:
data_loaders = {"train": train_loader, "val": val_loader, "test": test_loader}
datasets = {"train": train_dataset, "val": val_dataset, "test": test_dataset}

In [26]:
classifier = classifier.to(args.device)

loss_func = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer=optimizer,
                                                 mode='min', factor=0.5,
                                                 patience=1)

train_state = make_train_state(args)

epoch_bar = tqdm_notebook(desc='training routine', 
                          total=args.num_epochs,
                          position=0)

train_bar = tqdm_notebook(desc='split=train',
                          total=len(data_loaders["train"]), 
                          position=1, 
                          leave=True)

val_bar = tqdm_notebook(desc='split=val',
                        total=len(data_loaders["val"]), 
                        position=1, 
                        leave=True)

try:
    for epoch_index in range(args.num_epochs):
        train_state['epoch_index'] = epoch_index

        # Iterate over training dataset

        running_loss = 0.0
        running_acc = 0.0
        classifier.train()

        for batch_index, batch_dict in enumerate(data_loaders["train"]):
            # the training routine is these 5 steps:

            # --------------------------------------
            # step 1. zero the gradients
            optimizer.zero_grad()

            # step 2. compute the output
            y_pred = classifier(x_in=batch_dict[0].float())

            # step 3. compute the loss
            loss = loss_func(y_pred, batch_dict[1].float())
            loss_t = loss.item()
            running_loss += (loss_t - running_loss) / (batch_index + 1)

            # step 4. use loss to produce gradients
            loss.backward()

            # step 5. use optimizer to take gradient step
            optimizer.step()
            # -----------------------------------------
            # compute the accuracy
            acc_t = compute_accuracy(y_pred, batch_dict[1])
            running_acc += (acc_t - running_acc) / (batch_index + 1)

            # update bar
            train_bar.set_postfix(loss=running_loss, 
                                  acc=running_acc, 
                                  epoch=epoch_index)
            train_bar.update()

        train_state['train_loss'].append(running_loss)
        train_state['train_acc'].append(running_acc)

        # Iterate over val dataset
        running_loss = 0.
        running_acc = 0.
        classifier.eval()
        
        with torch.no_grad():
            for batch_index, batch_dict in enumerate(data_loaders["val"]):

                # compute the output
                y_pred = classifier(x_in=batch_dict[0].float())

                # step 3. compute the loss
                loss = loss_func(y_pred, batch_dict[1].float())
                loss_t = loss.item()
                running_loss += (loss_t - running_loss) / (batch_index + 1)

                # compute the accuracy
                acc_t = compute_accuracy(y_pred, batch_dict[1])
                running_acc += (acc_t - running_acc) / (batch_index + 1)

                val_bar.set_postfix(loss=running_loss, 
                                    acc=running_acc, 
                                    epoch=epoch_index)
                val_bar.update()

        train_state['val_loss'].append(running_loss)
        train_state['val_acc'].append(running_acc)

        train_state = update_train_state(args=args, model=classifier,
                                         train_state=train_state)

        scheduler.step(train_state['val_loss'][-1])

        train_bar.n = 0
        val_bar.n = 0
        epoch_bar.update()

        if train_state['stop_early']:
            break

        train_bar.n = 0
        val_bar.n = 0
except KeyboardInterrupt:
    print("Exiting loop")

HBox(children=(FloatProgress(value=0.0, description='training routine', style=ProgressStyle(description_width=…

HBox(children=(FloatProgress(value=0.0, description='split=train', max=306.0, style=ProgressStyle(description_…

HBox(children=(FloatProgress(value=0.0, description='split=val', max=65.0, style=ProgressStyle(description_wid…

In [27]:
# compute the loss & accuracy on the test set using the best available model

classifier.load_state_dict(torch.load(train_state['model_filename']))
classifier = classifier.to(args.device)


running_loss = 0.
running_acc = 0.
classifier.eval()

with torch.no_grad():
    for batch_index, batch_dict in enumerate(test_loader):
        # compute the output
        y_pred = classifier(x_in=batch_dict[0].float())

        # compute the loss
        loss = loss_func(y_pred, batch_dict[1].float())
        loss_t = loss.item()
        running_loss += (loss_t - running_loss) / (batch_index + 1)

        # compute the accuracy
        acc_t = compute_accuracy(y_pred, batch_dict[1])
        running_acc += (acc_t - running_acc) / (batch_index + 1)

train_state['test_loss'] = running_loss
train_state['test_acc'] = running_acc

In [28]:
print("Test loss: {:.3f}".format(train_state['test_loss']))
print("Test Accuracy: {:.2f}".format(train_state['test_acc']))

Test loss: 0.207
Test Accuracy: 92.01


## Inference

In [29]:
def predict_rating(review, classifier, vectorizer, le, decision_threshold=0.5):
    """Predict the rating of a review
    
    Args:
        review (str): the text of the review
        classifier (ReviewClassifier): the trained model
        vectorizer (ReviewVectorizer): the corresponding vectorizer
        decision_threshold (float): The numerical boundary which separates the rating classes
    """
    vector = vectorizer.transform([review]).toarray()
    vectorized_review = torch.tensor(vector)
    result = classifier(vectorized_review.float())
    
    probability_value = torch.sigmoid(result).item()
    index = 1
    if probability_value < decision_threshold:
        index = 0
    index = np.array([index])
    return le.inverse_transform(index)[0]

In [33]:
test_review = "This books is meh"

classifier = classifier.cpu()
prediction = predict_rating(test_review, classifier, vectorizer, le, decision_threshold=0.5)
print("{} -> {}".format(test_review, prediction))

This books is meh -> negative


## Interpretability

In [34]:
classifier.fc1.weight.shape

torch.Size([1, 7164])

In [35]:
lookup_index = dict((idx, vocab) for vocab, idx in vectorizer.vocabulary_.items())

In [36]:
# Sort weights
fc1_weights = classifier.fc1.weight.detach()[0]
_, indices = torch.sort(fc1_weights, dim=0, descending=True)
indices = indices.numpy().tolist()

# Top 20 words
print("Influential words in Positive Reviews:")
print("--------------------------------------")
for i in range(20):
    print(lookup_index[indices[i]])
    
print("====\n\n\n")

# Top 20 negative words
print("Influential words in Negative Reviews:")
print("--------------------------------------")
indices.reverse()
for i in range(20):
    print(lookup_index[indices[i]])

Influential words in Positive Reviews:
--------------------------------------
excellent
amazing
delicious
disappoint
outstanding
perfection
perfect
awesome
yum
incredible
downside
ngreat
fantastic
superb
great
heaven
perfectly
hooked
love
wonderful
====



Influential words in Negative Reviews:
--------------------------------------
worst
horrible
terrible
bland
awful
meh
tasteless
mediocre
poisoning
disgusting
poor
disappointment
eh
overpriced
inedible
lacked
rude
disappointing
sucks
waste
