# Yelp reviews Classification/Sentiment Analysis






### Reference Material: 
- [Facebook Roberta](https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/)
- [Roberta Github](https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/)
- [Hugging Face Transformers](https://huggingface.co/transformers/)



In [2]:
# Importing the libraries needed
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import torch
import seaborn as sns
import transformers
import json
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
from transformers import RobertaModel, RobertaTokenizer
import logging
logging.basicConfig(level=logging.ERROR)
import os

In [5]:
# Setting up the device for GPU usage
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'
device

'cuda'

In [4]:
#Load and prepare dataset
train = pd.read_csv('yelp_train.csv', header=None)

In [6]:
#display 
train.head()

Unnamed: 0,Star_Rating,Review
0,1,I got 'new' tires from them and within two wee...
1,1,Don't waste your time. We had two different p...
2,1,All I can say is the worst! We were the only 2...
3,1,I have been to this restaurant twice and was d...
4,1,Food was NOT GOOD at all! My husband & I ate h...


In [7]:
train.shape

(50000, 2)

In [8]:
#Display Classes 
train['Star_Rating'].unique()

array([1, 3, 2, 4, 5])

In [9]:
#dataset Overview 
train.describe()

Unnamed: 0,Star_Rating
count,50000.0
mean,3.0
std,1.414228
min,1.0
25%,2.0
50%,3.0
75%,4.0
max,5.0


In [10]:
new_df = train[['Review', 'Star_Rating']]

Dataset and Dataloader are constructs of the PyTorch library for defining and controlling the data pre-processing and its passage to neural network.

#### *SentimentData* Dataset Class
- Accepts the Dataframe as input and generate tokenized output that is used by the Roberta model for training. 
- Uses the Roberta tokenizer to tokenize the data in the `TITLE` column of the dataframe. 
- The tokenizer uses the `encode_plus` method to perform tokenization and generate the necessary outputs, namely: `ids`, `attention_mask`
- The *SentimentData* class is used to create 2 datasets, for training and for validation.
- *Training Dataset* is used to fine tune the model: **80% of the original data**
- *Validation Dataset* is used to evaluate the performance of the model. The model has not seen this data during training. 

#### Dataloader
- Dataloader is used to for creating training and validation dataloader that load data to the neural network in a defined manner.
- Training and Validation dataloaders are used in the training and validation part of the flow respectively

In [11]:
# Defining some key variables that will be used later on in the training
MAX_LEN = 256
TRAIN_BATCH_SIZE = 8
VALID_BATCH_SIZE = 4
# EPOCHS = 1
LEARNING_RATE = 1e-05
tokenizer = RobertaTokenizer.from_pretrained('roberta-base', truncation=True, do_lower_case=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




In [12]:
class SentimentData(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.text = dataframe.Review
        self.targets = self.data.Star_Rating
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = str(self.text[index])
        text = " ".join(text.split())

        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]


        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.targets[index], dtype=torch.float)
        }

In [13]:
train_size = 0.8
train_data=new_df.sample(frac=train_size,random_state=200)
test_data=new_df.drop(train_data.index).reset_index(drop=True)
train_data = train_data.reset_index(drop=True)


print("FULL Dataset: {}".format(new_df.shape))
print("TRAIN Dataset: {}".format(train_data.shape))
print("TEST Dataset: {}".format(test_data.shape))

training_set = SentimentData(train_data, tokenizer, MAX_LEN)
testing_set = SentimentData(test_data, tokenizer, MAX_LEN)

FULL Dataset: (50000, 2)
TRAIN Dataset: (40000, 2)
TEST Dataset: (10000, 2)


In [14]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

 - Create a neural network with the `RobertaClass`. 
 - This network will have the Roberta Language model followed by a `dropout` and finally a `Linear` layer to obtain the final outputs. 
 - The data will be fed to the Roberta Language model as defined in the dataset. 
 - Final layer outputs is what will be compared to the `Sentiment category` to determine the accuracy of models prediction. 
 - `Loss Function` and `Optimizer` and defined in the next cell.
 - The `Loss Function` is used the calculate the difference in the output created by the model and the actual output. 
 - `Optimizer` is used to update the weights of the neural network to improve its performance.

In [15]:
# Create NN with RobertaClass
class RobertaClass(torch.nn.Module):
    def __init__(self):
        super(RobertaClass, self).__init__()
        self.l1 = RobertaModel.from_pretrained("roberta-base")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.dropout = torch.nn.Dropout(0.3)
        self.classifier = torch.nn.Linear(768, 6)

    def forward(self, input_ids, attention_mask, token_type_ids):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

In [17]:
#Model Instance 
model = RobertaClass()
model.to(device)

RobertaClass(
  (l1): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, eleme

- The dataloader passes data to the model based on the batch size. 
- Loss value is used to optimize the weights of the neurons in the network.
- After every 5000 steps the loss value is printed in the console.

In [18]:
# Creating the loss function and optimizer
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

In [19]:
def calcuate_accuracy(preds, targets):
    n_correct = (preds==targets).sum().item()
    return n_correct

In [22]:
# Defining the training function on the 80% of the dataset for tuning the distilbert model
def train(epoch):
    tr_loss = 0
    n_correct = 0
    nb_tr_steps = 0
    nb_tr_examples = 0
    model.train()
    for _,data in tqdm(enumerate(training_loader, 0)):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.long)

        outputs = model(ids, mask, token_type_ids)
        loss = loss_function(outputs, targets)
        tr_loss += loss.item()
        big_val, big_idx = torch.max(outputs.data, dim=1)
        n_correct += calcuate_accuracy(big_idx, targets)

        nb_tr_steps += 1
        nb_tr_examples+=targets.size(0)
        
        if _%1000==0:
            loss_step = tr_loss/nb_tr_steps
            accu_step = (n_correct*100)/nb_tr_examples 
            print(f"Training Loss per 5000 steps: {loss_step}")
            print(f"Training Accuracy per 5000 steps: {accu_step}")

        optimizer.zero_grad()
        loss.backward()
        # # When using GPU
        optimizer.step()

    print(f'The Total Accuracy for Epoch {epoch}: {(n_correct*100)/nb_tr_examples}')
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    print(f"Training Loss Epoch: {epoch_loss}")
    print(f"Training Accuracy Epoch: {epoch_accu}")

    return 

In [24]:
EPOCHS = 5
for epoch in range(EPOCHS):
    train(epoch)

1it [00:00,  3.88it/s]

Training Loss per 5000 steps: 0.5478341579437256
Training Accuracy per 5000 steps: 75.0


1001it [03:54,  4.27it/s]

Training Loss per 5000 steps: 0.6268075251436377
Training Accuracy per 5000 steps: 72.9020979020979


2001it [07:47,  4.27it/s]

Training Loss per 5000 steps: 0.6317974003567093
Training Accuracy per 5000 steps: 72.73863068465766


3001it [11:42,  4.24it/s]

Training Loss per 5000 steps: 0.6391525830935733
Training Accuracy per 5000 steps: 72.47584138620459


4001it [15:36,  4.25it/s]

Training Loss per 5000 steps: 0.6391693873238308
Training Accuracy per 5000 steps: 72.53811547113222


5000it [19:29,  4.27it/s]
0it [00:00, ?it/s]

The Total Accuracy for Epoch 0: 72.48
Training Loss Epoch: 0.643178297418356
Training Accuracy Epoch: 72.48
Training Loss per 5000 steps: 1.0259482860565186
Training Accuracy per 5000 steps: 62.5


1001it [03:53,  4.27it/s]

Training Loss per 5000 steps: 0.5315319439703292
Training Accuracy per 5000 steps: 78.30919080919081


2001it [07:45,  4.27it/s]

Training Loss per 5000 steps: 0.5322275059386112
Training Accuracy per 5000 steps: 77.89855072463769


3001it [11:37,  4.31it/s]

Training Loss per 5000 steps: 0.5376669921499815
Training Accuracy per 5000 steps: 77.59496834388537


4001it [15:29,  4.33it/s]

Training Loss per 5000 steps: 0.5398781116948489
Training Accuracy per 5000 steps: 77.48687828042989


5000it [19:21,  4.30it/s]
0it [00:00, ?it/s]

The Total Accuracy for Epoch 1: 77.31
Training Loss Epoch: 0.5433348312035203
Training Accuracy Epoch: 77.31
Training Loss per 5000 steps: 0.3759306073188782
Training Accuracy per 5000 steps: 87.5


1001it [03:52,  4.29it/s]

Training Loss per 5000 steps: 0.4324629028903676
Training Accuracy per 5000 steps: 82.74225774225775


2001it [07:45,  4.35it/s]

Training Loss per 5000 steps: 0.4360211518281761
Training Accuracy per 5000 steps: 82.24012993503248


3001it [11:36,  4.39it/s]

Training Loss per 5000 steps: 0.44249465640402164
Training Accuracy per 5000 steps: 81.95184938353881


4001it [15:27,  4.32it/s]

Training Loss per 5000 steps: 0.4465283959828386
Training Accuracy per 5000 steps: 81.7701824543864


5000it [19:18,  4.32it/s]
0it [00:00, ?it/s]

The Total Accuracy for Epoch 2: 81.6375
Training Loss Epoch: 0.4492165217794478
Training Accuracy Epoch: 81.6375
Training Loss per 5000 steps: 0.29504650831222534
Training Accuracy per 5000 steps: 87.5


1001it [03:52,  4.31it/s]

Training Loss per 5000 steps: 0.3418266547436302
Training Accuracy per 5000 steps: 86.83816183816184


2001it [07:43,  4.33it/s]

Training Loss per 5000 steps: 0.35065433554079517
Training Accuracy per 5000 steps: 86.3255872063968


3001it [11:36,  4.31it/s]

Training Loss per 5000 steps: 0.35515509799659073
Training Accuracy per 5000 steps: 86.04631789403533


4001it [15:28,  4.32it/s]

Training Loss per 5000 steps: 0.3616630637758011
Training Accuracy per 5000 steps: 85.88477880529868


5000it [19:19,  4.31it/s]
0it [00:00, ?it/s]

The Total Accuracy for Epoch 3: 85.7
Training Loss Epoch: 0.3648388702150434
Training Accuracy Epoch: 85.7
Training Loss per 5000 steps: 0.12773264944553375
Training Accuracy per 5000 steps: 100.0


1001it [03:51,  4.35it/s]

Training Loss per 5000 steps: 0.2887675750602435
Training Accuracy per 5000 steps: 88.5989010989011


2001it [07:41,  4.33it/s]

Training Loss per 5000 steps: 0.29034208481491863
Training Accuracy per 5000 steps: 88.81809095452275


3001it [11:31,  4.37it/s]

Training Loss per 5000 steps: 0.29241657504760366
Training Accuracy per 5000 steps: 88.63295568143953


4001it [15:23,  4.32it/s]

Training Loss per 5000 steps: 0.2935765413271803
Training Accuracy per 5000 steps: 88.62159460134966


5000it [19:13,  4.33it/s]

The Total Accuracy for Epoch 4: 88.4775
Training Loss Epoch: 0.296093401164189
Training Accuracy Epoch: 88.4775





### Validating the Model

During the validation stage we pass the unseen data(Testing Dataset) to the model. This step determines how good the model performs on the unseen data. This unseen data is the 20% of `train.csv` which was seperated during the Dataset creation stage. 
During the validation stage the weights of the model are not updated. Only the final output is compared to the actual value. This comparison is then used to calcuate the accuracy of the model. 


In [None]:
def valid(model, testing_loader):
    model.eval()
    n_correct = 0; n_wrong = 0; total = 0; tr_loss=0; nb_tr_steps=0; nb_tr_examples=0
    with torch.no_grad():
        for _, data in tqdm(enumerate(testing_loader, 0)):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype=torch.long)
            targets = data['targets'].to(device, dtype = torch.long)
            outputs = model(ids, mask, token_type_ids).squeeze()
            loss = loss_function(outputs, targets)
            tr_loss += loss.item()
            big_val, big_idx = torch.max(outputs.data, dim=1)
            n_correct += calcuate_accuracy(big_idx, targets)

            nb_tr_steps += 1
            nb_tr_examples+=targets.size(0)
            
            if _%5000==0:
                loss_step = tr_loss/nb_tr_steps
                accu_step = (n_correct*100)/nb_tr_examples
                print(f"Validation Loss per 100 steps: {loss_step}")
                print(f"Validation Accuracy per 100 steps: {accu_step}")
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    print(f"Validation Loss Epoch: {epoch_loss}")
    print(f"Validation Accuracy Epoch: {epoch_accu}")
    
    return epoch_accu


In [None]:
acc = valid(model, testing_loader)
print("Accuracy on test data = %0.2f%%" % acc)

3it [00:00, 23.17it/s]

Validation Loss per 100 steps: 0.6547155380249023
Validation Accuracy per 100 steps: 75.0


5004it [03:13, 25.87it/s]

Validation Loss per 100 steps: 0.736690901120772
Validation Accuracy per 100 steps: 69.22115576884623


7803it [05:01, 25.89it/s]

Validation Loss Epoch: 0.7332612214877096
Validation Accuracy Epoch: 69.46687171600666
Accuracy on test data = 69.47%





<a id='section07'></a>
### Saving the Trained Model Artifacts for inference

This is the final step in the process of fine tuning the model. 

The model and its vocabulary are saved locally. These files are then used in the future to make inference on new inputs of news headlines.

In [None]:
#Save local copy of model 
output_model_file = 'pytorch_roberta_sentiment.bin'
output_vocab_file = './'

model_to_save = model
torch.save(model_to_save, output_model_file)
tokenizer.save_vocabulary(output_vocab_file)

print('All files saved')