# Sentiment Analysis with ParsBERT

## The NVIDIA System Management Interface (nvidia-smi) is a command line utility, based on top of the NVIDIA Management Library (NVML), intended to aid in the management and monitoring of NVIDIA GPU devices.

In [None]:
!nvidia-smi

## Install & import Libraries

In [3]:
# Import required packages (If You Need Any More Packages, You Can Add them HERE.)

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

import plotly.graph_objects as go

from tqdm.notebook import tqdm

import os
import re
import collections

## Dataset

### Load the data

In [None]:
!git clone https://github.com/SBU-CE/Deep-Learning.git
data = pd.read_csv('/content/Deep-Learning/spring-2022/assignments/project-3/taghche_5000.csv', encoding='utf-8')
data = data[['comment', 'rate']]
data.head()

In [None]:
# handle some conflicts with the dataset structure
# you can find a reliable solution, for the sake of the simplicity
# I just remove these bad combinations!
data['rate'] = data['rate'].apply(lambda r: r if r < 6 else None)

data = data.dropna(subset=['rate'])
data = data.dropna(subset=['comment'])
data = data.drop_duplicates(subset=['comment'], keep='first')
data = data.reset_index(drop=True)

### Normalization / Preprocessing

**<font color=red> For simplicity, Transform the rate in a range of 0.0 to 5.0 to a binary form of negative (0) or positive (1) with a threshold. If the rate is less than 3.0, it labeled as negative otherwise specified as positive.</font>**

In [None]:
##############################################################################################
#                                       Your Code                                            #
##############################################################################################

**<font color=red> Cleaning is the final step in this section. Your cleaned method should be included these steps:</font>**

**<font color=red>- fixing unicodes</font>**

**<font color=red>- removing specials like a phone number, email, url, new lines, ...</font>**

**<font color=red>- cleaning HTMLs</font>**

**<font color=red>- normalizing</font>**

**<font color=red>- removing emojis</font>**

**<font color=red>- removing extra spaces, hashtags</font>**

In [None]:
def cleaning(text):
    text = text.strip()

    ##############################################################################################
    #                                       Your Code                                            #
    ##############################################################################################
    
    
    return text

In [None]:
# cleaning comments
data['cleaned_comment'] = data['comment'].apply(cleaning)

**<font color=red> Calculate the Length of Comments based on their Words</font>**

In [None]:
##############################################################################################
#                                       Your Code                                            #
##############################################################################################

**<font color=red> Remove Comments with the Length of Fewer than 3 Words & More than 256 Words</font>**

In [None]:
##############################################################################################
#                                       Your Code                                            #
##############################################################################################

In [None]:
data = data[['cleaned_comment', 'label']]
data.columns = ['comment', 'label']
data.head()

### Handling Unbalanced Data

**<font color=red> Because the Data is Unbalanced, You should Balance it. Before & After Balancing Data, You should Plot a Bar Chart of Distribution of label within comments [DATA]</font>**

In [None]:
##############################################################################################
#                                       Your Code                                            #
##############################################################################################

## Train,Validation,Test split

To achieve a globalized model, we need to split the cleaned dataset into train, valid, test sets due to size of the data. In this tutorial, I have considered a rate of **0.1** for both *valid*, *test* sets. For splitting, I use `train_test_split` provided by Sklearn package with stratifying on the label for preserving the distribution balance.

In [None]:
new_data['label_id'] = new_data['label'].apply(lambda t: labels.index(t))

train, test = train_test_split(new_data, test_size=0.1, random_state=1, stratify=new_data['label'])
train, valid = train_test_split(train, test_size=0.1, random_state=1, stratify=train['label'])

train = train.reset_index(drop=True)
valid = valid.reset_index(drop=True)
test = test.reset_index(drop=True)

x_train, y_train = train['comment'].values.tolist(), train['label_id'].values.tolist()
x_valid, y_valid = valid['comment'].values.tolist(), valid['label_id'].values.tolist()
x_test, y_test = test['comment'].values.tolist(), test['label_id'].values.tolist()

print(train.shape)
print(valid.shape)
print(test.shape)

![BERT INPUTS](https://res.cloudinary.com/m3hrdadfi/image/upload/v1595158991/kaggle/bert_inputs_w8rith.png)

As you may know, the BERT model input is a combination of 3 embeddings.
- Token embeddings: WordPiece token vocabulary (WordPiece is another word segmentation algorithm, similar to BPE)
- Segment embeddings: for pair sentences [A-B] marked as $E_A$ or $E_B$ mean that it belongs to the first sentence or the second one.
- Position embeddings: specify the position of words in a sentence

## PyTorch

In [None]:
# Import required packages (If You Need Any More Packages, You Can Add them HERE.)

import torch
import torch.nn as nn
import torch.nn.functional as F

### Configuration

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f'device: {device}')

train_on_gpu = torch.cuda.is_available()

if not train_on_gpu:
    print('CUDA is not available.  Training on CPU ...')
else:
    print('CUDA is available!  Training on GPU ...')

In [None]:
# general config
MAX_LEN = 128
TRAIN_BATCH_SIZE = 16
VALID_BATCH_SIZE = 16
TEST_BATCH_SIZE = 16

EPOCHS = 10
EEVERY_EPOCH = 1000
LEARNING_RATE = 2e-5
CLIP = 0.0

OUTPUT_PATH = '/content/bert-fa-base-uncased-sentiment-taaghceh/pytorch_model.bin'

os.makedirs(os.path.dirname(OUTPUT_PATH), exist_ok=True)

In [None]:
# create a key finder based on label 2 id and id to label

label2id = {label: i for i, label in enumerate(labels)}
id2label = {v: k for k, v in label2id.items()}

print(f'label2id: {label2id}')
print(f'id2label: {id2label}')

**<font color=red> Setup the Tokenizer and Configuration</font>**

In [None]:
##############################################################################################
#                                       Your Code                                            #
##############################################################################################

### Input Embeddings

### Dataset

In [None]:
class TaaghcheDataset(torch.utils.data.Dataset):
    """ Create a PyTorch dataset for Taaghche. """

    def __init__(self, tokenizer, comments, targets=None, label_list=None, max_len=128):
        self.comments = comments
        self.targets = targets
        self.has_target = isinstance(targets, list) or isinstance(targets, np.ndarray)

        self.tokenizer = tokenizer
        self.max_len = max_len

        
        self.label_map = {label: i for i, label in enumerate(label_list)} if isinstance(label_list, list) else {}
    
    def __len__(self):
        return len(self.comments)

    def __getitem__(self, item):
        comment = str(self.comments[item])

        if self.has_target:
            target = self.label_map.get(str(self.targets[item]), str(self.targets[item]))

        encoding = self.tokenizer.encode_plus(
            comment,
            add_special_tokens=True,
            truncation=True,
            max_length=self.max_len,
            return_token_type_ids=True,
            padding='max_length',
            return_attention_mask=True,
            return_tensors='pt')
        
        inputs = {
            'comment': comment,
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'token_type_ids': encoding['token_type_ids'].flatten(),
        }

        if self.has_target:
            inputs['targets'] = torch.tensor(target, dtype=torch.long)
        
        return inputs


def create_data_loader(x, y, tokenizer, max_len, batch_size, label_list):
    dataset = TaaghcheDataset(
        comments=x,
        targets=y,
        tokenizer=tokenizer,
        max_len=max_len, 
        label_list=label_list)
    
    return torch.utils.data.DataLoader(dataset, batch_size=batch_size)

In [None]:
label_list = ['negative', 'positive']
train_data_loader = create_data_loader(train['comment'].to_numpy(), train['label'].to_numpy(), tokenizer, MAX_LEN, TRAIN_BATCH_SIZE, label_list)
valid_data_loader = create_data_loader(valid['comment'].to_numpy(), valid['label'].to_numpy(), tokenizer, MAX_LEN, VALID_BATCH_SIZE, label_list)
test_data_loader = create_data_loader(test['comment'].to_numpy(), None, tokenizer, MAX_LEN, TEST_BATCH_SIZE, label_list)

### Model

**<font color=red> Complete forward function</font>**

In [None]:
class SentimentModel(nn.Module):

    def __init__(self, config):
        super(SentimentModel, self).__init__()

        self.bert = BertModel.from_pretrained(MODEL_NAME_OR_PATH)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
    
    def forward(self):
        ##############################################################################################
        #                                       Your Code                                            #
        ##############################################################################################
        return None 

In [None]:
import torch, gc

gc.collect()
torch.cuda.empty_cache()
pt_model = None

!nvidia-smi

In [None]:
pt_model = SentimentModel(config=config)
pt_model = pt_model.to(device)

print('pt_model', type(pt_model))

### Training

**<font color=red> Complete functions</font>**

In [None]:
def acc_and_f1(y_true, y_pred, average='weighted'):
    # Define Accuracy and F1-score
    ##############################################################################################
    #                                       Your Code                                            #
    ##############################################################################################
    return None, None

def y_loss(y_true, y_pred, losses):
    y_true = torch.stack(y_true).cpu().detach().numpy()
    y_pred = torch.stack(y_pred).cpu().detach().numpy()
    y = [y_true, y_pred]
    loss = np.mean(losses)

    return y, loss


def eval_op(model, data_loader, loss_fn):
    model.eval()

    losses = []
    y_pred = []
    y_true = []

    with torch.no_grad():
        for dl in tqdm(data_loader, total=len(data_loader), desc="Evaluation... "):

            # Define input_ids, attention_mask, token_type_ids, targets
            ##############################################################################################
            #                                       Your Code                                            #
            ##############################################################################################


            # move tensors to GPU if CUDA is available
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            token_type_ids = token_type_ids.to(device)
            targets = targets.to(device)

            # compute predicted outputs by passing inputs to the model
            ##############################################################################################
            #                                       Your Code                                            #
            ##############################################################################################
            
            # convert output probabilities to predicted class
            ##############################################################################################
            #                                       Your Code                                            #
            ##############################################################################################

            # calculate the batch loss
            ##############################################################################################
            #                                       Your Code                                            #
            ##############################################################################################

            # accumulate all the losses
            losses.append(loss.item())

            y_pred.extend(preds)
            y_true.extend(targets)
    
    eval_y, eval_loss = y_loss(y_true, y_pred, losses)
    return eval_y, eval_loss


def train_op(model, 
             data_loader, 
             loss_fn, 
             optimizer, 
             scheduler, 
             step=0, 
             print_every_step=100, 
             eval=False,
             eval_cb=None,
             eval_loss_min=np.Inf,
             eval_data_loader=None, 
             clip=0.0):
    
    model.train()

    losses = []
    y_pred = []
    y_true = []

    for dl in tqdm(data_loader, total=len(data_loader), desc="Training... "):
        step += 1

        # Define input_ids, attention_mask, token_type_ids, targets
        ##############################################################################################
        #                                       Your Code                                            #
        ##############################################################################################

        # move tensors to GPU if CUDA is available
        input_ids = input_ids.to(device)
        attention_mask = attention_mask.to(device)
        token_type_ids = token_type_ids.to(device)
        targets = targets.to(device)

        # clear the gradients of all optimized variables
        optimizer.zero_grad()

        # compute predicted outputs by passing inputs to the model
        ##############################################################################################
        #                                       Your Code                                            #
        ##############################################################################################
        
        # convert output probabilities to predicted class
        ##############################################################################################
        #                                       Your Code                                            #
        ##############################################################################################

        # calculate the batch loss
        ##############################################################################################
        #                                       Your Code                                            #
        ##############################################################################################

        # accumulate all the losses
        losses.append(loss.item())

        # compute gradient of the loss with respect to model parameters
        loss.backward()

        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        if clip > 0.0:
            nn.utils.clip_grad_norm_(model.parameters(), max_norm=clip)

        # perform optimization step
        optimizer.step()

        # perform scheduler step
        scheduler.step()

        y_pred.extend(preds)
        y_true.extend(targets)

        if eval:
            train_y, train_loss = y_loss(y_true, y_pred, losses)
            train_score = acc_and_f1(train_y[0], train_y[1], average='weighted')

            if step % print_every_step == 0:
                eval_y, eval_loss = eval_op(model, eval_data_loader, loss_fn)
                eval_score = acc_and_f1(eval_y[0], eval_y[1], average='weighted')

                if hasattr(eval_cb, '__call__'):
                    eval_loss_min = eval_cb(model, step, train_score, train_loss, eval_score, eval_loss, eval_loss_min)

    train_y, train_loss = y_loss(y_true, y_pred, losses)

    return train_y, train_loss, step, eval_loss_min

**<font color=red> Define Optimizer, Scheduler & Loss Function</font>**

In [None]:
#######################################Your Code#############################################
#optimizer = 
#scheduler =
#loss_fn =                                                                                 
##############################################################################################

step = 0
eval_loss_min = np.Inf
history = collections.defaultdict(list)


def eval_callback(epoch, epochs, output_path):
    def eval_cb(model, step, train_score, train_loss, eval_score, eval_loss, eval_loss_min):
        statement = ''
        statement += 'Epoch: {}/{}...'.format(epoch, epochs)
        statement += 'Step: {}...'.format(step)
        
        statement += 'Train Loss: {:.6f}...'.format(train_loss)
        statement += 'Train Acc: {:.3f}...'.format(train_score['acc'])

        statement += 'Valid Loss: {:.6f}...'.format(eval_loss)
        statement += 'Valid Acc: {:.3f}...'.format(eval_score['acc'])

        print(statement)

        if eval_loss <= eval_loss_min:
            print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(
                eval_loss_min,
                eval_loss))
            
            torch.save(model.state_dict(), output_path)
            eval_loss_min = eval_loss
        
        return eval_loss_min


    return eval_cb


**<font color=red> Complete Training & Plot Loss and Accuracy Diagram</font>**

In [None]:
for epoch in tqdm(range(1, EPOCHS + 1), desc="Epochs... "):

    # Define train_y, train_loss, step, eval_loss_min using train_op
    ##############################################################################################
    #                                       Your Code                                            #
    ##############################################################################################
    
    # Define train_score using acc_and_f1
    ##############################################################################################
    #                                       Your Code                                            #
    ##############################################################################################
    
    # Define eval_y, eval_loss using eval_op
    ##############################################################################################
    #                                       Your Code                                            #
    ##############################################################################################
    
    # Define eval_score using acc_and_f1
    ##############################################################################################
    #                                       Your Code                                            #
    ##############################################################################################
    
    # Save Accuracy and Loss values
    ##############################################################################################
    #                                       Your Code                                            #
    ##############################################################################################


# Diagram
##############################################################################################
#                                       Your Code                                            #
##############################################################################################

### Prediction

**<font color=red> Complete function</font>**

In [None]:
def predict(model, comments, tokenizer, max_len=128, batch_size=32):
    data_loader = create_data_loader(comments, None, tokenizer, max_len, batch_size, None)
    
    predictions = []
    prediction_probs = []

    
    model.eval()
    with torch.no_grad():
        for dl in tqdm(data_loader, position=0):

            # Define input_ids, attention_mask, token_type_ids
            ##############################################################################################
            #                                       Your Code                                            #
            ##############################################################################################

            # move tensors to GPU if CUDA is available
            input_ids = input_ids.to(device)
            attention_mask = attention_mask.to(device)
            token_type_ids = token_type_ids.to(device)
            
            # compute predicted outputs by passing inputs to the model
            ##############################################################################################
            #                                       Your Code                                            #
            ##############################################################################################
            
            # convert output probabilities to predicted class
            ##############################################################################################
            #                                       Your Code                                            #
            ##############################################################################################

            predictions.extend(preds)
            prediction_probs.extend(F.softmax(outputs, dim=1))

    predictions = torch.stack(predictions).cpu().detach().numpy()
    prediction_probs = torch.stack(prediction_probs).cpu().detach().numpy()

    return predictions, prediction_probs

In [None]:
test_comments = test['comment'].to_numpy()
preds, probs = predict(pt_model, test_comments, tokenizer, max_len=128)

print(preds.shape, probs.shape)

**<font color=red> Evaluate Your Model using f1-score & Precision & Recall</font>**

In [None]:
##############################################################################################
#                                       Your Code                                            #
##############################################################################################