# Sentiment Analysis with BERT

> TL;DR In this notebook, we fine-tune BERT for sentiment analysis. We perform the required text preprocessing (special tokens, padding, and attention masks) and build a Sentiment Classifier using the Transformers library by Hugging Face as well as Pytorch.

Goal (official): To get a model that can classify text based on seniment, and hopefully it performs well on neutral text, which can be problematic for some models.

Goal (actual): To understand the entire pipeline better, become more skilled, and perhaps get a few ideas on how to perform some personal project ideas that I have. Also, For GPUs to go "Brrrrrrrrrrrrrr". (duh)

- Evaluate the model on test data
- Predict sentiment on raw text

In [1]:
!nvidia-smi

Thu Apr  6 01:14:10 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A10          On   | 00000000:06:00.0 Off |                    0 |
|  0%   31C    P8    20W / 150W |      0MiB / 23028MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [2]:
!nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_Mar__8_18:18:20_PST_2022
Cuda compilation tools, release 11.6, V11.6.124
Build cuda_11.6.r11.6/compiler.31057947_0


In [3]:
import torch
torch.cuda.is_available()

True

## Setup

We'll need to import the following:

In [4]:
!pip install -q transformers


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [5]:
import transformers
from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup

import torch
from torch import nn, optim
from torch.nn.utils import clip_grad_norm_
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.utils import shuffle

import numpy as np
import pandas as pd

from collections import defaultdict  
# a dictionary-like object that initializes nonexistent keys, if called, to a default pre-chosen value

from tqdm import tqdm

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

  from pandas.core.computation.check import NUMEXPR_INSTALLED


device(type='cuda', index=0)

We can use the watermark extension for Jupyter notebooks here to view the versions of th most important libraries we'll be using. In case of a code error, version incompatibility is one of the usual suspects.

In [6]:
!pip install -q -U watermark 
# -q means "quiet". This means it won't output anything 


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [7]:
# reload the watermark extension for Jupyter
%reload_ext watermark

# use watermark to display the version info for the specified packages
%watermark -v -p numpy,pandas,torch,transformers

Python implementation: CPython
Python version       : 3.8.10
IPython version      : 7.13.0

numpy       : 1.23.4
pandas      : 1.5.1
torch       : 1.12.1
transformers: 4.27.4



## Data Exploration

In this section, we will:

0. Download the [selected] dataset from Kaggle using the kaggle API.
1. Load our dataset using Pandas.
2. View the basic structure of the dataset: `head()`\\`sample()`, `shape`, `info()`.
3. Check for missing data or ...
4. Check if the dataset already has clear sentiment classes defined, if not, create a way (here: a function) to establish those classes and decide on their names.

In [8]:
!pip install -q -U kaggle


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3[0m[39;49m -> [0m[32;49m23.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


We follow the instructions found on [Kaggle's website](https://www.kaggle.com/docs/api) to get our API token and authenticate it. We should know there was no issue with the authentication if we can run `import kaggle` successfully without errors.

In [9]:
# import kaggle

Huh, seems like it works. This coding stuff must be real.

Now, let's download our dataset. Using the convenient **"Copy API command"** option available on every Kaggle dataset page, we apply:

In [10]:
# !kaggle datasets download -d jillanisofttech/amazon-product-reviews

The above code needs a `!` at the beginning, because what we installed above and are using to interact with the Kaggle API is actually a [CLI](https://en.wikipedia.org/wiki/Command-line_interface) tool.

Running the above cell tells us that our zip file `amazon-product-reviews.zip` has been downloaded to the current directory.

Now, let's unzip this zip file.

Problem: I don't know/remember how to unzip using Python.

Solution: Stack Overflow.

In [11]:
# import zipfile
# with zipfile.ZipFile('amazon-product-reviews.zip', 'r') as zip_file:
#     zip_file.extractall('')

In [12]:
# import os
# os.remove('amazon-product-reviews.zip')

We can see that we now have a `Reviews.csv` file in our directory. Let's use Pandas to load - and take a look at - this baby:

In [13]:
import pandas as pd
df = pd.read_csv('Reviews.csv')
print(df.shape)
df.sample(5)

(568454, 10)


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
119015,119016,B003YF1188,A3MB83ALNB3O4Z,Ann,0,0,5,1350432000,I am a fan of Stonewall Kitchen!,I was so happy that Amazon carries this produc...
120975,120976,B001EQ57KW,A2MM5OQCXV4BQ1,GAD,0,0,5,1338422400,"Good for snacking, great for baking!","First, I love the fact that Go Raw processes t..."
306052,306053,B002R89LOE,AY1EF0GOH80EK,Natasha Stryker,7,7,2,1276819200,Am I just crazy?,I feel like I am taking crazy pills when I rea...
485495,485496,B001RVFERK,AJFXMVJTGGHTY,"Wade Osborne ""Wade Osborne""",0,0,4,1312761600,great chips,Pop Chips are the best chips I've had that are...
490320,490321,B001E5DZJ8,A334K3EPD2H467,Sarah Norman,0,0,5,1314662400,Tastes great,These taste great. I add 1 or 2 cubes to the ...


All we really care about here are two columns: "Text" and "Score".

"Summary" and "HelpfulnessDenominator" can be useful to take a look at too.

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 568454 entries, 0 to 568453
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      568454 non-null  int64 
 1   ProductId               568454 non-null  object
 2   UserId                  568454 non-null  object
 3   ProfileName             568438 non-null  object
 4   HelpfulnessNumerator    568454 non-null  int64 
 5   HelpfulnessDenominator  568454 non-null  int64 
 6   Score                   568454 non-null  int64 
 7   Time                    568454 non-null  int64 
 8   Summary                 568427 non-null  object
 9   Text                    568454 non-null  object
dtypes: int64(5), object(5)
memory usage: 43.4+ MB


We have a fantastically complete dataset here! ~570K rows that are almost all complete, with a handful of missing values in the entire thing. Cool.

The two columns we care about the most, `Text` and `Score` are complete as well. We get to live to fight (or clean data) another day!

 However, i'm interested in filtering out those rows with null values in them in the "Summary" column. Maybe we could that column later, so I'd like to drop these rows now:

In [15]:
df = df.dropna(axis='index', how='all', subset=['Summary'])

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 568427 entries, 0 to 568453
Data columns (total 10 columns):
 #   Column                  Non-Null Count   Dtype 
---  ------                  --------------   ----- 
 0   Id                      568427 non-null  int64 
 1   ProductId               568427 non-null  object
 2   UserId                  568427 non-null  object
 3   ProfileName             568411 non-null  object
 4   HelpfulnessNumerator    568427 non-null  int64 
 5   HelpfulnessDenominator  568427 non-null  int64 
 6   Score                   568427 non-null  int64 
 7   Time                    568427 non-null  int64 
 8   Summary                 568427 non-null  object
 9   Text                    568427 non-null  object
dtypes: int64(5), object(5)
memory usage: 47.7+ MB


In [17]:
len(df.Text.unique())

393576

In [18]:
df.drop_duplicates(subset='Text', keep='first', inplace=True)

In [19]:
len(df)

393576

In [20]:
df.Score.value_counts()

5    250716
4     56042
1     36275
3     29752
2     20791
Name: Score, dtype: int64

## Data Preprocessing

In this section, we will:

1. Create a function to turn the ratings which are currently numbers into 3 sentiment classes. 
2. Create a balanced dataset by making sure we have roughly equal neutral scores to +ve and -ve ones.
3. Decide on a max. sequence length (in tokens) and trim longer texts to balance training cost vs. accuracy.
4. Create our Pytorch Dataset class.
5. Create our Pytorch Dataloaders class/function (which is better?)
6. Split the dataset into training, validation and test datasets, and create dataloaders for them.
7. Decide what batch size to use, and examine one of our batches before moving on to the training section.

Shuffle!

In [21]:
def score2sentiment(score):
    if score < 3 :
        return 'negative'
    elif score > 3 :
        return 'positive'
    else:
        return 'neutral'

In [22]:
df['Sentiment'] = df.Score.apply(score2sentiment)

In [23]:
df['Sentiment'].value_counts()

positive    306758
negative     57066
neutral      29752
Name: Sentiment, dtype: int64

In [24]:
positive_df = df[(df.Sentiment == 'positive')]
negative_df = df[(df.Sentiment == 'negative')]
neutral_df = df[(df.Sentiment == 'neutral')]

balanced_df = pd.concat([positive_df.head(29_750), negative_df.head(29_750), neutral_df.head(29_750)])

In [25]:
balanced_df.Sentiment.value_counts()

positive    29750
negative    29750
neutral     29750
Name: Sentiment, dtype: int64

In [26]:
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

In [56]:
tokens_lengths = []

for review in tqdm(balanced_df.Text):
    tokens = tokenizer.encode_plus(review,
                                   add_special_tokens=True,
                                   return_length=True)

    tokens_lengths.append(tokens.length)

100%|██████████| 89250/89250 [01:23<00:00, 1064.05it/s]


In [28]:
great_tokens_lengths = [x for x in tokens_lengths if x > 512]

In [29]:
(len(great_tokens_lengths) / len(tokens_lengths)) * 100

1.4980392156862745

Max len = 512.

In [30]:
MAX_LEN = 512

Let's take a look at counts: Let's create a more balanced dataset:

In [31]:
pd.value_counts(df.Score)

5    250716
4     56042
1     36275
3     29752
2     20791
Name: Score, dtype: int64

In [32]:
from torch.utils.data import Dataset

sentiment2number = {"negative" : 0, "neutral" : 1, "positive" : 2}

class AmazonReviewDataset(Dataset):
    
    def __init__(self, df, tokenizer,):
        self.df = df
        self.reviews = df.Text.to_numpy()
        self.sentiments = df.Sentiment.to_numpy()
        self.tokenizer = tokenizer
    
    # We also need to define __len__() and __getitem__():
    
    def __len__(self):
        return len(self.df['Id'])
    
    def __getitem__(self, idx):
        
        review = self.reviews[idx]
        sentiment = self.sentiments[idx]
        
        sentiment = sentiment2number[sentiment]
        sentiment = torch.tensor(sentiment, dtype=torch.long)
        
        tokens = self.tokenizer.encode_plus(review,
                                             add_special_tokens=True,
                                             max_length=512,
                                             padding='max_length',
                                             truncation=True,
                                             return_token_type_ids=False,
                                             return_tensors='pt')

        return {"review": review, "input_ids": tokens['input_ids'].flatten(), 
                "attention_mask": tokens['attention_mask'].flatten(), "sentiment": sentiment}
    
# we must first transform sentiment to a number, then to type torch.long as that is required for classification
# .flatten() turns our ...

In [33]:
from torch.utils.data import DataLoader

def create_dataloader(dataset, tokenizer):
    processed_ds = AmazonReviewDataset(dataset, tokenizer)
    
    return DataLoader(processed_ds, batch_size=16, shuffle=True)

In [34]:
shuffled_df = shuffle(balanced_df)

train_df, val_df = train_test_split(shuffled_df, train_size=0.75)
val_df, test_df = train_test_split(val_df, test_size=0.5)

In [35]:
train_dl = create_dataloader(train_df, tokenizer)
val_dl = create_dataloader(val_df, tokenizer)
test_dl = create_dataloader(test_df, tokenizer)

Let's have a look at an example batch from our training data loader:

In [36]:
batch = next(iter(train_dl))

print(batch.keys())
print(batch['input_ids'].shape)
print(batch['attention_mask'].shape)
print(batch['sentiment'].shape)

dict_keys(['review', 'input_ids', 'attention_mask', 'sentiment'])
torch.Size([16, 512])
torch.Size([16, 512])
torch.Size([16])


## Defining Training Process

Instead of using the Sentiment Analysis helper built for BERT which comes with the Transformers library, we'll use the basic `DistilBertModel` and build our sentiment classifier on top of it.

1. Load the base model (cased).
2. run it on the sample text from previous section.
4. Show pre-trained BERT's ability to classify our sample text.
3. Build a Sentiment Analysis wrapper around BERT using Pytorch.
5. Decide what loss function, optimizer, scheduler to use as well as the rest of the hyper-parameters.
6. Create our training epoch function.
7. Create our inference function.
8. Finish with our training loop.
9. View progress during fine-tuning.

In [37]:
bert_model = BertModel.from_pretrained('bert-base-cased', return_dict=False)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


And try to use it on the encoding of our sample text:

Shape **without** using `.flatten()`: `(batch_size, 1, sequence_length)` = `(16, 1, 512)`.

Shape **with** using `.flatten()`: `(batch_size, sequence_length)` = `(16, 512)`.

We can use `.flatten()` within `DataLoader`, presumably as it already has information about the batch size and preprocesses accordingly.

We shouldn't use `.flatten()` for casual inference, as the model expects an input of a fixed length and shape (?) (Here: 512, since this is a BERT-model). 

In [38]:
sample_input_ids = torch.reshape(batch["input_ids"][0], (1, 512))
sample_attention_mask = torch.reshape(batch["attention_mask"][0], (1, 512))

sample = {'input_ids':sample_input_ids, 'attention_mask': sample_attention_mask}
raw, pooled = bert_model(**sample)

In [39]:
raw

tensor([[[ 0.4075,  0.2071,  0.1190,  ..., -0.4108,  0.0408, -0.0849],
         [ 0.2758, -0.6349,  0.2261,  ..., -0.2547,  0.1820,  0.0482],
         [ 0.5432, -0.6583,  0.1456,  ...,  0.6100,  0.6814, -0.2100],
         ...,
         [ 0.1340,  0.2772, -0.2694,  ...,  0.1435,  0.1378,  0.2156],
         [ 0.1539,  0.2833, -0.3036,  ...,  0.1528,  0.1631,  0.2584],
         [ 0.0829,  0.2475, -0.2725,  ...,  0.1568,  0.2055,  0.2240]]],
       grad_fn=<NativeLayerNormBackward0>)

The `raw` is a sequence of hidden states of the last layer of the model. Obtaining the pooled output `pooled` is done by applying the [BertPooler](https://github.com/huggingface/transformers/blob/edf0582c0be87b60f94f41c659ea779876efc7be/src/transformers/modeling_bert.py#L426) on the last hidden state `raw`.

You can think of `pooled` as a summary of the content, according to BERT. Albeit, you might try and do better. Let's look at the shape of the output:

In [40]:
raw.shape, pooled.shape

(torch.Size([1, 512, 768]), torch.Size([1, 768]))

In [41]:
class SentimentClassifier(nn.Module):
    
    def __init__(self, num_classes):
        super(SentimentClassifier, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-cased', return_dict=False)
        self.drop = nn.Dropout(p=0.25)
        self.output_layer = nn.Linear(768, num_classes)
        # self.softmax = nn.Softmax(dim=1) # Why 1? to apply it among classes and not batches. Tell me more.
        
    def forward(self, input_ids, attention_mask):
        raw, pooled = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled = self.drop(pooled)
        fine_tuned = self.output_layer(pooled)
        # classified = self.softmax(fine_tuned)
        
        return fine_tuned

In [42]:
num_classes = len(df.Sentiment.unique())

classifier = SentimentClassifier(num_classes)
classifier = classifier.to(device)

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Our classifier delegates most of the heavy lifting to the DeBERTa model. We use a dropout layer for some regularization and a fully-connected layer for our output. **Note that we're returning the raw output of the last layer since that is required for the cross-entropy loss function in PyTorch to work.**

This should work like any other PyTorch model. Let's create an instance and move it to the GPU:

In [43]:
number2sentiment = {0: "negative", 1: "neutral", 2: "positive"}

sample = {'input_ids':sample_input_ids.to(device), 'attention_mask': sample_attention_mask.to(device)}

pooled = classifier(**sample)
probs = F.softmax(pooled, dim=1)

print(batch['review'][0])
print(batch['sentiment'][0])

result = number2sentiment[torch.argmax(probs).item()]

result

BEWARE  I recieved my box of 25 assorted Nonnie's Biscotti from Amazon . Every single biscotti was broken in 1-3 pieces. I am disgusted completely as these were to be favors of a wedding . I will never buy from this company again.  BEWARE
tensor(0)


'neutral'

In [44]:
# output.detach_()
# del output
# torch.cuda.empty_cache()

In [45]:
# torch.cuda.memory_allocated()

### Training

We need to define:

1. Epochs.
2. Learning rate
3. Optimizer.
4. Scheduler.
5. Loss function.

In [46]:
EPOCHS = 3
lr = 5e-5

num_steps = EPOCHS * len(train_dl)

optimizer = optim.AdamW(classifier.parameters(), lr=lr)

scheduler = get_linear_schedule_with_warmup(optimizer=optimizer,
                                            num_warmup_steps=0,
                                            num_training_steps=num_steps)

loss_fn = nn.CrossEntropyLoss().to(device)

How come those?

In [47]:
def train_one_epoch(model, dataloader, loss_fn, optimizer, scheduler, device):
    # 1. Set the model in training mode
    model = model.train()
    
    # 2. Initialize the variables for tracking loss and correct predictions
    loss_history = []
    correct_predictions = 0
    
    # 3. Iterate over the batches in the data loader
    for batch in tqdm(dataloader):
        
        # a. Move the input and target tensors to the specified device
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        targets = batch['sentiment'].to(device)
        
        # b. Forward pass
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        # only returns the processed 'pooled', as we defined in our S.C. class
        
        values, predictions = torch.max(outputs, dim=1) 
        # specifying the dim. changes the returns of torch.max()
        
        loss = loss_fn(outputs, targets)
        
        # c. Update the loss and correct predictions variables
        loss_history.append(loss.item())
        correct_predictions += torch.sum(predictions == targets)
        # torch compares values for numbers, regardless of dtype
        
        # d. Backward pass
        loss.backward()
        clip_grad_norm_(parameters=model.parameters(), max_norm=1.0)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()
        
    # 4. Compute the accuracy and average loss
    accuracy = correct_predictions.double()/(len(dataloader) * 16)
    average_loss = np.mean(loss_history)
        
    return accuracy, average_loss

Training the model should look familiar, except for two things. The scheduler gets called every time a batch is fed to the model. We're avoiding exploding gradients by clipping the gradients of the model using [clip_grad_norm_](https://pytorch.org/docs/stable/nn.html#clip-grad-norm).

Let's write another one that helps us evaluate the model on a given data loader:

In [48]:
def evaluate_model(model, dataloader, loss_fn, device):
    # 1. Set the model in evaluation mode
    model = model.eval()
    
    # 2. Initialize the variables for tracking loss and correct predictions
    loss_history = []
    correct_predictions = 0
    
    with torch.no_grad():
        # 3. Iterate over the batches in the data loader
        for batch in tqdm(dataloader):

            # a. Move the input and target tensors to the specified device
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            targets = batch['sentiment'].to(device)

            # b. Forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)        
            values, predictions = torch.max(outputs, dim=1)        
            loss = loss_fn(outputs, targets)

            # c. Update the loss and correct predictions variables
            loss_history.append(loss.item())
            correct_predictions += torch.sum(predictions == targets)

        # 4. Compute the accuracy and average loss
        accuracy = correct_predictions.double()/(len(dataloader) * 16)
        average_loss = np.mean(loss_history)

    return accuracy, average_loss

Using those two, we can write our training loop. We'll also store the training history:

In [49]:
# 1.a) initialize a dict. with default values that are lists
history = defaultdict(list)

# 1.b) store the highest val. acc. seen so far
best_accuracy = 0

# 2. loop through epochs
for epoch in range(EPOCHS):

    # a. Print the current epoch number
    print(f"Epoch {epoch + 1} out of {EPOCHS}.")
    print('-' * 10)

    # b. Train the model and print output
    train_acc, train_loss = train_one_epoch(classifier, train_dl, loss_fn, optimizer, scheduler, device)
    print(f"Accuracy on training dataset: {train_acc}")
    print(f"Loss on training dataset: {train_loss}")
    print('-' * 5)

    # c. Evaluate on the validation dataset and print output
    val_acc, val_loss = evaluate_model(classifier, val_dl, loss_fn, device)
    print(f"Accuracy on validation dataset: {val_acc}")
    print(f"Loss on validation dataset: {val_loss}")
    print('-' * 5)

    # d. Document the training and validation accuracy and loss in history
    history['train_acc'].append(train_acc)
    history['train_loss'].append(train_loss)
    history['val_acc'].append(val_acc)
    history['val_loss'].append(val_loss)

    # If current val_acc > best_accuracy, model state is saved to file
    if val_acc > best_accuracy:
        torch.save(classifier.state_dict(), 'best_model_state.bin')
        best_accuracy = val_acc

Epoch 1 out of 3.
----------


100%|██████████| 4184/4184 [39:46<00:00,  1.75it/s]


Accuracy on training dataset: 0.7467286089866156
Loss on training dataset: 0.5931135959189606
-----


100%|██████████| 698/698 [02:30<00:00,  4.65it/s]


Accuracy on validation dataset: 0.7821454154727794
Loss on validation dataset: 0.53433882863525
-----
Epoch 2 out of 3.
----------


100%|██████████| 4184/4184 [39:54<00:00,  1.75it/s]


Accuracy on training dataset: 0.833144120458891
Loss on training dataset: 0.4173259054600879
-----


100%|██████████| 698/698 [02:30<00:00,  4.65it/s]


Accuracy on validation dataset: 0.7933381088825214
Loss on validation dataset: 0.4978796033653028
-----
Epoch 3 out of 3.
----------


100%|██████████| 4184/4184 [39:53<00:00,  1.75it/s]


Accuracy on training dataset: 0.9087595602294455
Loss on training dataset: 0.25616456334308646
-----


100%|██████████| 698/698 [02:30<00:00,  4.64it/s]

Accuracy on validation dataset: 0.7905623209169054
Loss on validation dataset: 0.6446215840600922
-----





## Evaluation

Here, we will:
1. Write a function to test how well the model generalizes, by using the test dataset, which is data it didn't train on.
2. Evaluate the model's performance using classification report.
3. Try it on raw input.

In [54]:
def get_predictions_from_dl(model, dataloader):
    
    # 1. Set the model to evaluation mode
    model = model.eval()
    
    # 2. Initialize the lists for ins, true_outs, preds and confidence 
    reviews = []
    predictions = []
    confidence_levels = []
    true_sentiments = []
    
    with torch.no_grad():
        # 3. Iterate over the batches in the data loader
        for batch in tqdm(dataloader):
            # a. Move the model input tensors to the specified device,
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)

            # keeping the ins & outs of the dataset unmoved
            texts = batch['review']
            # using a different variable name to not overwrite the list
            sentiments = batch['sentiment']

            # b. Forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            vals, preds = torch.max(outputs, dim=1)

            # c. Get confidence levels with softmax 
            confidence_lvls = F.softmax(outputs, dim=1)

            # d. Update the lists
            reviews.extend(texts)
            predictions.extend(preds)
            confidence_levels.extend(confidence_lvls)
            true_sentiments.extend(sentiments)

    # 4. Stack, move back to cpu and return everything
    predictions = torch.stack(predictions).cpu()
    confidence_levels = torch.stack(confidence_levels).cpu()
    true_sentiments = torch.stack(true_sentiments).cpu()
    
    return reviews, predictions, confidence_levels, true_sentiments

This is similar to the evaluation function, except ...

In [55]:
test_reviews, test_predictions, test_confidence_lvls, test_sentiments = get_predictions_from_dl(classifier, test_dl)

100%|██████████| 698/698 [02:19<00:00,  4.99it/s]


Let's have a look at the classification report

In [58]:
classification_report(test_sentiments, test_predictions)

'              precision    recall  f1-score   support\n\n           0       0.79      0.78      0.78      3706\n           1       0.70      0.72      0.71      3785\n           2       0.89      0.88      0.88      3666\n\n    accuracy                           0.79     11157\n   macro avg       0.79      0.79      0.79     11157\nweighted avg       0.79      0.79      0.79     11157\n'

In [60]:
idx = 57

review = test_reviews[idx]
confidence_lvls = test_confidence_lvls[idx]

print(review)
conf_lvls = {number2sentiment[idx] : confidence_lvls[idx] for idx in range(3)}
conf_lvls

The price is good, size is perfect...but my dog doesn't like them at all. Would not buy again sorry. Not the products fault


{'negative': tensor(0.7084),
 'neutral': tensor(0.2869),
 'positive': tensor(0.0047)}

### Predicting on Raw Text

In [94]:
Text = r"""Very easy game. Explained in a minute. Anyone can play it. Very fast laps. The limit could be another two or three centimeters shorter,
then it will “click” faster! I did it this way. It can be packed up to a small size and is almost indestructible.
But it can easily lead to frustration if you play with too much ambition. You also have to be able to lose!"""

In [95]:
score = 4

In [96]:
encoding = tokenizer.encode_plus(Text, add_special_tokens=True, max_length=512,
                                 padding='max_length', truncation=True,
                                 return_token_type_ids=False, return_tensors='pt')

In [97]:
input_ids = encoding['input_ids'].to(device)
attention_mask = encoding['attention_mask'].to(device)

pooled = classifier(input_ids, attention_mask)

confidence = F.softmax(pooled, dim=1).tolist()[0]
confidence = {number2sentiment[idx] : confidence[idx] for idx in range(3)}

val, prediction = torch.max(pooled, dim=1)
prediction = number2sentiment[prediction.item()]

print(confidence)
prediction

{'negative': 0.001904321019537747, 'neutral': 0.023873839527368546, 'positive': 0.97422194480896}


'positive'

## Summary

## References

- [Huggingface Transformers](https://huggingface.co/transformers/)
- 