# Semantic Classification Using BERT

#### Setup

In [1]:
!pip install -q -U watermark

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
markdown 3.3.6 requires importlib-metadata>=4.4; python_version < "3.10", but you have importlib-metadata 2.1.3 which is incompatible.[0m


In [2]:
!pip install git+https://github.com/huggingface/transformers


Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-rs9hzkis
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-rs9hzkis
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 2.0 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 8.9 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 54.1 MB/s 
Buildin

In [3]:
%reload_ext watermark
%watermark -v -p numpy,pandas,torch,transformers

Python implementation: CPython
Python version       : 3.7.13
IPython version      : 5.5.0

numpy       : 1.21.6
pandas      : 1.3.5
torch       : 1.11.0+cu113
transformers: 4.20.0.dev0



In [4]:
#@title Setup & Config
import transformers
from transformers import BertModel, BertTokenizer, AdamW, get_linear_schedule_with_warmup
import torch

import numpy as np
import pandas as pd
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from collections import defaultdict
from textwrap import wrap

from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F

%matplotlib inline
%config InlineBackend.figure_format='retina'

sns.set(style='whitegrid', palette='muted', font_scale=1.2)


rcParams['figure.figsize'] = 12, 8

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

device(type='cpu')

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
class_names = ['negative', 'positive']


# 1C. Data Preprocessing
- Tokenizer
- Case folding (Built into Bert Tokenizer)
- Stop words are often NOT removed as they imply context (negations), i.e BERT uses transformers that are based on "real and clean" text

Encoding which does -through using encode_plus():
- Adds [CLS] token at the beginning of the sentence
- Adds the [SEP] token at the end of the sentence
- Adds Padding to the sentence with [PAD] tokens so that the total length equals to the maximum length
- Tokens that were not in the training set, [UNK] (unknown) token.




In [7]:
# Load pretrained bert model, uncased as the data is to be folded
PRE_TRAINED_MODEL_NAME = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME,  return_dict=False)

sample_txt = 'The Volume is flying high we gonna moon'
tokens = tokenizer.tokenize(sample_txt)
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print(f' Sentence: {sample_txt}')
print(f'   Tokens: {tokens}')
print(f'Token IDs: {token_ids}')

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

 Sentence: The Volume is flying high we gonna moon
   Tokens: ['the', 'volume', 'is', 'flying', 'high', 'we', 'gonna', 'moon']
Token IDs: [1996, 3872, 2003, 3909, 2152, 2057, 6069, 4231]


In [8]:
# Encoding
encoding = tokenizer.encode_plus(
  sample_txt,
  max_length=75,
  add_special_tokens=True, # Add '[CLS]' and '[SEP]'
  return_token_type_ids=False,
  pad_to_max_length=True,
  return_attention_mask=True,
  return_tensors='pt',  # Return PyTorch tensors
)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


# 2A. Soloution

In [9]:
class DiscordDataset(Dataset):
    def __init__(self, messages, targets, tokenizer, max_len):
        self.messages = messages
        self.targets = targets
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __len__(self):
        return len(self.messages)
    
    def __getitem__(self, item):
        message = str(self.messages[item])
        target = self.targets[item]
        encoding = self.tokenizer.encode_plus(
          message,
          add_special_tokens=True,
          max_length=self.max_len,
          return_token_type_ids=False,
          pad_to_max_length=True,
          return_attention_mask=True,
          return_tensors='pt',
    )
        
        return {
      'message_text': message,
      'input_ids': encoding['input_ids'].flatten(),
      'attention_mask': encoding['attention_mask'].flatten(),
      'targets': torch.tensor(target, dtype=torch.long)
        }

# 2C. BERT model Import

In [10]:
class SentimentClassifier(nn.Module):
    def __init__(self, n_classes):
        super(SentimentClassifier, self).__init__()
        self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME, return_dict=False)
        self.drop = nn.Dropout(p=0.3)
        self.out = nn.Linear(self.bert.config.hidden_size, n_classes)

    def forward(self, input_ids, attention_mask):
        _, pooled_output = self.bert(
          input_ids=input_ids,
          attention_mask=attention_mask
        )
        output = self.drop(pooled_output)
        return self.out(output)

In [11]:
# Import the train model, with the correct number of output neurons (one for each class Bearish/Bullish)
model = SentimentClassifier(len(class_names))
model = model.to(device)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# 2F. BERT model Validation (Implementation)

In [12]:
def eval_model(model, data_loader, loss_fn, device, n_examples):
    # Initialise params, switch to validation mode, As Backprop not used, this is just for monitoring not training
    model = model.eval()
    losses = []
    correct_predictions = 0

    with torch.no_grad():
        for d in data_loader:
            # Get the input data from the data loader, and the location of the padding, and labels.
            input_ids = d["input_ids"].to(device)
            attention_mask = d["attention_mask"].to(device)
            targets = d["targets"].to(device)

            # Calculate the Outputs
            outputs = model(
            input_ids=input_ids,
        attention_mask=attention_mask
      )
            
        # Calculate the class prediction using Max
        _, preds = torch.max(outputs, dim=1)
        loss = loss_fn(outputs, targets)

        # Calculate number of correct predictions
        correct_predictions += torch.sum(preds == targets)
        losses.append(loss.item())
    return correct_predictions.double() / n_examples, np.mean(losses)

#3A Evaluation

In [14]:
def get_predictions(model, data_loader):
    model = model.eval()
    message_texts = []
    predictions = []
    prediction_probs = []
    real_values = []
    with torch.no_grad():
        for d in data_loader:
            texts = d["message_text"]
            input_ids = d["input_ids"].to(device)
            attention_mask = d["attention_mask"].to(device)
            targets = d["targets"].to(device)
            outputs = model(
            input_ids=input_ids,
            attention_mask=attention_mask
            )
            
            _, preds = torch.max(outputs, dim=1)
            message_texts.extend(texts)
            
            predictions.extend(preds)
        
            prediction_probs.extend(outputs)
            
            real_values.extend(targets)
        
    predictions = torch.stack(predictions).cpu()
    prediction_probs = torch.stack(prediction_probs).cpu()
    real_values = torch.stack(real_values).cpu()
    return message_texts, predictions, prediction_probs, real_values

#### After Training Evaluation Raw Text & Obscure Examples

In [23]:
MAX_LEN = 75

In [18]:
model = SentimentClassifier(len(class_names))
model = model.to(device)


model.load_state_dict(torch.load("/content/drive/MyDrive/TEAMPROJECT_NLU/Model_BERT/80ACC_bert_state.bin"))
model.eval()

test_data_loader = torch.load("/content/drive/MyDrive/TEAMPROJECT_NLU/Model_BERT/80_dataloader.pth")
len_test = 109



Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [19]:
loss_fn = nn.CrossEntropyLoss().to(device)

In [20]:
test_acc, _ = eval_model(
  model,
  test_data_loader,
  loss_fn,
  device,
  len_test
)

test_acc.item()

  cpuset_checked))


0.7431192660550459

In [21]:
message = input("Enter message: ")


Enter message: really good art


In [24]:
encoded_message = tokenizer.encode_plus(
  message,
  max_length=MAX_LEN,
  add_special_tokens=True,
  return_token_type_ids=False,
  pad_to_max_length=True,
  return_attention_mask=True,
  return_tensors='pt',
)



# Problematic Examples

"How can you be happy woth this price" - identified as bullish with high confidence but it is actually bearish, as the price is really low.

"It's not gonna rise" - negations not recognised due to stop-word removal

"Price not good" - again, classified as bullish due to lack of such cases in the training data.

"If it reaches 50, I'd be delighted" - bullish with low confidence. It implies bullish but the model classified as bullish almost by chance (51% bullish vs 49% bearish)

"I wish it was going up instead" - clearly bearish since the author wishes the opposite was happening to what is actually happening (dropping).

In [25]:
input_ids = encoded_message['input_ids'].to(device)
attention_mask = encoded_message['attention_mask'].to(device)
output = model(input_ids, attention_mask)

p = output

_, prediction = torch.max(output, dim=1)

sm = torch.nn.Sigmoid()
probabilities = sm(output) 



print(f'Message text: {message}')
print(f'Predicted Status is: {class_names[prediction]}')
print(f'Probabillites for negative/positive are: {probabilities}')





Message text: really good art
Predicted Status is: positive
Probabillites for negative/positive are: tensor([[0.1867, 0.8374]], grad_fn=<SigmoidBackward0>)
