# Oxford Man Institute NLP Tutorial 

## 1. Introduction 

There are several ways of performing sentiment classification on a document or article, ranging from word-counts to modern Transformer-based Language Models. In this tutorial we will take you through a range of classification techniques:
- Loughran & McDonald financial sentiment dictionary
- Naive Bayes Classifier
- BERT out of the box
- BERT fine-tuned on general sentiment datasets
- FinBERT 
    - BERT that has been trained on positive and negative financial documents

## 2. Traditional sentiment analysis

### Import packages and load dictionaries

In [2]:
import numpy as np
import re

### Loughran & McDonald classifier

Loughran & McDonald released their master dictionary in 2011 in conjunction with their paper “When is a Liability not a Liability? Textual Analysis, Dictionaries, and 10-Ks". The dictionary lists a number of words and includes negative, positive, uncertainty, litigious, strong modal, weak modal, and constraining tags. 

There are several shortcomings to this simplistic approach:

- **Some words don't appear in the dictionary (fall, rise, etc.)**
- **Some words are negative/positive given the context they are written (profit, expenditure, etc.)**
- **Simple counts of words don't necessarily infer the overall sentiment**
    - *Hatred for football has always confused me; there are so many haters who attack the sport, but I have always loved it.* - 3 negative words and 1 positive word.


We have taken the words that have a negative and positive tag for our classifier:

In [38]:
lmdict = np.load('data/LoughranMcDonald_dict.npy', allow_pickle='TRUE').item()
print('Some examples of negative words: ', lmdict['Negative'][:5])
print('Some examples of positive words: ', lmdict['Positive'][:5])

Some examples of negative words:  ['abandon', 'abandoned', 'abandoning', 'abandonment', 'abandonments']
Some examples of negative words:  ['able', 'abundance', 'abundant', 'acclaimed', 'accomplish']


In [30]:
lmdict['Positive'][:5]

['able', 'abundance', 'abundant', 'acclaimed', 'accomplish']

Check to see if a word appears in the dictionary:

In [25]:
word = 'fall'

if word in lmdict['Negative']:
    print(f'Yes, {word} is a Negative word in the Loughran & McDonald dictionary')
elif word in lmdict['Positive']:
    print(f'Yes, {word} is Positive word in the Loughran & McDonald dictionary')
else:
    print(f'No, {word} is not in the Loughran & McDonald dictionary')

No, fall is not in the Loughran & McDonald dictionary


Negation is another challenge that emerges using this approach. A techy fix is to check if the word is preceeded by a negating word in our list:

In [32]:
negate = ["aint", "arent", "cannot", "cant", "couldnt", "darent", "didnt", "doesnt", "ain't", "aren't", "can't",
          "couldn't", "daren't", "didn't", "doesn't", "dont", "hadnt", "hasnt", "havent", "isnt", "mightnt", "mustnt",
          "neither", "don't", "hadn't", "hasn't", "haven't", "isn't", "mightn't", "mustn't", "neednt", "needn't",
          "never", "none", "nope", "nor", "not", "nothing", "nowhere", "oughtnt", "shant", "shouldnt", "wasnt",
          "werent", "oughtn't", "shan't", "shouldn't", "wasn't", "weren't", "without", "wont", "wouldnt", "won't",
          "wouldn't", "rarely", "seldom", "despite", "no", "nobody"]

In [33]:
def negated(word):
    """
    Determine if preceding word is a negation word
    """
    if word.lower() in negate:
        return True
    else:
        return False

This function counts the number of negative and positive words in a document and performs a negation check to switch the polarity of words that are preceeded by a word in the *negate* list.

In [35]:
def tone_count_with_negation_check(dict, article):
    """
    Count positive and negative words with negation check. Account for simple negation only for positive words.
    Simple negation is taken to be observations of one of negate words occurring within three words
    preceding a positive words.
    """
    pos_count = 0
    neg_count = 0
 
    pos_words = []
    neg_words = []
 
    input_words = re.findall(r'\b([a-zA-Z]+n\'t|[a-zA-Z]+\'s|[a-zA-Z]+)\b', article.lower())
 
    word_count = len(input_words)
 
    for i in range(0, word_count):
        if input_words[i] in dict['Negative']:
            neg_count += 1
            neg_words.append(input_words[i])
        if input_words[i] in dict['Positive']:
            if i >= 3:
                if negated(input_words[i - 1]) or negated(input_words[i - 2]) or negated(input_words[i - 3]):
                    neg_count += 1
                    neg_words.append(input_words[i] + ' (with negation)')
                else:
                    pos_count += 1
                    pos_words.append(input_words[i])
            elif i == 2:
                if negated(input_words[i - 1]) or negated(input_words[i - 2]):
                    neg_count += 1
                    neg_words.append(input_words[i] + ' (with negation)')
                else:
                    pos_count += 1
                    pos_words.append(input_words[i])
            elif i == 1:
                if negated(input_words[i - 1]):
                    neg_count += 1
                    neg_words.append(input_words[i] + ' (with negation)')
                else:
                    pos_count += 1
                    pos_words.append(input_words[i])
            elif i == 0:
                pos_count += 1
                pos_words.append(input_words[i])
 
    print('The results with negation check:', end='\n\n')
    print('The # of positive words:', pos_count)
    print('The # of negative words:', neg_count)
    print('The list of found positive words:', pos_words)
    print('The list of found negative words:', neg_words)
    print('\n', end='')
 
    results = [word_count, pos_count, neg_count, pos_words, neg_words]
 
    return results
 
    
# A sample output
article = '''Pharmaceuticals group Orion Corp reported a fall in its third-quarter earnings that were hit by larger expenditures on R&D and marketing not able abandon'''
 
tone_count_with_negation_check(lmdict, article)

The results with negation check:

The # of positive words: 0
The # of negative words: 2
The list of found positive words: []
The list of found negative words: ['able (with negation)', 'abandon']



[26, 0, 2, [], ['able (with negation)', 'abandon']]

## 3. BERT classification

In [6]:
# Import all dependencies 
from datasets import load_dataset
import pandas as pd
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader
from transformers import  TrainingArguments, Trainer, EarlyStoppingCallback
from transformers import AutoConfig
import warnings
warnings.filterwarnings("ignore")

# Import the dataset from huggingfaces' dataset repository
fin_dataset = load_dataset('financial_phrasebank', 'sentences_50agree')
df = pd.DataFrame(fin_dataset['train']) # send  it to a pandas dataframe

Reusing dataset financial_phrasebank (/Users/danielagorduza/.cache/huggingface/datasets/financial_phrasebank/sentences_50agree/1.0.0/a6d468761d4e0c8ae215c77367e1092bead39deb08fbf4bffd7c0a6991febbf0)


  0%|          | 0/1 [00:00<?, ?it/s]

In [7]:
### Tokenizer 

In [8]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [9]:
#how does the tokenizer work ? 
print('\nThis is our input sentence : \n Hi my name is BERT and I am overjoyed  to meet you ! \n')

out = tokenizer(['Hi my name is BERT and I am overjoyed  to meet you ! '],
          max_length=64,padding="max_length", truncation=True,return_tensors='pt')
print('These are the outputs of the tokenizer:\n')
print(out)

print('\nThese inputs correspond to the original sentence with separation and padding thrown in :\n')
print([tokenizer.decode(i) for i in out['input_ids']])


This is our input sentence : 
 Hi my name is BERT and I am overjoyed  to meet you ! 

These are the outputs of the tokenizer:

{'input_ids': tensor([[  101,  8790,  1139,  1271,  1110,   139,  9637,  1942,  1105,   146,
          1821,  1166, 18734,  1174,  1106,  2283,  1128,   106,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
     

In [10]:
# Now that we covered  the tokenizer lets introduce the other building block : the model 

print('this is our model : \n')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-cased')
layers = [i for i in model.parameters()]
print('\n First layer shape (vocabulary size) : \n ',layers[0].shape,
'\n Last layer shape (prediction task output shape) : \n ',layers[-1].shape)

this is our model : 



Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b


 First layer shape (vocabulary size) : 
  torch.Size([28996, 768]) 
 Last layer shape (prediction task output shape) : 
  torch.Size([2])


In [11]:
# basic forward propagation of our BERT model 
print('This is our forward propagation syntax. \n We feed in a tokenized text and receive the \n predicted  logits over the 2 classes : \n')
model.forward(**out)

This is our forward  propagation syntax. 
 We feed in a tokenized text and receive the 
 predicted  logits over the 2 classes : 



SequenceClassifierOutput(loss=None, logits=tensor([[-0.3459, -0.1475]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

In [12]:
# Working with BERT hands-on 

In [13]:
#  define tokenizer & model 
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

# turn the configuration for a 3 sentiment classification task
config = AutoConfig.from_pretrained('bert-base-cased')
config.num_labels = 3

model = AutoModelForSequenceClassification.from_config(config)


In [14]:
train, test = train_test_split(df, test_size=0.25, random_state=96)
test, val = train_test_split(test, test_size=0.4, random_state=96)

In [15]:
# Defining a Dataset object to put our data in


class BERTTutorialDataset(Dataset):
    """
    Special dataset class built on top of the torch Dataset class
    useful to have memory efficient dataloading tokenization batching and trainning.
    
    Huggingface can use these types of dataset as inputs and run all trainning/prediction on them. 
    """
    def __init__(self, input_data, sentiment_targets, tokenizer, max_len):
        """
        Basic generator function for the class.
        -----------------
        input_data : array
            Numpy array of string  input text to use for downstream task 
        sentiment_targets : 
            Numpy array of integers indexed in  the pytorch style of [0,C-1] with C being the total number of classes
            In our example this means the target sentiments should range from 0 to 2. 
        tokenizer  : Huggingface tokenizer 
            The huggingface tokenizer to use
        max_len : 
            The truncation length of the tokenizer 
        -------------------
        
        Returns : 
        
            Tokenized text with inputs, attentions and labels, ready for the Training script. 
        """
        self.input_data = input_data
        self.sentiment_targets = sentiment_targets
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __len__(self):
        """
        Function required by torch huggingface to batch efficiently
        """
        return len(self.input_data)
    
    def __getitem__(self, item):
        text = str(self.input_data[item])
        target = self.sentiment_targets[item]
        # only difference with the previuous tokenization step is the encode-plus for special tokens
        encoding = self.tokenizer.encode_plus(
          text,
          add_special_tokens=True,
          max_length=self.max_len,
          return_token_type_ids=False,
          padding='max_length',
          return_attention_mask=True,
          return_tensors='pt',
          truncation = True
        )
        return {
          'text': text,
          'input_ids': encoding['input_ids'].flatten(),
          'attention_mask': encoding['attention_mask'].flatten(),
          'labels': torch.tensor(target, dtype=torch.long)
        }

In [16]:
# Creating our train-val-test datasets
MAX_LEN = 32
train_ds = BERTTutorialDataset(
    input_data=train['sentence'].to_numpy(),
        sentiment_targets=train['label'].to_numpy(),
        tokenizer=tokenizer,
        max_len=MAX_LEN
    )
val_ds = BERTTutorialDataset(
    input_data=val['sentence'].to_numpy(),
        sentiment_targets=val['label'].to_numpy(),
        tokenizer=tokenizer,
        max_len=MAX_LEN
    )

test_ds = BERTTutorialDataset(
    input_data=test['sentence'].to_numpy(),
        sentiment_targets=test['label'].to_numpy(),
        tokenizer=tokenizer,
        max_len=MAX_LEN
    )


In [17]:
# Define some accuracy measure ( helpful for the early stopping )
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

def compute_metrics(p):
    """
    Function to calculate accuracies and losses for the validation from the predicted outputs
    This is neccessary for the early stopping. 
    """
    pred, labels = p
    pred = np.argmax(pred, axis=1)
    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred, average='macro')
    precision = precision_score(y_true=labels, y_pred=pred, average='macro')
    f1 = f1_score(y_true=labels, y_pred=pred, average='macro')    
    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}


In [18]:
# Define trainning arguments 
training_args = TrainingArguments('BERT_TUTORIAL_MODEL', overwrite_output_dir=True, evaluation_strategy="steps", 
                                  num_train_epochs=3, weight_decay=0.005,learning_rate=1e-4,
                                  eval_steps=10,metric_for_best_model='accuracy',
                                 per_device_train_batch_size=128, per_device_eval_batch_size=128,
                                 load_best_model_at_end = True, save_total_limit=2, save_steps=10,no_cuda=True
                             )
trainer = Trainer(
    model =model, args=training_args, train_dataset=train_ds, eval_dataset=val_ds,
    callbacks = [EarlyStoppingCallback(early_stopping_patience=20)], compute_metrics=compute_metrics
)

In [19]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1,Runtime,Samples Per Second
10,No log,0.902163,0.637113,0.212371,0.333333,0.259446,25.1862,19.257
20,No log,0.882709,0.637113,0.212371,0.333333,0.259446,25.0924,19.329
30,No log,0.94681,0.637113,0.212371,0.333333,0.259446,25.0788,19.339
40,No log,0.930707,0.637113,0.212371,0.333333,0.259446,25.1253,19.303
50,No log,0.892915,0.637113,0.212371,0.333333,0.259446,24.8064,19.551
60,No log,0.885687,0.637113,0.212371,0.333333,0.259446,24.659,19.668
70,No log,0.884307,0.637113,0.212371,0.333333,0.259446,31.2608,15.515
80,No log,0.889867,0.637113,0.212371,0.333333,0.259446,26.1874,18.52


TrainOutput(global_step=87, training_loss=0.9643055707558819, metrics={'train_runtime': 2767.9424, 'train_samples_per_second': 0.031, 'total_flos': 226718157361536, 'epoch': 3.0})

In [20]:
trainer.evaluate(test_ds)

{'eval_loss': 0.9398260712623596,
 'eval_accuracy': 0.5955983493810179,
 'eval_precision': 0.19853278312700595,
 'eval_recall': 0.3333333333333333,
 'eval_f1': 0.24885057471264369,
 'eval_runtime': 41.0251,
 'eval_samples_per_second': 17.721,
 'epoch': 3.0}

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns
predictions = trainer.predict(test_ds)

In [None]:
output = np.argmax(predictions.predictions,1)
sns.heatmap(confusion_matrix(test.label.values,output))#,labels = ['1','-1','0']

## 4. Visualisation

In [None]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from transformers_interpret import SequenceClassificationExplainer

To visualise which words in each phrase are the most important for the prediction we will use the python package transformers_interpret 

In [None]:
fin_model_name = "ProsusAI/finbert"
model_name = "textattack/bert-base-uncased-SST-2"


fin_model = AutoModelForSequenceClassification.from_pretrained(fin_model_name)
fin_tokenizer = AutoTokenizer.from_pretrained(fin_model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [None]:
# With both the model and tokenizer initialized we are now able to get explanations on an example text.
cls_explainer = SequenceClassificationExplainer(model,
                                                tokenizer)

fin_cls_explainer = SequenceClassificationExplainer(fin_model,
                                                    fin_tokenizer)

In [None]:
word_attributions = cls_explainer("Pharmaceuticals group Orion Corp reported a fall in its third-quarter earnings that were hit by larger expenditures on R&D and marketing")
word_attributions = fin_cls_explainer("Pharmaceuticals group Orion Corp reported a fall in its third-quarter earnings that were hit by larger expenditures on R&D and marketing")

In [None]:
cls_explainer.predicted_class_name

In [None]:
bert_vis = cls_explainer.visualize()

In [None]:
fin_bert_vis = fin_cls_explainer.visualize()