# FinBERT Example Notebook

This notebooks shows how to train and use the FinBERT pre-trained language model for financial sentiment analysis.

## Modules 

In [2]:
from pathlib import Path
import shutil
import os
import logging
import sys
sys.path.append('..')

from textblob import TextBlob
from pprint import pprint
from sklearn.metrics import classification_report

from transformers import AutoModelForSequenceClassification

from finbert.finbert import *
import finbert.utils as tools

%load_ext autoreload
%autoreload 2

project_dir = Path.cwd().parent
pd.set_option('max_colwidth', -1)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_Gram=True, verbose=0,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.floa

In [3]:
logging.basicConfig(format = '%(asctime)s - %(levelname)s - %(name)s -   %(message)s',
                    datefmt = '%m/%d/%Y %H:%M:%S',
                    level = logging.ERROR)

## Prepare the model

### Setting path variables:
1. `lm_path`: the path for the pre-trained language model (If vanilla Bert is used then no need to set this one).
2. `cl_path`: the path where the classification model is saved.
3. `cl_data_path`: the path of the directory that contains the data files of `train.csv`, `validation.csv`, `test.csv`.
---

In the initialization of `bertmodel`, we can either use the original pre-trained weights from Google by giving `bm = 'bert-base-uncased`, or our further pre-trained language model by `bm = lm_path`


---
All of the configurations with the model is controlled with the `config` variable. 

In [4]:
lm_path = 'ProsusAI/finbert'
cl_path = 'finbert-sentiment'
cl_data_path = project_dir/'..'/'Data'/'throughput'/'A_B'

###  Configuring training parameters

You can find the explanations of the training parameters in the class docsctrings. 

In [5]:
# Clean the cl_path
try:
    shutil.rmtree(cl_path) 
except:
    pass

bertmodel = AutoModelForSequenceClassification.from_pretrained(lm_path,cache_dir=None, num_labels=3)


config = Config(   data_dir=cl_data_path,
                   bert_model=bertmodel,
                   num_train_epochs=6,
                   model_dir=cl_path,
                   max_seq_length = 48,
                   train_batch_size = 32,
                   learning_rate = 2e-5,
                   output_mode='classification',
                   warm_up_proportion=0.2,
                   local_rank=-1,
                   discriminate=True,
                   gradual_unfreeze=True)

`finbert` is our main class that encapsulates all the functionality. The list of class labels should be given in the prepare_model method call with label_list parameter.

In [17]:
finbert = FinBert(config)
#finbert.base_model = lm_path
finbert.base_model = 'bert-base-uncased'
finbert.config.discriminate=True
finbert.config.gradual_unfreeze=True

In [18]:
finbert.prepare_model(label_list=['positive','negative','neutral'])

05/20/2022 15:55:23 - INFO - finbert.finbert -   device: cpu n_gpu: 0, distributed training: False, 16-bits training: False


## Fine-tune the model

In [19]:
# Get the training examples
train_data = finbert.get_data('train')

In [20]:
model = finbert.create_the_model()

### [Optional] Fine-tune only a subset of the model
The variable `freeze` determines the last layer (out of 12) to be freezed. You can skip this part if you want to fine-tune the whole model.

<span style="color:red">Important: </span>
Execute this step if you want a shorter training time in the expense of accuracy.

In [21]:
# This is for fine-tuning a subset of the model.

freeze = 3

for param in model.bert.embeddings.parameters():
    param.requires_grad = False
    
for i in range(freeze):
    for param in model.bert.encoder.layer[i].parameters():
        param.requires_grad = False

### Training

In [22]:
trained_model = finbert.train(train_examples = train_data, model = model)

Token indices sequence length is longer than the specified maximum sequence length for this model (939 > 512). Running this sequence through the model will result in indexing errors
05/20/2022 15:55:33 - INFO - finbert.utils -   *** Example ***
05/20/2022 15:55:33 - INFO - finbert.utils -   guid: train-1
05/20/2022 15:55:33 - INFO - finbert.utils -   tokens: [CLS] thanks ed revenue for the third quarter increased 1 % to ##in ##que ##ncy forecast continues to suggest flat to lower loss rates in 2018 . this trend allowed us to slightly drop our reserve rate during the quarter while maintaining 12 months of forward coverage [SEP]
05/20/2022 15:55:33 - INFO - finbert.utils -   input_ids: 101 4283 3968 6599 2005 1996 2353 4284 3445 1015 1003 2000 2378 4226 9407 19939 4247 2000 6592 4257 2000 2896 3279 6165 1999 2760 1012 2023 9874 3039 2149 2000 3621 4530 2256 3914 3446 2076 1996 4284 2096 8498 2260 2706 1997 2830 6325 102
05/20/2022 15:55:33 - INFO - finbert.utils -   attention_mask: 1 1 1

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

05/20/2022 15:55:42 - INFO - finbert.utils -   *** Example ***
05/20/2022 15:55:42 - INFO - finbert.utils -   guid: validation-1
05/20/2022 15:55:42 - INFO - finbert.utils -   tokens: [CLS] thank you hello , everyone , and welcome to eco ##lab , and market environment , we expect to deliver strong adjusted dil ##uted eps growth in 2017 . and now , here ' s doug baker with some comments thanks that concludes our formal remarks [SEP]
05/20/2022 15:55:42 - INFO - finbert.utils -   input_ids: 101 4067 2017 7592 1010 3071 1010 1998 6160 2000 17338 20470 1010 1998 3006 4044 1010 2057 5987 2000 8116 2844 10426 29454 12926 20383 3930 1999 2418 1012 1998 2085 1010 2182 1005 1055 8788 6243 2007 2070 7928 4283 2008 14730 2256 5337 12629 102
05/20/2022 15:55:42 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
05/20/2022 15:55:42 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

Validating:   0%|          | 0/2 [00:00<?, ?it/s]

Validation losses: [2.1675195693969727]
No best model found


Epoch:  17%|████████████▊                                                                | 1/6 [00:12<01:03, 12.72s/it]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

05/20/2022 15:56:00 - INFO - finbert.utils -   *** Example ***
05/20/2022 15:56:00 - INFO - finbert.utils -   guid: validation-1
05/20/2022 15:56:00 - INFO - finbert.utils -   tokens: [CLS] thank you hello , everyone , and welcome to eco ##lab , and market environment , we expect to deliver strong adjusted dil ##uted eps growth in 2017 . and now , here ' s doug baker with some comments thanks that concludes our formal remarks [SEP]
05/20/2022 15:56:00 - INFO - finbert.utils -   input_ids: 101 4067 2017 7592 1010 3071 1010 1998 6160 2000 17338 20470 1010 1998 3006 4044 1010 2057 5987 2000 8116 2844 10426 29454 12926 20383 3930 1999 2418 1012 1998 2085 1010 2182 1005 1055 8788 6243 2007 2070 7928 4283 2008 14730 2256 5337 12629 102
05/20/2022 15:56:00 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
05/20/2022 15:56:00 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

Validating:   0%|          | 0/2 [00:00<?, ?it/s]

Validation losses: [2.1675195693969727, 1.1219532489776611]


Epoch:  33%|█████████████████████████▋                                                   | 2/6 [00:29<01:00, 15.20s/it]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

05/20/2022 15:56:19 - INFO - finbert.utils -   *** Example ***
05/20/2022 15:56:19 - INFO - finbert.utils -   guid: validation-1
05/20/2022 15:56:19 - INFO - finbert.utils -   tokens: [CLS] thank you hello , everyone , and welcome to eco ##lab , and market environment , we expect to deliver strong adjusted dil ##uted eps growth in 2017 . and now , here ' s doug baker with some comments thanks that concludes our formal remarks [SEP]
05/20/2022 15:56:19 - INFO - finbert.utils -   input_ids: 101 4067 2017 7592 1010 3071 1010 1998 6160 2000 17338 20470 1010 1998 3006 4044 1010 2057 5987 2000 8116 2844 10426 29454 12926 20383 3930 1999 2418 1012 1998 2085 1010 2182 1005 1055 8788 6243 2007 2070 7928 4283 2008 14730 2256 5337 12629 102
05/20/2022 15:56:19 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
05/20/2022 15:56:19 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

Validating:   0%|          | 0/2 [00:00<?, ?it/s]

Validation losses: [2.1675195693969727, 1.1219532489776611, 1.0972586274147034]


Epoch:  50%|██████████████████████████████████████▌                                      | 3/6 [00:49<00:51, 17.21s/it]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

05/20/2022 15:56:42 - INFO - finbert.utils -   *** Example ***
05/20/2022 15:56:42 - INFO - finbert.utils -   guid: validation-1
05/20/2022 15:56:42 - INFO - finbert.utils -   tokens: [CLS] thank you hello , everyone , and welcome to eco ##lab , and market environment , we expect to deliver strong adjusted dil ##uted eps growth in 2017 . and now , here ' s doug baker with some comments thanks that concludes our formal remarks [SEP]
05/20/2022 15:56:42 - INFO - finbert.utils -   input_ids: 101 4067 2017 7592 1010 3071 1010 1998 6160 2000 17338 20470 1010 1998 3006 4044 1010 2057 5987 2000 8116 2844 10426 29454 12926 20383 3930 1999 2418 1012 1998 2085 1010 2182 1005 1055 8788 6243 2007 2070 7928 4283 2008 14730 2256 5337 12629 102
05/20/2022 15:56:42 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
05/20/2022 15:56:42 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

Validating:   0%|          | 0/2 [00:00<?, ?it/s]

Validation losses: [2.1675195693969727, 1.1219532489776611, 1.0972586274147034, 1.0836488604545593]


Epoch:  67%|███████████████████████████████████████████████████▎                         | 4/6 [01:12<00:39, 19.52s/it]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

05/20/2022 15:57:16 - INFO - finbert.utils -   *** Example ***
05/20/2022 15:57:16 - INFO - finbert.utils -   guid: validation-1
05/20/2022 15:57:16 - INFO - finbert.utils -   tokens: [CLS] thank you hello , everyone , and welcome to eco ##lab , and market environment , we expect to deliver strong adjusted dil ##uted eps growth in 2017 . and now , here ' s doug baker with some comments thanks that concludes our formal remarks [SEP]
05/20/2022 15:57:16 - INFO - finbert.utils -   input_ids: 101 4067 2017 7592 1010 3071 1010 1998 6160 2000 17338 20470 1010 1998 3006 4044 1010 2057 5987 2000 8116 2844 10426 29454 12926 20383 3930 1999 2418 1012 1998 2085 1010 2182 1005 1055 8788 6243 2007 2070 7928 4283 2008 14730 2256 5337 12629 102
05/20/2022 15:57:16 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
05/20/2022 15:57:16 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

Validating:   0%|          | 0/2 [00:00<?, ?it/s]

Validation losses: [2.1675195693969727, 1.1219532489776611, 1.0972586274147034, 1.0836488604545593, 1.0793660283088684]


Epoch:  83%|████████████████████████████████████████████████████████████████▏            | 5/6 [01:46<00:24, 24.95s/it]

Iteration:   0%|          | 0/6 [00:00<?, ?it/s]

05/20/2022 15:57:45 - INFO - finbert.utils -   *** Example ***
05/20/2022 15:57:45 - INFO - finbert.utils -   guid: validation-1
05/20/2022 15:57:45 - INFO - finbert.utils -   tokens: [CLS] thank you hello , everyone , and welcome to eco ##lab , and market environment , we expect to deliver strong adjusted dil ##uted eps growth in 2017 . and now , here ' s doug baker with some comments thanks that concludes our formal remarks [SEP]
05/20/2022 15:57:45 - INFO - finbert.utils -   input_ids: 101 4067 2017 7592 1010 3071 1010 1998 6160 2000 17338 20470 1010 1998 3006 4044 1010 2057 5987 2000 8116 2844 10426 29454 12926 20383 3930 1999 2418 1012 1998 2085 1010 2182 1005 1055 8788 6243 2007 2070 7928 4283 2008 14730 2256 5337 12629 102
05/20/2022 15:57:45 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
05/20/2022 15:57:45 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

Validating:   0%|          | 0/2 [00:00<?, ?it/s]

Validation losses: [2.1675195693969727, 1.1219532489776611, 1.0972586274147034, 1.0836488604545593, 1.0793660283088684, 1.0793660283088684]


Epoch: 100%|█████████████████████████████████████████████████████████████████████████████| 6/6 [02:15<00:00, 22.58s/it]


## Test the model

`bert.evaluate` outputs the DataFrame, where true labels and logit values for each example is given

In [23]:
test_data = finbert.get_data('test')

In [24]:
results = finbert.evaluate(examples=test_data, model=trained_model)

05/20/2022 15:57:50 - INFO - finbert.utils -   *** Example ***
05/20/2022 15:57:50 - INFO - finbert.utils -   guid: test-1
05/20/2022 15:57:50 - INFO - finbert.utils -   tokens: [CLS] thank you , doug , and good morning we ' re to informing the market of our strategy by that time we continue to believe that we will be able to lower our pharmaceutical cost by more than $ 3 billion annually in 2020 and beyond [SEP]
05/20/2022 15:57:50 - INFO - finbert.utils -   input_ids: 101 4067 2017 1010 8788 1010 1998 2204 2851 2057 1005 2128 2000 21672 1996 3006 1997 2256 5656 2011 2008 2051 2057 3613 2000 2903 2008 2057 2097 2022 2583 2000 2896 2256 13859 3465 2011 2062 2084 1002 1017 4551 6604 1999 12609 1998 3458 102
05/20/2022 15:57:50 - INFO - finbert.utils -   attention_mask: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
05/20/2022 15:57:50 - INFO - finbert.utils -   token_type_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

Testing:   0%|          | 0/2 [00:00<?, ?it/s]

### Prepare the classification report

In [25]:
def report(df, cols=['label','prediction','logits']):
    #print('Validation loss:{0:.2f}'.format(metrics['best_validation_loss']))
    cs = CrossEntropyLoss(weight=finbert.class_weights)
    loss = cs(torch.tensor(list(df[cols[2]])),torch.tensor(list(df[cols[0]])))
    print("Loss:{0:.2f}".format(loss))
    print("Accuracy:{0:.2f}".format((df[cols[0]] == df[cols[1]]).sum() / df.shape[0]) )
    print("\nClassification Report:")
    print(classification_report(df[cols[0]], df[cols[1]]))

In [26]:
results['prediction'] = results.predictions.apply(lambda x: np.argmax(x,axis=0))

In [27]:
report(results,cols=['labels','prediction','predictions'])

Loss:1.13
Accuracy:0.27

Classification Report:
              precision    recall  f1-score   support

           0       0.39      0.28      0.33        25
           1       0.29      0.21      0.24        24
           2       0.15      0.50      0.23         6

    accuracy                           0.27        55
   macro avg       0.28      0.33      0.27        55
weighted avg       0.32      0.27      0.28        55



  after removing the cwd from sys.path.


### Get predictions

With the `predict` function, given a piece of text, we split it into a list of sentences and then predict sentiment for each sentence. The output is written into a dataframe. Predictions are represented in three different columns: 

1) `logit`: probabilities for each class

2) `prediction`: predicted label

3) `sentiment_score`: sentiment score calculated as: probability of positive - probability of negative

Below we analyze a paragraph taken out of [this](https://www.economist.com/finance-and-economics/2019/01/03/a-profit-warning-from-apple-jolts-markets) article from The Economist. For comparison purposes, we also put the sentiments predicted with TextBlob.
> Later that day Apple said it was revising down its earnings expectations in the fourth quarter of 2018, largely because of lower sales and signs of economic weakness in China. The news rapidly infected financial markets. Apple’s share price fell by around 7% in after-hours trading and the decline was extended to more than 10% when the market opened. The dollar fell by 3.7% against the yen in a matter of minutes after the announcement, before rapidly recovering some ground. Asian stockmarkets closed down on January 3rd and European ones opened lower. Yields on government bonds fell as investors fled to the traditional haven in a market storm.

In [16]:
text = "Later that day Apple said it was revising down its earnings expectations in \
the fourth quarter of 2018, largely because of lower sales and signs of economic weakness in China. \
The news rapidly infected financial markets. Apple’s share price fell by around 7% in after-hours \
trading and the decline was extended to more than 10% when the market opened. The dollar fell \
by 3.7% against the yen in a matter of minutes after the announcement, before rapidly recovering \
some ground. Asian stockmarkets closed down on January 3rd and European ones opened lower. \
Yields on government bonds fell as investors fled to the traditional haven in a market storm."

In [17]:
cl_path = project_dir/'models'/'classifier_model'/'finbert-sentiment'
model = AutoModelForSequenceClassification.from_pretrained(cl_path, cache_dir=None, num_labels=3)

404 Client Error: Not Found for url: https://huggingface.co/C:%5CUsers%5Cpole1%5CPycharmProjects%5CNLP%5CfinBERT-master%5Cmodels%5Cclassifier_model%5Cfinbert-sentiment/resolve/main/config.json


OSError: Can't load config for 'C:\Users\pole1\PycharmProjects\NLP\finBERT-master\models\classifier_model\finbert-sentiment'. Make sure that:

- 'C:\Users\pole1\PycharmProjects\NLP\finBERT-master\models\classifier_model\finbert-sentiment' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'C:\Users\pole1\PycharmProjects\NLP\finBERT-master\models\classifier_model\finbert-sentiment' is the correct path to a directory containing a config.json file



In [None]:
import nltk
nltk.download('punkt')

In [None]:
result = predict(text,model)

In [None]:
blob = TextBlob(text)
result['textblob_prediction'] = [sentence.sentiment.polarity for sentence in blob.sentences]
result

In [None]:
print(f'Average sentiment is %.2f.' % (result.sentiment_score.mean()))

Here is another example

In [None]:
text2 = "Shares in the spin-off of South African e-commerce group Naspers surged more than 25% \
in the first minutes of their market debut in Amsterdam on Wednesday. Bob van Dijk, CEO of \
Naspers and Prosus Group poses at Amsterdam's stock exchange, as Prosus begins trading on the \
Euronext stock exchange in Amsterdam, Netherlands, September 11, 2019. REUTERS/Piroschka van de Wouw \
Prosus comprises Naspers’ global empire of consumer internet assets, with the jewel in the crown a \
31% stake in Chinese tech titan Tencent. There is 'way more demand than is even available, so that’s \
good,' said the CEO of Euronext Amsterdam, Maurice van Tilburg. 'It’s going to be an interesting \
hour of trade after opening this morning.' Euronext had given an indicative price of 58.70 euros \
per share for Prosus, implying a market value of 95.3 billion euros ($105 billion). The shares \
jumped to 76 euros on opening and were trading at 75 euros at 0719 GMT."

In [None]:
result2 = predict(text2,model)
blob = TextBlob(text2)
result2['textblob_prediction'] = [sentence.sentiment.polarity for sentence in blob.sentences]

In [None]:
result2

In [None]:
print(f'Average sentiment is %.2f.' % (result2.sentiment_score.mean()))