In this lab, I am working with the financial_phrasebank dataset, a collection of sentences from English financial news meticulously annotated for sentiment (positive, negative, neutral). Tailored for sentiment classification within the financial domain, the dataset provides variations based on annotator agreement percentages. Worth noting is the absence of a traditional train/validation/test split, but the dataset presents four configurations, including sentences_50agree, boasting 4846 instances with >=50% annotator agreement. For my analysis, I opted for the version with 100% agreement, denoted as sentences_allagree. These sentences were extracted from news articles covering OMX Helsinki-listed companies and are categorized as 0 for negative, 1 for neutral, and 2 for positive sentiments.

In [1]:
!pip install tensorflow
!pip install transformers



In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.15.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6


In [3]:
import numpy as np
import pandas as pd
import sklearn

import tensorflow as tf
import transformers
#tqdm is a progress bar
from tqdm import tqdm

In [4]:
from datasets import load_dataset
import pandas as pd

# Load the dataset from Hugging Face with the chosen configuration
dataset = load_dataset('financial_phrasebank', 'sentences_allagree')

# Convert the dataset to a Pandas DataFrame
df1 = pd.DataFrame(dataset['train'])

Downloading builder script:   0%|          | 0.00/6.04k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/13.7k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/8.88k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/682k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2264 [00:00<?, ? examples/s]

In [5]:
df1.head()

Unnamed: 0,sentence,label
0,"According to Gran , the company has no plans t...",1
1,"For the last quarter of 2010 , Componenta 's n...",2
2,"In the third quarter of 2010 , net sales incre...",2
3,Operating profit rose to EUR 13.1 mn from EUR ...,2
4,"Operating profit totalled EUR 21.1 mn , up fro...",2


In [6]:
duplicate_rows = df1[df1.duplicated(keep=False)]
duplicate_rows

Unnamed: 0,sentence,label
518,The issuer is solely responsible for the conte...,1
519,The issuer is solely responsible for the conte...,1
625,The report profiles 614 companies including ma...,1
626,The report profiles 614 companies including ma...,1
928,Ahlstrom 's share is quoted on the NASDAQ OMX ...,1
929,Ahlstrom 's share is quoted on the NASDAQ OMX ...,1
1026,SSH Communications Security Corporation is hea...,1
1027,SSH Communications Security Corporation is hea...,1
1408,The company serves customers in various indust...,1
1409,The company serves customers in various indust...,1


In [7]:
df1.drop_duplicates(keep='first', inplace=True)

In [8]:
# Assuming you have already created your DataFrame df
df1.to_csv('BERT1', index=False)

In [9]:
df1.shape

(2259, 2)

In [None]:
# Loading the BERT Classifier and Tokenizer along with Input module
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures# Load the BERT model and tokenizer

model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=3)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model.summary()

Model: "tf_bert_for_sequence_classification_11"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_459 (Dropout)       multiple                  0         
                                                                 
 classifier (Dense)          multiple                  2307      
                                                                 
Total params: 109484547 (417.65 MB)
Trainable params: 109484547 (417.65 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [None]:
train = df1[:1800]
test = df1[1800:]

In [None]:
def convert_data_to_examples(train, test, sentence, label):
    train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[sentence],
                                                          label = x[label]), axis = 1)

    validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[sentence],
                                                          label = x[label]), axis = 1,)

    return train_InputExamples, validation_InputExamples

train_InputExamples, validation_InputExamples = convert_data_to_examples(train,  test, 'sentence',  'label')

In [None]:
def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):
    features = [] # -> will hold InputFeatures to be converted later

    for e in tqdm(examples):
        input_dict = tokenizer.encode_plus(
            e.text_a,
            add_special_tokens=True,    # Add 'CLS' and 'SEP'
            max_length=max_length,    # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default # CHECK THIS for pad_to_max_length
            truncation=True
        )

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],input_dict["token_type_ids"], input_dict['attention_mask'])
        features.append(InputFeatures( input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label) )

    def gen():
        for f in features:
            yield (
                {
                    "input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,
                },
                f.label,
            )

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None]),
                "token_type_ids": tf.TensorShape([None]),
            },
            tf.TensorShape([]),
        ),
    )


DATA_COLUMN = 'sentence'
LABEL_COLUMN = 'label'

In [None]:
train_data = convert_examples_to_tf_dataset(list(train_InputExamples), tokenizer)
train_data = train_data.shuffle(100).batch(16).repeat(2)

100%|██████████| 1800/1800 [00:02<00:00, 806.90it/s] 


In [None]:
validation_data = convert_examples_to_tf_dataset(list(validation_InputExamples), tokenizer)
validation_data = validation_data.batch(16)

100%|██████████| 459/459 [00:00<00:00, 902.84it/s]


In [None]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-8),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])

model.fit(train_data, epochs=2, validation_data=validation_data)

Epoch 1/2
Epoch 2/2


<keras.src.callbacks.History at 0x7cccb424b310>

In [None]:
pred_sentences = [
    "A new tech startup revolutionizes the market with record-breaking profits, boosting investor confidence",
    "Investors eagerly anticipate the upcoming earnings report, expecting positive results",
    "Market fluctuations create uncertainty among traders and investors, challenging market stability",
    "The Federal Reserve's decision to raise interest rates sparks significant reactions in bond markets",
    "Analysts express strong optimism for the tech sector, attributing it to innovative product sales",
    "Economic indicators raise concerns about a potential recession looming in the near future",
    "The company's stock takes a nosedive following a disappointing quarterly report, worrying investors",
    "Investors seek refuge in safe-haven assets amidst rising geopolitical tensions",
    "The company's declaration of bankruptcy leads to widespread job losses and economic turmoil"]

In [None]:
 # we are tokenizing before sending into our trained model
tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
# axis=-1, this means that the index that will be returned by argmax will be taken from the *last* axis.
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
labels = [2, 1, 0]
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
for i in range(len(pred_sentences)):
    print(pred_sentences[i], ": ", labels[label[i]])

A new tech startup revolutionizes the market with record-breaking profits, boosting investor confidence :  2
Investors eagerly anticipate the upcoming earnings report, expecting positive results :  0
Market fluctuations create uncertainty among traders and investors, challenging market stability :  2
The Federal Reserve's decision to raise interest rates sparks significant reactions in bond markets :  2
Analysts express strong optimism for the tech sector, attributing it to innovative product sales :  0
Economic indicators raise concerns about a potential recession looming in the near future :  0
The company's stock takes a nosedive following a disappointing quarterly report, worrying investors :  2
Investors seek refuge in safe-haven assets amidst rising geopolitical tensions :  2
The company's declaration of bankruptcy leads to widespread job losses and economic turmoil :  2
