<a href="https://colab.research.google.com/github/Tonio-V98T/Kaibutsu/blob/main/Production_8A_FinBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FinBERT: Final**

## **Global Settings**

Install libraries

In [None]:
%%capture
!pip install transformers datasets evaluate xformers

Load libraries

In [None]:
from datasets import Dataset, DatasetDict
from scipy.special import softmax
from torch.utils.data import DataLoader
from tqdm.auto import tqdm
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          DataCollatorWithPadding, pipeline)
import datasets
import datetime
import evaluate
import huggingface_hub
import numpy as np
import pandas as pd
import torch
import transformers

Set global printing options

In [None]:
# Set printing options within the whole environment
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('expand_frame_repr', False)

Define computing device

In [None]:
torch.cuda.is_available()
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print((device))

cuda:0


## **Data**

Mount Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Load ordered dataset (as in prototype 7)

In [None]:
filepath = "/content/drive/MyDrive/Kaibutsu/db_desc_ordered.csv"

db_desc = pd.read_csv(filepath)
db_desc.drop(db_desc.columns[[0]], axis=1, inplace=True)

print(db_desc.iloc[0:5, :])

         Date                    Company                                                                                                                      Desc
0  2023-01-02  Builders FirstSource Inc.                                      4 Top Long-Term Stocks For 2023: 3 New Picks Join Google (Plus A Bonus Rule Breaker)
1  2023-01-03  Builders FirstSource Inc.                     Advisor Group Inc. boosted its holdings in shares of Builders FirstSource by 26.6% in the 3rd quarter
2  2023-01-03  Builders FirstSource Inc.                                                         Builders FirstSource Inc.: or reduced their stakes in the company
3  2023-01-03  Builders FirstSource Inc.                                                                                     Builders FirstSource Stock Down 0.5 %
4  2023-01-03  Builders FirstSource Inc.  Builders FirstSource Inc.: The company reported $5.20 earnings per share for the quarter, topping the consensus estimate


Convert from DataFrame to Dataset. Required to use the "map()" method

In [None]:
db_sa = Dataset.from_dict(db_desc)
print(db_sa)

Dataset({
    features: ['Date', 'Company', 'Desc'],
    num_rows: 31737
})


## **Data preprocessing**

Load tokenizer and data collator

In [None]:
%%capture
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Define custom tokenization function

In [None]:
def tokenize_function(dataset):
    input_ids = []
    token_type_ids = []
    attention_mask = []

    for i in range(0, len(dataset["Desc"])):

        token_temp = tokenizer(dataset["Desc"][i], truncation=True) #,
                               #padding = "max_length", max_length = 68) #return_tensors = "pt").to(device)

        input_ids.append(token_temp["input_ids"])
        token_type_ids.append(token_temp["token_type_ids"])
        attention_mask.append(token_temp["attention_mask"])

    return {"input_ids" : input_ids, "token_type_ids" : token_type_ids,
            "attention_mask" : attention_mask, }

Apply tokenization

In [None]:
original_columns = db_sa.column_names
tokenized_sentences = db_sa.map(tokenize_function,
                                batched = True,
                                remove_columns = original_columns)

Map:   0%|          | 0/31737 [00:00<?, ? examples/s]

Tokenization check

In [None]:
print(tokenized_sentences, "\n",
      tokenized_sentences["input_ids"][0:3], "\n",
      tokenized_sentences["token_type_ids"][0:3], "\n",
      tokenized_sentences["attention_mask"][0:3])

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 31737
}) 
 [[101, 1018, 2327, 2146, 1011, 2744, 15768, 2005, 16798, 2509, 1024, 1017, 2047, 11214, 3693, 8224, 1006, 4606, 1037, 6781, 3627, 24733, 1007, 102], [101, 8619, 2177, 4297, 1012, 28043, 2049, 9583, 1999, 6661, 1997, 16472, 2034, 6499, 3126, 3401, 2011, 2656, 1012, 1020, 1003, 1999, 1996, 3822, 4284, 102], [101, 16472, 2034, 6499, 3126, 3401, 4297, 1012, 1024, 2030, 4359, 2037, 7533, 1999, 1996, 2194, 102]] 
 [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]] 
 [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]


**DEBUG ONLY: to check that the order of the sentences was kept**

In [None]:
# debug only
tokenizer.batch_decode([101, 1018, 2327, 2146, 1011, 2744, 15768, 2005, 16798, 2509, 1024, 1017, 2047, 11214, 3693, 8224, 1006, 4606, 1037, 6781, 3627, 24733, 1007, 102], skip_special_tokens=True)

Use a dataloader to prepare batches

In [None]:
dataloader = DataLoader(tokenized_sentences, batch_size = 16,
                        collate_fn = data_collator)

Check

In [None]:
# debug only

temp = next(iter(dataloader))
print(temp)
print(type(temp))
#print(temp.items())

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


{'input_ids': tensor([[  101,  1018,  2327,  2146,  1011,  2744, 15768,  2005, 16798,  2509,
          1024,  1017,  2047, 11214,  3693,  8224,  1006,  4606,  1037,  6781,
          3627, 24733,  1007,   102,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0],
        [  101,  8619,  2177,  4297,  1012, 28043,  2049,  9583,  1999,  6661,
          1997, 16472,  2034,  6499,  3126,  3401,  2011,  2656,  1012,  1020,
          1003,  1999,  1996,  3822,  4284,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0],
        [  101, 16472,  2034,  6499,  3126,  3401,  4297,  1012,  1024,  2030,
          4359,  2037,  7533,  1999,  1996,  2194,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0],
        [  101, 16472,  2034,  6499,  3126,  3401,  4518,  2091,  1014,  1012,
          1019,  1003,   102,     0,     0,     0,     0,    

## **Model setup**

Load model

In [None]:
%%capture
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert").to(device)

## **Inference**

Process the inputs

In [None]:
print(model.generation_config)

None


In [None]:
progress_bar_outputs = tqdm(dataloader)
outputs = []

model.eval()
with torch.no_grad():
    for index, batch in enumerate(dataloader):
        outputs_temp = model(input_ids = batch["input_ids"].to(device),
                             token_type_ids = batch["token_type_ids"].to(device),
                             attention_mask = batch["attention_mask"].to(device))#,
                                      #pad_token_id = tokenizer.eos_token_id)
        outputs.append(outputs_temp)
        progress_bar_outputs.update(1)

print(outputs[0:4])

  0%|          | 0/1984 [00:00<?, ?it/s]

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


[SequenceClassifierOutput(loss=None, logits=tensor([[ 0.0984, -1.9341,  2.0447],
        [ 2.0180, -2.1471, -1.1404],
        [-1.5233,  0.6568,  1.7242],
        [-1.7823,  2.8259, -0.8059],
        [ 2.0154, -1.5227, -1.6379],
        [ 0.1896,  1.2642, -2.0788],
        [ 0.7737, -1.5080,  0.4280],
        [-1.0280,  1.1669, -0.1524],
        [-1.2686,  1.6720, -0.1076],
        [-1.2449,  1.6704, -0.2486],
        [ 2.0741, -2.0544, -1.3821],
        [-0.1123, -2.0769,  2.4339],
        [-1.5595,  2.0414,  0.4150],
        [ 1.7921, -2.0592, -1.0333],
        [-0.6071, -0.2591,  1.0123],
        [-1.6071,  2.2644, -0.0152]], device='cuda:0'), hidden_states=None, attentions=None), SequenceClassifierOutput(loss=None, logits=tensor([[-1.6703,  2.8214, -0.9254],
        [ 1.9805, -1.0615, -1.8579],
        [ 0.9748, -1.7796,  0.4614],
        [ 0.1858,  0.0106, -0.8916],
        [ 0.3742, -2.4239,  2.1724],
        [ 2.0030, -1.8632, -1.4554],
        [-1.8737,  2.9315, -0.4484],
     

## **Outputs post-processing**

Extract logits

In [None]:
logits= []

for i in outputs:
  logits_temp = i.logits
  logits.append(logits_temp)

print(logits[0])

tensor([[ 0.0984, -1.9341,  2.0447],
        [ 2.0180, -2.1471, -1.1404],
        [-1.5233,  0.6568,  1.7242],
        [-1.7823,  2.8259, -0.8059],
        [ 2.0154, -1.5227, -1.6379],
        [ 0.1896,  1.2642, -2.0788],
        [ 0.7737, -1.5080,  0.4280],
        [-1.0280,  1.1669, -0.1524],
        [-1.2686,  1.6720, -0.1076],
        [-1.2449,  1.6704, -0.2486],
        [ 2.0741, -2.0544, -1.3821],
        [-0.1123, -2.0769,  2.4339],
        [-1.5595,  2.0414,  0.4150],
        [ 1.7921, -2.0592, -1.0333],
        [-0.6071, -0.2591,  1.0123],
        [-1.6071,  2.2644, -0.0152]], device='cuda:0')


**[DEBUG ONLY: for one batch of logits]** Extract predicted class and corresponding probability

In [None]:
print(model.config.id2label, "\n")
print(model.config.num_labels)

predicted_class = []
predicted_prob = []


batch_logits = logits[0]
print(batch_logits)

# convert tensor batch to list
batch_logits_list = torch.Tensor.tolist(batch_logits)
print(batch_logits_list)

for individual_logits in batch_logits_list:
    # numerical classification
    classification_id = np.argmax(individual_logits)
    print(classification_id)

    # string classification
    classification_str = model.config.id2label[classification_id]
    print(classification_str)

    # get predicted probs for each class
    predicted_probabilities = softmax(individual_logits)
    print(predicted_probabilities)

    # highest pro (prob of predicted class)
    highest_probability = max(predicted_probabilities)
    print(highest_probability, "\n")

    # append results
    predicted_class.append(classification_str)
    predicted_prob.append(highest_probability)

# create dictionary
predictions_dict = {"Class" : predicted_class,
                    "Probability" : predicted_prob, }

print(predictions_dict)

# create dataframe to export
predictions = pd.DataFrame.from_dict(predictions_dict)
print(predictions)

tensor([[ 0.0984, -1.9341,  2.0447],
        [ 2.0180, -2.1471, -1.1404],
        [-1.5233,  0.6568,  1.7242],
        [-1.7823,  2.8259, -0.8059],
        [ 2.0154, -1.5227, -1.6379],
        [ 0.1896,  1.2642, -2.0788],
        [ 0.7737, -1.5080,  0.4280],
        [-1.0280,  1.1669, -0.1524]], device='cuda:0')
[[0.09844227135181427, -1.9341180324554443, 2.0447239875793457], [2.0180320739746094, -2.1471097469329834, -1.14035964012146], [-1.5233162641525269, 0.6567901372909546, 1.7242425680160522], [-1.7823010683059692, 2.82588791847229, -0.8059448003768921], [2.0154333114624023, -1.5226938724517822, -1.6378545761108398], [0.1896168291568756, 1.2642391920089722, -2.078840494155884], [0.7737278938293457, -1.508023977279663, 0.42797476053237915], [-1.027980923652649, 1.166912317276001, -0.1524118185043335]]
2
neutral
[0.12294677 0.01610599 0.86094724]
0.8609472394672112 

0
positive
[0.94516034 0.01467599 0.04016367]
0.9451603402650302 

2
neutral
[0.02810986 0.24869489 0.72319525]
0.723

Extract predicted class and corresponding probability

In [None]:
predicted_class = []
predicted_prob = []


for batch_logits in logits:

  # convert tensor batch to list
    batch_logits_list = torch.Tensor.tolist(batch_logits)

    for individual_logits in batch_logits_list:
        # numerical classification
        classification_id = np.argmax(individual_logits)

        # string classification
        classification_str = model.config.id2label[classification_id]

        # get predicted probs for each class
        predicted_probabilities = softmax(individual_logits)

        # highest pro (prob of predicted class)
        highest_probability = max(predicted_probabilities)

        # append results
        predicted_class.append(classification_str)
        predicted_prob.append(highest_probability)

# create dictionary
predictions_dict = {"Class" : predicted_class,
                    "Probability" : predicted_prob, }

# create dataframe to export
predictions = pd.DataFrame.from_dict(predictions_dict)
print(predictions[0:8])

      Class  Probability
0   neutral     0.860947
1  positive     0.945160
2   neutral     0.723195
3  negative     0.964844
4  positive     0.947891
5  negative     0.726345
6  positive     0.552550
7  negative     0.725329


## **Export dataset**

Concatenate to sentiment dataset

In [None]:
production_finbert = pd.concat([db_desc, predictions], axis = "columns")
print(production_finbert.iloc[0:10, :])

         Date                    Company                                                                                                                                      Desc     Class  Probability
0  2023-01-02  Builders FirstSource Inc.                                                      4 Top Long-Term Stocks For 2023: 3 New Picks Join Google (Plus A Bonus Rule Breaker)   neutral     0.860947
1  2023-01-03  Builders FirstSource Inc.                                     Advisor Group Inc. boosted its holdings in shares of Builders FirstSource by 26.6% in the 3rd quarter  positive     0.945160
2  2023-01-03  Builders FirstSource Inc.                                                                         Builders FirstSource Inc.: or reduced their stakes in the company   neutral     0.723195
3  2023-01-03  Builders FirstSource Inc.                                                                                                     Builders FirstSource Stock Down 0.5 %  negative    

Export as .csv

In [None]:
filename = 'production_finbert.csv'
production_finbert.to_csv('/content/drive/MyDrive/Kaibutsu/' + filename)

## **References**

Dataloader and HF : (https://huggingface.co/docs/datasets/use_with_pytorch#use-with-pytorch)

Dataloader class at Torch: (https://pytorch.org/docs/stable/data.html)

Padding and it attributes: (https://huggingface.co/docs/transformers/main/en/pad_truncation)

Batch encoding class: (https://huggingface.co/transformers/v3.4.0/_modules/transformers/tokenization_utils_base.html)

BERT for seq class, at its Forward: (https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertForSequenceClassification)


**For the next notebook**

Text generation: (https://huggingface.co/docs/transformers/main_classes/text_generation)

Text generation strategies: (https://huggingface.co/docs/transformers/generation_strategies#default-text-generation-configuration)

