Referred doc-https://huggingface.co/docs/transformers/tasks/sequence_classification


- Show the runtime used (T4)
- Upload the financial_sentiment_data.csv file to Colab

In [1]:
pip install transformers datasets evaluate accelerate



Importing required libraries

In [1]:
import numpy as np
import pandas as pd

https://www.kaggle.com/datasets/sbhatti/financial-sentiment-analysis?select=data.csv

In [2]:
financial_data = pd.read_csv("/content/financial_sentiment_data.csv")

financial_data.sample(10)

Unnamed: 0,Sentence,Sentiment
1959,Entering long Lockheed Martin $LMT at Thursday...,positive
4058,long $SDLP $CPGX $TWTR $BITA $LABU $FB $TSLA $...,positive
194,Stora is due to release its fourth-quarter and...,neutral
877,Our customers include companies in the energy ...,neutral
680,"The one dark spot on the horizon , however , w...",neutral
3902,According to Saarioinen 's Managing Director I...,neutral
4905,15 September 2010 - Finnish electrical compone...,neutral
5485,Net cash from operating activities was a negat...,neutral
5335,The EA Reng group posted sales of approximatel...,neutral
4460,"After non-recurring items of EUR 177mn , profi...",neutral


Checking dimension of data

In [3]:
financial_data.shape

(5842, 2)

In [4]:
financial_data["Sentiment"].value_counts()

Unnamed: 0_level_0,count
Sentiment,Unnamed: 1_level_1
neutral,3130
positive,1852
negative,860


In [5]:
financial_data["label"] = financial_data["Sentiment"].replace({"negative": 0, "neutral": 1, "positive": 2})

financial_data.sample(10)

Unnamed: 0,Sentence,Sentiment,label
4941,The company booked April-June new orders worth...,positive,2
158,"In future , the company intends to look for kn...",positive,2
5805,"The new office , located in Shenzhen , will st...",positive,2
3650,"The total scope of the project is about 38,000...",neutral,1
4058,long $SDLP $CPGX $TWTR $BITA $LABU $FB $TSLA $...,positive,2
439,The prices of stainless steel also rose in Eur...,neutral,1
1302,"In the Baltic countries , Atria 's target is o...",neutral,1
5266,Nokia has inaugurated its manufacturing plant ...,neutral,1
353,The new majority owners of Aspocomp Thailand C...,neutral,1
2818,The facility consists of a seven year bullet t...,neutral,1


Converting pandas df to huggingface dataset

In [6]:
from datasets import Dataset

financial_ds = Dataset.from_pandas(financial_data)

financial_ds

Dataset({
    features: ['Sentence', 'Sentiment', 'label'],
    num_rows: 5842
})

Casts the given column as :obj:datasets.features.ClassLabel and updates the table.

In [7]:
financial_ds = financial_ds.class_encode_column("label")

Stringifying the column:   0%|          | 0/5842 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/5842 [00:00<?, ? examples/s]

Splitting the dataset’s  into a train and test set with the train_test_split method

In [8]:
financial_ds = financial_ds.train_test_split(test_size = 0.2, stratify_by_column = 'label', seed = 123)

financial_ds

DatasetDict({
    train: Dataset({
        features: ['Sentence', 'Sentiment', 'label'],
        num_rows: 4673
    })
    test: Dataset({
        features: ['Sentence', 'Sentiment', 'label'],
        num_rows: 1169
    })
})

Negative, positive , neutral instances can be seen

In [11]:
financial_ds['train'][1]

{'Sentence': '$YHOO A breakout above $29.83 would constitute a technical entry for the short term trader. http://stks.co/jkUF',
 'Sentiment': 'positive',
 'label': 2}

In [13]:
financial_ds['train'][2]

{'Sentence': 'New Chairman of the Board of Directors , Mr Chaim Katzman , will give a presentation and answer questions .',
 'Sentiment': 'neutral',
 'label': 1}

In [14]:
financial_ds['train'][26]

{'Sentence': '@chessNwine: $IWM 30-Minute Chart. Small caps threatening descending triangle breakdown under $110.20.  http://stks.co/r0KKm',
 'Sentiment': 'negative',
 'label': 0}

https://huggingface.co/roberta-base

NOTE that this is a model pre-trained using masked language modeling (MLM) and we are going to fine-tune it on a classification task

Here we will set up a token to access the Hugging Face hub

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("roberta-base")

Creating a preprocessing function to tokenize text and truncate sequences to be no longer than roberta-base maximum input length (512 tokens)

In [10]:
def preprocess_function(examples):
    return tokenizer(examples["Sentence"], truncation = True)

In [11]:
tokenizer.vocab_size

50265

To apply the preprocessing function over the entire dataset, use 🤗 Datasets map function. You can speed up map by setting batched=True to process multiple elements of the dataset at once:

In [12]:
tokenized_financial_ds = financial_ds.map(preprocess_function, batched = True)

Map:   0%|          | 0/4673 [00:00<?, ? examples/s]

Map:   0%|          | 0/1169 [00:00<?, ? examples/s]

In [13]:
financial_ds["train"][45]

{'Sentence': '( ADP News ) - Feb 4 , 2009 - Finnish broadband data communication systems and solutions company Teleste Oyj ( HEL : TLT1V ) said today its net profit decreased to EUR 5.5 million ( USD 7.2 m ) for 2008 from EUR 9.4 million for 200',
 'Sentiment': 'negative',
 'label': 0}

Tokenizing that instance

In [14]:
tokenized_financial_ds["train"][45]

{'Sentence': '( ADP News ) - Feb 4 , 2009 - Finnish broadband data communication systems and solutions company Teleste Oyj ( HEL : TLT1V ) said today its net profit decreased to EUR 5.5 million ( USD 7.2 m ) for 2008 from EUR 9.4 million for 200',
 'Sentiment': 'negative',
 'label': 0,
 'input_ids': [0,
  1640,
  4516,
  510,
  491,
  4839,
  111,
  1927,
  204,
  2156,
  2338,
  111,
  21533,
  11451,
  414,
  4358,
  1743,
  8,
  2643,
  138,
  5477,
  13967,
  17311,
  267,
  36,
  39509,
  4832,
  27017,
  565,
  134,
  846,
  4839,
  26,
  452,
  63,
  1161,
  1963,
  8065,
  7,
  10353,
  195,
  4,
  245,
  153,
  36,
  6775,
  262,
  4,
  176,
  475,
  4839,
  13,
  2266,
  31,
  10353,
  361,
  4,
  306,
  153,
  13,
  1878,
  2],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1

Detokenizing the text to get back the input text

In [15]:
tokenizer.decode(tokenized_financial_ds["train"][45]["input_ids"])

'<s>( ADP News ) - Feb 4, 2009 - Finnish broadband data communication systems and solutions company Teleste Oyj ( HEL : TLT1V ) said today its net profit decreased to EUR 5.5 million ( USD 7.2 m ) for 2008 from EUR 9.4 million for 200</s>'

Now Creating  a batch of examples using DataCollatorWithPadding. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In NLP, when processing sequences of text (like sentences or paragraphs), it's common to batch these sequences together for efficient training. However, since these sequences can vary in length, DataCollatorWithPadding automatically pads all sequences in a batch to match the length of the longest sequence in that batch. This ensures that all sequences in a batch have the same length, which is a requirement for most deep learning models.

In [16]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer = tokenizer, return_tensors = "tf")

Including a metric during training is often helpful for evaluating your model’s performance. We can quickly load a evaluation method with the 🤗 Evaluate library. For this task, loading the accuracy metric.

In [17]:
import evaluate

accuracy = evaluate.load("accuracy")

Then creating a function that passes your predictions and labels to compute to calculate the accuracy

In [18]:
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)

    return accuracy.compute(predictions = predictions, references = labels)

Before you start training your model, creating a map of the expected ids to their labels with id2label and label2id.

In [19]:
id2label = {0: "negative", 1: "neutral", 2: "positive"}

label2id = {"negative": 0, "neutral": 1, "positive": 2}

For Finetuning a model in TensorFlow, first we start by setting up an optimizer function, learning rate schedule, and some training hyperparameters.

create_optimizer creates an AdamWeightDecay optimizer

In [33]:
from transformers import create_optimizer
import tensorflow as tf

batch_size = 16
num_epochs = 5

batches_per_epoch = len(tokenized_financial_ds["train"]) // batch_size

total_train_steps = int(batches_per_epoch * num_epochs)

optimizer, schedule = create_optimizer(init_lr = 2e-6, num_warmup_steps = 0, num_train_steps = total_train_steps)

In [34]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(
    "roberta-base", num_labels = 3, id2label = id2label, label2id = label2id
)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaForSequenceClassification: ['roberta.embeddings.position_ids']
- This IS expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFRobertaForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.dense.weight', 'classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predicti

Converting train and validation datasets to the tf.data.Dataset format with prepare_tf_dataset():

In [35]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_financial_ds["train"],
    shuffle = True,
    batch_size = 16,
    collate_fn = data_collator,
)

tf_validation_set = model.prepare_tf_dataset(
    tokenized_financial_ds["test"],
    shuffle = False,
    batch_size = 16,
    collate_fn = data_collator,
)

Configuring the model for training with compile. Note that Transformers models all have a default task-relevant loss function, so you don’t need to specify one unless you want to

In [36]:
import tensorflow as tf

model.compile(optimizer = optimizer)

The last two things to setup before we start training is to compute the accuracy from the predictions, and provide a way to push  model to the Hub. Both are done by using Keras callbacks.

Passing  compute_metrics function to KerasMetricCallback:

In [37]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn = compute_metrics, eval_dataset = tf_validation_set)

Write permission is validated by token
- After running the notebook cell - go to https://huggingface.co/
- Go to the top-right corner click on account avatar -> Settings
- Go to Access Tokens - create a Write token
- Copy and paste that in here



In [25]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Specifying where to push your model and tokenizer in the PushToHubCallback

In [28]:
# from transformers.keras_callbacks import PushToHubCallback

# push_to_hub_callback = PushToHubCallback(
#     output_dir = "financial_text_sentiment_classification_model",
#     tokenizer = tokenizer,
# )

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
/content/financial_text_sentiment_classification_model is already a clone of https://huggingface.co/jinxxx123/financial_text_sentiment_classification_model. Make sure you pull the latest changes with `repo.git_pull()`.


- Click on the Hugging Face repository link above
- Show the empty model card, files and versions (should be empty), community, and settings
- Keep this open in another tab

Starting training our model,Calling  fit with your training and validation datasets, the number of epochs, and your callbacks to finetune the model:

In [38]:
model.fit(
    x = tf_train_set, validation_data = tf_validation_set,
    epochs = 2, callbacks = [metric_callback]
)

Epoch 1/2
Epoch 2/2


<tf_keras.src.callbacks.History at 0x7ded76286bf0>

https://huggingface.co/jinxxx123

In [39]:
model.push_to_hub("financial_text_sentiment_classification_model")

tf_model.h5:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [40]:
tokenizer.push_to_hub("financial_text_sentiment_classification_model")

README.md:   0%|          | 0.00/1.87k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/jinxxx123/financial_text_sentiment_classification_model/commit/6359c837e5fdd082cb52ed3ce259e0b2a0fed7b0', commit_message='Upload tokenizer', commit_description='', oid='6359c837e5fdd082cb52ed3ce259e0b2a0fed7b0', pr_url=None, pr_revision=None, pr_num=None)


- Go back to the model and show that the model has been pushed (make sure you refresh)
- Show the  model card, files and versions (should NOT be empty), community, and settings

In [46]:
from transformers import pipeline

text = """
  The stock market experienced significant volatility today, with the Dow Jones Industrial Average
  dropping over 500 points in early trading before recovering slightly by the close of the session.
"""

classifier = pipeline("sentiment-analysis", model = model, tokenizer = tokenizer)

classifier(text)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'negative', 'score': 0.5824766159057617}]

In [47]:
text = """
  The central bank announced a cut in interest rates, aiming to stimulate economic
  growth by making borrowing cheaper for both consumers and businesses.
"""

classifier(text)

[{'label': 'positive', 'score': 0.8606800436973572}]

In [48]:
text = """
  Inflation rates have been steadily increasing, prompting concerns among economists about
  the potential for reduced purchasing power and higher costs of living.
"""

classifier(text)

[{'label': 'negative', 'score': 0.48690879344940186}]

In [49]:
!ls -l

total 736
-rw-r--r-- 1 root root 745117 Sep  1 05:04 financial_sentiment_data.csv
drwxr-xr-x 3 root root   4096 Sep  1 06:16 financial_text_sentiment_classification_model
drwxr-xr-x 1 root root   4096 Aug 29 13:22 sample_data


In [50]:
text = "The financial advisor recommended a balanced approach to investing, combining both growth and income strategies."

classifier = pipeline("sentiment-analysis", model = "jinxxx123/financial_text_sentiment_classification_model")

classifier(text)

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at jinxxx123/financial_text_sentiment_classification_model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.
Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'label': 'positive', 'score': 0.6122643947601318}]

In [51]:
text = "Strong earnings reports drive the stock to new highs, boosting investor confidence."

classifier(text)

[{'label': 'positive', 'score': 0.9129520654678345}]