This tutorial demonstrates a method for deriving sentiment from social media posts. We are going to use a transformer model named finbert. Finbert was trained on financial data and has three possible outputs: positive, negative, or neutral. This tutorial requires downloading of ~500MB of neural network files. Classification and training are faster when using a graphic card. The code is based on pytorch, which often doesn't play nice with the latest version of CUDA.

These are the imports that we need to run the sentiment analysis. Use "pip install" for imports that are not available on your system.

In [1]:
import numpy as np
import re
import pandas as pd
import html
import tensorflow as tf
from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                            pipeline, TrainingArguments, Trainer)
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from torch.utils.data import Dataset
from datasets import Dataset
from sklearn.metrics import classification_report

  from .autonotebook import tqdm as notebook_tqdm


We will use the function below to pre-parse the tweets. Calling html.unescape removes the transformed characters in html. Add the command print(post) before and after the line that calls html.unescape to get a better idea of how it works. Next, we split the post into words and use regular expressions to modify words that might cause overfitting (such as the ticker of a firm that is discussed a lot, or a user that is always bullish).

In [2]:
def lite_parser(post):
    # remove html chars %xx
    post = html.unescape(post)

    # look at each word
    t_post = post.split()
    
    # make links, mentions, cashtags, and numbers uniform
    t_post = [re.sub('^http.*', '#link',z) for z in t_post]
    t_post = [re.sub('^\@.*', '@mention',z) for z in t_post]
    t_post = [re.sub('^\$.*', '$cashtag', z) for z in t_post]
    t_post = [re.sub('^\d+\.*\d+', '#number', z) for z in t_post]

    # blunt instrument used to remove things that are not in the set of symbols below
    # more can be added here
    t_post = [re.sub("[^a-zA-Z@#$0-9.,!?']", ' ', z) for z in t_post]
  
    #remove blanks
    t_post = ' '.join([z.strip() for z in t_post if len(z.strip()) > 0]).strip().split('\w+')
    return ' '.join(t_post)

You can choose from many different pretrained models on huggingface.co. For this tutorial, the transformer finbert, tuned by ProsusAI (available here: https://huggingface.co/ProsusAI/finbert) is going to be used to classify social media posts as positive, negative, or neutral. Researchers can also build models from scratch using pytorch or tensorflow. Because the model is available from Hugging Face, you can add the relative address of the model (relative to huggingface.co/[ProsusAI/finbert]) and it will download to your computer. We use pipeline wrap our neural network in a manner that makes coding classification neater.

In [3]:
tokenizer = AutoTokenizer.from_pretrained('ProsusAI/finbert')
model = AutoModelForSequenceClassification.from_pretrained('ProsusAI/finbert', num_labels = 3)
nlp = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

Read the file at in/jfr_sample.txt. This file has posts and their associated sentiment. 0 is negative, 1 is positive, and 2 is neutral. Use the xmap disctionary to connect the sample data with the model data. Use classification report along with lists of the predicted and manually classified output values to test how well the model works with your data.

In [5]:
df = pd.read_csv('in/jfr_sample.txt', delimiter='\t')

# use lite_parser on each line of the input file
df['texts'] = df['texts'].apply(lite_parser)

# classified is a json-like data structure with the results of the classification
classified = nlp(df['texts'].to_list())

# map sentiment to jfr_sample label
xmap = {'neutral': 2,'positive': 1,'negative': 0}
y_predict = [xmap[z['label']] for z in classified]
y_true = df['label'].to_list()

# Get model performance
report = classification_report(y_pred=np.array(y_predict),y_true=np.array(y_true))
print(report)

              precision    recall  f1-score   support

           0       0.70      0.28      0.40       115
           1       0.77      0.20      0.32       115
           2       0.39      0.90      0.54       115

    accuracy                           0.46       345
   macro avg       0.62      0.46      0.42       345
weighted avg       0.62      0.46      0.42       345



The performance could probably be better. The initial model wasn't designed for social media. Instead of creating a new model, we could try to tune this model to work better with our data. A simple way to add words is shown below. Care should be taken to ensure that the embeddings for these words are being trained since they start as random numbers (equivalent to noise).

In [6]:
tokenizer.add_tokens(['#link', '@mention', '$cashtag', '#number'])
model.resize_token_embeddings(len(tokenizer))

Embedding(30526, 768)

Print the labels in the downloaded model and align numerical categorizations of model and new sentiment.

In [7]:
# check model labels and align with fine tuning data labels (here, we switch tuning positive and negative)
print(model.config.id2label.items())
sent_map = {1:0,0:1,2:2}
df['label'] = df['label'].map(sent_map)

dict_items([(0, 'positive'), (1, 'negative'), (2, 'neutral')])


This segment is copied from: https://github.com/yya518/FinBERT/blob/master/finetune.ipynb

Below, split the input data into training, validation, and holdout test sets and translate the data to the native pytorch data format.

The output notes that the model isn't using the raw text of the posts or the index of dataset, which makes sense because translations of the posts into numerical data is what the model uses (numerical data such as ids that map to a vocabulary dictionary of words that are represented as vectors of numbers and other vectors that help the model interpret context).

In [8]:

# split data
df_train, df_test, = train_test_split(df, stratify=df['label'], test_size=0.1, random_state=42)
df_train, df_val = train_test_split(df_train, stratify=df_train['label'],test_size=0.1, random_state=42)

# translate data
dataset_train = Dataset.from_pandas(df_train)
dataset_val = Dataset.from_pandas(df_val)
dataset_test = Dataset.from_pandas(df_test)

dataset_train = dataset_train.map(lambda e: tokenizer(e['texts'], truncation=True, padding='max_length', max_length=128), batched=True)
dataset_val = dataset_val.map(lambda e: tokenizer(e['texts'], truncation=True, padding='max_length', max_length=128), batched=True)
dataset_test = dataset_test.map(lambda e: tokenizer(e['texts'], truncation=True, padding='max_length' , max_length=128), batched=True)

dataset_train.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
dataset_val.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])
dataset_test.set_format(type='torch', columns=['input_ids', 'token_type_ids', 'attention_mask', 'label'])

# train/save model
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {'accuracy' : accuracy_score(predictions, labels)}

args = TrainingArguments(
        output_dir = 'models/',
        evaluation_strategy = 'epoch',
        save_strategy = 'epoch',
        learning_rate=2e-5,
        per_device_train_batch_size=32,
        per_device_eval_batch_size=32,
        num_train_epochs=5,
        weight_decay=0.01,
        load_best_model_at_end=True,
        metric_for_best_model='accuracy',
)

trainer = Trainer(
        model=model,                         # the instantiated 🤗 Transformers model to be trained
        args=args,                  # training arguments, defined above
        train_dataset=dataset_train,         # training dataset
        eval_dataset=dataset_val,            # evaluation dataset
        compute_metrics=compute_metrics,
        tokenizer=tokenizer
)

trainer.train()  

100%|██████████| 1/1 [00:00<00:00, 25.71ba/s]
100%|██████████| 1/1 [00:00<00:00, 199.97ba/s]
100%|██████████| 1/1 [00:00<00:00, 166.75ba/s]
The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: __index_level_0__, texts. If __index_level_0__, texts are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 279
  Num Epochs = 5
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 45
 20%|██        | 9/45 [00:57<03:34,  5.95s/it]The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: __index_level_0__, texts. If __index_level_0__, texts are not expected by `BertForSequenceClassification.forward`,  you can 

{'eval_loss': 1.0919320583343506, 'eval_accuracy': 0.5483870967741935, 'eval_runtime': 2.1976, 'eval_samples_per_second': 14.106, 'eval_steps_per_second': 0.455, 'epoch': 1.0}


Model weights saved in models/checkpoint-9\pytorch_model.bin
tokenizer config file saved in models/checkpoint-9\tokenizer_config.json
Special tokens file saved in models/checkpoint-9\special_tokens_map.json
 40%|████      | 18/45 [01:57<02:41,  5.99s/it]The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: __index_level_0__, texts. If __index_level_0__, texts are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 31
  Batch size = 32
                                               
 40%|████      | 18/45 [01:59<02:41,  5.99s/it]Saving model checkpoint to models/checkpoint-18
Configuration saved in models/checkpoint-18\config.json


{'eval_loss': 1.0437190532684326, 'eval_accuracy': 0.4838709677419355, 'eval_runtime': 2.2428, 'eval_samples_per_second': 13.822, 'eval_steps_per_second': 0.446, 'epoch': 2.0}


Model weights saved in models/checkpoint-18\pytorch_model.bin
tokenizer config file saved in models/checkpoint-18\tokenizer_config.json
Special tokens file saved in models/checkpoint-18\special_tokens_map.json
 60%|██████    | 27/45 [02:57<01:48,  6.00s/it]The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: __index_level_0__, texts. If __index_level_0__, texts are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 31
  Batch size = 32
                                               
 60%|██████    | 27/45 [03:00<01:48,  6.00s/it]Saving model checkpoint to models/checkpoint-27
Configuration saved in models/checkpoint-27\config.json


{'eval_loss': 1.1979701519012451, 'eval_accuracy': 0.45161290322580644, 'eval_runtime': 2.2303, 'eval_samples_per_second': 13.9, 'eval_steps_per_second': 0.448, 'epoch': 3.0}


Model weights saved in models/checkpoint-27\pytorch_model.bin
tokenizer config file saved in models/checkpoint-27\tokenizer_config.json
Special tokens file saved in models/checkpoint-27\special_tokens_map.json
 80%|████████  | 36/45 [03:57<00:53,  5.99s/it]The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: __index_level_0__, texts. If __index_level_0__, texts are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 31
  Batch size = 32
                                               
 80%|████████  | 36/45 [04:00<00:53,  5.99s/it]Saving model checkpoint to models/checkpoint-36
Configuration saved in models/checkpoint-36\config.json


{'eval_loss': 1.1972733736038208, 'eval_accuracy': 0.4838709677419355, 'eval_runtime': 2.1965, 'eval_samples_per_second': 14.113, 'eval_steps_per_second': 0.455, 'epoch': 4.0}


Model weights saved in models/checkpoint-36\pytorch_model.bin
tokenizer config file saved in models/checkpoint-36\tokenizer_config.json
Special tokens file saved in models/checkpoint-36\special_tokens_map.json
100%|██████████| 45/45 [04:57<00:00,  6.00s/it]The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: __index_level_0__, texts. If __index_level_0__, texts are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 31
  Batch size = 32
                                               
100%|██████████| 45/45 [04:59<00:00,  6.00s/it]Saving model checkpoint to models/checkpoint-45
Configuration saved in models/checkpoint-45\config.json


{'eval_loss': 1.1989370584487915, 'eval_accuracy': 0.4838709677419355, 'eval_runtime': 2.2012, 'eval_samples_per_second': 14.083, 'eval_steps_per_second': 0.454, 'epoch': 5.0}


Model weights saved in models/checkpoint-45\pytorch_model.bin
tokenizer config file saved in models/checkpoint-45\tokenizer_config.json
Special tokens file saved in models/checkpoint-45\special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from models/checkpoint-9 (score: 0.5483870967741935).
100%|██████████| 45/45 [05:01<00:00,  6.70s/it]

{'train_runtime': 301.6007, 'train_samples_per_second': 4.625, 'train_steps_per_second': 0.149, 'train_loss': 0.6538941701253255, 'epoch': 5.0}





TrainOutput(global_step=45, training_loss=0.6538941701253255, metrics={'train_runtime': 301.6007, 'train_samples_per_second': 4.625, 'train_steps_per_second': 0.149, 'train_loss': 0.6538941701253255, 'epoch': 5.0})

The weighted average f1-score went from 0.42 to 0.83 in my model. However, in this finetuned model we are blending data that we used to train the model with holdout data. Add the command 'df_test.to_csv('out/holdout.csv')' to save holdout data in the prior step. Load that data in this step.

In [9]:
tokenizer = AutoTokenizer.from_pretrained('models/checkpoint-45/')
model = AutoModelForSequenceClassification.from_pretrained('models/checkpoint-45/', num_labels = 3)
nlp = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)

df = pd.read_csv('in/jfr_sample.txt', delimiter='\t')
df['texts'] = df['texts'].apply(lite_parser)

x = nlp(df['texts'].to_list())

xmap = {'neutral': 2,'positive': 1,'negative': 0}

y_predict = [xmap[z['label']] for z in x]
y_true = df['label'].to_list()

report = classification_report(y_pred=np.array(y_predict),y_true=np.array(y_true))
print(report)

loading file models/checkpoint-45/vocab.txt
loading file models/checkpoint-45/tokenizer.json
loading file models/checkpoint-45/added_tokens.json
loading file models/checkpoint-45/special_tokens_map.json
loading file models/checkpoint-45/tokenizer_config.json
loading configuration file models/checkpoint-45/config.json
Model config BertConfig {
  "_name_or_path": "models/checkpoint-45/",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "positive",
    "1": "negative",
    "2": "neutral"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "negative": 1,
    "neutral": 2,
    "positive": 0
  },
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token

              precision    recall  f1-score   support

           0       0.90      0.83      0.86       115
           1       0.81      0.80      0.80       115
           2       0.78      0.85      0.82       115

    accuracy                           0.83       345
   macro avg       0.83      0.83      0.83       345
weighted avg       0.83      0.83      0.83       345

