#**Final Project: Fine-Tuning BERT Base Model Weights on StockTwits Data Set**

FRE-GY 7773: Machine Learning in Financial Engineering

Professor Sandeep Jain

Done by Sasha Agapiev (aba439) for 12/13/2022

**METHODOLOGY**

In this notebook, we import the BERT Base model so that we can fine-tune its weights on a data set that is relevant to our project. BERT is an open source machine learning model developed by Google which has an enormous number of pre-trained weights (~110M), but to achieve optimal performance we need to apply transfer learning to these weights by fine-tuning them on a specific data set. Since our trading strategy involves getting sentiment analysis for stock tweets, the specific data set we will use for training is the Labeled StockTwits Sentiment Data Set taken from Kaggle user Adeyoyin Temiyado [(found here)](https://www.kaggle.com/code/adeyoyintemidayo/stock-data-eda-and-prediction/data).


The first step is to import the relevant Python packages which are needed to set up a BERT model, namely the Transformers package which includes BertTokenizer and TFBertForSequenceClassification which is the BERT model used for predicting word sequences.

Then, we import the StockTwits sentiment training/testing data and convert the data to encodable examples using the Transformers InputExample/InputFeature classes. This is done in the convert_data_to_examples() helper function. We then have to feed these encodable representations of StockTwits training/testing examples into the convert_examples_to_tf_dataset() helper function, which creates encoded embedding representations. These encoded embedding representations contain the same information as the original StockTwits training/testing data but in a way that a BERT model can understand (i.e: train its weights on).

The following step is to compile the base TFBertForSequenceClassification model using ADAM optimizer, sparse categorical crossentropy (with logits) as the loss function, and sparse categorical accuracy as the accuracy function. To update the weights, we fit the compiled model to the encoded embedding representations of StockTwits training/testing data with two epochs of training iterations. Note that the training step takes a very long time because of the sheer number of trainable parameters, so it is highly encouraged to use GPU computing if possible.

Finally, once the model has been trained we save its weights to disk so we can access them in our TradingStrategy notebook. When all is said and done, the BERT model should achieve a training accuracy of around 95% and a validation accuracy of around 83%.



***The process for importing the BERT sequence prediction model and for fine-tuning this model on a labeled dataset is heavily inspired by Orhan G. Yalçın's [Medium article](https://towardsdatascience.com/sentiment-analysis-in-10-minutes-with-bert-and-hugging-face-294e8a04b671).***



~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

**PART 1: Importing the Relevant Packages and Libraries**

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~


In [None]:
# Transformers are an important part of BERT which require a seperate installation using pip
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 5.1 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 24.3 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 43.7 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1


In [None]:
# These are all standard imports for machine learning and natural language processing notebooks
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import tensorflow as tf
import warnings
warnings.filterwarnings('ignore')

In [None]:
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures

# Get the TFBERTForSequenceClassification model and its pre-trained weights from the Transformers package
model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
# Then get the BERT Tokenizer (with the same pre-trained weights)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/536M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

In [None]:
# Getting a sense of the model architecture using the summary method.
# Should be approximately 1500 dense classifiers and 110 million parameters,
# all of which are trainable (which is the whole point of transfer learning)
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________


In [None]:
# ~
# It would be beneficial to have a way to save the trained models to disk to avoid re-training
# every time we run the notebook, especially for models like BERT which have hundreds of millions
# of trainable parameters
# ~
import os
from tensorflow.keras.models import load_model

modelName = "Bert Trained Model"
model_path = os.path.join(".", modelName)

def saveModel(model, model_path):
    try:
        os.makedirs(model_path)
    except OSError:
        print("Directory {dir:s} already exists, files will be over-written.".format(dir=model_path))

    # Save JSON config to disk
    json_config = model.to_json()
    with open(os.path.join(model_path, 'config.json'), 'w') as json_file:
        json_file.write(json_config)
    # Save weights to disk
    model.save_weights(os.path.join(model_path, 'weights.h5'))

    print("Model saved in directory {dir:s}; create an archive of this directory and submit with your assignment.".format(dir=model_path))

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

**PART 2: Importing and processing the StockTwits Labeled Data to enable BERT model training**

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~

In [None]:
# First import the actual StockTwits tweets and their labels into a DataFrame
df = pd.read_csv('stock_data.csv', header=None, skiprows=[0])
# Then split the tweets into a training and testing set
train, test = train_test_split(df, test_size=0.2, random_state=42)

In [None]:
# Add logical key names to the DataFrame axes
train.set_axis(['DATA_COLUMN', 'LABEL_COLUMN'], axis='columns', inplace=True)
train['LABEL_COLUMN'] = [1 if x == 1 else 0 for x in np.array(train['LABEL_COLUMN'])]
test.set_axis(['DATA_COLUMN', 'LABEL_COLUMN'], axis='columns', inplace=True)
test['LABEL_COLUMN'] = [1 if x == 1 else 0 for x in np.array(test['LABEL_COLUMN'])]

In [None]:
# ~
# convert_data_to_examples() is a helper function which takes the following
# arguments:
#         train: The DataFrame of training data (StockTwits tweets with their corresponding labels)
#         test: The DataFrame of testing data (StockTwits tweets with their corresponding labels)
#         DATA_COLUMN: The key of the tweet column in both DataFrames
#         LABEL_COLUMN: The key of the label column in both DataFrames
# The function converts each row from the training/testing data into encodable InputExample
# objects which will later be used to creating encoded embedding representations. It returns
# two lists of such InputExample objects as train_InputExamples (for the training data) and
# validation_InputExamples (for the testing data).
# ~
def convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN):
  train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[DATA_COLUMN],
                                                          text_b = None,
                                                          label = x[LABEL_COLUMN]), axis = 1)

  validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[DATA_COLUMN],
                                                          text_b = None,
                                                          label = x[LABEL_COLUMN]), axis = 1)

  return train_InputExamples, validation_InputExamples



In [None]:
# ~
# convert_samples_to_tf_dataset() is a helper function that takes the following
# arguments:
#         examples: A list of labelled training/testing examples, represented as Transformer.InputExample objects
#         tokenizer: A BERT tokenizer, taken from the Transformer package
#         max_length: The max length of items in examples (I set this to 128 because tweet length is trimmed to 128 characters)
# The function creates encoded embedding representations of the InputExamples in
# the examples list and stores them in a dataset that can be fed-into the BERT
# model for training purposes. The function returns this dataset, as generated by
# the two nested helper generators.
# ~
def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):
    features = [] # -> will hold InputFeatures to be converted later

    for e in examples:
        input_dict = tokenizer.encode_plus(
            e.text_a,
            add_special_tokens=True,
            max_length=max_length, # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default
            truncation=True
        )

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
            input_dict["token_type_ids"], input_dict['attention_mask'])

        features.append(
            InputFeatures(
                input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label
            )
        )

    def gen():
        for f in features:
            yield (
                {
                    "input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,
                },
                f.label,
            )

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None]),
                "token_type_ids": tf.TensorShape([None]),
            },
            tf.TensorShape([]),
        ),
    )

In [None]:
# Calling the two functions to convert the StockTwits Labeled training/testing DataFrames
# into encoded embedding representations (lists called 'train_data' and 'validation_data')
train_InputExamples, validation_InputExamples = convert_data_to_examples(train, test, 'DATA_COLUMN', 'LABEL_COLUMN')

train_data = convert_examples_to_tf_dataset(list(train_InputExamples), tokenizer)
train_data = train_data.shuffle(100).batch(32).repeat(2) # Shuffling train_data for further randomization

validation_data = convert_examples_to_tf_dataset(list(validation_InputExamples), tokenizer)
validation_data = validation_data.batch(32) # Shuffling validation_data for further randomization

In [None]:
# Compiling the BERT model using the framework described in the 'METHODOLOGY' section
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0),
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])
# Fitting the compiled BERT model to the encoded embedding representations of StockTwits Labeled data
# using two epochs of iterations
model.fit(train_data, epochs=2, validation_data=validation_data)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f073e9b93d0>

In [None]:
# Saving the fine-tuned BERT model and its weights to disk so it can later be used
# in the TradingStrategy notebook to perform sentiment analyses on a wide range
# of stock-related tweets
saveModel(model, model_path)

Model saved in directory ./Bert Trained Model; create an archive of this directory and submit with your assignment.
