# Description

> This notebook is responsible for pre-training BerTweet on a new data set comprised only on tweets about cryptocurrencies and especially about Bitcoin.
 The dataset is composed of 16912 tweets about cryptocurrencies.
 For our model, we first pre-processed the tweets as the authors of Bertweet did (normalize each tweet) then we replaced some words related to a topic by generic keywords. 


> For instance, "Bitcoin" was replaced by "bitcn" surrounded by < and >.
After that preprocessing was done, the Bertweet model was then further pre-trained on the dataset using the Masked Language Modelling (MLM) procedure with tokens replaced 15% of the time by eithert the mask or another token.


# Necessary imports

In [None]:
!pip install pandas

!pip install torch

!pip install transformers

!pip install emoji

!pip install numpy

In [None]:
# Imports
import pandas as pd
import torch
import numpy as np
from transformers import AutoModelForMaskedLM,AutoTokenizer,AutoModelForSequenceClassification
import emoji

# Functions used in preprocessing

In [None]:
# This is the dictionary of strings that should be replaced into a token useful for us
dict_tok_subs = {'<bitcn>':['bitc','Bitc','BITC','Btc','btc','BTC'], '<coin>':['#coin','coin','Coin','COIN'], '<address>':['Address', 'address', 'ADDRESS'], '<block>':['Blockchain', 'blockchain', 'BLOCKCHAIN', 'Block Chain', 'block chain', 'BLOCK CHAIN'], '<confirmation>': ['Confirmation', 'confirmation', 'CONFIRMATION'], '<cryptography>':['Cryptography', 'cryptography', 'CRYPTOGRAPHY'], '<doublespend>': ['doublespend', 'Doublespend', 'DOUBLESPEND', 'double spend', 'Double Spend', 'DOUBLE SPEND'],
                    '<hashrate>': ['Hash Rate', 'HASH RATE', 'hash rate'], '<mining>': ['MINING', 'mining', 'Mining'], '<p2p>' : ['p2p', 'P2P', 'peer-to-peer', 'Peer-to-peer'], '<privatekey>': ['Private Key', 'private key', 'PRIVATE KEY','privatekey', 'PRIVATEKEY', 'Privatekey'],
                    '<signature>':['SIGNATURE', 'Signature', 'signature'] , '<wallet>':['Wallet', 'WALLET', 'wallet'],
                 '<price>': ['price', 'PRICE', 'Price'], '<buy>': ['buy', 'BUY', 'Buy'], '<pump>': ['pump', 'PUMP', 'Pump'],
                 '<profit>': ['PROFIT', 'profit', 'Profit'], '<volume>': ['volume', 'Volume', 'VOLUME'],
                 '<etf>': ['ETF', 'etf', 'Etf'], '<bull>': ['bull', 'Bull', 'BULL'], '<sell>': ['sell', 'SELL', 'Sell'],
                 '<top>': ['top', 'TOP', 'Top'], '<win>': ['win', 'WIN', 'Win'], '<moon>': ['moon', 'MOON', 'Moon'],
                 '<signal>': ['signal', 'SIGNAL', 'Signal'], '<long>': ['long', 'LONG', 'Long'], '<chart>': ['CHART', 'chart', 'Chart'],
                 '<alts>': ['alts', 'ALTS', 'Alts'], '<hodl>': ['hodl', 'HODL', 'Hodl'], '<support>': ['support', 'SUPPORT', 'Support'],
                 '<short>': ['short', 'Short', 'SHORT'], '<drop>': ['drop', 'DROP', 'Drop'], '<project>': ['project', 'PROJECT', 'Project'],
                 '<bullish>': ['bulllish', 'Bullish', 'BULLISH'], '<fall>': ['fall', 'Fall', 'FALL'], '<dump>': ['dump', 'DUMP', 'Dump'],
                 '<bear>': ['bear', 'Bear', 'BEAR'], '<resistance>': ['resistance', 'RESISTANCE', 'Resistance'], '<opportunity>': ['opportunity', 'OPPORTUNITY', 'Opportunity'],
                 '<stop-loss>': ['stop-loss', 'stop loss', 'STOP-LOSS'], '<volume>': ['Volume', 'VOLUME', 'volume'],
                 '<chain>': ['chain', 'Chain', 'CHAIN'], '<hold>': ['hold', 'Hold', 'HOLD'], '<future>': ['future', 'FUTURE', 'Future'],
                 '<value>': ['value', 'Value', 'VALUE'], '<trader>': ['trader', 'Trader', 'TRADER'], '<nft>': ['nft', 'NFT', 'Nft'],
                 '<launch>': ['launch', 'Launch', 'LAUNCH'], '<fiat>': ['fiat', 'Fiat', 'FIAT'], '<liquid>': ['liquid', 'Liquid', 'LIQUID'],
                 '<scam>': ['scam', 'Scam', 'SCAM']}


In [None]:
# Get the list of tokens that should be added
list_tokens = list(dict_tok_subs.keys())

In [None]:
def replace_by_token(lst_of_tweets, tokens_dictionary):
  """_summary_
    In this function, words that are significant for bitcoin tweets are replaced
    by appropriate words.

    Parameters
    ----------
    input : list, dict
      The function takes as list of tweets
      and a dictionary with signifcant substrings of words which are assigned 
      to an appropriate token.
      The order in which the keys in the the dictionary are placed matters,
      as words that could be assigned to two different tokens will be replaced by 
      the token that shows up first in the dictionary.

    Returns
    -------
    output : list
      It returns a list of tweets in which the  significant words 
      of each tweet are replaced.
  """
  list_tweets = []
  #loop thourgh every tweet
  for text in lst_of_tweets:
    splits = text.split(" ")

    for split in range(len(splits)):
        #go to every word
        for key in tokens_dictionary:
            #loop through substrings that are associated with each replacement word
            for possible_string in tokens_dictionary[key]:
                if possible_string in splits[split]:
                    splits[split]=key

    text_tok = ' '.join(splits)
    list_tweets.append(text_tok)

  return list_tweets


# Loading the model serving as a basis

In [None]:
# Load the bertweet with additional tokens
bertweet = AutoModelForMaskedLM.from_pretrained("vinai/bertweet-base")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base",additional_special_tokens =list_tokens)
bertweet.resize_token_embeddings(len(tokenizer))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Embedding(64051, 768)

In [None]:
# Tests to see if the tokens have been added to the tokenizer and model
"<bitcn>" in tokenizer.get_vocab()
#
special_token_id = tokenizer.convert_tokens_to_ids(["<bitcn>"])
print(special_token_id)
#%% Check it is adding the tokens correctly
query = "Hey this is a <bitcn> token"
data = [
    ["Pos"],
    ["1"],
    ["2"],
]
table = pd.DataFrame.from_records(data[1:], columns=data[0])
p_output = tokenizer.encode(query)
print(p_output)


#%%
print(tokenizer.encode("<bitcn>"))

In [None]:
# Load the dataset used in pre-training
from google.colab import drive
drive.mount("/content/gdrive")

dataset = pd.read_csv("/content/gdrive/My Drive/NLP Group Project (2022)/tweet_datasets/btc_tweet_20000_without_Sha_label - btc_tweet_20000.csv")
data_used_in_pretraining = dataset.iloc[1000:]
print(len(data_used_in_pretraining))
tweets = data_used_in_pretraining["text"]

Mounted at /content/gdrive
16912


In [None]:
# Normalize all the tweets using the bertweet procedure
normalized_tweets = []
for tweet in range(len(tweets)):
    normalized_tweets.append(tokenizer.normalizeTweet(str(tweets.iloc[tweet])))
    if tweet%100 == 0:
        print("Did 100 more")
        print(tweet)
tweets["normalized"] = normalized_tweets

In [None]:
# Pre-process the tweets using the function that replaces words by generic tokens
tweets_tokenized = replace_by_token(tweets["normalized"],dict_tok_subs)


In [None]:
# Save the normalized and preprocessed tweets in a csv in case we re-use them
tweets_tokenized_df = pd.DataFrame(tweets_tokenized)
tweets_tokenized_df.to_csv("/content/gdrive/My Drive/NLP Group Project (2022)/pre-training_and_fine-tuning_bertweet/tweets_normalized_df.txt", index = False,header = False)

In [None]:
# Load data set in hugging face format and train the model on it
from transformers import LineByLineTextDataset
from transformers import Trainer, TrainingArguments
from transformers import DataCollatorForLanguageModeling

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="tweets_normalized_df.txt",
    block_size=64,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

training_args = TrainingArguments(
    output_dir="/content/gdrive/My Drive/NLP Group Project (2022)/bertweet-retrained",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=48,
    save_steps=500,
    save_total_limit=2,
    seed=1
)

trainer = Trainer(
    model=bertweet,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset
)

trainer.train()

trainer.save_model("/content/gdrive/My Drive/NLP Group Project (2022)/bertweet-retrained")

In [None]:
# Since the model will stop training and save at every 500 optimizatin steps, it needs to be restarted the following way
from transformers import DataCollatorForLanguageModeling
from transformers import LineByLineTextDataset
from transformers import Trainer, TrainingArguments

bertweet_retrained = AutoModelForMaskedLM.from_pretrained("/content/gdrive/My Drive/NLP Group Project (2022)/bertweet-retrained/checkpoint-500")
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base",additional_special_tokens =list_tokens)

dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="/content/gdrive/My Drive/NLP Group Project (2022)/tweets_normalized_df.txt",
    block_size=64,
)

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

training_args = TrainingArguments(
    output_dir="/content/gdrive/My Drive/NLP Group Project (2022)/bertweet-retrained",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=48,
    save_steps=500,
    save_total_limit=2,
    seed=1
)

trainer = Trainer(
    model=bertweet_retrained,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset
)

trainer.train("/content/gdrive/My Drive/NLP Group Project (2022)/bertweet-retrained/checkpoint-500")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
# When it is done training, you can save the model running the following command
trainer.save_model("/content/gdrive/My Drive/NLP Group Project (2022)/bertweet-retrained")

Testing after pre-training


> To see if the model pre-trained correctly, run the following cell and see by which word our model would replace the missing word (named < mask > in our string)





In [None]:
from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model="/content/gdrive/My Drive/NLP Group Project (2022)/bertweet-retrained",
    tokenizer=tokenizer
)
fill_mask("The price of <mask> !")
