This notebook is the demo of sentiment prediction using the Bertweet model.
At the end of the file, you can input a list of tweets and see the sentiment predicted.

#Turn on GPU please


In [None]:
!pip install emoji
!pip install transformers

Collecting emoji
  Downloading emoji-1.7.0.tar.gz (175 kB)
[K     |████████████████████████████████| 175 kB 5.3 MB/s 
[?25hBuilding wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-1.7.0-py3-none-any.whl size=171046 sha256=21f246eb028daf83581b75024c74620532fa8d1edd650a323d97c8956e66ea1e
  Stored in directory: /root/.cache/pip/wheels/8a/4e/b6/57b01db010d17ef6ea9b40300af725ef3e210cb1acfb7ac8b6
Successfully built emoji
Installing collected packages: emoji
Successfully installed emoji-1.7.0
Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 5.0 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 6.4 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_

In [None]:
from transformers import AutoTokenizer,AutoModelForSequenceClassification
import torch
from torch import nn

In [None]:
from google.colab import drive

drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [None]:
#%% This is the dictionary of strings that should be replaced into a token useful for us
dict_tok_subs = {'<bitcn>':['bitc','Bitc','BITC','Btc','btc','BTC'], '<coin>':['#coin','coin','Coin','COIN'], '<address>':['Address', 'address', 'ADDRESS'], '<block>':['Blockchain', 'blockchain', 'BLOCKCHAIN', 'Block Chain', 'block chain', 'BLOCK CHAIN'], '<confirmation>': ['Confirmation', 'confirmation', 'CONFIRMATION'], '<cryptography>':['Cryptography', 'cryptography', 'CRYPTOGRAPHY'], '<doublespend>': ['doublespend', 'Doublespend', 'DOUBLESPEND', 'double spend', 'Double Spend', 'DOUBLE SPEND'],
                    '<hashrate>': ['Hash Rate', 'HASH RATE', 'hash rate'], '<mining>': ['MINING', 'mining', 'Mining'], '<p2p>' : ['p2p', 'P2P', 'peer-to-peer', 'Peer-to-peer'], '<privatekey>': ['Private Key', 'private key', 'PRIVATE KEY','privatekey', 'PRIVATEKEY', 'Privatekey'],
                    '<signature>':['SIGNATURE', 'Signature', 'signature'] , '<wallet>':['Wallet', 'WALLET', 'wallet'],
                 '<price>': ['price', 'PRICE', 'Price'], '<buy>': ['buy', 'BUY', 'Buy'], '<pump>': ['pump', 'PUMP', 'Pump'],
                 '<profit>': ['PROFIT', 'profit', 'Profit'], '<volume>': ['volume', 'Volume', 'VOLUME'],
                 '<etf>': ['ETF', 'etf', 'Etf'], '<bull>': ['bull', 'Bull', 'BULL'], '<sell>': ['sell', 'SELL', 'Sell'],
                 '<top>': ['top', 'TOP', 'Top'], '<win>': ['win', 'WIN', 'Win'], '<moon>': ['moon', 'MOON', 'Moon'],
                 '<signal>': ['signal', 'SIGNAL', 'Signal'], '<long>': ['long', 'LONG', 'Long'], '<chart>': ['CHART', 'chart', 'Chart'],
                 '<alts>': ['alts', 'ALTS', 'Alts'], '<hodl>': ['hodl', 'HODL', 'Hodl'], '<support>': ['support', 'SUPPORT', 'Support'],
                 '<short>': ['short', 'Short', 'SHORT'], '<drop>': ['drop', 'DROP', 'Drop'], '<project>': ['project', 'PROJECT', 'Project'],
                 '<bullish>': ['bulllish', 'Bullish', 'BULLISH'], '<fall>': ['fall', 'Fall', 'FALL'], '<dump>': ['dump', 'DUMP', 'Dump'],
                 '<bear>': ['bear', 'Bear', 'BEAR'], '<resistance>': ['resistance', 'RESISTANCE', 'Resistance'], '<opportunity>': ['opportunity', 'OPPORTUNITY', 'Opportunity'],
                 '<stop-loss>': ['stop-loss', 'stop loss', 'STOP-LOSS'], '<volume>': ['Volume', 'VOLUME', 'volume'],
                 '<chain>': ['chain', 'Chain', 'CHAIN'], '<hold>': ['hold', 'Hold', 'HOLD'], '<future>': ['future', 'FUTURE', 'Future'],
                 '<value>': ['value', 'Value', 'VALUE'], '<trader>': ['trader', 'Trader', 'TRADER'], '<nft>': ['nft', 'NFT', 'Nft'],
                 '<launch>': ['launch', 'Launch', 'LAUNCH'], '<fiat>': ['fiat', 'Fiat', 'FIAT'], '<liquid>': ['liquid', 'Liquid', 'LIQUID'],
                 '<scam>': ['scam', 'Scam', 'SCAM']}
                 
#%% Get the list of tokens that should be added
list_tokens = list(dict_tok_subs.keys())

In [None]:
def replace_by_token(lst_of_tweets, tokens_dictionary):
  """_summary_
    In this function, words that are significant for bitcoin tweets are replaced
    by appropriate tokens.

    Parameters
    ----------
    input : list, dict
      The function takes as input a dataframe with 1 column of tweets
      and a dictionary with signifcant substrings of words which are assigned 
      to an appropriate token.
      The order in which the keys in the the dictionary are placed matters,
      as words that could be assigned to two different tokens will be replaced by 
      the token that shows up first in the dictionary.

    Returns
    -------
    output : list
      It returns a list of tweets in which the  significant words 
      of each tweet are replaced by tokens.
  """
  list_tweets = []
  #loop thourgh every tweet
  for text in lst_of_tweets:
    splits = text.split(" ")

    for split in range(len(splits)):
        #go to every toke
        for key in tokens_dictionary:
            #loop through substrings that are associated with each token
            for possible_string in tokens_dictionary[key]:
                if possible_string in splits[split]:
                    splits[split]=key

    text_tok = ' '.join(splits)
    list_tweets.append(text_tok)

  return list_tweets

In [None]:
bertweet_tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base",additional_special_tokens =list_tokens)

Downloading:   0%|          | 0.00/558 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/824k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.03M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
#load weights of best model
bertweet_model = AutoModelForSequenceClassification.from_pretrained("/content/gdrive/My Drive/NLP Group Project (2022)/bertweet-retrained", #
                                                      num_labels = 3,
                                                      output_attentions = False,
                                                      output_hidden_states = False,
                                                      problem_type="multi_label_classification"
)
# Tell pytorch to run this model on the GPU.
# model.cuda()
bertweet_path = "/content/gdrive/My Drive/NLP Group Project (2022)/pre-training_and_fine-tuning_bertweet/best_bertweet.pt"
bertweet_model.load_state_dict(torch.load(bertweet_path))
# model.to(device)

Some weights of the model checkpoint at /content/gdrive/My Drive/NLP Group Project (2022)/bertweet-retrained were not used when initializing RobertaForSequenceClassification: ['lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at /content/gdrive/My Drive/NLP Group Project (2022)/bertweet-retrained and are newly initialized: ['classifier

In [None]:
def bertweet_preprocess(text):
  res = bertweet_tokenizer.normalizeTweet(text)
  return res

In [None]:
def bertweet_demo(clean_tweet):
  """
  Args:
    clean_tweet : list of cleaned tweet using preprocess_dataset function
  
  Returns:
    output : np.array of roberta probabilities
  """
  clean_tweet = [bertweet_preprocess(i) for i in clean_tweet]
  clean_tweet = replace_by_token(clean_tweet,dict_tok_subs)

  tokens = bertweet_tokenizer.batch_encode_plus(
      clean_tweet,
      padding='max_length',
      max_length = 64,
      truncation=True,                 
      add_special_tokens = True, # Add '[CLS]' and '[SEP]'  
      return_attention_mask = True
  )
  seq = torch.tensor(tokens['input_ids'])
  mask = torch.tensor(tokens['attention_mask'])
  bertweet_model.eval()
  with torch.no_grad():
    preds = bertweet_model(seq,token_type_ids=None,attention_mask=mask,return_dict=True)
    m = nn.Softmax(dim=1)
    output = m(preds['logits']).numpy()
  predictions = output.argmax(axis=1) - 1
  print("Probabilities")
  print("")
  print(output)
  print("")
  print("Predictions")
  print("")
  print(predictions)
  return predictions

Test the sentiments of tweets by replacing the strings by tweets
---

In [None]:
_ = bertweet_demo(['buy the dip','HODL','sell btc','SEC impose regularization'])