# Tweet sentiment extratcion
#### Training and inference model based upon roBERTa

Inspired by [this](https://www.kaggle.com/cdeotte/tensorflow-roberta-0-705) Kaggle notebook

### Load libraries
* **Pandas** and **NumPy** for computational mathematics
* **Tensorflow** for machine learning
* **Sklearn** (StratfiedKFold) for spliting the data into balanced distributions
* **transformers** (from Hunggingface) NPL library for tensorflow 2.0
* **tokenizers** (from Huggingface) implementation of modern tokenizers

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow.keras.backend as K
from sklearn.model_selection import StratifiedKFold
from transformers import *
import tokenizers
import os

##### Global variables

In [2]:
path = os.getcwd()

### Initializer tokenizer
A tokenizer is an algorithm that transform words into symbols that a neural network can understand.

**ByteLevelBPETokenizer**
BPE (Byte-Pair Econding) tokenizer has a vocabulary that consists of single letters and sets of letters. When we create a vocabulary for this tokenizer, we start with all the letters as tokens and we merge tokens whose juxtaposition is frequent on the data set. However, if we consider UTF-8 charecters, the dictionary might get too big. To optimize our tokenizer, instead of working with letters as tokens, we use bytes as tokens.
This function requires two files as arguments. ```merges``` contains all the merged tokens and ```vocab``` contains pairs (key, value), in which keys are tokens and values are numbers as input for the neural network.

For this experiment, the files used here are available in the [Huggingface website](https://huggingface.co/roberta-base/tree/main).

In [3]:
tokenizer = tokenizers.ByteLevelBPETokenizer(
    vocab= path + '/vocab.json',
    merges = path + '/merges.txt',
    lowercase = True, #All tokens are in lower case
    add_prefix_space=True #Do not treat spaces like part of the tokens
)

#Get the ids to decode the neural network output
sentiment_id = {'positive': [1, 0, 0], 'negative': [0, 0, 1], 'neutral': [0, 1, 0]} 

train_set = pd.read_csv(path+'/train.csv').fillna('')
train_set.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative


### Loading training set

We start by initializing some auxiliar variables to parse the tweets.

* **max_lenght**: the maximum size of a tokenized tweet.
* **ct**: saves the size of the training set.
* **input_ids**: saves the IDs for each token the algorithm detected. Token IDs are numerical representations of tokens building sequences.
* **attention_mask**: As tweets have different sizes, attention_mask signals the token IDs that should be read by the model. It indicates to the model which tokens should be attended to, and which should not.
* **token_type_ids** Some models' purpose it to do questions answering. With roBERTa, token type IDs identify which section of the phrase is a question and which is an answer. As this is not our goal, we leave it zero.
* **sent_class**: Type of sentiment: positive, negative or neutral. 

In [5]:
max_length = 96
ct = train_set.shape[0]
input_ids = np.ones((ct, max_length), dtype='int32')
attention_mask = np.zeros((ct, max_length), dtype='int32')
token_type_ids = np.zeros((ct, max_length),dtype='int32')
sent_class = np.zeros((ct, 3), dtype='int32')

for i in range(ct):
    
    #Find where selected text sits inside the tweet
    text1 = " "+" ".join(train_set.loc[i, 'text'].split())
        
    #Encode the text and find the selected_text offset, as the
    #encoded vector might not have the same length as the text vector
    enc = tokenizer.encode(text1)
    
    #After precessing, fill the vectors
    input_ids[i, :len(enc.ids)+2] = [0] + enc.ids + [2]
    attention_mask[i, :len(enc.ids)+2] = 1
    sent_class[i, :] = sentiment_id[train_set.loc[i, 'sentiment']]

### Loading testing set

We must enconde our testing set the same way we tokenized our training set. Variables here have analogous names to those of the last case.

In [6]:
test_set = pd.read_csv(path+'/test.csv').fillna('')

ct = test_set.shape[0]
input_ids_test = np.ones((ct, max_length), dtype='int32')
attention_mask_test = np.zeros((ct, max_length), dtype='int32')
token_type_ids_test = np.zeros((ct, max_length),dtype='int32')
sent_class_test = np.zeros((ct, 3), dtype='int32')

#We do not need to find the selected text for the testing set for
# the algorithm will detect and it will be compared with the fragment
# on the csv file
for k in range(ct):
    text1 = " " + " ".join(test_set.loc[k, 'text'].split())
    enc = tokenizer.encode(text1)
    input_ids_test[k, :len(enc.ids)+2] = [0] + enc.ids + [2]
    sent_class_test[k, :] = sentiment_id[test_set.loc[k, 'sentiment']]
    attention_mask_test[k, :len(enc.ids)+2] = 1

## Model

In [8]:
def build_model():
    ids = tf.keras.layers.Input((max_length,), dtype=tf.int32)
    att = tf.keras.layers.Input((max_length,), dtype=tf.int32)
    tok = tf.keras.layers.Input((max_length,), dtype=tf.int32)

    config = RobertaConfig.from_pretrained(path+'/config-roberta-base.json')
    bert_model = TFRobertaModel.from_pretrained(path+'/pretrained-roberta-base.h5',config=config)
    x = bert_model(ids,attention_mask=att,token_type_ids=tok)
        
    x3 = tf.keras.layers.Dropout(0.1)(x[0])
    x3 = tf.keras.layers.Conv1D(8, 2, padding='same')(x3)
    x3 = tf.keras.layers.LeakyReLU()(x3)
    x3 = tf.keras.layers.Conv1D(4, 2, padding='same')(x3)
    x3 = tf.keras.layers.Flatten()(x3)
    x3 = tf.keras.layers.Dense(8, activation='relu')(x3)
    x3 = tf.keras.layers.Dense(3, activation='relu')(x3)
    x3 = tf.keras.layers.Activation('softmax')(x3)
    

    model = tf.keras.models.Model(inputs=[ids, att, tok], outputs=[x3])
    optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
    model.compile(loss='categorical_crossentropy', optimizer=optimizer)

    return model

## Model Summary

In [9]:
ver_model = build_model()
ver_model.summary()

All model checkpoint layers were used when initializing TFRobertaModel.

All the layers of TFRobertaModel were initialized from the model checkpoint at /home/lucas/Documents/PO-240/sentiment-extraction/pretrained-roberta-base.h5.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 96)]         0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 96)]         0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            [(None, 96)]         0                                            
__________________________________________________________________________________________________
tf_roberta_model (TFRobertaMode TFBaseModelOutputWit 124645632   input_1[0][0]                    
                                                                 input_2[0][0]         

## Training and Test

In [11]:
jac = []; VER='v1'; DISPLAY=1 # USE display=1 FOR INTERACTIVE
oof_class = np.zeros((input_ids.shape[0], 3))
ct = test_set.shape[0]


skf = StratifiedKFold(n_splits=5,shuffle=True,random_state=777)
for fold,(idxT,idxV) in enumerate(skf.split(input_ids,train_set.sentiment.values)):

    print('#'*25)
    print('### FOLD %i'%(fold+1))
    print('#'*25)
    
    K.clear_session()
    model = build_model()
        
    sv = tf.keras.callbacks.ModelCheckpoint(
        '%s-roberta-ext-%i.h5'%(VER,fold), monitor='val_loss', verbose=1, save_best_only=True,
        save_weights_only=True, mode='auto', save_freq='epoch')

    model.fit([input_ids[idxT,], attention_mask[idxT,], token_type_ids[idxT,]], [sent_class[idxT,]], 
        epochs=3, batch_size=32, verbose=DISPLAY, callbacks=[sv],
        validation_data=([input_ids[idxV,],attention_mask[idxV,],token_type_ids[idxV,]], 
        [sent_class[idxV,]]))
    
    print('Loading model...')
    model.load_weights('%s-roberta-ext-%i.h5'%(VER,fold))
    

    print('Predicting OOF...')
    oof_class[idxV, ] = model.predict([input_ids[idxV,],attention_mask[idxV,],token_type_ids[idxV,]],verbose=DISPLAY)
    
    print('Predicting test ...')
    preds = model.predict([input_ids_test,attention_mask_test,token_type_ids_test],verbose=DISPLAY)
    
    # DISPLAY FOLD JACCARD
    all = []
    right = 0
    wrong = 0
    for k in idxV:
        c = np.argmax(oof_class[k, ])
        d = np.argmax(sent_class[k, ])
        if c == d:
            right += 1
        else:
            wrong += 1
    ans = float(right)/float(right + wrong)
    print('Validation right =', ans)
    right = 0
    wrong = 0
    for i in range(ct):
        a = np.argmax(sent_class_test[i, ])
        b = np.argmax(preds[i, ])
        if a == b:
            right += 1
        else:
            wrong += 1
            
    ans = float(right)/float(right + wrong)
    print('Testing set right = ', ans)
    print()

#########################
### FOLD 1
#########################


All model checkpoint layers were used when initializing TFRobertaModel.

All the layers of TFRobertaModel were initialized from the model checkpoint at /home/lucas/Documents/PO-240/sentiment-extraction/pretrained-roberta-base.h5.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaModel for predictions without further training.


Epoch 1/3


ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted:  OOM when allocating tensor with shape[32,96,3072] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node functional_1/tf_roberta_model/roberta/encoder/layer_._4/intermediate/activation/mul_1 (defined at /home/lucas/anaconda3/envs/iczin/lib/python3.8/site-packages/transformers/activations_tf.py:16) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[gradient_tape/functional_1/tf_roberta_model/roberta/embeddings/token_type_embeddings/embedding_lookup/Reshape/_46]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted:  OOM when allocating tensor with shape[32,96,3072] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node functional_1/tf_roberta_model/roberta/encoder/layer_._4/intermediate/activation/mul_1 (defined at /home/lucas/anaconda3/envs/iczin/lib/python3.8/site-packages/transformers/activations_tf.py:16) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_30699]

Errors may have originated from an input operation.
Input Source operations connected to node functional_1/tf_roberta_model/roberta/encoder/layer_._4/intermediate/activation/mul_1:
 functional_1/tf_roberta_model/roberta/encoder/layer_._4/intermediate/activation/mul (defined at /home/lucas/anaconda3/envs/iczin/lib/python3.8/site-packages/transformers/activations_tf.py:14)	
 functional_1/tf_roberta_model/roberta/encoder/layer_._4/intermediate/dense/BiasAdd (defined at /home/lucas/anaconda3/envs/iczin/lib/python3.8/site-packages/transformers/models/roberta/modeling_tf_roberta.py:374)

Input Source operations connected to node functional_1/tf_roberta_model/roberta/encoder/layer_._4/intermediate/activation/mul_1:
 functional_1/tf_roberta_model/roberta/encoder/layer_._4/intermediate/activation/mul (defined at /home/lucas/anaconda3/envs/iczin/lib/python3.8/site-packages/transformers/activations_tf.py:14)	
 functional_1/tf_roberta_model/roberta/encoder/layer_._4/intermediate/dense/BiasAdd (defined at /home/lucas/anaconda3/envs/iczin/lib/python3.8/site-packages/transformers/models/roberta/modeling_tf_roberta.py:374)

Function call stack:
train_function -> train_function
