# Tweet sentiment extratcion
#### Training and inference model based upon roBERTa

Inspired by [this](https://www.kaggle.com/cdeotte/tensorflow-roberta-0-705) Kaggle notebook

### Load libraries
* **Pandas** and **NumPy** for computational mathematics
* **Tensorflow** for machine learning
* **Sklearn** (StratfiedKFold) for spliting the data into balanced distributions
* **transformers** (from Hunggingface) NPL library for tensorflow 2.0
* **tokenizers** (from Huggingface) implementation of modern tokenizers

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf
import tensorflow.keras.backend as K
from sklearn.model_selection import StratifiedKFold
from transformers import *
import tokenizers
import os

##### Global variables

In [2]:
path = os.getcwd()

### Initializer tokenizer
A tokenizer is an algorithm that transform words into symbols that a neural network can understand.

**ByteLevelBPETokenizer**
BPE (Byte-Pair Econding) tokenizer has a vocabulary that consists of single letters and sets of letters. When we create a vocabulary for this tokenizer, we start with all the letters as tokens and we merge tokens whose juxtaposition is frequent on the data set. However, if we consider UTF-8 charecters, the dictionary might get too big. To optimize our tokenizer, instead of working with letters as tokens, we use bytes as tokens.
This function requires two files as arguments. ```merges``` contains all the merged tokens and ```vocab``` contains pairs (key, value), in which keys are tokens and values are numbers as input for the neural network.

For this experiment, the files used here are available in [Huggingface website](https://huggingface.co/roberta-base/tree/main).

In [5]:
tokenizer = tokenizers.ByteLevelBPETokenizer(
    vocab= path + '/vocab.json',
    merges = path + '/merges.txt',
    lowercase = True, #All tokens are in lower case
    add_prefix_space=True #Do not treat spaces like part of the tokens
)

#Get the ids to decode the neural network output
sentiment_id = {'positive': 1313, 'negative': 2430, 'neutral': 7974} 

train = pd.read_csv(path+'/train.csv').fillna('')
train.head()

Unnamed: 0,textID,text,selected_text,sentiment
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative
3,9642c003ef,what interview! leave me alone,leave me alone,negative
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative
