# **RoBERTa** - Training
This notebook trains the model starting from pre-trained RoBERTa from 'transformers'

## Model description
 1. RoBERTa + Dropout + Linear dense + Dropout + Linear Classificatior
 2. CrossEntropy Loss
 3. Finetuning RoBERTa
 3. Adam with Weight decay optimizer (https://arxiv.org/abs/1711.05101)
 4. Cosine schedule
 5. Preprocessing pipeline: _'standard'_

## Notes
**GPU is required**

## Credits
Some ideas were taken from https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/

## Reproducibility
After running this notebook, you will obtain the model used for Submission **#108663** on AIcrowd

| Accuracy | F1 |
|:---:|:---:|
| 90.0% | 90.1% |

## Set up

In [2]:
# Libraries that could be missing
!pip install numpy
!pip install torch
!pip install transformers
!pip install wordsegment
!pip install nltk

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/db/98c3ea1a78190dac41c0127a063abf92bd01b4b0b6970a6db1c2f5b66fa0/transformers-4.0.1-py3-none-any.whl (1.4MB)
[K     |████████████████████████████████| 1.4MB 9.7MB/s 
Collecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 28.1MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 52.2MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=774acd17a86697167c1

In [3]:
import transformers
from transformers import AutoTokenizer, RobertaForSequenceClassification, AdamW, get_cosine_schedule_with_warmup

In [4]:
transformers.logging.set_verbosity_info()

In [5]:
import numpy as np
from numpy.random import RandomState
import torch
import torch.nn as nn

In [6]:
# Contains preprocessing functions
from preprocessing_v6 import *
# Contains all the functions related to the model
from roberta_model import *

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## GPU check

In [7]:
assert torch.cuda.is_available(), "A CUDA-enabled GPU is required to execute this notebook (in a reasonable amount of time)"

In [8]:
print("GPU detected:", torch.cuda.get_device_properties('cuda:0'))

GPU detected: _CudaDeviceProperties(name='Tesla P100-PCIE-16GB', major=6, minor=0, total_memory=16280MB, multi_processor_count=56)


In [9]:
gpu = torch.device('cuda:0')

## Load components

In [None]:
bert_tokenizer = AutoTokenizer.from_pretrained("roberta-base")

In [11]:
bert_tokenizer.add_prefix_space = False

In [12]:
# Test preprocessing
sample_sentence = "that's a #verybad sentence <user> <url> youre gonna love it. lemme know what u think :-/"
print("Testing preprocessing & tokenizer...")
print("Original sentence:", sample_sentence)
print("Processed sentence:", bert_tokenizer.tokenize(apply_preprocessing(bert_tokenizer, sample_sentence)))

Testing preprocessing & tokenizer...
Original sentence: that's a #verybad sentence <user> <url> youre gonna love it. lemme know what u think :-/
Processed sentence: ['that', 'Ġis', 'Ġa', 'Ġvery', 'Ġbad', 'Ġsentence', 'Ġ<', 'user', '>', 'Ġ<', 'url', '>', 'Ġyou', 'Ġare', 'Ġgoing', 'Ġto', 'Ġlove', 'Ġit', '.', 'Ġlet', 'Ġme', 'Ġknow', 'Ġwhat', 'Ġyou', 'Ġthink', 'Ġ:', '-', '/']


In [None]:
bert_model = RobertaForSequenceClassification.from_pretrained("roberta-base")

In [14]:
# Initialize random state (for reproducibility)
rng = RandomState(124)

## Define parameters

In [40]:
# Max number of tokens in each tweet
MAX_LENGTH = 200
# Batch size
BATCH_SIZE = 32
# Number of epochs of training
EPOCHS = 2
# Filename of predictions
PREDICTIONS_FILENAME = "predictions.csv"

## Import data

In [16]:
# Download negative small
# !wget https://api.onedrive.com/v1.0/shares/u!aHR0cHM6Ly8xZHJ2Lm1zL3QvcyFBclREZ3U5ejdJT1ZqcDQyeURtYWFXMzZoMnVEeGc_ZT1IMnhQ/root/content -O neg_small.txt
# Download positive small
# !wget https://api.onedrive.com/v1.0/shares/u!aHR0cHM6Ly8xZHJ2Lm1zL3QvcyFBclREZ3U5ejdJT1ZqcDQxYUNPOENKdTBrX19hY2c_ZT1WNW5Y/root/content -O pos_small.txt

In [17]:
# Download negative full dataset
!wget https://api.onedrive.com/v1.0/shares/u!aHR0cHM6Ly8xZHJ2Lm1zL3QvcyFBclREZ3U5ejdJT1ZqcDQ0eDZMdDI5WXBlVXYyZGc_ZT1ZZDJn/root/content -O neg_full.txt

--2020-12-16 13:23:07--  https://api.onedrive.com/v1.0/shares/u!aHR0cHM6Ly8xZHJ2Lm1zL3QvcyFBclREZ3U5ejdJT1ZqcDQ0eDZMdDI5WXBlVXYyZGc_ZT1ZZDJn/root/content
Resolving api.onedrive.com (api.onedrive.com)... 13.107.42.12
Connecting to api.onedrive.com (api.onedrive.com)|13.107.42.12|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://d6fldw.db.files.1drv.com/y4mT677vxr3BV8LZI_OqX1DbpwG5d-npIsv2PqEvg4r49RqRI9-QE__dsFsHCdTdnNP-IkyKZpesyPoVdD_kAWg6MQzntgpFWy1saRspUrpOnctnGcSlvikJjOFHtMM8laRD96sUbU0t_1sPyMUHjdAD1iy2w7_TLMAX3ig614_7AkB-b2utLC3cHtP0X4uier5OGQv-NqKuA8ZPUODIjN0vw/train_neg_full_u.txt [following]
--2020-12-16 13:23:07--  https://d6fldw.db.files.1drv.com/y4mT677vxr3BV8LZI_OqX1DbpwG5d-npIsv2PqEvg4r49RqRI9-QE__dsFsHCdTdnNP-IkyKZpesyPoVdD_kAWg6MQzntgpFWy1saRspUrpOnctnGcSlvikJjOFHtMM8laRD96sUbU0t_1sPyMUHjdAD1iy2w7_TLMAX3ig614_7AkB-b2utLC3cHtP0X4uier5OGQv-NqKuA8ZPUODIjN0vw/train_neg_full_u.txt
Resolving d6fldw.db.files.1drv.com (d6fldw.db.files.1drv.com)

In [18]:
# Download positive full dataset
!wget https://api.onedrive.com/v1.0/shares/u!aHR0cHM6Ly8xZHJ2Lm1zL3QvcyFBclREZ3U5ejdJT1ZqcDQzcTc3QmNPbUdIWHQ3TXc_ZT01ejdG/root/content -O pos_full.txt

--2020-12-16 13:23:19--  https://api.onedrive.com/v1.0/shares/u!aHR0cHM6Ly8xZHJ2Lm1zL3QvcyFBclREZ3U5ejdJT1ZqcDQzcTc3QmNPbUdIWHQ3TXc_ZT01ejdG/root/content
Resolving api.onedrive.com (api.onedrive.com)... 13.107.42.12
Connecting to api.onedrive.com (api.onedrive.com)|13.107.42.12|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://pknvjw.db.files.1drv.com/y4mB6CuxdVyCAa1f_jeMpRk339mvvQqxnZRyH6_43ppSYfXDvaPBdmK92XPj-ptktso47h95B_PKWCJw0Yy5yxj6_pF5I5eRggN0bTDBdc9NkAGry8mcM3jdkFMlp4TRx76UK-2-KMAMX2cG5Hmi9tKLomHJrTrQ1WfC6KoqiueRMA_-IcQZIFUYbsBWxGJiM16U5uTVSurs_j8ejysi5y-vw/train_pos_full_u.txt [following]
--2020-12-16 13:23:20--  https://pknvjw.db.files.1drv.com/y4mB6CuxdVyCAa1f_jeMpRk339mvvQqxnZRyH6_43ppSYfXDvaPBdmK92XPj-ptktso47h95B_PKWCJw0Yy5yxj6_pF5I5eRggN0bTDBdc9NkAGry8mcM3jdkFMlp4TRx76UK-2-KMAMX2cG5Hmi9tKLomHJrTrQ1WfC6KoqiueRMA_-IcQZIFUYbsBWxGJiM16U5uTVSurs_j8ejysi5y-vw/train_pos_full_u.txt
Resolving pknvjw.db.files.1drv.com (pknvjw.db.files.1drv.com)

In [19]:
neg_train = []
with open("neg_full.txt", "r") as f:
    for line in f.readlines():
        neg_train.append(line)

In [20]:
pos_train = []
with open("pos_full.txt", "r") as f:
    for line in f.readlines():
        pos_train.append(line)

In [21]:
print("Dataset loaded. Size: \t negative %d \t positive %d" % (len(neg_train), len(pos_train)))

Dataset loaded. Size: 	 negative 1142838 	 positive 1127644


#### Re-establish balance between classes

In [22]:
if len(neg_train) < len(pos_train):
  pos_train = neg_train[:len(neg_train)-len(pos_train)]
elif len(neg_train) > len(pos_train):
  neg_train = neg_train[:len(pos_train)-len(neg_train)]

In [23]:
assert len(neg_train) == len(pos_train)

#### Trim and shuffle

In [24]:
# Select the number of samples (from each class) to use for training
samples_num_by_cat = 1_120_000

In [25]:
neg_train = neg_train[:samples_num_by_cat]
pos_train = pos_train[:samples_num_by_cat]

In [26]:
train_labels = np.concatenate([[0] * len(neg_train), [1] * len(pos_train)])
train_data = np.concatenate([neg_train, pos_train])

In [None]:
shuffling = np.arange(0, train_data.shape[0])
len(shuffling)

In [28]:
rng.shuffle(shuffling)

In [29]:
train_labels = train_labels[shuffling]
train_data = train_data[shuffling]

In [None]:
split = rng.choice(
    ["train", "val"],
    size=len(train_data),
    p=[.9, .1]
)
split

In [31]:
bert_x_data = train_data[split == "train"]
bert_labels = train_labels[split == "train"]

In [32]:
train_dataset = SentimentDataset(
    train_data[split == "train"], 
    train_labels[split == "train"], 
    tokenizer=bert_tokenizer, 
    max_len=MAX_LENGTH
)

In [33]:
train_loader = get_loader(train_dataset, batch_size=BATCH_SIZE)

In [None]:
print("Random sample:")
train_dataset.__getitem__(1)

In [35]:
val_dataset = SentimentDataset(
    train_data[split == "val"], 
    train_labels[split == "val"], 
    tokenizer=bert_tokenizer, 
    max_len=MAX_LENGTH
)

In [36]:
val_loader = get_loader(val_dataset, batch_size=BATCH_SIZE)

In [None]:
print("Training set size: %d \t Validation set size: %d" % (len(train_dataset), len(val_dataset)))

## Train the model

In [38]:
bert_classification = RobertaSimple(bert_model)
bert_classification = bert_classification.to(gpu)

In [41]:
# Initialize the optimizer
optimizer = AdamW(bert_classification.parameters(), lr=2e-5, correct_bias=False)

In [42]:
# Initialize the scheduler
tot_steps = EPOCHS * len(train_loader)
scheduler = get_cosine_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps=tot_steps
)

In [None]:
# Train the model, calculate validation accuracy, and store it
for epch in range(EPOCHS):
  print("EPOCH: ", epch)
  print("\t Train: ", train_epoch(bert_classification, train_loader, optimizer, gpu, scheduler, len(train_dataset)))
  print("\t Validation: ", eval_model(bert_classification, val_loader, gpu))
  save_model("RoBERTa_preproc_" + str(epch) + "epch", bert_classification)

In [None]:
eval_model(bert_classification, val_loader, gpu)

## Predict

In [None]:
!wget -O test_data.txt https://api.onedrive.com/v1.0/shares/u!aHR0cHM6Ly8xZHJ2Lm1zL3QvcyFBclREZ3U5ejdJT1ZqcDR5Q3hoWXM4T2FJd1JLenc_ZT1hSXh0/root/content

In [None]:
submission_idxs, submission_labels, _ = prepare_submission(bert_classification, bert_tokenizer, gpu, batch_size=BATCH_SIZE, max_len=MAX_LENGTH, test_filename="test_data.txt")
submission_idxs, submission_labels

In [None]:
write_submission(PREDICTIONS_FILENAME, submission_idxs, submission_labels)