# **RoBERTa** - Predictions
This notebook loads our finetuned model and computes predictions

## Model description
 1. RoBERTa + Linear head
 2. CrossEntropy Loss
 3. Finetuned RoBERTa
 5. Preprocessing pipeline _'standard'_

## Notes
GPU is **not** required (but it helps, around 2GB of dedicated memory are required to calculate predictions with a batch size of 32)

## Credits
Some ideas were taken from https://curiousily.com/posts/sentiment-analysis-with-bert-and-hugging-face-using-pytorch-and-python/

## Reproducibility
The predictions obtained with this notebook match those of Submission **#108663** on AIcrowd

| Accuracy | F1 |
|:---:|:---:|
| 90.0% | 90.1% |



## Set up

In [1]:
# # Libraries that could be missing
# !pip install numpy
# !pip install torch
# !pip install transformers
# !pip install wordsegment
# !pip install nltk

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/db/98c3ea1a78190dac41c0127a063abf92bd01b4b0b6970a6db1c2f5b66fa0/transformers-4.0.1-py3-none-any.whl (1.4MB)
[K     |████████████████████████████████| 1.4MB 5.5MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 19.5MB/s 
[?25hCollecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manylinux2010_x86_64.whl (2.9MB)
[K     |████████████████████████████████| 2.9MB 26.3MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp36-none-any.whl size=893261 sha256=989b3998c9641

In [1]:
import transformers
from transformers import AutoTokenizer, RobertaForSequenceClassification

In [2]:
transformers.logging.set_verbosity_info()

In [3]:
import numpy as np
from numpy.random import RandomState

In [4]:
# Contains preprocessing functions
from preprocessing_v6 import *
# Contains all the functions related to the model
from roberta_model import *

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## GPU check

In [5]:
# Use GPU if possible
if torch.cuda.is_available():
  used_device = torch.device('cuda:0')
  print("Using GPU:", torch.cuda.get_device_properties('cuda:0'))
else:
  used_device = torch.device('cpu')
  print("Using CPU")

Using CPU


## Load components

In [6]:
bert_tokenizer = AutoTokenizer.from_pretrained("roberta-base")

loading configuration file https://huggingface.co/roberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 50265
}

loading file https://huggingface.co/roberta-base/resolve/main/vocab.json from cache at /root/.cache/huggingface/transformers/d3ccdbfeb9aaa747ef20432d4976c32ee3fa69663b379deb253ccfce2bb1f

In [7]:
bert_tokenizer.add_prefix_space = False

In [8]:
# Test preprocessing
sample_sentence = "that's a #verybad sentence <user> <url> youre gonna love it. lemme know what u think :-/"
print("Testing preprocessing & tokenizer...")
print("Original sentence:", sample_sentence)
print("Processed sentence:", bert_tokenizer.tokenize(apply_preprocessing(bert_tokenizer, sample_sentence)))

Testing preprocessing & tokenizer...
Original sentence: that's a #verybad sentence <user> <url> youre gonna love it. lemme know what u think :-/
Processed sentence: ['that', 'Ġis', 'Ġa', 'Ġvery', 'Ġbad', 'Ġsentence', 'Ġ<', 'user', '>', 'Ġ<', 'url', '>', 'Ġyou', 'Ġare', 'Ġgoing', 'Ġto', 'Ġlove', 'Ġit', '.', 'Ġlet', 'Ġme', 'Ġknow', 'Ġwhat', 'Ġyou', 'Ġthink', 'Ġ:', '-', '/']


In [9]:
bert_model = RobertaForSequenceClassification.from_pretrained("roberta-base")

loading configuration file https://huggingface.co/roberta-base/resolve/main/config.json from cache at /root/.cache/huggingface/transformers/733bade19e5f0ce98e6531021dd5180994bb2f7b8bd7e80c7968805834ba351e.35205c6cfc956461d8515139f0f8dd5d207a2f336c0c3a83b4bc8dca3518e37b
Model config RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "type_vocab_size": 1,
  "vocab_size": 50265
}

loading weights file https://huggingface.co/roberta-base/resolve/main/pytorch_model.bin from cache at /root/.cache/huggingface/transformers/51ba668f7ff34e7cdfa9561e8361747738113878850a7d

## Define the parameters

In [12]:
# Max number of tokens in each tweet>
MAX_LENGTH = 200
# Batch size
BATCH_SIZE = 32
# Filename of predictions
PREDICTIONS_FILENAME = "predictions.csv"

## Load fine-tuned model

In [13]:
# Download our "finetuned" model
!wget -O RoBERTa_finetuned_std.pth https://api.onedrive.com/v1.0/shares/u!aHR0cHM6Ly8xZHJ2Lm1zL3UvcyFBclREZ3U5ejdJT1ZqcU1ndVZGUzV0RDdJeTFIUUE_ZT1wRW9q/root/content

--2020-12-16 13:05:38--  https://api.onedrive.com/v1.0/shares/u!aHR0cHM6Ly8xZHJ2Lm1zL3UvcyFBclREZ3U5ejdJT1ZqcU1ndVZGUzV0RDdJeTFIUUE_ZT1wRW9q/root/content
Resolving api.onedrive.com (api.onedrive.com)... 13.107.42.12
Connecting to api.onedrive.com (api.onedrive.com)|13.107.42.12|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://d2pbrw.db.files.1drv.com/y4mTId-TFwJFbJrCdmNdgHvky3s_t7WCLJx3wcUulxrXyjT9_RJKVEQxIxQVR5Y99Fd90fVgWIF0mmtqwBr00q3P3wRnvPwkuvMhm05cyvFNTKJpaOiSXbLlwWcrJU06vXGyaejGjfO8WtfENUReYnkUIsjlXaNb98C8xfRGcRqoF7GYIpGVCajyt3TI1nFGznHXixq2LoVGWvZR4YPGvhUSQ/RoBERTa_preproc_std_1epch.pth [following]
--2020-12-16 13:05:38--  https://d2pbrw.db.files.1drv.com/y4mTId-TFwJFbJrCdmNdgHvky3s_t7WCLJx3wcUulxrXyjT9_RJKVEQxIxQVR5Y99Fd90fVgWIF0mmtqwBr00q3P3wRnvPwkuvMhm05cyvFNTKJpaOiSXbLlwWcrJU06vXGyaejGjfO8WtfENUReYnkUIsjlXaNb98C8xfRGcRqoF7GYIpGVCajyt3TI1nFGznHXixq2LoVGWvZR4YPGvhUSQ/RoBERTa_preproc_std_1epch.pth
Resolving d2pbrw.db.files.1drv.com (d2pbrw.

In [13]:
reloaded_model = load_model("RoBERTa_finetuned_std", bert_model, used_device)

Model loaded


## Predict

In [15]:
# Download test_data.txt
!wget -O test_data.txt https://api.onedrive.com/v1.0/shares/u!aHR0cHM6Ly8xZHJ2Lm1zL3QvcyFBclREZ3U5ejdJT1ZqcDR5Q3hoWXM4T2FJd1JLenc_ZT1hSXh0/root/content

--2020-12-16 13:11:48--  https://api.onedrive.com/v1.0/shares/u!aHR0cHM6Ly8xZHJ2Lm1zL3QvcyFBclREZ3U5ejdJT1ZqcDR5Q3hoWXM4T2FJd1JLenc_ZT1hSXh0/root/content
Resolving api.onedrive.com (api.onedrive.com)... 13.107.42.12
Connecting to api.onedrive.com (api.onedrive.com)|13.107.42.12|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://43loag.db.files.1drv.com/y4mXjx_iztEzPDr2_yEraDIdBKOZgu7urAv8l930TMSuHIzGvuuFoS5EfKK4GgVbyI14jS0zrrS931mmVdpBJy7ijfAC-JdaNmzUA6UaAGlRPgFHOMpuv1AGSx8mXlfcvy3wvWFXD_SU74GTlcsVczeKhgAYKm143iI_FhQ3xJt4LHHaGElsHNgoLfjIrFmv55BCkb-Wn44B_ej_zp_5Xu4yg/test_data.txt [following]
--2020-12-16 13:11:49--  https://43loag.db.files.1drv.com/y4mXjx_iztEzPDr2_yEraDIdBKOZgu7urAv8l930TMSuHIzGvuuFoS5EfKK4GgVbyI14jS0zrrS931mmVdpBJy7ijfAC-JdaNmzUA6UaAGlRPgFHOMpuv1AGSx8mXlfcvy3wvWFXD_SU74GTlcsVczeKhgAYKm143iI_FhQ3xJt4LHHaGElsHNgoLfjIrFmv55BCkb-Wn44B_ej_zp_5Xu4yg/test_data.txt
Resolving 43loag.db.files.1drv.com (43loag.db.files.1drv.com)... 13.107.42.

In [None]:
submission_idxs, submission_labels, _ = prepare_submission(reloaded_model, bert_tokenizer, used_device, batch_size=BATCH_SIZE, max_len=MAX_LENGTH, test_filename="test_data.txt")
submission_idxs, submission_labels

Loading file...
Content: [1, 2] ['sea doo pro sea scooter ( sports with the portable sea-doo seascootersave air , stay longer in the water and ... <url>\n', "<user> shucks well i work all week so now i can't come cheer you on ! oh and put those batteries in your calculator ! ! !\n"]
Create dataloader...
Generating predictions...


In [None]:
write_submission(PREDICTIONS_FILENAME, submission_idxs, submission_labels)