# 1. Setup

In [4]:
import sys
import os
import nltk
import pandas as pd
import warnings

# Manually set the path to the parent directory
parent_dir = os.path.abspath('..')
if parent_dir not in sys.path:
    sys.path.append(parent_dir)

from utility.paths import DataPath
from preprocessing import Preprocessing
from models.roberta import RoBERTa
from tqdm.auto import tqdm

warnings.filterwarnings("ignore")

# Redirect NLTK downloader output to null
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)

warnings.filterwarnings("ignore")

**Due to the complexity of implementing the BERT model, we have created a dedicated [roberta.py](../models/roberta.py) class to encapsulate all the necessary logic. This notebook is primarily used for presenting our results, ensuring both conciseness and clarity.**

RoBERTa is an optimized version of BERT. Essentially, it is trained on a substantially larger dataset and employs dynamic masking instead of a fixed step approach. Moreover, it uses a byte-level BPE as a tokenizer.

We didn't have enough resources to train RoBERTa Large, but we might expect it to achieve some extra decimal points in accuracy compared to the base version.

# 2. Data loading

In [None]:
# Create training and testing preprocessing object.
train_prep = Preprocessing([DataPath.TRAIN_NEG_FULL, DataPath.TRAIN_POS_FULL])
test_prep = Preprocessing([DataPath.TEST], is_test=True)

In [5]:
# Declare params for RoBERTa.
MAX_LEN = 128
BATCH_SIZE = 32
EPOCHS = 3

In [6]:
# Declare RoBERTa model.
roberta = RoBERTa(weight_path="/content/drive/MyDrive/ML Project 2/weights/roberta",
                 submission_path="/content/drive/MyDrive/ML Project 2/submissions/roberta",
                 max_length=MAX_LEN,
                  is_weight=True)

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at /content/drive/MyDrive/ML Project 2/weights/roberta.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


# 3. Data preprocessing

Now, we will preprocess the data to ensure it is clean before beginning the training process.

In [None]:
# Retrieve preprocessing steps declared in GRU class for both train and test data.
for step in tqdm(bert.preprocessing(), desc="Preprocessing train data"):
    getattr(train_prep, step)()

for step in tqdm(bert.preprocessing(is_train=False), desc="Preprocessing test data"):
    getattr(test_prep, step)()

Preprocessing train data:   0%|          | 0/7 [00:00<?, ?it/s]

Executing: `drop_duplicates`
Executing: `remove_tag`
Executing: `strip`
Executing: `remove_ellipsis`
Executing: `reconstruct_emoji`


100%|██████████| 2268591/2268591 [00:18<00:00, 120802.38it/s]


Executing: `remove_extra_space`


100%|██████████| 2268591/2268591 [00:02<00:00, 1108200.90it/s]


Executing: `remove_space_around_emoji`
Executing: `remove_extra_space`


100%|██████████| 2268591/2268591 [00:02<00:00, 1107346.35it/s]


Preprocessing test data:   0%|          | 0/6 [00:00<?, ?it/s]

Executing: `remove_tag`
Executing: `strip`
Executing: `remove_ellipsis`
Executing: `reconstruct_emoji`


100%|██████████| 10000/10000 [00:00<00:00, 118059.62it/s]


Executing: `remove_extra_space`


100%|██████████| 10000/10000 [00:00<00:00, 793834.51it/s]

Executing: `remove_space_around_emoji`





Executing: `remove_extra_space`


100%|██████████| 10000/10000 [00:00<00:00, 885341.21it/s]


In [None]:
# Retrieve the preprocessed df.
train_data = train_prep.__get__()
test_data = test_prep.__get__()

In [None]:
# Export the dataframes. For training frames, shuffles.
train_data = train_data.sample(frac=1)
train_data.to_csv(DataPath.BERT_TRAIN, index=False)

test_data.to_csv(DataPath.BERT_TEST, index=False)

In [None]:
# Read the dataframe
train_df = pd.read_csv(DataPath.BERT_TRAIN)
train_df.dropna(inplace=True)

Unnamed: 0,text,label
0,seen while surfing : what a solution ! if you ...,1.0
1,something n40 would prevent ! smh i feel for d...,0.0
2,nah betch u owe me so i want food and i get to...,1.0
3,mine always get eaten before they get going . ...,0.0
4,i never got to go i'm 23 & still dream about i...,0.0


In [None]:
# Create X and y to feed into RoBERTa
X, y = train_df['text'].values, train_df['label'].values

# 4. Training RoBERTa

We can now begin training the RoBERTa model.

In [None]:
# Start the training process
roberta.train(X, y, batch_size=BATCH_SIZE, epochs=EPOCHS)

Tokenizing data:   0%|          | 0/2041717 [00:00<?, ?it/s]

Tokenizing data:   0%|          | 0/226858 [00:00<?, ?it/s]

Training steps: 191409
Model summary
Model: "tf_roberta_for_sequence_classification_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 roberta (TFRobertaMainLaye  multiple                  124055040 
 r)                                                              
                                                                 
 classifier (TFRobertaClass  multiple                  592130    
 ificationHead)                                                  
                                                                 
Total params: 124647170 (475.49 MB)
Trainable params: 124647170 (475.49 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None
Fitting model
Epoch 1/3


Generating features:   0%|          | 0/2041717 [00:00<?, ?it/s]

  63804/Unknown - 13244s 207ms/step - loss: 0.3021 - accuracy: 0.8677

Generating features:   0%|          | 0/226858 [00:00<?, ?it/s]

Epoch 2/3


Generating features:   0%|          | 0/2041717 [00:00<?, ?it/s]



Generating features:   0%|          | 0/226858 [00:00<?, ?it/s]

Epoch 3/3


Generating features:   0%|          | 0/2041717 [00:00<?, ?it/s]



Generating features:   0%|          | 0/226858 [00:00<?, ?it/s]

# 5. Submission

In [9]:
# Read preprocessed test data
test_df = pd.read_csv(DataPath.BERT_TEST)

# Retrieve `text` column for predicting
X_test = test_df["text"]

# Make the prediction
roberta.predict(X_test)

Generating predictions:   0%|          | 0/10000 [00:00<?, ?it/s]

Saving predictions


This submission to AIcrowd achieved the following accuracy scores:
    
- First Score =`0.899`
- Secondary Score = `0.899`

You can access the results here:

- csv output file : [test_predictions_RoBERTa.csv](./test_predictions_RoBERTa.csv)
- AIcrowd submission id : **#247500**

Employing a setup identical to BERT's, RoBERTa Base attained a validation accuracy of 90.7%, surpassing BERT by 1%. However, this did not translate into improved accuracy for the AIcrowd submission. 