# 1. Setup

In [8]:
import sys
import os
import pandas as pd
import warnings

# Manually set the path to the parent directory
parent_dir = os.path.abspath('..')
if parent_dir not in sys.path:
    sys.path.append(parent_dir)

from utility.paths import DataPath
from preprocessing import Preprocessing
from models.bert import BERT
from tqdm.auto import tqdm

warnings.filterwarnings("ignore")

**Due to the complexity of implementing the BERT model, we have created a dedicated [bert.py](../models/bert.py) class to encapsulate all the necessary logic. This notebook is primarily used for presenting our results, ensuring both conciseness and clarity.**

BERT comes in two primary variants: *BERT Base* and *BERT Large*. These variants differ primarily in their size, complexity, and computational requirements. Here are the key differences:

- BERT Base: Consists of 12 transformer blocks (layers), 768 hidden units (size of the representation vector for each token), and 12 attention heads. This results in a total of about 110 million parameters.

- BERT Large: More complex with 24 transformer blocks, 1024 hidden units, and 16 attention heads. This increases the total parameters to about 340 million.

We will conduct tests on both models and compare their respective results.

# 2. Data loading

In [9]:
# Create training and testing preprocessing object.
train_prep = Preprocessing([DataPath.TRAIN_NEG_FULL, DataPath.TRAIN_POS_FULL])
test_prep = Preprocessing([DataPath.TEST], is_test=True)

In [10]:
# Declare params for BERT.
MAX_LEN = 128
BATCH_SIZE = 32
EPOCHS = 3

In [None]:
# Declare BERT model.
bert = BERT(weight_path=DataPath.BERT_WEIGHT,
            submission_path=DataPath.BERT_SUBMISSION,
            max_length=MAX_LEN)

# 3. Data preprocessing

Now, we will preprocess the data to ensure it is clean before beginning the training process.

In [18]:
# Retrieve preprocessing steps declared in GRU class for both train and test data.
for step in tqdm(bert.preprocessing(), desc="Preprocessing train data"):
    getattr(train_prep, step)()

for step in tqdm(bert.preprocessing(is_train=False), desc="Preprocessing test data"):
    getattr(test_prep, step)()

Preprocessing train data:   0%|          | 0/7 [00:00<?, ?it/s]

Executing: `drop_duplicates`
Executing: `remove_tag`
Executing: `strip`
Executing: `remove_ellipsis`
Executing: `reconstruct_emoji`


100%|██████████| 2268591/2268591 [00:18<00:00, 120802.38it/s]


Executing: `remove_extra_space`


100%|██████████| 2268591/2268591 [00:02<00:00, 1108200.90it/s]


Executing: `remove_space_around_emoji`
Executing: `remove_extra_space`


100%|██████████| 2268591/2268591 [00:02<00:00, 1107346.35it/s]


Preprocessing test data:   0%|          | 0/6 [00:00<?, ?it/s]

Executing: `remove_tag`
Executing: `strip`
Executing: `remove_ellipsis`
Executing: `reconstruct_emoji`


100%|██████████| 10000/10000 [00:00<00:00, 118059.62it/s]


Executing: `remove_extra_space`


100%|██████████| 10000/10000 [00:00<00:00, 793834.51it/s]

Executing: `remove_space_around_emoji`





Executing: `remove_extra_space`


100%|██████████| 10000/10000 [00:00<00:00, 885341.21it/s]


In [19]:
# Retrieve the preprocessed df.
train_data = train_prep.__get__()
test_data = test_prep.__get__()

In [20]:
# Export the dataframes. For training frames, shuffles.
train_data = train_data.sample(frac=1)
train_data.to_csv(DataPath.BERT_TRAIN, index=False)

test_data.to_csv(DataPath.BERT_TEST, index=False)

In [12]:
# Read the dataframe
train_df = pd.read_csv(DataPath.BERT_TRAIN)
train_df.dropna(inplace=True)

Unnamed: 0,text,label
0,"awh , and he changes the background",0.0
1,i just wanna be with youu,0.0
2,just a little #rt for me ! ! ! pleeeaaasee ! !,0.0
3,"<3 rt "" with chilling ... feel like love enuh """,1.0
4,job lead : engineer 1 / 2 - nerc ( nerc cip ) ...,0.0


In [14]:
# Create X and y to feed into GRU
X, y = train_df['text'].values, train_df['label'].values

# 4. Training BERT

We can now begin training the GRU model.

In [15]:
# Start the training process
bert.train(X, y, batch_size=BATCH_SIZE, epochs=EPOCHS)

Tokenizing data:   0%|          | 0/2041717 [00:00<?, ?it/s]

Tokenizing data:   0%|          | 0/226858 [00:00<?, ?it/s]

Training steps: 191409
Model summary
Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 bert (TFBertMainLayer)      multiple                  109482240 
                                                                 
 dropout_37 (Dropout)        multiple                  0         
                                                                 
 classifier (Dense)          multiple                  1538      
                                                                 
Total params: 109483778 (417.65 MB)
Trainable params: 109483778 (417.65 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None
Fitting model
Epoch 1/3


Generating features:   0%|          | 0/2041717 [00:00<?, ?it/s]

  63804/Unknown - 13108s 205ms/step - loss: 0.2956 - accuracy: 0.8699

Generating features:   0%|          | 0/226858 [00:00<?, ?it/s]

Epoch 2/3


Generating features:   0%|          | 0/2041717 [00:00<?, ?it/s]



Generating features:   0%|          | 0/226858 [00:00<?, ?it/s]

Epoch 3/3


Generating features:   0%|          | 0/2041717 [00:00<?, ?it/s]



Generating features:   0%|          | 0/226858 [00:00<?, ?it/s]

Saving weights


The BERT model achieves an accuracy of `0.89` after `3 epochs`. It appears that 3 epochs are sufficient for the loss to converge.

# 5. Submission

In [17]:
# Read preprocessed test data
test_df = pd.read_csv(DataPath.BERT_TEST)

# Retrieve `text` column for predicting
X_test = test_df["text"]

# Make the prediction
bert.predict(X_test)

Generating predictions:   0%|          | 0/10000 [00:00<?, ?it/s]

For each variant of BERT, we made a submssion to AIcrowd :
    
**BERT Base :**
    
- First Score =`0.896`
- Secondary Score = `0.897`

You can access the results here:

- csv output file : [test_predictions_BERT_base.csv](./test_predictions_BERT_base.csv)
- AIcrowd submission id : **#247317**

**BERT Large :**
    
- First Score =`0.900`
- Secondary Score = `0.900`

You can access the results here:

- csv output file : [test_predictions_BERT_large.csv](./test_predictions_BERT_large.csv)
- AIcrowd submission id : **#247323**

The BERT model significantly outperforms the GRU, showing an improvement of over 3.5%. However, the performance difference between the BERT base and large variants is marginal (+0.004).