# Script to run BERT

In [34]:
from models import *
from preprocessing import Preprocessing
from tqdm.auto import tqdm

In [18]:
TRAIN_NEG_FULL = "./data/train_neg_full.txt"
TRAIN_POS_FULL = "./data/train_pos_full.txt"

TRAIN_NEG = "./data/train_neg.txt"
TRAIN_POS = "./data/train_pos.txt"

TEST_DATA = "./data/test_data.txt"

BERT_TRAIN_PREP = "./data/preprocessed/bert/train.csv"
BERT_TEST_PREP = "./data/preprocessed/bert/test.csv"

## Preprocessing

In [35]:
train_prep = Preprocessing([TRAIN_NEG, TRAIN_POS])
test_prep = Preprocessing([TEST_DATA], is_test=True)

In [36]:
BERT_WEIGHT = "./weights/bert"
BERT_SUBMISSION = "./submissions/bert"
MAX_LEN = 128

bert = Bert(weight_path=BERT_WEIGHT,
            submission_path=BERT_SUBMISSION,
            max_length=128)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [37]:
for step in tqdm(bert.preprocessing(), desc="Preprocessing train data"):
    getattr(train_prep, step)()

Preprocessing train data:   0%|          | 0/7 [00:00<?, ?it/s]

Executing: `drop_duplicates`
Executing: `remove_tag`
Executing: `strip`
Executing: `remove_ellipsis`
Executing: `reconstruct_emoji`
Executing: `remove_extra_space`



  0%|                                                                                        | 0/181307 [00:00<?, ?it/s][A
 24%|████████████████▉                                                       | 42779/181307 [00:00<00:00, 427785.31it/s][A
 52%|█████████████████████████████████████▎                                  | 93956/181307 [00:00<00:00, 477182.76it/s][A
100%|███████████████████████████████████████████████████████████████████████| 181307/181307 [00:00<00:00, 508442.21it/s][A


Executing: `remove_space_before_symbol`



  0%|                                                                                        | 0/181307 [00:00<?, ?it/s][A
 12%|████████▌                                                               | 21664/181307 [00:00<00:00, 216637.62it/s][A
 25%|██████████████████                                                      | 45410/181307 [00:00<00:00, 228883.26it/s][A
 40%|████████████████████████████▌                                           | 71870/181307 [00:00<00:00, 245189.83it/s][A
 55%|███████████████████████████████████████▎                                | 99095/181307 [00:00<00:00, 255869.97it/s][A
 70%|█████████████████████████████████████████████████▌                     | 126605/181307 [00:00<00:00, 262800.39it/s][A
100%|███████████████████████████████████████████████████████████████████████| 181307/181307 [00:00<00:00, 259298.16it/s][A


Executing: `remove_extra_space`



  0%|                                                                                        | 0/181307 [00:00<?, ?it/s][A
 25%|██████████████████                                                      | 45630/181307 [00:00<00:00, 456296.08it/s][A
 55%|███████████████████████████████████████▍                                | 99447/181307 [00:00<00:00, 504453.79it/s][A
100%|███████████████████████████████████████████████████████████████████████| 181307/181307 [00:00<00:00, 532609.15it/s][A


In [38]:
train_df = train_prep.__get__()

In [33]:
train_df["text"]

0         vinco tresorpack 6 ( difficulty 10 of 10 objec...
1         glad i dot have taks tomorrow ! ! #thankful #s...
2         1-3 vs celtics in the regular season = were fu...
3         i could actually kill that girl i'm so sorry !...
4                i find that very hard to believe im afraid
                                ...                        
181242                                 hey gina what's up ?
181243    sas 9.1 . 3 and 9.2 , east 5 , s-plus 8 , stat...
181244    um gord ... i just read your profile . i'm not...
181245    i'm so excited for tomorrow ! look out for two...
181246    i always wondered what the job application is ...
Name: text, Length: 181247, dtype: object

In [28]:
text = "vinco tresorpack 6 ( difficulty 10 of 10 object : disassemble and reassemble the wooden pieces this beautiful wo ."


def _find_unmatched_parentheses(text):
    open_stack = []  # Stack to keep track of indices of '('
    unmatched_indices = []  # List to store indices of unmatched parentheses

    for i, char in enumerate(text):
        if char == '(':
            open_stack.append(i)  # Push the index of '(' onto the stack
        elif char == ')':
            if open_stack:
                open_stack.pop()  # Pop the last '(' as it's a matched pair
            else:
                unmatched_indices.append(i)  # Unmatched ')'

    # Add remaining indices from the stack to unmatched_indices
    unmatched_indices.extend(open_stack)

    return sorted(unmatched_indices)


def _add_colon(text) -> str:
    unmatched_indices = _find_unmatched_parentheses(text)
    if not unmatched_indices:
        return text

    char_t = list(text)

    for i, index in enumerate(unmatched_indices):
        char_t.insert(index + i, ':')

    return "".join(char_t)


_add_colon(text)

'vinco tresorpack 6 :( difficulty 10 of 10 object : disassemble and reassemble the wooden pieces this beautiful wo .'

In [30]:
train_df["text"].apply(_add_colon)

0         vinco tresorpack 6 :( difficulty 10 of 10 obje...
1         glad i dot have taks tomorrow ! ! #thankful #s...
2         1-3 vs celtics in the regular season = were fu...
3         i could actually kill that girl i'm so sorry !...
4                i find that very hard to believe im afraid
                                ...                        
181302                                 hey gina what's up ?
181303    sas 9.1 . 3 and 9.2 , east 5 , s-plus 8 , stat...
181304    um gord ... i just read your profile . i'm not...
181305    i'm so excited for tomorrow ! look out for two...
181306    i always wondered what the job application is ...
Name: text, Length: 181307, dtype: object

# Training