<a href="https://colab.research.google.com/github/KarissaChan1/rocket-nuggets/blob/main/BERT_applications.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

BERT and Transformers Documentation in PyTorch: https://pytorch.org/hub/huggingface_pytorch-transformers/

Reading up on BERT: https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

Interesting research/application paper using BERT for the Dark Web: https://arxiv.org/abs/2305.08596


Example on how to use pre-trained BERT to generate word embeddings

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

text = "Hello, how are you?"

# Tokenize input text
tokens = tokenizer.tokenize(text)
print(tokens)
print("Length tokens: ",len(tokens)) #6

# Add special tokens and padding
tokens_spec = ['[CLS]'] + tokens + ['[SEP]']  #8
print("Length after adding special tokens: ",len(tokens_spec))

max_length = 10
padding_length = max_length - len(tokens_spec)  # Define the desired max_length #2

if padding_length > 0:
    tokens_padded = tokens_spec + ['[PAD]'] * padding_length
else:
    tokens_padded = tokens_spec

print("Length after padding: ",len(tokens_padded)) #10

# Convert tokens to input IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens_padded)
print("Tokens: ",input_ids)

# generate attention mask
attention_mask = [1] * len(tokens_spec)  # Set attention mask to 1 for all input tokens
attention_mask += [0] * padding_length  # Set attention to 0 for padding tokens
print("Attention mask: ",attention_mask)

input_ids = torch.tensor(input_ids).unsqueeze(0) # add batch dimension (1)
attention_mask = torch.tensor(attention_mask).unsqueeze(0)

# get word embeddings using pretrained BERT model
with torch.no_grad():
    outputs = model(input_ids, attention_mask=attention_mask)
    embeddings = outputs.last_hidden_state

print(embeddings)
print(embeddings.size())

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


['hello', ',', 'how', 'are', 'you', '?']
Length tokens:  6
Length after adding special tokens:  8
Length after padding:  10
Tokens:  [101, 7592, 1010, 2129, 2024, 2017, 1029, 102, 0, 0]
Attention mask:  [1, 1, 1, 1, 1, 1, 1, 1, 0, 0]
tensor([[[-0.0824,  0.0667, -0.2880,  ..., -0.3566,  0.1960,  0.5381],
         [ 0.0310, -0.1448,  0.0952,  ..., -0.1560,  1.0151,  0.0947],
         [-0.8935,  0.3240,  0.4184,  ..., -0.5498,  0.2853,  0.1149],
         ...,
         [ 0.5570, -0.1080, -0.2412,  ...,  0.2817, -0.3996, -0.1882],
         [-0.0117,  0.1051,  0.4211,  ..., -0.0783,  0.1717, -0.2015],
         [-0.2910,  0.0458,  0.2346,  ...,  0.1788,  0.0796, -0.1221]]])
torch.Size([1, 10, 768])


Very simple example of Fine-Tuning BERT Model for task specific datasets

In [None]:
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

###################################
# Prepare the training data:
# Load your task-specific dataset and split it into training examples.
# Tokenize the text data using the BERT tokenizer and convert it into tokenized input tensors. Add special tokens and PAD.
# Convert the labels/targets of your dataset into numerical values suitable for the specific task (e.g., class indices for classification tasks).
# Load train and validation sets using DataLoader
###################################

# Parameters
optimizer = AdamW(model.parameters(), lr=1e-5)  # Choose an appropriate learning rate
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')  # Use GPU if available
num_epochs = 20

model.to(device)
model.train() # set model in training mode

# Train loop:
for epoch in range(num_epochs):
    for batch in train_dataloader:
        optimizer.zero_grad()
        inputs = batch['input_ids'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(inputs, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

# Evaluate the model
model.eval()
with torch.no_grad():
    for batch in validation_dataloader:
        inputs = batch['input_ids'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(inputs, labels=labels)
        
        # Evaluate metrics or compute accuracy


        ####################

torch.save(model.state_dict(), 'fine_tuned_bert_model.pth')

# To load the model later
model.load_state_dict(torch.load('fine_tuned_bert_model.pth'))
