BERT - Bidirectional Encoder Representations from Transformers - is a state-of-the-art natural language processing model that has produced impressive results in a variety of text-based tasks. However, it is a pre-trained model and often requires fine-tuning on specific tasks to achieve optimal performance.

To fine-tune a BERT model on a special set of numerical data with random values missing, we'll need to follow these steps:

1. Prepare the Data: We first need to prepare our data in a usable format for BERT. This involves converting our numerical dataset into a textual format that BERT can understand. For example, we can convert our input data to a tab-separated sequence of values, where the missing values are represented as `[MASK]`.

2. Load the Pre-Trained Model: Next, we load a pre-trained BERT model that corresponds to our specific problem and task. There are several pre-trained BERT models available in the Hugging Face transformer library, such as bert-base-uncased, bert-large-uncased, and others.

3. Fine-Tune the Model: After preparing our data and loading the pre-trained model, we can fine-tune the model on our specific task. This involves training the model on our numerical data while updating the weights of the pre-trained BERT model based on the gradients of our dataset. We typically use a GPU to speed up the process of fine-tuning the model, as it is a computationally expensive task.

4. Evaluate the Model: Once we have fine-tuned our model, we evaluate its performance on a separate testing dataset. This allows us to verify the accuracy of our fine-tuned model and determine its suitability for our problem.

Step 1: Prepare the data
- Convert the numerical data to a text format that BERT can understand.
- Randomly remove values from the data and replace them with `[MASK]` tokens.
- Split the dataset into training and testing sets.

Step 2: Load the pre-trained model
- Import the pre-trained BERT model from the Hugging Face transformer library.
- Define the hyperparameters for the model, such as the learning rate, batch size, and number of epochs.
- Instantiate the model and load the pre-trained weights.

Step 3: Fine-tune the model
- Using the training dataset, fine-tune the pre-trained BERT model and update its weights.
- Use a GPU to speed up the training process, if possible.
- Monitor the loss and accuracy of the model during training to determine when to stop.

Step 4: Evaluate the model
- Use the testing dataset to evaluate the accuracy of the fine-tuned BERT model.
- Use metrics such as accuracy, precision, recall, and F1 score to measure the performance of the model.

### Import Packages

In [1]:
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import pandas as pd
import torch
from sklearn.model_selection import train_test_split

# torch.set_num_threads(2)

from transformers import (
    BertModel,
    BertTokenizer,
    TFBertForMaskedLM,
    BertForMaskedLM,
    AdamW,
    get_linear_schedule_with_warmup,
)

import sys
print(sys.executable)
from platform import python_version
print('python ' + python_version())
print(pd.__name__, pd.__version__)
print(np.__name__, np.__version__)
print(torch.__name__, torch.__version__)


  from .autonotebook import tqdm as notebook_tqdm


/Users/ask/anaconda3/envs/STA208_BERT/bin/python
python 3.9.16
pandas 1.5.3
numpy 1.24.3
torch 1.13.1


### Data Processing

#### Loading All Training/Testing Data

In [2]:
# Load the numerical data you want to train BERT on
df = pd.read_csv("data.csv")

# Define the name of the column that you want to move to the end of the DataFrame
column_name = "x_e_out [-]"

# Select the column and drop it from the DataFrame
column_to_move = df[column_name]
col = df.drop(column_name, axis=1, inplace=True)

# Append the column back to the end of the DataFrame
df[column_name] = column_to_move

# select only rows where x_e out exists
data = df[~df["x_e_out [-]"].isna()]

# start with a small data set for speed
data = data[0:100]
data = data.reset_index(drop=True)

# Convert numerical values to string format to match BERT input requirement
data = data.astype(str)

data["sequence"] = ""

# Concatenate all the values in a row into a single string using the column names
# Iterate through rows and columns
for index, row in data.iterrows():
    string = ""
    for column in data.columns:
        if column == "x_e_out [-]": 
            string += column + ": " + '[MASK]' + " "
            continue
        if column == "sequence" or column == 'id':
            continue
        string += column + ": " + str(row[column]) + " "
    masked_string = string.strip()
    data["sequence"][index] = masked_string

data.describe

<bound method NDFrame.describe of      id        author geometry pressure [MPa] mass_flux [kg/m2-s] D_e [mm]  \
0     0      Thompson     tube            7.0              3770.0      nan   
1     1      Thompson     tube            nan              6049.0     10.3   
2     2      Thompson      nan          13.79              2034.0      7.7   
3     3          Beus  annulus          13.79              3679.0      5.6   
4     5           nan      nan          17.24              3648.0      nan   
..  ...           ...      ...            ...                 ...      ...   
95  130      Thompson     tube          13.79              1356.0      7.8   
96  131      Thompson      nan           3.45              3838.0     10.3   
97  132           nan     tube          18.27              2197.0      3.0   
98  133  Richenderfer    plate            0.2              5600.0      nan   
99  134      Thompson     tube          18.96              3458.0      1.9   

   D_h [mm] length [mm] chf_e

### Torch Masked BERT

In [33]:
# Step 1: Instantiate the tokenizer, and attention-mask the tokens
dropout_rate = 0.3
MAX_LENGTH = 128

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained("bert-base-uncased",
                                        return_dict=False,
                                        pad_token_id = 0
                                        )
model.train()

sequences = data["sequence"]
x_e_out = data['x_e_out [-]']

token_encoder_fun = lambda row: tokenizer.encode(row,
                                                 add_special_tokens=True,
                                                 padding="max_length",
                                                 max_length=MAX_LENGTH
                                                 )

tokenized_data = sequences.apply(token_encoder_fun)
tokenized_data = tokenized_data.reset_index(drop=True)

tokenized_labels = x_e_out.apply(token_encoder_fun)
tokenized_labels = tokenized_labels.reset_index(drop=True)

input_data = torch.tensor(tokenized_data)
labels = torch.tensor(tokenized_labels)

generator = torch.Generator().manual_seed(42)

X_train, X_test, y_train, y_test = train_test_split(tokenized_data,
                                                    tokenized_labels,
                                                    test_size=0.1,
                                                    random_state=42)

# stuck here, need to make these pd.Series -> torch.tensor
X_train = torch.tensor(X_train.values)
y_train = torch.tensor(y_train.values)
X_test = torch.tensor(X_test.values)
y_test = torch.tensor(y_test.values)

# Create attention mask
""" attention_mask = torch.empty( ( len(X_train), MAX_LENGTH ) )
for i in range(len(X_train)):
    input_ids = X_train[i, :]
    row_mask = [int(token_id.item() > 0) for token_id in input_ids]
    row_mask = torch.tensor(row_mask).unsqueeze(0)
    attention_mask[i] = row_mask
 """
# Step 2: Define the masked language modeling (MLM) loss function
mask_token_id = tokenizer.mask_token_id

def mlm_loss(prediction_scores, labels):
    masked_prediction_scores = torch.masked_select(prediction_scores, labels != mask_token_id )
    masked_labels = torch.masked_select(labels, labels != mask_token_id)
    return torch.nn.functional.cross_entropy(masked_prediction_scores, masked_labels)

# Step 3: Define the optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

# Step 4: Train the model
num_epochs = 1

for epoch in range(num_epochs):
    for input_tok_seq, masked_ans_tok in zip(X_train, y_train):
        outputs = model(input_ids=input_tok_seq, labels=y_test)
        loss = outputs[0]
        loss = mlm_loss(outputs, y_test)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# Step 5: Evaluate the model
model.eval()
val_loader = (X_test, y_test)
with torch.no_grad():
    for input_ids,  in val_loader:
        #prediction_scores = model(input_ids=input_ids, attention_mask=input_masks, token_type_ids=input_segments)[0]
        loss = mlm_loss(prediction_scores, labels)
        # do something with the predictions and labels (e.g. calculate accuracy)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

### 

In [2]:
import torch
from transformers import BertModel, BertTokenizer

# Load the BERT model
#model = BertModel.from_pretrained("bert-base-uncased")

# Load the BERT tokenizer
#tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

'''
# Create the optimizer
optimizer = torch.optim.AdamW(masked_model.parameters(), lr=0.0001)

# Train the model
for epoch in range(10):
    for batch in tokenized_data:
        # Get the input and output tensors
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        labels = batch["labels"]

        # Forward pass
        outputs = masked_model(input_ids, attention_mask=attention_mask)
'''
        # Compute the loss
        loss = outputs[0]

        # Backpropagate the loss
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# Save the model
masked_model.save_pretrained("masked_bert_model")


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).

#### Reference

In [None]:
from transformers import BertTokenizer, BertForMaskedLM
from torch.nn import functional as F
import torch

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased", return_dict=True)
text = "where are my " + tokenizer.mask_token + "?"
input = tokenizer.encode_plus(text, return_tensors="pt")
mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)
output = model(**input)
logits = output.logits
softmax = F.softmax(logits, dim=-1)
mask_word = softmax[0, mask_index, :]
top_10 = torch.topk(mask_word, 10, dim=1)[1][0]
for token in top_10:
    word = tokenizer.decode([token])
    new_sentence = text.replace(tokenizer.mask_token, word)
    print(new_sentence)

In [7]:
# Load the pretrained BERT model and tokenizer to convert data to tokenize-able format
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
# model = BertForMaskedLM.from_pretrained("bert-base-uncased", return_dict=True)

sequences = data["sequence"]

MAX_LENGTH = 128
tokenized_data = sequences.apply(
    (
        lambda row: tokenizer.encode(
            row, add_special_tokens=True, padding="max_length", max_length=MAX_LENGTH
        )
    )
)

tokenized_data = tokenized_data.reset_index(drop=True)
input_data = torch.tensor(tokenized_data)

# Define the optimizer to be used to train the model
dropout_rate = 0.3
model = BERTSequenceImputer(dropout_rate=dropout_rate)
optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-8)

# Train the model over a set number of epochs
epochs = 4
for epoch in range(epochs):
    epoch_loss = 0
    for i in range(len(input_data)):
        # Reset gradients
        model.zero_grad()

        # Forward pass
        input_ids = input_data[i, :]

        attention_mask = [int(token_id.item() > 0) for token_id in input_ids]
        attention_mask = torch.tensor(attention_mask).unsqueeze(0)

        input_ids = input_ids.unsqueeze(0)

        # input_ids[[0]][0][0].item()

        # attention_mask = torch.tensor(attention_mask).unsqueeze(0)

        y_pred = model.forward(input_ids=input_ids, attention_mask=attention_mask)
#        model.forward()

        # Compute loss
        y_true = torch.tensor([float(labels[i])])
        loss_func = torch.nn.BCEWithLogitsLoss()
        loss = loss_func(y_pred.view(-1), y_true.view(-1))
        epoch_loss += loss.item()

        # Backward pass
        loss.backward()

        # Update weights
        optimizer.step()

    # Print loss per epoch
    print(f"Epoch: {epoch+1}, Loss: {epoch_loss}")

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


RuntimeError: mat1 and mat2 shapes cannot be multiplied (1x30522 and 768x1)

In [35]:
a = pd.DataFrame(X_train) 
a = torch.tensor(a.values)

TypeError: can't convert np.ndarray of type numpy.object_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

In [41]:
a.index

Int64Index([18, 30, 73, 33, 90,  4, 76, 77, 12, 31, 55, 88, 26, 42, 69, 15, 40,
            96,  9, 72, 11, 47, 85, 28, 93,  5, 66, 65, 35, 16, 49, 34,  7, 95,
            27, 19, 81, 25, 62, 13, 24,  3, 17, 38,  8, 78,  6, 64, 36, 89, 56,
            99, 54, 43, 50, 67, 46, 68, 61, 97, 79, 41, 58, 48, 98, 57, 75, 32,
            94, 59, 63, 84, 37, 29,  1, 52, 21,  2, 23, 87, 91, 74, 86, 82, 20,
            60, 71, 14, 92, 51],
           dtype='int64')