BERT - Bidirectional Encoder Representations from Transformers - is a state-of-the-art natural language processing model that has produced impressive results in a variety of text-based tasks. However, it is a pre-trained model and often requires fine-tuning on specific tasks to achieve optimal performance.

To fine-tune a BERT model on a special set of numerical data with random values missing, we'll need to follow these steps:

1. Prepare the Data: We first need to prepare our data in a usable format for BERT. This involves converting our numerical dataset into a textual format that BERT can understand. For example, we can convert our input data to a tab-separated sequence of values, where the missing values are represented as `[MASK]`.

2. Load the Pre-Trained Model: Next, we load a pre-trained BERT model that corresponds to our specific problem and task. There are several pre-trained BERT models available in the Hugging Face transformer library, such as bert-base-uncased, bert-large-uncased, and others.

3. Fine-Tune the Model: After preparing our data and loading the pre-trained model, we can fine-tune the model on our specific task. This involves training the model on our numerical data while updating the weights of the pre-trained BERT model based on the gradients of our dataset. We typically use a GPU to speed up the process of fine-tuning the model, as it is a computationally expensive task.

4. Evaluate the Model: Once we have fine-tuned our model, we evaluate its performance on a separate testing dataset. This allows us to verify the accuracy of our fine-tuned model and determine its suitability for our problem.

Step 1: Prepare the data
- Convert the numerical data to a text format that BERT can understand.
- Randomly remove values from the data and replace them with `[MASK]` tokens.
- Split the dataset into training and testing sets.

Step 2: Load the pre-trained model
- Import the pre-trained BERT model from the Hugging Face transformer library.
- Define the hyperparameters for the model, such as the learning rate, batch size, and number of epochs.
- Instantiate the model and load the pre-trained weights.

Step 3: Fine-tune the model
- Using the training dataset, fine-tune the pre-trained BERT model and update its weights.
- Use a GPU to speed up the training process, if possible.
- Monitor the loss and accuracy of the model during training to determine when to stop.

Step 4: Evaluate the model
- Use the testing dataset to evaluate the accuracy of the fine-tuned BERT model.
- Use metrics such as accuracy, precision, recall, and F1 score to measure the performance of the model.

### Import Packages

In [1]:
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import pandas as pd
import torch
from torch.utils.data import TensorDataset, DataLoader, random_split, SequentialSampler

import platform

from sklearn.model_selection import train_test_split

# torch.set_num_threads(2)

from transformers import (
    BertModel,
    BertTokenizer,
    TFBertForMaskedLM,
    BertForMaskedLM,
    AdamW,
    get_linear_schedule_with_warmup,
)
 
import sys
print(sys.executable)
print("py platform: {}".format(platform.platform())) #Python Platform: macOS-13.3.1-arm64-arm-64bit
from platform import python_version
print('python ' + python_version())
print(pd.__name__, pd.__version__)
print(np.__name__, np.__version__)
print(torch.__name__, torch.__version__)
has_gpu = torch.cuda.is_available()
has_mps = getattr(torch,'has_mps',False)
device = "mps" if has_mps \
    else "gpu" if has_gpu else "cpu"

print("GPU is", "AVAILABLE" if has_gpu else "NOT available")#GPU is NOT AVAILABLE
print("MPS is", "AVAILABLE" if has_mps else "NOT available") #MPS is AVAILABLE
 
print(f"Target device is {device}") #Target device is mps

/Users/ask/anaconda3/envs/STA208_BERT/bin/python
py platform: macOS-13.4-arm64-arm-64bit
python 3.9.16
pandas 1.5.3
numpy 1.24.3
torch 2.0.1
GPU is NOT available
MPS is AVAILABLE
Target device is mps


### Data Processing

#### Load Evaluation Data

In [10]:
# Load the numerical data you want to train BERT on
df1 = pd.read_csv("data.csv")
#df2 = pd.read_csv('Data_CHF_Zhao_2020_ATE.csv')
frames = [df1]
df = pd.concat(frames)

# Define the name of the column that you want to move to the end of the DataFrame
column_name = "x_e_out [-]"

# Select the column and drop it from the DataFrame
column_to_move = df[column_name]
col = df.drop(column_name, axis=1, inplace=True)

# Append the column back to the end of the DataFrame
df[column_name] = column_to_move

# select only rows where x_e out DOES exist
data_test = df[df["x_e_out [-]"].isna()]

# start with a small data set for speed
#data = data[0:100]
data_test = data_test.reset_index(drop=True)

# Convert numerical values to string format to match BERT input requirement
data_test = data_test.astype(str)

data_test["sequence"] = ""

# Concatenate all the values in a row into a single string using the column names
# Iterate through rows and columns
for index, row in data_test.iterrows():
    string = ""
    for column in data_test.columns:
        if column == "x_e_out [-]": 
            # make a mask with 7 sequential tokens
            string += column + ": " + '[MASK]'*7 + " "
            continue
        if column == "sequence" or column == 'id':
            continue
        string += column + ": " + str(row[column]) + " "
    masked_string = string.strip()
    data_test["sequence"][index] = masked_string

data_test.describe

<bound method NDFrame.describe of           id        author geometry pressure [MPa] mass_flux [kg/m2-s]  \
0          4           nan     tube          13.79               686.0   
1          7        Peskov     tube           18.0               750.0   
2         10      Thompson     tube            nan                 nan   
3         12      Thompson      nan           6.89              7500.0   
4         23          Beus  annulus          15.51              1355.0   
...      ...           ...      ...            ...                 ...   
10410  31633      Thompson     tube          11.03                 nan   
10411  31634  Richenderfer    plate           1.01              2000.0   
10412  31637   Weatherhead     tube          13.79               688.0   
10413  31640           nan      nan          13.79                 nan   
10414  31642      Thompson     tube           6.89              3825.0   

      D_e [mm] D_h [mm] length [mm] chf_exp [MW/m2] x_e_out [-]  \
0         

#### Loading All Training/Testing Data

In [None]:
# Load the numerical data you want to train BERT on
#df1 = pd.read_csv("data.csv")
df2 = pd.read_csv('Data_CHF_Zhao_2020_ATE.csv')
frames = [df2]
df = pd.concat(frames)

# Define the name of the column that you want to move to the end of the DataFrame
column_name = "x_e_out [-]"

# Select the column and drop it from the DataFrame
column_to_move = df[column_name]
col = df.drop(column_name, axis=1, inplace=True)

# Append the column back to the end of the DataFrame
df[column_name] = column_to_move

# select only rows where x_e out exists
data = df[~df["x_e_out [-]"].isna()]

# start with a small data set for speed
#data = data[0:100]
data = data.reset_index(drop=True)

# Convert numerical values to string format to match BERT input requirement
data = data.astype(str)

data["sequence"] = ""

# Concatenate all the values in a row into a single string using the column names
# Iterate through rows and columns
for index, row in data.iterrows():
    string = ""
    for column in data.columns:
        if column == "x_e_out [-]": 
            # make a mask with 7 sequential tokens
            string += column + ": " + '[MASK]'*4 + " "
            continue
        if column == "sequence" or column == 'id':
            continue
        string += column + ": " + str(row[column]) + " "
    masked_string = string.strip()
    data["sequence"][index] = masked_string

data.describe

### Train/Optimize 
BERT Masked LM in PyTorch

In [3]:
# Step 1: Make the tokenizer
MAX_LENGTH = 128

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

token_encoder_fun = lambda row: tokenizer.encode(row,
                                                 add_special_tokens=True,
                                                 padding="max_length",
                                                 #return_tensors='pt',
                                                 max_length=MAX_LENGTH,
                                                 )


In [None]:
model = BertForMaskedLM.from_pretrained("bert-base-uncased",
                                        return_dict=True,
                                        pad_token_id = 0
                                        ).to(device)

model.train()

sequences = data["sequence"]
x_e_out = data['x_e_out [-]']

tokenized_data = sequences.apply(token_encoder_fun)
tokenized_data = tokenized_data.reset_index(drop=True)

tokenized_labels = x_e_out.apply(token_encoder_fun)
tokenized_labels = tokenized_labels.reset_index(drop=True)

input_ids = torch.tensor(tokenized_data)
labels = torch.tensor(tokenized_labels)

attention_masks = torch.empty( ( len(input_ids), MAX_LENGTH ) )

# Generate attention masks
for i in range(len(input_ids)):
    tokens = input_ids[i, :]
    row_mask = [int(token_id.item() > 0) for token_id in tokens]
    row_mask = torch.tensor(row_mask).unsqueeze(0)
    attention_masks[i] = row_mask
    
generator = torch.Generator().manual_seed(42)

# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(input_ids, attention_masks, labels)

# Create a 90-10 train-validation split.

# Calculate the number of samples to include in each set.
train_size = int(0.9 * len(dataset))
test_size = len(dataset) - train_size

# Divide the dataset by randomly selecting samples.
train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

# The DataLoader needs to know our batch size for training. 
# For fine-tuning BERT on a specific task, the authors recommend a batch 
# size of 16 or 32.
batch_size = 32

from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order. 
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
test_dataloader = DataLoader(
            test_dataset, # The validation samples.
            sampler = SequentialSampler(test_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )
# Step 2: Define the masked language modeling (MLM) loss function
mask_token_id = tokenizer.mask_token_id

# Step 3: Define the optimizer and loss fn
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
loss_fn = torch.nn.CrossEntropyLoss()

# Step 4: Train the model
num_epochs = 10

for epoch in range(num_epochs):
    for batch in train_dataloader:
        input_ids, attention_mask, labels = [x.to(device) for x in batch]

        # Reset gradients
        optimizer.zero_grad()

        # Forward pass
        outputs = model(input_ids, 
                        attention_mask=attention_mask, 
                        labels=labels)
        loss = outputs.loss

        # Backward pass
        loss.backward()

        # Update the weights
        optimizer.step()

    # Print the loss and accuracy of the model for this epoch
    print(f"Epoch {epoch+1}, Loss: {loss:.3f}")

# Set model to evaluate mode
model.eval()

# Evaluate the model on the testing dataset
accuracy = 0
with torch.no_grad():
  for batch in test_dataloader:
    input_ids, attention_mask, labels = [x.to(device) for x in batch]

    # Get the logits for the masked tokens
    #outputs = model(input_ids, attention_mask=attention_mask)
    outputs = model(input_ids, 
                    attention_mask=attention_mask, 
                    labels=labels)
    logits = outputs.logits

    # Get the predictions for the masked tokens
    _, predictions = torch.max(logits, dim=2)

    # Calculate the accuracy of the model on this batch
    accuracy += np.sum(predictions.cpu().numpy() == labels.cpu().numpy())

# Print the accuracy of the model
print(f"Accuracy: {accuracy / test_size:.3f}")
torch.save(model, 'bert_fine-tuned_5.sav')

#### Predict Using Fine-Tuned BERT

90/10 train/test split

##### 7 Sequential `[MASK]` tokens
- `bert_fine-tuned_1.sav`
  - epochs=2, training: data.csv
#
- `bert_fine-tuned_2.sav`
  - epochs=2, training: data.csv + Data_CHF
#
- `bert_fine-tuned_3.sav`
  - epochs=10, training: data + Data_CHF
#
- `bert_fine-tuned_4.sav`
  - epochs=5, training: Data_CHF
#
- `bert_fine-tuned_5.sav`
  - epochs=10, training: Data_CHF

##### 4 Sequential `[MASK]` tokens

In [78]:
mask_token_id = tokenizer.mask_token_id

model_name = 'bert_fine-tuned_3.sav'
model = torch.load(model_name).to('mps')

sequences = data_test["sequence"]
x_e_out = data_test['x_e_out [-]']

tokenized_data = sequences.apply(token_encoder_fun)
tokenized_data = tokenized_data.reset_index(drop=True)

input_ids = torch.tensor(tokenized_data)

attention_masks = torch.empty( ( len(input_ids), MAX_LENGTH ) )

for i in range(len(input_ids)):
    tokens = input_ids[i, :]
    row_mask = [int(token_id.item() > 0) for token_id in tokens]
    row_mask = torch.tensor(row_mask).unsqueeze(0)
    attention_masks[i] = row_mask

# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(input_ids, attention_masks)

# For validation the order doesn't matter, so we'll just read them sequentially.
prediction_dataloader = DataLoader(
            dataset, # The validation samples.
            sampler = SequentialSampler(dataset), # Pull out batches sequentially.
        )

for batch in prediction_dataloader:
    input_ids, attention_mask = [x.to(device) for x in batch]

    # Get the logits for the masked tokens
    #outputs = model(input_ids, attention_mask=attention_mask)
    token_logits = model(input_ids, 
                    attention_mask=attention_mask).logits

    # Find the location of [MASK] and extract its logits
    mask_token_index = np.argwhere(input_ids.cpu().numpy() == mask_token_id)

    # Empty list of decoded
    s = []

    # loop over mask indices and ID, replace tokens
    for tok_idx in mask_token_index:
        mask_token_logits = token_logits[0, tok_idx, :]
    
        top_tokens = np.argsort(-mask_token_logits.cpu().detach())[:6].tolist()
        # Remove special tokens that don't contain human-relevant information ([CLS] and [SEP])
        top_tokens = [tok for tok in top_tokens if all([tok!=0, tok !=101, tok !=102])]
        
        # Choose the most likely remaining token and replace masked token
        input_ids[0][tok_idx[1]] = top_tokens[0][0]

    # Trim PSD, CLS, SEP tokens    
    trim_toks = [ t for t in input_ids[0] if all([t!=0, t !=101, t !=102]) ]

    s = s.append(tokenizer.decode(trim_toks))


### Appendix

***

#### Old Snippets

In [None]:
#torch.save(model, 'path/to/model')

#saved_model = torch.load('path/to/model')

#### Console/Testing

In [8]:
for tok in top_tokens:
    if tok != 101 and tok != 102:
        print(tokenizer.decode(tok)) 

[ P A D ]
# # 8
2 5 9
# # 3 3
# # 7


In [48]:
for i in mask_token_index:
    print(i[1])

80
81
82
83
84
85
86


In [52]:
a = [ t for t in input_ids[0] if all([t!=0, t !=101, t !=102]) ]

In [55]:
tokenizer.decode(a)

'author : beus geometry : nan pressure [ mpa ] : 13. 79 mass _ flux [ kg / m2 - s ] : 2730. 0 d _ e [ mm ] : 5. 6 d _ h [ mm ] : 15. 2 length [ mm ] : 2134. 0 chf _ exp [ mw / m2 ] : 1. 6 x _ e _ out [ - ] : [MASK] [MASK] [MASK] [MASK] [MASK] [MASK] [MASK]'

In [77]:
top_tokens[0][0]

101