BERT - Bidirectional Encoder Representations from Transformers - is a state-of-the-art natural language processing model that has produced impressive results in a variety of text-based tasks. However, it is a pre-trained model and often requires fine-tuning on specific tasks to achieve optimal performance.

To fine-tune a BERT model on a special set of numerical data with random values missing, we'll need to follow these steps:

1. Prepare the Data: We first need to prepare our data in a usable format for BERT. This involves converting our numerical dataset into a textual format that BERT can understand. For example, we can convert our input data to a tab-separated sequence of values, where the missing values are represented as `[MASK]`.

2. Load the Pre-Trained Model: Next, we load a pre-trained BERT model that corresponds to our specific problem and task. There are several pre-trained BERT models available in the Hugging Face transformer library, such as bert-base-uncased, bert-large-uncased, and others.

3. Fine-Tune the Model: After preparing our data and loading the pre-trained model, we can fine-tune the model on our specific task. This involves training the model on our numerical data while updating the weights of the pre-trained BERT model based on the gradients of our dataset. We typically use a GPU to speed up the process of fine-tuning the model, as it is a computationally expensive task.

4. Evaluate the Model: Once we have fine-tuned our model, we evaluate its performance on a separate testing dataset. This allows us to verify the accuracy of our fine-tuned model and determine its suitability for our problem.

Step 1: Prepare the data
- Convert the numerical data to a text format that BERT can understand.
- Randomly remove values from the data and replace them with `[MASK]` tokens.
- Split the dataset into training and testing sets.

Step 2: Load the pre-trained model
- Import the pre-trained BERT model from the Hugging Face transformer library.
- Define the hyperparameters for the model, such as the learning rate, batch size, and number of epochs.
- Instantiate the model and load the pre-trained weights.

Step 3: Fine-tune the model
- Using the training dataset, fine-tune the pre-trained BERT model and update its weights.
- Use a GPU to speed up the training process, if possible.
- Monitor the loss and accuracy of the model during training to determine when to stop.

Step 4: Evaluate the model
- Use the testing dataset to evaluate the accuracy of the fine-tuned BERT model.
- Use metrics such as accuracy, precision, recall, and F1 score to measure the performance of the model.

### Import Packages

In [1]:
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
import pandas as pd
import torch
from torch.utils.data import TensorDataset, DataLoader, random_split, SequentialSampler, RandomSampler

#
from tqdm.notebook import tqdm, trange

from platform import python_version, platform

from sklearn.model_selection import train_test_split

# torch.set_num_threads(2)

from transformers import (
    BertModel,
    BertTokenizer,
    TFBertForMaskedLM,
    BertForMaskedLM,
    AdamW,
    get_linear_schedule_with_warmup,
)
 
from sys import executable
print(executable)
print("platform: {}".format(platform())) #Python Platform: macOS-13.3.1-arm64-arm-64bit
print('python ' + python_version())
print(pd.__name__, pd.__version__)
print(np.__name__, np.__version__)
print(torch.__name__, torch.__version__)
has_gpu = torch.cuda.is_available()
has_mps = getattr(torch,'has_mps',False)
device = "mps" if has_mps \
    else "gpu" if has_gpu else "cpu"

print("GPU is", "AVAILABLE" if has_gpu else "NOT available")#GPU is NOT AVAILABLE
print("MPS is", "AVAILABLE" if has_mps else "NOT available") #MPS is AVAILABLE
 
print("target device is {}".format(device)) #Target device is mps

/Users/ask/anaconda3/envs/STA208_BERT/bin/python
platform: macOS-13.4-arm64-arm-64bit
python 3.9.16
pandas 1.5.3
numpy 1.24.3
torch 2.0.1
GPU is NOT available
MPS is AVAILABLE
target device is mps


### Data Processing

#### Load Evaluation Data

In [2]:
""" # Load the numerical data you want to train BERT on
df1 = pd.read_csv("data.csv")
#df2 = pd.read_csv('Data_CHF_Zhao_2020_ATE.csv')
frames = [df1]
df = pd.concat(frames)

# Define the name of the column that you want to move to the end of the DataFrame
column_name = "x_e_out [-]"

# Select the column and drop it from the DataFrame
column_to_move = df[column_name]
col = df.drop(column_name, axis=1, inplace=True)

# Append the column back to the end of the DataFrame
df[column_name] = column_to_move

# select only rows where x_e out DOES exist
data_test = df[df["x_e_out [-]"].isna()]

# start with a small data set for speed
data_test = data_test[0:10]
data_test = data_test.reset_index(drop=True)

# Convert numerical values to string format to match BERT input requirement
data_test = data_test.astype(str)

data_test["sequence"] = ""

# Concatenate all the values in a row into a single string using the column names
# Iterate through rows and columns
for index, row in data_test.iterrows():
    string = ""
    for column in data_test.columns:
        if column == "x_e_out [-]": 
            # make a mask with 4 sequential tokens
            string += column + ": " + '[MASK]'*4 + " "
            continue
        if column == "sequence" or column == 'id':
            continue
        string += column + ": " + str(row[column]) + " "
    masked_string = string.strip()
    data_test["sequence"][index] = masked_string

data_test.describe
print(data_test["sequence"][0]) """

' # Load the numerical data you want to train BERT on\ndf1 = pd.read_csv("data.csv")\n#df2 = pd.read_csv(\'Data_CHF_Zhao_2020_ATE.csv\')\nframes = [df1]\ndf = pd.concat(frames)\n\n# Define the name of the column that you want to move to the end of the DataFrame\ncolumn_name = "x_e_out [-]"\n\n# Select the column and drop it from the DataFrame\ncolumn_to_move = df[column_name]\ncol = df.drop(column_name, axis=1, inplace=True)\n\n# Append the column back to the end of the DataFrame\ndf[column_name] = column_to_move\n\n# select only rows where x_e out DOES exist\ndata_test = df[df["x_e_out [-]"].isna()]\n\n# start with a small data set for speed\ndata_test = data_test[0:10]\ndata_test = data_test.reset_index(drop=True)\n\n# Convert numerical values to string format to match BERT input requirement\ndata_test = data_test.astype(str)\n\ndata_test["sequence"] = ""\n\n# Concatenate all the values in a row into a single string using the column names\n# Iterate through rows and columns\nfor inde

#### Loading All Training/Testing Data

In [19]:
# Load the numerical data you want to train BERT on
df1 = pd.read_csv("data.csv")
df2 = pd.read_csv('Data_CHF_Zhao_2020_ATE.csv')
frames = [df1, df2]
df = pd.concat(frames)

# Define the name of the column that you want to move to the end of the DataFrame
column_name = "x_e_out [-]"

# Select the column and drop it from the DataFrame
column_to_move = df[column_name]
col = df.drop(column_name, axis=1, inplace=True)

# Append the column back to the end of the DataFrame
df[column_name] = column_to_move

# select only rows where x_e out exists
#data = df[df["x_e_out [-]"].isna()]
data = df[~df["x_e_out [-]"].isna()]

# start with a small data set for speed
#data = data[0:1000]
data = data.reset_index(drop=True)

# Convert numerical values to string format to match BERT input requirement
data = data.astype(str)

data["sequence"] = ""

# Concatenate all the values in a row into a single string using the column names
# Iterate through rows and columns
for index, row in data.iterrows():
    string = ""
    for column in data.columns:
        if column == "x_e_out [-]": 
            # Do not add mask tokens, simply ignore
            #string += column + ": " + '[MASK]'*4 + " "
            continue
        if column == "sequence" or column == 'id':
            continue
        string += column + ": " + str(row[column]) + " "
    masked_string = string.strip()
    data["sequence"][index] = masked_string

data.describe

<bound method NDFrame.describe of          id        author geometry pressure [MPa] mass_flux [kg/m2-s]  \
0         0      Thompson     tube            7.0              3770.0   
1         1      Thompson     tube            nan              6049.0   
2         2      Thompson      nan          13.79              2034.0   
3         3          Beus  annulus          13.79              3679.0   
4         5           nan      nan          17.24              3648.0   
...     ...           ...      ...            ...                 ...   
23089  1861  Richenderfer    plate           1.01              1500.0   
23090  1862  Richenderfer    plate           1.01              1500.0   
23091  1863  Richenderfer    plate           1.01              2000.0   
23092  1864  Richenderfer    plate           1.01              2000.0   
23093  1865  Richenderfer    plate           1.01              2000.0   

      D_e [mm] D_h [mm] length [mm] chf_exp [MW/m2] x_e_out [-]  \
0          nan     10.

### Train/Optimize 
BERT Masked LM in PyTorch

In [20]:
# Step 1: Make the tokenizer
MAX_LENGTH = 128

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

token_encoder_fun = lambda row: tokenizer.encode(row,
                                                 add_special_tokens=True,
                                                 padding="max_length",
                                                 #return_tensors='pt',
                                                 max_length=MAX_LENGTH,
                                                 )


In [22]:
# Prepare data as Ax = B
# A
sequences = data["sequence"]
# B
x_e_out = data['x_e_out [-]']

tokenized_data = sequences.apply(token_encoder_fun)
tokenized_data = tokenized_data.reset_index(drop=True)

input_ids = torch.tensor(tokenized_data)

attention_masks = torch.empty( ( len(input_ids), MAX_LENGTH ) )

# Generate attention masks
for i in trange(len(input_ids)):
    tokens = input_ids[i, :]
    row_mask = [int(token_id.item() > 0) for token_id in tokens]
    row_mask = torch.tensor(row_mask).unsqueeze(0)
    attention_masks[i] = row_mask

  0%|          | 0/23094 [00:00<?, ?it/s]

In [49]:
x_e_out = data['x_e_out [-]'].astype(float)
x_e_out = torch.tensor(x_e_out, dtype=torch.float32).reshape(-1,1)

generator = torch.Generator().manual_seed(42)

# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(input_ids, attention_masks, x_e_out)

# Create a 90-10 train-validation split.

# Calculate the number of samples to include in each set.
train_size = int(0.9 * len(dataset))
test_size = len(dataset) - train_size

# Divide the dataset by randomly selecting samples.
train_dataset, test_dataset = random_split(dataset, [train_size, test_size])

train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
test_dataloader = DataLoader(
            test_dataset, # The validation samples.
            sampler = SequentialSampler(test_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )

AssertionError: Size mismatch between tensors

In [88]:
class BertHeatFlux(torch.nn.Module):
    def __init__(self):
        super(BertHeatFlux, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased').to(device)
        self.dropout = torch.nn.Dropout(0.3)
        # x[1].shape[-1]
        self.linear = torch.nn.Linear(768, 1)
    def forward(self, input_ids, attention_masks):
        ouput = self.bert(input_ids, attention_masks)
        pooled_output = output[1]
        dropout = self.dropout(pooled_output)
        out = self.linear(dropout)
        return out

In [89]:
model = BertHeatFlux().to(device)
batch_size = 32

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [34]:
# Step 3: Define the optimizer and loss fn

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
loss_fn = torch.nn.MSELoss()

# Step 4: Train the model
num_epochs = 1

for epoch in trange(num_epochs):
    for batch in tqdm(train_dataloader):
        input_ids, attention_mask, labels = [x.to(device) for x in batch]

        # Reset gradients
        optimizer.zero_grad()

        # Forward pass
        predictions = model(input_ids, 
                        attention_mask)
        loss = loss_fn(predictions, labels)
        
        # Backward pass
        loss.backward()

        #torch.nn.utils.clip_grad

        # Update the weights
        optimizer.step()

    # Print the loss and accuracy of the model for this epoch
    print(f"Epoch {epoch+1}, Loss: {loss:.3f}")

torch.save(model, 'bert_fine-tuned-7.sav')

  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 1, Loss: 0.002


90/10 train/test split

##### 7 Sequential `[MASK]` tokens
- `bert_fine-tuned_1.sav`
  - epochs=2, training: data.csv
#
- `bert_fine-tuned_2.sav`
  - epochs=2, training: data.csv + Data_CHF
#
- `bert_fine-tuned_3.sav`
  - epochs=10, training: data + Data_CHF
#
- `bert_fine-tuned_4.sav`
  - epochs=5, training: Data_CHF
#
- `bert_fine-tuned_5.sav`
  - epochs=10, training: Data_CHF

##### 4 Sequential `[MASK]` tokens
#
- `bert_fine-tuned_6.sav`
  - epochs=5, training: data + Data_CHF

In [79]:

# Evaluate the model on the testing dataset
accuracy = 0
with torch.no_grad():
  for batch in tqdm(test_dataloader):
    input_ids, attention_mask, labels = [x.to(device) for x in batch]

    # Get the logits for the masked tokens
    #outputs = model(input_ids, attention_mask=attention_mask)
    outputs = model(input_ids, 
                    attention_mask)
    
    # Calculate the accuracy of the model on this batch
    loss += np.mean(np.square(outputs.cpu() - labels.cpu()))

# Print the accuracy of the model
print(f"MSE: {np.sqrt(loss/test_samples) :.3f}")
torch.save(model, 'bert_fine-tuned-7.sav')

  0%|          | 0/33 [00:00<?, ?it/s]

TypeError: mean() received an invalid combination of arguments - got (out=NoneType, dtype=NoneType, axis=NoneType, ), but expected one of:
 * (*, torch.dtype dtype)
 * (tuple of ints dim, bool keepdim, *, torch.dtype dtype)
 * (tuple of names dim, bool keepdim, *, torch.dtype dtype)


#### Predict Using Fine-Tuned BERT

In [78]:
# Load the numerical data you want to train BERT on
df1 = pd.read_csv("data.csv")
#df2 = pd.read_csv('Data_CHF_Zhao_2020_ATE.csv')
frames = [df1]
df = pd.concat(frames)

# Define the name of the column that you want to move to the end of the DataFrame
column_name = "x_e_out [-]"

# Select the column and drop it from the DataFrame
column_to_move = df[column_name]
col = df.drop(column_name, axis=1, inplace=True)

# Append the column back to the end of the DataFrame
df[column_name] = column_to_move

# select only rows where x_e out DOES NOT exist
data_test = df[df["x_e_out [-]"].isna()]

# start with a small data set for speed
data_test = data_test[890:]
data_test = data_test.reset_index(drop=True)

# Convert numerical values to string format to match BERT input requirement
data_test = data_test.astype(str)

data_test["sequence"] = ""

# Concatenate all the values in a row into a single string using the column names
# Iterate through rows and columns
for index, row in data_test.iterrows():
    string = ""
    for column in data_test.columns:
        if column == "x_e_out [-]": 
            # make a mask with 4 sequential tokens
            string += column + ": " + '[MASK]'*4 + " "
            continue
        if column == "sequence" or column == 'id':
            continue
        string += column + ": " + str(row[column]) + " "
    masked_string = string.strip()
    data_test["sequence"][index] = masked_string

#data_test.describe


sequences = data_test["sequence"]
x_e_out = data_test['x_e_out [-]']
x_e_out = data_test['x_e_out [-]'].astype(float)
x_e_out = torch.tensor(x_e_out, dtype=torch.float32).reshape(-1,1)

tokenized_data = sequences.apply(token_encoder_fun)
tokenized_data = tokenized_data.reset_index(drop=True)

input_ids = torch.tensor(tokenized_data)
#labels = torch.tensor(tokenized_labels)

attention_masks = torch.empty( ( len(input_ids), MAX_LENGTH ) )

# Generate attention masks
for i in trange(len(input_ids)):
    tokens = input_ids[i, :]
    row_mask = [int(token_id.item() > 0) for token_id in tokens]
    row_mask = torch.tensor(row_mask).unsqueeze(0)
    attention_masks[i] = row_mask

# For validation the order doesn't matter, so we'll just read them sequentially.
eval_dataset = TensorDataset(input_ids, attention_masks, x_e_out)

eval_dataloader = DataLoader(
            dataset, # The validation samples.
            sampler = SequentialSampler(test_dataset), # Pull out batches sequentially
        )

Exception ignored in: <function tqdm.__del__ at 0x29fb14ee0>
Traceback (most recent call last):
  File "/Users/ask/anaconda3/envs/STA208_BERT/lib/python3.9/site-packages/tqdm/std.py", line 1145, in __del__
    self.close()
  File "/Users/ask/anaconda3/envs/STA208_BERT/lib/python3.9/site-packages/tqdm/notebook.py", line 283, in close
    self.disp(bar_style='danger', check_delay=False)
AttributeError: 'tqdm_notebook' object has no attribute 'disp'


  0%|          | 0/9525 [00:00<?, ?it/s]

In [74]:
predictions = torch.empty( ( len(input_ids), 1 ) ).to(device)
i = 0
for batch in eval_dataloader:
    input_ids, attention_mask, labels = [x.to(device) for x in tqdm(batch)]

    # Forward pass
    pred = model(input_ids, 
                    attention_mask)
    
    # Print the loss and accuracy of the model for this epoch
    predictions[i] = pred
    i += 1
    #print(pred)

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

IndexError: index 1 is out of bounds for dimension 0 with size 1

In [62]:
np.savetxt('batch2.txt', predictions.cpu().detach().numpy())

### Appendix

***

#### Old Snippets

#### Console/Testing

In [None]:
train_dataset[0]

(tensor([  101,  3166,  1024, 16660, 10988,  1024,  7270,  3778,  1031,  6131,
          2050,  1033,  1024,  1020,  1012,  6535,  3742,  1035, 19251,  1031,
          4705,  1013, 25525,  1011,  1055,  1033,  1024, 18432,  1012,  1014,
          1040,  1035,  1041,  1031,  3461,  1033,  1024,  2184,  1012,  1022,
          1040,  1035,  1044,  1031,  3461,  1033,  1024,  2184,  1012,  1022,
          3091,  1031,  3461,  1033,  1024,  4724,  2475,  1012,  1014, 10381,
          2546,  1035,  4654,  2361,  1031, 12464,  1013, 25525,  1033,  1024,
          1018,  1012,  1018,  1060,  1035,  1041,  1035,  2041,  1031,  1011,
          1033,  1024,   103,   103,   103,   103,   102,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,  

In [None]:
i = torch.argmax(probs)
print(i)

tensor(0, device='mps:0')


In [None]:
a = [ t for t in input_ids[0] if all([t!=0, t !=101, t !=102]) ]

In [None]:
tokenizer.decode(102)

'[ S E P ]'

In [87]:
model = BertModel.from_pretrained('bert-base-uncased').to(device)
print(model.named_modules())

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


<generator object Module.named_modules at 0x4351c05f0>
