# Training and Implementing BERT to Label Clippings

## 1) Overview

Via the code and processes in 05_random_sample.ipynb and 06_label_training_data.ipynb, I created a training dataset (saved as [training_data.csv](https://raw.githubusercontent.com/MatthewKollmer/us_lynching_victims/refs/heads/main/training_data.csv)) and a test set (saved as [test_data.csv](https://raw.githubusercontent.com/MatthewKollmer/us_lynching_victims/refs/heads/main/test_data.csv)). Using these data sources, I explored several text classification methods (TF-IDF logistic regression, BoW logistic regression, and BERT-base). Ultimately, fine-tuning BERT yielded the best results, so I went ahead and classified the rest of our data with BERT. 

This notebook showcases our BERT classification tests and results. It also shows how I ran our fine-tuned BERT model on the rest of the data three times, creating multiple BERT classifications that I averaged for our final probability labels. All those steps are explained below. However, this notebook omits the numerous tests using logistic regression, tests with different versions of the training data, and tests across different values for BERT's parameters. While those less accurate attempts and explorations were important for us to land on our final methods, they're not necessary to understand the resultant data of classified US lynching victim reports. That's why I'm omitting them here, but I may revisit the whole process in other notebooks down the line.

In [None]:
import torch
import pandas as pd
from torch.utils.data import TensorDataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
from sklearn.metrics import classification_report
from tqdm import tqdm
import os
import glob

## 2) Loading BERT and Data

If you're trying to replicate our processes, please be sure to review the following parameters to ensure they're compatible with your device. In particular, you'll need to check your torch.device() compatibility. I'm working on a M-series Macbook Pro, so my version of the device is 'mps' (meaning 'metal performance shaders'). If you're working on a PC with NVIDIA GPUs, you should try 'cuda'. If you're on an older device with only CPUs available, then you should try 'cpu'. Be warned, though: this code will take an unreasonable amount of time without the proper hardware and GPUs.

In [None]:
# load data, BERT, and tokenizer
train_df = pd.read_csv('https://raw.githubusercontent.com/MatthewKollmer/us_lynching_victims/refs/heads/main/training_data.csv')
test_df = pd.read_csv('test_data/test_data.csv')
# device here basically tells your computer what part of the hardware should handle running BERT. GPUs are necessary.
device = torch.device('mps')
# I'm using bert-base-uncased. There are other models and BERT variations available here, but BERT base is well documented and capable for our task.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
model.to(device)

## 3) Transforming Data for BERT

In order to use BERT for classification, it's necessary to convert textual data into numbers. This is just how the model is designed to read the data, so we convert 'no' labels to 0 and 'yes' labels to 1, then convert them back for easier human reading after BERT completes its classifications.

In [None]:
numerize_labels = {'no': 0, 'yes': 1}
textualize_labels = {0: 'no', 1: 'yes'}
train_labels_numeric = train_df['case_match'].map(numerize_labels).values
test_labels = test_df['case_match'].map(numerize_labels).values

BERT also runs on data structures called tensors. In a nutshell, these are high dimensional data structures that give BERT the capacity to process textual data. The following function tokenize_clippings() essentially takes the clippings in our data, tokenizes them, and transforms the tokens into tensor data.

In [None]:
# This is a tricky part. BERT needs tensor data to make its predictions. This function turns the strings in 'clippings' into tensor data for our training purposes.
def tokenize_clippings(clippings, labels, max_length=155): 
    # max_length at 155 corresponds to the word length of our clippings
    input_ids_list = []
    attention_mask_list = []
    token_type_ids_list = []

    for clip in clippings:
        encoding = tokenizer.encode_plus(clip, add_special_tokens=True, max_length=max_length, truncation=True, padding='max_length',return_tensors='pt', return_token_type_ids=True, return_attention_mask=True)
        input_ids_list.append(encoding['input_ids'].flatten())
        attention_mask_list.append(encoding['attention_mask'].flatten())
        token_type_ids_list.append(encoding['token_type_ids'].flatten())

    input_ids = torch.stack(input_ids_list)
    attention_masks = torch.stack(attention_mask_list)
    token_type_ids = torch.stack(token_type_ids_list)
    labels_tensor = torch.tensor(labels, dtype=torch.long)

    return TensorDataset(input_ids, attention_masks, token_type_ids, labels_tensor)

With tokenize_clippings(), we create tensors for BERT:

In [None]:
train_dataset = tokenize_clippings(train_df['clippings'].values, train_labels_numeric, max_length=155)
test_dataset = tokenize_clippings(test_df['clippings'].values, test_labels, max_length=155)

Using DataLoader, I'm feeding our training and test data to BERT in batches. This is just so it doesn't try to process the entire dataset at once (which would be an incredible strain on my computer).

In [None]:
# A loader so I don't fry my pooter
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

I also set the model optimizer and learning rate. After numerous tests, I found a learning rate of 4e-5 does best for recall. There is slight degradation to precision, but I address that in other steps below.

In [None]:
optimizer = AdamW(model.parameters(), lr=4e-5)

## 4) Fine-Tuning BERT for Our Classification Task

With the data prepared for BERT, I'm now ready to fine-tune the model for our classification task. The following loop does so with progress bars to monitor the process and the average loss. Just a heads up: this step can take some time, depending on your device. On my M3 Max Macbook Pro, it takes about 8 minutes. On older devices with less GPUs, it will take much longer.

One thing I've monitored is the average loss between training epochs. In this context, loss is basically a measure of the model's accuracy in predicting tokens in the training data. If the loss decreases between epochs, it means the model is learning something about the training data. Each time I ran the model, the first epoch saw an average loss of roughly .23. By the fifth epoch, the average loss always dropped to roughly .05.

In [None]:
# commence training
epochs = 5 # I've tried 3, 4, 5, and 6. The more epochs, the better the model does (up to five where it plateaus)
model.train()

for epoch in range(epochs):
    total_loss = 0
    progress_bar = tqdm(train_loader, desc=f'Epoch {epoch+1}/{epochs}', unit='batch')

    for batch in progress_bar:
        input_ids, attention_mask, token_type_ids, labels = [t.to(device) for t in batch]
        optimizer.zero_grad()
        outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()
        loss.backward()
        optimizer.step()
        progress_bar.set_postfix({'loss': loss.item()})

    avg_loss = total_loss / len(train_loader)
    print(f'epoch {epoch+1} - avg loss: {avg_loss:.4f}')

## 5) Evaluating BERT

Before running our BERT classifier on the entirety of our data, I tested it on our test set.

In [None]:
# let's see how it does on our test data
model.eval()
predictions = []
all_labels = []
progress_bar = tqdm(test_loader, desc='Evaluation', unit='batch')

with torch.no_grad():
    for batch in progress_bar:
        input_ids, attention_mask, token_type_ids, labels = [t.to(device) for t in batch]
        outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        logits = outputs.logits
        _, preds = torch.max(logits, dim=1)
        predictions.extend(preds.cpu().tolist())
        all_labels.extend(labels.cpu().tolist())

Here's how I reported the results:

In [None]:
predicted_labels = [textualize_labels[p] for p in predictions]
test_df['predicted_case_match'] = predicted_labels
print(classification_report(test_df['case_match'], test_df['predicted_case_match']))

Across each iteration of fine-tuning BERT, it reached an overall accuracy of 93-95%. I found that lower learning rates (2e-5 in particular) yielded higher overall accuracy by increasing precision scores but lowered recall for 'yes' labels. While it may seem counterintuitive, I chose to implement higher learning rates despite lower precision because a higher learning rate consistently resulted in higher recall (above 90%). Recall is more important to us, because I want to identify as many references to lynchings as possible. Since I was able to fine-tune the model three separate times and maintain the results each time, I also have a method for combining BERT results into final classifications (described below).

Results with learning rate at 4e-5, max_length 155, and epochs 5:

FIRST IMPLEMENTATION (BERT_1):

              precision    recall  f1-score   support

          no       0.99      0.93      0.96       161
         yes       0.77      0.95      0.85        39

    accuracy                           0.94       200

SECOND IMPLEMENTATION (BERT_2):

              precision    recall  f1-score   support

          no       0.99      0.92      0.95       161
         yes       0.74      0.95      0.83        39

    accuracy                           0.93       200
    
THIRD IMPLEMENTATION (BERT_3):

              precision    recall  f1-score   support

          no       1.00      0.94      0.97       161
         yes       0.81      1.00      0.90        39

    accuracy                           0.95       200


## 6) Classifying Aggregate Data

After each fine-tuning version of BERT, I ran the model on the aggregate data and saved its classifications as separate columns (BERT_1, BERT_2, and BERT_3). The following code demonstrates how to task the fine-tuned model. 

- It starts by creating a progress bar. 
- It includes a loop to check if there are already labels in the given column (in case you have to pause and come back, since this code takes a while). 
- It creates a mask of just the rows where the 'clippings' column contains text. This is so the model doesn't try to guess at blank rows. It labels any blank rows as 'no'.
- It prepares the clippings in each csv for BERT.
- It saves the BERT predictions in a new column.

In [None]:
# count the total rows across csvs so I can keep track of progress
directory = 'name_clusters'
csv_files = glob.glob(os.path.join(directory, '*.csv'))
total_rows = 0
for csv_file in csv_files:
    df = pd.read_csv(csv_file)
    total_rows += len(df)

In [None]:
# I ran this code three times, changing the 'BERT_{}' to the corresponding runs, thereby creating three columns with separate BERT classifications (BERT_1, BERT_2, BERT_3)
progress_bar = tqdm(total=total_rows, desc='Rows', unit='rows')

for csv_file in csv_files:
    df = pd.read_csv(csv_file)
    # check BERT iteration here
    if 'BERT_3' in df.columns:
        progress_bar.update(len(df))
        continue
        
    df['clippings'] = df['clippings'].astype(str)
    mask_valid = (df['clippings'].notnull() & (df['clippings'].str.strip() != '') & (df['clippings'] != 'NaN'))
    df_valid = df[mask_valid].copy()
    df_invalid = df[~mask_valid].copy()
    # check BERT iteration here
    df_invalid['BERT_3'] = 'no'

    if df_valid.empty:
        progress_bar.update(len(df))
        df_merged = pd.concat([df_valid, df_invalid], axis=0).sort_index()
        df_merged.to_csv(csv_file, index=False)
        continue

    dummy_labels = [0] * len(df_valid)
    clippings_dataset = tokenize_clippings(df_valid['clippings'].values, dummy_labels, max_length=155)
    clippings_loader = DataLoader(clippings_dataset, batch_size=16, shuffle=False)

    model.eval()
    predictions = []

    with torch.no_grad():
        for batch in clippings_loader:
            input_ids, attention_mask, token_type_ids, _ = [t.to(device) for t in batch]
            outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
            _, preds = torch.max(outputs.logits, dim=1)
            predictions.extend(preds.cpu().tolist())
            progress_bar.update(len(input_ids))

    predicted_labels = [textualize_labels[p] for p in predictions]
    # check BERT iteration here
    df_valid['BERT_3'] = predicted_labels

    df_merged = pd.concat([df_valid, df_invalid], axis=0).sort_index()
    df_merged.to_csv(csv_file, index=False)

progress_bar.close()

## 7) Averaging Predictions for Final Results

After running BERT each time, I assessed the results. First I calculated how many clippings BERT classified as yes (that is, 'yes, the clipping contains reference to lynching').

In [None]:
total_yes = 0

for csv_file in csv_files:
    df = pd.read_csv(csv_file, usecols=['BERT_3'])
    file_yes_count = (df['BERT_3'] == 'yes').sum()
    total_yes += file_yes_count

print(f'Total yes count: {total_yes}')

### Each Run: 

- BERT_1 has 76,653 clippings labelled as containing references to lynchings (about 17% of the data)

- BERT_2 has 81,256 clippings labelled as containing references to lynchings (about 18% of the data)

- BERT_3 has 77,297 clippings labelled as containing references to lynchings (about 17% of the data)

Then I calculated the average correspondence (how often all three iterations of BERT made the same predictions).

In [None]:
overall_matches = 0
overall_rows = 0

for csv_file in csv_files:
    df = pd.read_csv(csv_file, usecols=['BERT_1', 'BERT_2', 'BERT_3'])
    total = len(df)
    matches = ((df['BERT_1'] == df['BERT_2']) & (df['BERT_2'] == df['BERT_3'])).sum()
    overall_matches += matches
    overall_rows += total
    average_correspondence = overall_matches / overall_rows
    
print(f'Average correspondence: {average_correspondence:.2%}')
# Average correspondence: 94.82%

The average correspondence was 94.82%

Finally, I created a final probability label using the three BERT predictions. In the function below, I label every empty 'clippings' row with a 'probability' unknown. If each iteration of BERT classified the clipping as 'yes', I labelled its probability as 'high'. If two BERT iterations labelled it 'yes', the probability is 'medium'. If only one 'BERT' iteration labelled the clipping 'yes', I set the probability to 'low'. If no BERT iterations labelled the clipping 'yes', the probability label is 'unlikely'. 

In [None]:
def calculate_probability(row):
    if row[['clippings']].isnull().any() or row['clippings'] == 'NaN':
        return 'unknown'
    
    yes_count = sum(row[col] == 'yes' for col in ['BERT_1','BERT_2','BERT_3']) 
    if yes_count == 3:
        return 'high'
    elif yes_count == 2:
        return 'medium'
    elif yes_count == 1:
        return 'low'
    else:
        return 'unlikely'

In [None]:
for csv_file in csv_files:
    df = pd.read_csv(csv_file)
    df['probability'] = df.apply(calculate_probability, axis=1)
    df.to_csv(csv_file, index=False)

Finally, to review the numbers of each label, I counted their totals and printed below:

In [None]:
overall_counts = pd.Series(dtype=int)

for csv_file in csv_files:
    df = pd.read_csv(csv_file, usecols=['probability'])    
    file_counts = df['probability'].value_counts()
    overall_counts = overall_counts.add(file_counts, fill_value=0)

overall_counts = overall_counts.sort_values(ascending=False)
print(overall_counts)

Counts per probability label:

- unlikely:    273,440
- unknown:      88,462
- high:         67,482
- low:          14,210
- medium:        9,257