# Fine-Tuning BERT for Location Mention Recognition

This notebook demonstrates the process of fine-tuning a BERT model to recognize and categorize location mentions in text using the IDRISI dataset. The task at hand is a type of Named Entity Recognition (NER) where the goal is to identify and classify location names, such as countries, cities, or landmarks, within a given text.

We utilize the BILOU (Begin, Inside, Last, Outside, Unit) labeling scheme, which provides detailed annotations of entity boundaries. Fine-tuning BERT with these structured labels allows the model to leverage its deep contextual understanding to perform highly accurate token classification, essential for detecting location mentions in diverse textual data.

The notebook is structured as follows:
1. **Setup and Installation**: Install and import the necessary libraries.
2. **Data Ingestion and Preprocessing**: Load the IDRISI dataset and prepare it for modeling, including tokenization and label mapping.
3. **Modeling Preparation**: Create custom datasets, define label mappings, and set up the BERT model for token classification.
4. **Fine-Tuning**: Train the BERT model on the labeled data, optimizing for accuracy in location mention recognition.
5. **Evaluation**: Assess the performance of the fine-tuned model using the Word Error Rate Metric.

By the end of this notebook, we will have a BERT model that is specifically fine-tuned to recognize and classify location mentions, which can be applied to various real-world applications like geolocation in social media, news analytics, and more.

# Setup

In [None]:
!pip install transformers jiwer pandas accelerate -U

In [2]:
import os
import re
from collections import Counter

import pandas as pd
import jiwer
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForTokenClassification, Trainer, TrainingArguments, DataCollatorForTokenClassification
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from tqdm import tqdm

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Helpers

In [3]:
def ingest_idrisi_data(bilou_base_dir='/kaggle/input/idrisi-location-mention/LMR/data/EN/gold-random-bilou/'):
    sentences, labels = [], []
    for root, dirs, files in os.walk(bilou_base_dir):
        for file in files:
            if file.endswith('.txt'):
                file_path = os.path.join(root, file)
                with open(file_path, 'r') as f:
                    current_sentence, current_labels = [], []
                    for line in f:
                        word_label = line.strip().split()
                        if len(word_label) == 2:
                            word, label = word_label
                            current_sentence.append(word)
                            current_labels.append(label)
                        elif len(current_sentence) > 0:
                            sentences.append(' '.join(current_sentence))
                            labels.append(','.join(current_labels))
                            current_sentence, current_labels = [], []
                    if len(current_sentence) > 0:
                        sentences.append(' '.join(current_sentence))
                        labels.append(','.join(current_labels))
    return pd.DataFrame({'sentence': sentences, 'word_labels': labels})



def tokenize_and_preserve_labels(sentence, text_labels, tokenizer):
    tokenized_sentence, labels = [], []
    for word, label in zip(sentence.split(), text_labels.split(",")):
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)
        tokenized_sentence.extend(tokenized_word)
        labels.extend([label] * n_subwords)
    return tokenized_sentence, labels

class CustomDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __getitem__(self, index):
        sentence = self.data.sentence[index]  
        word_labels = self.data.word_labels[index]  
        tokenized_sentence, labels = tokenize_and_preserve_labels(sentence, word_labels, self.tokenizer)
        
        tokenized_sentence = ["[CLS]"] + tokenized_sentence + ["[SEP]"]
        labels.insert(0, "O")
        labels.insert(-1, "O")

        if len(tokenized_sentence) > self.max_len:
            tokenized_sentence = tokenized_sentence[:self.max_len]
            labels = labels[:self.max_len]
        else:
            tokenized_sentence += ['[PAD]'] * (self.max_len - len(tokenized_sentence))
            labels += ["O"] * (self.max_len - len(labels))

        attn_mask = [1 if tok != '[PAD]' else 0 for tok in tokenized_sentence]
        ids = self.tokenizer.convert_tokens_to_ids(tokenized_sentence)
        label_ids = [label2id[label] for label in labels]
        
        return {
            'input_ids': torch.tensor(ids, dtype=torch.long),
            'attention_mask': torch.tensor(attn_mask, dtype=torch.long),
            'labels': torch.tensor(label_ids, dtype=torch.long)
        }
    
    def __len__(self):
        return len(self.data)
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels.flatten(), preds.flatten(), average='weighted')
    acc = accuracy_score(labels.flatten(), preds.flatten())
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

def infer_on_sentences(sentences, model, tokenizer, max_len=300, with_extra=False):
    # Put the model in evaluation mode
    model.eval()
    
    results = []
    extra_results = []
    
    for sentence in tqdm(sentences):
        # Tokenize the sentence and prepare input for the model
        tokenized_sentence = tokenizer(
            sentence.split(),
            is_split_into_words=True,
            return_offsets_mapping=False,
            padding='max_length',
            truncation=True,
            max_length=max_len,
            return_tensors="pt"
        )
        
        # Move tensors to the correct device
        input_ids = tokenized_sentence['input_ids'].to(device)
        attention_mask = tokenized_sentence['attention_mask'].to(device)
        
        # Get predictions
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=2)  # Get the index of the highest logit for each token
        
        # Convert predictions to labels
        pred_labels = [id2label[pred.item()] for pred in predictions[0]]
        
        # Get the original tokens from input_ids
        tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
        
        # Filter out tokens with the 'O' label and concatenate them
        filtered_tokens = [
            token for token, label in zip(tokens, pred_labels)
            if label != 'O' and token not in ['[CLS]', '[SEP]', '[PAD]']
        ]
        filtered_labels = [
            label for token, label in zip(tokens, pred_labels)
            if label != 'O' and token not in ['[CLS]', '[SEP]', '[PAD]']
        ]
        
        results.append(" ".join(filtered_tokens))
        extra_results.append(filtered_labels)

    if with_extra:
        return results, extra_results

    return results

def calculate_performance_metric(df, col1='location', col2='prediction'):

    # Function to calculate WER for each row
    def calculate_wer(row):
        return jiwer.wer(str(row[col1]), str(row[col2]))

    # Calculate WER for each row
    df['WER'] = df.apply(calculate_wer, axis=1)

    # Calculate the average WER
    average_wer = df['WER'].mean()

    return df, average_wer

def clean_text(text):
    # Define a dictionary of replacements
    replacements = {
        ",": " ",
        "@": "",
        ".": "",
        ";": "",
        "-": " ",
        "_": "",
        "#": "",
        "##": ""
    }
    
    cleaned_text = text
    for k, v in replacements.items():
        cleaned_text = cleaned_text.replace(k, v)

    return cleaned_text

def clean_prediction(row, raw_prediction_col='prediction_raw'):
    prediction = row[raw_prediction_col]
    prediction = prediction.replace(" ##", "")
    if prediction.startswith("##"):
        prediction = " ".join(prediction.split()[1:])

    cleaned_text = clean_text(row['text'])
    lower_upper_map = {k.lower(): k for k in cleaned_text.split()}

    for k, v in lower_upper_map.items():
        prediction = prediction.replace(k, v)

    replacements = {
        "U S .": "",
        "L . A .": "L.A.",
        "P R . P R .": "P.R.",
        "N C . N C": "N.C.",
        "u . s .": "U.S.",
        "s . c .": "S.C.",
        "n . c . n . c": "N.C.",
        "n . c .": "N.C.",
        "d . c .": "D.C.",
        "n c . n c": "N.C.",
        '. r . p . r .': "P.R.",
        "u s .": "U.S.",

        " sc": "",
        " St": "",
        " -": "",
        " .": "",
        " _": "",
    }
    cleaned_prediction = prediction
    for k, v in replacements.items():
        cleaned_prediction = cleaned_prediction.replace(k, v)

    prediction_words = cleaned_prediction.split()
    if len(prediction_words) > 5:
        cleaned_prediction = Counter(cleaned_prediction.split()).most_common(1)[0][0]

    if len(set(prediction_words)) == 1:
        cleaned_prediction = prediction_words[0] 

    return cleaned_prediction

# Data preparation

The IDRISI dataset is used because it contains annotated location mentions in text, which is essential for training models to recognize named entities, specifically locations. The data is provided in the BILOU format (Begin, Inside, Last, Outside, Unit), a variant of the IOB (Inside, Outside, Begin) format.

In [4]:
# Load IDRISI data
data = ingest_idrisi_data()
data = data[["sentence", "word_labels"]].drop_duplicates().reset_index(drop=True)

# Split the dataset into training and testing sets
train_size = 0.9999  # High percentage for training data as we've already run a classic split before - this is equivalent to finetuning on the whole dataset
train_dataset = data.sample(frac=train_size, random_state=200)
test_dataset = data.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)

In [5]:
data.head()

Unnamed: 0,sentence,word_labels
0,RT @pzf : ECUADOR EARTHQUAKE : - At least 250 ...,"O,O,O,U-CTRY,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O"
1,RT @brk_news_now : FOX : ECUADOR ROCKED : Magn...,"O,O,O,O,O,U-CTRY,O,O,O,O,O,O,O,O,O"
2,Independent : Video shows the moment earthquak...,"O,O,O,O,O,O,O,O,O,O,O,O,O,U-CTRY"
3,RT @telesurenglish : Gracias Venezuela : Count...,"O,O,O,O,U-CTRY,O,O,O,O,O,O,U-CTRY,O,O,O,O,O"
4,RT @Emergency_Life : ἞A἞8Ἱ7 # Ecuador # Earthq...,"O,O,O,O,O,U-CTRY,O,O,O,O,O,O,O,O,O,O,O,O,O,O,O..."


# Modeling preparation

## Prepare custom label mappings

Before fine-tuning, it’s essential to map the location mention labels from the BILOU format to a format that BERT can understand. This involves converting categorical labels (e.g., `B-CITY`,`B-CNTY`, `B-CONT`) into integer IDs, which the model will use during training. This mapping is critical because BERT outputs logits for each token, which are then converted back to these labels.

In [6]:
# Extract unique tags from word labels
tags = set(",".join(data.word_labels).split(','))

# Create label to ID and ID to label mappings
label2id = {k: v for v, k in enumerate(tags)}
id2label = {v: k for v, k in enumerate(tags)}

## Setup the model and tokenizer

In [7]:
# Initialize the tokenizer using a pre-trained BERT model
tokenizer = BertTokenizer.from_pretrained("bert-large-uncased")

# Load a pre-trained BERT model for token classification with the custom label mappings
model = BertForTokenClassification.from_pretrained(
    "bert-large-uncased",
    num_labels=len(id2label),
    id2label=id2label,
    label2id=label2id
)
model.to(device)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-23): 24 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024

A custom dataset class is created to handle the input data, applying tokenization and ensuring that sequences are properly padded or truncated to fit the model’s expected input size. The `DataCollatorForTokenClassification` from the Hugging Face `transformers` library is used to dynamically pad batches during training, making the process efficient and preventing data leakage between samples.

In [8]:
# Initialize data collator for token classification
data_collator = DataCollatorForTokenClassification(tokenizer)

In [9]:
# Define maximum sequence length for tokenization
MAX_LEN = 300

# Create custom datasets for training and testing
training_set = CustomDataset(train_dataset, tokenizer, MAX_LEN)
testing_set = CustomDataset(test_dataset, tokenizer, MAX_LEN)

In [10]:
# Define training parameters
TRAIN_BATCH_SIZE = 16
VALID_BATCH_SIZE = 8
EPOCHS = 10 # Increase to 3 or more

The fine-tuning process involves setting up the `Trainer` class from the `transformers` library, which simplifies the training loop, handles model optimization, and tracks metrics like accuracy, precision, recall, and F1-score. We specify training arguments such as the number of epochs, batch size, learning rate, and the device (GPU or CPU). The model is trained to minimize the loss function, adjusting its weights based on the labeled data to improve its predictions.

In [11]:
# Set up training arguments for the Trainer API
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=VALID_BATCH_SIZE,
    warmup_steps=25,
    weight_decay=0.001,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="steps",
    eval_steps=25,
    save_steps=50,
    save_total_limit=2,
    gradient_accumulation_steps=4,  # Accumulate gradients for larger effective batch size
    fp16=True,  # Enable mixed precision training for faster computation
    report_to=["none"] #set this to true if you have a WANDB API key
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=training_set,
    eval_dataset=testing_set,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics  # Function to compute metrics during evaluation
)

  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


# Fine-tuning

For this task, we fine-tune the `BertForTokenClassification` model, a variant of BERT designed for sequence tagging tasks like Named Entity Recognition (NER). Fine-tuning involves taking a pre-trained BERT model and adapting it to our specific task—location mention recognition—by training it further on our labeled dataset. This step leverages the knowledge BERT has from its initial pre-training on a vast corpus while specializing it for identifying location mentions.

In [12]:
trainer.train()

Step,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
25,0.2673,0.069474,0.99,0.985025,0.9801,0.99
50,0.0313,0.050041,0.99,0.985025,0.9801,0.99
75,0.0285,0.039596,0.99,0.986683,0.983389,0.99
100,0.0197,0.03192,0.99,0.98751,0.985034,0.99
125,0.017,0.027783,0.99,0.98751,0.985034,0.99
150,0.0152,0.024249,0.993333,0.992006,0.99085,0.993333
175,0.0155,0.018874,0.995,0.99334,0.993358,0.995
200,0.0136,0.024671,0.993333,0.992006,0.99085,0.993333
225,0.0128,0.02598,0.995,0.99334,0.993358,0.995
250,0.0111,0.022319,0.993333,0.992006,0.99085,0.993333


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_pr

TrainOutput(global_step=2570, training_loss=0.01950127827148551, metrics={'train_runtime': 15754.4987, 'train_samples_per_second': 10.439, 'train_steps_per_second': 0.163, 'total_flos': 8.9507210801472e+16, 'train_loss': 0.01950127827148551, 'epoch': 10.0})

# Measuring average WER

Now that the model is trained, let's make inference on train data to evaluate against the custom metric.

In [13]:
train = pd.read_csv("/kaggle/input/dattttt/dat/Train.csv")

train = train[~train['text'].isna()]

In [14]:
# implement in batches later
train_predictions = infer_on_sentences(train.text.to_list(), trainer.model, tokenizer)
train['prediction_raw'] = train_predictions

100%|██████████| 16448/16448 [15:23<00:00, 17.80it/s]


In [15]:
train.to_csv('news.csv', index=False)

In [16]:
df, average_wer = calculate_performance_metric(train, col2='prediction_raw')
average_wer

1.28802675855813

Can we do better?

Here we will perform some post-inference cleaning,using the `clean_prediction` function defined in the helpers section. Nothing fancy, just a bunch of heuristics.

In [17]:
train['prediction_clean'] = train.apply(clean_prediction, axis=1)

In [18]:
df, average_wer = calculate_performance_metric(train, col2='prediction_clean')
average_wer

0.5019827050651466

# Submission

In [19]:
test = pd.read_csv("/kaggle/input/dattttt/dat/Test.csv")

In [20]:
# implement in batches later
test_predictions = infer_on_sentences(test.text.to_list(), model, tokenizer)

100%|██████████| 2942/2942 [02:45<00:00, 17.81it/s]


In [21]:
test['prediction_raw'] = test_predictions
test['prediction'] = test.apply(clean_prediction, axis=1)
test['prediction'] = test['prediction'].replace("", " ")

In [22]:
test[['tweet_id', 'prediction']].to_csv("bert-large-uncased-fine-tuned-1-epoch+huristic-cleaning.csv", index=False)

# Some improvement ideas
- Obviously, improving the fine-tuning for better results
- Fine tunning other models
- using the location hierarchy to re-order output
- cleaning tweets and predictions using a grammar correction model such as t5-base-grammar-correction
- ensemble modeling...

# Extra - pushing the model into the hub 