# Location Mention Recognition - BERT fine-tuning

This project involves developing an automated process for recognition of toponyms (place/ area/ street names) in microblogging posts. The aim is to help authorities determine specific locations to send resources such as medical aid, food.

The microblogging data used will be Twitter (X) posts and a Location Mention Recognition system will be built.

0.Setup
1. Read Data
2. Clean Dataset
3. Text and Label Tokenization
4. Load BERT and Define Optimizer
5. Define Training and Evaluation Functions
6. Training (Fine-Tuning)
7. Evaluation
8. Conclusion

## 0. Setup


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# imports
import numpy as np
import pandas as pd
import requests
from google.colab import userdata
import torch
from transformers import BertTokenizer, BertTokenizerFast, AutoTokenizer, BertForTokenClassification, pipeline, BertModel
import warnings
import re
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup
from tqdm import tqdm
from torch import nn
import evaluate
import matplotlib.pyplot as plt
from jiwer import wer
import plotly.graph_objs as go

In [6]:
if torch.cuda.is_available():
    print("GPU is available")
else:
    print("GPU is not available")

device = 'cuda' if torch.cuda.is_available() else 'cpu'

GPU is available


In [7]:
# setup
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_colwidth', None)
warnings.filterwarnings('ignore')

## 1. Read Data

In [None]:
train_df = pd.read_csv('/content/drive/MyDrive/data_challenges/zindi_microsoft_LMR_challenge/data/Train.csv')
train_df.head(3)

Unnamed: 0,tweet_id,text,location
0,ID_1001136212718088192,,EllicottCity
1,ID_1001136696589631488,"Flash floods struck a Maryland city on Sunday, washing out streets and tossing cars like bath toys.",Maryland
2,ID_1001136950345109504,State of emergency declared for Maryland flooding: via @YouTube,Maryland


In [None]:
test_df = pd.read_csv('/content/drive/MyDrive/data_challenges/zindi_microsoft_LMR_challenge/data/Test.csv')
test_df.head(3)

Unnamed: 0,tweet_id,text
0,ID_1001154804658286592,"What is happening to the infrastructure in New England? It isnt global warming, its misappropriated funds being abused that shouldve been used maintaining their infrastructure that couldve protected them from floods! Like New Orleans. Their mayor went to ὄ7#Maryland #floods"
1,ID_1001155505459486720,"SOLDER MISSING IN FLOOD.. PRAY FOR EDDISON HERMOND! PRAY FOR ELLICOTT CITY, MARYLAND! #PrayForEddisonHermond #PrayForEllicottCity"
2,ID_1001155756371136512,"RT @TIME: Police searching for missing person after devastating 1,000-year flood in Ellicott City, Maryland"


## 2. Clean Dataset

In [None]:
# drop NaN
train_df = train_df.dropna()

## 3. Text and Label Tokenization
- sources:
  - https://www.analyticsvidhya.com/blog/2022/11/comprehensive-guide-to-bert/
  - https://www.kaggle.com/code/thanish/bert-for-token-classification-ner-tutorial
- Tokenize train sentence
- Tokenize label
- For each tokenized label, loop the tokenized train sentence, whereever there is a match label as 'B' else 'O'
- Convert alphabetic labels to numerical labels
  O -> 0
  B -> 1
  P -> 2 (padding)

In [8]:
# Load pre-trained BERT model and tokenizer
#tokenizer = AutoTokenizer.from_pretrained('bert-base-cased', use_fast=False)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [9]:
# return a binary array of the same length of the tokenized text
# 1s correspond to the tokens which are present in the tokenized location
def get_binary_location_labels(tokenized_text, tokenized_location):

  binary_location_labels = [0]*len(tokenized_text)

  for location_i in range(len(tokenized_location)):
    for text_j in range(len(tokenized_text)):
      if tokenized_text[text_j]==0:
        binary_location_labels[text_j]=-100
      elif tokenized_text[text_j] == tokenized_location[location_i]:
        binary_location_labels[text_j]=1

  return binary_location_labels

In [10]:
# tokenize text
# create binary label for text
def tokenize_and_label(text, location=None):

  tokenized_text_dict = tokenizer(text, truncation=True, padding='max_length', max_length=450)

  if location is not None:
    tokenized_location_dict = tokenizer(location, truncation=True, padding='max_length', max_length=450)
    tokenized_text_dict['labels']=get_binary_location_labels(tokenized_text_dict['input_ids'], tokenized_location_dict['input_ids'])

  tokenized_text_dict['input_ids'] = torch.tensor(tokenized_text_dict['input_ids']).squeeze(0)
  tokenized_text_dict['attention_mask'] = torch.tensor(tokenized_text_dict['attention_mask']).squeeze(0)
  tokenized_text_dict['token_type_ids'] = torch.tensor(tokenized_text_dict['token_type_ids']).squeeze(0)

  if location is not None:
    tokenized_text_dict['labels'] = torch.tensor(tokenized_text_dict['labels']).squeeze(0)

  return tokenized_text_dict

- We will use 450 as max length for BERT sentence, however, on test dataset, we will have to truncate sentences of length greater than this

In [None]:
train_df.head(3)

Unnamed: 0,tweet_id,text,location
1,ID_1001136696589631488,"Flash floods struck a Maryland city on Sunday, washing out streets and tossing cars like bath toys.",Maryland
2,ID_1001136950345109504,State of emergency declared for Maryland flooding: via @YouTube,Maryland
3,ID_1001137334056833024,"Other parts of Maryland also saw significant damage from Sundays storms including this Baltimore city neighborhood, #Dundalk and #Catonsville. Rain totals spanned from 1 to 10 inches across Maryland: #ECFlood",Baltimore Maryland


In [None]:
test_text = 'Flash floods struck a Maryland city on Sunday, washing out streets and tossing cars like bath toys.'
test_location = 'Maryland'

In [None]:
#tokenize_and_label(test_text)

We will be splitting the Train dataset into 3 datasets:
1. X_train: used in fine-tuning (train phase)
2. X_val: used in fine-tuning (evaluation phase)
3. X_eval_test: used to evaluate the fine-tuned model


The Test dataset is an unlabelled one, the predictions on this dataset will be submitted for the data challenge.


In [None]:
# split dataset into train and validation
X_train, X_val, y_train, y_val = train_test_split(
    train_df, train_df['location'], test_size=0.2, random_state=42)

# split dataset into train and validation
X_train, X_eval_test, y_train, y_eval_test = train_test_split(
    X_train, X_train['location'], test_size=0.25, random_state=42)

In [None]:
#torch.save(X_train, '/content/drive/MyDrive/data_challenges/zindi_microsoft_LMR_challenge/data/lmr_portfolio_split/train_split_df.pkl')
#torch.save(X_val, '/content/drive/MyDrive/data_challenges/zindi_microsoft_LMR_challenge/data/lmr_portfolio_split/val_split_df.pkl')
#torch.save(X_eval_test, '/content/drive/MyDrive/data_challenges/zindi_microsoft_LMR_challenge/data/lmr_portfolio_split/test_split_df.pkl')

In [None]:
print('X_train shape: ' + str(X_train.shape))
print('X_val shape: ' + str(X_val.shape))
print('X_eval_test shape: ' + str(X_eval_test.shape))

X_train shape: (7109, 3)
X_val shape: (2370, 3)
X_eval_test shape: (2370, 3)


In [None]:
# tokenize train dataset
tokenized_train_dataset = X_train.apply(lambda x: tokenize_and_label(x['text'], x['location']), axis=1) #4 mins
tokenized_train_dataset.reset_index(drop=True, inplace=True)

In [None]:
# tokenize validation dataset
tokenized_val_dataset = X_val.apply(lambda x: tokenize_and_label(x['text'], x['location']), axis=1) # 1 min
tokenized_val_dataset.reset_index(drop=True, inplace=True)

In [None]:
# Convert datasets into DataLoader objects
train_dataloader = DataLoader(tokenized_train_dataset, sampler=RandomSampler(tokenized_train_dataset), batch_size=8)
val_dataloader = DataLoader(tokenized_val_dataset, sampler=SequentialSampler(tokenized_val_dataset), batch_size=16)

## 4. Load BERT and define Optimizer

In [None]:
# Load pre-trained BERT model
model = BertForTokenClassification.from_pretrained('bert-base-uncased', num_labels=2)  # 2 labels for binary classification
model.to(device)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12

In [None]:
epochs = 5

# Optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

# Total number of training steps
total_steps = len(train_dataloader) * epochs

# Scheduler to decrease learning rate linearly
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

# Loss function is included in the model

## 5. Training and Evaluation Functions

In [None]:
# Training function
def train_model(model, train_dataloader, optimizer, scheduler, device):
    model.train()
    total_loss = 0

    # Loop through batches
    for batch in tqdm(train_dataloader, desc="Training"):

        # Move batch to device (GPU/CPU)
        batch = {k: v.to(device) for k, v in batch.items()}

        # Forward pass
        outputs = model(**batch)
        loss = outputs.loss

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        scheduler.step()

        # Accumulate loss
        total_loss += loss.item()

    avg_loss = total_loss / len(train_dataloader)
    return avg_loss

In [None]:
# Validation function
def evaluate_model(model, val_dataloader, device):
    model.eval()
    total_loss = 0
    correct_preds = 0
    total_preds = 0

    with torch.no_grad():
        for batch in tqdm(val_dataloader, desc="Validating"):
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            loss = outputs.loss
            logits = outputs.logits

            total_loss += loss.item()

            predictions = torch.argmax(logits, dim=-1)
            labels = batch['labels']

            # Calculate accuracy, ignoring padding (-100 labels)
            mask = labels != -100
            correct_preds += (predictions[mask] == labels[mask]).sum().item()
            total_preds += mask.sum().item()

    avg_loss = total_loss / len(val_dataloader)
    accuracy = correct_preds / total_preds
    return avg_loss, accuracy

## 6. Training (Fine-Tuning)

In [None]:
# evaluate model before training
initial_train_loss, initial_train_accuracy = evaluate_model(model, train_dataloader, device)
initial_val_loss, initial_val_accuracy = evaluate_model(model, val_dataloader, device)

Validating: 100%|██████████| 889/889 [03:10<00:00,  4.66it/s]
Validating: 100%|██████████| 149/149 [01:08<00:00,  2.19it/s]


In [None]:
print('initial model train loss: ' +str(initial_train_loss))
print('initial model validation loss: ' +str(initial_val_loss))

initial model train loss: 0.6111351655894646
initial model validation loss: 0.6119939812877834


In [None]:
# about 10 mins per epoch
train_losses = []
val_losses = []
val_accuracies = []

def last3_val_losses_identical(val_losses):
  if len(val_losses) < 3:
    return False
  else:
    return (val_losses[-3:] == val_losses[-1])

epoch = 0
while epoch<epochs and (not last3_val_losses_identical(val_losses)):

    print(f"Epoch {epoch + 1}/{epochs}")

    # Train
    train_loss = train_model(model, train_dataloader, optimizer, scheduler, device)
    train_losses.append(train_loss)
    print(f"Training loss: {train_loss}")

    # Validate
    val_loss, val_accuracy = evaluate_model(model, val_dataloader, device)
    val_losses.append(val_loss)
    val_accuracies.append(val_accuracy)
    print(f"Validation loss: {val_loss}, Validation accuracy: {val_accuracy}")

    # save mode at each epoch
    #torch.save(model, '/content/drive/MyDrive/data_challenges/zindi_microsoft_LMR_challenge/models/model_lmr_portfolio_'+str(epoch)+'.pkl')
    #torch.save(train_losses, '/content/drive/MyDrive/data_challenges/zindi_microsoft_LMR_challenge/losses/train_losses.pkl')
    #torch.save(val_losses, '/content/drive/MyDrive/data_challenges/zindi_microsoft_LMR_challenge/losses/val_losses.pkl')
    #torch.save(val_accuracies, '/content/drive/MyDrive/data_challenges/zindi_microsoft_LMR_challenge/losses/val_accuracies.pkl')

    epoch += 1

Epoch 1/5


Training: 100%|██████████| 889/889 [09:43<00:00,  1.52it/s]


Training loss: 0.06290301339175967


Validating: 100%|██████████| 149/149 [01:08<00:00,  2.16it/s]


Validation loss: 0.04136069510882133, Validation accuracy: 0.9853441157899849
Epoch 2/5


Training: 100%|██████████| 889/889 [09:52<00:00,  1.50it/s]


Training loss: 0.03389111061712609


Validating: 100%|██████████| 149/149 [01:08<00:00,  2.17it/s]


Validation loss: 0.04114269589749009, Validation accuracy: 0.9852712613531497
Epoch 3/5


Training: 100%|██████████| 889/889 [09:53<00:00,  1.50it/s]


Training loss: 0.022607712716069643


Validating: 100%|██████████| 149/149 [01:08<00:00,  2.18it/s]


Validation loss: 0.04346761188076047, Validation accuracy: 0.9860483753460586
Epoch 4/5


Training: 100%|██████████| 889/889 [09:53<00:00,  1.50it/s]


Training loss: 0.01522593843425963


Validating: 100%|██████████| 149/149 [01:08<00:00,  2.18it/s]


Validation loss: 0.04920674306371768, Validation accuracy: 0.9865947836223226
Epoch 5/5


Training: 100%|██████████| 889/889 [09:53<00:00,  1.50it/s]


Training loss: 0.010324653461263844


Validating: 100%|██████████| 149/149 [01:08<00:00,  2.18it/s]


Validation loss: 0.05633065418024616, Validation accuracy: 0.9864126475302346


In [None]:
# preparing data to plot
epochs_range = np.array(range(epochs)) + 1
epochs_range = np.append([0], epochs_range)

train_losses = np.append([initial_train_loss], train_losses)
val_losses = np.append([initial_val_loss], val_losses)

In [None]:
fig = go.Figure()

# min loss
min_idx = val_losses.argmin()
min_val = val_losses[min_idx]
min_val_epoch = min_idx

fig.add_trace(go.Scatter(y=train_losses, name='Training Loss'))
fig.add_trace(go.Scatter(y=val_losses, name='Validation Loss'))
fig.add_trace(go.Scatter(x=[min_val_epoch], y=[min_val], mode='markers', name='Minimum Validation Loss'))

# add title
fig.update_layout(title='Training and Validation Loss',
                  xaxis_title='Epoch',
                  yaxis_title='Loss')
fig.show()


From the above curves, the minimum validation loss was obtained on the second epoch

In [12]:
model=torch.load('/content/drive/MyDrive/data_challenges/zindi_microsoft_LMR_challenge/models/model_lmr_portfolio_1.pkl')

## 7. Evaluation

Evaluation metric: Word Error Rate Metric

In [13]:
def predict_locations(model, dataloader):
  # Assume you have a pre-trained BERT model loaded (for binary classification)
  model.eval()  # Set the model to evaluation mode

  # Move the model to the device
  model = model.to(device)

  # Make predictions (forward pass)
  with torch.no_grad():  # Disable gradient calculations for inference

    predicted_locations = []

    for batch in tqdm(dataloader, desc="Predicting"):
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            #loss = outputs.loss
            logits = outputs.logits
            probs = torch.softmax(logits, dim=-1)
            batch_predicted_classes = torch.argmax(probs, dim=-1)
            batch_predicted_classes = batch_predicted_classes.cpu().numpy()

            input_ids = batch['input_ids'].cpu().numpy()

            cls_token = tokenizer.decode(tokenizer.cls_token_id).replace(' ', '')
            pad_token = tokenizer.decode(tokenizer.pad_token_id).replace(' ', '')
            sep_token = tokenizer.decode(tokenizer.sep_token_id).replace(' ', '')

            tokens = [tokenizer.decode(id) for id in input_ids]

            batch_predicted_locations = [tokenizer.decode([input_ids[i][j] for j in range(len(batch_predicted_classes[i])) if batch_predicted_classes[i][j]==1]) for i in range(len(batch_predicted_classes))]
            batch_predicted_locations = [str.strip(string.replace(cls_token, '').replace(pad_token, '').replace(sep_token, '')) for string in batch_predicted_locations]

            predicted_locations.extend(batch_predicted_locations)

  return predicted_locations

In [None]:
eval_test_df = pd.DataFrame({'text': X_eval_test['text'], 'location': X_eval_test['location']})
eval_test_df.reset_index(drop=True, inplace=True)
eval_test_df.head(3)

Unnamed: 0,text,location
0,Wife of @StephenCurry30 responds to Trump’s attacks by asking him to donate to Mexico earthquake victims. That’s how you use the limelight!,Mexico
1,"Kerala needs your help, we are sending medicines for people affected by flood. If you want to help, please send them to our office: You can send: OTC medicines, sanitary pads and dry food items. Let’s do our bit ὤFἿD #KeralaFloods #KeralaFloodRelief",Kerala
2,@TheEllenShow Ecuador needs the help of everyone. Please!,Ecuador


In [15]:
# tokenize data
tokenized_eval_test_dataset = eval_test_df.apply(lambda x: tokenize_and_label(x['text']), axis=1) # 1 min

In [16]:
# load dataset
eval_test_dataloader = DataLoader(tokenized_eval_test_dataset, batch_size=16)

In [None]:
eval_test_predicted_locations = predict_locations(model, eval_test_dataloader)

Predicting: 100%|██████████| 149/149 [01:02<00:00,  2.37it/s]


In [18]:
eval_test_df = pd.DataFrame({'text': X_eval_test['text'], 'location': X_eval_test['location'], 'predicted_location': eval_test_predicted_locations})

In [None]:
print('percentage missing locations in labels:')
print(str((len(eval_test_df[eval_test_df['location']==''])/len(eval_test_df))*100) + '%')

print('percentage missing locations in predictions:')
print(str((len(eval_test_df[eval_test_df['predicted_location']==''])/len(eval_test_df))*100) + '%')

percentage missing locations in labels:
0.0%
percentage missing locations in predictions:
0.25316455696202533%


In [21]:
wer = evaluate.load("wer")
word_error_rate = wer.compute(references=eval_test_df['location'].apply(str.lower).tolist(), predictions=eval_test_df['predicted_location'].apply(str.lower).tolist())
print('word error rate: ' + str(word_error_rate))

word error rate: 0.5135883241066935


In [None]:
eval_test_df.sample(20)

Unnamed: 0,text,location,predicted_location
19506,"The enormity of this tragedy isn’t comprehensible. Death toll in Northern California wildfire grows to 76 w/ 1,276 still missing.",California,california
44656,RT @SamSachdevaNZ: More #eqnz pics from Wellington - few buildings on Featherston St seem to be concerning engineers,Featherston St Wellington,wellington featherston st
25105,RT @NatashaFatah: Death Toll in Africa Climbs Past 600 After Cyclone #CycloneIdai,Africa,africa
44310,Mum-in-law living at Leithfield Beach scared but okay – the whole place has been evacuated on the tsunami warning. #eqnz,Leithfield Beach,leithfield beach
37296,"Salute to these young doctors who have volunteered from hospitals of #SaraiAlamgir, #Jhelum, #Kharian & #Gujrat to come to #Mirpur hospital to help their colleagues. Majority of injured are kids & women as they were taken out of collapsed houses. #Earthquake",Jhelum SaraiAlamgir Kharian Mirpur Gujrat,saraialamgir jhelum kharian gujrat mirpur hospital
37170,"90 percent electricity restored in #Mirpur and other #earthquake affected areas according to electricity department , moreover 2 days vacations announced in all educational institutions in district Mirpur #AJK #Kashmir",Kashmir Mirpur AJK,mirpur mirpur ajk kashmir
12207,Emergency workers from East Tennessee this morning heading over to South Carolina to assist during and after hurricane Florence,East Tennessee South Carolina,east tennessee south carolina
17895,California wildfires: Death toll rises to 31 with 200 missing - BBC News via @nuzzel,California,california
64260,21 kids among dead found in school after Mexico earthquake:,Mexico,mexico
31337,Governor Brian Kemp has declared a State of Emergency for 12 counties in Southern Georgia. Governor Henry McMaster has declared a State of Emergency for the entire state of South Carolina. Dorian remains a major category 4 hurricane.,South Carolina Georgia,georgia south carolina


## 8. Conclusion


Fine-tuning BERT pretrained model on the given dataset gave the following results:
- Percentage missing location: 0.253%
- Word Error Rate: 0.514

The given labels may sometimes contain errors or be incomplete such as the following examples where the model may be more accurate.

| text              |  label | predicted location |
| :---------------: | :------: | :----:|
| RT @the_chill_son: I am active duty military, w/ supplies, trying to get to St. Croix. Currently flying to St. Thomas on Sunday. Help! #stc        |   St. Croix   | st. croix. st. thomas. |
| DEVELOPING: Hurricane Matthew lashes Haiti & Cuba, amid States of Emergency along US east coast. @Miguelnbc reports now on @NBCNightlyNews.           |   Haiti Cuba   | haiti cuba us |
| Whoa. Craven County (includes New Bern, isnt a county with beachfront areas) has ordered a mandatory evacuation #HurricaneFlorence    |  Whoa. Craven New Bern   | craven county new bern |

