# Location Mention Recognition - BERT Named-Entity Recognition
This project involves developing an automated process for recognition of toponyms (place/ area/ street names) in microblogging posts. The aim is to help authorities determine specific locations to send resources such as medical aid, food.

The microblogging data used will be Twitter (X) posts and a Location Mention Recognition system will be built.

We will be using a pretrained BERT model and since no training will be done, we will use the whole train dataset to get the location predictions then evaluate the results obtained.

0. Setup
1. Read Data
2. Clean Dataset
3. Classification using BERT NER
4. Evaluation

## 0. Setup


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# imports
import pandas as pd
import requests
from google.colab import userdata
import torch
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertTokenizerFast, BertForTokenClassification, pipeline
import warnings
import re
import evaluate
from jiwer import wer
from datasets import Dataset

In [None]:
if torch.cuda.is_available():
    print("GPU is available")
else:
    print("GPU is not available")

device = 'cuda' if torch.cuda.is_available() else 'cpu'

GPU is available


In [None]:
# setup
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_colwidth', None)
warnings.filterwarnings('ignore')

## 1. Read Data

In [None]:
train_df = pd.read_csv('/content/drive/MyDrive/data_challenges/zindi_microsoft_LMR_challenge/data/Train.csv')
train_df.head(3)

Unnamed: 0,tweet_id,text,location
0,ID_1001136212718088192,,EllicottCity
1,ID_1001136696589631488,"Flash floods struck a Maryland city on Sunday, washing out streets and tossing cars like bath toys.",Maryland
2,ID_1001136950345109504,State of emergency declared for Maryland flooding: via @YouTube,Maryland


In [None]:
test_df = pd.read_csv('/content/drive/MyDrive/data_challenges/zindi_microsoft_LMR_challenge/data/Test.csv')
test_df.head(3)

Unnamed: 0,tweet_id,text
0,ID_1001154804658286592,"What is happening to the infrastructure in New England? It isnt global warming, its misappropriated funds being abused that shouldve been used maintaining their infrastructure that couldve protected them from floods! Like New Orleans. Their mayor went to ὄ7#Maryland #floods"
1,ID_1001155505459486720,"SOLDER MISSING IN FLOOD.. PRAY FOR EDDISON HERMOND! PRAY FOR ELLICOTT CITY, MARYLAND! #PrayForEddisonHermond #PrayForEllicottCity"
2,ID_1001155756371136512,"RT @TIME: Police searching for missing person after devastating 1,000-year flood in Ellicott City, Maryland"


## 2. Clean Dataset

In [None]:
# drop NaN
train_df = train_df.dropna()
train_df.reset_index(drop=True, inplace=True)

## 3. Classification using BERT Named-Entity Recognition

In [None]:
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased', use_fast=False)

model_ner = BertForTokenClassification.from_pretrained('dbmdz/bert-large-cased-finetuned-conll03-english').to('cuda')

tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
bert_pipeline = pipeline("ner", model=model_ner, tokenizer=tokenizer, device=0, grouped_entities=True)

In [None]:
# extract only location entites from BERT NER results
# note that the BERT pipeline with grouped_entities=True does not support batch processing
def extract_location(text):
    if text == '':
      return ''

    result = bert_pipeline(text)
    element_location = ''

    for dic_element in result:
      if dic_element['entity_group']=='LOC' and dic_element['score']>0.8:
        element_location = element_location + dic_element['word'] + ' '

    return element_location.rstrip()

In [None]:
# batch extract location for parallel processing
def extract_location_batch(batch):
    batch_locations = []

    for text in batch['text']:
      batch_locations.append(extract_location(text))

    batch['predicted_locations']=batch_locations

    return batch

In [None]:
# Convert DataFrame to Hugging Face Dataset for parallel processing
train_dataset = Dataset.from_pandas(train_df)

In [None]:
train_dataset = train_dataset.map(extract_location_batch, batched=True)

Map:   0%|          | 0/11849 [00:00<?, ? examples/s]

In [None]:
train_df = train_dataset.to_pandas()
train_df.head(3)

Unnamed: 0,tweet_id,text,location,predicted_locations
0,ID_1001136696589631488,"Flash floods struck a Maryland city on Sunday, washing out streets and tossing cars like bath toys.",Maryland,Maryland
1,ID_1001136950345109504,State of emergency declared for Maryland flooding: via @YouTube,Maryland,Maryland
2,ID_1001137334056833024,"Other parts of Maryland also saw significant damage from Sundays storms including this Baltimore city neighborhood, #Dundalk and #Catonsville. Rain totals spanned from 1 to 10 inches across Maryland: #ECFlood",Baltimore Maryland,Maryland Baltimore Dundalk Catonsville Maryland


## 4. Evaluation

In [None]:
# percentage missing values
print('percentage missing locations in labels:')
print(str((len(train_df[train_df['location']==''])/len(train_df))*100))

print('percentage missing locations in predictions:')
print(str((len(train_df[train_df['predicted_locations']==''])/len(train_df))*100))

percentage missing locations in labels:
0.0
percentage missing locations in predictions:
16.153261878639547


In [None]:
# word error rate
wer = evaluate.load("wer")
word_error_rate = wer.compute(references=train_df['location'].apply(str.lower).tolist(), predictions=train_df['predicted_locations'].apply(str.lower).tolist())
print('word error rate: ' + str(word_error_rate))

Downloading builder script:   0%|          | 0.00/4.49k [00:00<?, ?B/s]

word error rate: 0.5641190429108167


In [None]:
train_df.sample(20)

Unnamed: 0,tweet_id,text,location,predicted_locations
9784,ID_907233421604397056,#Beaufort residents should remain indoors & off roadways & bridges. Wind gusts have reported at more than 60 miles. #Irma #sctweets #SCWX,Beaufort,Beaufort
2451,ID_1061384573370552320,"Map listing areas of fire and specific house damage throughout Malibu, Thousand Oaks, and everywhere else affected #woosleyfire #pointdume",Malibu Thousand Oaks,Malibu Thousand Oaks
11136,ID_912353188749217792,RT ReutersLive: LIVE: Search and rescue continues in Mexico City after earthquake,Mexico City,Mexico City
3173,ID_1067586036459601920,This #GivingTuesday we’re excited to announce our support of the Butte County Community and join fellow Brewers across the world by brewing #ResilienceIPA All proceeds will be donated to the @SierraNevada Camp Fire Relief Fund. Learn more: #ButteStrong,Butte County,
272,ID_1002650883979694080,Today begins the 2018 Hurricane Season for Maryland and our team is focused on supporting recovery efforts from two recent severe flooding events. @MDMEMA #partofthesolution,Maryland,Maryland
743,ID_1022481819684483072,Irishman missing in Athens wildfires confirmed dead #Greece #news,Greece Athens,Athens Greece
9156,ID_901684388282454016,Another 36 Sustainment Brigade @36thInfantryDiv @TXMilitary convoy headed to help Texans who are most in need #Harvey #TXARNG #AlwaysReady,Texans,TX
5857,ID_1177068053043060736,Islamic Relief Pakistan visited the affected areas of #Jattlan and responded in #Earthquake. Proud to be a part of Rapid Need Assesment. #ReachingtheUnreached,Pakistan,
8918,ID_870557998246227968,Hiru in Matara for giving donations for displaced person who are living in Matara #floodSL,Matara,Matara Matara
10021,ID_908498416212246528,21:11 : NWS-JAX has Continued a Flood Warning () for Alachua County until 09:11 PM. The,Alachua County,Alachua County


Using BERT pretrained Named Entity Recognition model gave the following results:
- Percentage missing location: 16.15%
- Word Error Rate: 0.564


