# Location Mention Recognition - BERT Named-Entity Recognition
This project involves developing an automated process for recognition of toponyms (place/ area/ street names) in microblogging posts. The aim is to help authorities determine specific locations to send resources such as medical aid, food.

The microblogging data used will be Twitter (X) posts and a Location Mention Recognition system will be built.

0. Setup
1. Read Data
2. Clean Dataset
3. Classification using BERT NER
4. Evaluation

## 0. Setup


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# imports
import pandas as pd
import requests
from google.colab import userdata
import torch
from transformers import BertTokenizer, BertTokenizerFast, BertForTokenClassification, pipeline
import warnings
import re
import evaluate
from jiwer import wer
from parallel_pandas import ParallelPandas


In [None]:
if torch.cuda.is_available():
    print("GPU is available")
else:
    print("GPU is not available")

GPU is available


In [None]:
#initialize parallel-pandas
ParallelPandas.initialize(n_cpu=4, split_factor=4, disable_pr_bar=False)

In [None]:
# setup
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_colwidth', None)
warnings.filterwarnings('ignore')

## 1. Read Data

In [None]:
train_df = pd.read_csv('/content/drive/MyDrive/data_challenges/zindi_microsoft_LMR_challenge/data/Train.csv')
train_df.head(3)

Unnamed: 0,tweet_id,text,location
0,ID_1001136212718088192,,EllicottCity
1,ID_1001136696589631488,"Flash floods struck a Maryland city on Sunday, washing out streets and tossing cars like bath toys.",Maryland
2,ID_1001136950345109504,State of emergency declared for Maryland flooding: via @YouTube,Maryland


In [None]:
test_df = pd.read_csv('/content/drive/MyDrive/data_challenges/zindi_microsoft_LMR_challenge/data/Test.csv')
test_df.head(3)

Unnamed: 0,tweet_id,text
0,ID_1001154804658286592,"What is happening to the infrastructure in New England? It isnt global warming, its misappropriated funds being abused that shouldve been used maintaining their infrastructure that couldve protected them from floods! Like New Orleans. Their mayor went to ὄ7#Maryland #floods"
1,ID_1001155505459486720,"SOLDER MISSING IN FLOOD.. PRAY FOR EDDISON HERMOND! PRAY FOR ELLICOTT CITY, MARYLAND! #PrayForEddisonHermond #PrayForEllicottCity"
2,ID_1001155756371136512,"RT @TIME: Police searching for missing person after devastating 1,000-year flood in Ellicott City, Maryland"


## 2. Clean Dataset

In [None]:
# drop NaN
train_df = train_df.dropna()

## 3. Classification using BERT Named-Entity Recognition

In [None]:
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased', use_fast=False)

model_ner = BertForTokenClassification.from_pretrained('dbmdz/bert-large-cased-finetuned-conll03-english').to('cuda')

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
bert_pipeline = pipeline("ner", model=model_ner, tokenizer=tokenizer, device=0, grouped_entities=True)

In [None]:
# extract only location entites from BERT NER results
def extract_location(text):
    if text == '':
      return ''

    result = bert_pipeline(text)
    element_location = ''

    for dic_element in result:
      if dic_element['entity_group']=='LOC' and dic_element['score']>0.8:
        element_location = element_location + dic_element['word'] + ' '

    return element_location.rstrip()

In [None]:
train_df['predicted_locations'] = train_df['text'].apply(lambda x: extract_location(x)) # 4mins

In [None]:
train_df.head(3)

Unnamed: 0,tweet_id,text,location,predicted_locations
1,ID_1001136696589631488,"Flash floods struck a Maryland city on Sunday, washing out streets and tossing cars like bath toys.",Maryland,Maryland
2,ID_1001136950345109504,State of emergency declared for Maryland flooding: via @YouTube,Maryland,Maryland
3,ID_1001137334056833024,"Other parts of Maryland also saw significant damage from Sundays storms including this Baltimore city neighborhood, #Dundalk and #Catonsville. Rain totals spanned from 1 to 10 inches across Maryland: #ECFlood",Baltimore Maryland,Maryland Baltimore Dundalk Catonsville Maryland


## 4. Evaluation

In [None]:
print('percentage missing locations in labels:')
print(str((len(train_df[train_df['location']==''])/len(train_df))*100))

print('percentage missing locations in predictions:')
print(str((len(train_df[train_df['predicted_locations']==''])/len(train_df))*100))

percentage missing locations in labels:
0.0
percentage missing locations in predictions:
16.153261878639547


In [None]:
wer = evaluate.load("wer")
word_error_rate = wer.compute(references=train_df['location'].tolist(), predictions=train_df['predicted_locations'].tolist())
print('word error rate: ' + str(word_error_rate))

word error rate: 0.5662942456001582


In [None]:
train_df.sample(20)

Unnamed: 0,tweet_id,text,location,predicted_locations
45785,ID_870318387712319488,RT @ColomboPageNews: #Australia hands over flood aid to #SriLanka navy #srilanka #FloodSL @AusHCSriLanka .,Australia srilanka SriLanka,Colombo Australia
69839,ID_913345638447886336,#HurricaneMaria victims from #PuertoRico arrive in #NJ #njmorningshow @News12NJ,PuertoRico NJ,
39610,ID_728961497179815936,"Celebs, PM Trudeau Offer Support For Those Affected By Fort McMurray Wildfire Crisis",Fort McMurray,##ray
65666,ID_910717683544592384,Dozens of children and adults are still missing after a school collapsed in Mexico following a 7.1 magnitude earthquake:,Mexico,Mexico
46041,ID_874589898350743552,#SriLanka was hit by the worst flood since 2003. Support @hfhslorg disaster response & rebuilding efforts at:,SriLanka,SriLanka
41766,ID_768760370177708032,Volunteers distributing food near #Amatrice #ItalyEarthquake #terremoto,Amatrice,Italy
71062,ID_913928111066157056,Help Hurricane Maria Puerto Rico victims: Drop off needed supplies at HORC or donate to the American Red Cross through the Towns website.,Puerto Rico,Puerto Rico
1562,ID_1022584826828935168,#Greece wildfire that killed 82 people was started by arson,Greece,Greece
9241,ID_1036242305731051520,"A time-lapse video of volunteers in Kochi loading back-to-home-kits, containing food and other essential items, to be distributed in flood-affected houses in the suburbs tonight. #KeralaFloods",Kochi,Kochi
13069,ID_1040712364788785152,"HURRICANE YETI IS MAKING LANDFALL TONIGHT AT 9:30 in Belmont, NC! Friends Sports Bar is the place to be tonight! We got Beer, Food, Liquor and Wine! All the Hurricane Prep stuff you need.",Belmont NC,Belmont NC


Using BERT pretrained Named Entity Recognition model gave the following results:
- Percentage missing location: 16.15%
- Word Error Rate: 0.566%

The next step will be to fine-tune the BERT classification model.



