# Location Mention Recognition - Result Analysis
To achieve location extraction from a given text, 2 methods have been explored. The first involves using BERT Named-Entity Recognition to extract the locations and the second involves fine-tuning a pretrained BERT model on the given dataset.

The results of the 2 different methods will be evaluated and analysed below.

0. Setup
1. Read Data
2. BERT NER Predictions
3. Fine-Tuned BERT Predictions
4. Result Analysis

## 0. Setup


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# imports
import numpy as np
import pandas as pd
import requests
from google.colab import userdata
import torch
from transformers import BertTokenizer, BertTokenizerFast, AutoTokenizer, BertForTokenClassification, pipeline, BertModel
import warnings
import re
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler
from torch.optim import AdamW
from transformers import get_linear_schedule_with_warmup
from tqdm import tqdm
from torch import nn
import evaluate
import matplotlib.pyplot as plt
from jiwer import wer
import plotly.graph_objs as go
from datasets import Dataset

In [None]:
if torch.cuda.is_available():
    print("GPU is available")
else:
    print("GPU is not available")
device = 'cuda' if torch.cuda.is_available() else 'cpu'

GPU is available


In [None]:
# setup
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_colwidth', None)
warnings.filterwarnings('ignore')

## 1. Read Data

In [None]:
# load data which wasn't used for training nor validation
data_df = torch.load('/content/drive/MyDrive/data_challenges/zindi_microsoft_LMR_challenge/data/lmr_portfolio_split/test_split_df.pkl')

In [None]:
# reset index
data_df.reset_index(drop=True, inplace=True)
data_df.drop(['tweet_id'], axis=1, inplace=True)
data_df.head()

Unnamed: 0,tweet_id,text,location
0,ID_911783155463372800,Wife of @StephenCurry30 responds to Trump’s attacks by asking him to donate to Mexico earthquake victims. That’s how you use the limelight!,Mexico
1,ID_1031091729577988096,"Kerala needs your help, we are sending medicines for people affected by flood. If you want to help, please send them to our office: You can send: OTC medicines, sanitary pads and dry food items. Let’s do our bit ὤFἿD #KeralaFloods #KeralaFloodRelief",Kerala
2,ID_722162800471248896,@TheEllenShow Ecuador needs the help of everyone. Please!,Ecuador
3,ID_912411687474667520,RT @DemSocialists: Please support this solidarity relief fund for Puerto Ricos most vulnerable communities after the hurricanes .,Puerto Ricos
4,ID_911740845535170560,"RT @CavaliersViews: #ClevelandCavaliers #ALLinCLE #AllforOne Hurricane Marias death toll, crisis grow in Puerto Rico",Puerto Rico


## 2. BERT NER Predictions

In [None]:
# Load pre-trained BERT model and tokenizer
ner_tokenizer = BertTokenizer.from_pretrained('bert-base-cased', use_fast=False)
model_ner = BertForTokenClassification.from_pretrained('dbmdz/bert-large-cased-finetuned-conll03-english').to('cuda')

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
bert_pipeline = pipeline("ner", model=model_ner, tokenizer=ner_tokenizer, device=0, grouped_entities=True)

In [None]:
# extract only location entites from BERT NER results
def extract_location(text):
    if text == '':
      return ''

    result = bert_pipeline(text)
    element_location = ''

    for dic_element in result:
      if dic_element['entity_group']=='LOC' and dic_element['score']>0.8:
        element_location = element_location + dic_element['word'] + ' '

    return element_location.rstrip()

In [None]:
# batch extract location for parallel processing
def extract_location_batch(batch):
    batch['NER_predicted_locations'] = extract_location(batch['text'])
    return batch

In [None]:
# Convert DataFrame to Hugging Face Dataset for parallel processing
dataset = Dataset.from_pandas(data_df)

In [None]:
dataset = dataset.map(extract_location_batch)

Map:   0%|          | 0/2370 [00:00<?, ? examples/s]

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


In [None]:
data_df = dataset.to_pandas()
data_df.head(3)

Unnamed: 0,tweet_id,text,location,NER_predicted_locations
0,ID_911783155463372800,Wife of @StephenCurry30 responds to Trump’s attacks by asking him to donate to Mexico earthquake victims. That’s how you use the limelight!,Mexico,Mexico
1,ID_1031091729577988096,"Kerala needs your help, we are sending medicines for people affected by flood. If you want to help, please send them to our office: You can send: OTC medicines, sanitary pads and dry food items. Let’s do our bit ὤFἿD #KeralaFloods #KeralaFloodRelief",Kerala,Kerala
2,ID_722162800471248896,@TheEllenShow Ecuador needs the help of everyone. Please!,Ecuador,Ecuador


## 3. Fine-Tuned BERT Predictions


In [None]:
# load fine-tuned model
fine_tuned_bert =torch.load('/content/drive/MyDrive/data_challenges/zindi_microsoft_LMR_challenge/models/model_lmr_portfolio_1.pkl')

In [None]:
fine_tuned_bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

In [None]:
def predict_locations(model, dataloader):
  # Assume you have a pre-trained BERT model loaded (for binary classification)
  model.eval()  # Set the model to evaluation mode

  # Move the model to the device
  model = model.to(device)

  # Make predictions (forward pass)
  with torch.no_grad():  # Disable gradient calculations for inference

    predicted_locations = []

    for batch in tqdm(dataloader, desc="Predicting"):
            batch = {k: v.to(device) for k, v in batch.items()}
            outputs = model(**batch)
            #loss = outputs.loss
            logits = outputs.logits
            probs = torch.softmax(logits, dim=-1)
            batch_predicted_classes = torch.argmax(probs, dim=-1)
            batch_predicted_classes = batch_predicted_classes.cpu().numpy()

            input_ids = batch['input_ids'].cpu().numpy()

            cls_token = fine_tuned_bert_tokenizer.decode(fine_tuned_bert_tokenizer.cls_token_id).replace(' ', '')
            pad_token = fine_tuned_bert_tokenizer.decode(fine_tuned_bert_tokenizer.pad_token_id).replace(' ', '')
            sep_token = fine_tuned_bert_tokenizer.decode(fine_tuned_bert_tokenizer.sep_token_id).replace(' ', '')

            tokens = [fine_tuned_bert_tokenizer.decode(id) for id in input_ids]

            batch_predicted_locations = [fine_tuned_bert_tokenizer.decode([input_ids[i][j] for j in range(len(batch_predicted_classes[i])) if batch_predicted_classes[i][j]==1]) for i in range(len(batch_predicted_classes))]
            batch_predicted_locations = [str.strip(string.replace(cls_token, '').replace(pad_token, '').replace(sep_token, '')) for string in batch_predicted_locations]

            predicted_locations.extend(batch_predicted_locations)

  return predicted_locations

In [None]:
# return a binary array of the same length of the tokenized text
# 1s correspond to the tokens which are present in the tokenized location
def get_binary_location_labels(tokenized_text, tokenized_location):

  binary_location_labels = [0]*len(tokenized_text)

  for location_i in range(len(tokenized_location)):
    for text_j in range(len(tokenized_text)):
      if tokenized_text[text_j]==0:
        binary_location_labels[text_j]=-100
      elif tokenized_text[text_j] == tokenized_location[location_i]:
        binary_location_labels[text_j]=1

  return binary_location_labels

# tokenize text
# create binary label for text
def tokenize_and_label(text, location=None):

  tokenized_text_dict = fine_tuned_bert_tokenizer(text, truncation=True, padding='max_length', max_length=450)

  if location is not None:
    tokenized_location_dict = fine_tuned_bert_tokenizer(location, truncation=True, padding='max_length', max_length=450)
    tokenized_text_dict['labels']=get_binary_location_labels(tokenized_text_dict['input_ids'], tokenized_location_dict['input_ids'])

  tokenized_text_dict['input_ids'] = torch.tensor(tokenized_text_dict['input_ids']).squeeze(0)
  tokenized_text_dict['attention_mask'] = torch.tensor(tokenized_text_dict['attention_mask']).squeeze(0)
  tokenized_text_dict['token_type_ids'] = torch.tensor(tokenized_text_dict['token_type_ids']).squeeze(0)

  if location is not None:
    tokenized_text_dict['labels'] = torch.tensor(tokenized_text_dict['labels']).squeeze(0)

  return tokenized_text_dict

In [None]:
# tokenize data
tokenized_dataset = data_df.apply(lambda x: tokenize_and_label(x['text']), axis=1) # 1 min

In [None]:
# load dataset
dataset_dataloader = DataLoader(tokenized_dataset, batch_size=16)

In [None]:
fine_tuned_bert_predicted_locations = predict_locations(fine_tuned_bert, dataset_dataloader)

Predicting: 100%|██████████| 149/149 [01:04<00:00,  2.30it/s]


In [None]:
data_df['fine_tuned_BERT_predicted_locations'] = fine_tuned_bert_predicted_locations

In [None]:
data_df.head(3)

Unnamed: 0,tweet_id,text,location,NER_predicted_locations,fine_tuned_BERT_predicted_locations
0,ID_911783155463372800,Wife of @StephenCurry30 responds to Trump’s attacks by asking him to donate to Mexico earthquake victims. That’s how you use the limelight!,Mexico,Mexico,mexico
1,ID_1031091729577988096,"Kerala needs your help, we are sending medicines for people affected by flood. If you want to help, please send them to our office: You can send: OTC medicines, sanitary pads and dry food items. Let’s do our bit ὤFἿD #KeralaFloods #KeralaFloodRelief",Kerala,Kerala,kerala kerala kerala
2,ID_722162800471248896,@TheEllenShow Ecuador needs the help of everyone. Please!,Ecuador,Ecuador,ecuador


## 4. Result Analysis


**Percentage missing values**

In [None]:
print('percentage missing locations in labels:')
print(str((len(data_df[data_df['location']==''])/len(data_df))*100) + '%')

print('percentage missing locations in NER predictions:')
print(str((len(data_df[data_df['NER_predicted_locations']==''])/len(data_df))*100) + '%')

print('percentage missing locations in fine-tuned BERT predictions:')
print(str((len(data_df[data_df['fine_tuned_BERT_predicted_locations']==''])/len(data_df))*100) + '%')

percentage missing locations in labels:
0.0%
percentage missing locations in NER predictions:
15.949367088607595%
percentage missing locations in fine-tuned BERT predictions:
0.25316455696202533%


**Word Error Rates**

In [None]:
wer = evaluate.load("wer")
ner_wer = wer.compute(references=data_df['location'].apply(str.lower).tolist(), predictions=data_df['NER_predicted_locations'].apply(str.lower).tolist())
print('word error rate NER predicted locations: ' + str(ner_wer))

fine_tuned_bert_wer = wer.compute(references=data_df['location'].apply(str.lower).tolist(), predictions=data_df['fine_tuned_BERT_predicted_locations'].apply(str.lower).tolist())
print('word error rate fine-tuned BERT predicted locations: ' + str(fine_tuned_bert_wer))

word error rate NER predicted locations: 0.5636638147961751
word error rate fine-tuned BERT predicted locations: 0.5135883241066935


**Result Summary**

|  | BERT NER | BERT Fine-tuned |
|----------|----------|----------|
| % missing values    | 15.949   | 0.253   |
| word error rate    | 0.564   | 0.514   |

It is indeed odd to have a significantly higher percentage of missing values in the NER results while having an error rate close to that of the fine-tuned model.

Below we will be investigating this.

In [None]:
# calculating individual word error rates
data_df['ner_wer'] = data_df.apply(lambda x: wer.compute(references=[x['location'].lower()], predictions=[x['NER_predicted_locations'].lower()]), axis=1)
data_df['fine_tuned_wer'] = data_df.apply(lambda x: wer.compute(references=[x['location'].lower()], predictions=[x['fine_tuned_BERT_predicted_locations'].lower()]), axis=1)

**missing predictions for NER**

In [None]:
data_df[data_df['NER_predicted_locations']==''].head(10)

Unnamed: 0,text,location,NER_predicted_locations,fine_tuned_BERT_predicted_locations,ner_wer,fine_tuned_wer
5,Hundreds of #Marlborough homes may face a second night without power: #nzearthquake #eqnz #NewZealand,Marlborough NewZealand,,marlborough newzealand,1.0,0.0
15,Continue to support disaster relief in Fort McMurray. Text FIRES to 45678 to donate $10 towards relief efforts.,Fort McMurray,,fort mcmurray,1.0,0.0
18,Lots of incredible stories out there of selfless response - adding one more. This one is of Seva Bharti volunteers going out there with large tyres to keep afloat and help people. Not trained for this but doing what they can. #KeralaFloods,Seva Bharti,,b,1.0,1.0
68,#HurricaneFlorence Help Needed. Ponies in neck deep water Greenevers #NC @reddogsusie @Gdad1 @chortletown @leighjalland @Indigo_Pho13 @RuthDBourdet @ruthmen @WinglessBird_ @BadBoyEM @msmorgan1968 @Janetlynne211 @samjarvis49 @Freedom4Horses @jr3597,NC,,nc,1.0,0.0
75,RT @Chirand33986739: Chimanimani needs your help #CycloneIdaiZW,Chimanimani,,chimanimani,1.0,0.0
76,RT @phunphunphun: It says a lot about the people of #Attawapiskat - raising money to help #ymmfire Great kindness & consideration. <3,Attawapiskat,,attawapiskat,1.0,0.0
77,17 Ray Morgan Company employees and their families have lost their homes in the Northern California Camp Fire. Donate to the GoFundMe to help these employees: via @aslawetsky,California,,california,1.0,0.0
82,@SenSanders @SenatorCollins @RepBarbaraLee @RepMaxineWaters @SenSchumer please help ἟5἟7 #PuertoRico now. #RepealJonesAct #HurricaneMaria,PuertoRico,,puertorico,1.0,0.0
90,#HurricaneMaria victims from #PuertoRico arrive in #NJ #njmorningshow @News12NJ,PuertoRico NJ,,puertorico nj,1.0,0.0
98,Canadian communities mobilise to help wildfire victims,Canadian,,canadian,1.0,0.0


In [None]:
data_df[data_df['NER_predicted_locations']!=''].head(10)

Unnamed: 0,text,location,NER_predicted_locations,fine_tuned_BERT_predicted_locations,ner_wer,fine_tuned_wer
0,Wife of @StephenCurry30 responds to Trump’s attacks by asking him to donate to Mexico earthquake victims. That’s how you use the limelight!,Mexico,Mexico,mexico,0.0,0.0
1,"Kerala needs your help, we are sending medicines for people affected by flood. If you want to help, please send them to our office: You can send: OTC medicines, sanitary pads and dry food items. Let’s do our bit ὤFἿD #KeralaFloods #KeralaFloodRelief",Kerala,Kerala,kerala kerala kerala,0.0,2.0
2,@TheEllenShow Ecuador needs the help of everyone. Please!,Ecuador,Ecuador,ecuador,0.0,0.0
3,RT @DemSocialists: Please support this solidarity relief fund for Puerto Ricos most vulnerable communities after the hurricanes .,Puerto Ricos,Puerto Rico,puerto ricos,0.5,0.0
4,"RT @CavaliersViews: #ClevelandCavaliers #ALLinCLE #AllforOne Hurricane Marias death toll, crisis grow in Puerto Rico",Puerto Rico,Puerto Rico,puerto rico,0.0,0.0
6,"My hometown, Utuado, Puerto Rico was destroyed by Hurricane Maria. Please help. Donate what you can, thanks.",Utuado Puerto Rico,Utuado Puerto Rico,puerto rico,0.0,0.333333
7,"@kobebryant you were going to help for a good cause in Puerto Rico w/ The Autism Training Center, but we need help now with Hurricane Maria.",Puerto Rico,Puerto Rico,puerto rico,0.0,0.0
8,RT @CBSNewYork: Hurricane Maria regains strength back to Category 3 storm after leaving path of destruction in Puerto Rico,Puerto Rico,Puerto Rico,puerto rico,0.0,0.0
9,"RT @breakingstorm: Embassy of Haiti in Washington, DC, confirms 5 Hurricane Matthew-related deaths on island",Washington Haiti,Haiti Washington DC,haiti washington dc,1.0,1.0
10,"Unity is strength, when there is teamwork and collaboration, Wonderful things can be achieved #Zimbabwe Warriors making a plea to the nation to keep on helping the Victims of #CycloneIdai @FIFPro @taytbells @CastleLagerPSL @mikemadoda @kadewere44 @AlecMudimu @FIFProAfrica",Zimbabwe,F,zimbabwe,1.0,0.0


**obeservations**

It is hard to guess the reasons for the missing locations in the NER model.


**missing locations for fine-tuned model**

In [None]:
len(data_df[data_df['fine_tuned_BERT_predicted_locations']==''])

6

In [None]:
data_df[data_df['fine_tuned_BERT_predicted_locations']==''].head(10)

Unnamed: 0,text,location,NER_predicted_locations,fine_tuned_BERT_predicted_locations,ner_wer,fine_tuned_wer
642,Winds are strengthening as storm force gales and heavy rain expected in quake-affected areas. Updates: #eqnz,gales,,,1.0,1.0
834,Still we need international help!! #EcuadorEarthquake @CNNEE @BBC @TIME @CancilleriaEc @Seguridad_Ec @Riesgos_Ec,EcuadorEarthquake,Riesgos,,1.0,1.0
1088,@MatutuLewis @edmnangagwa Those came through Red Cross #redcross #cycloneIDAI,redcross,,,1.0,1.0
1571,"Often in any disaster situation, the lack of food and the access to basic essentials is the most pressing problem. Throughout our relief operations during the #Keralaflood, Sewa volunteers made sure that something as",Keralaflood,,,1.0,1.0
1590,You can help victims of Hurricane Irma and Hurricane Harvey by donating to the #RedCross.,Hurricane Harvey,,,1.0,1.0
2238,"Vice President Constantino Chiwenga commends the military, Gvt Departments, citizens, corporate world and development partners who have been on the front lines and have been leading in the rescue, recovery, repair and rehabilitation efforts in #CycloneIdai hit areas.",Chiwenga,,,1.0,1.0


**Observations**
- There are less than 10 missing locations for the fine-tuned model.
- The labels are all incorrect for the samples where the fine-tuned model was unable to make a prediction!

**WER higher for NER than fine-tuned BERT**





In [None]:
# ignore missing predictions
# they have already been analysed
data_df = data_df[data_df['fine_tuned_BERT_predicted_locations']!='']
data_df = data_df[data_df['NER_predicted_locations']!='']

In [None]:
cols = ['text', 'location', 'NER_predicted_locations', 'fine_tuned_BERT_predicted_locations']

In [None]:
data_df[data_df['ner_wer']>data_df['fine_tuned_wer']][cols].head(30)

Unnamed: 0,text,location,NER_predicted_locations,fine_tuned_BERT_predicted_locations
3,RT @DemSocialists: Please support this solidarity relief fund for Puerto Ricos most vulnerable communities after the hurricanes .,Puerto Ricos,Puerto Rico,puerto ricos
10,"Unity is strength, when there is teamwork and collaboration, Wonderful things can be achieved #Zimbabwe Warriors making a plea to the nation to keep on helping the Victims of #CycloneIdai @FIFPro @taytbells @CastleLagerPSL @mikemadoda @kadewere44 @AlecMudimu @FIFProAfrica",Zimbabwe,F,zimbabwe
13,@billiejoe If any of you live in Texas- I am helping host a meet in Hurst for Hurricane Harvey relief !,Texas-,Texas Hurst,texas
20,"NBC News: Six Dead at Florida Nursing Home After Irma Cut Power This is really sad, RIP.",Florida,Florida Nursing Home,florida
22,"Hurricane Matthew pummels Haiti and Cuba, evacuations ordered in US: Port-au-Prince: Hurricane Matthew pummel",Haiti US Cuba,Haiti Cuba US Port - au - Prince,haiti cuba us
25,"#eqnz UPDATE: New Zealand earthquake raised to 7.8 magnitude, first tsunami waves detected and fears for kaikoura",New Zealand kaikoura,New Zealand,new zealand kaikoura
29,"In adlux conv centre food n water is available.but no1 approached till nw .plz pass it to all rescue camps near by Adlux Convention Center Angamaly, Cable Junction, Ernakulam,National Highway 47, Karukutty, 683576 04842612527 #KeralaFloods",Adlux Convention Center Angamaly Karukutty Ernakulam,Center Angamaly Cable Junction Ernakulam Karukutty,adlux adlux angamaly ernakulam karukutty
33,"Troops have reached to the affected areas of #Mirpur, #Jatlan and #Jarikas. Relief and rescue operations are underway. #ISPR #earthquake #Pakistan",Mirpur Jarikas Jatlan Pakistan,# Mirpur Jatlan Jarikas Pakistan,mirpur jatlan jarikas pakistan
37,"All of these major disaster are happening because that Nibiru system is near and it has a strong thresh hold on planet Earth. ⚡️ Heavy rain and flash flooding devastate Ellicott City, Maryland”",Ellicott City Maryland,Earth Ellicott City Maryland,ellicott city maryland
42,"Gov. Brown, fire-besieged California hit back at Trump over blame tweet. Governor Brown also asks White House for disaster relief funds for the state. The death toll is now at 25. #CaliforniaFires #VeteransDay #HippysResist",California,California White House,california california


In [None]:
data_df['location'].apply(lambda x: '#' in x).any()

False

**Observations**
- The NER results make sense and are often more precise locations than the labels.
- Hashtags (#) are sometimes remain in the results form the hashtags found in the text.
- Hashtags are also present to indicate subwords.



**WER higher for fine-tuned BERT than NER**

In [None]:
data_df[data_df['ner_wer']<data_df['fine_tuned_wer']][cols].head(30)

Unnamed: 0,text,location,NER_predicted_locations,fine_tuned_BERT_predicted_locations
1,"Kerala needs your help, we are sending medicines for people affected by flood. If you want to help, please send them to our office: You can send: OTC medicines, sanitary pads and dry food items. Let’s do our bit ὤFἿD #KeralaFloods #KeralaFloodRelief",Kerala,Kerala,kerala kerala kerala
6,"My hometown, Utuado, Puerto Rico was destroyed by Hurricane Maria. Please help. Donate what you can, thanks.",Utuado Puerto Rico,Utuado Puerto Rico,puerto rico
11,RT @globaltimesnews: Death toll rises to 233 and at least 500 others injured in Ecuador’s 7.8-magnitude #EcuadorEarthquake,Ecuador,Ecuador,ecuador ecuador
21,"RT @kala_cw: Urgent If someone can provide a vehicle to send some donations to Ratnapura, Please call Saranga 0713589423. #lka #FloodSL",Ratnapura,Ratnapura,ratnapura saranga
32,RT @LaibaDOTpk: PNS Zulfiquar reaches flood-hit Sri Lanka with relief goods #disasterrelief #floods #Latest #Pakistan #PakistanNavy #srila,Sri Lanka Pakistan,Sri Lanka Pakistan Pakistan,sri lanka pakistan pakistan srila
39,Now you can donate to Karnataka Chief Ministers Calamity Relief Fund through Paytm also. Do make your contributions & spread the word. Lets help Karnataka Govt to rebuild the life of flood affected people in Kodagu. #IndiaForKodagu #KodaguFloodRelief #KodaguFloods,Karnataka Kodagu,Karnataka Kodagu,karnataka karnataka kodagukodagu kodagu kodagu
44,#BREAKING: At least 1 person reported dead as a result of Hurricane Dorian. 7-year-old Lachino Mcintosh drowned while trying to evacuate with his family in The Bahamas. @OANN,The Bahamas,The Bahamas,bahamas
45,"Delhi Govt to open donation centres at all SDM Offices in Delhi. People are requested to donate clothes, blankets and bed sheets. Delhi Govt to send water bottles, biscuits & dry food packets in bulk to Kerala.",Delhi Kerala,Delhi Kerala,delhi delhi delhi kerala
53,"RT @DFID_UK: Every £ you donate will be matched by the UK government, through #UKAidMatch #UKaid #CycloneIdai",UK,UK UK,uk uk uk uk
57,"My father died my brother was injured we have no house to stay but I am happy to be assisting my community that was affected. Despite suffering tragedy and loss as a result of #CycloneIdai, Wadzanai is volunteering to help others impacted by the storm in #Zimbabwe.",Zimbabwe,Zimbabwe,wa zimbabwe


**Observations**
- Most of the fine-tuned results contain repetitions of the same words as they are present several times in the text.
- The order of the words in the fine-tuned corresponds to the order the words appear in text, which is not always the case for the labels.
- Some incorrect words which do not refer to locations are also sometimes present in the fine-tuned results.

**Removing hashtags from BERT NER:**

In [None]:
def remove_hashtags(words):
  return words.replace('#', '')

In [None]:
predictions=data_df['NER_predicted_locations'].apply(remove_hashtags).apply(str.lower)

fine_tuned_bert_wer = wer.compute(references=data_df['location'].apply(str.lower).tolist(), predictions=predictions.tolist())
print('word error rate: NER predicted locations, no hashtags: ' + str(fine_tuned_bert_wer))

word error rate: NER predicted locations, no hashtags: 0.49391657010428736


**Removing repeated words from fine-tuned BERT:**

In [None]:
def remove_repeating_words(words):
  return ' '.join(list(set(words.split(' '))))

In [None]:
predictions=data_df['fine_tuned_BERT_predicted_locations'].apply(remove_repeating_words).apply(str.lower)

fine_tuned_bert_wer = wer.compute(references=data_df['location'].apply(str.lower).tolist(), predictions=predictions.tolist())
print('word error rate: fine-tuned BERT predicted locations, no repetitions: ' + str(fine_tuned_bert_wer))

word error rate: fine-tuned BERT predicted locations, no repetitions: 0.46523754345307067


**Result Summary**

- Percentage missing predictions

| Model             | Percentage Missing Values(%) |
|-------------------|---------------------------|
| BERT NER          | 15.949                   |
| BERT Fine-tuned   | 0.253                   |  

  


- Word Error Rate

| Model                           | Word Error Rate |
|---------------------------------|-----------------|
| BERT NER                        | 0.564           |
| BERT NER no hashtags            | 0.494           |
| BERT Fine-Tuned                 | 0.514           |
| BERT Fine-Tuned no repetitions  | 0.465           |

- When missing predictions does not have a high impact on the use-case, BERT NER model may be advantageous as it does not require fine-tuning and can be put in place in less time while having a word error rate that is not too far from the fine-tuned model.
- The fine-tuned model has the advantage of having significantly less missing predictions and a lower word error rate.