<div style="padding: 0.5em; background-color: #1876d1; color: #fff; font-weight: bold; font-size: 1.4em;">
    [Approach 3]  Location Mention Recognition - NER BERT Transformer
</div>

In this Jupyter notebook, we will use Name Entity Recognition to extract from X (Twitter formely) tweets Location Mention from Emergency Situation.

Note :
* Do NER
* Try BERT Model
* Extract Location Mention

---
<b>#Microsoft Learn Challenge, #Zindi, #Hamad Bin Khalifa University </b>

### **Importing Library**

In [1]:
#!pip install simpletransformers
#!pip install pyspellchecker
#!pip install stanza
#!pip install nltk
#!pip install python-dotenv
#!pip install werpy
#!pip install wandb

In [2]:
# general utils
import werpy
import numpy as np
import pandas as pd
import seaborn as sns
import stanza, os, sys
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from simpletransformers.ner import NERModel, NERArgs
pd.set_option('display.max_colwidth', 500)

# utils setup
current_directory = os.getcwd()
root_directory = os.path.abspath(os.path.join(current_directory, os.pardir))
sys.path.append(root_directory)

# logging
import wandb
os.environ["WANDB_NOTEBOOK_NAME"] = "transformers_3.ipynb"

# custom utils
from utils.io import Predictions
from utils.metrics import LMR_Metrics
from utils.io import LMR_BILOU_Scrapper, LMR_JSON_Scrapper
from utils.preprocessing import Preprocess

### **Exploring Data**

The provided Train.csv contain many missing value so we have to get data from initial source.

In [3]:
LMR_JSON_Scrapper(output_dir="../data/self_scrapped/raw").run()

Processing dataset: california_wildfires_2018


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.17file/s]


Processing dataset: canada_wildfires_2016


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.35file/s]


Processing dataset: cyclone_idai_2019


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  1.58file/s]


Processing dataset: ecuador_earthquake_2016


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.51file/s]


Processing dataset: greece_wildfires_2018


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.39file/s]


Processing dataset: hurricane_dorian_2019


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  1.74file/s]


Processing dataset: hurricane_florence_2018


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.14file/s]


Processing dataset: hurricane_harvey_2017


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.34file/s]


Processing dataset: hurricane_irma_2017


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.28file/s]


Processing dataset: hurricane_maria_2017


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.42file/s]


Processing dataset: hurricane_matthew_2016


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  1.99file/s]


Processing dataset: italy_earthquake_aug_2016


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.52file/s]


Processing dataset: kaikoura_earthquake_2016


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.46file/s]


Processing dataset: kerala_floods_2018


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.35file/s]


Processing dataset: maryland_floods_2018


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.74file/s]


Processing dataset: midwestern_us_floods_2019


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.48file/s]


Processing dataset: pakistan_earthquake_2019


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.54file/s]


Processing dataset: puebla_mexico_earthquake_2017


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.38file/s]


Processing dataset: srilanka_floods_2017


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.27file/s]

Processing complete.





- Let concatenate out dataset

In [4]:
train_dfs = []
dev_dfs   = []
test_dfs  = []
path_dfs  = "../data/self_scrapped/raw"
for filename in os.listdir(path_dfs):
    if filename.endswith(".csv"):
        file_path = os.path.join(path_dfs, filename)
        if filename.startswith("train"):
            df = pd.read_csv(file_path)
            train_dfs.append(df)
        elif filename.startswith("dev"):
            df = pd.read_csv(file_path)
            dev_dfs.append(df)
        elif filename.startswith("test_unlabeled"):
            df = pd.read_csv(file_path)
            test_dfs.append(df)

df_train = pd.concat(train_dfs, ignore_index=True) if train_dfs else pd.DataFrame()
df_test  = pd.concat(test_dfs, ignore_index=True) if test_dfs else pd.DataFrame()
df_dev   = pd.concat(dev_dfs, ignore_index=True) if dev_dfs else pd.DataFrame()

print("TRAIN SHAPE: ", df_train.shape)
print("TEST  SHAPE: ", df_test.shape)
print("DEV   SHAPE: ", df_dev.shape)

TRAIN SHAPE:  (14392, 3)
TEST  SHAPE:  (4066, 3)
DEV   SHAPE:  (2056, 3)


- We observe in sentencess that we have hashtag, no-ascii character, stopword , ... we have to clean data 

In [5]:
df_train.head(20)

Unnamed: 0,tweet_id,text,location_mentions
0,ID_1022420413882744832,Nearly half of #houses checked in #fire-stricken areas deemed #uninhabitable #GO #PrayForGreece #PrayForAthens #AthensFires ἞C἟7,
1,ID_1021778661895294976,RT @anadoluagency: #Greece: Death toll from wildfires hits 74,Greece=>COUNTRY
2,ID_1022015997740503042,When the essence of cooperation meets the sad reality of lifeThe IPA partner country offers financial aid to Greece to handle disaster @InterregIPACBC #Greecefires,Greece=>COUNTRY
3,ID_1022557424585240576,We are live from the Lureio Idrima the orphanage and nursing home operared by the nuns of the Holy Trinity Monastery that was destroyed by the fire in Neos Voutzas. Here too the scene is apocalyptic.,Holy Trinity Monastery=>HUMAN-MADE POINT-OF-INTEREST * Neos Voutzas=>NEIGHBORHOOD
4,ID_1021749412639457280,RT @AP: Greek prime minister declares 3-day national mourning period for dozens killed by wildfires near Athens.,Athens=>CITY
5,ID_1024662121391579136,#Greece vows to speed up destruction of illegal property after #wildfires,Greece=>COUNTRY
6,ID_1022441664697298944,"In Mati, hundreds of humanitarian volunteers have joined relief efforts following Greece’s most deadly wildfires in a decade",Mati=>CITY * Greece=>COUNTRY
7,ID_1024974395759108097,State Minister #Flambouraris offers thanks for assistance to #Greecefires victims,
8,ID_1022473412034416640,"Unfortunately theres the first confirmed fatality, hope to be the last, among tourists at the #AthensFires: an Irishman in honeymoon. So sorry, R.I.P ὢ2 #Mati #Attica #wildfires #PrayForGreece #PrayForAthens #Πυρκαγια #Αττικη #ματι",Attica=>CITY
9,ID_1022371614091096064,"A special account has been opened at the Bank of Greece for donations in support of fire victims. Account number 23/2341195169, IBAN GR4601000230000002341195169) available for foreign states, businesses and individuals from Greece and abroad to provide their financial support",Greece=>COUNTRY * Greece=>COUNTRY


- There are some sentences without a location mention. We need to look closer. It could be normal if there is no corresponding location found in the tweet, or it might be an error from the labeling task. Note that for the test set, it is normal for all location_mentions to be NaN. (😎 Yeah, we have to predict this value).

In [6]:
print(df_train.isnull().sum())
print(df_test.isnull().sum())
print(df_dev.isnull().sum())

tweet_id                0
text                    0
location_mentions    4026
dtype: int64
tweet_id                0
text                    0
location_mentions    4066
dtype: int64
tweet_id               0
text                   0
location_mentions    573
dtype: int64


### **Preprocessing Data**

- Remove special character
- Treat HASHTAG, USERTAG
- Remove stop word
- Tokenization
- Stemming
- BIO Tagging

##### **<> BIO Tagging**

BIO stands for Begin, Inside, and Outside. It’s a method for tagging tokens (words or subwords) in a sequence to identify entities within the text. Each token in the text is assigned a tag that indicates whether it is at the beginning of an entity, inside an entity, or outside of any entity.

In [7]:
# TRAIN
train_path = "../data/transformed/train.tag.csv"
if not os.path.exists(train_path):
    df_train = Preprocess.remove_non_ascii(df_train, column_name='text')
    df_train = Preprocess.remove_usertag(df_train, column_name='text')
    df_train = Preprocess.reformat_hashtag(df_train, column_name='text')
    df_train = Preprocess.remove_stop_words(df_train, column_name='text', new_col="text_transformed", transformation=[
        "tokenize", "lemma", "lower"], save_in="../data/transformed/train.lemma.csv")
    df_tag_train = Preprocess.build_bilou_encoding(df_train, text_col="text_transformed", save_in=train_path)
else:
    df_tag_train = pd.read_csv(train_path)

In [8]:
# DEV
dev_path = "../data/transformed/dev.tag.csv"
if not os.path.exists(dev_path):
    df_dev = Preprocess.remove_non_ascii(df_dev, column_name='text')
    df_dev = Preprocess.remove_usertag(df_dev, column_name='text')
    df_dev = Preprocess.reformat_hashtag(df_dev, column_name='text')
    df_dev = Preprocess.remove_stop_words(df_dev, column_name='text', new_col="text_transformed", transformation=[
        "tokenize", "lemma", "lower"], save_in="../data/transformed/dev.lemma.csv")
    df_tag_dev = Preprocess.build_bilou_encoding(df_dev, text_col="text_transformed", save_in=dev_path)
else:
    df_tag_dev = pd.read_csv(dev_path)

In [9]:
# TEST
test_path = "../data/transformed/test.lemma.csv"
if not os.path.exists(test_path):
    df_test = Preprocess.remove_non_ascii(df_test, column_name='text')
    df_test = Preprocess.remove_usertag(df_test, column_name='text')
    df_test = Preprocess.reformat_hashtag(df_test, column_name='text')
    df_test = Preprocess.remove_stop_words(df_test, column_name='text', new_col="text_transformed", transformation=[
        "tokenize", "lemma", "lower"], save_in=test_path)
else:
    df_test = pd.read_csv(test_path)

In [10]:
df_tag_train.head(30)

Unnamed: 0.1,Unnamed: 0,sentence_id,words,labels
0,0,684,nearly,O
1,1,684,half,O
2,2,684,of,O
3,3,684,house,O
4,4,684,check,O
5,5,684,in,O
6,6,684,fire,O
7,7,684,stricken,O
8,8,684,area,O
9,9,684,deem,O


### **Prepare training, dev and test data**

In [11]:
df_tag_train["sentence_id"] = LabelEncoder().fit_transform(df_tag_train["sentence_id"])
df_tag_dev["sentence_id"]   = LabelEncoder().fit_transform(df_tag_dev["sentence_id"])

In [12]:
df_tag_train.head()

Unnamed: 0.1,Unnamed: 0,sentence_id,words,labels
0,0,684,nearly,O
1,1,684,half,O
2,2,684,of,O
3,3,684,house,O
4,4,684,check,O


In [13]:
X_train  = df_tag_train[["sentence_id", "words"]]
X_test   = df_tag_dev[["sentence_id", "words"]]
y_train  = df_tag_train["labels"]
y_test   = df_tag_dev["labels"]

train_data = pd.DataFrame({"sentence_id": X_train["sentence_id"], "words": X_train["words"], "labels": y_train})
test_data = pd.DataFrame({"sentence_id": X_test["sentence_id"], "words": X_test["words"], "labels": y_test})

train_data

Unnamed: 0,sentence_id,words,labels
0,684,nearly,O
1,684,half,O
2,684,of,O
3,684,house,O
4,684,check,O
...,...,...,...
292721,5841,the,O
292722,5841,ddrc,O
292723,5841,patient,O
292724,5841,preparedness,O


#### **Model Training**

- Let count NER label

In [14]:
label = pd.concat([df_tag_train, df_tag_dev])["labels"].unique().tolist()
label_counts = pd.concat([df_tag_train, df_tag_dev])["labels"].value_counts().reset_index()
label_counts.columns = ["Label", "Frequency"]
display(label_counts)

Unnamed: 0,Label,Frequency
0,O,314421
1,U-COUNTRY,4575
2,U-STATE,4298
3,U-CITY,2822
4,B-CITY,1050
5,L-CITY,1050
6,B-COUNTRY,653
7,L-COUNTRY,653
8,L-ISLAND,572
9,B-ISLAND,572


- Let define model **Args** and hyperparameters optimisation approach

In [15]:
# hyperparameters

sweep_config = {
    "method": "bayes",  # grid, random
    "metric": {"name": "wer", "goal": "minimize"},
    "parameters": {
        "num_train_epochs": {"values": [1, 2, 3, 5, 8]},
        "learning_rate": {"min": 5e-5, "max": 4e-4},
    },
}

- Initialize a W&B sweep with the config defined earlier.

In [16]:
# sweep_id = wandb.sweep(sweep_config, project="LMR")
#%%capture
# wandb.init(project="LMR", name="Location-Mention-Recognition")

- Model args

In [35]:
model_args = NERArgs()

# general
model_args.num_train_epochs=4
model_args.evaluate_during_training = True
model_args.overwrite_output_dir = True
model_args.train_batch_size = 64
model_args.eval_batch_size = 32
model_args.labels_list = label
model_args.use_multiprocessing = True
# model_args.wandb_project = "LMR"

# for eaarly stoping
# model_args.use_early_stopping = True
# model_args.early_stopping_delta = 0.01
# model_args.early_stopping_metric = "wer"
# model_args.early_stopping_metric_minimize = False
# model_args.early_stopping_patience = 5
model_args.evaluate_during_training_steps = 1000

In [36]:
model = NERModel(
    "bert", 
    "bert-large-uncased", #bert-large-uncased
    use_cuda=False,
    args=model_args, 
)

# Train the model
print('### TRAINING')
model.train_model(
    train_data, 
    eval_data=test_data, 
    wer=LMR_Metrics.wer_type
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### TRAINING


python(34373) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(34374) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(34375) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(34376) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(34377) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(34378) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(34379) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(34380) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


  0%|          | 0/17 [00:00<?, ?it/s]

Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Running Epoch 1 of 4:   0%|          | 0/225 [00:00<?, ?it/s]

python(40715) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(40716) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(40717) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(40718) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(40719) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(40720) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(40721) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(40722) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/65 [00:00<?, ?it/s]



Running Epoch 2 of 4:   0%|          | 0/225 [00:00<?, ?it/s]

In [None]:
# Evaluate the model
print('### EVALUATION')
result, model_outputs, wrong_preds = model.eval_model(test_data, wer=LMR_Metrics.wer_type)

### EVALUATION


python(29932) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(29938) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(29939) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(29940) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(29941) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(29942) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(29943) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(29944) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/65 [00:00<?, ?it/s]

In [None]:
result

{'eval_loss': 0.08372374755831866,
 'precision': 0.7946026986506747,
 'recall': 0.8038422649140546,
 'f1_score': 0.7991957778336265,
 'wer': 0.35615196468537486}

- Quick prediction

In [None]:
predictions, raw_outputs = model.predict([
    "elicott city, maryland, struck by catastrophic flooding; 1 missing.",
    "memorial day weekend floods ravage maryland town"
])

python(34283) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(34284) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(34285) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(34286) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(34287) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(34288) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(34289) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(34290) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
predictions

[[{'elicott': 'O'},
  {'city,': 'L-CITY'},
  {'maryland,': 'U-STATE'},
  {'struck': 'O'},
  {'by': 'O'},
  {'catastrophic': 'O'},
  {'flooding;': 'O'},
  {'1': 'O'},
  {'missing.': 'O'}],
 [{'memorial': 'O'},
  {'day': 'O'},
  {'weekend': 'O'},
  {'floods': 'O'},
  {'ravage': 'O'},
  {'maryland': 'U-STATE'},
  {'town': 'O'}]]

### **Make prediction for Context**

In [None]:
# Get Data and Preprocess
# df_context = pd.read_csv('../data/provided/Test.csv')
# df_context = Preprocess.remove_special_characters(df_context, column_name='text')
# df_context['text'] = df_context['text'].apply(lambda x: Preprocess.treat_hashtags(x))
# df_context['text'] = df_context['text'].apply(lambda x: Preprocess.correct_spelling(x))
# #df_context['text'] = df_context['text'].apply(lambda x: Preprocess.remove_stop_words(x))
# df_context.to_csv("../data/provided/Test-processed.csv")

df_context = pd.read_csv('../data/provided/Test.csv')
df_context = Preprocess.remove_non_ascii(df_context, column_name='text')
df_context = Preprocess.remove_usertag(df_context, column_name='text')
df_context = Preprocess.reformat_hashtag(df_context, column_name='text')
df_context = Preprocess.remove_stop_words(df_context, column_name='text', new_col="text_transformed", transformation=[
    "tokenize", "lemma", "lower"], save_in="../data/provided/Test-processed.csv")

#df_context = pd.read_csv('../data/provided/Test-processed.csv')

ids = df_context["tweet_id"].values
tweets = df_context["text_transformed"].values

# Make prediction
predictions, raw_outputs = model.predict(tweets)

100%|██████████| 2942/2942 [03:56<00:00, 12.44it/s]
python(30065) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(30066) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(30067) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(30068) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(30069) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(30070) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(30071) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
python(30072) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.


  0%|          | 0/6 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/92 [00:00<?, ?it/s]

In [None]:
# Extract Location Mention based on model output
results = []
for sentence in predictions:
    result = " ".join([word for d in sentence for word, tag in d.items() if tag != 'O'])
    if result == "":
        result = " "
    results.append(result)

Predictions.to_csv(ids, results)

Saved predictions to ../submissions/submission_9.csv


In [None]:
### END