<div style="padding: 0.5em; background-color: #1876d1; color: #fff; font-weight: bold; font-size: 1.4em;">
    [Approach 3]  Location Mention Recognition - NER BERT Transformer
</div>

In this Jupyter notebook, we will use Name Entity Recognition to extract from X (Twitter formely) tweets Location Mention from Emergency Situation.

Note :
* Do NER
* Try BERT Model
* Extract Location Mention

---
<b>#Microsoft Learn Challenge, #Zindi, #Hamad Bin Khalifa University </b>

### **Importing Library**

In [1]:
#!pip install simpletransformers
#!pip install pyspellchecker
#!pip install stanza
#!pip install nltk
#!pip install python-dotenv
#!pip install wandb

In [2]:
# general utils
import numpy as np
import pandas as pd
import stanza, os, sys
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from simpletransformers.ner import NERModel, NERArgs

# utils setup
current_directory = os.getcwd()
root_directory = os.path.abspath(os.path.join(current_directory, os.pardir))
sys.path.append(root_directory)

# custom utils
from utils.io import Predictions
from utils.io import LMR_BILOU_Scrapper
from utils.preprocessing import Preprocess

### **Exploring Data**

The provided Train.csv contain many missing value so we have to get data from initial source.

In [2]:
LMR_BILOU_Scrapper(output_dir="../data/self_scrapped/bilou").run()

Processing dataset: california_wildfires_2018


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.21file/s]


Processing dataset: canada_wildfires_2016


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.34file/s]


Processing dataset: cyclone_idai_2019


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.36file/s]


Processing dataset: ecuador_earthquake_2016


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.71file/s]


Processing dataset: greece_wildfires_2018


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.51file/s]


Processing dataset: hurricane_dorian_2019


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.33file/s]


Processing dataset: hurricane_florence_2018


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.25file/s]


Processing dataset: hurricane_harvey_2017


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.25file/s]


Processing dataset: hurricane_irma_2017


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.38file/s]


Processing dataset: hurricane_maria_2017


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.57file/s]


Processing dataset: hurricane_matthew_2016


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.70file/s]


Processing dataset: italy_earthquake_aug_2016


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.75file/s]


Processing dataset: kaikoura_earthquake_2016


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.53file/s]


Processing dataset: kerala_floods_2018


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.11file/s]


Processing dataset: maryland_floods_2018


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  3.01file/s]


Processing dataset: midwestern_us_floods_2019


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.36file/s]


Processing dataset: pakistan_earthquake_2019


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.56file/s]


Processing dataset: puebla_mexico_earthquake_2017


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.50file/s]


Processing dataset: srilanka_floods_2017


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  3.07file/s]

Processing complete.





- Let concatenate out dataset

In [3]:
train_dfs = []
dev_dfs   = []
path_dfs  = "../data/self_scrapped/bilou/"
for filename in os.listdir(path_dfs):
    if filename.endswith(".csv"):
        file_path = os.path.join(path_dfs, filename)
        if filename.startswith("train"):
            df = pd.read_csv(file_path)
            train_dfs.append(df)
        elif filename.startswith("dev"):
            df = pd.read_csv(file_path)
            dev_dfs.append(df)

df_train = pd.concat(train_dfs, ignore_index=True) if train_dfs else pd.DataFrame()
df_dev = pd.concat(dev_dfs, ignore_index=True) if dev_dfs else pd.DataFrame()

print("TRAIN SHAPE: ", df_train.shape)
print("DEV   SHAPE: ", df_dev.shape)

TRAIN SHAPE:  (363376, 3)
DEV   SHAPE:  (52038, 3)


In [4]:
df_train.head(30)

Unnamed: 0,id_sentence,word,tag
0,GREECE_WILDFIRES_2018_0,Nearly,O
1,GREECE_WILDFIRES_2018_0,half,O
2,GREECE_WILDFIRES_2018_0,of,O
3,GREECE_WILDFIRES_2018_0,#,O
4,GREECE_WILDFIRES_2018_0,houses,O
5,GREECE_WILDFIRES_2018_0,checked,O
6,GREECE_WILDFIRES_2018_0,in,O
7,GREECE_WILDFIRES_2018_0,#,O
8,GREECE_WILDFIRES_2018_0,fire,O
9,GREECE_WILDFIRES_2018_0,-,O


In [5]:
print(df_train.isnull().sum())
print(df_dev.isnull().sum())
df_train.dropna(inplace=True)

id_sentence    0
word           6
tag            0
dtype: int64
id_sentence    0
word           0
tag            0
dtype: int64


In [6]:
df_train = Preprocess.remove_non_ascii(df_train, column_name='word')
df_dev   = Preprocess.remove_non_ascii(df_dev, column_name='word')

In [7]:
#Preprocess.treat_hashtags("#EddisonHermond missing after catastrophic flood hits #EllicottCity #Maryland; damage believed worse than 20")
#Preprocess.remove_stop_words("#EddisonHermond missing after catastrophic flood hits #EllicottCity #Maryland; damage believed worse than 20")
#Preprocess.correct_spelling("#EddisonHermond missings after catastrophic flood hits #EllicottCity #Maryland; damage believed worse than 20")

### **BIO Tagging**

BIO stands for Begin, Inside, and Outside. It’s a method for tagging tokens (words or subwords) in a sequence to identify entities within the text. Each token in the text is assigned a tag that indicates whether it is at the beginning of an entity, inside an entity, or outside of any entity.

In [8]:
df_train["id_sentence"] = LabelEncoder().fit_transform(df_train["id_sentence"])
df_dev["id_sentence"]   = LabelEncoder().fit_transform(df_dev["id_sentence"])
df_train["tag"]         = df_train["tag"].str.upper()
df_dev["tag"]           = df_dev["tag"].str.upper()

In [9]:
df_train.head()

Unnamed: 0,id_sentence,word,tag
0,3542,Nearly,O
1,3542,half,O
2,3542,of,O
3,3542,#,O
4,3542,houses,O


### **Prepare training, dev and test data**

In [10]:
# X = df_tag[["tweet_id", "word"]]
# y = df_tag["label"]
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

X_train  = df_train[["id_sentence", "word"]]
X_test   = df_dev[["id_sentence", "word"]]
y_train  = df_train["tag"]
y_test   = df_dev["tag"]

train_data = pd.DataFrame({"sentence_id": X_train["id_sentence"], "words": X_train["word"], "labels": y_train})
test_data = pd.DataFrame({"sentence_id": X_test["id_sentence"], "words": X_test["word"], "labels": y_test})

train_data

Unnamed: 0,sentence_id,words,labels
0,3542,Nearly,O
1,3542,half,O
2,3542,of,O
3,3542,#,O
4,3542,houses,O
...,...,...,...
363371,5121,Preparedness,O
363372,5121,Plan,O
363373,5121,.,O
363374,5121,#,O


#### **Model Training**

In [11]:
label = pd.concat([df_train, df_dev])["tag"].unique().tolist()
label

['O',
 'U-CTRY',
 'B-HPOI',
 'I-HPOI',
 'L-HPOI',
 'B-NBHD',
 'L-NBHD',
 'U-CITY',
 'U-CONT',
 'U-STAT',
 'B-ISL',
 'L-ISL',
 'U-ISL',
 'U-OTHR',
 'B-CITY',
 'L-CITY',
 'B-NPOI',
 'L-NPOI',
 'U-NBHD',
 'B-CNTY',
 'I-CNTY',
 'L-CNTY',
 'B-OTHR',
 'L-OTHR',
 'U-DIST',
 'B-DIST',
 'L-DIST',
 'I-CITY',
 'B-CTRY',
 'L-CTRY',
 'U-HPOI',
 'I-DIST',
 'B-STAT',
 'L-STAT',
 'I-NBHD',
 'U-CNTY',
 'I-NPOI',
 'B-ST',
 'L-ST',
 'U-NPOI',
 'I-OTHR',
 'I-ST',
 'I-CTRY',
 'B-CONT',
 'L-CONT',
 'I-STAT',
 'U-ST',
 'I-ISL']

In [12]:
# logging
import wandb
wandb.init(project="lmr-bilou", name="lmr-bilou")

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mgenereux-akotenou[0m ([33mgenereux-akotenou-local[0m). Use [1m`wandb login --relogin`[0m to force relogin


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011145411577955303, max=1.0…

In [16]:
model_args = NERArgs()
model_args.num_train_epochs = 1
model_args.learning_rate = 4e-4
model_args.overwrite_output_dir = True
model_args.train_batch_size = 32
model_args.eval_batch_size = 32
model_args.labels_list = label

model_args.logging_steps = 50
model_args.save_steps = 100
model_args.evaluate_during_training = True


model_args.wandb_project = "lmr-bilou"
model_args.wandb_kwargs = {}


In [17]:
model = NERModel('bert', "bert-base-cased", args=model_args, labels=label, use_cuda=False)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [18]:
model.train_model(train_data, eval_data=test_data)
wandb.finish()

  0%|          | 0/17 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.0111674523110398, max=1.0))…

Running Epoch 1 of 1:   0%|          | 0/450 [00:00<?, ?it/s]

In [22]:
result, model_outputs, wrong_preds = model.eval_model(test_data)

  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/65 [00:00<?, ?it/s]



In [23]:
predictions, raw_outputs = model.predict([
    "Elicott City, Maryland, struck by catastrophic flooding; 1 missing.",
    "Memorial Day weekend floods ravage Maryland town"
])

  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

In [24]:
predictions

[[{'Elicott': 'B-CITY'},
  {'City,': 'L-CITY'},
  {'Maryland,': 'U-STAT'},
  {'struck': 'O'},
  {'by': 'O'},
  {'catastrophic': 'O'},
  {'flooding;': 'O'},
  {'1': 'O'},
  {'missing.': 'O'}],
 [{'Memorial': 'O'},
  {'Day': 'O'},
  {'weekend': 'O'},
  {'floods': 'O'},
  {'ravage': 'O'},
  {'Maryland': 'U-STAT'},
  {'town': 'O'}]]

### **Make prediction for Context**

In [25]:
# Get Data and Preprocess
df_context = pd.read_csv('../data/provided/Test.csv')
df_context = Preprocess.remove_non_ascii(df_context, column_name='text')
# df_context['text'] = df_context['text'].apply(lambda x: Preprocess.treat_hashtags(x))
# df_context['text'] = df_context['text'].apply(lambda x: Preprocess.correct_spelling(x))
# #df_context['text'] = df_context['text'].apply(lambda x: Preprocess.remove_stop_words(x))
# df_context.to_csv("../data/provided/Test-processed.csv")

#df_context = pd.read_csv('../data/provided/Test-processed-2.csv')

ids = df_context["tweet_id"].values
tweets = df_context["text"].values

# Make prediction
predictions, raw_outputs = model.predict(tweets)

  0%|          | 0/6 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/92 [00:00<?, ?it/s]

In [26]:
# Extract Location Mention based on model output
results = []
for sentence in predictions:
    result = " ".join([word for d in sentence for word, tag in d.items() if tag != 'O'])
    if result == "":
        result = " " 
    results.append(result)

Predictions.to_csv(ids, results)

Saved predictions to ../submissions/submission_7.csv


In [None]:
df_context = pd.read_csv('../data/provided/Test-processed.csv')
df_context

In [None]:
### END