<div style="padding: 0.5em; background-color: #1876d1; color: #fff; font-weight: bold; font-size: 1.4em;">
    [Approach 3]  Location Mention Recognition - NER BERT Transformer
</div>

In this Jupyter notebook, we will use Name Entity Recognition to extract from X (Twitter formely) tweets Location Mention from Emergency Situation.

Note :
* Do NER
* Try BERT Model
* Extract Location Mention
* The BIO format is very specific. It requires an understanding of tokens being “Inside” (I) and “Outside” (O) a particular entity label. This would add unnecessary complexity in formulating a task description prompt. Idea is to try IO formating approach

---
<b>#Microsoft Learn Challenge, #Zindi, #Hamad Bin Khalifa University </b>

### **Importing Library**

In [1]:
#!pip install simpletransformers
#!pip install pyspellchecker
#!pip install stanza
#!pip install nltk
#!pip install python-dotenv
#!pip install werpy
#!pip install wandb
#!pip install deep_utils

In [2]:
# general utils
import werpy
import numpy as np
import pandas as pd
import seaborn as sns
import stanza, os, sys, re
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from simpletransformers.ner import NERModel, NERArgs
pd.set_option('display.max_colwidth', 300)

# utils setup
current_directory = os.getcwd()
root_directory = os.path.abspath(os.path.join(current_directory, os.pardir))
sys.path.append(root_directory)

# logging
import wandb
os.environ["WANDB_NOTEBOOK_NAME"] = "transformers_3_BIO_and_IO.ipynb"

# custom utils
from utils.io import Predictions
from utils.metrics import LMR_Metrics
from utils.io import LMR_BILOU_Scrapper, LMR_JSON_Scrapper
from utils.preprocessing import Preprocess
from utils.stratify import MultiLabelNERStratify

### **Exploring Data**

The provided Train.csv contain many missing value so we have to get data from initial source.

In [3]:
#LMR_JSON_Scrapper(output_dir="../data/self_scrapped/raw").run()

- Let concatenate out dataset

In [4]:
train_dfs = []
dev_dfs   = []
path_dfs  = "../data/self_scrapped/raw"
for filename in os.listdir(path_dfs):
    if filename.endswith(".csv"):
        file_path = os.path.join(path_dfs, filename)
        if filename.startswith("train"):
            df = pd.read_csv(file_path)
            train_dfs.append(df)
        elif filename.startswith("dev"):
            df = pd.read_csv(file_path)
            dev_dfs.append(df)

df_train = pd.concat(train_dfs, ignore_index=True) if train_dfs else pd.DataFrame()
df_dev   = pd.concat(dev_dfs, ignore_index=True) if dev_dfs else pd.DataFrame()

df_train = pd.concat([df_train, df_dev])
print("TRAIN SHAPE: ", df_train.shape)

TRAIN SHAPE:  (16448, 3)


- Une augmented df

In [5]:
df = pd.read_csv('../data/provided/TrainEncoded.csv')

def parse_location_mentions(location_mentions):
    location_dict = {}
    if pd.notna(location_mentions):
        parts = location_mentions.split(' * ')
        for part in parts:
            location, loc_type = part.split('=>')
            location_dict[location.strip()] = loc_type.strip()
    return location_dict

location_type_dict = {}
for location in df_train['location_mentions'].dropna():
    location_type_dict.update(parse_location_mentions(location))

def label_location(row, location_type_dict):
    location = row['location']
    labeled_locations = []
    words = location.split()
    
    while words:
        for i in range(len(words), 0, -1):
            sub_location = ' '.join(words[:i]).strip()
            if sub_location in location_type_dict:
                labeled_locations.append(f"{sub_location}=>{location_type_dict[sub_location]}")
                words = words[i:]
                break
        else:
            words = words[:-1]
    if labeled_locations:
        return ' * '.join(labeled_locations)
    else:
        return None

df['location_mentions'] = df.apply(lambda row: label_location(row, location_type_dict), axis=1)
df.head(60)

Unnamed: 0,tweet_id,text,location,location_mentions
0,ID_1001136212718088192,EllicottCity is known for its vibrant art scene.,EllicottCity,EllicottCity=>CITY
1,ID_1001136696589631488,"Flash floods struck a Maryland city on Sunday, washing out streets and tossing cars like bath toys.",Maryland,Maryland=>STATE
2,ID_1001136950345109504,State of emergency declared for Maryland flooding: via @YouTube,Maryland,Maryland=>STATE
3,ID_1001137334056833024,"Other parts of Maryland also saw significant damage from Sundays storms including this Baltimore city neighborhood, #Dundalk and #Catonsville. Rain totals spanned from 1 to 10 inches across Maryland: #ECFlood",Baltimore Maryland,Baltimore=>CITY * Maryland=>STATE
4,ID_1001138374923579392,"Catastrophic Flooding Slams Ellicott City, Maryland; Water Rescues Reported - The Weather Channel via @GoogleNews",Ellicott City Maryland,Ellicott City=>CITY * Maryland=>STATE
5,ID_1001138377717157888,"WATCH: 1 missing after flash #FLOODING devastates Ellicott City, Maryland #GPWX",Ellicott City Maryland,Ellicott City=>CITY * Maryland=>STATE
6,ID_1001139323075416064,The scenic spots in Ellicott City Maryland are perfect for photography.,Ellicott City Maryland,Ellicott City=>CITY * Maryland=>STATE
7,ID_1001140017207459840,Maryland has a variety of historical landmarks.,Maryland,Maryland=>STATE
8,ID_1001140276377935872,The local food markets in Maryland are a feast for the senses.,Maryland,Maryland=>STATE
9,ID_1001140804503601152,Baltimore is a popular destination for both relaxation and adventure.,Baltimore,Baltimore=>CITY


### **Preprocessing Data**

- Remove special character
- Treat HASHTAG, USERTAG
- Remove stop word
- Tokenization
- Stemming
- BIO Tagging

##### **<> BIO Tagging**

BIO stands for Begin, Inside, and Outside. It’s a method for tagging tokens (words or subwords) in a sequence to identify entities within the text. Each token in the text is assigned a tag that indicates whether it is at the beginning of an entity, inside an entity, or outside of any entity.

In [6]:
df = Preprocess.remove_non_ascii(df, column_name='text')
df = Preprocess.remove_usertag(df, column_name='text')
df = Preprocess.reformat_hashtag(df, column_name='text')
df = Preprocess.remove_prefix(df, df_type="train", text_column='text')
df = Preprocess.reformat_useless_char(df, column_name='text')
df.head(60)

Unnamed: 0,tweet_id,text,location,location_mentions
0,ID_1001136212718088192,EllicottCity is known for its vibrant art scene.,EllicottCity,EllicottCity=>CITY
1,ID_1001136696589631488,Flash floods struck a Maryland city on Sunday washing out streets and tossing cars like bath toys.,Maryland,Maryland=>STATE
2,ID_1001136950345109504,State of emergency declared for Maryland flooding: via,Maryland,Maryland=>STATE
3,ID_1001137334056833024,Other parts of Maryland also saw significant damage from Sundays storms including this Baltimore city neighborhood Dundalk and Catonsville. Rain totals spanned from 1 to 10 inches across Maryland: ECFlood,Baltimore Maryland,Baltimore=>CITY * Maryland=>STATE
4,ID_1001138374923579392,Catastrophic Flooding Slams Ellicott City Maryland Water Rescues Reported - The Weather Channel via,Ellicott City Maryland,Ellicott City=>CITY * Maryland=>STATE
5,ID_1001138377717157888,1 missing after flash FLOODING devastates Ellicott City Maryland GPWX,Ellicott City Maryland,Ellicott City=>CITY * Maryland=>STATE
6,ID_1001139323075416064,The scenic spots in Ellicott City Maryland are perfect for photography.,Ellicott City Maryland,Ellicott City=>CITY * Maryland=>STATE
7,ID_1001140017207459840,Maryland has a variety of historical landmarks.,Maryland,Maryland=>STATE
8,ID_1001140276377935872,The local food markets in Maryland are a feast for the senses.,Maryland,Maryland=>STATE
9,ID_1001140804503601152,Baltimore is a popular destination for both relaxation and adventure.,Baltimore,Baltimore=>CITY


In [7]:
lemma_path = "../data/new/train.encoded2.lemma.csv"
if not os.path.exists(lemma_path):
    df_ = Preprocess.remove_stop_words(df, column_name='text', new_col="text_transformed", transformation=[
        "tokenize", "lemma-", "lower"
    ], save_in=lemma_path)
else:
    df_ = pd.read_csv(lemma_path)

In [8]:
# Subtitution

df = df_.drop(columns=['text'])
df = df.rename(columns={'text_transformed': 'text'})
df = df.dropna(subset=['text'])

In [9]:
print(df.shape)
print(df.isnull().sum())

(43375, 4)
tweet_id                0
location                0
location_mentions    1831
text                    0
dtype: int64


- Train dev split

In [10]:
df_idx, ner_classes = MultiLabelNERStratify.process_location_mentions(df)
train_idx, test_idx, train_label_freq, test_label_freq = MultiLabelNERStratify.stratify_train_test_split_multi_label(
    df_idx.tweet_id, 
    np.vstack(df_idx.location_array_freq.values), 
    test_size=0.2
)

# Filter the original DataFrame based on the tweet_id column
train_idx_list = train_idx.tolist() if hasattr(train_idx, 'tolist') else list(train_idx)
test_idx_list = test_idx.tolist() if hasattr(test_idx, 'tolist') else list(test_idx)
df_train = df[df['tweet_id'].isin(train_idx_list)]
df_dev   = df[df['tweet_id'].isin(test_idx_list)]

# print repartition
print("TRAIN SHAPE: ", df_train.shape)
print("DEV SHAPE: ", df_dev.shape)

TRAIN SHAPE:  (34123, 4)
DEV SHAPE:  (9252, 4)


In [11]:
df_tag_train = Preprocess.build_bilou_encoding(df_train, text_col="text", save_in="../data/new/train.encoded.bilou.tag.csv")
df_tag_dev   = Preprocess.build_bilou_encoding(df_dev, text_col="text", save_in="../data/new/dev.encoded.bilou.tag.csv")

df_tag_train.head(5)

Unnamed: 0,sentence_id,words,labels
0,ID_1001136212718088192,ellicottcity,U-CITY
1,ID_1001136212718088192,is,O
2,ID_1001136212718088192,known,O
3,ID_1001136212718088192,for,O
4,ID_1001136212718088192,its,O


In [12]:
grouped = df_tag_train.groupby('sentence_id').agg({
    'words': lambda x: ' '.join(x),
    'labels': lambda x: ' '.join(x)
}).reset_index()
with open('../data/new/train.preprocessed_sentences.txt', 'w') as f_sentences:
    for sentence in grouped['words']:
        f_sentences.write(sentence + '\n')
with open('../data/new/train.preprocessed_labels.txt', 'w') as f_labels:
    for label in grouped['labels']:
        f_labels.write(label + '\n')

grouped = df_tag_dev.groupby('sentence_id').agg({
    'words': lambda x: ' '.join(x),
    'labels': lambda x: ' '.join(x)
}).reset_index()
with open('../data/new/dev.preprocessed_sentences.txt', 'w') as f_sentences:
    for sentence in grouped['words']:
        f_sentences.write(sentence + '\n')
with open('../data/new/dev.preprocessed_labels.txt', 'w') as f_labels:
    for label in grouped['labels']:
        f_labels.write(label + '\n')

### **Prepare training, dev and test data**

In [13]:
df_tag_train["sentence_id"] = LabelEncoder().fit_transform(df_tag_train["sentence_id"])
df_tag_dev["sentence_id"]   = LabelEncoder().fit_transform(df_tag_dev["sentence_id"])

In [14]:
df_tag_train.head()

Unnamed: 0,sentence_id,words,labels
0,0,ellicottcity,U-CITY
1,0,is,O
2,0,known,O
3,0,for,O
4,0,its,O


In [15]:
X_train  = df_tag_train[["sentence_id", "words"]]
X_test   = df_tag_dev[["sentence_id", "words"]]
y_train  = df_tag_train["labels"]
y_test   = df_tag_dev["labels"]

train_data = pd.DataFrame({"sentence_id": X_train["sentence_id"], "words": X_train["words"], "labels": y_train})
test_data = pd.DataFrame({"sentence_id": X_test["sentence_id"], "words": X_test["words"], "labels": y_test})

train_data

Unnamed: 0,sentence_id,words,labels
0,0,ellicottcity,U-CITY
1,0,is,O
2,0,known,O
3,0,for,O
4,0,its,O
...,...,...,...
406450,34121,fellow,O
406451,34121,people,O
406452,34121,in,O
406453,34121,mexico,U-COUNTRY


#### **Model Training**

- Let count NER label

In [16]:
label = pd.concat([df_tag_train, df_tag_dev])["labels"].unique().tolist()
label_counts = pd.concat([df_tag_train, df_tag_dev])["labels"].value_counts().reset_index()
label_counts.columns = ["Label", "Frequency"]
display(label_counts)

Unnamed: 0,Label,Frequency
0,O,455787
1,U-STATE,18685
2,U-COUNTRY,11105
3,U-CITY,7490
4,B-ISLAND,3029
5,L-ISLAND,3029
6,B-STATE,2061
7,L-STATE,2061
8,L-COUNTRY,2053
9,B-COUNTRY,2053


- Let define model **Args** and hyperparameters optimisation approach

In [17]:
model_args = NERArgs()

# general
model_args.evaluate_during_training = True
model_args.overwrite_output_dir = True
model_args.train_batch_size = 64
model_args.eval_batch_size = 32
model_args.labels_list = label
model_args.use_multiprocessing = True
model_args.num_train_epochs = 1
model_args.learning_rate = 4e-4

# for eaarly stoping
# model_args.use_early_stopping = True
# model_args.early_stopping_delta = 0.01
# model_args.early_stopping_metric = "wer"
# model_args.early_stopping_metric_minimize = False
# model_args.early_stopping_patience = 5
# model_args.wandb_project = "LMR-IO"
model_args.evaluate_during_training_steps = 1000

- Train

In [19]:
model = NERModel(
    "bert", 
    "rsuwaileh/IDRISI-LMR-EN-random-typebased", #bert-base-cased - #bert-large-uncased
    use_cuda=False,
    args=model_args, 
    ignore_mismatched_sizes=True
)

# Train the model
print('\n### TRAINING')
model.train_model(
    train_data, 
    eval_data=test_data, 
    wer=LMR_Metrics.wer_type
)

Some weights of the model checkpoint at rsuwaileh/IDRISI-LMR-EN-random-typebased were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at rsuwaileh/IDRISI-LMR-EN-random-typebased and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([49, 1024]) in the checkpoint and torch.Size([48, 1024]) in the model instantia

tokenizer_config.json:   0%|          | 0.00/40.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]


### TRAINING


  0%|          | 0/17 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 1 of 1:   0%|          | 0/534 [00:00<?, ?it/s]

- Eval

In [19]:
# Evaluate the model
#print('\n### EVALUATION')
#result, model_outputs, wrong_preds = model.eval_model(test_data, wer=LMR_Metrics.wer_type)

In [20]:
#result

- Quick prediction

In [21]:
predictions, raw_outputs = model.predict([
    "Elicott City, Maryland, struck by catastrophic flooding; 1 missing.".lower(),
    "Memorial Day weekend floods ravage Maryland town".lower()
])

  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

In [22]:
predictions

[[{'elicott': 'B-CITY'},
  {'city,': 'I-CITY'},
  {'maryland,': 'B-STATE'},
  {'struck': 'O'},
  {'by': 'O'},
  {'catastrophic': 'O'},
  {'flooding;': 'O'},
  {'1': 'O'},
  {'missing.': 'O'}],
 [{'memorial': 'O'},
  {'day': 'O'},
  {'weekend': 'O'},
  {'floods': 'O'},
  {'ravage': 'O'},
  {'maryland': 'B-STATE'},
  {'town': 'O'}]]

### **Make prediction for Context**

In [23]:
df_context = pd.read_csv('../data/provided/Test.csv')
df_context = Preprocess.remove_non_ascii(df_context, column_name='text')
df_context = Preprocess.remove_usertag(df_context, column_name='text')
df_context = Preprocess.reformat_hashtag(df_context, column_name='text')
df_context = Preprocess.remove_prefix(df_context, df_type="test", text_column='text')
df_context = Preprocess.reformat_useless_char(df_context, column_name='text')
df_context.head(10)

Unnamed: 0,tweet_id,text
0,ID_1001154804658286592,What is happening to the infrastructure in New England? It isnt global warming its misappropriated funds being abused that shouldve been used maintaining their infrastructure that couldve protected them from floods! Like New Orleans. Their mayor went to 7Maryland floods
1,ID_1001155505459486720,SOLDER MISSING IN FLOOD.. PRAY FOR EDDISON HERMOND! PRAY FOR ELLICOTT CITY MARYLAND!
2,ID_1001155756371136512,Police searching for missing person after devastating 1000-year flood in Ellicott City Maryland
3,ID_1001159445194399744,Flash Flood Tears Through Maryland Town For Second Time In Two Years Less than two years after what had been called a once in a 1000 years flood in 2016 Ellicott City Md. sees its historic downtown ravaged anew. One man remains missing. from Flas
4,ID_1001164907587538944,Ellicott City FLOODING Pictures: Maryland Governor Declares State of Emergency After Severe Flash FLOODS
5,ID_1001178904617476096,Our Harts gos out to a Fellow Soldier missing in Maryland as he was HELPING Others the fast moving Water Consuming him. Well all on the island are praying for the missing Soldier in Maryland.
6,ID_1001179909245587456,CRAZY VIDEO. Roaring flash floods struck a Maryland city Sunday that had been wracked by similar devastation two years ago its main street turned into a raging river that reached the first floor of some buildings and swept away parked cars authorities and witnesses say.
7,ID_1001180876548591616,I liked a video BREAKING: Devastating flooding strikes Ellicott City Maryland
8,ID_1001182906130280448,Thank you to the first responders who are taking swift action to aid the Ellicott City community. The entire Maryland Delegation is working with to bring in federal resources. If youre in the area please follow the guidance of local authorities and .
9,ID_1001185240256311296,Ellicott City floods: Maryland officials assess destruction search for missing man in 1000-year flood. via


In [24]:
lemma_path = "../data/new/test.encoded2.lemma.csv"
if not os.path.exists(lemma_path):
    df_context = Preprocess.remove_stop_words(df_context, column_name='text', new_col="text_transformed", transformation=[
        "tokenize", "lemma-", "lower"
    ], save_in=lemma_path)
else:
    df_context = pd.read_csv(lemma_path)

100%|██████████| 2942/2942 [04:54<00:00,  9.97it/s]  


In [25]:
df_context.isnull().sum()

tweet_id            0
text                0
text_transformed    0
dtype: int64

In [26]:
ids = df_context["tweet_id"].values
tweets = df_context["text_transformed"].values

# Make prediction
predictions, raw_outputs = model.predict(tweets)

# Save submission file
results = []
for sentence in predictions:
    result = " ".join([word for d in sentence for word, tag in d.items() if tag != 'O'])
    if result == "":
        result = " "
    results.append(result)

  0%|          | 0/6 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/92 [00:00<?, ?it/s]

In [27]:
Predictions.to_csv(ids, results)

Saved predictions to ../submissions/submission_14.csv


In [40]:
# Some Quick postprocessing
def remove_duplicate_words(file_path):
    df = pd.read_csv(file_path)
    def remove_duplicates(location):
        if pd.isna(location):
            return location
        words = location.split()
        unique_words = list(dict.fromkeys(words))
        loc = ' '.join(unique_words)
        return loc if loc != '' else ' '
    df['location'] = df['location'].apply(remove_duplicates)
    return df

# Usage
df_cleaned = remove_duplicate_words('../submissions/submission_13.csv')
df_cleaned.to_csv('../submissions/submission_13_post.csv', index=False)

In [None]:
### END