<div style="padding: 0.5em; background-color: #1876d1; color: #fff; font-weight: bold; font-size: 1.4em;">
    [Approach 3]  Location Mention Recognition - NER BERT Transformer
</div>

In this Jupyter notebook, we will use Name Entity Recognition to extract from X (Twitter formely) tweets Location Mention from Emergency Situation.

Note :
* Do NER
* Try BERT Model
* Extract Location Mention
* The BIO format is very specific. It requires an understanding of tokens being “Inside” (I) and “Outside” (O) a particular entity label. This would add unnecessary complexity in formulating a task description prompt. Idea is to try IO formating approach

---
<b>#Microsoft Learn Challenge, #Zindi, #Hamad Bin Khalifa University </b>

### **Importing Library**

In [1]:
#!pip install simpletransformers
#!pip install pyspellchecker
#!pip install stanza
#!pip install nltk
#!pip install python-dotenv
#!pip install werpy
#!pip install wandb

In [2]:
# general utils
import werpy
import numpy as np
import pandas as pd
import seaborn as sns
import stanza, os, sys
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from simpletransformers.ner import NERModel, NERArgs
pd.set_option('display.max_colwidth', 300)

# utils setup
current_directory = os.getcwd()
root_directory = os.path.abspath(os.path.join(current_directory, os.pardir))
sys.path.append(root_directory)

# logging
import wandb
os.environ["WANDB_NOTEBOOK_NAME"] = "transformers_3_BIO_and_IO.ipynb"

# custom utils
from utils.io import Predictions
from utils.metrics import LMR_Metrics
from utils.io import LMR_BILOU_Scrapper, LMR_JSON_Scrapper
from utils.preprocessing import Preprocess

### **Exploring Data**

The provided Train.csv contain many missing value so we have to get data from initial source.

In [3]:
#LMR_JSON_Scrapper(output_dir="../data/self_scrapped/raw").run()

- Let concatenate out dataset

In [33]:
train_dfs = []
dev_dfs   = []
test_dfs  = []
path_dfs  = "../data/self_scrapped/raw"
for filename in os.listdir(path_dfs):
    if filename.endswith(".csv"):
        file_path = os.path.join(path_dfs, filename)
        if filename.startswith("train"):
            df = pd.read_csv(file_path)
            train_dfs.append(df)
        elif filename.startswith("dev"):
            df = pd.read_csv(file_path)
            dev_dfs.append(df)
        elif filename.startswith("test_unlabeled"):
            df = pd.read_csv(file_path)
            test_dfs.append(df)

df_train = pd.concat(train_dfs, ignore_index=True) if train_dfs else pd.DataFrame()
df_test  = pd.concat(test_dfs, ignore_index=True) if test_dfs else pd.DataFrame()
df_dev   = pd.concat(dev_dfs, ignore_index=True) if dev_dfs else pd.DataFrame()

print("TRAIN SHAPE: ", df_train.shape)
print("TEST  SHAPE: ", df_test.shape)
print("DEV   SHAPE: ", df_dev.shape)

TRAIN SHAPE:  (14392, 3)
TEST  SHAPE:  (4066, 3)
DEV   SHAPE:  (2056, 3)


In [31]:
df_train = Preprocess.build_bilou_encoding(df_train, text_col="text", save_in=None)
df_dev = Preprocess.build_bilou_encoding(df_dev, text_col="text", save_in=None)

In [32]:
# Train
train_dataset = df_train.groupby('sentence_id').agg({
    'words': lambda x: ' '.join(x),
    'labels': lambda x: ' '.join(x)
}).reset_index()
train_dataset = train_dataset.drop(columns=['sentence_id'])
train_dataset.to_csv('../data/new/kaggle/train_bilou.csv', index=False)

# Dev
test_dataset = df_dev.groupby('sentence_id').agg({
    'words': lambda x: ' '.join(x),
    'labels': lambda x: ' '.join(x)
}).reset_index()
test_dataset = test_dataset.drop(columns=['sentence_id'])
test_dataset.to_csv('../data/new/kaggle/dev_bilou.csv', index=False)

In [11]:
df_

Unnamed: 0,sentence_id,words,labels
0,ID_1001136212718088192,EllicottCity,B-CITY
1,ID_1001136212718088192,is,O
2,ID_1001136212718088192,known,O
3,ID_1001136212718088192,for,O
4,ID_1001136212718088192,its,O
...,...,...,...
532265,ID_916205068281462784,fellow,O
532266,ID_916205068281462784,people,O
532267,ID_916205068281462784,in,O
532268,ID_916205068281462784,Mexico,B-COUNTRY


In [12]:
df_tag_train = df_

- We observe in sentencess that we have hashtag, no-ascii character, stopword , ... we have to clean data 

In [None]:
df_train.head(5)

Unnamed: 0,tweet_id,text,location_mentions
0,ID_1022420413882744832,Nearly half of #houses checked in #fire-stricken areas deemed #uninhabitable #GO #PrayForGreece #PrayForAthens #AthensFires ἞C἟7,
1,ID_1021778661895294976,RT @anadoluagency: #Greece: Death toll from wildfires hits 74,Greece=>COUNTRY
2,ID_1022015997740503042,When the essence of cooperation meets the sad reality of lifeThe IPA partner country offers financial aid to Greece to handle disaster @InterregIPACBC #Greecefires,Greece=>COUNTRY
3,ID_1022557424585240576,We are live from the Lureio Idrima the orphanage and nursing home operared by the nuns of the Holy Trinity Monastery that was destroyed by the fire in Neos Voutzas. Here too the scene is apocalyptic.,Holy Trinity Monastery=>HUMAN-MADE POINT-OF-INTEREST * Neos Voutzas=>NEIGHBORHOOD
4,ID_1021749412639457280,RT @AP: Greek prime minister declares 3-day national mourning period for dozens killed by wildfires near Athens.,Athens=>CITY


- There are some sentences without a location mention. We need to look closer. It could be normal if there is no corresponding location found in the tweet, or it might be an error from the labeling task. Note that for the test set, it is normal for all location_mentions to be NaN. (😎 Yeah, we have to predict this value).

In [None]:
print(df_train.isnull().sum())
print(df_test.isnull().sum())
print(df_dev.isnull().sum())

tweet_id                0
text                    0
location_mentions    4026
dtype: int64
tweet_id                0
text                    0
location_mentions    4066
dtype: int64
tweet_id               0
text                   0
location_mentions    573
dtype: int64


### **Preprocessing Data**

- Remove special character
- Treat HASHTAG, USERTAG
- Remove stop word
- Tokenization
- Stemming
- BIO Tagging

##### **<> BIO Tagging**

BIO stands for Begin, Inside, and Outside. It’s a method for tagging tokens (words or subwords) in a sequence to identify entities within the text. Each token in the text is assigned a tag that indicates whether it is at the beginning of an entity, inside an entity, or outside of any entity.

In [None]:
# TRAIN
train_path = "../data/transformed/train.io2-tag.csv"
if not os.path.exists(train_path):
    df_train = Preprocess.remove_non_ascii(df_train, column_name='text')
    # df_train = Preprocess.remove_usertag(df_train, column_name='text')
    # df_train = Preprocess.reformat_hashtag(df_train, column_name='text')
    # df_train = Preprocess.remove_stop_words(df_train, column_name='text', new_col="text_transformed", transformation=[
    #     "tokenize", "lemma", "lower"], save_in="../data/transformed/train.lemma.csv")
    df_tag_train = Preprocess.build_io_encoding(df_train, text_col="text", save_in=train_path)
else:
    df_tag_train = pd.read_csv(train_path)

In [13]:
# DEV
dev_path = "../data/transformed/dev.io2-tag.csv"
if not os.path.exists(dev_path):
    df_dev = Preprocess.remove_non_ascii(df_dev, column_name='text')
    # df_dev = Preprocess.remove_usertag(df_dev, column_name='text')
    # df_dev = Preprocess.reformat_hashtag(df_dev, column_name='text')
    # df_dev = Preprocess.remove_stop_words(df_dev, column_name='text', new_col="text_transformed", transformation=[
    #     "tokenize", "lemma", "lower"], save_in="../data/transformed/dev.lemma.csv")
    df_tag_dev = Preprocess.build_io_encoding(df_dev, text_col="text", save_in=dev_path)
else:
    df_tag_dev = pd.read_csv(dev_path)

In [14]:
# TEST
test_path = "../data/transformed/test.lemma.csv"
if not os.path.exists(test_path):
    df_test = Preprocess.remove_non_ascii(df_test, column_name='text')
    df_test = Preprocess.remove_usertag(df_test, column_name='text')
    df_test = Preprocess.reformat_hashtag(df_test, column_name='text')
    df_test = Preprocess.remove_stop_words(df_test, column_name='text', new_col="text_transformed", transformation=[
        "tokenize", "lemma", "lower"], save_in=test_path)
else:
    df_test = pd.read_csv(test_path)

In [None]:
df_tag_train.head(30)

Unnamed: 0,sentence_id,words,labels
0,ID_1022420413882744832,Nearly,O
1,ID_1022420413882744832,half,O
2,ID_1022420413882744832,of,O
3,ID_1022420413882744832,houses,O
4,ID_1022420413882744832,checked,O
5,ID_1022420413882744832,in,O
6,ID_1022420413882744832,fire,O
7,ID_1022420413882744832,stricken,O
8,ID_1022420413882744832,areas,O
9,ID_1022420413882744832,deemed,O


### **Prepare training, dev and test data**

In [20]:
df_tag_train["sentence_id"] = LabelEncoder().fit_transform(df_tag_train["sentence_id"])
df_tag_dev["sentence_id"]   = LabelEncoder().fit_transform(df_tag_dev["sentence_id"])

In [23]:
df_tag_train.head()

Unnamed: 0,sentence_id,words,labels
0,0,EllicottCity,B-CITY
1,0,is,O
2,0,known,O
3,0,for,O
4,0,its,O


In [24]:
X_train  = df_tag_train[["sentence_id", "words"]]
X_test   = df_tag_dev[["sentence_id", "words"]]
y_train  = df_tag_train["labels"]
y_test   = df_tag_dev["labels"]

train_data = pd.DataFrame({"sentence_id": X_train["sentence_id"], "words": X_train["words"], "labels": y_train})
test_data = pd.DataFrame({"sentence_id": X_test["sentence_id"], "words": X_test["words"], "labels": y_test})

train_data

Unnamed: 0,sentence_id,words,labels
0,0,EllicottCity,B-CITY
1,0,is,O
2,0,known,O
3,0,for,O
4,0,its,O
...,...,...,...
532265,43459,fellow,O
532266,43459,people,O
532267,43459,in,O
532268,43459,Mexico,B-COUNTRY


#### **Model Training**

- Let count NER label

In [25]:
label = pd.concat([df_tag_train, df_tag_dev])["labels"].unique().tolist()
label_counts = pd.concat([df_tag_train, df_tag_dev])["labels"].value_counts().reset_index()
label_counts.columns = ["Label", "Frequency"]
display(label_counts)

Unnamed: 0,Label,Frequency
0,O,511322
1,B-STATE,22004
2,B-COUNTRY,13583
3,B-CITY,9933
4,B-ISLAND,6386
5,I-ISLAND,3287
6,I-COUNTRY,1873
7,I-CITY,1851
8,I-STATE,1574
9,B-COUNTY,663


- Let define model **Args** and hyperparameters optimisation approach

In [None]:
# hyperparameters

sweep_config = {
    "method": "bayes",  # grid, random
    "metric": {"name": "wer", "goal": "minimize"},
    "parameters": {
        "num_train_epochs": {"values": [1, 2]},
        # "learning_rate": {"min": 5e-5, "max": 4e-4},
    },
}

- Initialize a W&B sweep with the config defined earlier.

In [None]:
sweep_id = wandb.sweep(sweep_config, project="LMR-IO2")
#%%capture
# wandb.init(project="LMR", name="Location-Mention-Recognition")

Create sweep with ID: qf80j295
Sweep URL: https://wandb.ai/genereux-akotenou-local/LMR-IO2/sweeps/qf80j295


- Model args

In [None]:
model_args = NERArgs()

# general
model_args.evaluate_during_training = True
model_args.overwrite_output_dir = True
model_args.train_batch_size = 32
model_args.eval_batch_size = 16
model_args.labels_list = label
model_args.use_multiprocessing = True
model_args.wandb_project = "LMR-IO2"

# for eaarly stoping
# model_args.use_early_stopping = True
# model_args.early_stopping_delta = 0.01
# model_args.early_stopping_metric = "wer"
# model_args.early_stopping_metric_minimize = False
# model_args.early_stopping_patience = 3
model_args.evaluate_during_training_steps = 1000

In [None]:
def train_eval():
    wandb.init(name="Location-Mention-Recognition")
    model = NERModel(
        "bert", 
        "bert-base-cased", #bert-large-uncased
        use_cuda=False,
        args=model_args, 
        sweep_config=wandb.config)

    # Train the model
    print('### TRAINING')
    # train_data1, _ = train_test_split(train_data, test_size=0.99998)
    model.train_model(
        train_data, 
        eval_data=test_data, 
        wer=LMR_Metrics.wer_type
    )
    
    # Evaluate the model
    print('### EVALUATION')
    result, model_outputs, wrong_preds = model.eval_model(test_data, wer=LMR_Metrics.wer_type)

    # Log metrics to wandb
    wandb.log({"eval_result": result, "model_outputs": model_outputs})

    # Sync wandb
    wandb.join()

In [None]:
#%%capture
wandb.agent(sweep_id, train_eval)

[34m[1mwandb[0m: Agent Starting Run: z6i5a9zx with config:
[34m[1mwandb[0m: 	learning_rate: 0.00023034296211370871
[34m[1mwandb[0m: 	num_train_epochs: 3


Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### TRAINING


[34m[1mwandb[0m: Ctrl + C detected. Stopping sweep.


- Let define model **Args** and hyperparameters optimisation approach

In [26]:
model_args = NERArgs()

# general
model_args.evaluate_during_training = True
model_args.overwrite_output_dir = True
model_args.train_batch_size = 64
model_args.eval_batch_size = 32
model_args.labels_list = label
model_args.use_multiprocessing = True
model_args.num_train_epochs = 2
model_args.learning_rate = 4e-4

# for eaarly stoping
# model_args.use_early_stopping = True
# model_args.early_stopping_delta = 0.01
# model_args.early_stopping_metric = "wer"
# model_args.early_stopping_metric_minimize = False
# model_args.early_stopping_patience = 5
# model_args.wandb_project = "LMR-IO"
model_args.evaluate_during_training_steps = 1000

In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split

def stratified_split_ner(df, entity_col="labels", test_size=0.2):
    # Créer une colonne avec la présence de chaque type d'entité
    df['has_entity'] = df[entity_col].apply(lambda x: int(any(tag.startswith('B-') or tag.startswith('I-') for tag in x.split())))
    
    # Réaliser une division stratifiée basée sur la présence d'entités
    print(df)
    train, test = train_test_split(df, test_size=test_size, stratify=df['has_entity'])
    
    return train, test

# Utilisation
df = pd.DataFrame({
    'sentence_id': ['ID_1', 'ID_2', 'ID_3', 'ID_4'],
    'words': [['I', 'love', 'Paris'], ['He', 'went', 'to', 'London'], ['This', 'is', 'Berlin'], ['New', 'York', 'City']],
    'labels': ['O O B-LOCATION', 'O O O B-LOCATION', 'O O B-LOCATION', 'B-LOCATION I-LOCATION I-LOCATION']
})

train, test = stratified_split_ner(df)


  sentence_id                   words                            labels  \
0        ID_1        [I, love, Paris]                    O O B-LOCATION   
1        ID_2  [He, went, to, London]                  O O O B-LOCATION   
2        ID_3      [This, is, Berlin]                    O O B-LOCATION   
3        ID_4       [New, York, City]  B-LOCATION I-LOCATION I-LOCATION   

   has_entity  
0           1  
1           1  
2           1  
3           1  


In [8]:
train

Unnamed: 0,sentence_id,words,labels,has_entity
1,ID_2,"[He, went, to, London]",O O O B-LOCATION,1
0,ID_1,"[I, love, Paris]",O O B-LOCATION,1
3,ID_4,"[New, York, City]",B-LOCATION I-LOCATION I-LOCATION,1


In [9]:
test

Unnamed: 0,sentence_id,words,labels,has_entity
2,ID_3,"[This, is, Berlin]",O O B-LOCATION,1


- Train

In [27]:
model = NERModel(
    "bert", 
    "bert-base-cased", #bert-large-uncased
    use_cuda=False,
    args=model_args, 
)

# Train the model
print('\n### TRAINING')
model.train_model(
    train_data, 
    eval_data=test_data, 
    wer=LMR_Metrics.wer_type
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



### TRAINING


  0%|          | 0/17 [00:00<?, ?it/s]

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 1 of 2:   0%|          | 0/680 [00:00<?, ?it/s]

  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/65 [00:00<?, ?it/s]

Running Epoch 2 of 2:   0%|          | 0/680 [00:00<?, ?it/s]

  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/65 [00:00<?, ?it/s]

  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/65 [00:00<?, ?it/s]

(1360,
 defaultdict(list,
             {'global_step': [680, 1000, 1360],
              'train_loss': [0.013492707163095474,
               0.08003070205450058,
               0.01464491244405508],
              'eval_loss': [0.44833912138755505,
               0.4386223302437709,
               0.4364334000990941],
              'precision': [0.652946273830156,
               0.706430568499534,
               0.7135036496350365],
              'recall': [0.732620320855615,
               0.7369956246961594,
               0.7603305785123967],
              'f1_score': [0.6904925544100802,
               0.7213894837021175,
               0.736173217227583],
              'wer': [0.8983337382489047,
               0.7721650940722028,
               0.8305438006735043]}))

- Eval

In [None]:
# Evaluate the model
#print('\n### EVALUATION')
#result, model_outputs, wrong_preds = model.eval_model(test_data, wer=LMR_Metrics.wer_type)

  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/65 [00:00<?, ?it/s]

In [None]:
#result

{'eval_loss': 0.06285895355618917,
 'precision': 0.8029925187032418,
 'recall': 0.813953488372093,
 'f1_score': 0.8084358523725833,
 'wer': 0.41041911334521874}

- Quick prediction

In [28]:
predictions, raw_outputs = model.predict([
    "Elicott City, Maryland, struck by catastrophic flooding; 1 missing.",
    "Memorial Day weekend floods ravage Maryland town"
])

  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

In [29]:
predictions

[[{'Elicott': 'O'},
  {'City,': 'O'},
  {'Maryland,': 'B-STATE'},
  {'struck': 'O'},
  {'by': 'O'},
  {'catastrophic': 'O'},
  {'flooding;': 'O'},
  {'1': 'O'},
  {'missing.': 'O'}],
 [{'Memorial': 'O'},
  {'Day': 'O'},
  {'weekend': 'O'},
  {'floods': 'O'},
  {'ravage': 'O'},
  {'Maryland': 'B-STATE'},
  {'town': 'O'}]]

### **Make prediction for Context**

In [30]:
# Get Data and Preprocess
# df_context = pd.read_csv('../data/provided/Test.csv')
# df_context = Preprocess.remove_special_characters(df_context, column_name='text')
# df_context['text'] = df_context['text'].apply(lambda x: Preprocess.treat_hashtags(x))
# df_context['text'] = df_context['text'].apply(lambda x: Preprocess.correct_spelling(x))
# #df_context['text'] = df_context['text'].apply(lambda x: Preprocess.remove_stop_words(x))
# df_context.to_csv("../data/provided/Test-processed.csv")

df_context = pd.read_csv('../data/provided/Test.csv')
df_context = Preprocess.remove_non_ascii(df_context, column_name='text')
# df_context = Preprocess.remove_usertag(df_context, column_name='text')
# df_context = Preprocess.reformat_hashtag(df_context, column_name='text')
# df_context = Preprocess.remove_stop_words(df_context, column_name='text', new_col="text_transformed", transformation=[
#     "tokenize", "lemma", "lower"], save_in="../data/provided/Test-processed.csv")

#df_context = pd.read_csv('../data/provided/Test-processed.csv')

ids = df_context["tweet_id"].values
tweets = df_context["text"].values #text_transformed

# Make prediction
predictions, raw_outputs = model.predict(tweets)

  0%|          | 0/6 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/92 [00:00<?, ?it/s]

In [31]:
# Extract Location Mention based on model output
results = []
for sentence in predictions:
    result = " ".join([word for d in sentence for word, tag in d.items() if tag != 'O'])
    if result == "":
        result = " "
    results.append(result)

Predictions.to_csv(ids, results)

Saved predictions to ../submissions/submission_12.csv


In [None]:
### END

In [36]:
def postprocessing(path="../submissions/submission_12.csv"):
    df = pd.read_csv(path)
    remove_chars = ['"', ',', '.', ')', '[', ']', '(', '#', '?']
    translation_table = str.maketrans('', '', ''.join(remove_chars))
    df['location'] = df['location'].apply(lambda x: x.translate(translation_table).strip())
    df.loc[df.location.apply(len) < 2, 'location'] = ' '
    return df

df_cleaned = postprocessing()
df_cleaned.to_csv("../submissions/submission_12-processed.csv", index=False)
df_cleaned.head()

Unnamed: 0,tweet_id,location
0,ID_1001154804658286592,New England New Orleans
1,ID_1001155505459486720,
2,ID_1001155756371136512,Ellicott City Maryland
3,ID_1001159445194399744,Maryland Ellicott City
4,ID_1001164907587538944,Ellicott City Maryland
