<div style="padding: 0.5em; background-color: #1876d1; color: #fff; font-weight: bold; font-size: 1.4em;">
    [Approach 4]  Location Mention Recognition - NER BERT Transformer
</div>

In this Jupyter notebook, we will use Name Entity Recognition to extract from X (Twitter formely) tweets Location Mention from Emergency Situation.

Note :
* Do NER
* Try BERT Model
* Extract Location Mention

---
<b>#Microsoft Learn Challenge, #Zindi, #Hamad Bin Khalifa University </b>

### **Importing Library**

In [3]:
#!pip install simpletransformers
#!pip install pyspellchecker
#!pip install stanza
#!pip install nltk
#!pip install python-dotenv
#!pip install werpy
#!pip install wandb
#!pip install transformers jiwer accelerate -U
#!pip install tf-keras
#!pip install transformers[torch]
#!pip install accelerate -U

In [1]:
# general utils
import werpy
import numpy as np
import pandas as pd
import seaborn as sns
import stanza, os, sys
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from simpletransformers.ner import NERModel, NERArgs
pd.set_option('display.max_colwidth', 500)

# utils setup
current_directory = os.getcwd()
root_directory = os.path.abspath(os.path.join(current_directory, os.pardir))
sys.path.append(root_directory)

# logging
import wandb
os.environ["WANDB_NOTEBOOK_NAME"] = "transformers_3.ipynb"

# custom utils
from utils.io import Predictions
from utils.metrics import LMR_Metrics
from utils.io import LMR_BILOU_Scrapper, LMR_JSON_Scrapper
from utils.preprocessing import Preprocess

In [2]:
import re
from collections import Counter
# import jiwer
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForTokenClassification, Trainer, TrainingArguments, DataCollatorForTokenClassification
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from tqdm import tqdm
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

### **Exploring Data**

The provided Train.csv contain many missing value so we have to get data from initial source.

In [3]:
LMR_JSON_Scrapper(output_dir="../data/self_scrapped/raw").run()

Processing dataset: california_wildfires_2018


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.29file/s]


Processing dataset: canada_wildfires_2016


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.46file/s]


Processing dataset: cyclone_idai_2019


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.26file/s]


Processing dataset: ecuador_earthquake_2016


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  1.80file/s]


Processing dataset: greece_wildfires_2018


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.47file/s]


Processing dataset: hurricane_dorian_2019


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.29file/s]


Processing dataset: hurricane_florence_2018


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.26file/s]


Processing dataset: hurricane_harvey_2017


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.30file/s]


Processing dataset: hurricane_irma_2017


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.44file/s]


Processing dataset: hurricane_maria_2017


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.46file/s]


Processing dataset: hurricane_matthew_2016


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  1.84file/s]


Processing dataset: italy_earthquake_aug_2016


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.76file/s]


Processing dataset: kaikoura_earthquake_2016


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.48file/s]


Processing dataset: kerala_floods_2018


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.08file/s]


Processing dataset: maryland_floods_2018


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.17file/s]


Processing dataset: midwestern_us_floods_2019


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.26file/s]


Processing dataset: pakistan_earthquake_2019


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.52file/s]


Processing dataset: puebla_mexico_earthquake_2017


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  1.91file/s]


Processing dataset: srilanka_floods_2017


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.62file/s]

Processing complete.





- Let concatenate out dataset

In [3]:
train_dfs = []
dev_dfs   = []
test_dfs  = []
path_dfs  = "../data/self_scrapped/raw"
for filename in os.listdir(path_dfs):
    if filename.endswith(".csv"):
        file_path = os.path.join(path_dfs, filename)
        if filename.startswith("train"):
            df = pd.read_csv(file_path)
            train_dfs.append(df)
        elif filename.startswith("dev"):
            df = pd.read_csv(file_path)
            dev_dfs.append(df)
        elif filename.startswith("test_unlabeled"):
            df = pd.read_csv(file_path)
            test_dfs.append(df)

df_train = pd.concat(train_dfs, ignore_index=True) if train_dfs else pd.DataFrame()
df_test  = pd.concat(test_dfs, ignore_index=True) if test_dfs else pd.DataFrame()
df_dev   = pd.concat(dev_dfs, ignore_index=True) if dev_dfs else pd.DataFrame()

print("TRAIN SHAPE: ", df_train.shape)
print("TEST  SHAPE: ", df_test.shape)
print("DEV   SHAPE: ", df_dev.shape)

TRAIN SHAPE:  (14392, 3)
TEST  SHAPE:  (4066, 3)
DEV   SHAPE:  (2056, 3)


- We observe in sentencess that we have hashtag, no-ascii character, stopword , ... we have to clean data 

In [4]:
df_train.head(20)

Unnamed: 0,tweet_id,text,location_mentions
0,ID_1022420413882744832,Nearly half of #houses checked in #fire-stricken areas deemed #uninhabitable #GO #PrayForGreece #PrayForAthens #AthensFires ἞C἟7,
1,ID_1021778661895294976,RT @anadoluagency: #Greece: Death toll from wildfires hits 74,Greece=>COUNTRY
2,ID_1022015997740503042,When the essence of cooperation meets the sad reality of lifeThe IPA partner country offers financial aid to Greece to handle disaster @InterregIPACBC #Greecefires,Greece=>COUNTRY
3,ID_1022557424585240576,We are live from the Lureio Idrima the orphanage and nursing home operared by the nuns of the Holy Trinity Monastery that was destroyed by the fire in Neos Voutzas. Here too the scene is apocalyptic.,Holy Trinity Monastery=>HUMAN-MADE POINT-OF-INTEREST * Neos Voutzas=>NEIGHBORHOOD
4,ID_1021749412639457280,RT @AP: Greek prime minister declares 3-day national mourning period for dozens killed by wildfires near Athens.,Athens=>CITY
5,ID_1024662121391579136,#Greece vows to speed up destruction of illegal property after #wildfires,Greece=>COUNTRY
6,ID_1022441664697298944,"In Mati, hundreds of humanitarian volunteers have joined relief efforts following Greece’s most deadly wildfires in a decade",Mati=>CITY * Greece=>COUNTRY
7,ID_1024974395759108097,State Minister #Flambouraris offers thanks for assistance to #Greecefires victims,
8,ID_1022473412034416640,"Unfortunately theres the first confirmed fatality, hope to be the last, among tourists at the #AthensFires: an Irishman in honeymoon. So sorry, R.I.P ὢ2 #Mati #Attica #wildfires #PrayForGreece #PrayForAthens #Πυρκαγια #Αττικη #ματι",Attica=>CITY
9,ID_1022371614091096064,"A special account has been opened at the Bank of Greece for donations in support of fire victims. Account number 23/2341195169, IBAN GR4601000230000002341195169) available for foreign states, businesses and individuals from Greece and abroad to provide their financial support",Greece=>COUNTRY * Greece=>COUNTRY


- There are some sentences without a location mention. We need to look closer. It could be normal if there is no corresponding location found in the tweet, or it might be an error from the labeling task. Note that for the test set, it is normal for all location_mentions to be NaN. (😎 Yeah, we have to predict this value).

In [5]:
print(df_train.isnull().sum())
print(df_test.isnull().sum())
print(df_dev.isnull().sum())

tweet_id                0
text                    0
location_mentions    4026
dtype: int64
tweet_id                0
text                    0
location_mentions    4066
dtype: int64
tweet_id               0
text                   0
location_mentions    573
dtype: int64


### **Preprocessing Data**

- Remove special character
- Treat HASHTAG, USERTAG
- Remove stop word
- Tokenization
- Stemming
- BIO Tagging

##### **<> BIO Tagging**

BIO stands for Begin, Inside, and Outside. It’s a method for tagging tokens (words or subwords) in a sequence to identify entities within the text. Each token in the text is assigned a tag that indicates whether it is at the beginning of an entity, inside an entity, or outside of any entity.

In [6]:
# TRAIN
train_path = "../data/transformed/train.tag.csv"
if not os.path.exists(train_path):
    df_train = Preprocess.remove_non_ascii(df_train, column_name='text')
    df_train = Preprocess.remove_usertag(df_train, column_name='text')
    df_train = Preprocess.reformat_hashtag(df_train, column_name='text')
    df_train = Preprocess.remove_stop_words(df_train, column_name='text', new_col="text_transformed", transformation=[
        "tokenize", "lemma", "lower"], save_in="../data/transformed/train.lemma.csv")
    df_tag_train = Preprocess.build_bilou_encoding(df_train, text_col="text_transformed", save_in=train_path)
else:
    df_tag_train = pd.read_csv(train_path)

In [7]:
# DEV
dev_path = "../data/transformed/dev.tag.csv"
if not os.path.exists(dev_path):
    df_dev = Preprocess.remove_non_ascii(df_dev, column_name='text')
    df_dev = Preprocess.remove_usertag(df_dev, column_name='text')
    df_dev = Preprocess.reformat_hashtag(df_dev, column_name='text')
    df_dev = Preprocess.remove_stop_words(df_dev, column_name='text', new_col="text_transformed", transformation=[
        "tokenize", "lemma", "lower"], save_in="../data/transformed/dev.lemma.csv")
    df_tag_dev = Preprocess.build_bilou_encoding(df_dev, text_col="text_transformed", save_in=dev_path)
else:
    df_tag_dev = pd.read_csv(dev_path)

In [8]:
# TEST
test_path = "../data/transformed/test.lemma.csv"
if not os.path.exists(test_path):
    df_test = Preprocess.remove_non_ascii(df_test, column_name='text')
    df_test = Preprocess.remove_usertag(df_test, column_name='text')
    df_test = Preprocess.reformat_hashtag(df_test, column_name='text')
    df_test = Preprocess.remove_stop_words(df_test, column_name='text', new_col="text_transformed", transformation=[
        "tokenize", "lemma", "lower"], save_in=test_path)
else:
    df_test = pd.read_csv(test_path)

In [9]:
df_tag_train.head(30)

Unnamed: 0.1,Unnamed: 0,sentence_id,words,labels
0,0,684,nearly,O
1,1,684,half,O
2,2,684,of,O
3,3,684,house,O
4,4,684,check,O
5,5,684,in,O
6,6,684,fire,O
7,7,684,stricken,O
8,8,684,area,O
9,9,684,deem,O


### **Prepare training, dev and test data**

In [10]:
df_tag_train["sentence_id"] = LabelEncoder().fit_transform(df_tag_train["sentence_id"])
df_tag_dev["sentence_id"]   = LabelEncoder().fit_transform(df_tag_dev["sentence_id"])

In [11]:
df_tag_train.head()

Unnamed: 0.1,Unnamed: 0,sentence_id,words,labels
0,0,684,nearly,O
1,1,684,half,O
2,2,684,of,O
3,3,684,house,O
4,4,684,check,O


In [12]:
X_train  = df_tag_train[["sentence_id", "words"]]
X_test   = df_tag_dev[["sentence_id", "words"]]
y_train  = df_tag_train["labels"]
y_test   = df_tag_dev["labels"]

train_data = pd.DataFrame({"sentence_id": X_train["sentence_id"], "words": X_train["words"], "labels": y_train})
test_data = pd.DataFrame({"sentence_id": X_test["sentence_id"], "words": X_test["words"], "labels": y_test})

train_data

Unnamed: 0,sentence_id,words,labels
0,684,nearly,O
1,684,half,O
2,684,of,O
3,684,house,O
4,684,check,O
...,...,...,...
292721,5841,the,O
292722,5841,ddrc,O
292723,5841,patient,O
292724,5841,preparedness,O


- We have to collaps word per sentence and do same for labels

In [13]:
# TRAINSET
train_data['labels_list'] = train_data['labels']
train_dataset = train_data.groupby('sentence_id').agg({
    'words': ' '.join,
    'labels': ','.join,
    'labels_list': lambda x: list(x) 
}).reset_index()
train_dataset.rename(columns={'words': 'sentence', 'labels': 'word_labels'}, inplace=True)

# DEVSET
test_data['labels_list'] = test_data['labels']
test_dataset = test_data.groupby('sentence_id').agg({
    'words': ' '.join,
    'labels': ','.join,
    'labels_list': lambda x: list(x) 
}).reset_index()
test_dataset.rename(columns={'words': 'sentence', 'labels': 'word_labels'}, inplace=True)

In [14]:
train_dataset.head()

Unnamed: 0,sentence_id,sentence,word_labels,labels_list
0,0,flash flood strike a maryland city on sunday wash out street and toss car like bath toy,"O,O,O,O,U-STATE,O,O,O,O,O,O,O,O,O,O,O,O","[O, O, O, O, U-STATE, O, O, O, O, O, O, O, O, O, O, O, O]"
1,1,state of emergency declare for maryland flooding via,"O,O,O,O,O,U-STATE,O,O","[O, O, O, O, O, U-STATE, O, O]"
2,2,other part of maryland also see significant damage from sunday storm include this baltimore city neighborhood dundalk and catonsville rain total span from 1 to 10 inch across maryland ecflood,"O,O,O,U-STATE,O,O,O,O,O,O,O,O,O,U-CITY,O,O,O,O,O,O,O,O,O,O,O,O,O,O,U-STATE,O","[O, O, O, U-STATE, O, O, O, O, O, O, O, O, O, U-CITY, O, O, O, O, O, O, O, O, O, O, O, O, O, O, U-STATE, O]"
3,3,catastrophic flooding slam ellicott city maryland water rescues report the weather channel via,"O,O,O,B-CITY,L-CITY,U-STATE,O,O,O,O,O,O,O","[O, O, O, B-CITY, L-CITY, U-STATE, O, O, O, O, O, O, O]"
4,4,watch 1 miss after flash flooding devastate ellicott city maryland gpwx,"O,O,O,O,O,O,O,B-CITY,L-CITY,U-STATE,O","[O, O, O, O, O, O, O, B-CITY, L-CITY, U-STATE, O]"


#### **Modeling preparation**

In [15]:
# Extract unique tags from word labels
tags = pd.concat([df_tag_train, df_tag_dev])["labels"].unique().tolist()

# Create label to ID and ID to label mappings
label2id = {k: v for v, k in enumerate(tags)}
id2label = {v: k for v, k in enumerate(tags)}

#### **Setup the model and tokenizer**

In [16]:
# Initialize the tokenizer using a pre-trained BERT model
tokenizer = BertTokenizer.from_pretrained("bert-large-uncased")

In [17]:
# Load a pre-trained BERT model for token classification with the custom label mappings
model = BertForTokenClassification.from_pretrained(
    "bert-large-uncased",
    num_labels=len(id2label),
    id2label=id2label,
    label2id=label2id
)
model.to(device)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-large-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertForTokenClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 1024, padding_idx=0)
      (position_embeddings): Embedding(512, 1024)
      (token_type_embeddings): Embedding(2, 1024)
      (LayerNorm): LayerNorm((1024,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-23): 24 x BertLayer(
          (attention): BertAttention(
            (self): BertSdpaSelfAttention(
              (query): Linear(in_features=1024, out_features=1024, bias=True)
              (key): Linear(in_features=1024, out_features=1024, bias=True)
              (value): Linear(in_features=1024, out_features=1024, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=1024, out_features=1024, bias=True)
              (LayerNorm): LayerNorm((1024

- A custom dataset class is created to handle the input data, applying tokenization and ensuring that sequences are properly padded or truncated to fit the model’s expected input size. 

In [18]:
data_collator = DataCollatorForTokenClassification(tokenizer)

In [19]:
df_temp = pd.concat([train_dataset, test_dataset])
df_temp['labels_list_length'] = df_temp['labels_list'].apply(len)
min_length = df_temp['labels_list_length'].min()
max_length = df_temp['labels_list_length'].max()

print("Min: ", min_length, "; Max: ", max_length)

Min:  1 ; Max:  60


In [20]:
def tokenize_and_preserve_labels(sentence, text_labels, tokenizer):
    tokenized_sentence, labels = [], []
    for word, label in zip(sentence.split(), text_labels.split(",")):
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)
        tokenized_sentence.extend(tokenized_word)
        labels.extend([label] * n_subwords)
    return tokenized_sentence, labels
    
class CustomDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len

    def __len__(self):
        return len(self.data)
        
    def __getitem__(self, index):
        sentence = self.data.sentence[index]  
        word_labels = self.data.word_labels[index]  
        tokenized_sentence, labels = tokenize_and_preserve_labels(sentence, word_labels, self.tokenizer)
        
        tokenized_sentence = ["[CLS]"] + tokenized_sentence + ["[SEP]"]
        labels.insert(0, "O")
        labels.insert(-1, "O")

        if len(tokenized_sentence) > self.max_len:
            tokenized_sentence = tokenized_sentence[:self.max_len]
            labels = labels[:self.max_len]
        else:
            tokenized_sentence += ['[PAD]'] * (self.max_len - len(tokenized_sentence))
            labels += ["O"] * (self.max_len - len(labels))

        attn_mask = [1 if tok != '[PAD]' else 0 for tok in tokenized_sentence]
        ids = self.tokenizer.convert_tokens_to_ids(tokenized_sentence)
        label_ids = [label2id[label] for label in labels]
        
        return {
            'input_ids': torch.tensor(ids, dtype=torch.long),
            'attention_mask': torch.tensor(attn_mask, dtype=torch.long),
            'labels': torch.tensor(label_ids, dtype=torch.long)
        }

In [21]:
MAX_LEN = 60+20
training_set = CustomDataset(train_dataset, tokenizer, MAX_LEN)
testing_set = CustomDataset(test_dataset, tokenizer, MAX_LEN)

In [22]:
training_set.__getitem__(1)

{'input_ids': tensor([  101,  2110,  1997,  5057, 13520,  2005,  5374,  9451,  3081,   102,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0]),
 'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0]),
 'labels': tensor([ 0,  0,  0,  0,  0,  0, 15,  0,  0,  0

In [23]:
# Define training parameters
TRAIN_BATCH_SIZE = 32
VALID_BATCH_SIZE = 16
EPOCHS = 1

- The fine-tuning process involves setting up the Trainer class from the transformers library, which simplifies the training loop, handles model optimization, and tracks metrics like accuracy, precision, recall, and F1-score. We specify training arguments such as the number of epochs, batch size, learning rate, and the device (GPU or CPU). The model is trained to minimize the loss function, adjusting its weights based on the labeled data to improve its predictions.

In [25]:
# Helper
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels.flatten(), preds.flatten(), average='weighted')
    acc = accuracy_score(labels.flatten(), preds.flatten())
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }

# Training arguments for the Trainer API
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=TRAIN_BATCH_SIZE,
    per_device_eval_batch_size=VALID_BATCH_SIZE,
    warmup_steps=25,
    weight_decay=0.001,
    logging_dir='./logs',
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=25,
    save_steps=50,
    save_total_limit=2,
    gradient_accumulation_steps=4,  # Accumulate gradients for larger effective batch size
    fp16=False,  # Enable mixed precision training for faster computation
    report_to=["none"] #set this to true if you have a WANDB API key
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=training_set,
    eval_dataset=testing_set,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

#### **Fine-tuning**

For this task, we fine-tune the BertForTokenClassification model, a variant of BERT designed for sequence tagging tasks like Named Entity Recognition (NER). Fine-tuning involves taking a pre-trained BERT model and adapting it to our specific task—location mention recognition—by training it further on our labeled dataset. This step leverages the knowledge BERT has from its initial pre-training on a vast corpus while specializing it for identifying location mentions.

#### **Model Training**

In [26]:
trainer.train()

  0%|          | 0/112 [00:00<?, ?it/s]

#### **Make WER Inference**

In [None]:
def infer_on_sentences(sentences, model, tokenizer, max_len=80, with_extra=False):
    # Put the model in evaluation mode
    model.eval()
    
    results = []
    extra_results = []
    
    for sentence in tqdm(sentences):
        # Tokenize the sentence and prepare input for the model
        tokenized_sentence = tokenizer(
            sentence.split(),
            is_split_into_words=True,
            return_offsets_mapping=False,
            padding='max_length',
            truncation=True,
            max_length=max_len,
            return_tensors="pt"
        )
        
        # Move tensors to the correct device
        input_ids = tokenized_sentence['input_ids'].to(device)
        attention_mask = tokenized_sentence['attention_mask'].to(device)
        
        # Get predictions
        with torch.no_grad():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=2)  # Get the index of the highest logit for each token
        
        # Convert predictions to labels
        pred_labels = [id2label[pred.item()] for pred in predictions[0]]
        
        # Get the original tokens from input_ids
        tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
        
        # Filter out tokens with the 'O' label and concatenate them
        filtered_tokens = [
            token for token, label in zip(tokens, pred_labels)
            if label != 'O' and token not in ['[CLS]', '[SEP]', '[PAD]']
        ]
        filtered_labels = [
            label for token, label in zip(tokens, pred_labels)
            if label != 'O' and token not in ['[CLS]', '[SEP]', '[PAD]']
        ]
        
        results.append(" ".join(filtered_tokens))
        extra_results.append(filtered_labels)

    if with_extra:
        return results, extra_results

    return results

- Let count NER label

In [16]:
label = pd.concat([df_tag_train, df_tag_dev])["labels"].unique().tolist()
label_counts = pd.concat([df_tag_train, df_tag_dev])["labels"].value_counts().reset_index()
label_counts.columns = ["Label", "Frequency"]
display(label_counts)

Unnamed: 0,Label,Frequency
0,O,314421
1,U-COUNTRY,4575
2,U-STATE,4298
3,U-CITY,2822
4,B-CITY,1050
5,L-CITY,1050
6,B-COUNTRY,653
7,L-COUNTRY,653
8,B-ISLAND,572
9,L-ISLAND,572


- Let define model **Args** and hyperparameters optimisation approach

In [17]:
# hyperparameters

sweep_config = {
    "method": "bayes",  # grid, random
    "metric": {"name": "wer", "goal": "minimize"},
    "parameters": {
        "num_train_epochs": {"values": [1, 2, 3, 5, 8]},
        "learning_rate": {"min": 5e-5, "max": 4e-4},
    },
}

- Initialize a W&B sweep with the config defined earlier.

In [18]:
sweep_id = wandb.sweep(sweep_config, project="LMR-HPC")
#%%capture
# wandb.init(project="LMR", name="Location-Mention-Recognition")

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /home/genereux.akotenou/.netrc


Create sweep with ID: fbynau3c
Sweep URL: https://wandb.ai/genereux-akotenou-local/LMR-HPC/sweeps/fbynau3c


- Model args

In [19]:
model_args = NERArgs()

# general
model_args.evaluate_during_training = True
model_args.overwrite_output_dir = True
model_args.train_batch_size = 64
model_args.eval_batch_size = 32
model_args.labels_list = label
model_args.use_multiprocessing = True
model_args.wandb_project = "LMR"

# for eaarly stoping
"""model_args.use_early_stopping = True
model_args.early_stopping_delta = 0.01
model_args.early_stopping_metric = "wer"
model_args.early_stopping_metric_minimize = False
model_args.early_stopping_patience = 5"""
model_args.evaluate_during_training_steps = 1000

In [20]:
def train_eval():
    wandb.init(name="Location-Mention-Recognition-HPC")
    model = NERModel(
        "bert", 
        "bert-base-cased", 
        use_cuda=False,
        args=model_args, 
        sweep_config=wandb.config)

    # Train the model
    print('### TRAINING')
    # train_data1, _ = train_test_split(train_data, test_size=0.99998)
    model.train_model(
        train_data, 
        eval_data=test_data, 
        wer=LMR_Metrics.wer_type
    )
    
    # Evaluate the model
    print('### EVALUATION')
    result, model_outputs, wrong_preds = model.eval_model(test_data, wer=LMR_Metrics.wer_type)

    # Log metrics to wandb
    wandb.log({"eval_result": result, "model_outputs": model_outputs})

    # Sync wandb
    wandb.join()

In [None]:
#%%capture
wandb.agent(sweep_id, train_eval)

[34m[1mwandb[0m: Agent Starting Run: t5notlaa with config:
[34m[1mwandb[0m: 	learning_rate: 0.00033320406223994023
[34m[1mwandb[0m: 	num_train_epochs: 2
[34m[1mwandb[0m: Currently logged in as: [33mgenereux-akotenou[0m ([33mgenereux-akotenou-local[0m). Use [1m`wandb login --relogin`[0m to force relogin


Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


### TRAINING


  0%|          | 0/29 [00:00<?, ?it/s]

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]



VBox(children=(Label(value='0.005 MB of 0.022 MB uploaded\r'), FloatProgress(value=0.23401931227334294, max=1.…

VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112203945716222, max=1.0…

Running Epoch 1 of 2:   0%|          | 0/225 [00:00<?, ?it/s]

In [None]:
***

In [None]:
result

{'eval_loss': 0.08742212575106394,
 'precision': 0.7850278199291857,
 'recall': 0.7846309403437816,
 'f1_score': 0.7848293299620733}

- Quick prediction

In [None]:
predictions, raw_outputs = model.predict([
    "Elicott City, Maryland, struck by catastrophic flooding; 1 missing.",
    "Memorial Day weekend floods ravage Maryland town"
])

  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
predictions

[[{'Elicott': 'B-CITY'},
  {'City,': 'L-CITY'},
  {'Maryland,': 'O'},
  {'struck': 'O'},
  {'by': 'O'},
  {'catastrophic': 'O'},
  {'flooding;': 'O'},
  {'1': 'O'},
  {'missing.': 'O'}],
 [{'Memorial': 'O'},
  {'Day': 'O'},
  {'weekend': 'O'},
  {'floods': 'O'},
  {'ravage': 'O'},
  {'Maryland': 'O'},
  {'town': 'O'}]]

### **Make prediction for Context**

In [None]:
# Get Data and Preprocess
# df_context = pd.read_csv('../data/provided/Test.csv')
# df_context = Preprocess.remove_special_characters(df_context, column_name='text')
# df_context['text'] = df_context['text'].apply(lambda x: Preprocess.treat_hashtags(x))
# df_context['text'] = df_context['text'].apply(lambda x: Preprocess.correct_spelling(x))
# #df_context['text'] = df_context['text'].apply(lambda x: Preprocess.remove_stop_words(x))
# df_context.to_csv("../data/provided/Test-processed.csv")

df_context = pd.read_csv('../data/provided/Test.csv')
df_context = Preprocess.remove_non_ascii(df_context, column_name='text')
df_context = Preprocess.remove_usertag(df_context, column_name='text')
df_context = Preprocess.reformat_hashtag(df_context, column_name='text')
df_context = Preprocess.remove_stop_words(df_context, column_name='text', new_col="text_transformed", transformation=[
    "tokenize", "lemma", "lower"], save_in="../data/provided/Test-processed.csv")

#df_context = pd.read_csv('../data/provided/Test-processed.csv')

ids = df_context["tweet_id"].values
tweets = df_context["text_transformed"].values

# Make prediction
predictions, raw_outputs = model.predict(tweets)

100%|██████████| 2942/2942 [05:19<00:00,  9.21it/s]


  0%|          | 0/6 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/30 [00:00<?, ?it/s]

In [None]:
# Extract Location Mention based on model output
results = []
for sentence in predictions:
    result = " ".join([word for d in sentence for word, tag in d.items() if tag != 'O'])
    if result == "":
        result = " "
    results.append(result)

Predictions.to_csv(ids, results)

Saved predictions to ../submissions/submission_8.csv


In [None]:
### END