<div style="padding: 0.5em; background-color: #1876d1; color: #fff; font-weight: bold; font-size: 1.4em;">
    [Approach 3]  Location Mention Recognition - NER BERT Transformer
</div>

In this Jupyter notebook, we will use Name Entity Recognition to extract from X (Twitter formely) tweets Location Mention from Emergency Situation.

Note :
* Do NER
* Try BERT Model
* Extract Location Mention

---
<b>#Microsoft Learn Challenge, #Zindi, #Hamad Bin Khalifa University </b>

### **Importing Library**

In [5]:
#!pip install simpletransformers
#!pip install pyspellchecker
#!pip install stanza
#!pip install nltk
#!pip install python-dotenv



In [1]:
# general utils
import numpy as np
import pandas as pd
import stanza, os, sys
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from simpletransformers.ner import NERModel, NERArgs
pd.set_option('display.max_colwidth', 1000)

# utils setup
current_directory = os.getcwd()
root_directory = os.path.abspath(os.path.join(current_directory, os.pardir))
sys.path.append(root_directory)

# custom utils
from utils.io import Predictions
from utils.io import LMR_BILOU_Scrapper, LMR_JSON_Scrapper
from utils.preprocessing import Preprocess

### **Exploring Data**

The provided Train.csv contain many missing value so we have to get data from initial source.

In [2]:
LMR_JSON_Scrapper(output_dir="../data/self_scrapped/raw").run()

Processing dataset: california_wildfires_2018


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.38file/s]


Processing dataset: canada_wildfires_2016


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.17file/s]


Processing dataset: cyclone_idai_2019


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.28file/s]


Processing dataset: ecuador_earthquake_2016


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.50file/s]


Processing dataset: greece_wildfires_2018


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.19file/s]


Processing dataset: hurricane_dorian_2019


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.37file/s]


Processing dataset: hurricane_florence_2018


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.37file/s]


Processing dataset: hurricane_harvey_2017


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.50file/s]


Processing dataset: hurricane_irma_2017


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.44file/s]


Processing dataset: hurricane_maria_2017


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.36file/s]


Processing dataset: hurricane_matthew_2016


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  1.92file/s]


Processing dataset: italy_earthquake_aug_2016


Extracting Files : 100%|██████████| 3/3 [00:02<00:00,  1.42file/s]


Processing dataset: kaikoura_earthquake_2016


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.49file/s]


Processing dataset: kerala_floods_2018


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.33file/s]


Processing dataset: maryland_floods_2018


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.64file/s]


Processing dataset: midwestern_us_floods_2019


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.36file/s]


Processing dataset: pakistan_earthquake_2019


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.63file/s]


Processing dataset: puebla_mexico_earthquake_2017


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.46file/s]


Processing dataset: srilanka_floods_2017


Extracting Files : 100%|██████████| 3/3 [00:01<00:00,  2.87file/s]

Processing complete.





- Let concatenate out dataset

In [2]:
train_dfs = []
dev_dfs   = []
test_dfs  = []
path_dfs  = "../data/self_scrapped/raw"
for filename in os.listdir(path_dfs):
    if filename.endswith(".csv"):
        file_path = os.path.join(path_dfs, filename)
        if filename.startswith("train"):
            df = pd.read_csv(file_path)
            train_dfs.append(df)
        elif filename.startswith("dev"):
            df = pd.read_csv(file_path)
            dev_dfs.append(df)
        elif filename.startswith("test_unlabeled"):
            df = pd.read_csv(file_path)
            test_dfs.append(df)

df_train = pd.concat(train_dfs, ignore_index=True) if train_dfs else pd.DataFrame()
df_test  = pd.concat(test_dfs, ignore_index=True) if test_dfs else pd.DataFrame()
df_dev   = pd.concat(dev_dfs, ignore_index=True) if dev_dfs else pd.DataFrame()

print("TRAIN SHAPE: ", df_train.shape)
print("TEST  SHAPE: ", df_test.shape)
print("DEV   SHAPE: ", df_dev.shape)

TRAIN SHAPE:  (14392, 3)
TEST  SHAPE:  (4066, 3)
DEV   SHAPE:  (2056, 3)


- We observe in sentencess that we have hashtag, no-ascii character, stopword , ... we have to clean data 

In [3]:
df_train.head(30)

Unnamed: 0,tweet_id,text,location_mentions
0,ID_1022420413882744832,Nearly half of #houses checked in #fire-stricken areas deemed #uninhabitable #GO #PrayForGreece #PrayForAthens #AthensFires ἞C἟7,
1,ID_1021778661895294976,RT @anadoluagency: #Greece: Death toll from wildfires hits 74,Greece=>COUNTRY
2,ID_1022015997740503042,When the essence of cooperation meets the sad reality of lifeThe IPA partner country offers financial aid to Greece to handle disaster @InterregIPACBC #Greecefires,Greece=>COUNTRY
3,ID_1022557424585240576,We are live from the Lureio Idrima the orphanage and nursing home operared by the nuns of the Holy Trinity Monastery that was destroyed by the fire in Neos Voutzas. Here too the scene is apocalyptic.,Holy Trinity Monastery=>HUMAN-MADE POINT-OF-INTEREST * Neos Voutzas=>NEIGHBORHOOD
4,ID_1021749412639457280,RT @AP: Greek prime minister declares 3-day national mourning period for dozens killed by wildfires near Athens.,Athens=>CITY
5,ID_1024662121391579136,#Greece vows to speed up destruction of illegal property after #wildfires,Greece=>COUNTRY
6,ID_1022441664697298944,"In Mati, hundreds of humanitarian volunteers have joined relief efforts following Greece’s most deadly wildfires in a decade",Mati=>CITY * Greece=>COUNTRY
7,ID_1024974395759108097,State Minister #Flambouraris offers thanks for assistance to #Greecefires victims,
8,ID_1022473412034416640,"Unfortunately theres the first confirmed fatality, hope to be the last, among tourists at the #AthensFires: an Irishman in honeymoon. So sorry, R.I.P ὢ2 #Mati #Attica #wildfires #PrayForGreece #PrayForAthens #Πυρκαγια #Αττικη #ματι",Attica=>CITY
9,ID_1022371614091096064,"A special account has been opened at the Bank of Greece for donations in support of fire victims. Account number 23/2341195169, IBAN GR4601000230000002341195169) available for foreign states, businesses and individuals from Greece and abroad to provide their financial support",Greece=>COUNTRY * Greece=>COUNTRY


- There are some sentences without a location mention. We need to look closer. It could be normal if there is no corresponding location found in the tweet, or it might be an error from the labeling task. Note that for the test set, it is normal for all location_mentions to be NaN. (😎 Yeah, we have to predict this value).

In [4]:
print(df_train.isnull().sum())
print(df_test.isnull().sum())
print(df_dev.isnull().sum())
#df_train.dropna(inplace=True)

tweet_id                0
text                    0
location_mentions    4026
dtype: int64
tweet_id                0
text                    0
location_mentions    4066
dtype: int64
tweet_id               0
text                   0
location_mentions    573
dtype: int64


### **Preprocessing Data**

- Remove special character
- Treat HASHTAG, USERTAG
- Remove stop word
- Tokenization
- Stemming
- BIO Formating

In [7]:
df_train = Preprocess.remove_non_ascii(df_train, column_name='text')
df_train = Preprocess.remove_usertag(df_train, column_name='text')
df_train = Preprocess.reformat_hashtag(df_train, column_name='text')
df_train_ = df_train[:50]
df_train_ = Preprocess.remove_stop_words(df_train_, column_name='text', new_col="text-transformed", transformation=[
    "tokenize", "lemma", "lower"], save_in="../data/transformed/train.lemma.csv")
df_tag_train_ = Preprocess.build_bilou_encoding(df_train_, text_col="text", save_in="../data/transformed/train.tag.csv")

df_tag_train_.head(60)

100%|██████████| 50/50 [00:03<00:00, 13.04it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.loc[:, new_col] = df[column_name].progress_apply(process_text)


Unnamed: 0,sentence_id,words,labels
0,ID_1022420413882744832,Nearly,O
1,ID_1022420413882744832,half,O
2,ID_1022420413882744832,of,O
3,ID_1022420413882744832,houses,O
4,ID_1022420413882744832,checked,O
5,ID_1022420413882744832,in,O
6,ID_1022420413882744832,fire,O
7,ID_1022420413882744832,stricken,O
8,ID_1022420413882744832,areas,O
9,ID_1022420413882744832,deemed,O


In [6]:
df_train_.head(30)

Unnamed: 0,tweet_id,text,location_mentions,text-transformed
0,ID_1022420413882744832,Nearly half of houses checked in fire-stricken areas deemed uninhabitable GO C7,,nearly half of house check in fire stricken area deem uninhabitable go c7
1,ID_1021778661895294976,RT : Greece: Death toll from wildfires hits 74,Greece=>COUNTRY,rt greece death toll from wildfire hit 74
2,ID_1022015997740503042,When the essence of cooperation meets the sad reality of lifeThe IPA partner country offers financial aid to Greece to handle disaster Greecefires,Greece=>COUNTRY,when the essence of cooperation meet the sad reality of life the ipa partner country offer financial aid to greece to handle disaster greecefires
3,ID_1022557424585240576,We are live from the Lureio Idrima the orphanage and nursing home operared by the nuns of the Holy Trinity Monastery that was destroyed by the fire in Neos Voutzas. Here too the scene is apocalyptic.,Holy Trinity Monastery=>HUMAN-MADE POINT-OF-INTEREST * Neos Voutzas=>NEIGHBORHOOD,we be live from the lureio idrima the orphanage and nursing home operare by the nun of the holy trinity monastery that be destroy by the fire in neos voutzas here too the scene be apocalyptic
4,ID_1021749412639457280,RT : Greek prime minister declares 3-day national mourning period for dozens killed by wildfires near Athens.,Athens=>CITY,rt greek prime minister declare 3 day national mourn period for dozen kill by wildfire near athens
5,ID_1024662121391579136,Greece vows to speed up destruction of illegal property after wildfires,Greece=>COUNTRY,greece vow to speed up destruction of illegal property after wildfire
6,ID_1022441664697298944,"In Mati, hundreds of humanitarian volunteers have joined relief efforts following Greeces most deadly wildfires in a decade",Mati=>CITY * Greece=>COUNTRY,in mati hundred of humanitarian volunteer have join relief effort follow greeces most deadly wildfire in a decade
7,ID_1024974395759108097,State Minister Flambouraris offers thanks for assistance to Greecefires victims,,state minister flambouraris offer thanks for assistance to greecefires victim
8,ID_1022473412034416640,"Unfortunately theres the first confirmed fatality, hope to be the last, among tourists at the : an Irishman in honeymoon. So sorry, R.I.P 2 Mati Attica wildfires",Attica=>CITY,unfortunately there the first confirm fatality hope to be the last among tourist at the a irishman in honeymoon so sorry r.i.p 2 mati attica wildfire
9,ID_1022371614091096064,"A special account has been opened at the Bank of Greece for donations in support of fire victims. Account number 23/2341195169, IBAN GR4601000230000002341195169) available for foreign states, businesses and individuals from Greece and abroad to provide their financial support",Greece=>COUNTRY * Greece=>COUNTRY,a special account have be open at the bank of greece for donation in support of fire victim account number 23/2341195169 iban gr4601000230000002341195169 ) available for foreign state business and individual from greece and abroad to provide their financial support


In [10]:
df_tag_train_[60:100]

Unnamed: 0,sentence_id,words,labels
60,ID_1022557424585240576,of,O
61,ID_1022557424585240576,the,O
62,ID_1022557424585240576,Holy,O
63,ID_1022557424585240576,Trinity,O
64,ID_1022557424585240576,Monastery,O
65,ID_1022557424585240576,that,O
66,ID_1022557424585240576,was,O
67,ID_1022557424585240576,destroyed,O
68,ID_1022557424585240576,by,O
69,ID_1022557424585240576,the,O


In [None]:
aaa

In [8]:
df_dev   = Preprocess.remove_special_characters(df_dev, column_name='word')

In [10]:
#Preprocess.treat_hashtags("#EddisonHermond missing after catastrophic flood hits #EllicottCity #Maryland; damage believed worse than 20")
#Preprocess.remove_stop_words("#EddisonHermond missing after catastrophic flood hits #EllicottCity #Maryland; damage believed worse than 20")
#Preprocess.correct_spelling("#EddisonHermond missings after catastrophic flood hits #EllicottCity #Maryland; damage believed worse than 20")

### **BIO Tagging**

BIO stands for Begin, Inside, and Outside. It’s a method for tagging tokens (words or subwords) in a sequence to identify entities within the text. Each token in the text is assigned a tag that indicates whether it is at the beginning of an entity, inside an entity, or outside of any entity.

In [11]:
df_train["id_sentence"] = LabelEncoder().fit_transform(df_train["id_sentence"])
df_dev["id_sentence"]   = LabelEncoder().fit_transform(df_dev["id_sentence"])
df_train["tag"]         = df_train["tag"].str.upper()
df_dev["tag"]           = df_dev["tag"].str.upper()

In [12]:
df_train.head()

Unnamed: 0,id_sentence,word,tag
0,3542,Nearly,O
1,3542,half,O
2,3542,of,O
3,3542,#,O
4,3542,houses,O


### **Prepare training, dev and test data**

In [13]:
# X = df_tag[["tweet_id", "word"]]
# y = df_tag["label"]
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

X_train  = df_train[["id_sentence", "word"]]
X_test   = df_dev[["id_sentence", "word"]]
y_train  = df_train["tag"]
y_test   = df_dev["tag"]

train_data = pd.DataFrame({"sentence_id": X_train["id_sentence"], "words": X_train["word"], "labels": y_train})
test_data = pd.DataFrame({"sentence_id": X_test["id_sentence"], "words": X_test["word"], "labels": y_test})

train_data

Unnamed: 0,sentence_id,words,labels
0,3542,Nearly,O
1,3542,half,O
2,3542,of,O
3,3542,#,O
4,3542,houses,O
...,...,...,...
363371,5121,Preparedness,O
363372,5121,Plan,O
363373,5121,.,O
363374,5121,#,O


#### **Model Training**

In [14]:
label = pd.concat([df_train, df_dev])["tag"].unique().tolist()
label

['O',
 'U-CTRY',
 'B-HPOI',
 'I-HPOI',
 'L-HPOI',
 'B-NBHD',
 'L-NBHD',
 'U-CITY',
 'U-CONT',
 'U-STAT',
 'B-ISL',
 'L-ISL',
 'U-ISL',
 'U-OTHR',
 'B-CITY',
 'L-CITY',
 'B-NPOI',
 'L-NPOI',
 'U-NBHD',
 'B-CNTY',
 'I-CNTY',
 'L-CNTY',
 'B-OTHR',
 'L-OTHR',
 'U-DIST',
 'B-DIST',
 'L-DIST',
 'I-CITY',
 'B-CTRY',
 'L-CTRY',
 'U-HPOI',
 'I-DIST',
 'B-STAT',
 'L-STAT',
 'I-NBHD',
 'U-CNTY',
 'I-NPOI',
 'B-ST',
 'L-ST',
 'U-NPOI',
 'I-OTHR',
 'I-ST',
 'I-CTRY',
 'B-CONT',
 'L-CONT',
 'I-STAT',
 'U-ST',
 'I-ISL']

In [15]:
model_args = NERArgs()
model_args.num_train_epochs = 1
model_args.learning_rate = 1e-4
model_args.overwrite_output_dir = True
model_args.train_batch_size = 32
model_args.eval_batch_size = 32
model_args.labels_list = label
#model_args.lazy_loading = True

In [16]:
model = NERModel('bert', "bert-base-cased", args=model_args, labels=label, use_cuda=False)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
model.train_model(train_data, eval_data=test_data)

  0%|          | 0/17 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 1 of 1:   0%|          | 0/450 [00:00<?, ?it/s]

(450, 0.15278238881586326)

In [18]:
result, model_outputs, wrong_preds = model.eval_model(test_data)

  0%|          | 0/5 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/65 [00:00<?, ?it/s]



In [19]:
predictions, raw_outputs = model.predict([
    "Elicott City, Maryland, struck by catastrophic flooding; 1 missing.",
    "Memorial Day weekend floods ravage Maryland town"
])

  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

In [20]:
predictions

[[{'Elicott': 'B-CITY'},
  {'City,': 'L-CITY'},
  {'Maryland,': 'U-STAT'},
  {'struck': 'O'},
  {'by': 'O'},
  {'catastrophic': 'O'},
  {'flooding;': 'O'},
  {'1': 'O'},
  {'missing.': 'O'}],
 [{'Memorial': 'O'},
  {'Day': 'O'},
  {'weekend': 'O'},
  {'floods': 'O'},
  {'ravage': 'O'},
  {'Maryland': 'U-STAT'},
  {'town': 'O'}]]

### **Make prediction for Context**

In [21]:
# Get Data and Preprocess
# df_context = pd.read_csv('../data/provided/Test.csv')
# df_context = Preprocess.remove_special_characters(df_context, column_name='text')
# df_context['text'] = df_context['text'].apply(lambda x: Preprocess.treat_hashtags(x))
# df_context['text'] = df_context['text'].apply(lambda x: Preprocess.correct_spelling(x))
# #df_context['text'] = df_context['text'].apply(lambda x: Preprocess.remove_stop_words(x))
# df_context.to_csv("../data/provided/Test-processed.csv")

df_context = pd.read_csv('../data/provided/Test-processed.csv')

ids = df_context["tweet_id"].values
tweets = df_context["text"].values

# Make prediction
predictions, raw_outputs = model.predict(tweets)

  0%|          | 0/5 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/78 [00:00<?, ?it/s]

In [22]:
# Extract Location Mention based on model output
results = []
for sentence in predictions:
    result = " ".join([word for d in sentence for word, tag in d.items() if tag != 'O'])
    results.append(result)

Predictions.to_csv(ids, results)

Saved predictions to ../submissions/submission_3.csv


In [26]:
df_context = pd.read_csv('../data/provided/Test-processed.csv')
df_context

Unnamed: 0.1,Unnamed: 0,tweet_id,text
0,1,ID_1001155505459486720,SOLDER MISSING IN flood PRAY FOR EDDISON HERMO...
1,2,ID_1001155756371136512,ut timer Police searching for missing person a...
2,4,ID_1001164907587538944,Ellicott City FLOODING pictures marchland Gove...
3,5,ID_1001178904617476096,@CBSNews Our Harts goy out to a Fellow Soldier...
4,6,ID_1001179909245587456,CRAZY video Roaring flash floods struck a marc...
...,...,...,...
2491,2935,ID_914995385743167488,Hurricane Maria left devastating damage in pre...
2492,2936,ID_915002992214110208,Artificial Intelligence Raises pe Million for ...
2493,2938,ID_915026957758328832,@HannahStocking I live the Mexico earthquake a...
2494,2939,ID_915253441726889984,ut @GlobalCalgary: watch National Taco Day in ...


In [None]:
### END