<div style="padding: 0.5em; background-color: #1876d1; color: #fff; font-weight: bold; font-size: 1.4em;">
    [Approach 3]  Location Mention Recognition - NER BERT Transformer
</div>

In this Jupyter notebook, we will use Name Entity Recognition to extract from X (Twitter formely) tweets Location Mention from Emergency Situation.

Note :
* Do NER
* Try BERT Model
* Extract Location Mention

---
<b>#Microsoft Learn Challenge, #Zindi, #Hamad Bin Khalifa University </b>

### **Importing Library**

In [None]:
!pip install simpletransformers

In [1]:
# general utils
import numpy as np
import pandas as pd
import stanza, os, sys
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from simpletransformers.ner import NERModel, NERArgs

# utils setup
current_directory = os.getcwd()
root_directory = os.path.abspath(os.path.join(current_directory, os.pardir))
sys.path.append(root_directory)

# custom utils
from utils.io import Predictions
from utils.io import LMR_Scrapper

### **Exploring Data**

The provided Train.csv contain many missing value so we have to get data from initial source.

In [2]:
LMR_Scrapper(output_dir="../data/self_scrappe/").run()

Processing dataset: california_wildfires_2018


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.29file/s]


Processing dataset: canada_wildfires_2016


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.49file/s]


Processing dataset: cyclone_idai_2019


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.25file/s]


Processing dataset: ecuador_earthquake_2016


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.53file/s]


Processing dataset: greece_wildfires_2018


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.44file/s]


Processing dataset: hurricane_dorian_2019


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.20file/s]


Processing dataset: hurricane_florence_2018


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.26file/s]


Processing dataset: hurricane_harvey_2017


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.41file/s]


Processing dataset: hurricane_irma_2017


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.54file/s]


Processing dataset: hurricane_maria_2017


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.44file/s]


Processing dataset: hurricane_matthew_2016


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.70file/s]


Processing dataset: italy_earthquake_aug_2016


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.78file/s]


Processing dataset: kaikoura_earthquake_2016


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.58file/s]


Processing dataset: kerala_floods_2018


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.32file/s]


Processing dataset: maryland_floods_2018


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.70file/s]


Processing dataset: midwestern_us_floods_2019


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.33file/s]


Processing dataset: pakistan_earthquake_2019


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.41file/s]


Processing dataset: puebla_mexico_earthquake_2017


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.40file/s]


Processing dataset: srilanka_floods_2017


Extracting Files : 100%|██████████| 2/2 [00:00<00:00,  2.83file/s]

Processing complete.





- Let concatenate out dataset

In [28]:
df_raw = pd.read_csv('../data/Train.csv')
df_raw.head(20)

Unnamed: 0,tweet_id,text,location
0,ID_1001136212718088192,,EllicottCity
1,ID_1001136696589631488,"Flash floods struck a Maryland city on Sunday,...",Maryland
2,ID_1001136950345109504,State of emergency declared for Maryland flood...,Maryland
3,ID_1001137334056833024,Other parts of Maryland also saw significant d...,Baltimore Maryland
4,ID_1001138374923579392,"Catastrophic Flooding Slams Ellicott City, Mar...",Ellicott City Maryland
5,ID_1001138377717157888,WATCH: 1 missing after flash #FLOODING devasta...,Ellicott City Maryland
6,ID_1001139323075416064,,Ellicott City Maryland
7,ID_1001139644023693312,,
8,ID_1001140017207459840,,Maryland
9,ID_1001140276377935872,,Maryland


In [38]:
print("Shape: ", df_raw.shape, "\n------------------")
df_raw.isnull().sum()

Shape:  (73072, 3) 
------------------


tweet_id        0
text        56624
location    29612
dtype: int64

In [39]:
#df_raw.fillna(method='ffill', inplace=True)
df_raw.dropna(inplace=True)
df_raw.shape

(11849, 3)

### **BIO Tagging**

BIO stands for Begin, Inside, and Outside. It’s a method for tagging tokens (words or subwords) in a sequence to identify entities within the text. Each token in the text is assigned a tag that indicates whether it is at the beginning of an entity, inside an entity, or outside of any entity.

In [75]:
nlp = stanza.Pipeline(lang='en', processors='tokenize', verbose=False)

def generate_bio_tags(text, location):
    doc = nlp(text)
    tokens = []
    tags = []
    loc_words = location.split()
    
    for sentence in doc.sentences:
        for i, token in enumerate(sentence.tokens):
            token_text = token.text
            tokens.append(token_text)
            
            if token_text in loc_words:
                if loc_words[0] == token_text:
                    tags.append('B-geo')
                else:
                    tags.append('I-geo')
            else:
                tags.append('O')
    
    return tokens, tags

# Apply to each row in the TrainSet
results = []
for index, row in df_raw.iterrows():
    tweet_id = row['tweet_id']
    text = row['text']
    location = row['location']
    tokens, tags = generate_bio_tags(text, location)
    for token, tag in zip(tokens, tags):
        results.append({'tweet_id': tweet_id, 'word': token, 'label': tag})

# Get a look of the new df
df_tag = pd.DataFrame(results)
df_tag["tweet_id"] = LabelEncoder().fit_transform(df_tag["tweet_id"])
df_tag["label"]    = df_tag["label"].str.upper()

In [76]:
df_tag.head(20)

Unnamed: 0,tweet_id,word,label
0,0,Flash,O
1,0,floods,O
2,0,struck,O
3,0,a,O
4,0,Maryland,B-GEO
5,0,city,O
6,0,on,O
7,0,Sunday,O
8,0,",",O
9,0,washing,O


### **Prepare training, dev and test data**

In [96]:
X = df_tag[["tweet_id", "word"]]
y = df_tag["label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

train_data = pd.DataFrame({"sentence_id": X_train["tweet_id"], "words": X_train["word"], "labels": y_train})
test_data = pd.DataFrame({"sentence_id": X_test["tweet_id"], "words": X_test["word"], "labels": y_test})

train_data

Unnamed: 0,sentence_id,words,labels
3673,152,Volunteer,O
162856,5019,5,O
261904,9678,Irma,O
97087,3119,drive,O
282994,10770,the,O
...,...,...,...
37028,1254,to,O
84592,2691,are,O
9826,358,#,O
238547,8508,your,O


#### **Model Training**

In [97]:
label = df_tag["label"].unique().tolist()
label

['O', 'B-GEO', 'I-GEO']

In [98]:
model_args = NERArgs()
model_args.num_train_epochs = 1
model_args.learning_rate = 1e-4
model_args.overwrite_output_dir = True
model_args.train_batch_size = 32
model_args.eval_batch_size = 32
model_args.labels_list = label

In [99]:
model = NERModel('bert', "bert-base-cased", args=model_args, use_cuda=False)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [100]:
model.train_model(train_data, eval_data=test_data)

  0%|          | 0/17 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 1 of 1:   0%|          | 0/371 [00:00<?, ?it/s]

(371, 0.0998445101222902)

In [101]:
result, model_outputs, wrong_preds = model.eval_model(test_data)

  0%|          | 0/17 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/364 [00:00<?, ?it/s]

In [102]:
predictions, raw_outputs = model.predict([
    "Elicott City, Maryland, struck by catastrophic flooding; 1 missing.",
    "Memorial Day weekend floods ravage Maryland town"
])

  0%|          | 0/1 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]

In [103]:
predictions

[[{'Elicott': 'O'},
  {'City,': 'I-GEO'},
  {'Maryland,': 'I-GEO'},
  {'struck': 'O'},
  {'by': 'O'},
  {'catastrophic': 'O'},
  {'flooding;': 'O'},
  {'1': 'O'},
  {'missing.': 'O'}],
 [{'Memorial': 'O'},
  {'Day': 'O'},
  {'weekend': 'O'},
  {'floods': 'O'},
  {'ravage': 'O'},
  {'Maryland': 'B-GEO'},
  {'town': 'O'}]]

### **Make prediction for Context**

In [111]:
df_context = pd.read_csv('../data/Test.csv')
ids = df_context["tweet_id"].values
tweets = df_context["text"].values

In [116]:
predictions, raw_outputs = model.predict(tweets)

  0%|          | 0/6 [00:00<?, ?it/s]

Running Prediction:   0%|          | 0/92 [00:00<?, ?it/s]

In [118]:
results = []
for sentence in predictions:
    result = " ".join([word for d in sentence for word, tag in d.items() if tag != 'O'])
    results.append(result)

In [123]:
Predictions.to_csv(ids, results)

Saved predictions to submissions/submission_2.csv
