<div style="padding: 0.5em; background-color: #1876d1; color: #fff; font-weight: bold; font-size: 1.4em;">
    [Approach 1]  Location Mention Recognition - NER BERT Approach
</div>

In this Jupyter notebook, we will use Name Entity Recognition to extract from X (Twitter formely) tweets Location Mention from Emergency Situation.

Note :
* Do NER
* Try BERT Model
* Extract Location Mention

---
<b>#Microsoft Learn Challenge, #Zindi, #Hamad Bin Khalifa University </b>

<!-- 4 types of NER systems
Dictionary-based. Dictionary-based NER systems reference terms listed in dictionaries to identify their presence in text. Dictionaries can be any collection of words related to a specific field or domain. You can create one yourself or use public sources such as databases. 

Rule-based. Rule-based NER systems rely on a set of instructions for extracting named entities from text. You must create the rules based on two types of instruction: pattern-based rules, which relate to word forms and structure, and context-based rules like “if a contraction such as Mr. or Ms. precedes a name, then that contraction is the person’s honorific title.” These rules can also be combined with dictionaries.

Machine learning-based. Machine learning-based NER systems are based on statistical models designed to identify entity names. To develop an ML-based NER system, the machine learning model must be trained on annotated documents. Annotated documents have explanations that help the machine learn to produce entity names based on instruction and past experiences.

Hybrid systems. Hybrid NER systems combine more than one of the approaches listed above. 
 -->


### **Importing Library**

In [2]:
!pip install "keras<3.0.0" "tensorflow<2.16" "tf-models-official<2.16" mediapipe-model-maker

Collecting keras<3.0.0
  Downloading keras-2.15.0-py3-none-any.whl.metadata (2.4 kB)
Collecting tensorflow<2.16
  Downloading tensorflow-2.13.1-cp38-cp38-macosx_12_0_arm64.whl.metadata (2.6 kB)
Collecting tf-models-official<2.16
  Downloading tf_models_official-2.15.0-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting mediapipe-model-maker
  Downloading mediapipe_model_maker-0.2.1.4-py3-none-any.whl.metadata (1.7 kB)
INFO: pip is looking at multiple versions of tensorflow to determine which version is compatible with other requirements. This could take a while.
Collecting tensorflow<2.16
  Downloading tensorflow-2.13.0-cp38-cp38-macosx_12_0_arm64.whl.metadata (2.6 kB)
Collecting tensorflow-macos==2.13.0 (from tensorflow<2.16)
  Downloading tensorflow_macos-2.13.0-cp38-cp38-macosx_12_0_arm64.whl.metadata (3.2 kB)
Collecting absl-py>=1.0.0 (from tensorflow-macos==2.13.0->tensorflow<2.16)
  Using cached absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting astunparse>=1.6.0 (from ten

In [1]:
import numpy as np
import pandas as pd
import stanza
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Flatten, Dense, Dropout
from tensorflow.keras.models import Model

ModuleNotFoundError: No module named 'numpy'

### **Exploring Data**

In [55]:
df_train = pd.read_csv('../data/Train.csv')
df_train.head()

Unnamed: 0,tweet_id,text,location
0,ID_1001136212718088192,,EllicottCity
1,ID_1001136696589631488,"Flash floods struck a Maryland city on Sunday,...",Maryland
2,ID_1001136950345109504,State of emergency declared for Maryland flood...,Maryland
3,ID_1001137334056833024,Other parts of Maryland also saw significant d...,Baltimore Maryland
4,ID_1001138374923579392,"Catastrophic Flooding Slams Ellicott City, Mar...",Ellicott City Maryland


In [56]:
df_train.isnull().sum()

tweet_id        0
text        56624
location    29612
dtype: int64

In [57]:
df_train.dropna(inplace=True)

### **BIO Tagging**

BIO stands for Begin, Inside, and Outside. It’s a method for tagging tokens (words or subwords) in a sequence to identify entities within the text. Each token in the text is assigned a tag that indicates whether it is at the beginning of an entity, inside an entity, or outside of any entity.

In [58]:
nlp = stanza.Pipeline(lang='en', processors='tokenize', verbose=False)

def generate_bio_tags(text, location):
    doc = nlp(text)
    tokens = []
    tags = []
    
    for sentence in doc.sentences:
        for token in sentence.tokens:
            token_text = token.text
            tokens.append(token_text)
            
            if location in token_text:
                tags.append('B')
                loc_words = location.split()
                if len(loc_words) > 1:
                    for _ in loc_words[1:]:
                        tags.append('I')
            else:
                tags.append('O')
    
    return tokens, tags


# Apply the function to each row in the TrainSet
df_train['temp_tuple'] = df_train.apply(lambda row: generate_bio_tags(row['text'], row['location']), axis=1)
df_tokens = df_train['temp_tuple'].apply(pd.Series)
df_tokens.columns = ['tokens', 'bio_tags']
df_train = pd.concat([df_train.drop(columns=['temp_tuple']), df_tokens], axis=1)

In [59]:
df_train.head()

Unnamed: 0,tweet_id,text,location,tokens,bio_tags
1,ID_1001136696589631488,"Flash floods struck a Maryland city on Sunday,...",Maryland,"[Flash, floods, struck, a, Maryland, city, on,...","[O, O, O, O, B, O, O, O, O, O, O, O, O, O, O, ..."
2,ID_1001136950345109504,State of emergency declared for Maryland flood...,Maryland,"[State, of, emergency, declared, for, Maryland...","[O, O, O, O, O, B, O, O, O, O, O]"
3,ID_1001137334056833024,Other parts of Maryland also saw significant d...,Baltimore Maryland,"[Other, parts, of, Maryland, also, saw, signif...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,ID_1001138374923579392,"Catastrophic Flooding Slams Ellicott City, Mar...",Ellicott City Maryland,"[Catastrophic, Flooding, Slams, Ellicott, City...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
5,ID_1001138377717157888,WATCH: 1 missing after flash #FLOODING devasta...,Ellicott City Maryland,"[WATCH, :, 1, missing, after, flash, #FLOODING...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O]"


### **PAD Data**

In [60]:
max_len = df_train['tokens'].apply(len).max()
df_train['tokens_padded'] = pad_sequences(df_train['tokens'], maxlen=max_len, dtype=object, padding='post', truncating='post', value='PAD').tolist()
df_train['bio_tags_padded'] = pad_sequences(df_train['bio_tags'], maxlen=max_len, dtype=object, padding='post', truncating='post', value='O').tolist()

df_train.head()


Unnamed: 0,tweet_id,text,location,tokens,bio_tags,tokens_padded,bio_tags_padded
1,ID_1001136696589631488,"Flash floods struck a Maryland city on Sunday,...",Maryland,"[Flash, floods, struck, a, Maryland, city, on,...","[O, O, O, O, B, O, O, O, O, O, O, O, O, O, O, ...","[Flash, floods, struck, a, Maryland, city, on,...","[O, O, O, O, B, O, O, O, O, O, O, O, O, O, O, ..."
2,ID_1001136950345109504,State of emergency declared for Maryland flood...,Maryland,"[State, of, emergency, declared, for, Maryland...","[O, O, O, O, O, B, O, O, O, O, O]","[State, of, emergency, declared, for, Maryland...","[O, O, O, O, O, B, O, O, O, O, O, O, O, O, O, ..."
3,ID_1001137334056833024,Other parts of Maryland also saw significant d...,Baltimore Maryland,"[Other, parts, of, Maryland, also, saw, signif...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[Other, parts, of, Maryland, also, saw, signif...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,ID_1001138374923579392,"Catastrophic Flooding Slams Ellicott City, Mar...",Ellicott City Maryland,"[Catastrophic, Flooding, Slams, Ellicott, City...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[Catastrophic, Flooding, Slams, Ellicott, City...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
5,ID_1001138377717157888,WATCH: 1 missing after flash #FLOODING devasta...,Ellicott City Maryland,"[WATCH, :, 1, missing, after, flash, #FLOODING...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O]","[WATCH, :, 1, missing, after, flash, #FLOODING...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."


### **Encode Tokens and BIO Tags**

In [62]:
# BIO Encoding
tag2idx = {'O': 0, 'B': 1, 'I': 2}
df_train['bio_tags_encoded'] = df_train['bio_tags_padded'].apply(lambda x: [tag2idx[tag] for tag in x])

# Token Encoding
tokenizer = Tokenizer(oov_token='OOV')
tokenizer.fit_on_texts(df_train['tokens_padded'])
df_train['tokens_encoded'] = tokenizer.texts_to_sequences(df_train['tokens_padded'])

df_train.head()

Unnamed: 0,tweet_id,text,location,tokens,bio_tags,tokens_padded,bio_tags_padded,bio_tags_encoded,tokens_encoded
1,ID_1001136696589631488,"Flash floods struck a Maryland city on Sunday,...",Maryland,"[Flash, floods, struck, a, Maryland, city, on,...","[O, O, O, O, B, O, O, O, O, O, O, O, O, O, O, ...","[Flash, floods, struck, a, Maryland, city, on,...","[O, O, O, O, B, O, O, O, O, O, O, O, O, O, O, ...","[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[200, 96, 463, 14, 138, 73, 27, 482, 8, 5106, ..."
2,ID_1001136950345109504,State of emergency declared for Maryland flood...,Maryland,"[State, of, emergency, declared, for, Maryland...","[O, O, O, O, O, B, O, O, O, O, O]","[State, of, emergency, declared, for, Maryland...","[O, O, O, O, O, B, O, O, O, O, O, O, O, O, O, ...","[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[107, 10, 83, 355, 12, 138, 57, 9, 106, 13, 89..."
3,ID_1001137334056833024,Other parts of Maryland also saw significant d...,Baltimore Maryland,"[Other, parts, of, Maryland, also, saw, signif...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[Other, parts, of, Maryland, also, saw, signif...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[219, 323, 10, 138, 244, 1076, 1007, 39, 22, 2..."
4,ID_1001138374923579392,"Catastrophic Flooding Slams Ellicott City, Mar...",Ellicott City Maryland,"[Catastrophic, Flooding, Slams, Ellicott, City...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[Catastrophic, Flooding, Slams, Ellicott, City...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[302, 57, 1219, 310, 73, 8, 138, 167, 89, 951,..."
5,ID_1001138377717157888,WATCH: 1 missing after flash #FLOODING devasta...,Ellicott City Maryland,"[WATCH, :, 1, missing, after, flash, #FLOODING...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O]","[WATCH, :, 1, missing, after, flash, #FLOODING...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[276, 9, 159, 132, 31, 200, 6490, 1376, 310, 7..."


<div style="padding: 0.5em; background-color: #ececec; color: #000; font-weight: bold; font-size: 1.2em;">
    Machine learning-based NER Model for LMR
</div>

#### **Model 1 - FeedForward**

In [93]:
X = np.array(df_train['tokens_encoded'].tolist())
y = np.array(df_train['bio_tags_encoded'].tolist())
y_one_hot = tf.keras.utils.to_categorical(y, num_classes=3)



#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


In [94]:
y[0]

array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0])

In [96]:
y_one_hot[0].shape

(137, 3)

In [None]:
INPUT_SHAPE = max_len
NUM_CLASS   = 3
vocab_size = len(tokenizer.word_index) + 1
embedding_dim = 128 

# Model
model = models.Sequential()

input_layer = Input(shape=(SIZE,))
embedding_layer = Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=SIZE)(input_layer)
flatten_layer = Flatten()(embedding_layer)
dense_1 = Dense(128, activation='relu')(flatten_layer)
dropout_1 = Dropout(0.5)(dense_1)
dense_2 = Dense(64, activation='relu')(dropout_1)
output_layer = Dense(SIZE * 3, activation='softmax')(dense_2)
output_layer = tf.reshape(output_layer, (-1, SIZE, 3))
model = Model(inputs=input_layer, outputs=output_layer)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()

In [99]:
from tf2crf import CRF

ModuleNotFoundError: No module named 'keras.src.engine'

In [3]:
import numpy as np
np.random.rand(2, 4, 8).astype(np.float32)

array([[[0.9259407 , 0.03039424, 0.0869724 , 0.9362894 , 0.159234  ,
         0.02058137, 0.33528695, 0.1239636 ],
        [0.40570205, 0.36699   , 0.6836332 , 0.935297  , 0.07573988,
         0.6030935 , 0.25253534, 0.26845726],
        [0.26470143, 0.07290941, 0.5822046 , 0.00982825, 0.22411057,
         0.7851651 , 0.82800126, 0.4819368 ],
        [0.7842525 , 0.6323881 , 0.4326343 , 0.30710718, 0.15193187,
         0.44470468, 0.4869712 , 0.34016693]],

       [[0.8027908 , 0.9547014 , 0.01100268, 0.32435906, 0.40750113,
         0.3338801 , 0.37195256, 0.1948911 ],
        [0.702745  , 0.77574563, 0.47705722, 0.45453975, 0.34168258,
         0.3638236 , 0.24401966, 0.46358928],
        [0.94289315, 0.00827615, 0.11387574, 0.76900065, 0.93496716,
         0.35411477, 0.527251  , 0.44217628],
        [0.11547489, 0.41354004, 0.9950221 , 0.77645296, 0.18145017,
         0.7558336 , 0.84219056, 0.3814326 ]]], dtype=float32)

In [97]:
import tensorflow as tf
from tensorflow.keras import layers
from tensorflow_addons.layers import CRF

# Define the input layer for the model
input = layers.Input(shape=(S,), dtype=tf.int32)  # S is the sentence length

# Embedding layer to convert tokens into dense vectors
embedding = layers.Embedding(input_dim=vocab_size, output_dim=embedding_dim)(input)

# Bi-LSTM layer
bi_lstm = layers.Bidirectional(layers.LSTM(units=128, return_sequences=True))(embedding)

# Dense layer before CRF
dense = layers.TimeDistributed(layers.Dense(64, activation="relu"))(bi_lstm)

# CRF layer
crf = CRF(units=num_classes)  # num_classes could be 3 (for 0, 1, 2 labels)
output = crf(dense)

# Create the model
model = tf.keras.Model(inputs=input, outputs=output)

# Compile the model
model.compile(optimizer='adam', loss=crf.loss, metrics=[crf.accuracy])

# Summary of the model
model.summary()



TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 

 The versions of TensorFlow you are currently using is 2.16.1 and is not supported. 
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. 
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons


ModuleNotFoundError: No module named 'keras.src.engine'