# Twitter Named Entity Recognition Case Study

### About
Twitter is a microblogging and social networking service on which users post and interact with messages known as "tweets". Every second, on average, around 6,000 tweets are tweeted on Twitter, corresponding to over 350,000 tweets sent per minute, 500 million tweets per day.

### Problem statement 
Twitter wants to automatically tag and analyze tweets for better understanding of the trends and topics without being dependent on the hashtags that the users use. Many users do not use hashtags or sometimes use wrong or mis-spelled tags, so they want to completely remove this problem and create a system of recognizing important content of the tweets.

### Objective
Named Entity Recognition (NER) is an important subtask of information extraction that seeks to locate and recognise named entities.
We need to train models that will be able to identify the various named entities.

### Data
Dataset is annotated with 10 fine-grained NER categories: person, geo-location, company, facility, product,music artist, movie, sports team, tv show and other. Dataset was extracted from tweets and is structured in CoNLL format., in English language. Containing in Text file format.
The CoNLL format is a text file with one word per line with sentences separated by an empty line. The first word in a line should be the word and the last word should be the label.

In [64]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf

from warnings import filterwarnings
filterwarnings('ignore')

In [65]:
import os
root_path = os.path.abspath(os.path.join(os.getcwd(),os.pardir))
data_path = os.path.join(root_path,'data')
train_data_path = os.path.join(data_path,'wnut 16.txt.conll')
test_data_path = os.path.join(data_path,'wnut 16test.txt.conll')

## Getting the data

In [66]:
# reading the training file
with open(train_data_path,'r') as f:
    train_raw = f.read()
with open(test_data_path,'r') as f:
    test_raw = f.read()

In [67]:
# creating a function to format the data
def extract_ner_from_conll(conll_data):
    # Split the data into sentences based on empty lines
    sentences = [sentence.strip() for sentence in conll_data.strip().split('\n\n')]
    ner_data = []

    for sentence in sentences:
        tokenised_sentence = []
        for token_entity in sentence.split('\n'):
            token, entity = token_entity.split('\t')
            tokenised_sentence.append((token,entity))
        ner_data.append(tokenised_sentence)

    return ner_data

In [68]:
# preprocessing the raw files
train_data = extract_ner_from_conll(train_raw)
test_data = extract_ner_from_conll(test_raw)

In [69]:
# checking sentences after preprocessing
print(train_data[0])

[('@SammieLynnsMom', 'O'), ('@tg10781', 'O'), ('they', 'O'), ('will', 'O'), ('be', 'O'), ('all', 'O'), ('done', 'O'), ('by', 'O'), ('Sunday', 'O'), ('trust', 'O'), ('me', 'O'), ('*wink*', 'O')]


## EDA

In [70]:
# number of words in the vocabulary and lenght of sentences in the training data

sentence_lenghts = list()
word_set = set()
for sentence in train_data:
    sentence_lenghts.append(len(sentence))
    for word in sentence:
        word_set.add(word[0])
        
NUM_WORDS = len(word_set)+2 # +2 to include padding and out of vocabulary
print(f"Number of unique words in training data (including padding and OOV token) = {NUM_WORDS}")
print(f"Maximum sentence length = {max(sentence_lenghts)}")
print(f"Minimum sentence length = {min(sentence_lenghts)}")

Number of unique words in training data (including padding and OOV token) = 10588
Maximum sentence length = 39
Minimum sentence length = 1


In [71]:
# since the max sentence length if 39, we will take a length of 45 in our model to incorporate for edge cases in inference
SENTENCE_LENGTH = 45
# we will keep the embedding dimensions to be 50 since the number of datapoints is small
EMBEDDING_DIMS = 50

In [72]:
# number of entities
entity_set = set()
for sentence in train_data:
    for word in sentence:
        entity_set.add(word[1])
        
NUM_ENTITIES = len(entity_set)
print(f"Number of unique entities in training data = {NUM_ENTITIES}")

Number of unique entities in training data = 21


## Data preparation

In [73]:
import re
import string
punctuations = string.punctuation

In [74]:
# create a function to prepare the data to be fed into the model

def clean_text(text):
    return re.sub("[^A-Za-z0-9]+",'',str(text).lower())

def prepare_data(text_data):
    
    # initialize empty lists for sentences and entities
    sentences = []
    entities = []
    
    for sentence in text_data:
        
        # initialize empty lists for sentence text and corresponding entities
        word_list = []
        entity_list = []
        
        for token in sentence:
            word = token[0]
            entity = token[1]
            if word in punctuations:
                continue
            else:
                word_list.append(clean_text(word))
                entity_list.append(entity)
        
        sentences.append(word_list)
        entities.append(entity_list)
    
    # create a single string for each sentence and entity by joining elements with whitespace
    sentences = np.array([' '.join(sentence) for sentence in sentences])
    entities = np.array([' '.join(entity) for entity in entities])
    
    return (sentences,entities)

In [None]:
# since the datapoints in train file is very low, we will merge the datasets and prepare our own train and test data
data = []
for i in train_data:
    data.append(i)
for i in test_data:
    data.append(i)

xdata, ydata = prepare_data(data)

from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(xdata, ydata, test_size=0.2, random_state=42)

In [76]:
# checking sentences are conversion
print(f"Before conversion\n{' '.join([word[0] for word in data[0]])}\n")
print(f"After conversion\n{xdata[0]}\n")
print(f"Entities\n{ydata[0]}")

Before conversion
@SammieLynnsMom @tg10781 they will be all done by Sunday trust me *wink*

After conversion
sammielynnsmom tg10781 they will be all done by sunday trust me wink

Entities
O O O O O O O O O O O O


## Bidirectional LSTM + CRF

In [77]:
from tensorflow.keras.layers import Input, TextVectorization, Embedding, Bidirectional, LSTM, TimeDistributed, Dense, Lambda
from tensorflow.keras.models import Model
from tensorflow_addons.layers import CRF
from tensorflow_addons.losses import SigmoidFocalCrossEntropy

In [78]:
# adapt the vectorizer layer before modeling
sentence_tokenizer = TextVectorization(max_tokens=NUM_WORDS, output_sequence_length=SENTENCE_LENGTH, standardize='lower', name='sentence_tokenizer')
sentence_tokenizer.adapt(xtrain)

In [79]:
# define a function to build and compile model
def build_lstm_model(name='bi-lstm+crf'):
    
    # input layer for getting sentences
    sentence_input = Input(shape=(1,), dtype=tf.string, name='sentence_input')
    
    # tokenizing the sentence
    x = sentence_tokenizer(sentence_input)
    
    # creating embeddings for each token in sentence
    embeddings = Embedding(
        input_dim = NUM_WORDS,
        output_dim = EMBEDDING_DIMS,
        mask_zero = True,
        name = 'word_embedding'
    )(x)
    
    # stacking two bidirectional LSTMs
    output_sequence = Bidirectional(LSTM(40, return_sequences=True), name='lstm_1')(embeddings)
    output_sequence = Bidirectional(LSTM(40, return_sequences=True), name='lstm_2')(output_sequence)
    
    # passing the sequence through dense layer to compress the information
    dense_sequence = TimeDistributed(Dense(25, activation='relu'), name='dense')(output_sequence)
    
    # passing the dense sequences through crf layer
    predicted_sequence, potentials, sequence_length, crf_kernel = CRF(NUM_ENTITIES, name='crf')(dense_sequence)
    
    # define the train model
    training_model = Model(sentence_input, potentials)
    
    # compile the model
    training_model.compile(
        loss=SigmoidFocalCrossEntropy(),
        optimizer='adam'
    )
    
    # create an inference model
    inference_model = Model(sentence_input, predicted_sequence)
    
    return training_model, inference_model

In [80]:
# creating the traing and inferencing model
lstm_training_model, lstm_inference_model = build_lstm_model()

In [81]:
lstm_training_model.summary()

Model: "model_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 sentence_input (InputLayer  [(None, 1)]               0         
 )                                                               
                                                                 
 sentence_tokenizer (TextVe  (None, 45)                0         
 ctorization)                                                    
                                                                 
 word_embedding (Embedding)  (None, 45, 50)            529400    
                                                                 
 lstm_1 (Bidirectional)      (None, 45, 80)            29120     
                                                                 
 lstm_2 (Bidirectional)      (None, 45, 80)            38720     
                                                                 
 dense (TimeDistributed)     (None, 45, 25)            2025

In [82]:
# prepare the entity data

# tokenizing the entities
entity_tokenizer = TextVectorization(max_tokens=NUM_ENTITIES, output_sequence_length=SENTENCE_LENGTH, standardize='lower', name='entity_tokenizer')
entity_tokenizer.adapt(ytrain)
ytrain_tokenized = entity_tokenizer(ytrain)
ytest_tokenized = entity_tokenizer(ytest)

# one hot encoding the entitiy tokens
entity_ohe = Lambda(lambda x: tf.one_hot(x, NUM_ENTITIES))
ytrain_tokenized = entity_ohe(ytrain_tokenized)
ytest_tokenized = entity_ohe(ytest_tokenized)

In [None]:
# defining callbacks
mc = tensorflow.keras.callbacks.ModelCheckpoint(
    'models/lstm.h5',
    monitor="val_loss",
    save_best_only=True,
)
es = tensorflow.keras.callbacks.EarlyStopping(
    monitor="val_loss",
    min_delta=0,
    patience=5,
    start_from_epoch=5,
)
callbacks=[mc,es]

In [25]:
history = lstm_training_model.fit(xtrain, ytrain_tokenized, epochs=5, batch_size=16, callbacks=callbacks)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [None]:
# loading the best model
lstm_training_model = tensorflow.keras.models.load_model('models/lstm.h5')

In [26]:
lstm_training_model.evaluate(xtest,ytest_tokenized)



0.07284174114465714

In [27]:
# implement the validation data somehow

### Inferencing the Bi-LSTM+CRF model

In [28]:
print(xtest[0])
print(ytest[0])
print(len(xtest[0]))
print(len(ytest[0]))

new orleans mother s day parade shooting one of the people hurt was a 10yearold girl what the hell is wrong with people
B-other I-other I-other I-other I-other I-other O O O O O O O O O O O O O O O O O
119
81


In [29]:
ytest[[2]][0].split()

['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-movie', 'O', 'O']

In [30]:
# function to infer prediction from a text in xtest randomly
def infer_lstm():
    idx = np.random.choice(range(len(xtest)),1,replace=False)
    text = xtest[idx]
    pred_labels = lstm_inference_model.predict(text)
    act_labels = ytest[idx]
    text_len = len(text[0].split())
    
    pred_labels = np.asarray(entity_tokenizer.get_vocabulary())[list(pred_labels[0])]
    result = pd.DataFrame(
        {
            'text':text[0].split(),
            'actual_labels':act_labels[0].split(),
            'predicted_labels':pred_labels[:text_len]
        }
    )
    display(result)

In [31]:
infer_lstm()



Unnamed: 0,text,actual_labels,predicted_labels
0,rt,O,o
1,abc3340,O,b-tvshow
2,police,O,o
3,conducting,O,b-tvshow
4,a,O,o
5,homicide,O,b-person
6,investigation,O,o
7,on,O,b-person
8,avenue,O,o
9,v,O,b-person


## BERT

In [83]:
from transformers import TFBertForTokenClassification
from transformers import BertTokenizer

In [84]:
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

In [85]:
# preprocessing data for bert

def preprocess_bert(text,labels):
    
    text_list = []
    label_list = []
    
    for i in range(len(text)):
        tokenized_words = []
        tokenized_labels = []
        words = text[i].split()
        entities = labels[i].split()
        sentence_len = len(words)
        for j in range(sentence_len):
            tokenized_word = bert_tokenizer.tokenize(words[j])
            tokenized_words.extend(tokenized_word)
            tokenized_label = [entities[j]]*len(tokenized_word)
            tokenized_labels.extend(tokenized_label)
        text_list.append(' '.join(tokenized_words))
        label_list.append(' '.join(tokenized_labels))
    return text_list,label_list

In [98]:
train_bert_tokenized_sentences, train_bert_tokenized_labels = preprocess_bert(xtrain,ytrain)
test_bert_tokenized_sentences, test_bert_tokenized_labels = preprocess_bert(xtest,ytest)

In [99]:
print(train_bert_tokenized_sentences[0])
print(train_bert_tokenized_labels[0])

sam ##mie ##lynn ##smo ##m t ##g ##10 ##7 ##8 ##1 they will be all done by sunday trust me wink
O O O O O O O O O O O O O O O O O O O O O


In [100]:
# adapt the vectorizer layer before modeling
sentence_indexer_bert = TextVectorization(max_tokens=NUM_WORDS, output_sequence_length=SENTENCE_LENGTH, standardize='lower', name='sentence_indexer_bert')
sentence_indexer_bert.adapt(train_bert_tokenized_sentences)
train_bert_sentence_index = sentence_indexer_bert(train_bert_tokenized_sentences)
test_bert_sentence_index = sentence_indexer_bert(test_bert_tokenized_sentences)

In [101]:
train_token_type_ids = np.zeros(shape=(len(train_bert_tokenized_sentences),SENTENCE_LENGTH))
test_token_type_ids = np.zeros(shape=(len(test_bert_tokenized_sentences),SENTENCE_LENGTH))

In [102]:
train_attn_mask = np.zeros(shape=(len(train_bert_tokenized_sentences),SENTENCE_LENGTH))
test_attn_mask = np.zeros(shape=(len(test_bert_tokenized_sentences),SENTENCE_LENGTH))
for i in range(len(train_bert_tokenized_sentences)):
    length = min(len(train_bert_tokenized_sentences[i].split()),SENTENCE_LENGTH)
    train_attn_mask[i,:length] = 1
for i in range(len(test_bert_tokenized_sentences)):
    length = min(len(test_bert_tokenized_sentences[i].split()),SENTENCE_LENGTH)
    test_attn_mask[i,:length] = 1

In [103]:
# prepare the entity data

# tokenizing the entities
entity_indexer_bert = TextVectorization(max_tokens=NUM_ENTITIES, output_sequence_length=SENTENCE_LENGTH, standardize='lower', name='entity_tokenizer')
entity_indexer_bert.adapt(ytrain)
train_bert_label_index = entity_indexer_bert(train_bert_tokenized_labels)
test_bert_label_index = entity_indexer_bert(test_bert_tokenized_labels)

In [92]:
from keras.losses import SparseCategoricalCrossentropy
from keras.optimizers import Adam

In [240]:
# build bert model
encoder = TFBertForTokenClassification.from_pretrained('bert-base-uncased',name='bert_layer')
def build_bert_model():

    # getting inputs to the model
    input_sentence_ids = Input(shape=(SENTENCE_LENGTH,), dtype=tf.int32)
    input_token_type_ids = Input(shape=(SENTENCE_LENGTH,), dtype=tf.int32)
    input_attn_ids = Input(shape=(SENTENCE_LENGTH,), dtype=tf.int32)

    # sending inputs through the bert model
    embeddings = encoder(
        input_ids = input_sentence_ids,
        token_type_ids = input_token_type_ids,
        attention_mask = input_attn_ids,
    )[0]

    # sending the context vectors through dense layer with linear activation (softmax will be applied during the loss calculation)
    output_logits = Dense(NUM_ENTITIES,activation='linear')(embeddings)

    # compiling the model
    model = Model(
        inputs = [input_sentence_ids, input_token_type_ids, input_attn_ids],
        outputs = [output_logits]
    )

    model.compile(
        loss = SparseCategoricalCrossentropy(from_logits=True),
        optimizer = Adam()
    )

    return model

All PyTorch model weights were used when initializing TFBertForTokenClassification.

Some weights or buffers of the TF 2.0 model TFBertForTokenClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [241]:
bert_model = build_bert_model()
bert_model.summary()

Model: "model_8"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
 input_28 (InputLayer)       [(None, 45)]                 0         []                            
                                                                                                  
 input_30 (InputLayer)       [(None, 45)]                 0         []                            
                                                                                                  
 input_29 (InputLayer)       [(None, 45)]                 0         []                            
                                                                                                  
 bert_layer (TFBertForToken  TFTokenClassifierOutput(lo   1088931   ['input_28[0][0]',            
 Classification)             ss=None, logits=(None, 45,   86         'input_30[0][0]',      

In [None]:
# defining callbacks
mc = tensorflow.keras.callbacks.ModelCheckpoint(
    'models/bert.h5',
    monitor="val_loss",
    save_best_only=True,
)
es = tensorflow.keras.callbacks.EarlyStopping(
    monitor="val_loss",
    min_delta=0,
    patience=5,
    start_from_epoch=5,
)
callbacks=[mc,es]

In [96]:
# fitting the bert model
history = bert_model.fit(
    [train_bert_sentence_index, train_token_type_ids, train_attn_mask],
    train_bert_label_index,
    epochs=1,
    batch_size=16,
    callbacks=callbacks
)



In [None]:
# loading the best model
bert_model = tensorflow.keras.models.load_model('models/bert.h5')

### BERT Inference

In [114]:
predictions = bert_model.predict([test_bert_sentence_index,test_token_type_ids,test_attn_mask])



In [266]:
def get_joined_labels(tokenized_text,tokenized_labels):
    joined_labels = []
    begin = 0
    end = 0

    for i in range(len(tokenized_text.split())):
        
        if i==len(tokenized_text.split())-1:
            if begin==end:
                joined_labels.append(tokenized_labels.split()[i])
            else:
                label_segment = tokenized_labels.split()[begin:end+1]
                added = None
                for j in label_segment:
                    if j!='o':
                        added = j
                        break
                if added == None:
                    added = 'o'
                joined_labels.append(added)
        
        elif str(tokenized_text.split()[i+1]).startswith("##"):
            end = i+1
        
        else:
            if begin==end:
                begin = i+1
                end = i+1
                joined_labels.append(tokenized_labels.split()[i])
            else:
                label_segment = tokenized_labels.split()[begin:end+1]
                added = None
                for j in label_segment:
                    if j!='o':
                        added = j
                        break
                if added == None:
                    added = 'o'
                joined_labels.append(added)
                begin = i+1
                end = i+1
    return ' '.join(joined_labels)

# tokenized_labels
'breaking dawn ##returns to vancouver on january 11th http ##bit ##ly ##db ##dm ##s ##8'

'breaking dawn ##returns to vancouver on january 11th http ##bit ##ly ##db ##dm ##s ##8'

In [196]:
# function to infer prediction from a text in xtest randomly
def infer_bert():
    idx = np.random.choice(range(len(test_bert_sentence_index)),1,replace=False)[0]

    text = test_bert_tokenized_sentences[idx]
    labels_act = test_bert_tokenized_labels[idx]
    label_pred = np.argmax(predictions[idx], axis=-1)
    text_len = min(len(text.split()),SENTENCE_LENGTH)
    label_pred = np.asarray(entity_indexer_bert.get_vocabulary())[list(label_pred)]
    result = pd.DataFrame(
        {
            'text':text.split(),
            'actual_labels':labels_act.split(),
            'predicted_labels':label_pred[:text_len]
        }
    )
    display(result)

In [243]:
infer_bert()

Unnamed: 0,text,actual_labels,predicted_labels
0,rt,O,o
1,hey,O,o
2,##ife,O,o
3,##elli,O,o
4,##ke,O,o
5,im,O,o
6,single,O,o
7,bc,O,o
8,i,O,o
9,didn,O,o


In [109]:
test_bert_sentence_index[0]

<tf.Tensor: shape=(45,), dtype=int64, numpy=
array([ 124, 5009, 1278,   17,   21, 2257, 1450,   73,   15,    2,  144,
       1302,   45,    5,  240,  820, 1370,  423,   62,    2,  886,   20,
        635,   34,  144,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0])>

In [None]:
# convert all the data objects into dataframe