# Twitter Named Entity Recognition Case Study

### About
Twitter is a microblogging and social networking service on which users post and interact with messages known as "tweets". Every second, on average, around 6,000 tweets are tweeted on Twitter, corresponding to over 350,000 tweets sent per minute, 500 million tweets per day.

### Problem statement 
Twitter wants to automatically tag and analyze tweets for better understanding of the trends and topics without being dependent on the hashtags that the users use. Many users do not use hashtags or sometimes use wrong or mis-spelled tags, so they want to completely remove this problem and create a system of recognizing important content of the tweets.

### Objective
Named Entity Recognition (NER) is an important subtask of information extraction that seeks to locate and recognise named entities.
We need to train models that will be able to identify the various named entities.

### Data
Dataset is annotated with 10 fine-grained NER categories: person, geo-location, company, facility, product,music artist, movie, sports team, tv show and other. Dataset was extracted from tweets and is structured in CoNLL format., in English language. Containing in Text file format.
The CoNLL format is a text file with one word per line with sentences separated by an empty line. The first word in a line should be the word and the last word should be the label.

In [2]:
# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf

import warnings
warnings.filterwarnings('ignore')

In [23]:
import os
root_path = os.path.abspath(os.path.join(os.getcwd(),os.pardir))
data_path = os.path.join(root_path,'data')
train_data_path = os.path.join(data_path,'wnut 16.txt.conll')
test_data_path = os.path.join(data_path,'wnut 16test.txt.conll')

## Getting the data

In [50]:
# reading the training file
with open(train_data_path,'r') as f:
    train_raw = f.read()
with open(test_data_path,'r') as f:
    test_raw = f.read()

In [51]:
# creating a function to format the data
def extract_ner_from_conll(conll_data):
    # Split the data into sentences based on empty lines
    sentences = [sentence.strip() for sentence in conll_data.strip().split('\n\n')]
    ner_data = []

    for sentence in sentences:
        tokenised_sentence = []
        for token_entity in sentence.split('\n'):
            token, entity = token_entity.split('\t')
            tokenised_sentence.append((token,entity))
        ner_data.append(tokenised_sentence)

    return ner_data

In [52]:
# preprocessing the raw files
train_data = extract_ner_from_conll(train_raw)
test_data = extract_ner_from_conll(test_raw)

In [68]:
# checking sentences after preprocessing
print(train_data[0])

[('@SammieLynnsMom', 'O'), ('@tg10781', 'O'), ('they', 'O'), ('will', 'O'), ('be', 'O'), ('all', 'O'), ('done', 'O'), ('by', 'O'), ('Sunday', 'O'), ('trust', 'O'), ('me', 'O'), ('*wink*', 'O')]


## EDA

In [98]:
# number of words in the vocabulary and lenght of sentences in the training data

sentence_lenghts = list()
word_set = set()
for sentence in train_data:
    sentence_lenghts.append(len(sentence))
    for word in sentence:
        word_set.add(word[0])
        
NUM_WORDS = len(word_set)+2 # +2 to include padding and out of vocabulary
print(f"Number of unique words in training data (including padding and OOV token) = {NUM_WORDS}")
print(f"Maximum sentence length = {max(sentence_lenghts)}")
print(f"Minimum sentence length = {min(sentence_lenghts)}")

Number of unique words in training data (including padding and OOV token) = 10588
Maximum sentence length = 39
Minimum sentence length = 1


In [97]:
# since the max sentence length if 39, we will take a length of 45 in our model to incorporate for edge cases in inference
SENTENCE_LENGTH = 45

In [76]:
# number of entities
entity_set = set()
for sentence in train_data:
    for word in sentence:
        entity_set.add(word[1])
        
NUM_ENTITIES = len(entity_set)
print(f"Number of unique entities in training data = {NUM_ENTITIES}")

Number of unique entities in training data = 21


In [None]:
# since the number of entities are 21, we can define it as number of all possible entities
num_entities

## Data preparation

In [87]:
# create a function to prepare the data to be fed into the model
def prepare_data(text_data):
    
    # initialize empty lists for sentences and entities
    sentences = []
    entities = []
    
    for sentence in text_data:
        
        # initialize empty lists for sentence text and corresponding entities
        word_list = []
        entity_list = []
        
        for token in sentence:
            word_list.append(token[0])
            entity_list.append(token[1])
        
        sentences.append(word_list)
        entities.append(entity_list)
    
    # create a single string for each sentence and entity by joining elements with whitespace
    sentences = [' '.join(sentence) for sentence in sentences]
    entities = [' '.join(entity) for entity in entities]
    
    return (sentences,entities)

In [88]:
# call the prepare_data function on the train and test data
xtrain, ytrain = prepare_data(train_data)
xtest, ytest = prepare_data(test_data)

In [90]:
# checking sentences are conversion
print(xtrain[0])
print(ytrain[0])

@SammieLynnsMom @tg10781 they will be all done by Sunday trust me *wink*
O O O O O O O O O O O O


In [None]:
# try keras text vectorization layer

In [93]:
from tensorflow.keras.layers import TextVectorization

temp_layer = TextVectorization(max_tokens=num_words+2, output_sequence_length=40)
temp_layer.adapt(xtrain[:1])


In [94]:
temp_layer(xtrain[:2])

<tf.Tensor: shape=(2, 40), dtype=int64, numpy=
array([[ 8,  6,  5,  3, 12, 13, 10, 11,  7,  4,  9,  2,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0],
       [ 1,  1,  1,  1,  1,  1,  1,  1,  1,  1, 12,  1,  1,  1,  1,  1,
         1,  1, 12,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0]], dtype=int64)>