<a href="https://colab.research.google.com/github/Sujan-Sawant/Twitter-Named-Entity-Recognition-NER-NLP-/blob/main/Twitter_(NER)_case_study.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **<font color = "red">Problem Statement**

- Twitter is a microblogging and social networking service on which users post and interact with messages known as "tweets". Every second, on average, around 6,000 tweets are tweeted on Twitter, corresponding to over 350,000 tweets sent per minute, 500 million tweets per day.

- Twitter wants to automatically tag and analyze tweets for better understanding of the trends and topics without being dependent on the hashtags that the users use. Many users do not use hashtags or sometimes use wrong or mis-spelled tags, so they want to completely remove this problem and create a system of recognizing important content of the tweets.

- Named Entity Recognition (NER) is an important subtask of information extraction that seeks to locate and recognise named entities.

- You need to train models that will be able to identify the various named entities.



## <font color = "green">**Data Description**

- Dataset is annotated with 10 fine-grained NER categories: person, geo-location, company, facility, product,music artist, movie, sports team, tv show and other. Dataset was extracted from tweets and is structured in CoNLL format., in English language. Containing in Text file format.

- The CoNLL format is a text file with one word per line with sentences separated by an empty line. The first word in a line should be the word and the last word should be the label.


- Consider the two sentences below;

**1. Harry Potter was a student living in london**
**2. Albus Dumbledore went to the Disney World**

These two sentences can be prepared in a CoNLL formatted text file as follows.

 - Harry B-PER

 - Potter I-PER

 - was O

 - a O

 - student O

 - Living O

 - in O

 - London B-geo-loc

--------------------------------

 - Albus B-PER

 - Dumbledore I-PER

 - went O

 - to O

 - the O

 - Disney B-facility

 - World I-facility

In [1]:
! gdown 1ege3gkDEzfTrJYrs8iGFBWNiAzx3egnb
! gdown 1vosdOEqepdYc4c83eHWow4RO57MTcgg4

Downloading...
From: https://drive.google.com/uc?id=1ege3gkDEzfTrJYrs8iGFBWNiAzx3egnb
To: /content/wnut 16test.txt.conll
100% 635k/635k [00:00<00:00, 22.1MB/s]
Downloading...
From: https://drive.google.com/uc?id=1vosdOEqepdYc4c83eHWow4RO57MTcgg4
To: /content/wnut 16.txt.conll
100% 403k/403k [00:00<00:00, 54.4MB/s]


## **Imports & Setup**

In [37]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import tensorflow as tf
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

from gensim.models import Word2Vec


## **Load CoNLL Dataset**

In [41]:
def load_conll(path):
    words, labels = [], []
    with open(path, "r", encoding="utf-8") as f:
        for line in f:
            if line.strip() == "":
                words.append("")
                labels.append("")
            else:
                token, tag = line.strip().split()
                words.append(token)
                labels.append(tag)
    return pd.DataFrame({"word": words, "label": labels})

train_df = load_conll("/content/wnut 16.txt.conll")
test_df  = load_conll("/content/wnut 16test.txt.conll")



In [40]:
train_df.head()


Unnamed: 0,word,label
0,@SammieLynnsMom,O
1,@tg10781,O
2,they,O
3,will,O
4,be,O


In [42]:
test_df.head()

Unnamed: 0,word,label
0,New,B-other
1,Orleans,I-other
2,Mother,I-other
3,'s,I-other
4,Day,I-other


## **Convert Tokens ---- Sentences**

In [43]:
def to_sentences(df):
    sentences, labels = [], []
    sentence, label_seq = [], []

    for word, label in zip(df["word"], df["label"]):
        if word == "":
            if sentence:
                sentences.append(sentence)
                labels.append(label_seq)
                sentence, label_seq = [], []
        else:
            sentence.append(word)
            label_seq.append(label)

    if sentence:
        sentences.append(sentence)
        labels.append(label_seq)

    return sentences, labels

train_sentences, train_labels = to_sentences(train_df)
test_sentences, test_labels = to_sentences(test_df)

print(train_sentences[0])
print(train_labels[0])


['@SammieLynnsMom', '@tg10781', 'they', 'will', 'be', 'all', 'done', 'by', 'Sunday', 'trust', 'me', '*wink*']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


## **Train Word2Vec Embeddings**

In [44]:
train_sentences = [[str(token) for token in sent] for sent in train_sentences]
test_sentences  = [[str(token) for token in sent] for sent in test_sentences]

w2v_model = Word2Vec(
    sentences=train_sentences,
    vector_size=100,
    window=5,
    min_count=1,
    workers=4
)

w2v_model.wv["UI"][:10]


array([-0.00399294,  0.00270391, -0.00261349,  0.00754211,  0.01248802,
       -0.01020205, -0.00255211,  0.02059236,  0.00125127, -0.01658982],
      dtype=float32)

## **Tokenization & Padding**

In [45]:
tokenizer = Tokenizer(lower=True, oov_token="<OOV>")
tokenizer.fit_on_texts(train_sentences)

X_train = tokenizer.texts_to_sequences(train_sentences)
X_test  = tokenizer.texts_to_sequences(test_sentences)

MAXLEN = 100
X_train = pad_sequences(X_train, maxlen=MAXLEN, padding="post")
X_test  = pad_sequences(X_test, maxlen=MAXLEN, padding="post")


In [46]:
X_train

array([[2905, 2906,   87, ...,    0,    0,    0],
       [ 180,   12,   98, ...,    0,    0,    0],
       [  96, 1040, 2908, ...,    0,    0,    0],
       ...,
       [  16, 9065,    8, ...,    0,    0,    0],
       [9068,   74,   10, ...,    0,    0,    0],
       [  73,   65, 9069, ...,    0,    0,    0]], dtype=int32)

## **Encode Labels**

In [47]:
label2id = {label: idx for idx, label in enumerate(sorted(set(sum(train_labels, []))))}
id2label = {v: k for k, v in label2id.items()}
num_tags = len(label2id)

def encode_labels(labels):
    encoded = [[label2id[tag] for tag in seq] for seq in labels]
    return pad_sequences(encoded, maxlen=MAXLEN, padding="post")

y_train = encode_labels(train_labels)
y_test  = encode_labels(test_labels)

y_train = tf.keras.utils.to_categorical(y_train, num_classes=num_tags)
y_test  = tf.keras.utils.to_categorical(y_test, num_classes=num_tags)


In [48]:
y_train

array([[[0., 0., 0., ..., 0., 0., 1.],
        [0., 0., 0., ..., 0., 0., 1.],
        [0., 0., 0., ..., 0., 0., 1.],
        ...,
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 1.],
        [0., 0., 0., ..., 0., 0., 1.],
        [0., 0., 0., ..., 0., 0., 1.],
        ...,
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 1.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.]],

       ...,

       [[0., 0., 0., ..., 0., 0., 1.],
        [0., 0., 0., ..., 0., 0., 1.],
        [0., 0., 0., ..., 0., 0., 1.],
        ...,
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0.

## **Build Embedding Matrix**

In [49]:
EMB_DIM = 100
embedding_matrix = np.zeros((len(tokenizer.word_index) + 1, EMB_DIM))

for word, idx in tokenizer.word_index.items():
    if word in w2v_model.wv:
        embedding_matrix[idx] = w2v_model.wv[word]


## **BiLSTM NER Model**

In [50]:
inputs = tf.keras.Input(shape=(MAXLEN,))

embedding = layers.Embedding(
    input_dim=len(tokenizer.word_index) + 1,
    output_dim=EMB_DIM,
    weights=[embedding_matrix],
    trainable=False,
    mask_zero=True
)(inputs)

x = layers.Bidirectional(
    layers.LSTM(64, return_sequences=True)
)(embedding)

outputs = layers.TimeDistributed(
    layers.Dense(num_tags, activation="softmax")
)(x)

model = tf.keras.Model(inputs, outputs)

model.compile(
    optimizer="adam",
    loss="categorical_crossentropy",
    metrics=["accuracy"]
)



In [51]:
model.summary()

## **Model Training**

In [52]:
history = model.fit(
    X_train,
    y_train,
    batch_size=32,
    epochs=5,
    validation_split=0.1
)


Epoch 1/5
[1m68/68[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m27s[0m 251ms/step - accuracy: 0.2284 - loss: 0.9949 - val_accuracy: 0.1756 - val_loss: 0.3638
Epoch 2/5
[1m68/68[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m14s[0m 209ms/step - accuracy: 0.1866 - loss: 0.3815 - val_accuracy: 0.1756 - val_loss: 0.3578
Epoch 3/5
[1m68/68[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 232ms/step - accuracy: 0.1833 - loss: 0.3749 - val_accuracy: 0.1756 - val_loss: 0.3531
Epoch 4/5
[1m68/68[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 223ms/step - accuracy: 0.1832 - loss: 0.3613 - val_accuracy: 0.1756 - val_loss: 0.3457
Epoch 5/5
[1m68/68[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 212ms/step - accuracy: 0.1846 - loss: 0.3532 - val_accuracy: 0.1756 - val_loss: 0.3377


In [53]:
test_loss, test_acc = model.evaluate(X_test, y_test)
print("Test Accuracy:", test_acc)


[1m121/121[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 75ms/step - accuracy: 0.1452 - loss: 0.6670
Test Accuracy: 0.145332470536232
