## Assignment 3 - Named Entity Recognition

In this assignment, we are going to build a Named Entity Recognition model. With this model, we will also tag new data.

More on Named Entity Recognition:

https://blog.paralleldots.com/data-science/named-entity-recognition-milestone-models-papers-and-technologies/

https://blog.paralleldots.com/product/applications-named-entity-recognition-api/

### Steps:

**1. Import the data**

**2. Build the model**

**3. Pick a dataset to run the model on**

**4. Build a function to load new data and print the tags**

Your web application will load small sections of text (such as tweets or headlines) and from that, you will tag the text based on the presence of named entities.

*What you will be graded on:*

1. Ability to build a model on word and tag data

2. Ability to use the model to predict on new data and display that prediction

*The model will be based on:*
1. Embeddings from words
2. Embeddings from tag inputs

### Step 1: Importing the data

Below is some code to get you started. As in the part of speech tagging example, you will have to write code to:

0. Split your data into a train/test set (Do a 80/20 or 90/10 split since we'll be later applying this model to an entirely separate set of data)
1. Find the set of all words
2. Find the set of all tags
3. Make a dictionary of words to index and entity tag to index

In [1]:
import pandas as pd
import numpy as np
### NER DATASET IS FOUND IN THE COURSE REPO
data = pd.read_csv("ner_dataset.csv", encoding="latin1")
data = data.fillna(method="ffill")
data.head(10)

  (fname, cnt))
  (fname, cnt))


Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O
5,Sentence: 1,through,IN,O
6,Sentence: 1,London,NNP,B-geo
7,Sentence: 1,to,TO,O
8,Sentence: 1,protest,VB,O
9,Sentence: 1,the,DT,O


In [5]:
#!pip install tensorflow


Collecting tensorflow
  Downloading https://files.pythonhosted.org/packages/38/4a/42ba8d00a50a9fafc88dd5935246ecc64ffe1f6a0258ef535ffb9652140b/tensorflow-1.7.0-cp36-cp36m-manylinux1_x86_64.whl (48.0MB)
[K    100% |████████████████████████████████| 48.0MB 28kB/s  eta 0:00:01    70% |██████████████████████▌         | 33.7MB 43.6MB/s eta 0:00:01
Collecting tensorboard<1.8.0,>=1.7.0 (from tensorflow)
  Downloading https://files.pythonhosted.org/packages/0b/ec/65d4e8410038ca2a78c09034094403d231228d0ddcae7d470b223456e55d/tensorboard-1.7.0-py3-none-any.whl (3.1MB)
[K    100% |████████████████████████████████| 3.1MB 429kB/s eta 0:00:01
[?25hCollecting absl-py>=0.1.6 (from tensorflow)
  Downloading https://files.pythonhosted.org/packages/90/6b/ba04a9fe6aefa56adafa6b9e0557b959e423c49950527139cb8651b0480b/absl-py-0.2.0.tar.gz (82kB)
[K    100% |████████████████████████████████| 92kB 12.3MB/s ta 0:00:01
Collecting astor>=0.6.0 (from tensorflow)
  Downloading https://files.pythonhosted.org/pack

In [2]:
import tensorflow

  from ._conv import register_converters as _register_converters


In [3]:
import keras

Using TensorFlow backend.


In [4]:
len(data)

1048575

In [4]:
#the set of all words
words = list(set(data["Word"].values))

In [5]:
n_words = len(words)

In [6]:
n_words

35178

In [7]:
#the set of all tags
tags = list(set(data["Tag"].values))

In [8]:
n_tags = len(tags)

In [9]:
n_tags

17

In [10]:
#dictionary of words to index
word_to_id = {w: i + 1 for i, w in enumerate(words)}

In [11]:
word_to_id 

{'compilation': 1,
 'Berruti': 2,
 'printed': 3,
 'parliamentarian': 4,
 'EST': 5,
 'Camille': 6,
 'concessionary': 7,
 'waters': 8,
 'Greg': 9,
 'clubs': 10,
 'bereavement': 11,
 'costs': 12,
 'al-Obeidi': 13,
 'Koizumi': 14,
 'perimeter': 15,
 'ranks': 16,
 'Presse': 17,
 'bite': 18,
 'telegram': 19,
 'three-thousand': 20,
 'ox-stall': 21,
 'haze-hit': 22,
 'currents': 23,
 'Armed': 24,
 'feminism': 25,
 'starving': 26,
 '1986': 27,
 'Hadassah': 28,
 'Zuloaga': 29,
 'Political': 30,
 'Laurence': 31,
 'Schumacher': 32,
 'nurtured': 33,
 'Storm': 34,
 '12-month': 35,
 'disorder': 36,
 'Vahidi': 37,
 'Use': 38,
 'hammered': 39,
 'Significant': 40,
 'Quartet': 41,
 'Csongrad': 42,
 'satellites': 43,
 'vines': 44,
 'Zoeggeler': 45,
 'war-ravaged': 46,
 'cleaned': 47,
 'spokesperson': 48,
 'Abayi': 49,
 'trauma': 50,
 'toy': 51,
 'radiotherapy': 52,
 'Curling': 53,
 'mogul': 54,
 'explained': 55,
 'Rams': 56,
 'US-VISIT': 57,
 'repeatedly': 58,
 'locales': 59,
 'fruit': 60,
 'First': 61,
 

In [12]:
#dictionary of tag to index
tag_to_id = {t: i for i, t in enumerate(tags)}

In [13]:
tag_to_id 

{'I-gpe': 0,
 'B-art': 1,
 'I-org': 2,
 'B-eve': 3,
 'O': 4,
 'B-nat': 5,
 'B-org': 6,
 'I-per': 7,
 'I-tim': 8,
 'B-gpe': 9,
 'I-eve': 10,
 'I-art': 11,
 'B-geo': 12,
 'I-nat': 13,
 'B-per': 14,
 'I-geo': 15,
 'B-tim': 16}

### Step 1a: Formatting the data
Data will need to be

1. Indexed
2. Limited by vocabulary (ie replace tokens with UNKNOWN if they are too rare, come up with a reasonable limit based on your survey of the data and also model performance)
3. Padded

In [14]:
data.groupby('Sentence #').groups

{'Sentence: 1': Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
             17, 18, 19, 20, 21, 22, 23],
            dtype='int64'),
 'Sentence: 10': Int64Index([196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208,
             209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220],
            dtype='int64'),
 'Sentence: 100': Int64Index([2275, 2276, 2277, 2278, 2279, 2280, 2281, 2282, 2283, 2284, 2285,
             2286, 2287, 2288, 2289, 2290, 2291, 2292, 2293, 2294, 2295, 2296,
             2297, 2298, 2299, 2300, 2301, 2302, 2303, 2304, 2305, 2306],
            dtype='int64'),
 'Sentence: 1000': Int64Index([22066, 22067, 22068, 22069, 22070, 22071, 22072, 22073, 22074,
             22075, 22076],
            dtype='int64'),
 'Sentence: 10000': Int64Index([218268, 218269, 218270, 218271, 218272, 218273, 218274, 218275,
             218276, 218277, 218278, 218279, 218280, 218281, 218282, 218283,
             218284, 218285, 218286, 2

In [15]:
ssss=data.groupby('Sentence #').apply(lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
                                                           s["POS"].values.tolist(),
                                                           s["Tag"].values.tolist())])

In [16]:
sentencesss = [s for s in ssss]

In [17]:
len(sentencesss)

47959

In [18]:
from keras.preprocessing.sequence import pad_sequences
X = [[word_to_id[w[0]] for w in s] for s in sentencesss]

In [19]:
max_len=100
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=0)
X 

array([[20319, 26198, 31397, ...,     0,     0,     0],
       [32491, 18163, 27455, ...,     0,     0,     0],
       [32459,  3293, 26606, ...,     0,     0,     0],
       ...,
       [19356,  2244, 27938, ...,     0,     0,     0],
       [13241,  4954, 30907, ...,     0,     0,     0],
       [13450, 15319, 20645, ...,     0,     0,     0]], dtype=int32)

In [20]:
y = [[tag_to_id[w[2]] for w in s] for s in sentencesss]

In [21]:
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag_to_id["O"])


In [22]:
y

array([[ 4,  4,  4, ...,  4,  4,  4],
       [ 9,  4,  4, ...,  4,  4,  4],
       [ 4,  4, 16, ...,  4,  4,  4],
       ...,
       [ 4, 12,  4, ...,  4,  4,  4],
       [ 4,  4,  4, ...,  4,  4,  4],
       [ 4,  6,  2, ...,  4,  4,  4]], dtype=int32)

In [23]:
from keras.utils import to_categorical
y = [to_categorical(i, num_classes=n_tags) for i in y]

In [24]:
y

[array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]), array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]), array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 1.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]), array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]), array([

In [25]:
import sklearn
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


In [26]:
y_test

[array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]), array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]), array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 1., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]), array([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]), array([

### Step 2. Build the model

Here we will build a Bidirectional LSTM-CRF model using the `Bidirectional` function from Keras and `CRF` function from Keras-contrib

**Documentation and source code:**

https://keras.io/layers/wrappers/#bidirectional

https://github.com/keras-team/keras-contrib

Fit your model with a validation split of 0.1, feel free to use as many epochs as you like. Base your predictions both from the input words **and** the tags from previous words like in the POS example.

After building your model, grade your performance on your test set, both by comparing your predicted output to the actual (*at least 3 examples*) and calculate the averaged precision and recall for your tags.

In [14]:
!pip install git+https://www.github.com/keras-team/keras-contrib.git


Collecting git+https://www.github.com/keras-team/keras-contrib.git
  Cloning https://www.github.com/keras-team/keras-contrib.git to /tmp/pip-lc2fth9v-build
Installing collected packages: keras-contrib
  Running setup.py install for keras-contrib ... [?25ldone
[?25hSuccessfully installed keras-contrib-2.0.8
[33mYou are using pip version 9.0.1, however version 10.0.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [27]:
from keras.layers import Activation
from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional
from keras_contrib.layers import CRF

In [28]:
input = Input(shape=(max_len,))
model = Embedding(input_dim=n_words + 1, output_dim=20,
                  input_length=max_len, mask_zero=True)(input)
model = Dropout(0.1)(model)
model = Bidirectional(LSTM(units=50, return_sequences=True,
                           recurrent_dropout=0.1))(model)  
model = TimeDistributed(Dense(50, activation="relu"))(model)  
crf = CRF(n_tags)  
out = crf(model) #output layer  
model = Model(input, out)
model.compile(optimizer="rmsprop", loss=crf.loss_function, metrics=[crf.accuracy])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 100)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 100, 20)           703580    
_________________________________________________________________
dropout_1 (Dropout)          (None, 100, 20)           0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100, 100)          28400     
_________________________________________________________________
time_distributed_1 (TimeDist (None, 100, 50)           5050      
_________________________________________________________________
crf_1 (CRF)                  (None, 100, 17)           1190      
Total params: 738,220
Trainable params: 738,220
Non-trainable params: 0
_________________________________________________________________


In [29]:
input

<tf.Tensor 'input_1:0' shape=(?, 100) dtype=float32>

In [30]:
model.fit(x_train, np.array(y_train), batch_size=32, epochs=5,
                    validation_split=0.1, verbose=1)
loss, accuracy = model.evaluate(x_train, np.array(y_train), verbose=1)
print('Accuracy: %f' % (accuracy*100))

Train on 34530 samples, validate on 3837 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Accuracy: 97.291147


In [44]:
pp=model.predict(x_test)

comparing your predicted output to the actual (at least 3 examples) and calculate the averaged precision and recall for your tags.

In [51]:
from sklearn.metrics import classification_report
print(classification_report(y_test[1000], pp[1000]))

             precision    recall  f1-score   support

          0       0.00      0.00      0.00         0
          1       0.00      0.00      0.00         0
          2       0.00      0.00      0.00         0
          3       0.00      0.00      0.00         0
          4       1.00      0.11      0.20        98
          5       0.00      0.00      0.00         0
          6       0.00      0.00      0.00         0
          7       0.00      0.00      0.00         0
          8       0.00      0.00      0.00         0
          9       0.00      0.00      0.00         0
         10       0.00      0.00      0.00         0
         11       0.00      0.00      0.00         0
         12       1.00      1.00      1.00         2
         13       0.00      0.00      0.00         0
         14       0.00      0.00      0.00         0
         15       0.00      0.00      0.00         0
         16       0.00      0.00      0.00         0

avg / total       1.00      0.13      0.22  

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [81]:
i = 1000
p = model.predict(np.array([x_test[i]]))
p = np.argmax(p, axis=-1)
true = np.argmax(y_test[i], -1)
print("{:15} {:5} {}".format("Word", "True", "Pred"))
print(30 * "-")
for w, t, pred in zip(x_test[i], true, p[0]):
    if w != 0:
        print("{:15}: {:5} {}".format(words[w-1], tags[t], tags[pred]))

Word            True  Pred
------------------------------
The            : O     O
rebels         : O     O
are            : O     O
demanding      : O     O
Kashmir        : B-geo B-geo
's             : O     O
independence   : O     O
or             : O     O
its            : O     O
merger         : O     O
with           : O     O
Pakistan       : B-geo B-geo
.              : O     O


In [52]:
print(classification_report(y_test[2000], pp[2000]))

             precision    recall  f1-score   support

          0       0.00      0.00      0.00         0
          1       0.00      0.00      0.00         0
          2       0.00      0.00      0.00         0
          3       0.00      0.00      0.00         0
          4       1.00      0.21      0.35        99
          5       0.00      0.00      0.00         0
          6       0.00      0.00      0.00         0
          7       0.00      0.00      0.00         0
          8       0.00      0.00      0.00         0
          9       0.00      0.00      0.00         0
         10       0.00      0.00      0.00         0
         11       0.00      0.00      0.00         0
         12       1.00      1.00      1.00         1
         13       0.00      0.00      0.00         0
         14       0.00      0.00      0.00         0
         15       0.00      0.00      0.00         0
         16       0.00      0.00      0.00         0

avg / total       1.00      0.22      0.36  

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [82]:
i = 2000
p = model.predict(np.array([x_test[i]]))
p = np.argmax(p, axis=-1)
true = np.argmax(y_test[i], -1)
print("{:10} {:5} {}".format("Word", "True", "Pred"))
print(30 * "-")
for w, t, pred in zip(x_test[i], true, p[0]):
    if w != 0:
        print("{:10}: {:5} {}".format(words[w-1], tags[t], tags[pred]))

Word       True  Pred
------------------------------
Dozens    : O     O
of        : O     O
people    : O     O
have      : O     O
been      : O     O
killed    : O     O
or        : O     O
are       : O     O
missing   : O     O
after     : O     O
parts     : O     O
of        : O     O
Afghanistan: B-geo B-geo
were      : O     O
hit       : O     O
by        : O     O
heavy     : O     O
snowfall  : O     O
and       : O     O
icy       : O     O
conditions: O     O
.         : O     O


In [53]:
print(classification_report(y_test[5000], pp[5000]))

             precision    recall  f1-score   support

          0       0.00      0.00      0.00         0
          1       0.00      0.00      0.00         0
          2       1.00      1.00      1.00         2
          3       0.00      0.00      0.00         0
          4       1.00      0.24      0.38        93
          5       0.00      0.00      0.00         0
          6       1.00      1.00      1.00         1
          7       0.00      0.00      0.00         0
          8       0.00      0.00      0.00         0
          9       0.00      0.00      0.00         0
         10       0.00      0.00      0.00         0
         11       0.00      0.00      0.00         0
         12       1.00      1.00      1.00         2
         13       0.00      0.00      0.00         0
         14       0.00      0.00      0.00         0
         15       0.00      0.00      0.00         0
         16       1.00      1.00      1.00         2

avg / total       1.00      0.29      0.43  

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [83]:
i = 5000
p = model.predict(np.array([x_test[i]]))
p = np.argmax(p, axis=-1)
true = np.argmax(y_test[i], -1)
print("{:10} {:5} {}".format("Word", "True", "Pred"))
print(30 * "-")
for w, t, pred in zip(x_test[i], true, p[0]):
    if w != 0:
        print("{:10}: {:5} {}".format(words[w-1], tags[t], tags[pred]))

Word       True  Pred
------------------------------
A         : O     O
military  : O     O
spokesman : O     O
said      : O     O
Saturday  : B-tim B-tim
the       : O     O
airmen    : O     O
-         : O     O
all       : O     O
based     : O     O
at        : O     O
Keesler   : B-org B-org
Air       : I-org I-org
Base      : I-org I-org
in        : O     O
Biloxi    : B-geo B-geo
,         : O     O
Mississippi: B-geo B-geo
-         : O     O
would     : O     O
begin     : O     O
flying    : O     O
home      : O     O
over      : O     O
the       : O     O
next      : O     O
two       : B-tim B-tim
weeks     : O     O
.         : O     O


### Step 3. Pick a dataset

Pick a dataset that has short text, similar to the sentences you just tagged. Headlines and tweets are good choices.

https://www.kaggle.com/datasets?sortBy=relevance&group=public&search=news&page=1&pageSize=20&size=all&filetype=all&license=all

In [59]:
#abcnews headline
data2 = pd.read_csv("abcnews-date-text.csv")
data2.head(10)

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers
5,20030219,ambitious olsson wins triple jump
6,20030219,antic delighted with record breaking barca
7,20030219,aussie qualifier stosur wastes four memphis match
8,20030219,aust addresses un security council over iraq
9,20030219,australia is locked into war timetable opp


### Step 4. Tag your new data!

Create a modification to the **ent_tagger function** that combines words and tags from your original dataset. Now allow the function to also load new text from your new data set, and output the tags predicted from your trained model alongside the text. Make your function load five random texts from your data and output the tagged text.

In [72]:
def  ent_tagger (index):
                sn=data2["headline_text"][index]
                snn=sn.split(" ")   
                x_test2 = pad_sequences(sequences=[[word_to_id.get(w, 0) for w in snn]],
                            padding="post", value=0, maxlen=max_len)  
                p = model.predict(np.array([x_test2[0]]))
                p = np.argmax(p, axis=-1)
                print("{:10} {}".format("Word", "Prediction"))
                print(30 * "-")
                for w, pred in zip(snn, p[0]):
                     print("{:10}: {:5}".format(w, tags[pred]))

In [73]:
ent_tagger(2)

Word       Prediction
------------------------------
a         : O    
g         : I-gpe
calls     : O    
for       : O    
infrastructure: O    
protection: O    
summit    : O    


In [74]:
ent_tagger(5)

Word       Prediction
------------------------------
ambitious : O    
olsson    : I-gpe
wins      : O    
triple    : O    
jump      : O    


In [75]:
ent_tagger(1000)

Word       Prediction
------------------------------
death     : O    
toll      : O    
hits      : O    
41        : O    
during    : O    
bangladeshs: I-gpe
local     : O    


In [76]:
ent_tagger(2000)

Word       Prediction
------------------------------
gas       : O    
fired     : O    
power     : O    
station   : O    
planned   : O    
for       : O    
illawarra : I-gpe


In [77]:
ent_tagger(5000)

Word       Prediction
------------------------------
sri       : I-gpe
lanka     : I-gpe
hoping    : O    
for       : O    
new       : O    
zealand   : I-gpe
defeat    : O    
