# Deliverable 2

- Deliverable 2 will be a NER (Named entity recognition system).


## Overview of the data

url = https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus


Essential info about entities:

```
geo = Geographical Entity
org = Organization
per = Person
gpe = Geopolitical Entity
tim = Time indicator
art = Artifact
eve = Event
nat = Natural Phenomenon
```


In [5]:
import pandas as pd
import numpy as np
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm

The data is located in the 'data' folder

In [6]:
data = pd.read_csv("data/ner_dataset.csv", encoding="latin1")

In [7]:
data.head(70)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
...,...,...,...,...
65,,Hyde,NNP,B-geo
66,,Park,NNP,I-geo
67,,.,.,O
68,Sentence: 4,Police,NNS,O


 Fill with "Sentence: k" for each k

In [8]:
sentences = list(set(data["Sentence #"]))
sentences[0] = "nan"
sentences.sort()
len(sentences)

47960

In [9]:
sentences[0:3]

['Sentence: 1', 'Sentence: 10', 'Sentence: 100']

In [10]:
set(data["Tag"])

{'B-art',
 'B-eve',
 'B-geo',
 'B-gpe',
 'B-nat',
 'B-org',
 'B-per',
 'B-tim',
 'I-art',
 'I-eve',
 'I-geo',
 'I-gpe',
 'I-nat',
 'I-org',
 'I-per',
 'I-tim',
 'O'}

In [11]:
for tag in set(data["Tag"]):
    print("\nTAG:",tag)
    print(data[data["Tag"] == tag]["Word"][0:10])


TAG: I-gpe
1225    States
1264     Korea
2713      Binh
2932     Ababa
3466      City
5241     Lanka
5313     Korea
5361     Korea
5370     Korea
5390     Korea
Name: Word, dtype: object

TAG: B-org
97             Labor
154    International
215             IAEA
234         European
248             U.N.
328        Bilfinger
359      Royal-Dutch
370            Shell
543               al
597               al
Name: Word, dtype: object

TAG: I-per
271         Mahmoud
272     Ahmadinejad
332         Horbach
444       Abdullahi
445           Yusuf
446           Ahmad
966        Muhammad
974          Khayam
1106     Faridullah
1107           Khan
Name: Word, dtype: object

TAG: B-nat
2723       H5N1
4554       H5N1
5044       Jing
5073       Jing
5606       H5N1
12506      SARS
12508    Severe
13162       HIV
13164      AIDS
22260      AIDS
Name: Word, dtype: object

TAG: B-gpe
18     British
102    English
113    Britain
126    British
173       Iran
181       Iran
196    Iranian
238       U

How many sentences do we have?

In [12]:
"Sentence: 47959" in sentences, "Sentence: 47960" in sentences

(True, False)

## Indexing Sentences

In [13]:
sentence_formatter = "Sentence: {}"
sentence_formatter.format(0) in sentences

False

In [14]:
sentence_formatter = "Sentence: {}"
sentence_formatter.format(1) in sentences

True

In [15]:
i = 1
sentence_id      = sentence_formatter.format(i)
sentence_id_next = sentence_formatter.format(i+1)
sentence_id, sentence_id_next

('Sentence: 1', 'Sentence: 2')

In [16]:
print(data.index[data["Sentence #"] == sentence_id])
print(data.index[data["Sentence #"] == sentence_id_next])

Int64Index([0], dtype='int64')
Int64Index([24], dtype='int64')


In [17]:
start = data.index[data["Sentence #"] == sentence_id][0]
end   =  data.index[data["Sentence #"] == sentence_id_next][0]
start, end

(0, 24)

In [18]:
data["Sentence #"][start:end] = sentence_id

In [19]:
data["Sentence #"][start:end]

0     Sentence: 1
1     Sentence: 1
2     Sentence: 1
3     Sentence: 1
4     Sentence: 1
5     Sentence: 1
6     Sentence: 1
7     Sentence: 1
8     Sentence: 1
9     Sentence: 1
10    Sentence: 1
11    Sentence: 1
12    Sentence: 1
13    Sentence: 1
14    Sentence: 1
15    Sentence: 1
16    Sentence: 1
17    Sentence: 1
18    Sentence: 1
19    Sentence: 1
20    Sentence: 1
21    Sentence: 1
22    Sentence: 1
23    Sentence: 1
Name: Sentence #, dtype: object

## Selecting a subset and writting an identifier

In [20]:
data = pd.read_csv("data/ner_dataset.csv", encoding="latin1")

last_n = 2000
end   = data.index[data["Sentence #"] == sentence_formatter.format(last_n)][0]

In [21]:
data = data[0:end]

In [22]:
n_sentences = len(list(set(data["Sentence #"])))
first_n = 1
last_n = last_n -1
print(n_sentences)

2000


In [23]:
%%time 
sentence_formatter = "Sentence: {}"

for s_id in  range(first_n, last_n):
    print("current {}/{}".format(s_id,last_n), end="\r")
    sentence_id = sentence_formatter.format(s_id)
    sentence_id_next = sentence_formatter.format(s_id + 1)
    start = data.index[data["Sentence #"] == sentence_id][0]
    end   = data.index[data["Sentence #"] == sentence_id_next][0]
    data["Sentence #"][start:end] = sentence_id
    
sentence_id = sentence_formatter.format(last_n)
start = data.index[data["Sentence #"] == sentence_id][0]
end   = data.shape[0]
data["Sentence #"][start:end] = sentence_id


CPU times: user 12 s, sys: 105 ms, total: 12.1 s
Wall time: 11.6 s


## Building X and Y

In [24]:
n_sentences

2000

In [25]:
X = []
Y = []

sentence_formatter = "Sentence: {}"

for i in range(1,n_sentences):
    s = sentence_formatter.format(i)
    X.append(list(data[data["Sentence #"]==s]["Word"].values))
    Y.append(list(data[data["Sentence #"]==s]["Tag"].values))

In [26]:
i = 0
xy = ["{}/{}".format(x,y) for x,y in zip(X[i],Y[i])]
" ".join(xy)

'Thousands/O of/O demonstrators/O have/O marched/O through/O London/B-geo to/O protest/O the/O war/O in/O Iraq/B-geo and/O demand/O the/O withdrawal/O of/O British/B-gpe troops/O from/O that/O country/O ./O'

In [27]:
def build_word_to_pos(X):

    word_to_pos = {}
    i = 0
    for s in X:
        for w in s:
            if w not in word_to_pos:
                word_to_pos[w] = i
                i +=1
                
    pos_to_word = {v: k for k, v in word_to_pos.items()}
    return word_to_pos, pos_to_word
            
def build_tag_to_pos(Y):
    tag_to_pos = {}
    i = 0
    for s in Y:
        for t in s:
            if t not in tag_to_pos:
                tag_to_pos[t] = i
                i +=1
    pos_to_tag = {v: k for k, v in tag_to_pos.items()}

    return tag_to_pos, pos_to_tag

In [28]:
word_to_pos, pos_to_word = build_word_to_pos(X)
tag_to_pos, pos_to_tag  = build_tag_to_pos(Y)

len(word_to_pos), len(tag_to_pos)

(7047, 17)

In [29]:
tag_to_pos

{'O': 0,
 'B-geo': 1,
 'B-gpe': 2,
 'B-per': 3,
 'I-geo': 4,
 'B-org': 5,
 'I-org': 6,
 'B-tim': 7,
 'B-art': 8,
 'I-art': 9,
 'I-per': 10,
 'I-gpe': 11,
 'I-tim': 12,
 'B-nat': 13,
 'B-eve': 14,
 'I-eve': 15,
 'I-nat': 16}

In [30]:
#X = [[word_to_pos[w] for w in s] for s in X]
#Y = [[tag_to_pos[t] for t in s] for s in Y]

In [31]:
X = [[w for w in s] for s in X]
Y = [[t for t in s] for s in Y]

In [32]:
from sklearn.model_selection import train_test_split

In [33]:
train_idx, val_idx, _, _ = train_test_split(np.arange(len(X)), np.arange(len(X)), test_size=0.2, random_state=42)

In [34]:
X_train = [X[i] for i in train_idx]
Y_train = [Y[i] for i in train_idx]
X_val = [X[i] for i in val_idx]
Y_val = [Y[i] for i in val_idx]

# HMM

In [35]:
from HMM import HMM

In [36]:
hmm = HMM(word_to_pos, tag_to_pos)

In [37]:
hmm.fit(X_train, Y_train)

  return {"emission":   np.log(probs["emission"]),
  "transition": np.log(probs["transition"]),
  "final":      np.log(probs["final"]),
  "initial":    np.log(probs["initial"])}


##### Train acc

In [38]:
Y_hat = []
for x in tqdm(X_train):
    Y_hat.append(hmm.predict_labels(x))

correct = 0
total   = 0
for y,y_hat in zip(Y_train,Y_hat):
    for y_hat_k, y_k in zip(y,y_hat):
        total +=1
        if y_hat_k == y_k:
            correct +=1

print("Accuracy posterior decode train data", correct/total)

HBox(children=(FloatProgress(value=0.0, max=1599.0), HTML(value='')))

  state_posteriors[:, pos] = log_f_x[:, pos] + log_b_x[:, pos] - log_likelihood



Accuracy posterior decode train data 0.9699934768427919


#### Validation acc

In [39]:
Y_hat = []
for x in tqdm(X_val):
    Y_hat.append(hmm.predict_labels(x))

correct = 0
total   = 0
for y,y_hat in zip(Y_val,Y_hat):
    for y_hat_k, y_k in zip(y,y_hat):
        total +=1
        if y_hat_k == y_k:
            correct +=1

print("Accuracy posterior decode validation data", correct/total)

HBox(children=(FloatProgress(value=0.0, max=400.0), HTML(value='')))


Accuracy posterior decode validation data 0.8725318121983326


# Structured perceptron

In [54]:
import skseq
from skseq.sequences import sequence
from skseq.sequences.sequence import Sequence
from skseq.sequences.sequence_list import SequenceList
from skseq.sequences.label_dictionary import LabelDictionary
import skseq.sequences.structured_perceptron as spc
import time


In [55]:
def generate_sequence_list(X, y, word_to_pos, tag_to_pos):
    # Generate x and y dicts
    x_dict = LabelDictionary(word_to_pos.keys())
    y_dict = LabelDictionary(tag_to_pos.keys())
    # Generate SequenceList
    seq_list = SequenceList(x_dict, y_dict)
    # Add words/tags to sequencelist
    for i in range(len(X)):
        seq_list.add_sequence(X[i], y[i], x_dict, y_dict)
    return seq_list

In [56]:
train_seq = generate_sequence_list(X_train, Y_train, word_to_pos, tag_to_pos)
val_seq = generate_sequence_list(X_val, Y_val, word_to_pos, tag_to_pos)

In [57]:
feature_mapper = skseq.sequences.id_feature.IDFeatures(train_seq)
feature_mapper.build_features()

### Train perceptron

In [59]:
sp = spc.StructuredPerceptron(word_to_pos, tag_to_pos, feature_mapper)

In [60]:
%%time
num_epochs = 15
sp.fit(feature_mapper.dataset, num_epochs)

Epoch: 0 Accuracy: 0.781871
Epoch: 1 Accuracy: 0.835900
Epoch: 2 Accuracy: 0.867949
Epoch: 3 Accuracy: 0.883973
Epoch: 4 Accuracy: 0.909442
Epoch: 5 Accuracy: 0.920786
Epoch: 6 Accuracy: 0.920559
Epoch: 7 Accuracy: 0.936782
Epoch: 8 Accuracy: 0.942965
Epoch: 9 Accuracy: 0.946624
Epoch: 10 Accuracy: 0.947049
Epoch: 11 Accuracy: 0.947730
Epoch: 12 Accuracy: 0.951332
Epoch: 13 Accuracy: 0.956068
Epoch: 14 Accuracy: 0.957458
CPU times: user 2min 53s, sys: 513 ms, total: 2min 53s
Wall time: 2min 52s


### Make predictions

In [61]:
p = "Egypt had been asked to write Asia for Angel ."
new_seq = skseq.sequences.sequence.Sequence(x=p.split(), y=[int(0) for w in p.split()])
new_seq


Egypt/0 had/0 been/0 asked/0 to/0 write/0 Asia/0 for/0 Angel/0 ./0 

In [62]:
sp.viterbi_decode(new_seq)[0].to_words(train_seq,
                                       only_tag_translation=True)

'Egypt/B-geo had/O been/O asked/O to/O write/O Asia/B-geo for/O Angel/O ./O '

### Evaluate performance

In [64]:
# Make predictions for the various sequences using the trained model.
pred_train = sp.viterbi_decode_corpus(train_seq)
pred_val = sp.viterbi_decode_corpus(val_seq)

In [65]:
def evaluate_corpus(sequences, sequences_predictions):
    """Evaluate classification accuracy at corpus level, comparing with
    gold standard."""
    total = 0.0
    correct = 0.0
    for i, sequence in enumerate(sequences):
        pred = sequences_predictions[i]
        for j, y_hat in enumerate(pred.y):
            if sequence.y[j] == y_hat:
                correct += 1
            total += 1
    return correct / total

In [66]:
# Evaluate and print accuracies
eval_train = evaluate_corpus(train_seq.seq_list, pred_train)
eval_val = evaluate_corpus(val_seq.seq_list, pred_val)
print("SP -  Accuracy Train: %.3f Validation: %.3f"%(eval_train, eval_val))

SP -  Accuracy Train: 0.976 Validation: 0.943


### Save the model

In [67]:
sp.save_model("perceptron_15_iter")

### Load existing model

In [68]:
sp2 = spc.StructuredPerceptron(word_to_pos, tag_to_pos, feature_mapper)
sp2.load_model(dir="perceptron_15_iter")