# NLP: Delivery 2, Train Models

## Montse Comas, Blai Ras and Fritz Pere Nobbe

## Named Entity Recognition

The objective of this project is to fully understand the structured perceptron algorithm
applied to Named Entity Recognition (NER). NER problems are very useful in many
contexts, from information retrieval to question answering systems. The goal of this project
is not to achieve the best results, but to fully understand all the details about a simple
solution.


#### Imports <a id='imports'></a>

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

#Vector operations and data management
import pandas as pd
import scipy
import numpy as np

#Importing custom functions
import utils

#Folder management
import os,sys,inspect

#Model and other data type saving
import pickle

#Printing styling
import pprint
from IPython.display import display, HTML

#Plot management
import seaborn as sns
import matplotlib.pyplot as plt

#Skseq
import skseq
from skseq.sequences import sequence
from skseq.sequences.sequence import Sequence
from skseq.sequences.sequence_list import SequenceList
from skseq.sequences.label_dictionary import LabelDictionary
import skseq.sequences.structured_perceptron as spc
from skseq.sequences import extended_feature

#Metrics
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

import utils.utils as utils

#### Savings helpers

In [2]:
def save(name,file):
    with open(name, "wb") as f:
        pickle.dump(file, f)
def load(name):
    with open(name, 'rb') as f:
        return(pickle.load(f))

In [3]:
train = pd.read_csv("data/train_data_ner.csv")
test = pd.read_csv("data/test_data_ner.csv")
TINY_TEST,y_true_tiny = utils.get_tiny_test()


# Train/Test set up <a id='train/test'></a>



In [4]:
#Load the datasets if you don't want to wait 5 minutes

X_train = load('fitted_models/X_train.pkl')
y_train = load('fitted_models/y_train.pkl')
X_test = load('fitted_models/X_test.pkl')
y_test = load('fitted_models/y_test.pkl')

This lasts ~5 minutes

In [5]:
#Train vectors (list of lists)
X_train = [] #Contains the sentences (words)
y_train = [] #Contains the tags (aka target)

#The column "sentence_id" does not follow a cardinal order so we group all the unique id's for later iteration
valid_ids_train = train.sentence_id.unique()

for sentence in valid_ids_train:
    X_train.append(list(train[train["sentence_id"]==sentence]["words"].values))
    y_train.append(list(train[train["sentence_id"]==sentence]["tags"].values))

In [6]:
#Train vectors (list of lists)
X_test = [] #Contains the sentences (words)
y_test = [] #Contains the tags (aka target)

#The column "sentence_id" does not follow a cardinal order so we group all the unique id's for later iteration
valid_ids_test = test.sentence_id.unique()

for sentence in valid_ids_test:
    X_test.append(list(test[test["sentence_id"]==sentence]["words"].values))
    y_test.append(list(test[test["sentence_id"]==sentence]["tags"].values))

In [10]:
#Saving the models:

# with open("fitted_models/X_train.pkl", "wb") as f:
#     pickle.dump(X_train, f)
# with open("fitted_models/y_train.pkl", "wb") as f:
#     pickle.dump(y_train, f)
# with open("fitted_models/X_test.pkl", "wb") as f:
#     pickle.dump(X_test, f)
# with open("fitted_models/y_test.pkl", "wb") as f:
#     pickle.dump(y_test, f)

In [5]:
#Example of a train & test set sentence/tags combination

i = 1595

for X_word, y_tag in zip(X_train[i],y_train[i]):
    print(X_word+"/"+y_tag,end=" ")
print("\n")
for X_word, y_tag in zip(X_test[i],y_test[i]):
    print(X_word+"/"+y_tag,end=" ")

Honiara/B-geo is/O reported/O to/O be/O quiet/O Saturday/B-tim ,/O with/O Australian/B-gpe troops/O patrolling/O the/O streets/O ./O 

At/O the/O end/B-tim of/I-tim three/I-tim months/O ,/O the/O Mind/O Reader/O lost/O his/O money/O ./O 

# Corpus Creation <a id=corpus></a>

Function that creates two dictionaries:

* word_pos: stores every unique word (key) of the training set in a unique value, i.e., {'David':1,'Hello':2}

* tag_pos: stores every unique tag (key) of the training set in a unique value, i.e., {'O':0,'B-geo':1}

Lasts less than a second.

In [6]:
def corpus(X_train, y_train):
    i = 0
    #X_train word:position
    word_pos_dict = {}
    for sentence in X_train:
        for word in sentence:
            if word not in word_pos_dict:
                word_pos_dict[word] = i
                i+=1
    i = 0
    tag_pos_dict = {}
    for sentence in y_train:
        for tag in sentence:
            if tag not in tag_pos_dict:
                tag_pos_dict[tag] = i
                i +=1
                
    return word_pos_dict, tag_pos_dict

In [7]:
corpus_word_dict, corpus_tag_dict = corpus(X_train, y_train)

In [8]:
corpus_tag_dict

{'O': 0,
 'B-geo': 1,
 'B-gpe': 2,
 'B-tim': 3,
 'B-org': 4,
 'I-geo': 5,
 'B-per': 6,
 'I-per': 7,
 'I-org': 8,
 'B-art': 9,
 'I-art': 10,
 'I-tim': 11,
 'I-gpe': 12,
 'B-nat': 13,
 'I-nat': 14,
 'B-eve': 15,
 'I-eve': 16}

# Training sequence creation

We proceed to create the sequence list from the training corpus. 

This lasts 10 minutes, so it has been stored in a pickle:

In [9]:
train_seq = load("fitted_models/train_seq.pkl")

In [None]:
train_seq = SequenceList(LabelDictionary(corpus_word_dict), LabelDictionary(corpus_tag_dict))
for word,tag in zip(X_train,y_train):
    train_seq.add_seq_cython(word,tag, LabelDictionary(corpus_word_dict), LabelDictionary(corpus_tag_dict))

In [10]:
print(train_seq.__dict__.keys())
print(type(train_seq))

dict_keys(['x_dict', 'y_dict', 'seq_list'])
<class 'skseq.sequences.sequence_list.SequenceList'>


In [11]:
#Saving the sequence

# save(fitted_models/train_seq, train_seq.pkl")

In [12]:
print(train_seq[1595],"\n")
print(train_seq[1595].to_words(sequence_list = train_seq))

6249/1 184/0 1636/0 7/0 543/0 4694/0 24/3 31/0 54/0 66/2 17/0 6257/0 9/0 1759/0 21/0  

Honiara/B-geo is/O reported/O to/O be/O quiet/O Saturday/B-tim ,/O with/O Australian/B-gpe troops/O patrolling/O the/O streets/O ./O 


# SP training using default features

We proceed to create the feature mapper using only the given default features.

In [17]:
feature_mapper = skseq.sequences.id_feature.IDFeatures(train_seq)
# get features
feature_mapper.build_features()
pprint.pprint(list(feature_mapper.__dict__.keys()))

['feature_dict',
 'feature_list',
 'add_features',
 'dataset',
 'node_feature_cache',
 'initial_state_feature_cache',
 'final_state_feature_cache',
 'edge_feature_cache']


In [18]:
list(feature_mapper.feature_dict)[0:10]

['init_tag:O',
 'id:Thousands::O',
 'id:of::O',
 'prev_tag:O::O',
 'id:demonstrators::O',
 'id:have::O',
 'id:marched::O',
 'id:through::O',
 'id:London::B-geo',
 'prev_tag:O::B-geo']

In [19]:
id_seq = 1

print ("Initial features:",     feature_mapper.feature_list[id_seq][0])
print ("Transition features:",  feature_mapper.feature_list[id_seq][1])
print ("Final features:",       feature_mapper.feature_list[id_seq][2])
print ("Emission features:",    feature_mapper.feature_list[id_seq][3])

Initial features: [[0]]
Transition features: [[3], [32], [34], [3], [3], [3], [3], [9], [11], [3], [3], [3], [3], [44], [46], [3], [3], [3], [3], [3], [3], [3], [3], [3], [3], [3], [3], [3], [9], [58], [59]]
Final features: [[28]]
Emission features: [[29], [30], [31], [33], [35], [36], [15], [13], [37], [38], [39], [40], [41], [42], [43], [45], [47], [48], [10], [5], [49], [10], [50], [51], [52], [53], [54], [15], [55], [56], [57], [27]]


Perceptron creation using our dictionary of words, tags and the feature mapper

In [23]:
sp = spc.StructuredPerceptron(corpus_word_dict, corpus_tag_dict, feature_mapper)
sp.num_epochs = 5

In [24]:
sp.get_num_states(), sp.get_num_observations()

(17, 31979)

In [86]:
%%time
num_epochs = 15
sp.fit(feature_mapper.dataset, num_epochs)

Epoch: 0 Accuracy: 0.893815
Epoch: 1 Accuracy: 0.931674
Epoch: 2 Accuracy: 0.940913
Epoch: 3 Accuracy: 0.946175
Epoch: 4 Accuracy: 0.950018
Epoch: 5 Accuracy: 0.952577
Epoch: 6 Accuracy: 0.954425
Epoch: 7 Accuracy: 0.956033
Epoch: 8 Accuracy: 0.957185
Epoch: 9 Accuracy: 0.958481
Epoch: 10 Accuracy: 0.959217
Epoch: 11 Accuracy: 0.960524
Epoch: 12 Accuracy: 0.961121
Epoch: 13 Accuracy: 0.961207
Epoch: 14 Accuracy: 0.961983
Wall time: 1h 19min 23s


In [87]:
#Model saving

# sp.save_model("fitted_models/01_Default_Features")

In [28]:
#Model loading
sp_default = spc.StructuredPerceptron(corpus_word_dict, corpus_tag_dict, feature_mapper)
sp_default.load_model(dir="fitted_models/01_Default_Features")
sp_default.parameters

39802

# SP Training using added, personalized features

Feature mapper creation:

In [13]:
added_feature_mapper = skseq.sequences.extended_feature.ExtendedFeatures(train_seq) 
# get features
added_feature_mapper.build_features()
pprint.pprint(list(added_feature_mapper.__dict__.keys()))

['feature_dict',
 'feature_list',
 'add_features',
 'dataset',
 'node_feature_cache',
 'initial_state_feature_cache',
 'final_state_feature_cache',
 'edge_feature_cache']


In order to see the new features, we group them in the following categories:

In [14]:
added = ["capi","point","ending","prep"]

And we can visualize them:

In [15]:
for index,feat_type in enumerate(added):
    print(str(index)+". "+feat_type,end="\n")
    print([feature for feature in list(added_feature_mapper.feature_dict.keys()) if feat_type in feature],end="\n\n")

0. capi
['capi_ini::O', 'capi_ini::B-geo', 'capi_ini::B-gpe', 'capi_ini::B-tim', 'capi_ini::B-org', 'capi_ini::I-geo', 'capi_any::B-geo', 'capi_ini::B-per', 'capi_ini::I-per', 'capi_any::O', 'capi_ini::I-org', 'capi_ini::B-art', 'capi_any::B-art', 'capi_ini::I-art', 'id:capital::O', 'capi_any::B-org', 'capi_any::I-tim', 'capi_any::I-org', 'capi_any::B-tim', 'capi_ini::I-tim', 'id:capita::O', 'capi_any::B-per', 'capi_ini::I-gpe', 'capi_any::B-gpe', 'capi_any::I-per', 'capi_ini::B-nat', 'capi_ini::I-nat', 'capi_ini::B-eve', 'capi_ini::I-eve', 'capi_any::B-nat', 'capi_any::I-art', 'capi_any::I-geo', 'capi_any::B-eve', 'id:capitals::O', 'id:capitalize::O', 'id:decapitated::O', 'capi_any::I-eve', 'id:capitalist::O', 'id:capital-intensive::O', 'id:landscaping::O', 'id:per-capita::O', 'id:capitalization::O', 'id:capitalized::O', 'id:escaping::O', 'capi_any::I-gpe', 'id:capitalism::O', 'capi_any::I-nat', 'id:anti-capitalist::O', 'id:capitol::O']

1. point
['inside_point::O', 'inside_point::B-g

Model creation:

In [16]:
sp_added = spc.StructuredPerceptron(corpus_word_dict, corpus_tag_dict, added_feature_mapper)
sp_added.num_epochs = 5

In [17]:
sp_added.get_num_states(), sp_added.get_num_observations()

(17, 31979)

In [18]:
%%time
num_epochs = 15
sp_added.fit(added_feature_mapper.dataset, num_epochs)

Epoch: 0 Accuracy: 0.932321
Epoch: 1 Accuracy: 0.946742
Epoch: 2 Accuracy: 0.950434
Epoch: 3 Accuracy: 0.953119
Epoch: 4 Accuracy: 0.955183
Epoch: 5 Accuracy: 0.956330
Epoch: 6 Accuracy: 0.957800
Epoch: 7 Accuracy: 0.958352
Epoch: 8 Accuracy: 0.959670
Epoch: 9 Accuracy: 0.960568
Epoch: 10 Accuracy: 0.960595
Epoch: 11 Accuracy: 0.961291
Epoch: 12 Accuracy: 0.961944
Epoch: 13 Accuracy: 0.962345
Epoch: 14 Accuracy: 0.962734
CPU times: user 1h 48s, sys: 10.1 s, total: 1h 58s
Wall time: 1h 1min 51s


In [19]:
#Model saving

sp_added.save_model("fitted_models/02_Added_Features.pkl")

In [20]:
#Model loading
sp_added = spc.StructuredPerceptron(corpus_word_dict, corpus_tag_dict, added_feature_mapper)
sp_added.load_model(dir="fitted_models/02_Added_Features.pkl")
sp_added.parameters

array([ 20.066667,  11.533333, -17.6     , ..., -11.933333,   0.733333,
         0.      ])