# Neural Network for Spanish Named Entity Recognition  

Jupyter Notebook based on: **Kamal Raj** NER with Bidirectional LSTM-CNNs implementation available on Github. https://github.com/kamalkraj/Named-Entity-Recognition-with-Bidirectional-LSTM-CNNs.


**Versión: -v_1.2-**

Notas de version:
    
    
    - Se implementa las word embeddings en español: 
        GloVe embeddings for SWBC; #dimensions=300, #vectors=855380.
    - Se modifico y mejoró el preprocesamiento de los datos de entrada para predicción. Ahora puede predecir las etiquetas I-(PER/LOC/ORG/MISC)
    Para 50 epoch:
    -Tiene un accuracy :~80 
    -No se implementa nada para el español, se encontró que era perjudicial con las embeddings.
    -Tiempo: 38 min aprox.  

    Para 100 epoch: 
    -Tiene un accuracy :~84 (
    -No se implementa nada para el español, se encontró que era perjudicial con las embeddings.
    -Tiempo: 1 hr 20 min aprox.  
   
 

Entrenamiento realizado en:

    DESKTOP-0UQLV13
    Processor: Intel Core i7-6700HQ CPU 2.6GHz 
    RAM: 16GB
    OS: Windows 10 Home Single x64
    Tipo de memoria: SSD

    

Requiere:

    unidecode
    numpy (pip install --upgrade numpy)
    nltk (pip install --upgrade nltk)
    * Descargar nltk punkt y nltk stopwords:
    * >> import nltk 
    * >> nltk.download('stopwords')
    * >> nltk.download('punkt')
    * Para más información: https://www.nltk.org/data.html 
    random
    tensorflow 1.13.1 (pip install --upgrade tensorflow) *Actualmente (10/abril/19) no funciona con python 3.7.
    * Para más información: https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class01_intro_python.ipynb 
    keras (pip install --upgrade keras) 








NER task can be formulated as: 

_Given a sequence of tokens (words, and may be punctuation symbols) provide a tag from predefined set of tags for each token in the sequence._

For NER task there are some common types of entities which essentially are tags:
- Persons
- Locations
- Organizations
- Expressions of time
- Quantities
- Monetary values 

Furthermore, to distinguish consequent entities with the same tags BIO tagging scheme is used. "B" stands for beginning, 
"I" stands for the continuation of an entity and "O" means the absence of entity. Example with dropped punctuation:

    Bernhard        B-PER
    Riemann         I-PER
    Carl            B-PER
    Friedrich       I-PER
    Gauss           I-PER
    and             O
    Leonhard        B-PER
    Euler           I-PER

In the example above PER means person tag, and "B-" and "I-" are prefixes identifying beginnings and continuations of the entities. Without such prefixes, it is impossible to separate Bernhard Riemann from Carl Friedrich Gauss.




In [1]:
# Np for math 
# Keras for models, layers 4 NN layers 
import numpy as np
from keras.models import Model
from keras.layers import TimeDistributed,Conv1D,Dense,Embedding,Input,Dropout,LSTM,Bidirectional,MaxPooling1D,Flatten,concatenate
from keras.utils import Progbar
from keras.preprocessing.sequence import pad_sequences
from keras.initializers import RandomUniform
import unidecode
import string 

Using TensorFlow backend.


In [2]:

# Read file (txt) and divide the sentences into character bins (word, tag).
def readfile(filename):
    '''
    read file
    return format :
    [ ['EU', 'B-ORG'], ['rejects', 'O'], ['German', 'B-MISC'], ['call', 'O'], ['to', 'O'], ['boycott', 'O'], ['British', 'B-MISC'], ['lamb', 'O'], ['.', 'O'] ]
    '''
    f = open(filename, encoding='utf-8-sig') # open the file. Update to fix 'ï»¿'
    sentences = []
    sentence = []
    for line in f:
        if len(line)==0 or line.startswith('-DOCSTART') or line[0]=="\n":
            if len(sentence) > 0:     
                sentences.append(sentence)
                sentence = []
            continue
        splits = line.split(' ')
        #splits[0] = unidecode.unidecode(splits[0]) # Remove special characters from spanish
        #splits[0] = splits[0].lower() # Lowercase the words 
        #splits[0] = splits[0].translate(str.maketrans('', '', string.punctuation)) # remove puntuation 
        splits[-1] = splits[-1].replace('\n', '').replace('\r', '') #Remove all line breaks from a long string of text
        if splits[0] != '':
            sentence.append([splits[0],splits[-1]])

    if len(sentence) >0: 
        sentences.append(sentence)
        sentence = []
    return sentences

In [3]:
# Read the 3 sets ************************************************* PATH ************************************************
# Dataset CoNLL 2002 for Spanish, wich is divided into train, test, valid (dev) sets. Each row contains a word and it's tag
# https://github.com/teropa/nlp/tree/master/resources/corpora/conll2002 
trainSentences = readfile("tidy_data/train.txt")
devSentences = readfile("tidy_data/valid.txt")
testSentences = readfile("tidy_data/test.txt")

In [4]:
print(len(trainSentences))

8323


In [5]:
trainSentences[0]

[['Melbourne', 'B-LOC'],
 ['(', 'O'],
 ['Australia', 'B-LOC'],
 [')', 'O'],
 [',', 'O'],
 ['25', 'O'],
 ['may', 'O'],
 ['(', 'O'],
 ['EFE', 'B-ORG'],
 [')', 'O'],
 ['.', 'O']]

In [6]:
devSentences[0]

[['Sao', 'B-LOC'],
 ['Paulo', 'I-LOC'],
 ['(', 'O'],
 ['Brasil', 'B-LOC'],
 [')', 'O'],
 [',', 'O'],
 ['23', 'O'],
 ['may', 'O'],
 ['(', 'O'],
 ['EFECOM', 'B-ORG'],
 [')', 'O'],
 ['.', 'O']]

In [7]:
testSentences[0]

[['La', 'B-LOC'],
 ['Coruña', 'I-LOC'],
 [',', 'O'],
 ['23', 'O'],
 ['may', 'O'],
 ['(', 'O'],
 ['EFECOM', 'B-ORG'],
 [')', 'O'],
 ['.', 'O']]

In [8]:
# Create new attribute in the character bins for padding
def addCharInformatioin(Sentences):
    for i,sentence in enumerate(Sentences):
        for j,data in enumerate(sentence):
            chars = [c for c in data[0]]
            Sentences[i][j] = [data[0],chars,data[1]]
    return Sentences

In [9]:
trainSentences = addCharInformatioin(trainSentences)
devSentences = addCharInformatioin(devSentences)
testSentences = addCharInformatioin(testSentences)

In [10]:
trainSentences[0]

[['Melbourne', ['M', 'e', 'l', 'b', 'o', 'u', 'r', 'n', 'e'], 'B-LOC'],
 ['(', ['('], 'O'],
 ['Australia', ['A', 'u', 's', 't', 'r', 'a', 'l', 'i', 'a'], 'B-LOC'],
 [')', [')'], 'O'],
 [',', [','], 'O'],
 ['25', ['2', '5'], 'O'],
 ['may', ['m', 'a', 'y'], 'O'],
 ['(', ['('], 'O'],
 ['EFE', ['E', 'F', 'E'], 'B-ORG'],
 [')', [')'], 'O'],
 ['.', ['.'], 'O']]

In [11]:
devSentences[0]

[['Sao', ['S', 'a', 'o'], 'B-LOC'],
 ['Paulo', ['P', 'a', 'u', 'l', 'o'], 'I-LOC'],
 ['(', ['('], 'O'],
 ['Brasil', ['B', 'r', 'a', 's', 'i', 'l'], 'B-LOC'],
 [')', [')'], 'O'],
 [',', [','], 'O'],
 ['23', ['2', '3'], 'O'],
 ['may', ['m', 'a', 'y'], 'O'],
 ['(', ['('], 'O'],
 ['EFECOM', ['E', 'F', 'E', 'C', 'O', 'M'], 'B-ORG'],
 [')', [')'], 'O'],
 ['.', ['.'], 'O']]

In [12]:
testSentences[0]

[['La', ['L', 'a'], 'B-LOC'],
 ['Coruña', ['C', 'o', 'r', 'u', 'ñ', 'a'], 'I-LOC'],
 [',', [','], 'O'],
 ['23', ['2', '3'], 'O'],
 ['may', ['m', 'a', 'y'], 'O'],
 ['(', ['('], 'O'],
 ['EFECOM', ['E', 'F', 'E', 'C', 'O', 'M'], 'B-ORG'],
 [')', [')'], 'O'],
 ['.', ['.'], 'O']]

In [13]:
# 1.Creates the label set ( tag's set)
# 2.Creates a set with the lowercased words contained in the train,dev,test sets 
labelSet = set()
words = {}

for dataset in [trainSentences, devSentences, testSentences]:
    for sentence in dataset:
        for token,char,label in sentence:
            labelSet.add(label)
            words[token.lower()] = True

In [14]:
labelSet

{'B-LOC', 'B-MISC', 'B-ORG', 'B-PER', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PER', 'O'}

In [15]:
words

{'melbourne': True,
 '(': True,
 'australia': True,
 ')': True,
 ',': True,
 '25': True,
 'may': True,
 'efe': True,
 '.': True,
 '-': True,
 'el': True,
 'abogado': True,
 'general': True,
 'del': True,
 'estado': True,
 'daryl': True,
 'williams': True,
 'subrayó': True,
 'hoy': True,
 'la': True,
 'necesidad': True,
 'de': True,
 'tomar': True,
 'medidas': True,
 'para': True,
 'proteger': True,
 'al': True,
 'sistema': True,
 'judicial': True,
 'australiano': True,
 'frente': True,
 'a': True,
 'una': True,
 'página': True,
 'internet': True,
 'que': True,
 'imposibilita': True,
 'cumplimiento': True,
 'los': True,
 'principios': True,
 'básicos': True,
 'ley': True,
 'petición': True,
 'tiene': True,
 'lugar': True,
 'después': True,
 'un': True,
 'juez': True,
 'tribunal': True,
 'supremo': True,
 'victoria': True,
 'se': True,
 'viera': True,
 'forzado': True,
 'disolver': True,
 'jurado': True,
 'popular': True,
 'y': True,
 'suspender': True,
 'proceso': True,
 'ante': True,
 

In [16]:
# Gives the labels a numerical id.
# :: Create a mapping for the labels ::
label2Idx = {}
for label in labelSet:
    label2Idx[label] = len(label2Idx)

In [17]:
label2Idx

{'B-ORG': 0,
 'I-ORG': 1,
 'I-PER': 2,
 'I-LOC': 3,
 'B-LOC': 4,
 'B-MISC': 5,
 'B-PER': 6,
 'O': 7,
 'I-MISC': 8}

In [18]:
# Look up table
# :: Hard coded case lookup ::
case2Idx = {'numeric': 0, 'allLower':1, 'allUpper':2, 'initialUpper':3, 'other':4, 'mainly_numeric':5, 'contains_digit': 6, 'PADDING_TOKEN':7}
caseEmbeddings = np.identity(len(case2Idx), dtype='float32')

In [19]:
case2Idx

{'numeric': 0,
 'allLower': 1,
 'allUpper': 2,
 'initialUpper': 3,
 'other': 4,
 'mainly_numeric': 5,
 'contains_digit': 6,
 'PADDING_TOKEN': 7}

In [20]:
caseEmbeddings

array([[1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1.]], dtype=float32)

In [21]:
# :: Read in word embeddings ::
word2Idx = {}
wordEmbeddings = []
# *********************************************************************************************** PATH *************************************************
# GloVe embeddings from SBWC
# https://github.com/uchile-nlp/spanish-word-embeddings

In [22]:
#* Hace los wordEmbedings en base a la lista de embedings + revisa si la palabra en embeddings esta contenido en la lista 
# de palabras ** Nota: Remember that the words are seen as vectors.
with open("word_embeddings/SBW-vectors-300-min5.txt", encoding="utf-8") as fEmbeddings:  ## change to skip first line (headings)
    next(fEmbeddings)
    for line in fEmbeddings:
        split = line.strip().split(' ')
        word = split[0]

        if len(word2Idx) == 0: #Add padding+unknown
            word2Idx["PADDING_TOKEN"] = len(word2Idx)
            vector = np.zeros(len(split)-1) #Zero vector vor 'PADDING' word
            wordEmbeddings.append(vector)

            word2Idx["UNKNOWN_TOKEN"] = len(word2Idx)
            vector = np.random.uniform(-0.25, 0.25, len(split)-1)
            wordEmbeddings.append(vector)

        if split[0].lower() in words:
            vector = np.array([float(num) for num in split[1:]])
            wordEmbeddings.append(vector)
            word2Idx[split[0]] = len(word2Idx)

    wordEmbeddings = np.array(wordEmbeddings)

In [23]:
wordEmbeddings

array([[ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [-0.12417509,  0.08496462, -0.06926975, ..., -0.05228813,
         0.0231604 , -0.18764412],
       [-0.029648  ,  0.011336  ,  0.019949  , ..., -0.128057  ,
        -0.004917  ,  0.062628  ],
       ...,
       [-0.005399  , -0.018904  ,  0.00199   , ..., -0.023052  ,
         0.038082  ,  0.024057  ],
       [-0.010136  , -0.020782  , -0.041017  , ..., -0.079891  ,
        -0.037598  ,  0.017487  ],
       [ 0.095675  , -0.068076  , -0.067965  , ..., -0.012334  ,
         0.012556  ,  0.001487  ]])

In [24]:
wordEmbeddings.shape[0]

54819

In [25]:
wordEmbeddings.shape[1]

300

In [26]:
char2Idx = {"PADDING":0, "UNKNOWN":1}
for c in " 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZáéíóúñäëïöüÁÉÍÓÚÄËÏÖÜÃÂñÑàèìòùÀÈÌÒÙ.,-_()[]{}¡!¿?:;#'\"/\\%$`&=*+@^~|‘´·»³©\xadº±¼\xa0":
    char2Idx[c] = len(char2Idx)

In [27]:
# characters and position (val)
char2Idx 

{'PADDING': 0,
 'UNKNOWN': 1,
 ' ': 2,
 '0': 3,
 '1': 4,
 '2': 5,
 '3': 6,
 '4': 7,
 '5': 8,
 '6': 9,
 '7': 10,
 '8': 11,
 '9': 12,
 'a': 13,
 'b': 14,
 'c': 15,
 'd': 16,
 'e': 17,
 'f': 18,
 'g': 19,
 'h': 20,
 'i': 21,
 'j': 22,
 'k': 23,
 'l': 24,
 'm': 25,
 'n': 26,
 'o': 27,
 'p': 28,
 'q': 29,
 'r': 30,
 's': 31,
 't': 32,
 'u': 33,
 'v': 34,
 'w': 35,
 'x': 36,
 'y': 37,
 'z': 38,
 'A': 39,
 'B': 40,
 'C': 41,
 'D': 42,
 'E': 43,
 'F': 44,
 'G': 45,
 'H': 46,
 'I': 47,
 'J': 48,
 'K': 49,
 'L': 50,
 'M': 51,
 'N': 52,
 'O': 53,
 'P': 54,
 'Q': 55,
 'R': 56,
 'S': 57,
 'T': 58,
 'U': 59,
 'V': 60,
 'W': 61,
 'X': 62,
 'Y': 63,
 'Z': 64,
 'á': 65,
 'é': 66,
 'í': 67,
 'ó': 68,
 'ú': 69,
 'ñ': 88,
 'ä': 71,
 'ë': 72,
 'ï': 73,
 'ö': 74,
 'ü': 75,
 'Á': 76,
 'É': 77,
 'Í': 78,
 'Ó': 79,
 'Ú': 80,
 'Ä': 81,
 'Ë': 82,
 'Ï': 83,
 'Ö': 84,
 'Ü': 85,
 'Ã': 86,
 'Â': 87,
 'Ñ': 88,
 'à': 89,
 'è': 90,
 'ì': 91,
 'ò': 92,
 'ù': 93,
 'À': 94,
 'È': 95,
 'Ì': 96,
 'Ò': 97,
 'Ù': 98,
 '.': 99

In [28]:
# words and possition (val)
word2Idx

{'PADDING_TOKEN': 0,
 'UNKNOWN_TOKEN': 1,
 'de': 2,
 'la': 3,
 'en': 4,
 'el': 5,
 'y': 6,
 'que': 7,
 'a': 8,
 'los': 9,
 'del': 10,
 'las': 11,
 'se': 12,
 'por': 13,
 'un': 14,
 'con': 15,
 'para': 16,
 'una': 17,
 'su': 18,
 'al': 19,
 'no': 20,
 'es': 21,
 'El': 22,
 'como': 23,
 'La': 24,
 'más': 25,
 'En': 26,
 'lo': 27,
 'o': 28,
 'sobre': 29,
 'sus': 30,
 'ha': 31,
 'fue': 32,
 'entre': 33,
 'este': 34,
 'Los': 35,
 'también': 36,
 'años': 37,
 'dos': 38,
 'pero': 39,
 'son': 40,
 'han': 41,
 'esta': 42,
 'le': 43,
 'A': 44,
 'parte': 45,
 'ser': 46,
 'Estados': 47,
 'está': 48,
 'ya': 49,
 'año': 50,
 'hasta': 51,
 'desde': 52,
 'contra': 53,
 'sin': 54,
 'e': 55,
 'Se': 56,
 'Las': 57,
 'si': 58,
 'todos': 59,
 'cuando': 60,
 'donde': 61,
 'Comisión': 62,
 'otros': 63,
 'tiene': 64,
 'durante': 65,
 'todo': 66,
 'países': 67,
 'Naciones': 68,
 'Unidas': 69,
 'muy': 70,
 'personas': 71,
 'así': 72,
 'Consejo': 73,
 'puede': 74,
 'desarrollo': 75,
 'Por': 76,
 'era': 77,
 'Gen

In [29]:
### Padding the sentences 
def padding(Sentences):
    maxlen = 52
    for sentence in Sentences:
        char = sentence[2]
        for x in char:
            maxlen = max(maxlen,len(x))
    for i,sentence in enumerate(Sentences):
        Sentences[i][2] = pad_sequences(Sentences[i][2],52,padding='post')
    return Sentences

In [30]:
# classify the word according to the caseLookup ( numeric, mainly_numeric, allLower, allUpper, initialUpper, contains_digit) 
def getCasing(word, caseLookup):   
    casing = 'other'
    #Get number of digits in word
    numDigits = 0
    for char in word:
        if char.isdigit():
            numDigits += 1
            
    digitFraction = numDigits / float(len(word))
    
    if word.isdigit(): #Is a digit
        casing = 'numeric'
    elif digitFraction > 0.5:
        casing = 'mainly_numeric'
    elif word.islower(): #All lower case
        casing = 'allLower'
    elif word.isupper(): #All upper case
        casing = 'allUpper'
    elif word[0].isupper(): #is a title, initial char upper, then all lower
        casing = 'initialUpper'
    elif numDigits > 0:
        casing = 'contains_digit'
    
   
    return caseLookup[casing]

In [31]:
# Create words embeding matrices to padding
def createMatrices(sentences, word2Idx, label2Idx, case2Idx,char2Idx):
    unknownIdx = word2Idx['UNKNOWN_TOKEN']
    paddingIdx = word2Idx['PADDING_TOKEN']    
        
    dataset = []
    
    wordCount = 0
    unknownWordCount = 0
    
    for sentence in sentences:
        wordIndices = []    
        caseIndices = []
        charIndices = []
        labelIndices = []
        
        for word,char,label in sentence:  
            wordCount += 1
            # if the word is in the list of words to index, then index it (verify with the lower cased word)
            if word in word2Idx:
                wordIdx = word2Idx[word]
            elif word.lower() in word2Idx:
                wordIdx = word2Idx[word.lower()]                 
            else: # else tag it as unknown
                wordIdx = unknownIdx
                unknownWordCount += 1
            charIdx = []
            for x in char:
                charIdx.append(char2Idx[x])
            #Get the label and map to int            
            wordIndices.append(wordIdx)
            caseIndices.append(getCasing(word, case2Idx)) #Call getCasing
            charIndices.append(charIdx)
            labelIndices.append(label2Idx[label])
           
        dataset.append([wordIndices, caseIndices, charIndices, labelIndices]) 
        
    return dataset

In [32]:
# Padding the train/dev/test set and convert them to embedings

train_set = padding(createMatrices(trainSentences,word2Idx,  label2Idx, case2Idx,char2Idx))
dev_set = padding(createMatrices(devSentences,word2Idx, label2Idx, case2Idx,char2Idx))
test_set = padding(createMatrices(testSentences, word2Idx, label2Idx, case2Idx,char2Idx))


In [33]:
trainSentences[0]

[['Melbourne', ['M', 'e', 'l', 'b', 'o', 'u', 'r', 'n', 'e'], 'B-LOC'],
 ['(', ['('], 'O'],
 ['Australia', ['A', 'u', 's', 't', 'r', 'a', 'l', 'i', 'a'], 'B-LOC'],
 [')', [')'], 'O'],
 [',', [','], 'O'],
 ['25', ['2', '5'], 'O'],
 ['may', ['m', 'a', 'y'], 'O'],
 ['(', ['('], 'O'],
 ['EFE', ['E', 'F', 'E'], 'B-ORG'],
 [')', [')'], 'O'],
 ['.', ['.'], 'O']]

In [34]:
train_set[0]

[[11707, 1, 1439, 1, 1, 1, 16426, 1, 6411, 1, 1],
 [3, 4, 3, 4, 4, 0, 1, 4, 2, 4, 4],
 array([[ 51,  17,  24,  14,  27,  33,  30,  26,  17,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
        [103,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
        [ 39,  33,  31,  32,  30,  13,  24,  21,  13,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
        [104,   0,   0,

In [35]:
# save the words and labels to index as dict types
idx2Label = {v: k for k, v in label2Idx.items()}
#***************************************PATH***********************
np.save("model_data/idx2Label.npy",idx2Label)
np.save("model_data/word2Idx.npy",word2Idx)

In [36]:
# Create a batch for each set (later we will create mini-batch)
def createBatches(data):
    l = []
    for i in data:
        l.append(len(i[0]))
    l = set(l)
    batches = []
    batch_len = []
    z = 0
    for i in l:
        for batch in data:
            if len(batch[0]) == i:
                batches.append(batch)
                z += 1
        batch_len.append(z)
    return batches,batch_len

In [37]:
train_batch,train_batch_len = createBatches(train_set)
dev_batch,dev_batch_len = createBatches(dev_set)
test_batch,test_batch_len = createBatches(test_set)

In [38]:
#train_batch_len

Start with Tensorflow. Remember that tf first construct a graph, and then run it. tf automatically determines the best contruction taking into consideration each node requirements.   

In [39]:
# Create a tensor for the inputs
words_input = Input(shape=(None,),dtype='int32',name='words_input')

In [40]:
# Create a tensor of the embeddings using the words embeddings and feeding with the words_input tensor
words = Embedding(input_dim=wordEmbeddings.shape[0], output_dim=wordEmbeddings.shape[1],  weights=[wordEmbeddings], trainable=False)(words_input)

Instructions for updating:
Colocations handled automatically by placer.


In [41]:
# Create a tensor of casing input
casing_input = Input(shape=(None,), dtype='int32', name='casing_input')

In [42]:
#Create a tensor of the casing using the words embeddings and feeding with the casing_input tensor
casing = Embedding(output_dim=caseEmbeddings.shape[1], input_dim=caseEmbeddings.shape[0], weights=[caseEmbeddings], trainable=False)(casing_input)

More tensors for the model....

In [43]:
character_input=Input(shape=(None,52,),name='char_input')

In [44]:
embed_char_out=TimeDistributed(Embedding(len(char2Idx),30,embeddings_initializer=RandomUniform(minval=-0.5, maxval=0.5)), name='char_embedding')(character_input)

In [45]:
# Establish the dropout (neurons?)
dropout= Dropout(0.5)(embed_char_out)

Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


In [46]:
conv1d_out= TimeDistributed(Conv1D(kernel_size=3, filters=30, padding='same',activation='tanh', strides=1))(dropout)

In [47]:
# max pool of the convolutional
maxpool_out=TimeDistributed(MaxPooling1D(52))(conv1d_out)

In [48]:
# Flattern  layer for the CNN, it is requered to be flattern for the CNN
char = TimeDistributed(Flatten())(maxpool_out)
char = Dropout(0.5)(char)

In [49]:
output = concatenate([words, casing,char])
output = Bidirectional(LSTM(200, return_sequences=True, dropout=0.50, recurrent_dropout=0.25))(output)
output = TimeDistributed(Dense(len(label2Idx), activation='softmax'))(output)

Model. Inlcudes a Summary of the Model.

In [50]:
model = Model(inputs=[words_input, casing_input,character_input], outputs=[output])
model.compile(loss='sparse_categorical_crossentropy', optimizer='nadam')
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
char_input (InputLayer)         (None, None, 52)     0                                            
__________________________________________________________________________________________________
char_embedding (TimeDistributed (None, None, 52, 30) 4260        char_input[0][0]                 
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, None, 52, 30) 0           char_embedding[0][0]             
__________________________________________________________________________________________________
time_distributed_1 (TimeDistrib (None, None, 52, 30) 2730        dropout_1[0][0]                  
__________________________________________________________________________________________________
time_distr

In [51]:
#Number of epochs
epochs = 150

In [52]:
# Minibatches
def iterate_minibatches(dataset,batch_len): 
    start = 0
    for i in batch_len:
        tokens = []
        caseing = []
        char = []
        labels = []
        data = dataset[start:i]
        start = i
        for dt in data:
            t,c,ch,l = dt
            l = np.expand_dims(l,-1)
            tokens.append(t)
            caseing.append(c)
            char.append(ch)
            labels.append(l)
        yield np.asarray(labels),np.asarray(tokens),np.asarray(caseing),np.asarray(char)

In [53]:
#Training of n epochs 
for epoch in range(epochs):    
    print("Epoch %d/%d"%(epoch,epochs))
    a = Progbar(len(train_batch_len))
    for i,batch in enumerate(iterate_minibatches(train_batch,train_batch_len)):
        labels, tokens, casing,char = batch       
        model.train_on_batch([tokens, casing,char], labels)
        a.update(i)
    a.update(i+1)
    print(' ')

Epoch 0/150
Instructions for updating:
Use tf.cast instead.
 
Epoch 1/150
 
Epoch 2/150
 
Epoch 3/150
 
Epoch 4/150
 
Epoch 5/150
 
Epoch 6/150
 
Epoch 7/150
 
Epoch 8/150
 
Epoch 9/150
 
Epoch 10/150
 
Epoch 11/150
 
Epoch 12/150
 
Epoch 13/150
 
Epoch 14/150
 
Epoch 15/150
 
Epoch 16/150
 
Epoch 17/150
 
Epoch 18/150
 
Epoch 19/150
 
Epoch 20/150
 
Epoch 21/150
 
Epoch 22/150
 
Epoch 23/150
 
Epoch 24/150
 
Epoch 25/150
 
Epoch 26/150
 
Epoch 27/150
 
Epoch 28/150
 
Epoch 29/150
 
Epoch 30/150
 
Epoch 31/150
 
Epoch 32/150
 
Epoch 33/150
 
Epoch 34/150
 
Epoch 35/150
 
Epoch 36/150
 
Epoch 37/150
 
Epoch 38/150
 
Epoch 39/150
 
Epoch 40/150
 
Epoch 41/150
 
Epoch 42/150
 
Epoch 43/150
 
Epoch 44/150
 
Epoch 45/150
 
Epoch 46/150
 
Epoch 47/150
 
Epoch 48/150
 
Epoch 49/150
 
Epoch 50/150
 
Epoch 51/150
 
Epoch 52/150
 
Epoch 53/150
 
Epoch 54/150
 
Epoch 55/150
 
Epoch 56/150
 
Epoch 57/150
 
Epoch 58/150
 
Epoch 59/150
 
Epoch 60/150
 
Epoch 61/150
 
Epoch 62/150
 
Epoch 63/150
 
Ep

 
Epoch 109/150
 
Epoch 110/150
 
Epoch 111/150
 
Epoch 112/150
 
Epoch 113/150
 
Epoch 114/150
 
Epoch 115/150
 
Epoch 116/150
 
Epoch 117/150
 
Epoch 118/150
 
Epoch 119/150
 
Epoch 120/150
 
Epoch 121/150
 
Epoch 122/150
 
Epoch 123/150
 
Epoch 124/150
 
Epoch 125/150
 
Epoch 126/150
 
Epoch 127/150
 
Epoch 128/150
 
Epoch 129/150
 
Epoch 130/150
 
Epoch 131/150
 
Epoch 132/150
 
Epoch 133/150
 
Epoch 134/150
 
Epoch 135/150
 
Epoch 136/150
 
Epoch 137/150
 
Epoch 138/150
 
Epoch 139/150
 
Epoch 140/150
 
Epoch 141/150
 
Epoch 142/150
 
Epoch 143/150
 
Epoch 144/150
 
Epoch 145/150
 
Epoch 146/150
 
Epoch 147/150
 
Epoch 148/150
 
Epoch 149/150
 


In [54]:
# Saving the model
model.save("model_data/model.h5")

Evaluating model accurracy. Using F1, precision  and recal for Dev and Test sets.

In [55]:
def tag_dataset(dataset):
    correctLabels = []
    predLabels = []
    b = Progbar(len(dataset))
    for i,data in enumerate(dataset):    
        tokens, casing,char, labels = data
        tokens = np.asarray([tokens])     
        casing = np.asarray([casing])
        char = np.asarray([char])
        pred = model.predict([tokens, casing,char], verbose=False)[0]   
        pred = pred.argmax(axis=-1) #Predict the classes            
        correctLabels.append(labels)
        predLabels.append(pred)
        b.update(i)
    b.update(i+1)
    return predLabels, correctLabels

In [56]:
#Method to compute the accruarcy. Call predict_labels to get the labels for the dataset
def compute_f1(predictions, correct, idx2Label): 
    label_pred = []    
    for sentence in predictions:
        label_pred.append([idx2Label[element] for element in sentence])
        
    label_correct = []    
    for sentence in correct:
        label_correct.append([idx2Label[element] for element in sentence])
            
    
    #print label_pred
    #print label_correct
    
    prec = compute_precision(label_pred, label_correct)
    rec = compute_precision(label_correct, label_pred)
    
    f1 = 0
    if (rec+prec) > 0:
        f1 = 2.0 * prec * rec / (prec + rec);
        
    return prec, rec, f1

In [57]:
def compute_precision(guessed_sentences, correct_sentences):
    assert(len(guessed_sentences) == len(correct_sentences))
    correctCount = 0
    count = 0
    
    
    for sentenceIdx in range(len(guessed_sentences)):
        guessed = guessed_sentences[sentenceIdx]
        correct = correct_sentences[sentenceIdx]
        assert(len(guessed) == len(correct))
        idx = 0
        while idx < len(guessed):
            if guessed[idx][0] == 'B': #A new chunk starts
                count += 1
                
                if guessed[idx] == correct[idx]:
                    idx += 1
                    correctlyFound = True
                    
                    while idx < len(guessed) and guessed[idx][0] == 'I': #Scan until it no longer starts with I
                        if guessed[idx] != correct[idx]:
                            correctlyFound = False
                        
                        idx += 1
                    
                    if idx < len(guessed):
                        if correct[idx][0] == 'I': #The chunk in correct was longer
                            correctlyFound = False
                        
                    
                    if correctlyFound:
                        correctCount += 1
                else:
                    idx += 1
            else:  
                idx += 1
    
    precision = 0
    if count > 0:    
        precision = float(correctCount) / count
        
    return precision

In [58]:
#   Performance on dev dataset        
predLabels, correctLabels = tag_dataset(dev_batch)        
pre_dev, rec_dev, f1_dev = compute_f1(predLabels, correctLabels, idx2Label)
print("Dev-Data: Prec: %.3f, Rec: %.3f, F1: %.3f" % (pre_dev, rec_dev, f1_dev))

Dev-Data: Prec: 0.811, Rec: 0.813, F1: 0.812


In [59]:
#   Performance on test dataset       
predLabels, correctLabels = tag_dataset(test_batch)        
pre_test, rec_test, f1_test= compute_f1(predLabels, correctLabels, idx2Label)
print("Test-Data: Prec: %.3f, Rec: %.3f, F1: %.3f" % (pre_test, rec_test, f1_test))

Test-Data: Prec: 0.849, Rec: 0.853, F1: 0.851


Test with data

In [60]:
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords

Defining class for testing.

In [68]:
import numpy as np
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
from keras.models import load_model
from keras.preprocessing.sequence import pad_sequences
from nltk import word_tokenize

class Parser:

    def __init__(self):
        # ::Hard coded char lookup ::
        self.char2Idx = {"PADDING":0, "UNKNOWN":1}
        for c in " 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.,-_()[]{}!?:;#'\"/\\%$`&=*+@^~|":
            self.char2Idx[c] = len(self.char2Idx)
        # :: Hard coded case lookup ::
        self.case2Idx = {'numeric': 0, 'allLower':1, 'allUpper':2, 'initialUpper':3, 'other':4, 'mainly_numeric':5, 'contains_digit': 6, 'PADDING_TOKEN':7}

    def load_models(self, loc=None):
        if not loc:
            loc = os.path.join(os.path.expanduser('~'), '.ner_model')
        self.model = load_model(os.path.join(loc,"model.h5"))
        # loading word2Idx
        self.word2Idx = np.load(os.path.join(loc,"word2Idx.npy")).item()
        # loading idx2Label
        self.idx2Label = np.load(os.path.join(loc,"idx2Label.npy")).item()

    def getCasing(self,word, caseLookup):   
        casing = 'other'
        
        numDigits = 0
        for char in word:
            if char.isdigit():
                numDigits += 1
                
        digitFraction = numDigits / float(len(word))
        
        if word.isdigit(): #Is a digit
            casing = 'numeric'
        elif digitFraction > 0.5:
            casing = 'mainly_numeric'
        elif word.islower(): #All lower case
            casing = 'allLower'
        elif word.isupper(): #All upper case
            casing = 'allUpper'
        elif word[0].isupper(): #is a title, initial char upper, then all lower
            casing = 'initialUpper'
        elif numDigits > 0:
            casing = 'contains_digit'  
        return caseLookup[casing]

    def createTensor(self,sentence, word2Idx,case2Idx,char2Idx):
        unknownIdx = word2Idx['UNKNOWN_TOKEN']
    
        wordIndices = []    
        caseIndices = []
        charIndices = []
            
        for word,char in sentence:  
            word = str(word)
            if word in word2Idx:
                wordIdx = word2Idx[word]
            elif word.lower() in word2Idx:
                wordIdx = word2Idx[word.lower()]                 
            else:
                wordIdx = unknownIdx
            charIdx = []
            for x in char:
                if x in char2Idx.keys():
                    charIdx.append(char2Idx[x])
                else:
                    charIdx.append(char2Idx['UNKNOWN'])   
            wordIndices.append(wordIdx)
            caseIndices.append(self.getCasing(word, case2Idx))
            charIndices.append(charIdx)
            
        return [wordIndices, caseIndices, charIndices]

    def addCharInformation(self, sentence):
        return [[word, list(str(word))] for word in sentence]

    def padding(self,Sentence):
        Sentence[2] = pad_sequences(Sentence[2],52,padding='post')
        return Sentence

    def predict(self,Sentence):
        Sentence = words =  word_tokenize(Sentence)
        Sentence = self.addCharInformation(Sentence)
        Sentence = self.padding(self.createTensor(Sentence,self.word2Idx,self.case2Idx,self.char2Idx))
        tokens, casing,char = Sentence
        tokens = np.asarray([tokens])     
        casing = np.asarray([casing])
        char = np.asarray([char])
        pred = self.model.predict([tokens, casing,char], verbose=False)[0]   
        pred = pred.argmax(axis=-1)
        pred = [self.idx2Label[x].strip() for x in pred]

        return  list(zip(words,pred))
       

In [69]:
p = Parser()
p.load_models("model_data/")

In [70]:
from nltk import sent_tokenize
text_file = open("Input_sample.txt").read()
token_sent = sent_tokenize(text_file)

In [71]:
print(token_sent)

['MÉXICO.—Marcelo Ebrard va a renunciar a la Secretaría de Relaciones Exteriores, según versiones periodísticas que hoy fueron desmentidas por la SRE.', '“En el ámbito de la fantasía”\n\nRoberto Velasco Álvarez, vocero de la Cancillería, aseguró a través de Twitter que es totalmente falsa la versión que circuló sobre el tema.', 'El evento referido, señaló, sólo ocurrió en el ámbito de la fantasía, según lo publicado por El Universal.', 'Primero Gertz Moreno\n\nAyer lunes también circuló la versión sobre una supuesta renuncia del Fiscal General de la República, Alejandro Gertz Manero, por cuestiones de salud.', 'También fue desmentido por la dependencia.', 'Hoy en importante firma\n\nEsta mañana, en Palacio Nacional, con el presidente Andrés Manuel López Obrador como testigo de honor, Michelle Bachelet, alta comisionada de Naciones Unidas para los Derechos Humanos, y el canciller Marcelo Ebrard firmaron el Acuerdo para la formación en materia de derechos humanos y operación de acuerdo a

Input: 

In [72]:
print(text_file)

MÉXICO.—Marcelo Ebrard va a renunciar a la Secretaría de Relaciones Exteriores, según versiones periodísticas que hoy fueron desmentidas por la SRE.
“En el ámbito de la fantasía”

Roberto Velasco Álvarez, vocero de la Cancillería, aseguró a través de Twitter que es totalmente falsa la versión que circuló sobre el tema.

El evento referido, señaló, sólo ocurrió en el ámbito de la fantasía, según lo publicado por El Universal.
Primero Gertz Moreno

Ayer lunes también circuló la versión sobre una supuesta renuncia del Fiscal General de la República, Alejandro Gertz Manero, por cuestiones de salud. También fue desmentido por la dependencia.
Hoy en importante firma

Esta mañana, en Palacio Nacional, con el presidente Andrés Manuel López Obrador como testigo de honor, Michelle Bachelet, alta comisionada de Naciones Unidas para los Derechos Humanos, y el canciller Marcelo Ebrard firmaron el Acuerdo para la formación en materia de derechos humanos y operación de acuerdo a estándares internacio

In [85]:
outlist =[]
for t in token_sent:
    t= unidecode.unidecode(t)
    outlist.append(p.predict(t))

to_out=[]
for s in outlist:
    for w in s:
        if ('O') not in w:
            print(w)
            to_out.append(w)
            
with open('Output_sample.txt', 'w') as f:
    for item in to_out:
        f.write("\n")
        for x in item:
            f.write("%s " %x)

('MEXICO.', 'B-ORG')
('Marcelo', 'B-PER')
('Ebrard', 'I-PER')
('Secretaria', 'B-ORG')
('de', 'I-ORG')
('Relaciones', 'I-MISC')
('Exteriores', 'I-MISC')
('SRE', 'B-MISC')
('Roberto', 'B-PER')
('Velasco', 'I-PER')
('Alvarez', 'I-PER')
('Cancilleria', 'B-PER')
('Twitter', 'B-LOC')
('senalo', 'I-MISC')
('El', 'B-ORG')
('Universal', 'I-ORG')
('Gertz', 'B-PER')
('Moreno', 'I-PER')
('Ayer', 'I-PER')
('Fiscal', 'B-MISC')
('General', 'I-MISC')
('de', 'I-ORG')
('la', 'I-ORG')
('Republica', 'I-ORG')
('Alejandro', 'B-PER')
('Gertz', 'I-PER')
('Manero', 'I-PER')
('Tambien', 'B-PER')
('Esta', 'B-MISC')
('Palacio', 'B-LOC')
('Nacional', 'I-LOC')
('Andres', 'B-PER')
('Manuel', 'I-PER')
('Lopez', 'I-PER')
('Obrador', 'I-PER')
('Michelle', 'I-PER')
('Bachelet', 'I-PER')
('Naciones', 'B-MISC')
('Unidas', 'I-MISC')
('para', 'I-ORG')
('los', 'I-MISC')
('Derechos', 'I-MISC')
('Humanos', 'I-MISC')
('Marcelo', 'B-PER')
('Ebrard', 'I-PER')
('Acuerdo', 'B-MISC')
('Guardia', 'B-ORG')
('Nacional', 'I-ORG')
('Ebra