# LSTM part-of-speech tagging and supertagging for the French Treebank: 

This notebook trains a part-of-speech tagger and supertagger for the French Treebank using a vanilla bi-direction LSTM network.

Run the following cell to load the Keras packages.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import sys

from keras.models import Model, load_model
from keras.layers import Bidirectional, Dense, Input, Dropout, LSTM, Activation, TimeDistributed, BatchNormalization, concatenate, Concatenate
from keras.layers.embeddings import Embedding
from keras.constraints import max_norm
from keras import regularizers
from keras.preprocessing import sequence
from keras.utils import to_categorical
from keras.initializers import glorot_uniform
from keras import backend as K
from sklearn.model_selection import train_test_split

from grail_data_utils import *

%matplotlib inline

np.random.seed(1)

Using TensorFlow backend.


### Read the TLGbank file

In [2]:
# very small initial part of corpus (only file aa1)
# X, Y1, Y2, Z, vocabulary, vnorm, partsofspeech1, partsofspeech2, superset, maxLen = read_maxentdata('aa1.txt')

In [3]:
# small initial part of corpus (files aa1, aa2, ab2 and ae1)
# number of sentences, train: 1195, test: 398, dev: 399  
# X, Y1, Y2, Z, vocabulary, vnorm, partsofspeech1, partsofspeech2, superset, maxLen = read_maxentdata('aa1_ae1.txt')

In [4]:
# entire corpus
# number of sentences, train: 9449, test: 3150, dev: 3150
X, Y1, Y2, Z, vocabulary, vnorm, partsofspeech1, partsofspeech2, superset, maxLen = read_maxentdata('m2.txt')

In [5]:
numClasses = len(partsofspeech2)+1
numSuperClasses = len(superset)+1

print()
print("Longest sentence   : ", maxLen)
print("Number of POS tags : ", numClasses)
print("Number of supertags: ", numSuperClasses)



Longest sentence   :  266
Number of POS tags :  32
Number of supertags:  891


## 1. Split the input into train/dev/test

Split the full training set into 60% train, 20% dev and 20% test.

In [6]:
# split the training data into the standard 60% train, 20% dev, 20% test 
X_train, X_testdev, Y_train, Y_testdev = train_test_split(X, Y2, test_size=0.4)
X_test, X_dev, Y_test, Y_dev = train_test_split(X_testdev, Y_testdev, test_size=0.5)
print("Train: ", X_train.shape)
print("Test:  ", X_test.shape)
print("Dev:   ", X_dev.shape)


Train:  (9449,)
Test:   (3150,)
Dev:    (3150,)


## 2. Create auxiliary mappings

Create mappings from supertags and the two sets of part-of-speech tags to integers and back.

In [7]:
# create mapping for the two POS tagset and for the supertags

super_to_index, index_to_super = indexify(superset)
pos1_to_index, index_to_pos1 = indexify(partsofspeech1)
pos2_to_index, index_to_pos2 = indexify(partsofspeech2)
print(pos2_to_index)

{'PUN': 1, 'PRO:REL': 2, 'VER:cond': 3, 'PRO:IND': 4, 'ADJ': 5, 'ADV': 6, 'KON': 7, 'INT': 8, 'VER:impe': 9, 'DET:ART': 10, 'VER:simp': 11, 'VER:pper': 12, 'ABR': 13, 'PRO': 14, 'NAM': 15, 'PRO:DEM': 16, 'VER:subp': 17, 'VER:futu': 18, 'NOM': 19, 'DET:POS': 20, 'NUM': 21, 'SYM': 22, 'PRO:POS': 23, 'VER:infi': 24, 'VER:ppre': 25, 'PUN:cit': 26, 'VER:pres': 27, 'PRP': 28, 'PRP:det': 29, 'VER:impf': 30, 'PRO:PER': 31}


## 3. Obtain the word vector information

We are using a shell call to the compiled fastText code to produce a file _vectors.txt_ with the relevant vectors. 

### 3.1. Feature vectors

#### Suffixes

In [None]:
french_suffixes = read_suffixes('suffixes.txt')
print(len(french_suffixes))

In [None]:
suffix_vector("seraient", french_suffixes)

#### Manually designed features

In [None]:
print(word_features("ABCD"))
print(word_features("Abcd"))
print(word_features("1234"))
print(word_features("*%"))
print(word_features("Ab-cd"))
print(word_features("-t-il"))
print(word_features("Contre"))
print(word_features("dans"))
print(word_features("anti-"))
print(word_features("et"))
print(word_features("ou"))
print(word_features("-t-il"))
print(word_features("-il"))
print(word_features("-tu"))
print(word_features("eussent"))

### 3.2. Sending the vocabulary through the fasttext executable

Write the vocabulary to an output file, then pass it to the fastText executable to produce the relevant word embeddings for our text. Since the fastText model is over 5 Gb, the shell call can take some time.

In [None]:
with open("vocab.txt", 'w') as vocab_file:
    for w in vnorm:
        print(w, file=vocab_file)

Shell call to `fasttext` for my Macbook Air

In [None]:
!/Users/moot/Software/fastText-master/fasttext print-word-vectors /Users/moot/Corpus/wiki.fr/wiki.fr.bin < vocab.txt > vectors.txt

Shell call to `fasttest` for my Macbook Pro, with `wiki.fr.bin` on external drive

In [None]:
!/Users/moot/Software/fastText-master/fasttext print-word-vectors /Volumes/LaCie/Corpus/fastText/wiki/wiki.fr.bin < vocab.txt > vectors.txt

#### Combine all vector information

Combined all information from fasttext, suffixes, manually selected features and the features for digits (which are not in fasttext) to produce combined feature vectors for all words 

In [None]:
word_to_index, index_to_word, word_to_vec_map = read_vecs('vectors.txt', vnorm, vocabulary)

#### Save auxiliary mapping to files

Use pickle to save the auxiliary dictionaries to files. This avoids having to generate them from scratch when using the model. 

In [None]:
import pickle

def save_obj(obj, name):
    with open(name + '.pkl', 'wb+') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)

def load_obj(name):
    with open(name + '.pkl', 'rb') as f:
        return pickle.load(f)

In [None]:
save_obj(word_to_index, 'word_to_index')
save_obj(index_to_word, 'index_to_word')
save_obj(word_to_vec_map, 'word_to_vec_map')

### 3.3. Alternative word embeddings using CWindow

Instead of the fastText embeddings, we can use the `wang2vec` embeddings which should be more appropriate for
syntactic applications.

In [21]:
from gensim.models import KeyedVectors

def remove_prefix(text, prefix):
    if text.startswith(prefix):
        return text[len(prefix):]
    return text

wv = KeyedVectors.load_word2vec_format('../wang2vec/frwiki_cwindow50_10.bin', binary=True)
veclength = 50


Compute `word_to_vec_map` for all words in the vocabulary using only the `cwindow` embeddings

In [22]:
word_to_vec_map = {}
unknowns = set()
invoc = 0

for w in vocabulary:
    wn = normalize_word(w)
    wr = remove_prefix(wn, "-t-")
    wr = remove_prefix(wr, "-")
    try:
        vec = wv[wr]
        invoc = invoc + 1
    except:
        unknowns.add(w)
        vec = np.zeros(veclength)
    word_to_vec_map[w] = vec

print(unknowns)
print(len(unknowns))
print(invoc)

{'1645', 'Wonder', 'Nursultan', 'au-delà', 'Giancardo', 'ISSC', 'anti-', 'VMS', 'Obnubilées', '42', 'Hypotheken', '285', 'nord-est', 'OFP', 'politiquement', '5.760', 'Rio-de-Janeiro', 'échaudées', 'baby-sitter', '171,5', 'CIO', 'allègrement', 'Jonh', '1er', '1915', 'Ernewein', 'Marceau', 'Adnan', '4,21', 'OPZZ', '740', '164', 'Newsweek', '390', 'surpiquer', 'Ebbene', '2009', 'Alphandery', 'Strauss-Kahn', 'celles-ci', '4,45', 'repointent', 'Briand', 'NV', 'garden', 'Chanteloup-les-Vignes', 'Officiellement', '9,3', 'Taxil', 'INFINT', 'Loira', 'Goetzfrid', 'Djebel-Ali', '67,9', 'Brighton', 'Manet', 'Guinée-Equatoriale', 'Duisbourg', 'Motors', 'injustement', 'chauves-souris', 'Daily', 'life', 'Haulage', '98', '40', '15,5', 'Agnelli', '2,9', '1999', 'Rizzoli', 'après-midi', '1.974,55', 'mini-rencontres', 'SOFIREM', '310', 'DSF', 'N1', '16,33', 'P', 'Marie-Marvingt', 'Kia', '38', 'Winterthur', '1914-1918', 'Esterel', 'isolément', 'Lopez', '1869', 'Afflelou', 'roupillait', '30', 'Saint-Honoré

Compute `word_to_vec_map` for all words in the vocabulary using the `cwindow` embeddings, plus the suffix and
custom feature information

In [23]:
word_to_vec_map = {}
unknowns = set()
invoc = 0

for w in vocabulary:
    wn = normalize_word(w)
    wr = remove_prefix(wn, "-t-")
    wr = remove_prefix(wr, "-")
    try:
        emb = wv[wr]
        invoc = invoc + 1
        features = word_features(w)
    except:
        unknowns.add(w)
        emb = np.zeros(veclength)
        features = word_features(w, unknown=True)
    suffix = suffix_vector(wn)
    word_to_vec_map[w] = np.concatenate((emb,suffix,features))


Compute `word_to_index` and `index_to_word` for the entire vocabulary 

In [24]:
word_to_index, index_to_word = indexify(vocabulary)

### 3.4. The Embedding layer

In Keras, the embedding matrix is represented as a "layer", and maps positive integers (indices corresponding to words) into dense vectors of fixed size (the embedding vectors). It can be trained or initialized with a pretrained embedding. In this part, we create an [Embedding()](https://keras.io/layers/embeddings/) layer in Keras, and initialize it with the fastTeX vectors loaded earlier in the notebook. 

The `Embedding()` layer takes an integer matrix of size (batch size, max input length) as input. This corresponds to sentences converted into lists of indices (integers), as shown in the figure below.

The largest integer (i.e. word index) in the input should be no larger than the vocabulary size. The layer outputs an array of shape (batch size, max input length, dimension of word vectors).

We first convert all our training sentences into lists of indices, and then zero-pad all these lists so that their length is the length of the longest sentence. 

**TODO**: I'd like try if it makes any difference to add the </s> end tag to the end of each sentence. 

Run the following cell to check what `sentences_to_indices()` does, and check your results.

In [25]:
sentences_to_indices(X_train, word_to_index, maxLen)

array([[  4816.,  24327.,  25013., ...,      0.,      0.,      0.],
       [  5571.,  11400.,   7950., ...,      0.,      0.,      0.],
       [  6609.,  29463.,  13376., ...,      0.,      0.,      0.],
       ..., 
       [  1126.,  28292.,  13915., ...,      0.,      0.,      0.],
       [ 10367.,   8928.,  28429., ...,      0.,      0.,      0.],
       [  5571.,  19012.,  17529., ...,      0.,      0.,      0.]])

We now build the `Embedding()` layer for use with Keras, using pre-trained word vectors. After this layer is built, we can pass the output of `sentences_to_indices()` to it as an input, and the `Embedding()` layer will return the word embeddings for a sentence. 

We use the following steps:
1. Initialize the embedding matrix as a numpy array of zeroes with the correct shape.
2. Fill in the embedding matrix with all the word embeddings extracted from `word_to_vec_map`.
3. Define Keras embedding layer. Use [Embedding()](https://keras.io/layers/embeddings/). Be sure to make this layer non-trainable, by setting `trainable = False` when calling `Embedding()`. If you were to set `trainable = True`, then it will allow the optimization algorithm to modify the values of the word embeddings. 
4. Set the embedding weights to be equal to the embedding matrix 

In [26]:

def pretrained_embedding_layer(word_to_vec_map, word_to_index):
    """
    Creates a Keras Embedding() layer and loads in pre-trained fastText vectors.
    
    Arguments:
    word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
    word_to_index -- dictionary mapping from words to their indices in the vocabulary

    Returns:
    embedding_layer -- pretrained layer Keras instance
    """
    
    vocab_len = len(word_to_index) + 2                  # adding 1 to fit Keras embedding (requirement)
    emb_dim = word_to_vec_map["est"].shape[0]      # define dimensionality of your GloVe word vectors (= 50)
    
    # Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)
    emb_matrix = np.zeros((vocab_len,emb_dim))
    
    # Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary
    for word, index in word_to_index.items():
        emb_matrix[index, :] = word_to_vec_map[word]

    # Define Keras embedding layer with the correct output/input sizes, make it trainable. Use Embedding(...). Make sure to set trainable=False. 
    embedding_layer = Embedding(vocab_len,emb_dim,trainable=False,mask_zero=True)
    ### END CODE HERE ###

    # Build the embedding layer, it is required before setting the weights of the embedding layer. Do not modify the "None".
    embedding_layer.build((None,))
    
    # Set the weights of the embedding layer to the embedding matrix. Your layer is now pretrained.
    embedding_layer.set_weights([emb_matrix])
    
    return embedding_layer

In [27]:
embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
print("weights[0][2][1] =", embedding_layer.get_weights()[0][2][1])

weights[0][2][1] = -0.179958


**Expected Output**:

<table>
    <tr>
        <td>
            **weights[0][2][1] =**
        </td>
        <td>
           0.19175
        </td>
    </tr>
</table>

## 4. Building the Part-of-Speech tagger

We now build the POS-tagger model using the previously built enbedding layer and feed its output to a bidirectional LSTM network with 128 states in each direction. 



In [None]:
# POS_model

def POS_model(input_shape, word_to_vec_map, word_to_index):
    """
    Function creating the graph for the part-of-speech tagger model
    
    Arguments:
    input_shape -- shape of the input, usually (max_len,)
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its fastText vector representation
    word_to_index -- dictionary mapping from words to their indices in the vocabulary

    Returns:
    model -- a model instance in Keras
    """
    
    # Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
    sentence_indices = Input(shape = input_shape, dtype = 'int32')
    
    # Create the embedding layer pretrained with GloVe Vectors (≈1 line)
    embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
    
    # Propagate sentence_indices through your embedding layer, you get back the embeddings
    embeddings = embedding_layer(sentence_indices)   
    
    # Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
    # returning a batch of sequences.
    X = LSTM(128, return_sequences=True)(embeddings)
    X = BatchNormalization()(X)
    Y = Dropout(0.5)(X)
    # Add a (time distributed) Dense layer followed by a softmax activation
    Y = TimeDistributed(Dense(numClasses, activation='softmax'))(Y)
    
    # Create Model instance which converts sentence_indices into X.
    model = Model(inputs=sentence_indices,outputs=Y)
        
    return model

Run the following cell to create your model and check its summary. Because all sentences in the dataset are less than 10 words, we chose `max_len = 10`.  You should see your architecture, it uses "20,223,927" parameters, of which 20,000,050 (the word embeddings) are non-trainable, and the remaining 223,877 are. Because our vocabulary size has 400,001 words (with valid indices from 0 to 400,000) there are 400,001\*50 = 20,000,050 non-trainable parameters. 

In [None]:
model = POS_model((maxLen,), word_to_vec_map, word_to_index)
model.summary()

As usual, after creating your model in Keras, you need to compile it and define what loss, optimizer and metrics your are want to use. Compile your model using `categorical_crossentropy` loss, `adam` optimizer and `['accuracy']` metrics:

In [None]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

It's time to train your model. Your Emojifier-V2 `model` takes as input an array of shape (`m`, `max_len`) and outputs probability vectors of shape (`m`, `number of classes`). We thus have to convert X_train (array of sentences as strings) to X_train_indices (array of sentences as list of word indices), and Y_train (labels as indices) to Y_train_oh (labels as one-hot vectors).

In [None]:
X_train_indices = lists_to_indices(X_train, word_to_index, maxLen)
Y_train_indices = lists_to_indices(Y_train, pos2_to_index, maxLen)
Y_train_oh = to_categorical(Y_train_indices, num_classes=numClasses)

In [None]:
print(Y_train_indices[1])
print(Y_train_oh[1])

In [None]:
X_dev_indices = lists_to_indices(X_dev, word_to_index, max_len = maxLen)
Y_dev_indices = lists_to_indices(Y_dev, pos2_to_index, max_len = maxLen)
Y_dev_oh = to_categorical(Y_dev_indices, num_classes = numClasses)

Fit the Keras model on `X_train_indices` and `Y_train_oh`. We will use `epochs = 50` and `batch_size = 32`.

In [None]:
history = model.fit(X_train_indices, Y_train_oh, epochs = 30, batch_size = 32, shuffle=True, validation_data=(X_dev_indices,Y_dev_oh))

Your model should perform close to **100% accuracy** on the training set. The exact accuracy you get may be a little different. Run the following cell to evaluate your model on the test set. 

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model train vs validation loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper right')
plt.show()

In [None]:
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model train vs validation accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper right')
plt.show()

In [None]:
loss, acc = model.evaluate(X_dev_indices, Y_dev_oh)
print()
print("Dev accuracy = ", acc)

You should get a test accuracy of about 94.8% for a vanilla model using only aa1.txt.
A vanilla POS model on the full training set gets a dev accuracy of 98.50%!

In [None]:
X_dev_indices = lists_to_indices(X_test, word_to_index, maxLen)
pred = model.predict(X_dev_indices)

# print('           '+ label_to_emoji(0)+ '    ' + label_to_emoji(1) + '    ' +  label_to_emoji(2)+ '    ' + label_to_emoji(3)+'   ' + label_to_emoji(4))
# print(pd.crosstab(Y_dev, pred_test.reshape(56,), rownames=['Actual'], colnames=['Predicted'], margins=True))
plot_confusion_matrix(Y_dev, pred)

In [None]:
# This code allows you to see the mislabelled examples

y_dev_oh = to_categorical(Y_dev_indices, num_classes = numClasses)
X_dev_indices = lists_to_indices(X_dev, word_to_index, maxLen)
pred = model.predict(X_dev_indices)

correct = 0
wrong = 0


for i in range(len(X_dev)-1):
    for j in range(len(X_dev[i])):
        num = np.argmax(pred[i][j])
        if(num != Y_dev_indices[i][j]):
            wrong = wrong + 1
            print('Expected POS tag: '+ X_dev[i][j] + '|' + Y_dev[i][j] + ' prediction: '+ X_dev[i][j] + '|' + index_to_pos2[num])
        else:
            correct = correct + 1
total = wrong + correct
print("Total  : ", total)
print("Correct: ", correct)
print("Wrong  : ", wrong)

cpct = (100*correct)/total
wpct = (100*wrong)/total
print("Correct %: ", cpct)
print("Wrong   %: ", wpct)

### POStagger results on development set

| tagset | LTSM units | batchnorm | dropout | epochs | results |
|:-----|---------:|:----------:|-------:|-----:|----------:|
| tt | 128 | no | 0 |  50 | 98.50 |
| tt | 128 | yes | 0.5 | 30 | 98.76 |


In [None]:
model.save('tt_pos.h5')

## 5. Training the Supertagger



### Prepare the training and development data

We split the data as before, only using Z (supertags) instead of Y2 (treetagger POStag set) as the goal

In [28]:
# split the training data into the standard 60% train, 20% dev, 20% test 
X_train, X_testdev, Y_super_train, Y_super_testdev = train_test_split(X, Z, test_size=0.4)
X_test, X_dev, Y_super_test, Y_super_dev = train_test_split(X_testdev, Y_super_testdev, test_size=0.5)
print("Train: ", X_train.shape)
print("Test:  ", X_test.shape)
print("Dev:   ", X_dev.shape)

Train:  (9449,)
Test:   (3150,)
Dev:    (3150,)


#### Prepare training data

Transform the training data into the form most convenient for the supertag model

In [29]:
X_train_indices = lists_to_indices(X_train, word_to_index, maxLen)
Y_super_train_indices = lists_to_indices(Y_super_train, super_to_index, maxLen)
Y_super_train_oh = to_categorical(Y_super_train_indices, num_classes=numSuperClasses)

#### Prepare development data

Do the same for the development data. The development data allows us to check for over/underfitting.

In [30]:
X_dev_indices = lists_to_indices(X_dev, word_to_index, max_len = maxLen)
Y_super_dev_indices = lists_to_indices(Y_super_dev, super_to_index, max_len = maxLen)
Y_super_dev_oh = to_categorical(Y_super_dev_indices, num_classes = numSuperClasses)

### Define the model

We define the structure of the model

In [None]:
# Super_model
# this is a direct supertag model not using the part-of-speech tags

def Super_model(input_shape, word_to_vec_map, word_to_index):
    """
    Function creating the direct supertagger model's graph
    
    Arguments:
    input_shape -- shape of the input, usually (max_len,)
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its fastText vector representation
    word_to_index -- dictionary mapping from words to their indices in the vocabulary

    Returns:
    model -- a model instance in Keras
    """
    
    # Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
    sentence_indices = Input(shape = input_shape, dtype = 'int32')
    
    # Create the embedding layer pretrained with GloVe Vectors (≈1 line)
    embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
    
    # Propagate sentence_indices through your embedding layer, you get back the embeddings
    embeddings = embedding_layer(sentence_indices)   
    
    # Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
    # returning a batch of sequences.
    X = LSTM(256, return_sequences=True)(embeddings) 
    X = BatchNormalization()(X)
    X = Dropout(0.5)(X)

#    merged = concatenate([embeddings,X])
#    X = LSTM(128, return_sequences=True)(merged) 
#    X = BatchNormalization()(X)
#    X = Dropout(0.5)(X)

    # Add a (time distributed) Dense layer followed by a softmax activation
    X = TimeDistributed(Dense(numSuperClasses, activation='softmax'))(X)
    
    # Create Model instance which converts sentence_indices into X.
    model = Model(inputs=sentence_indices,outputs=X)
        
    return model

In [None]:
supermodel = Super_model((maxLen,), word_to_vec_map, word_to_index)
supermodel.summary()

In [None]:
supermodel.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
history = supermodel.fit(X_train_indices, Y_super_train_oh, epochs = 50, batch_size = 64, shuffle=True, validation_data=(X_dev_indices,Y_super_dev_oh))

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model train vs validation loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper right')
plt.show()

In [None]:
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model train vs validation accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper right')
plt.show()

In [None]:
loss, acc = supermodel.evaluate(X_dev_indices, Y_super_dev_oh)
print()
print("Test accuracy = ", acc)

In [None]:
supermodel.save('supertagger.h5')

In [None]:
# This code allows you to see the mislabelled examples

y_dev_oh = to_categorical(Y_super_dev_indices, num_classes = numSuperClasses)
X_dev_indices = lists_to_indices(X_dev, word_to_index, maxLen)
pred = supermodel.predict(X_dev_indices)

correct = 0
wrong = 0

f = open('superlog_raw.txt', 'w')
for i in range(len(X_dev)-1):
    for j in range(len(X_dev[i])):
        num = np.argmax(pred[i][j])
        if(num != Y_super_dev_indices[i][j]):
            wrong = wrong + 1
            f.write(X_dev[i][j]+"|"+Y_super_dev[i][j]+"|"+index_to_super[num]+"\n")
            print('Expected supertag: '+ X_dev[i][j] + '|' + Y_super_dev[i][j] + ' prediction: '+ X_dev[i][j] + '|' + index_to_super[num])
        else:
            correct = correct + 1
f.close()
total = wrong + correct

print("Total  : ", total)
print("Correct: ", correct)
print("Wrong  : ", wrong)

cpct = (100*correct)/total
wpct = (100*wrong)/total
print("Correct %: ", cpct)
print("Wrong   %: ", wpct)

In [None]:
y_dev_oh = to_categorical(Y_super_dev_indices, num_classes = numSuperClasses)
X_dev_indices = lists_to_indices(X_dev, word_to_index, maxLen)
pred = supermodel.predict(X_dev_indices)

correct = 0
wrong = 0

fo = open('super_out.txt', 'w')
fc = open('super_correct.txt', 'w')
for i in range(len(X_dev)-1):
    for j in range(len(X_dev[i])):
        num = np.argmax(pred[i][j])
        fc.write(X_dev[i][j]+"|"+Y_super_dev[i][j]+"\n")
        fo.write(X_dev[i][j]+"|"+index_to_super[num]+"\n")
fc.close()
fo.close()

### Supertagger results on development set


Vanilla LSTM model (new feature set 20180309)

| embedding | dimension| LTSM units | batchnorm | dropout | epochs | train | devel |
|:-----|-----:|---------:|:----------:|-------:|----:|-----:|----------:|
| cwindow | 50 | 256 | yes | .5 |  10 | 80.66 | 81.12 |
| cwindow | 50 | 256 | yes | .5 |  30 | 85.43 | 82.04 |
| cwindow | 50 | 256 | yes | .5 |  50 |  |  |

| embedding | dimension| LTSM units | batchnorm | dropout | epochs | train | devel |
|:-----|-----:|---------:|:----------:|-------:|----:|-----:|----------:|
| cwindow | 50 | 128 | yes | .5 |  10 | 79.57 | 80.53 |
| cwindow | 50 | 128 | yes | .5 |  30 | 82.44 | 81.93 |
| cwindow | 50 | 128 | yes | .5 |  50 | 84.16 | 82.01 |


Vanilla LSTM model (new feature set 20180309)

| embedding | dimension| LTSM units | batchnorm | dropout | epochs | train | devel |
|:-----|-----:|---------:|:----------:|-------:|----:|-----:|----------:|
| cwindow | 50 | 128 | yes | .5 |  10 | 79.42 | 80.70 |
| cwindow | 50 | 128 | yes | .5 |  30 | 82.40 | 81.72 |
| cwindow | 50 | 128 | yes | .5 |  50 | 84.15 | 82.12 |


Vanilla LSTM model (new feature set 20180308)

| embedding | dimension| LTSM units | batchnorm | dropout | epochs | train | devel |
|:-----|-----:|---------:|:----------:|-------:|----:|-----:|----------:|
| cwindow | 50 | 128 | yes | .5 |  10 | 79.18 | 80.01 |
| cwindow | 50 | 128 | yes | .5 |  30 | 82.13 | 81.35 |
| cwindow | 50 | 128 | yes | .5 |  50 | 83.78 | 81.72 |

Vanilla LSTM model (old feature set)

| embedding | dimension| LTSM units | batchnorm | dropout | epochs | results |
|:-----|-----:|---------:|:----------:|-------:|----:|----------:|
| fastText | 200 | 128 | yes | .2 |  50 | 80.21 |
| cwindow | 50 | 128 | yes | .2 | 50 | 75.82 |
| cwindow | 50 | 128 | yes | .4 | 50 | 79.25  | 
| cwindow | 50 | 128 | yes | .5 | 50 | 79.59  |
| cwindow | 50 | 128 | no | .5 |  10 | 73.60 |
| cwindow | 50 | 128 | yes | .5 |  10 | 78.27 |
| cwindow | 50 | 128 | no | .5 |  30 | 77.89 |
| cwindow | 50 | 128 | yes | .5 |  30 | 79.64 |

Second LSTM layer

| embedding | dimension| LTSM units | batchnorm | dropout | epochs | results |
|:-----|-----:|---------:|:----------:|-------:|----:|----------:|
| cwindow | 50 | 2 * 128 | yes | .5 | 10 | 78.32  |
| cwindow | 50 | 2 * 128 | yes | .5 | 30 | 80.10  |
| cwindow | 50 | 2 * 128 | yes | .5 | 50 |  80.55 |

Second LSTM layer plus forward mapping of the word embeddings

| embedding | dimension| LTSM units | batchnorm | dropout | epochs | results |
|:-----|-----:|---------:|:----------:|-------:|----:|----------:|
| cwindow | 50 | 2 * 128 | yes | .5 | 10 |  78.50 |
| cwindow | 50 | 2 * 128 | yes | .5 | 30 |  80.45 |
| cwindow | 50 | 2 * 128 | yes | .5 | 50 |  80.34 |

In [None]:
tag_sequence("yves acceptera le lait", model, word_to_index, index_to_pos2, maxLen)

In [None]:
tag_sequence("yves acceptera le lait", superposmodel, word_to_index, index_to_super, maxLen)

In [None]:
print_tagged(X_dev, model, word_to_index, index_to_pos2, maxLen)

In [None]:
print_tagged(X_dev[1:5], supermodel, word_to_index, index_to_super, maxLen)

In [None]:
print_tagged_beta(X_dev[1:4], superposmodel, 0.1, word_to_index, index_to_super, maxLen)

In [None]:
eval_beta(X_dev, Y_super_dev, supermodel, word_to_index, super_to_index, index_to_super, 0.01, maxLen)

## Combined part-of-speech and supertagger

### Prepare training and development data

In [18]:
# split the training data into the standard 60% train, 20% dev, 20% test 
X_train, X_testdev, Y_super_train, Y_super_testdev = train_test_split(X, Z, test_size=0.4)
X_test, X_dev, Y_super_test, Y_super_dev = train_test_split(X_testdev, Y_super_testdev, test_size=0.5)
print("Train: ", X_train.shape)
print("Test:  ", X_test.shape)
print("Dev:   ", X_dev.shape)

Train:  (9449,)
Test:   (3150,)
Dev:    (3150,)


In [19]:
X_train_indices = lists_to_indices(X_train, word_to_index, maxLen)
Y_super_train_indices = lists_to_indices(Y_super_train, super_to_index, maxLen)
Y_super_train_oh = to_categorical(Y_super_train_indices, num_classes=numSuperClasses)

In [20]:
X_dev_indices = lists_to_indices(X_dev, word_to_index, max_len = maxLen)
Y_super_dev_indices = lists_to_indices(Y_super_dev, super_to_index, max_len = maxLen)
Y_super_dev_oh = to_categorical(Y_super_dev_indices, num_classes = numSuperClasses)

### Define and train the model

In [None]:
# Super_model
# this is a direct supertag model not using the part-of-speech tags

def Super_pos_model(input_shape, pos_model, word_to_vec_map, word_to_index):
    """
    Function creating the combined supertag/part-of-speech model's graph.
    
    Arguments:
    input_shape -- shape of the input, usually (max_len,)
    pos_model -- the part-of-speech model to incorporate
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its fastText vector representation
    word_to_index -- dictionary mapping from words to their indices in the vocabulary 

    Returns:
    model -- a model instance in Keras
    """
    
    # Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
    sentence_indices = Input(shape = input_shape, dtype = 'int32')
    
    # Create the embedding layer pretrained with fastText vectors
    embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
    
    # Propagate sentence_indices through your embedding layer, you get back the embeddings
    embeddings = embedding_layer(sentence_indices)
    
    # get probability distribution over parts_of_speech from pos_model
    parts_of_speech = pos_model(sentence_indices)
    
    # concatenate with the embeddings
    merged = concatenate([parts_of_speech,embeddings])
    
    # Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
    # returning a batch of sequences.
    X = LSTM(128, return_sequences=True)(merged) 
    X = BatchNormalization()(X)
    X = Dropout(0.5)(X)
    
    # Add a (time distributed) Dense layer followed by a softmax activation
    X = TimeDistributed(Dense(numSuperClasses, activation='softmax'))(X)
    
    # Create Model instance which converts sentence_indices into X.
    model = Model(inputs=sentence_indices,outputs=X)
        
    return model

In [None]:
superposmodel = Super_pos_model((maxLen,), model, word_to_vec_map, word_to_index)
superposmodel.summary()
print(superposmodel.layers[1])

In [None]:
superposmodel.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
history = superposmodel.fit(X_train_indices, Y_super_train_oh, epochs = 20, batch_size = 32, shuffle=True, validation_data=(X_dev_indices,Y_super_dev_oh))

### Output progress

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model train vs validation loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper right')
plt.show()

In [None]:
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model train vs validation accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper right')
plt.show()

In [None]:
superposmodel.save('superpostagger.h5')

## Two-level supertag model

Use the probabilities over the supertags produced by the first model as input to a bi-directional LSTM. We should probably add the probabilities over the part-of-speech tags and/or the word vector information as well. This means adding the outputs of, respectively, layers `model_1` (32), `embedding_3` (663) and `concatenate_1` (695).

#### Load previously defined model and associated dictionaries

In [None]:
load_obj('word_to_vec_map')
load_obj('word_to_index')
superposmodel = load_model('superpostagger.h5')

In [None]:
def Super_two_model(input_shape, super_pos_model, word_to_vec_map, word_to_index):
    """
    Function creating the graph of a supertag model over the output probabilities of another supertag model.
    This is the simplest way to do this, using only the output but none of the internal activations
    
    Arguments:
    input_shape -- shape of the input, usually (max_len,)
    pos_model -- the part-of-speech model to incorporate
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its fastText vector representation
    word_to_index -- dictionary mapping from words to their indices in the vocabulary 

    Returns:
    model -- a model instance in Keras
    """
     
    # Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
    sentence_indices = Input(shape = input_shape, dtype = 'int32')

    # get probability distribution over supertags from super_pos_model
    supertags = super_pos_model(sentence_indices)
#    posout = supertags.layers['model_1'].output
#    embout = supertags.layers['embedding_3'].output

 
    # concatenate with the embeddings
#    merged = concatenate([posout,supertags])
    
    # Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
    # returning a batch of sequences.
    X = LSTM(128, return_sequences=True)(supertags) 
#    X = BatchNormalization()(X)
    X = Dropout(0.5)(X)
    
    # Add a (time distributed) Dense layer followed by a softmax activation
    X = TimeDistributed(Dense(numSuperClasses, activation='softmax'))(X)
    
    # Create Model instance which converts sentence_indices into X.
    model = Model(inputs=sentence_indices,outputs=X)
        
    return model

In [None]:
supertwomodel = Super_two_model((maxLen,), superposmodel, word_to_vec_map, word_to_index)
supertwomodel.summary()

In [None]:
supertwomodel.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
history = supertwomodel.fit(X_train_indices, Y_super_train_oh, epochs = 30, batch_size = 32, shuffle=True, validation_data=(X_dev_indices,Y_super_dev_oh))

In [None]:
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model train vs validation loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper right')
plt.show()

In [None]:
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model train vs validation accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='upper right')
plt.show()

In [None]:
supertwomodel.save('super2tagger.h5')

In [None]:
del supertwomodel

### Supertagger results on development set

| LTSM units | batchnorm | dropout | epochs | results |
|---------:|:----------:|-------:|----:|----------:|
| 128 | yes | .5 |  20 | 80.98 |
| 128 | no | .5 |  30 |  |



#### Combine POS-tagger with supertagger output

In [None]:
from keras import backend as K

In [None]:
def Superpos_two_model(input_shape, super_pos_model, word_to_vec_map, word_to_index):
    """
    Function creating the graph of a supertag model over the output probabilities of another supertag model.
    This version combines the supertagger output with the output of the POS-tagger
    
    Arguments:
    input_shape -- shape of the input, usually (max_len,)
    pos_model -- the part-of-speech model to incorporate
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its fastText vector representation
    word_to_index -- dictionary mapping from words to their indices in the vocabulary 

    Returns:
    model -- a model instance in Keras
    """

    
    # Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
    sentence_indices = Input(shape = input_shape, dtype = 'int32')

    # get probability distribution over supertags from super_pos_model
    supertags = super_pos_model(sentence_indices)
    
    get_1st_layer_output = K.function([supertags[0].input],
                                  [supertags[1].output])
    posout = get_1st_layer_output([sentence_indices])[0]
#    embout = supertags.layers['embedding_3'].output

 
    # concatenate with the embeddings
    merged = concatenate([posout,supertags])
    
    # Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
    # returning a batch of sequences.
    X = LSTM(128, return_sequences=True)(merged) 
#    X = BatchNormalization()(X)
    X = Dropout(0.5)(X)
    
    # Add a (time distributed) Dense layer followed by a softmax activation
    X = TimeDistributed(Dense(numSuperClasses, activation='softmax'))(X)
    
    # Create Model instance which converts sentence_indices into X.
    model = Model(inputs=sentence_indices,outputs=X)
        
    return model

In [None]:
superposmodel.summary()

In [None]:
superpos2model = Superpos_two_model((maxLen,), superposmodel, word_to_vec_map, word_to_index)
superpos2model.summary()

#### Combine word embedding, POS-tagger and supertagger output

In [None]:
def Superstack_model(input_shape, super_pos_model, word_to_vec_map, word_to_index):
    """
    Function creating the graph of a supertag model over the output probabilities of another supertag model.
    This version combines the supertagger output with the output of the POS-tagger
    
    Arguments:
    input_shape -- shape of the input, usually (max_len,)
    pos_model -- the part-of-speech model to incorporate
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its fastText vector representation
    word_to_index -- dictionary mapping from words to their indices in the vocabulary 

    Returns:
    model -- a model instance in Keras
    """

   
    # Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
    sentence_indices = Input(shape = input_shape, dtype = 'int32')

    # get probability distribution over supertags from super_pos_model
    supertags = super_pos_model(sentence_indices)
#    posout = supertags.layers['model_1'].output
#    embout = supertags.layers['embedding_3'].output
    pos_emb = supertags.layers['concatenate_1'].output
 
    # concatenate with the embeddings
    merged = concatenate([pos_emb,supertags])
    
    # Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
    # returning a batch of sequences.
    X = LSTM(128, return_sequences=True)(merged) 
#    X = BatchNormalization()(X)
    X = Dropout(0.5)(X)
    
    # Add a (time distributed) Dense layer followed by a softmax activation
    X = TimeDistributed(Dense(numSuperClasses, activation='softmax'))(X)
    
    # Create Model instance which converts sentence_indices into X.
    model = Model(inputs=sentence_indices,outputs=X)
        
    return model

### Auxiliary memory management functions

`show_mem_usage()` shows the (somewhat larger) objects in the notebook memory, which can then be manually deleted using `del`. 

In [None]:
import sys

def show_mem_usage():
    '''Displays memory usage from inspection
    of global variables in this notebook'''
    gl = sys._getframe(1).f_globals
    vars= {}
    for k,v in list(gl.items()):
        # for pandas dataframes
        if hasattr(v, 'memory_usage'):
            mem = v.memory_usage(deep=True)
            if not np.isscalar(mem):
                mem = mem.sum()
            vars.setdefault(id(v),[mem]).append(k)
        # work around for a bug
        elif isinstance(v,pd.Panel):
            v = v.values
        vars.setdefault(id(v),[sys.getsizeof(v)]).append(k)
    total = 0
    for k,(value,*names) in vars.items():
        if value>1e6:
            print(names,"%.3fMB"%(value/1e6))
        total += value
    print("%.3fMB"%(total/1e6))

In [None]:
show_mem_usage()

#### Garbage collection

Call garbage collection `gc.collect()` to free unused memory.

In [None]:
import gc

gc.collect()

## Compute prefixes and postfixes

Auxiliary function to compute all prefixes and postfixes (occurring more than `Cutoff` times) in the vocabulary. The constant `*OOR*` is used for out-of-range (such as the four-character prefix of the three letter word), and `*UNK*` is used for affixes not seen at all (or less than `Cutoff` times)

In [11]:
Cutoff = 2

def trim_dict(d, min_count=Cutoff):
    for k,v in list(d.items()):
        if v < min_count:
            del d[k]
    d['*UNK*'] = 1
    d['*OOR*'] = 1
    return d

suffixcount1={}
suffixcount2={}
suffixcount3={}
suffixcount4={}
suffixcount5={}
suffixcount6={}
suffixcount7={}
prefixcount1={}
prefixcount2={}
prefixcount3={}
prefixcount4={}

for word in vocabulary:
    # convert to lower case and replace all digits by '9'
    word = word.lower()
    word = re.sub(r'[0-8]', '9', word)
    # take suffixes of length 7 or smaller (to distinguish between 'eraient' and 'aient')    
    suf1 = word[-1:]
    suf2 = word[-2:]
    suf3 = word[-3:]
    suf4 = word[-4:]
    suf5 = word[-5:]
    suf6 = word[-6:]
    suf7 = word[-7:]
    # take prefixes of length 4 or smaller    
    pref1 = word [:1]
    pref2 = word [:2]
    pref3 = word [:3]
    pref4 = word [:4]
    
    # update all counters for the computed affixes
    if len(suf1) > 0:
        if suf1 not in suffixcount1:
            suffixcount1[suf1] = 1
        else:
            suffixcount1[suf1] += 1

    if len(suf2) > 1: 
        if suf2 not in suffixcount2:
            suffixcount2[suf2] = 1
        else:
            suffixcount2[suf2] += 1

    if len(suf3) > 2: 
        if suf3 not in suffixcount3:
            suffixcount3[suf3] = 1
        else:
            suffixcount3[suf3] += 1

    if len(suf4) > 3: 
        if suf4 not in suffixcount4:
            suffixcount4[suf4] = 1
        else:
            suffixcount4[suf4] += 1

    if len(suf5) > 4: 
        if suf5 not in suffixcount5:
            suffixcount5[suf5] = 1
        else:
            suffixcount5[suf5] += 1
    if len(suf6) > 5: 
        if suf6 not in suffixcount6:
            suffixcount6[suf6] = 1
        else:
            suffixcount6[suf6] += 1
    if len(suf7) > 6: 
        if suf7 not in suffixcount7:
            suffixcount7[suf7] = 1
        else:
            suffixcount7[suf7] += 1
    if len(pref1) > 0:
        if pref1 not in prefixcount1:
            prefixcount1[pref1] = 1
        else:
            prefixcount1[pref1] += 1

    if len(pref2) > 1:
        if pref2 not in prefixcount2:
            prefixcount2[pref2] = 1
        else:
            prefixcount2[pref2] += 1

    if len(pref3) > 2:
        if pref3 not in prefixcount3:
            prefixcount3[pref3] = 1
        else:
            prefixcount3[pref3] += 1
    if len(pref4) > 3:
        if pref4 not in prefixcount4:
            prefixcount4[pref4] = 1
        else:
            prefixcount4[pref4] += 1


suffixcount1 = trim_dict(suffixcount1)
suffixcount2 = trim_dict(suffixcount2)
suffixcount3 = trim_dict(suffixcount3)
suffixcount4 = trim_dict(suffixcount4)
suffixcount5 = trim_dict(suffixcount5)
suffixcount6 = trim_dict(suffixcount6)
suffixcount7 = trim_dict(suffixcount7)

prefixcount1 = trim_dict(prefixcount1)
prefixcount2 = trim_dict(prefixcount2)
prefixcount3 = trim_dict(prefixcount3)
prefixcount4 = trim_dict(prefixcount4)

suffix1 = set(suffixcount1.keys())
suffix2 = set(suffixcount2.keys())
suffix3 = set(suffixcount3.keys())
suffix4 = set(suffixcount4.keys())
suffix5 = set(suffixcount5.keys())
suffix6 = set(suffixcount6.keys())
suffix7 = set(suffixcount7.keys())

prefix1 = set(prefixcount1.keys())
prefix2 = set(prefixcount2.keys())
prefix3 = set(prefixcount3.keys())
prefix4 = set(prefixcount4.keys())



In [12]:
p1_to_integer, integer_to_p1 = indexify(prefix1)
p2_to_integer, integer_to_p2 = indexify(prefix2)
p3_to_integer, integer_to_p3 = indexify(prefix3)
p4_to_integer, integer_to_p4 = indexify(prefix4)

s1_to_integer, integer_to_s1 = indexify(suffix1)
s2_to_integer, integer_to_s2 = indexify(suffix2)
s3_to_integer, integer_to_s3 = indexify(suffix3)
s4_to_integer, integer_to_s4 = indexify(suffix4)
s5_to_integer, integer_to_s5 = indexify(suffix5)
s6_to_integer, integer_to_s6 = indexify(suffix6)
s7_to_integer, integer_to_s7 = indexify(suffix7)

In [13]:
def word_to_prefvec(word, alen, afset, af_to_int):
    if len(word) >= alen:
        pref = word[:alen]
        if pref in afset:
            int = af_to_int[pref]
        else:
            int = af_to_int['*UNK*']
    else:
        int = af_to_int['*OOR*']
    return to_categorical(int, len(afset)+1)


In [14]:
def word_to_sufvec(word, alen, afset, af_to_int):
    if len(word) >= alen:
        pref = word[-alen:]
        if pref in afset:
            int = af_to_int[pref]
        else:
            int = af_to_int['*UNK*']
    else:
        int = af_to_int['*OOR*']
    return to_categorical(int, len(afset)+1)


In [15]:
def word_to_prefix_vector(word):
    p1 = word_to_prefvec(word, 1, prefix1, p1_to_integer)
    p2 = word_to_prefvec(word, 2, prefix2, p2_to_integer)
    p3 = word_to_prefvec(word, 3, prefix3, p3_to_integer)
    p4 = word_to_prefvec(word, 4, prefix4, p4_to_integer)
    return np.concatenate((p1,p2,p3,p4))

def word_to_suffix_vector(word):
    s1 = word_to_sufvec(word, 1, suffix1, s1_to_integer)
    s2 = word_to_sufvec(word, 2, suffix2, s2_to_integer)
    s3 = word_to_sufvec(word, 3, suffix3, s3_to_integer)
    s4 = word_to_sufvec(word, 4, suffix4, s4_to_integer)
    s5 = word_to_sufvec(word, 5, suffix5, s5_to_integer)
    s6 = word_to_sufvec(word, 6, suffix6, s6_to_integer)
    s7 = word_to_sufvec(word, 7, suffix7, s7_to_integer)
    return np.concatenate((s1,s2,s3,s4,s5,s6,s7))

def word_to_affix_vector(word):
    p1 = word_to_prefvec(word, 1, prefix1, p1_to_integer)
    p2 = word_to_prefvec(word, 2, prefix2, p2_to_integer)
    p3 = word_to_prefvec(word, 3, prefix3, p3_to_integer)
    p4 = word_to_prefvec(word, 4, prefix3, p4_to_integer)
    s1 = word_to_sufvec(word, 1, suffix1, s1_to_integer)
    s2 = word_to_sufvec(word, 2, suffix2, s2_to_integer)
    s3 = word_to_sufvec(word, 3, suffix3, s3_to_integer)
    s4 = word_to_sufvec(word, 4, suffix4, s4_to_integer)
    s5 = word_to_sufvec(word, 5, suffix5, s5_to_integer)
    s6 = word_to_sufvec(word, 6, suffix6, s6_to_integer)
    s7 = word_to_sufvec(word, 7, suffix7, s7_to_integer)
    return np.concatenate((p1,p2,p3,p4,s1,s2,s3,s4,s5,s6,s7))

In [16]:
def compute_affixes(vocab):
    
    word_to_suffix = {}
    word_to_prefix = {}

    for word in vocab:
        w = word.lower()
        w = re.sub(r'[0-8]', '9', w)
        pvec = word_to_prefix_vector(w)
        svec = word_to_suffix_vector(w)
        word_to_prefix[word] = pvec
        word_to_suffix[word] = svec
        
    return word_to_prefix, word_to_suffix    

In [17]:
word_to_prefix, word_to_suffix = compute_affixes(vocabulary)

In [None]:
# Super_model
# this is a direct supertag model not using the part-of-speech tags

def Super_affix_model(input_shape, word_to_vec_map, word_to_prefix, word_to_suffix, word_to_index):
    """
    Function creating the direct supertagger model's graph
    
    Arguments:
    input_shape -- shape of the input, usually (max_len,)
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its fastText vector representation
    word_to_index -- dictionary mapping from words to their indices in the vocabulary

    Returns:
    model -- a model instance in Keras
    """
    
    # Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
    sentence_indices = Input(shape = input_shape, dtype = 'int32')
    
    # Create the embedding layer pretrained with GloVe Vectors (≈1 line)
    embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
    
    prefix_emb = pretrained_embedding_layer(word_to_prefix, word_to_index)
    suffix_emb = pretrained_embedding_layer(word_to_suffix, word_to_index)
    
    # Propagate sentence_indices through your embedding layer, you get back the embeddings
    embeddings = embedding_layer(sentence_indices)   
    
    pref = prefix_emb(sentence_indices)
    suff = suffix_emb(sentence_indices)
    P = Dense(32)(pref)
    P = Dropout(0.5)(P)
    S = Dense(32)(suff)
    S = Dropout(0.5)(S)
    merged = concatenate([embeddings,P,S])
    # Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
    # returning a batch of sequences.
    X = LSTM(128, return_sequences=True)(merged) 
    X = BatchNormalization()(X)
    X = Dropout(0.5)(X)
    X = LSTM(128, return_sequences=True)(X) 
    X = BatchNormalization()(X)
    X = Dropout(0.5)(X)

    # Add a (time distributed) Dense layer followed by a softmax activation
    X = TimeDistributed(Dense(32,kernel_constraint=max_norm(5.),kernel_regularizer=regularizers.l2(0.0001)))(X)
    X = TimeDistributed(Dense(numSuperClasses, activation='softmax',kernel_constraint=max_norm(5.),kernel_regularizer=regularizers.l2(0.0001)))(X)
    
    # Create Model instance which converts sentence_indices into X.
    model = Model(inputs=sentence_indices,outputs=X)
        
    return model

In [35]:
# Super_model
# this is a direct supertag model not using the part-of-speech tags

def Super_affix_model(input_shape, word_to_vec_map, word_to_prefix, word_to_suffix, word_to_index, Reg):
    """
    Function creating the direct supertagger model's graph
    
    Arguments:
    input_shape -- shape of the input, usually (max_len,)
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its fastText vector representation
    word_to_index -- dictionary mapping from words to their indices in the vocabulary

    Returns:
    model -- a model instance in Keras
    """
    
    # Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
    sentence_indices = Input(shape = input_shape, dtype = 'int32')
    
    # Create the embedding layer pretrained with GloVe Vectors (≈1 line)
    embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
    
    prefix_emb = pretrained_embedding_layer(word_to_prefix, word_to_index)
    suffix_emb = pretrained_embedding_layer(word_to_suffix, word_to_index)
    
    # Propagate sentence_indices through your embedding layer, you get back the embeddings
    embeddings = embedding_layer(sentence_indices)   
    
    pref = prefix_emb(sentence_indices)
    suff = suffix_emb(sentence_indices)
    P = Dense(32)(pref)
    S = Dense(32)(suff)
    merged = concatenate([embeddings,P,S])
    # Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
    # returning a batch of sequences.
    #  recurrent_dropout=0.2
    # kernel_constraint=maxnorm(3)  # recommended for layer before dropout
    X = LSTM(128, dropout=0.5, recurrent_activation='sigmoid', kernel_regularizer=Reg, return_sequences=True)(merged) 
#    X = BatchNormalization()(X)
    X = LSTM(128, recurrent_activation='sigmoid', kernel_regularizer=Reg, return_sequences=True)(X) 
#    X = BatchNormalization()(X)

    # Add a (time distributed) Dense layer followed by a softmax activation
    X = TimeDistributed(Dense(32,kernel_constraint=max_norm(5.),kernel_regularizer=Reg))(X)
    X = TimeDistributed(Dense(numSuperClasses, activation='softmax',kernel_constraint=max_norm(5.),kernel_regularizer=Reg))(X)
    X = Dropout{.2}(X)
    # Create Model instance which converts sentence_indices into X.
    model = Model(inputs=sentence_indices,outputs=X)
        
    return model

In [39]:
supermodel = Super_affix_model((maxLen,), word_to_vec_map, word_to_prefix, word_to_suffix, word_to_index, regularizers.l2(0.000001))
supermodel.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            (None, 266)          0                                            
__________________________________________________________________________________________________
embedding_9 (Embedding)         (None, 266, 5788)    175387976   input_3[0][0]                    
__________________________________________________________________________________________________
embedding_10 (Embedding)        (None, 266, 14983)   454014866   input_3[0][0]                    
__________________________________________________________________________________________________
embedding_8 (Embedding)         (None, 266, 429)     12999558    input_3[0][0]                    
__________________________________________________________________________________________________
dense_9 (D

In [40]:
supermodel.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
history = supermodel.fit(X_train_indices, Y_super_train_oh, epochs = 50, batch_size = 32, shuffle=True, validation_data=(X_dev_indices,Y_super_dev_oh))

Train on 9449 samples, validate on 3150 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
2112/9449 [=====>........................] - ETA: 10:24 - loss: 0.7252 - acc: 0.8294


### Evaluation

Two-layer 128-unit model with L2 norm of 10<sup>-5</sup>, and only one dropout layer of .5 for the input.

| embedding | dimension| LTSM units | batchnorm | dropout | L2 | epochs | train | devel |
|:-----|-----:|---------:|:----------:|-------:|----:|-----:|-----:|----------:|
| cwindow | 50 | 128+128 | no | .5+ 0 | 10<sup>-5</sup> | 10 | |

Two-layer 128-unit model with a dropout of .5 for the two LSTM layers, and a dropout of .5 for the dense affix layers. Added a second 32-unit `Dense` layer and added regularization (L2 of 0.0001) plus a constraint (`max_norm` of 5) to the final two `Dense` layers.

| embedding | dimension| LTSM units | batchnorm | dropout | epochs | train | devel |
|:-----|-----:|---------:|:----------:|-------:|----:|-----:|----------:|
| cwindow | 50 | 128+128 | yes | .5+.5 |  10 | 80.64 | 81.76 |
| cwindow | 50 | 128+128 | yes | .5+.5 |  20 | 82.80 | 82.29 |
| cwindow | 50 | 128+128 | yes | .5+.5 |  30 | 84.08 | 82.60 |
| cwindow | 50 | 128+128 | yes | .5+.5 |  30 | 85.32 | 82.69 |


Two-layer 128-unit model with a dropout of .5 for the two LSTM layers, and a dropout of .5 for the dense affix layers. Obtain free 83.00 as maximum value for development data, but then overfits and stagnates.

| embedding | dimension| LTSM units | batchnorm | dropout | epochs | train | devel |
|:-----|-----:|---------:|:----------:|-------:|----:|-----:|----------:|
| cwindow | 50 | 128+128 | yes | .5+.5 |  10 | 81.16 | 82.04 |
| cwindow | 50 | 128+128 | yes | .5+.5 | 20 | 83.09 | 82.82 |
| cwindow | 50 | 128+128 | yes | .5+.5 | 30 | 84.34 | 82.91 |

Two-layer 128-unit model with a dropout of .5 for the two LSTM layers, and a dropout of .4 for the dense affix layers

| embedding | dimension| LTSM units | batchnorm | dropout | epochs | train | devel |
|:-----|-----:|---------:|:----------:|-------:|----:|-----:|----------:|
| cwindow | 50 | 128+128 | yes | .5+.5 |  10 | 81.41 | 82.27 |
| cwindow | 50 | 128+128 | yes | .5+.5 |  20 | 83.66 | 82.83 |
| cwindow | 50 | 128+128 | yes | .5+.5 |  30 | 85.79 | 82.77 |

Two-layer 128-unit model with a dropout of .5 for the two LSTM layers, and a dropout of .2 for the dense affix layers. Although the results for the development set are slighlty higher than other settings at their peak (close to 82.5), there is still a considerable amount of overfitting after 20 epochs

| embedding | dimension| LTSM units | batchnorm | dropout | epochs | train | devel |
|:-----|-----:|---------:|:----------:|-------:|----:|-----:|----------:|
| cwindow | 50 | 128+128 | yes | .5+.5 |  10 | 81.98 | 82.14 |
| cwindow | 50 | 128+128 | yes | .5+.5 |  20 | 84.67 | 82.08 |

Two-layer 32-unit model with dropout of .2

| embedding | dimension| LTSM units | batchnorm | dropout | epochs | train | devel |
|:-----|-----:|---------:|:----------:|-------:|----:|-----:|----------:|
| cwindow | 50 | 32+32 | yes | .2+.2 |  10 | 81.43 | 80.82 |
| cwindow | 50 | 32+32 | yes | .2+.2 |  30 | 85.63 | 80.89 |
| cwindow | 50 | 32+32 | yes | .2+.2 |  50 | 88.24 | 79.75 |

Two-layer 32-unit model with dropout of .5. Appears to underfit after 10 epochs; very slow progression on training data, somewhat better on development data (as usual) but still not nowhere near a good performance. 

| embedding | dimension| LTSM units | batchnorm | dropout | epochs | train | devel |
|:-----|-----:|---------:|:----------:|-------:|----:|-----:|----------:|
| cwindow | 50 | 32+32 | yes | .5+.5 |  10 | 73.36 | 78.06 |


Retraining 64-unit model with dropout of .6

| embedding | dimension| LTSM units | batchnorm | dropout | epochs | train | devel |
|:-----|-----:|---------:|:----------:|-------:|----:|-----:|----------:|
| cwindow | 50 | 64 | yes | .6 |  10 | 81.97 | 81.05 |

Retraining 64-unit model with dropout of .5

| embedding | dimension| LTSM units | batchnorm | dropout | epochs | train | devel |
|:-----|-----:|---------:|:----------:|-------:|----:|-----:|----------:|
| cwindow | 50 | 64 | yes | .5 |  10 | 83.35 | 81.11 |



Since the 128-unit LSTM model showed significant overfitting, training with smaller 64-unit model but lower dropout (.2). Still overfits considerably, even after 10 epochs.

| embedding | dimension| LTSM units | batchnorm | dropout | epochs | train | devel |
|:-----|-----:|---------:|:----------:|-------:|----:|-----:|----------:|
| cwindow | 50 | 64 | yes | .2 |  10 | 85.66 | 80.85 |

First attempt with affix features. Clear overfitting from 10 epochs on.

| embedding | dimension| LTSM units | batchnorm | dropout | epochs | train | devel |
|:-----|-----:|---------:|:----------:|-------:|----:|-----:|----------:|
| cwindow | 50 | 128 | yes | .5 |  10 | 84.84 | 81.17 |
| cwindow | 50 | 128 | yes | .5 |  30 | 91.56 | 79.65 |
| cwindow | 50 | 128 | yes | .5 |  50 | 94.41 | 78.97 |



In [None]:
weights = supermodel.trainable_weights

In [None]:
get_gradients = supermodel.optimizer.get_gradients(supermodel.total_loss, weights)

In [None]:
    input_tensors = [
        # input data
        supermodel.inputs[0],
        # how much to weight each sample by
        supermodel.sample_weights[0],
        # labels
        supermodel.targets[0],
        # train or test mode
        K.learning_phase()
    ]


In [None]:
print(weights[0])
print(np.max(K.get_value(weights[0])))
print(np.min(K.get_value(weights[0])))

print(weights[1])
print(np.max(K.get_value(weights[1])))
print(np.min(K.get_value(weights[1])))

print(weights[2])
print(np.max(K.get_value(weights[2])))
print(np.min(K.get_value(weights[2])))

print(weights[3])
print(np.max(K.get_value(weights[3])))
print(np.min(K.get_value(weights[3])))



In [None]:
    steps = 0
    total_norm = 0
    s_w = None
    while steps < 32:
        X, y = next(data)
        # set sample weights to one
        # for every input
        if s_w is None:
            s_w = np.ones(X.shape[0])

        gradients = grad_fct([X, s_w, y, 0])
        total_norm += np.sqrt(np.sum([np.sum(np.square(g)) for g in gradients]))
        steps += 1

    return total_norm / float(steps)