# Parts of Speech Tagger using Deep Learning

In my previous trial, I build a POS tagger that uses Hidden Markov Model based probabilistic model. Here, I will be building POS tagger that uses deep learning models e.g. LSTM, Bi-LSTM,GRU. I will break down whole project into steps so that you can understand it better and I can debug better. I can think of dividing it into four steps : 1.) Get word embeddings  2.) Prepare data for training 3.) Train and 4.) Test (Evaluate the result)

Step 1 : I will be using Glove word embedding developed at Stanford. One can use any word embedding like word2vec developed by Google. 

# Step 1 : Dealing with Word embeddings

In [1]:
import os
import pickle
import sys
import numpy as np
from sklearn.metrics import confusion_matrix

Download the glove.6B.100d.txt file from https://nlp.stanford.edu/projects/glove/ . This file consists of word separated by it's embedding. 


In [2]:
word_embedding = {} #This is a dictionary whose key will be word and value will be it's embedding
word_embedding_file = open('glove.6B/glove.6B.100d.txt', encoding="utf8") # This just opend the file 'glove.6B.100d.txt'

for line in word_embedding_file:
    temp = line.split() # Convert each line into a list of elements. First ele is word and rest is embedding. 
    word = temp[0]                                    # Word is copied
    embedding = np.asarray(temp[1:], dtype='float32') # Embedding is copied
    word_embedding[word] = embedding # word is made the key of the dictionary and embedding is made the value of the dictionary


After opening the file, iterate through each line in it. Note that each line is in the form of a word separated by it's embedding by space e.g. play 12.9 10.5 11.8 192.8 ... 
So, line.split() just converts above example into a list \[ play, 12.9, 10.5, 11.8, 192.8, ...\]

In [3]:
if not os.path.exists('PickledWordEmbeddings/'):
    os.makedirs('PickledWordEmbeddings/')
    print("Does not exist.. so making the directory")

Check if PickledWordEmbeddings directory exists or not. If PickledWordEmbeddings does not exist make a directory with this name.

In [4]:
with open('PickledWordEmbeddings/Glove.pkl', 'wb') as f:
    pickle.dump(word_embedding, f)

Open the Glove.pkl file (if this file does not exist then open() funtion will create a file with this name) in PickledWordEmbeddings/ directory and dump the dictionary (word_embedding) we created in this file. Note that we'll have to fetch these embeddings in a dictionary only when we load the Glove.pkl file.

Step 2 : Now we are done with saving the word embeddings on the disk, we move on to prepare the data required for training the neural network model. We'll get the data from the famous Brown Corpus. We can get this data from nltk library from Stanford (https://www.nltk.org/data.html : visit this link to know more about downloading).

# Brown Corpus preprocessing

In [5]:
list_of_files = os.listdir('brown/') # lists names of all files present in the brown/ folder.

temp_corpus = '' # Contains all lines from all files separated by \n

for file in list_of_files[0:100]:  # There are around 500 files but I am just using 100 for the sake of making it small in scale
    with open('brown/' + file) as f:
        temp_corpus = temp_corpus + '\n' + f.read() # Here actual reading of content of file occured

corpus = temp_corpus.split('\n')

In [6]:
sentences_words = [] # each element is a list of words present in a single sentence 
sentences_tags = [] # each element is a list of tags present in a single sentence

#We need words in each line separately as we are going to feed a sentence in the neural network not an individual word
for line in corpus:
    if(len(line)>0):
        words_in_a_line = [] #temporary storage for words in a line
        tags_in_a_line = [] #temporary storage for tags in a line
        for word in line.split():
            try:            
                w, tag = word.split('/')
            except: # If the w/tag form is not present by mistake
                break

            words_in_a_line.append(w.lower())
            tags_in_a_line.append(tag)
        
        sentences_words.append(words_in_a_line) 
        sentences_tags.append(tags_in_a_line)

# print(sentences_words[0])
# print(sentences_tags[0])


In [7]:
#including words from the test file into voacb
list_of_files = os.listdir('test/') # lists names of all files present in the brown/ folder.

temp_corpus = '' # Contains all lines from all files separated by \n

for file in list_of_files:  # There are around 500 files but I am just using 100 for the sake of making it small in scale
    with open('test/' + file) as f:
        temp_corpus = temp_corpus + '\n' + f.read() # Here actual reading of content of file occured

corpus = temp_corpus.split('\n')

sentences_words_temp = sentences_words # each element is a list of words present in a single sentence 
sentences_tags_temp = sentences_tags # each element is a list of tags present in a single sentence

#We need words in each line separately as we are going to feed a sentence in the neural network not an individual word
for line in corpus:
    if(len(line)>0):
        words_in_a_line = [] #temporary storage for words in a line
        tags_in_a_line = [] #temporary storage for tags in a line
        for word in line.split():
            try:            
                w, tag = word.split('/')
            except: # If the w/tag form is not present by mistake
                break

            words_in_a_line.append(w.lower())
            tags_in_a_line.append(tag)
        
        sentences_words_temp.append(words_in_a_line) 
        sentences_tags_temp.append(tags_in_a_line)

# print(sentences_words[0])
# print(sentences_tags[0])


vocab_words = set(sum(sentences_words_temp, [])) #flattening of list is being done followed by finding unique words
vocab_tags = set(sum(sentences_tags_temp, [])) # This gives total number of tags 

# assert len(X_train) == len(Y_train)

word2int = {}
int2word = {}

word2int = dict((w, i) for i, w in enumerate(vocab_words))
int2word = dict((i, w) for i, w in enumerate(vocab_words))

tag2int = {}
int2tag = {}

tag2int = dict((t, i) for i, t in enumerate(vocab_tags))
int2tag = dict((i, t) for i, t in enumerate(vocab_tags))

sentences_words_num = [[word2int[word] for word in sentence] for sentence in sentences_words]
sentences_tags_num = [[tag2int[word] for word in sentence] for sentence in sentences_tags]

# print('sample X_train_numberised: ', sentences_words_num[0], '\n')
# print('sample Y_train_numberised: ', sentences_tags_num[0], '\n')

X_train_numberised = np.asarray(sentences_words_num)
Y_train_numberised = np.asarray(sentences_tags_num)

pickle_files = [sentences_words_num, sentences_tags_num, word2int, int2word, tag2int, int2tag]

if not os.path.exists('PickledData/'):
    print('PickledData/ is created to save pickled glove file')
    os.makedirs('PickledData/')

with open('PickledData/data.pkl', 'wb') as f:
    pickle.dump(pickle_files, f)

print('Saved as pickle file')

Saved as pickle file


# Training

In [8]:
from keras.preprocessing.sequence import pad_sequences
from keras.utils.np_utils import to_categorical

from keras.layers import Embedding
from keras.layers import Dense, Input
from keras.layers import TimeDistributed
from keras.layers import LSTM, Bidirectional
from keras.models import Model

from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

# PARAMETERS ================
MAX_SEQUENCE_LENGTH = 100
EMBEDDING_DIM = 100
TEST_SPLIT = 0.2
VALIDATION_SPLIT = 0.2
BATCH_SIZE = 32


with open('PickledData/data.pkl', 'rb') as f:
    X, y, word2int, int2word, tag2int, int2tag = pickle.load(f)


def generator(all_X, all_y, n_classes, batch_size=BATCH_SIZE):
    num_samples = len(all_X)

    while True:

        for offset in range(0, num_samples, batch_size):
            
            X = all_X[offset:offset+batch_size]
            y = all_y[offset:offset+batch_size]

            y = to_categorical(y, num_classes=n_classes)


            yield shuffle(X, y)


n_tags = len(tag2int)
print(n_tags)
X = pad_sequences(X, maxlen=MAX_SEQUENCE_LENGTH)
y = pad_sequences(y, maxlen=MAX_SEQUENCE_LENGTH)

# y = to_categorical(y, num_classes=len(tag2int) + 1)

# print('TOTAL TAGS', len(tag2int))
# print('TOTAL WORDS', len(word2int))

# shuffle the data
X, y = shuffle(X, y)

# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SPLIT,random_state=42)

# split training data into train and validation
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=VALIDATION_SPLIT, random_state=1)

n_train_samples = X_train.shape[0]
n_val_samples = X_val.shape[0]
n_test_samples = X_test.shape[0]

# print('We have %d TRAINING samples' % n_train_samples)
# print('We have %d VALIDATION samples' % n_val_samples)
# print('We have %d TEST samples' % n_test_samples)

# make generators for training and validation
train_generator = generator(all_X=X_train, all_y=y_train, n_classes=n_tags + 1)
validation_generator = generator(all_X=X_val, all_y=y_val, n_classes=n_tags + 1)



with open('PickledWordEmbeddings/Glove.pkl', 'rb') as f:
	embeddings_index = pickle.load(f)

# print('Total %s word vectors.' % len(embeddings_index))

# + 1 to include the unkown word
embedding_matrix = np.random.random((len(word2int) + 1, EMBEDDING_DIM))

for word, i in word2int.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embeddings_index will remain unchanged and thus will be random.
        embedding_matrix[i] = embedding_vector

# print('Embedding matrix shape', embedding_matrix.shape)
# print('X_train shape', X_train.shape)

embedding_layer = Embedding(len(word2int) + 1,
                            EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=True)
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)

l_lstm = Bidirectional(LSTM(64, return_sequences=True))(embedded_sequences)
preds = TimeDistributed(Dense(n_tags + 1, activation='softmax'))(l_lstm)
model = Model(sequence_input, preds)


model.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

# print("model fitting - Bidirectional LSTM")
model.summary()

model.fit_generator(train_generator, 
                     steps_per_epoch=n_train_samples//BATCH_SIZE,
                     validation_data=validation_generator,
                     validation_steps=n_val_samples//BATCH_SIZE,
                     epochs=2,
                     verbose=2,
                     workers=4)

if not os.path.exists('Models/'):
    print('MAKING DIRECTORY Models/ to save model file')
    os.makedirs('Models/')

train = True

if train:
    model.save('Models/model.h5')
    print('MODEL SAVED in Models/ as model.h5')
else:
    from keras.models import load_model
    model = load_model('Models/model.h5')

y_test = to_categorical(y_test, num_classes=n_tags+1)
test_results = model.evaluate(X_test, y_test, verbose=0)
print('TEST LOSS %f \nTEST ACCURACY: %f' % (test_results[0], test_results[1]))



Using TensorFlow backend.


294
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 100)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 100, 100)          2029700   
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100, 128)          84480     
_________________________________________________________________
time_distributed_1 (TimeDist (None, 100, 295)          38055     
Total params: 2,152,235
Trainable params: 2,152,235
Non-trainable params: 0
_________________________________________________________________
Epoch 1/2
 - 95s - loss: 0.5567 - acc: 0.8934 - val_loss: 0.2039 - val_acc: 0.9569
Epoch 2/2
 - 88s - loss: 0.1301 - acc: 0.9708 - val_loss: 0.0926 - val_acc: 0.9770
MODEL SAVED in Models/ as model.h5
TEST LOSS 0.088579 
TEST ACCURACY: 0.977693


# Testing

In [9]:
from keras.models import load_model


with open('PickledData/data.pkl', 'rb') as f:
    X_train, Y_train, word2int, int2word, tag2int, int2tag = pickle.load(f)

    del X_train
    del Y_train


with open('test_pickle_file/test_data.pkl', 'rb') as f:
    test_sentence, y_actual_tag, frequent_pos = pickle.load(f)

y_pred_tag  = []  
    
for sentence in test_sentence:
#     sentence = 'he is a good boy'.split()

    tokenized_sentence = []

    for word in sentence:
        tokenized_sentence.append(word2int[word])

    tokenized_sentence = np.asarray([tokenized_sentence])
    padded_tokenized_sentence = pad_sequences(tokenized_sentence, maxlen=100)

    # print('The sentence is ', sentence)
    # print('The tokenized sentence is ',tokenized_sentence)
    # print('The padded tokenized sentence is ', padded_tokenized_sentence)

    model = load_model('Models/model.h5')

    prediction = model.predict(padded_tokenized_sentence)

#     print(prediction.shape)
    out = []
    for i in range(100-len(sentence),100):
        out.append(int2tag[np.argmax(prediction[0][i])])
        
    y_pred_tag.append(out)
    
y_pred_tag = sum(y_pred_tag,[]) 


actual = []
predicted = []

print("10 Most frequent tags!")
print(frequent_pos)

for idx2, pos in enumerate(y_actual_tag):
    if pos in frequent_pos:
        actual.append(pos)
        predicted.append(y_pred_tag[idx2])

temp_labels = sorted(list(set(actual+predicted)))
print(temp_labels)
cm = confusion_matrix(actual, predicted, temp_labels)
print(cm)



10 Most frequent tags!
['nn', 'in', 'at', 'jj', 'vb', 'ppss', 'nns', 'ppo', 'rb', 'vbd']
['at', 'in', 'jj', 'nn', 'nns', 'ppo', 'ppss', 'rb', 'vb', 'vbd', 'vbn']
[[26  0  0  0  0  0  0  0  0  0  0]
 [ 0 28  0  0  0  0  0  0  0  0  0]
 [ 0  0 11  2  0  0  0  1  0  0  1]
 [ 0  0  0 32  0  0  0  0  0  0  0]
 [ 0  0  0  1 11  0  0  0  0  0  0]
 [ 0  0  0  0  0 10  1  0  0  0  0]
 [ 0  0  0  0  0  0 14  0  0  0  0]
 [ 0  0  0  1  0  0  0  9  0  0  0]
 [ 0  0  1  0  0  0  0  0 13  0  0]
 [ 0  0  0  0  0  0  0  1  0  8  1]
 [ 0  0  0  0  0  0  0  0  0  0  0]]
