### **Objective:**

The process of classifying words into their __parts of speech__ and labeling them accordingly is known as **part-of-speech tagging** or **POS-tagging**. This project aims at assigning words their respective POS tags using different Recurrent neural network (RNN) models.

The analysis involves the following steps.
1. Preprocessing data
2. Using Word Embeddings
3. Building Vanilla RNN model
4. Building LSTM model
5. Building GRU model
6. Building Bidirectional LSTM model
7. Model Evaluation

In [1]:
# Load the following libraries.

import warnings
warnings.filterwarnings("ignore")
import numpy as np
from matplotlib import pyplot as plt
from nltk.corpus import brown
from nltk.corpus import treebank
from nltk.corpus import conll2000
import seaborn as sns
from gensim.models import KeyedVectors
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.layers import TimeDistributed
from tensorflow.keras.layers import LSTM, GRU, Bidirectional, SimpleRNN, RNN
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.text import Tokenizer
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from nltk.data import find

### **1. Data Preprocessing**

In [2]:
# Load POS tagged corpora from NLTK.

treebank_corpus = treebank.tagged_sents(tagset='universal')

brown_corpus = brown.tagged_sents(tagset='universal')

conll_corpus = conll2000.tagged_sents(tagset='universal')

tagged_sentences = treebank_corpus + brown_corpus + conll_corpus

In [3]:
# Let's look at the data.

tagged_sentences[11]

### **Divide data in words (X) and tags (Y)**

Since this is a **many-to-many** problem, each data point will be a different sentence of the corpora.

Each data point will have multiple words in the **input sequence**. This is what we will refer to as **X**.

Each word will have its correpsonding tag in the **output sequence**. This what we will refer to as **Y**.

Sample dataset:

|                    X                        |                 Y                |
|---------------------------------------------|----------------------------------|
|   Mr. Vinken is chairman of Elsevier        |   NOUN NOUN VERB NOUN ADP NOUN   |
|     We have no useful information           |      PRON VERB DET ADJ NOUN      |

In [4]:
X = [] # Store input sequence
Y = [] # Store output sequence

for sentence in tagged_sentences:
    X_sentence = []
    Y_sentence = []
    for entity in sentence:         
        X_sentence.append(entity[0])  # entity[0] contains the word
        Y_sentence.append(entity[1])  # entity[1] contains corresponding tag
        
    X.append(X_sentence)
    Y.append(Y_sentence)

In [5]:
num_words = len(set([word.lower() for sentence in X for word in sentence]))

num_tags   = len(set([word.lower() for sentence in Y for word in sentence]))

print("Total number of tagged sentences: {}".format(len(X)))

print("Vocabulary size: {}".format(num_words))

print("Total number of tags: {}".format(num_tags))

In [6]:
# Let's look at first data point.

print('sample X: ', X[0], '\n')

print('sample Y: ', Y[0], '\n')

In [7]:
# Ensure that the length of input sequence equals the output sequence.

print("Length of first input sequence  : {}".format(len(X[0])))

print("Length of first output sequence : {}".format(len(Y[0])))

### **Vectorise X and Y**

#### Encode X and Y to integer values

We'll use the Tokenizer() function from Keras library to encode text sequence to integer sequence

In [8]:
# Encode X

word_tokenizer = Tokenizer() 

word_tokenizer.fit_on_texts(X)

X_encoded = word_tokenizer.texts_to_sequences(X)  

In [9]:
# Encode Y

tag_tokenizer = Tokenizer()

tag_tokenizer.fit_on_texts(Y)

Y_encoded = tag_tokenizer.texts_to_sequences(Y)

In [11]:
# Let's look at first encoded data point.

print("** Raw data point **", "\n", "-"*100, "\n")
print('X: ', X[0], '\n')
print('Y: ', Y[0], '\n')
print()
print("** Encoded data point **", "\n", "-"*100, "\n")
print('X: ', X_encoded[0], '\n')
print('Y: ', Y_encoded[0], '\n')

In [12]:
# Make sure that each sequence of input and output is same length.

different_length = [1 if len(input) != len(output) else 0 for input, output in zip(X_encoded, Y_encoded)]

print("{} sentences have different input-output lengths.".format(sum(different_length)))

### Pad sequences

The next step after encoding the data is to **define the sequence lengths**. As of now, the sentences present in the data are of various lengths. We need to either pad short sentences or truncate long sentences to a fixed length. This fixed length, however, is a **hyperparameter**.

In [13]:
# Check length of longest sentence.

lengths = [len(seq) for seq in X_encoded]

print("Length of longest sentence: {}".format(max(lengths)))

In [15]:
# View various lengths of sentences with the help of boxplot.

sns.boxplot(lengths)

plt.show()

In [16]:
MAX_SEQ_LENGTH = 100  # Sequences greater than 100 in length will be truncated.

X_padded = pad_sequences(X_encoded, maxlen = MAX_SEQ_LENGTH, padding = "pre", truncating = "post")

Y_padded = pad_sequences(Y_encoded, maxlen = MAX_SEQ_LENGTH, padding = "pre", truncating = "post")

In [17]:
# Print the first sequence.

print(X_padded[0], "\n"*3)

print(Y_padded[0])

In [18]:
# Assign padded sequences to X and Y.

X, Y = X_padded, Y_padded

### Word embeddings

Currently, each word and each tag is encoded as an integer. 

We'll use a more sophisticated technique to represent the input words (X) using what's known as **word embeddings**.

However, to represent each tag in Y, we'll simply use **one-hot encoding** scheme since there are only 13 tags in the dataset and the LSTM will have no problems in learning its own representation of these tags.

### Use word embeddings for input sequences (X)

In [19]:
# Using word2vec.

path = str(find('models/word2vec_sample/pruned.word2vec.txt'))

# Load word2vec using the following function present in the gensim library.

word2vec = KeyedVectors.load_word2vec_format(path, binary = False)

In [20]:
# Check word2vec effectiveness.

word2vec.most_similar(positive = ["King", "Woman"], negative = ["Man"])

In [21]:
# Assign word vectors from word2vec model.

EMBEDDING_SIZE  = 300

VOCABULARY_SIZE = len(word_tokenizer.word_index) + 1

# Create an empty embedding matix.

embedding_weights = np.zeros((VOCABULARY_SIZE, EMBEDDING_SIZE))

# Create a word to index dictionary mapping.

word2id = word_tokenizer.word_index

# Copy vectors from word2vec model to the words present in corpus.

for word, index in word2id.items():
    try:
        embedding_weights[index, :] = word2vec[word]
    except KeyError:
        pass

In [22]:
# Check embedding dimension.

print("Embeddings shape: {}".format(embedding_weights.shape))

In [23]:
# Let's look at an embedding of a word.

embedding_weights[word_tokenizer.word_index['joy']]

### Use one-hot encoding for output sequences (Y)

In [24]:
# Use Keras' to_categorical function to one-hot encode Y.

Y = to_categorical(Y)

In [25]:
# Print Y of the first output sequqnce.

print(Y.shape)

### Split data into training, validation and tesing sets

In [26]:
# Split entire data into training and testing sets.

TEST_SIZE = 0.15

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = TEST_SIZE, random_state = 100)

In [27]:
# Split training data into training and validation sets.

VALID_SIZE = 0.15

X_train, X_validation, Y_train, Y_validation = train_test_split(X_train, Y_train, test_size = VALID_SIZE, random_state = 100)

In [28]:
# Print number of samples in each set.

print("TRAINING DATA")
print('Shape of input sequences: {}'.format(X_train.shape))
print('Shape of output sequences: {}'.format(Y_train.shape))
print("-"*50)
print("VALIDATION DATA")
print('Shape of input sequences: {}'.format(X_validation.shape))
print('Shape of output sequences: {}'.format(Y_validation.shape))
print("-"*50)
print("TESTING DATA")
print('Shape of input sequences: {}'.format(X_test.shape))
print('Shape of output sequences: {}'.format(Y_test.shape))

Before using RNN, we must make sure the dimensions of the data are what an RNN expects. In general, an RNN expects the following shape

Shape of X:
(#samples, #timesteps, #features)

Shape of Y:
(#samples, #timesteps, #features)

![RNN tensor shape](./jupyter resources/rnn_tensor.png)

Now, there can be various variations in the shape that you use to feed an RNN depending on the type of architecture. Since the problem we're working on has a many-to-many architecture, the input and the output both include number of timesteps which is nothing but the sequence length. But notice that the tensor X doesn't have the third dimension, that is, number of features. That's because we're going to use word embeddings before feeding in the data to an RNN, and hence there is no need to explicitly mention the third dimension. That's because when you use the Embedding() layer in Keras, the training data will automatically be converted to (#samples, #timesteps, #features) where #features will be the embedding dimention (and note that the Embedding layer is always the very first layer of an RNN). While using the embedding layer we only need to reshape the data to (#samples, #timesteps) which is what we have done. However, note that you'll need to shape it to (#samples, #timesteps, #features) in case you don't use the Embedding() layer in Keras.

### 2. Building Vanilla RNN model.

### Uninitialised fixed embeddings
First let's try running a vanilla RNN. For this RNN we won't use the pre-trained word embeddings. We'll use randomly inititalise embeddings. Moreover, we won't update the embeddings weights.

In [29]:
# Total number of tags.

NUM_CLASSES = Y.shape[2]

In [30]:
# Create architecture.

rnn_model = Sequential()

# Create embedding layer.
rnn_model.add(Embedding(input_dim     =  VOCABULARY_SIZE,         
                        output_dim    =  EMBEDDING_SIZE,          
                        input_length  =  MAX_SEQ_LENGTH,          
                        trainable     =  False                    
))

# Add a RNN layer which contains 64 RNN cells.

rnn_model.add(SimpleRNN(64, 
              return_sequences = True))

# Add time distributed (output at each sequence) layer.
                        
rnn_model.add(TimeDistributed(Dense(NUM_CLASSES, activation='softmax')))

### Compile model

In [31]:
rnn_model.compile(loss      =  'categorical_crossentropy',
                  optimizer =  'adam',
                  metrics   =  ['acc'])

In [32]:
# Check summary of the model.

rnn_model.summary()

### Fit model

In [33]:
rnn_training = rnn_model.fit(X_train, Y_train, batch_size = 128, epochs = 10, validation_data = (X_validation, Y_validation))

In [34]:
# Visualise training history.

plt.plot(rnn_training.history['acc'])
plt.plot(rnn_training.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc = "lower right")
plt.show()

### Uninitialised trainable embeddings

In [35]:
# Create architecture.

rnn_model = Sequential()

# Create embedding layer.

rnn_model.add(Embedding(input_dim     =  VOCABULARY_SIZE,         
                        output_dim    =  EMBEDDING_SIZE,          
                        input_length  =  MAX_SEQ_LENGTH,          
                        trainable     =  True                     
))

# Add an RNN layer which contains 64 RNN cells.

rnn_model.add(SimpleRNN(64, 
              return_sequences = True 
))

# Add time distributed (output at each sequence) layer.

rnn_model.add(TimeDistributed(Dense(NUM_CLASSES, activation = 'softmax')))

### Compile model

In [36]:
rnn_model.compile(loss      =  'categorical_crossentropy',
                  optimizer =  'adam',
                  metrics   =  ['acc'])

In [37]:
# Check summary of the model.

rnn_model.summary()

### Fit model

In [38]:
rnn_training = rnn_model.fit(X_train, Y_train, batch_size = 128, epochs = 10, validation_data = (X_validation, Y_validation))

In [39]:
# Visualise training history.

plt.plot(rnn_training.history['acc'])
plt.plot(rnn_training.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc = "lower right")
plt.show()

### Using pre-trained embedding weights

In [40]:
# Create architecture.

rnn_model = Sequential()

# Create embedding layer.

rnn_model.add(Embedding(input_dim     =  VOCABULARY_SIZE,         
                        output_dim    =  EMBEDDING_SIZE,          
                        input_length  =  MAX_SEQ_LENGTH,          
                        weights       = [embedding_weights],     
                        trainable     =  True                     
))

# Add an RNN layer which contains 64 RNN cells.

rnn_model.add(SimpleRNN(64, 
              return_sequences = True  
))

# Add time distributed (output at each sequence) layer.

rnn_model.add(TimeDistributed(Dense(NUM_CLASSES, activation = 'softmax')))

### Compile model

In [41]:
rnn_model.compile(loss      =  'categorical_crossentropy',
                  optimizer =  'adam',
                  metrics   =  ['acc'])

In [42]:
# Check summary of the model.

rnn_model.summary()

### Fit model

In [43]:
rnn_training = rnn_model.fit(X_train, Y_train, batch_size = 128, epochs = 10, validation_data = (X_validation, Y_validation))

In [44]:
# Visualise training history.

plt.plot(rnn_training.history['acc'])
plt.plot(rnn_training.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc = "lower right")
plt.show()

### 2. Building LSTM model

We'll use pre-trained word embeddings in following models and allow them to be updated as well. 

### Create model architecture

In [45]:
# Create architecture.

lstm_model = Sequential()

lstm_model.add(Embedding(input_dim     = VOCABULARY_SIZE,         
                         output_dim    = EMBEDDING_SIZE,         
                         input_length  = MAX_SEQ_LENGTH,          
                         weights       = [embedding_weights],     
                         trainable     = True                      
))

lstm_model.add(LSTM(64, return_sequences = True))

lstm_model.add(TimeDistributed(Dense(NUM_CLASSES, activation = 'softmax')))

### Compile model

In [46]:
lstm_model.compile(loss      =  'categorical_crossentropy',
                   optimizer =  'adam',
                   metrics   =  ['acc'])

In [47]:
# Check summary of the model.

lstm_model.summary()

### Fit model

In [48]:
lstm_training = lstm_model.fit(X_train, Y_train, batch_size = 128, epochs = 10, validation_data = (X_validation, Y_validation))

In [49]:
# Visualise training history.

plt.plot(lstm_training.history['acc'])
plt.plot(lstm_training.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc = "lower right")
plt.show()

### 2. Building GRU model

### Create model architecture

In [50]:
# Create architecture.

gru_model = Sequential()

gru_model.add(Embedding(input_dim     = VOCABULARY_SIZE,
                        output_dim    = EMBEDDING_SIZE,
                        input_length  = MAX_SEQ_LENGTH,
                        weights       = [embedding_weights],
                        trainable     = True
))

gru_model.add(GRU(64, return_sequences = True))

gru_model.add(TimeDistributed(Dense(NUM_CLASSES, activation = 'softmax')))

### Compile model

In [51]:
gru_model.compile(loss = 'categorical_crossentropy',
              optimizer = 'adam',
              metrics = ['acc'])

In [52]:
# Check summary of model.

gru_model.summary()

### Fit model

In [53]:
gru_training = gru_model.fit(X_train, Y_train, batch_size = 128, epochs = 10, validation_data = (X_validation, Y_validation))

In [54]:
# Visualise training history.

plt.plot(gru_training.history['acc'])
plt.plot(gru_training.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc = "lower right")
plt.show()

### 3. Building Bidirectional LSTM model

### Create model architecture

In [55]:
# Create architecture.

bidirect_model = Sequential()

bidirect_model.add(Embedding(input_dim     = VOCABULARY_SIZE,
                             output_dim    = EMBEDDING_SIZE,
                             input_length  = MAX_SEQ_LENGTH,
                             weights       = [embedding_weights],
                             trainable     = True
))

bidirect_model.add(Bidirectional(LSTM(64, return_sequences = True)))

bidirect_model.add(TimeDistributed(Dense(NUM_CLASSES, activation = 'softmax')))

### Compile model

In [56]:
bidirect_model.compile(loss = 'categorical_crossentropy',
              optimizer = 'adam',
              metrics = ['acc'])

In [57]:
# Check summary of model.

bidirect_model.summary()

### Fit model

In [58]:
bidirect_training = bidirect_model.fit(X_train, Y_train, batch_size = 128, epochs = 10, validation_data = (X_validation, Y_validation))

In [59]:
# Visualise training history.

plt.plot(bidirect_training.history['acc'])
plt.plot(bidirect_training.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc = "lower right")
plt.show()

### 5. Model evaluation

In [60]:
loss, accuracy = rnn_model.evaluate(X_test, Y_test, verbose = 1)

print("Loss: {0},\nAccuracy: {1}".format(loss, accuracy))

In [61]:
loss, accuracy = lstm_model.evaluate(X_test, Y_test, verbose = 1)

print("Loss: {0},\nAccuracy: {1}".format(loss, accuracy))

In [62]:
loss, accuracy = gru_model.evaluate(X_test, Y_test, verbose = 1)

print("Loss: {0},\nAccuracy: {1}".format(loss, accuracy))

In [63]:
loss, accuracy = bidirect_model.evaluate(X_test, Y_test, verbose = 1)

print("Loss: {0},\nAccuracy: {1}".format(loss, accuracy))

### Conclusions:

Accuracies of different RNN models are as follows:

1. Vanilla RNN model - 98.99%
2. LSTM model - 99.11%
3. GRU model - 99.09%
4. Bidirectional LSTM model - 99.32%

Bidirectional LSTM model is performing best with an accuracy of 99.32% for POS tagging when compared to other RNN models.