### PoS - Part of Speech Tagging 

As promised in the previous notebook, In this notebook we are going to show how we can use the `LSTM` layer together with the `Embedding` layer using `word2vec` vectors as the weigths of the embedding layer.


**Note:** The rest of the notebokk will remain the same as the previous notebook the only difference is that this time around we are going to load the data from a file called `pos.csv`.

### Part of Speech Tagging (PoS)

This is a process of classifying words into their part of speech.

### Imports



In [37]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

import pandas as pd

from sklearn.model_selection import train_test_split
from nltk.corpus import brown, treebank, conll2000

import os, time, re, string, random, nltk
tf.__version__

'2.6.0'

### Data loading

We are going to load the data from a `csv` file named `pos.csv` from our google drive.

In [38]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [39]:
file_path = "/content/drive/My Drive/NLP Data/pos-datasets/english/pos.csv"
os.path.exists(file_path)

True

In [40]:
dataframe = pd.read_csv(file_path)
dataframe.head(5)

Unnamed: 0,sentence,tags
0,The Fulton County Grand Jury said Friday an in...,DET NOUN NOUN ADJ NOUN VERB NOUN DET NOUN ADP ...
1,The jury further said in term-end presentments...,DET NOUN ADV VERB ADP NOUN NOUN ADP DET NOUN A...
2,The September-October term jury had been charg...,DET NOUN NOUN NOUN VERB VERB VERB ADP NOUN ADJ...
3,`` Only a relative handful of such reports was...,. ADV DET ADJ NOUN ADP ADJ NOUN VERB VERB . . ...
4,The jury said it did find that many of Georgia...,DET NOUN VERB PRON VERB VERB ADP ADJ ADP NOUN ...


### Dataset creation

Now in our dataset we a sentence of words and a sentence of tags, which each tag correspont to a word in a sentence of words.

In [41]:
X = dataframe.sentence.values
y = dataframe.tags.values

print(X[0])
print(y[0])

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .
DET NOUN NOUN ADJ NOUN VERB NOUN DET NOUN ADP NOUN ADJ NOUN NOUN VERB . DET NOUN . ADP DET NOUN VERB NOUN .


In [42]:
X = list(map(lambda x: x.split(" "), list(X)))
y = list(map(lambda x: x.split(" "), list(y)))

### Data Statistics

In [43]:
words = []
tags = []

for sentence in X:
  for word in sentence:
    words.append(word.lower())

for tag in y:
  for t in tag:
    tags.append(t.lower())
num_words = len(set(words))
num_tags = len(set(tags))

In [44]:
set(tags)

{'.',
 'adj',
 'adp',
 'adv',
 'conj',
 'det',
 'noun',
 'num',
 'pron',
 'prt',
 'verb',
 'x'}

In [45]:
print("Total number of tagged sentences: {}".format(len(X)))
print("Vocabulary size: {}".format(num_words))
print("Total number of tags: {}".format(num_tags))

Total number of tagged sentences: 72202
Vocabulary size: 59448
Total number of tags: 12


### Checking examples

In [46]:
print("Sample x: ", X[0], "\n")
print("Sample y: ", y[0], "\n")

Sample x:  ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'] 

Sample y:  ['DET', 'NOUN', 'NOUN', 'ADJ', 'NOUN', 'VERB', 'NOUN', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADJ', 'NOUN', 'NOUN', 'VERB', '.', 'DET', 'NOUN', '.', 'ADP', 'DET', 'NOUN', 'VERB', 'NOUN', '.'] 



### Text vectorization

We are going to use the `Tokenizer` class to encode text from sequences to sequence of integers.

In [47]:
tokenizer = keras.preprocessing.text.Tokenizer(split=' ')
tokenizer.fit_on_texts(X)

In [48]:
tag_tokenizer = keras.preprocessing.text.Tokenizer(split=' ')
tag_tokenizer.fit_on_texts(y)

In [49]:
tag_tokenizer.word_index

{'.': 3,
 'adj': 6,
 'adp': 4,
 'adv': 7,
 'conj': 9,
 'det': 5,
 'noun': 1,
 'num': 11,
 'pron': 8,
 'prt': 10,
 'verb': 2,
 'x': 12}

Now we can convert our tokens to sequences.

In [50]:
sentences_sequences = tokenizer.texts_to_sequences(X)
tags_sequences = tag_tokenizer.texts_to_sequences(y)

### Checking a single example.

In [51]:
len(tags_sequences[3]), len(sentences_sequences[3])

(37, 37)

Let's convert tag tokens back to word representations.

In [52]:
print("Y[0]: ", y[0])
print("tags_sequences[0]: ", tags_sequences[0])
print("sequences_to_tags[0]: ", 
      tag_tokenizer.sequences_to_texts([tags_sequences[0]]))

Y[0]:  ['DET', 'NOUN', 'NOUN', 'ADJ', 'NOUN', 'VERB', 'NOUN', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADJ', 'NOUN', 'NOUN', 'VERB', '.', 'DET', 'NOUN', '.', 'ADP', 'DET', 'NOUN', 'VERB', 'NOUN', '.']
tags_sequences[0]:  [5, 1, 1, 6, 1, 2, 1, 5, 1, 4, 1, 6, 1, 1, 2, 3, 5, 1, 3, 4, 5, 1, 2, 1, 3]
sequences_to_tags[0]:  ['det noun noun adj noun verb noun det noun adp noun adj noun noun verb . det noun . adp det noun verb noun .']


### Checking if the inputs and outputs have the same length.


In [53]:
different_length = [1 if len(input) != len(output) else 0 for input, output in zip(tags_sequences, sentences_sequences)]
print("{} sentences have disparate input-output lengths.".format(sum(different_length)))

0 sentences have disparate input-output lengths.


### Padding sequences

Since the sentences has various length we are going to pad the sequences of these sentences to the longest sentence. We will make sure that these sequences are padded to have the same length.

In [54]:
lengths = [len(seq) for seq in tags_sequences]
MAX_LENGTH = max(lengths)
print(f"Longest sentence: {MAX_LENGTH}")

MAX_LENGTH = 100 # we are going to set the max-length to 100

Longest sentence: 271


In [55]:
padded_sentences = keras.preprocessing.sequence.pad_sequences(
    sentences_sequences,
    maxlen=MAX_LENGTH,
    padding="post",
    truncating="post"
)
padded_tags = keras.preprocessing.sequence.pad_sequences(
    tags_sequences,
    maxlen=MAX_LENGTH,
    padding="post",
    truncating="post"
)

Checking the a single example of the padded sequence.

In [56]:
print(padded_sentences[0], "\n"*2)
print(padded_tags[0])

[    1  5731   778  2326  1842    39   853    34  1944     4 16831   379
  1343  1523  1116    12    67   569    14     9    89 10208   252   205
     3     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0] 


[5 1 1 6 1 2 1 5 1 4 1 6 1 1 2 3 5 1 3 4 5 1 2 1 3 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


### One-hot Encode `padded_tags` labels

In [57]:
padded_tags = keras.utils.to_categorical(padded_tags)
padded_tags.shape

(72202, 100, 13)

### Set's spliting.

We are then going to split the data into 3 sets using the `sklearn` `train_test_split` method to split our data train and test sets.

In [58]:
X_train, X_test, y_train, y_test = train_test_split(
   padded_sentences, padded_tags, random_state=42, test_size=.15
)
X_train, X_valid, y_train, y_valid = train_test_split(
   padded_sentences, padded_tags, random_state=42, test_size=.15
)

### Counting examples

In [59]:
print("training: ", len(X_train))
print("testing: ", len(X_test))
print("validation: ", len(X_valid))

training:  61371
testing:  10831
validation:  10831


### A simple RNN with word Embeddings

As mentioned previously we are going to make use of the `word2vec` so to download these vectors we are going to run the following code cell:


In [60]:
import gensim.downloader as api

word2vec = api.load('word2vec-google-news-300')

vec_king = word2vec['king']

In [61]:
vec_king

array([ 1.25976562e-01,  2.97851562e-02,  8.60595703e-03,  1.39648438e-01,
       -2.56347656e-02, -3.61328125e-02,  1.11816406e-01, -1.98242188e-01,
        5.12695312e-02,  3.63281250e-01, -2.42187500e-01, -3.02734375e-01,
       -1.77734375e-01, -2.49023438e-02, -1.67968750e-01, -1.69921875e-01,
        3.46679688e-02,  5.21850586e-03,  4.63867188e-02,  1.28906250e-01,
        1.36718750e-01,  1.12792969e-01,  5.95703125e-02,  1.36718750e-01,
        1.01074219e-01, -1.76757812e-01, -2.51953125e-01,  5.98144531e-02,
        3.41796875e-01, -3.11279297e-02,  1.04492188e-01,  6.17675781e-02,
        1.24511719e-01,  4.00390625e-01, -3.22265625e-01,  8.39843750e-02,
        3.90625000e-02,  5.85937500e-03,  7.03125000e-02,  1.72851562e-01,
        1.38671875e-01, -2.31445312e-01,  2.83203125e-01,  1.42578125e-01,
        3.41796875e-01, -2.39257812e-02, -1.09863281e-01,  3.32031250e-02,
       -5.46875000e-02,  1.53198242e-02, -1.62109375e-01,  1.58203125e-01,
       -2.59765625e-01,  

### Theory behind word vectors.

Words with the simmilar meaning are closer to each other in the vector space.


### Number of classes

In [62]:
n_classes = y_train.shape[-1]
n_classes

13

### Hyper parameters

In [63]:
VOCAB_SIZE = len(tokenizer.word_index) + 1
EMBEDDING_SIZE = 300 # each word in word2vec model is represented using a 300 dimensional vector
MAX_SEQUENCE_LENGTH = 100
VOCAB_SIZE

59449

### Creating an embedding matrix that suits our data.

In [64]:
# Initialize the embedding weights with zeros
embedding_weigths = np.zeros([
   VOCAB_SIZE, EMBEDDING_SIZE
])

# Getting the string to integer mapping
stoi = tokenizer.word_index

# copying vectors from word2vec to our embedding matrix that suits our data

for word, index in stoi.items():
  try:
    embedding_weigths[index, :] = word2vec[word]
  except:
    pass


### Checking the dimention of our embedding weights

In [65]:
print("Embeddings shape: {}".format(embedding_weigths.shape))

Embeddings shape: (59449, 300)


#### Checking a single word in our embedding weights

In [66]:
embedding_weigths[tokenizer.word_index['the']]

array([ 0.08007812,  0.10498047,  0.04980469,  0.0534668 , -0.06738281,
       -0.12060547,  0.03515625, -0.11865234,  0.04394531,  0.03015137,
       -0.05688477, -0.07617188,  0.01287842,  0.04980469, -0.08496094,
       -0.06347656,  0.00628662, -0.04321289,  0.02026367,  0.01330566,
       -0.01953125,  0.09277344, -0.171875  , -0.00131989,  0.06542969,
        0.05834961, -0.08251953,  0.0859375 , -0.00318909,  0.05859375,
       -0.03491211, -0.0123291 , -0.0480957 , -0.00302124,  0.05639648,
        0.01495361, -0.07226562, -0.05224609,  0.09667969,  0.04296875,
       -0.03540039, -0.07324219,  0.03271484, -0.06176758,  0.00787354,
        0.0035553 , -0.00878906,  0.0390625 ,  0.03833008,  0.04443359,
        0.06982422,  0.01263428, -0.00445557, -0.03320312, -0.04272461,
        0.09765625, -0.02160645, -0.0378418 ,  0.01190186, -0.01391602,
       -0.11328125,  0.09326172, -0.03930664, -0.11621094,  0.02331543,
       -0.01599121,  0.02636719,  0.10742188, -0.00466919,  0.09

### Building an RNN using pretrained weigths.

In [69]:
rnn_model = keras.Sequential([
    keras.layers.Embedding(
      VOCAB_SIZE, EMBEDDING_SIZE, input_length=MAX_SEQUENCE_LENGTH,
      trainable = True,
      weights = [embedding_weigths]
    ),
    keras.layers.LSTM(64, return_sequences=True),
    keras.layers.TimeDistributed(
        keras.layers.Dense(n_classes, activation="softmax")
    )
], name="lstm_rnn")

rnn_model.summary()

Model: "lstm_rnn"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 100, 300)          17834700  
_________________________________________________________________
lstm (LSTM)                  (None, 100, 64)           93440     
_________________________________________________________________
time_distributed_2 (TimeDist (None, 100, 13)           845       
Total params: 17,928,985
Trainable params: 17,928,985
Non-trainable params: 0
_________________________________________________________________


### Model training

In [70]:
rnn_model.compile(loss='categorical_crossentropy',
                  optimizer='adam',
                  metrics=['acc'])

rnn_model.fit(
    X_train, y_train, batch_size=128, 
    epochs=2, validation_data=(X_valid, y_valid)
)

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7fcb26f6d8d0>

### Evaluating the model

In [71]:
rnn_model.evaluate(X_test, y_test, verbose=1)



[0.02823222056031227, 0.9903010129928589]

### Model Inference

Now we are ready to make predictions of our tags. We are going to perform the following steps in the `make_prediction` function.
1. tokenize the sentence
2. convert the tokenized sentence to integer representation
3. padd the tokenized sentences and pass them to the model
4. get the predictions and we convert the predictions back to `tags`.

In [72]:
sent = "The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place ."
tags = 'DET NOUN NOUN ADJ NOUN VERB NOUN DET NOUN ADP NOUN ADJ NOUN NOUN VERB . DET NOUN . ADP DET NOUN VERB NOUN .'.split(" ")

In [73]:
def tokenize_and_pad_sequences(sent):

  if isinstance(sent, str):
    tokens = sent.split(" ")
  else:
    tokens = sent
  tokens = [t.lower() for t in tokens]
  sequences = tokenizer.texts_to_sequences([tokens])
  padded_sequnces = keras.preprocessing.sequence.pad_sequences(
    sequences,
    maxlen=MAX_LENGTH,
    padding="post",
    truncating="post"
  )
  predictions = rnn_model.predict(padded_sequnces)
  predictions = tf.argmax(predictions, axis=-1).numpy().astype("int32")

  return  tag_tokenizer.sequences_to_texts(predictions)

In [74]:
pred_tags = tokenize_and_pad_sequences(sent)

print("word\t\t\ttag\tpred-tag\t")
print("-"*40)
for word, tag, pred_tag in zip(sent.split(" "), tags, pred_tags[0].split(" ")):
  print(f"{word}\t\t\t{tag}\t{pred_tag.upper()}\t")

word			tag	pred-tag	
----------------------------------------
The			DET	DET	
Fulton			NOUN	NOUN	
County			NOUN	NOUN	
Grand			ADJ	ADJ	
Jury			NOUN	NOUN	
said			VERB	VERB	
Friday			NOUN	NOUN	
an			DET	DET	
investigation			NOUN	NOUN	
of			ADP	ADP	
Atlanta's			NOUN	NOUN	
recent			ADJ	ADJ	
primary			NOUN	ADJ	
election			NOUN	NOUN	
produced			VERB	VERB	
``			.	.	
no			DET	DET	
evidence			NOUN	NOUN	
''			.	.	
that			ADP	ADP	
any			DET	DET	
irregularities			NOUN	NOUN	
took			VERB	VERB	
place			NOUN	NOUN	
.			.	.	


### Conclusion
Our goal was achived to load the `word2vec` vectors in our embeding layer of as LSTM RNN model. In the next one we are going to have a look at how we can perform the same task using `Bi-LSTM`. So basically the model achitecture will remain the same, we are only going to change from using LSTM to the use of `Bi-LSTM` (Bidirectional LSTM) with `word2vec` vectors in our embedding layer.