### PoS - Part of Speech Tagging 

In this series of notebook we are going to make use of sevaral model achitecture to perform PoS Part of Speech Tagging using tensorflow.

### Part of Speech Tagging (PoS)

This is a process of classifying words into their part of speech.

### Imports



In [24]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

import pandas as pd

from sklearn.model_selection import train_test_split
from nltk.corpus import brown, treebank, conll2000

import os, time, re, string, random, nltk
tf.__version__

'2.6.0'

### Data loading

We are going to make use of the `nltk` (Natural Language Tool Kit) to create our dataset from the imported corpus.

In [10]:
for i in ['brown', "treebank", "conll2000", 'universal_tagset']:
  nltk.download(i)


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.


In [13]:
tagged_sentences = brown.tagged_sents(tagset='universal') + treebank.tagged_sents(tagset='universal') + conll2000.tagged_sents(tagset='universal')

In [15]:
len(tagged_sentences)

72202

In [18]:
tagged_sentences[7]

[('Merger', 'NOUN'), ('proposed', 'VERB')]

### Dataset creation

Since this is a many to may problem, each data point will be different sentence of the corpra. Each data point will have multiple words and multiple output for example:

```
X = [Mr Vinken is chairman of Elsevier]
Y = [NOUN NOUN VERB NOUN ADP NOUN]
```

In [19]:
X, Y = [], []

for sent in tagged_sentences:
  sentence = []
  tags = []
  for s, t in sent:
    sentence.append(s)
    tags.append(t)

  X.append(sentence)
  Y.append(tags)

print("done!")

done!


### Saving this as a `csv` file.

In [22]:
sentences_tags_pairs = []

for sents, tgs in zip(X, Y):
  sentences_tags_pairs.append([" ".join(sents), " ".join(tgs)])

In [23]:
sentences_tags_pairs[:2]

[["The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .",
  'DET NOUN NOUN ADJ NOUN VERB NOUN DET NOUN ADP NOUN ADJ NOUN NOUN VERB . DET NOUN . ADP DET NOUN VERB NOUN .'],
 ["The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .",
  'DET NOUN ADV VERB ADP NOUN NOUN ADP DET NOUN ADJ NOUN . DET VERB ADJ NOUN ADP DET NOUN . . VERB DET NOUN CONJ NOUN ADP DET NOUN ADP NOUN . ADP DET NOUN ADP DET DET NOUN VERB VERB .']]

In [26]:
dataframe = pd.DataFrame(sentences_tags_pairs, columns=[
    "sentence", "tags"])

dataframe.to_csv("pos.csv", index=False)
print("saved.")

saved.


### Data Statistics

In [27]:
num_words = len(set([word.lower() 
for sentence in X for word in sentence]))

num_tags = len(set([word.lower() 
for sentence in Y for word in sentence]))

In [28]:
print("Total number of tagged sentences: {}".format(len(X)))
print("Vocabulary size: {}".format(num_words))
print("Total number of tags: {}".format(num_tags))

Total number of tagged sentences: 72202
Vocabulary size: 59448
Total number of tags: 12


### Checking examples

In [29]:
print("Sample x: ", X[0], "\n")
print("Sample y: ", Y[0], "\n")

Sample x:  ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'] 

Sample y:  ['DET', 'NOUN', 'NOUN', 'ADJ', 'NOUN', 'VERB', 'NOUN', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADJ', 'NOUN', 'NOUN', 'VERB', '.', 'DET', 'NOUN', '.', 'ADP', 'DET', 'NOUN', 'VERB', 'NOUN', '.'] 



### Text vectorization

We are going to use the `Tokenizer` class to encode text from sequences to sequence of integers.

In [31]:
tokenizer = keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(X)

In [32]:
tag_tokenizer = keras.preprocessing.text.Tokenizer()
tag_tokenizer.fit_on_texts(Y)

In [33]:
tag_tokenizer.word_index

{'.': 3,
 'adj': 6,
 'adp': 4,
 'adv': 7,
 'conj': 9,
 'det': 5,
 'noun': 1,
 'num': 11,
 'pron': 8,
 'prt': 10,
 'verb': 2,
 'x': 12}

Now we can convert our tokens to sequences.

In [37]:
sentences_sequences = tokenizer.texts_to_sequences(X)
tags_sequences = tag_tokenizer.texts_to_sequences(Y)

### Checking a single example.

In [38]:
print(sentences_sequences[0])
tags_sequences[0]

[1, 5731, 778, 2326, 1842, 39, 853, 34, 1944, 4, 16831, 379, 1343, 1523, 1116, 12, 67, 569, 14, 9, 89, 10208, 252, 205, 3]


[5, 1, 1, 6, 1, 2, 1, 5, 1, 4, 1, 6, 1, 1, 2, 3, 5, 1, 3, 4, 5, 1, 2, 1, 3]

Let's convert tag tokens back to word representations.

In [40]:
print("Y[0]: ", Y[0])
print("tags_sequences[0]: ", tags_sequences[0])
print("sequences_to_tags[0]: ", 
      tag_tokenizer.sequences_to_texts([tags_sequences[0]]))

Y[0]:  ['DET', 'NOUN', 'NOUN', 'ADJ', 'NOUN', 'VERB', 'NOUN', 'DET', 'NOUN', 'ADP', 'NOUN', 'ADJ', 'NOUN', 'NOUN', 'VERB', '.', 'DET', 'NOUN', '.', 'ADP', 'DET', 'NOUN', 'VERB', 'NOUN', '.']
tags_sequences[0]:  [5, 1, 1, 6, 1, 2, 1, 5, 1, 4, 1, 6, 1, 1, 2, 3, 5, 1, 3, 4, 5, 1, 2, 1, 3]
sequences_to_tags[0]:  ['det noun noun adj noun verb noun det noun adp noun adj noun noun verb . det noun . adp det noun verb noun .']


### Checking if the inputs and outputs have the same length.


In [42]:
different_length = [1 if len(input) != len(output) else 0 for input, output in zip(tags_sequences, sentences_sequences)]
print("{} sentences have disparate input-output lengths.".format(sum(different_length)))

0 sentences have disparate input-output lengths.


### Padding sequences

Since the sentences has various length we are going to pad the sequences of these sentences to the longest sentence. We will make sure that these sequences are padded to have the same length.

In [47]:
lengths = [len(seq) for seq in tagged_sentences]
MAX_LENGTH = max(lengths)
print(f"Longest sentence: {MAX_LENGTH}")

MAX_LENGTH = 100 # we are going to set the max-length to 100

Longest sentence: 271


In [48]:
padded_sentences = keras.preprocessing.sequence.pad_sequences(
    sentences_sequences,
    maxlen=MAX_LENGTH,
    padding="post",
    truncating="post"
)
padded_tags = keras.preprocessing.sequence.pad_sequences(
    tags_sequences,
    maxlen=MAX_LENGTH,
    padding="post",
    truncating="post"
)

Checking the a single example of the padded sequence.

In [49]:
print(padded_sentences[0], "\n"*2)
print(padded_tags[0])

[    1  5731   778  2326  1842    39   853    34  1944     4 16831   379
  1343  1523  1116    12    67   569    14     9    89 10208   252   205
     3     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0] 


[5 1 1 6 1 2 1 5 1 4 1 6 1 1 2 3 5 1 3 4 5 1 2 1 3 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


### One-hot Encode `padded_tags` labels

In [54]:
padded_tags = keras.utils.to_categorical(padded_tags)
padded_tags.shape

(72202, 100, 13)

### Set's spliting.

We are then going to split the data into 3 sets using the `sklearn` `train_test_split` method to split our data train and test sets.

In [55]:
X_train, X_test, y_train, y_test = train_test_split(
   padded_sentences, padded_tags, random_state=42, test_size=.15
)
X_train, X_valid, y_train, y_valid = train_test_split(
   padded_sentences, padded_tags, random_state=42, test_size=.15
)

### Counting examples

In [57]:
print("training: ", len(X_train))
print("testing: ", len(X_test))
print("validation: ", len(X_valid))

training:  61371
testing:  10831
validation:  10831


### A simple RNN

We are going to create a simple RNN without word embeddings.

In [61]:
n_classes = y_train.shape[-1]
n_classes

13

In [66]:
VOCAB_SIZE = len(tokenizer.word_index) + 1
EMBEDDING_SIZE = 300
MAX_SEQUENCE_LENGTH = 100
VOCAB_SIZE

59449

In [68]:
rnn_model = keras.Sequential([
    keras.layers.Embedding(
      VOCAB_SIZE, EMBEDDING_SIZE, input_length=MAX_SEQUENCE_LENGTH,
      trainable = True
    ),
    keras.layers.SimpleRNN(64, return_sequences=True),
    keras.layers.TimeDistributed(
        keras.layers.Dense(n_classes, activation="softmax")
    )
], name="simple_rnn")

rnn_model.summary()

Model: "simple_rnn"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 300)          17834700  
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 100, 64)           23360     
_________________________________________________________________
time_distributed (TimeDistri (None, 100, 13)           845       
Total params: 17,858,905
Trainable params: 17,858,905
Non-trainable params: 0
_________________________________________________________________


### Model training

In [70]:
rnn_model.compile(loss='categorical_crossentropy',
                  optimizer='adam',
                  metrics=['acc'])

rnn_model.fit(
    X_train, y_train, batch_size=128, 
    epochs=10, validation_data=(X_valid, y_valid)
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fdc77ff2d90>

### Evaluating the model

In [72]:
rnn_model.evaluate(X_test, y_test, verbose=1)



[0.036777108907699585, 0.989393413066864]

### Model Inference

Now we are ready to make predictions of our tags. We are going to perform the following steps in the `make_prediction` function.
1. tokenize the sentence
2. convert the tokenized sentence to integer representation
3. padd the tokenized sentences and pass them to the model
4. get the predictions and we convert the predictions back to `tags`.

In [75]:
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [76]:
sent = "The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place ."
tags = 'DET NOUN NOUN ADJ NOUN VERB NOUN DET NOUN ADP NOUN ADJ NOUN NOUN VERB . DET NOUN . ADP DET NOUN VERB NOUN .'.split(" ")

In [99]:
def tokenize_and_pad_sequences(sent):

  if isinstance(sent, str):
    tokens = sent.split(" ")
  else:
    tokens = sent
  tokens = [t.lower() for t in tokens]
  sequences = tokenizer.texts_to_sequences([tokens])
  padded_sequnces = keras.preprocessing.sequence.pad_sequences(
    sequences,
    maxlen=MAX_LENGTH,
    padding="post",
    truncating="post"
  )
  predictions = rnn_model.predict(padded_sequnces)
  predictions = tf.argmax(predictions, axis=-1).numpy().astype("int32")

  return  tag_tokenizer.sequences_to_texts(predictions)

In [110]:
pred_tags = tokenize_and_pad_sequences(sent)

print("word\t\t\ttag\tpred-tag\t")
print("-"*40)
for word, tag, pred_tag in zip(sent.split(" "), tags, pred_tags[0].split(" ")):
  print(f"{word}\t\t\t{tag}\t{pred_tag.upper()}\t")

word			tag	pred-tag	
----------------------------------------
The			DET	DET	
Fulton			NOUN	NOUN	
County			NOUN	NOUN	
Grand			ADJ	ADJ	
Jury			NOUN	NOUN	
said			VERB	VERB	
Friday			NOUN	NOUN	
an			DET	DET	
investigation			NOUN	NOUN	
of			ADP	ADP	
Atlanta's			NOUN	NOUN	
recent			ADJ	ADJ	
primary			NOUN	NOUN	
election			NOUN	NOUN	
produced			VERB	VERB	
``			.	.	
no			DET	DET	
evidence			NOUN	NOUN	
''			.	.	
that			ADP	ADP	
any			DET	DET	
irregularities			NOUN	NOUN	
took			VERB	VERB	
place			NOUN	NOUN	
.			.	.	


### Conclusion
We have implemented our **Deep Learning** model that perform POS tagging. In the next notebook we are going to have a look at how we can make use of the `word2vec` vectors and load them in our embeding layer.