# Part of Speech Tagging in PyTorch

## Part 1: Dataset Exploration

In [1]:
from src.data_module.udpos_dataset import UDPOS

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
udpos = UDPOS()

### 2.1. Dataset Shape

Let's check the number of training examples in each dataset split.

In [3]:
print(f"Number of training examples: {len(udpos.train)}")
print(f"Number of validation examples: {len(udpos.val)}")
print(f"Number of testing examples: {len(udpos.test)}")

Number of training examples: 12543
Number of validation examples: 2002
Number of testing examples: 2077


This is what each example looks like. Here,
* The `text` field will be used as feature for the POS tagger model.
* The `udtags` field will be used as labels.

In [4]:
example = vars(udpos.train.examples[0])
print("Text\n", ' '.join(example['text']))
print("UD Tags\n", example['udtags'])

Text
 al - zaman : american forces killed shaikh abdullah al - ani , the preacher at the mosque in the town of qaim , near the syrian border .
UD Tags
 ['PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'ADJ', 'NOUN', 'VERB', 'PROPN', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'PROPN', 'PUNCT', 'ADP', 'DET', 'ADJ', 'NOUN', 'PUNCT']


###  2.2. Analyzing Vocabulary

The fields are preprocessed. This means, the tokenization step has been performed. Let's check what the vocabulary size is for each field.

In [5]:
print(f"Unique tokens in TEXT vocabulary: {len(udpos.TEXT.vocab)}")
print(f"Unique tokens in UD_TAG vocabulary: {len(udpos.UD_TAGS.vocab)}")

Unique tokens in TEXT vocabulary: 8866
Unique tokens in UD_TAG vocabulary: 18


Let's see what the top 10 most common tokens in `text` are.

In [6]:
TOP = 10

print(f"Top {TOP} most common tokens in text are as follows:")

for i, (token, count) in enumerate(udpos.TEXT.vocab.freqs.most_common(TOP)):
    print(f"{i+1:>2}) {token:<5} has count {count:>5}")

Top 10 most common tokens in text are as follows:
 1) the   has count  9076
 2) .     has count  8640
 3) ,     has count  7021
 4) to    has count  5137
 5) and   has count  5002
 6) a     has count  3782
 7) of    has count  3622
 8) i     has count  3379
 9) in    has count  3112
10) is    has count  2239


Let's check out how many distinct POS tags there are in `udtags`.

In [7]:
unique_pos_tags = len(udpos.UD_TAGS.vocab)

print(f"There are {unique_pos_tags} in total which are as follows")

for i, (tag, count) in enumerate(udpos.UD_TAGS.vocab.freqs.most_common()):
    print(f"{i+1:>2}) {tag:>6} ({count})")

There are 18 in total which are as follows
 1)   NOUN (34781)
 2)  PUNCT (23679)
 3)   VERB (23081)
 4)   PRON (18577)
 5)    ADP (17638)
 6)    DET (16285)
 7)  PROPN (12946)
 8)    ADJ (12477)
 9)    AUX (12343)
10)    ADV (10548)
11)  CCONJ (6707)
12)   PART (5567)
13)    NUM (3999)
14)  SCONJ (3843)
15)      X (847)
16)   INTJ (688)
17)    SYM (599)


### 2.3. Batch Iterator

Our model will use vectorzied computation to be efficient. For that, we need to iterate over the dataset in batches. 

In [8]:
train_iter, val_iter, test_iter = udpos.get_iterators()

Let's see what the iterator looks like.

In [14]:
for batch in train_iter:
    print(batch)
    break


[torchtext.data.batch.Batch of size 64 from UDPOS]
	[.text]:[torch.LongTensor of size 65x64]
	[.udtags]:[torch.LongTensor of size 65x64]


Each batch has two fields, `text` and `udtags`. The shape of both the fields is $65 \times 64$. Here's what this means:

* **Number of steps**: 65. Meaning, there are 65 sequential tokens. 
* **Batches**: 64. Meaning, there are 64 different sequences.

Let's look at a breakdown of one row.

In [20]:
for batch in train_iter:
    text, tags = batch.text.T, batch.udtags.T

    print("Text:", " ".join(udpos.TEXT.vocab.itos[i]
                            for i in text[0]))
    print("Tags:", " ".join(udpos.UD_TAGS.vocab.itos[i]
                            for i in tags[0]))

    break

Text: at weekend press conferences in salt lake city and phoenix , fbi and state officials said jeffs " is considered armed and dangerous and may be traveling with armed <unk> . " <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
Tags: ADP NOUN NOUN NOUN ADP PROPN PROPN PROPN CCONJ PROPN PUNCT PROPN CCONJ NOUN NOUN VERB PROPN PUNCT AUX VERB ADJ CCONJ ADJ CCONJ AUX AUX VERB ADP ADJ NOUN PUNCT PUNCT <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>


Transposing the filed values, their shape becomes $64 \times 65$. So each row is now one example sentence. Thus the sentence makes sense when we print it. In the LSTM model, we'll iterate over timesteps and look at a batch of examples.

## Part 2: Model Creation

In [1]:
import torch

  from .autonotebook import tqdm as notebook_tqdm


Let's first specify some hyperparameter values. These will dictate the architectural constraints of our model.

In [2]:
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
BATCH_SIZE = 64

### 2.1. Base LSTM Model

In [3]:
from src.module.lstm import LSTM

In [4]:
lstm = LSTM(num_inputs=EMBEDDING_DIM, num_hiddens=HIDDEN_DIM)

In [5]:
print("List of parameters making up the LSTM model")

for name, params in lstm.named_parameters():
    print(f"{name:<5} with shape {params.shape}")

List of parameters making up the LSTM model
W_xf  with shape torch.Size([100, 128])
W_hf  with shape torch.Size([128, 128])
b_f   with shape torch.Size([128])
W_xi  with shape torch.Size([100, 128])
W_hi  with shape torch.Size([128, 128])
b_i   with shape torch.Size([128])
W_xo  with shape torch.Size([100, 128])
W_ho  with shape torch.Size([128, 128])
b_o   with shape torch.Size([128])
W_xc  with shape torch.Size([100, 128])
W_hc  with shape torch.Size([128, 128])
b_c   with shape torch.Size([128])


In [6]:
dummy_inputs = torch.randn((65, BATCH_SIZE, EMBEDDING_DIM))
outputs, (H, C) = lstm(dummy_inputs)

In [7]:
print("Number of outputs:", len(outputs))
print("Shape of Hidden State:", H.shape)
print("Shape of Memory Cell State:", C.shape)

Number of outputs: 65
Shape of Hidden State: torch.Size([64, 128])
Shape of Memory Cell State: torch.Size([64, 128])


Output dimensions match expected values. So our implementation is correct.

### 2.2. Bidirectional LSTM

In [8]:
from src.module.bi_lstm import BiLSTM

In [9]:
bi_lstm = BiLSTM(num_inputs=EMBEDDING_DIM, num_hiddens=HIDDEN_DIM)

In [10]:
print("List of parameters making up the LSTM model")

for name, params in bi_lstm.named_parameters():
    print(f"{name:<20} with shape {params.shape}")

List of parameters making up the LSTM model
forward_lstm.W_xf    with shape torch.Size([100, 128])
forward_lstm.W_hf    with shape torch.Size([128, 128])
forward_lstm.b_f     with shape torch.Size([128])
forward_lstm.W_xi    with shape torch.Size([100, 128])
forward_lstm.W_hi    with shape torch.Size([128, 128])
forward_lstm.b_i     with shape torch.Size([128])
forward_lstm.W_xo    with shape torch.Size([100, 128])
forward_lstm.W_ho    with shape torch.Size([128, 128])
forward_lstm.b_o     with shape torch.Size([128])
forward_lstm.W_xc    with shape torch.Size([100, 128])
forward_lstm.W_hc    with shape torch.Size([128, 128])
forward_lstm.b_c     with shape torch.Size([128])
backward_lstm.W_xf   with shape torch.Size([100, 128])
backward_lstm.W_hf   with shape torch.Size([128, 128])
backward_lstm.b_f    with shape torch.Size([128])
backward_lstm.W_xi   with shape torch.Size([100, 128])
backward_lstm.W_hi   with shape torch.Size([128, 128])
backward_lstm.b_i    with shape torch.Size([12

In [11]:
dummy_inputs = torch.randn((65, BATCH_SIZE, EMBEDDING_DIM))
outputs, (f_h, b_h) = bi_lstm(dummy_inputs)

In [12]:
print("Number of outputs:", len(outputs))
print("Shape of Forward Hidden State:", f_h[0].shape)
print("Shape of Forward Memory Cell:", f_h[1].shape)
print("Shape of backward Hidden State:", b_h[0].shape)
print("Shape of backward Memory Cell:", b_h[1].shape)

Number of outputs: 65
Shape of Forward Hidden State: torch.Size([64, 128])
Shape of Forward Memory Cell: torch.Size([64, 128])
Shape of backward Hidden State: torch.Size([64, 128])
Shape of backward Memory Cell: torch.Size([64, 128])
