# Part of Speech Tagging in PyTorch

## Part 1: Dataset Exploration

In [1]:
from src.data_module.udpos_dataset import UDPOS

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
udpos = UDPOS()

### 2.1. Dataset Shape

Let's check the number of training examples in each dataset split.

In [3]:
print(f"Number of training examples: {len(udpos.train)}")
print(f"Number of validation examples: {len(udpos.val)}")
print(f"Number of testing examples: {len(udpos.test)}")

Number of training examples: 12543
Number of validation examples: 2002
Number of testing examples: 2077


This is what each example looks like. Here,
* The `text` field will be used as feature for the POS tagger model.
* The `udtags` field will be used as labels.

In [4]:
example = vars(udpos.train.examples[0])
print("Text\n", ' '.join(example['text']))
print("UD Tags\n", example['udtags'])

Text
 al - zaman : american forces killed shaikh abdullah al - ani , the preacher at the mosque in the town of qaim , near the syrian border .
UD Tags
 ['PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'ADJ', 'NOUN', 'VERB', 'PROPN', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'PROPN', 'PUNCT', 'ADP', 'DET', 'ADJ', 'NOUN', 'PUNCT']


###  2.2. Analyzing Vocabulary

The fields are preprocessed. This means, the tokenization step has been performed. Let's check what the vocabulary size is for each field.

In [5]:
print(f"Unique tokens in TEXT vocabulary: {len(udpos.TEXT.vocab)}")
print(f"Unique tokens in UD_TAG vocabulary: {len(udpos.UD_TAGS.vocab)}")

Unique tokens in TEXT vocabulary: 8866
Unique tokens in UD_TAG vocabulary: 18


Let's see what the top 10 most common tokens in `text` are.

In [6]:
TOP = 10

print(f"Top {TOP} most common tokens in text are as follows:")

for i, (token, count) in enumerate(udpos.TEXT.vocab.freqs.most_common(TOP)):
    print(f"{i+1:>2}) {token:<5} has count {count:>5}")

Top 10 most common tokens in text are as follows:
 1) the   has count  9076
 2) .     has count  8640
 3) ,     has count  7021
 4) to    has count  5137
 5) and   has count  5002
 6) a     has count  3782
 7) of    has count  3622
 8) i     has count  3379
 9) in    has count  3112
10) is    has count  2239


Let's check out how many distinct POS tags there are in `udtags`.

In [7]:
unique_pos_tags = len(udpos.UD_TAGS.vocab)

print(f"There are {unique_pos_tags} in total which are as follows")

for i, (tag, count) in enumerate(udpos.UD_TAGS.vocab.freqs.most_common()):
    print(f"{i+1:>2}) {tag:>6} ({count})")

There are 18 in total which are as follows
 1)   NOUN (34781)
 2)  PUNCT (23679)
 3)   VERB (23081)
 4)   PRON (18577)
 5)    ADP (17638)
 6)    DET (16285)
 7)  PROPN (12946)
 8)    ADJ (12477)
 9)    AUX (12343)
10)    ADV (10548)
11)  CCONJ (6707)
12)   PART (5567)
13)    NUM (3999)
14)  SCONJ (3843)
15)      X (847)
16)   INTJ (688)
17)    SYM (599)


### 2.3. Batch Iterator

Our model will use vectorzied computation to be efficient. For that, we need to iterate over the dataset in batches. 

Let's see what the iterator looks like.

In [8]:
for batch in udpos.train_dataloader():
    print(batch[0].shape, batch[1].shape)
    break

torch.Size([62, 64]) torch.Size([62, 64])


In [9]:
for batch in udpos.val_dataloader():
    print(batch[0].shape, batch[1].shape)
    break

torch.Size([1, 64]) torch.Size([1, 64])


Each batch has two objects containing `text` and `udtags`. The shape of both the fields is $x \times 64$. Here's what this means:

* **Number of steps**: x. Meaning, there are x sequential tokens. 
* **Batches**: 64. Meaning, there are 64 different sequences.

Let's look at a breakdown of one row.

In [10]:
for batch in udpos.train_dataloader():
    text, tags = batch

    print("Text:", " ".join(udpos.TEXT.vocab.itos[i]
                            for i in text[0]))
    print("Tags:", " ".join(udpos.UD_TAGS.vocab.itos[i]
                            for i in tags[0]))

    break

Text: thank that they you will service i sorry when this vince have i it because lorie www are will 3 we the and <unk> tiffany i from the a an i i power there most sorry <unk> juan margaret aquarius the it game little sounds thank absolute we we we john that we like you mark the criminal if take he five " one
Tags: VERB DET PRON PRON AUX NOUN PRON INTJ ADV DET PROPN VERB PRON PRON ADP X PROPN AUX AUX X PRON DET CCONJ X PROPN PRON SCONJ DET DET DET PRON PRON NOUN PRON ADJ ADJ NOUN PROPN PROPN PROPN DET PRON NOUN ADV VERB VERB ADJ PRON PRON PRON PROPN DET PRON ADP PRON PROPN DET ADJ SCONJ VERB PRON NUM PUNCT NUM


Transposing the filed values, their shape becomes $64 \times 65$. So each row is now one example sentence. Thus the sentence makes sense when we print it. In the LSTM model, we'll iterate over timesteps and look at a batch of examples.

## Part 2: Model Creation

In [11]:
import torch

Let's first specify some hyperparameter values. These will dictate the architectural constraints of our model.

In [12]:
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
BATCH_SIZE = 64
NUM_LAYERS = 2

### 2.1. Base LSTM Model

In [13]:
from src.module.lstm import LSTM

In [14]:
model = LSTM(num_inputs=EMBEDDING_DIM, num_hiddens=HIDDEN_DIM)

In [15]:
print("List of parameters making up the LSTM model")

for name, params in model.named_parameters():
    print(f"{name:<5} with shape {params.shape}")

List of parameters making up the LSTM model
W_xf  with shape torch.Size([100, 128])
W_hf  with shape torch.Size([128, 128])
b_f   with shape torch.Size([128])
W_xi  with shape torch.Size([100, 128])
W_hi  with shape torch.Size([128, 128])
b_i   with shape torch.Size([128])
W_xo  with shape torch.Size([100, 128])
W_ho  with shape torch.Size([128, 128])
b_o   with shape torch.Size([128])
W_xc  with shape torch.Size([100, 128])
W_hc  with shape torch.Size([128, 128])
b_c   with shape torch.Size([128])


In [16]:
dummy_inputs = torch.randn((65, BATCH_SIZE, EMBEDDING_DIM))
outputs, (H, C) = model(dummy_inputs)

In [17]:
print("Number of outputs:", len(outputs))
print("Shape of Hidden State:", H.shape)
print("Shape of Memory Cell State:", C.shape)

Number of outputs: 65
Shape of Hidden State: torch.Size([64, 128])
Shape of Memory Cell State: torch.Size([64, 128])


Output dimensions match expected values. So our implementation is correct.

### 2.2. Bidirectional LSTM

In [18]:
from src.module.bi_lstm import BiLSTM

In [19]:
model = BiLSTM(num_inputs=EMBEDDING_DIM, num_hiddens=HIDDEN_DIM)

In [20]:
print("List of parameters making up the LSTM model")

for name, params in model.named_parameters():
    print(f"{name:<20} with shape {params.shape}")

List of parameters making up the LSTM model
forward_lstm.W_xf    with shape torch.Size([100, 128])
forward_lstm.W_hf    with shape torch.Size([128, 128])
forward_lstm.b_f     with shape torch.Size([128])
forward_lstm.W_xi    with shape torch.Size([100, 128])
forward_lstm.W_hi    with shape torch.Size([128, 128])
forward_lstm.b_i     with shape torch.Size([128])
forward_lstm.W_xo    with shape torch.Size([100, 128])
forward_lstm.W_ho    with shape torch.Size([128, 128])
forward_lstm.b_o     with shape torch.Size([128])
forward_lstm.W_xc    with shape torch.Size([100, 128])
forward_lstm.W_hc    with shape torch.Size([128, 128])
forward_lstm.b_c     with shape torch.Size([128])
backward_lstm.W_xf   with shape torch.Size([100, 128])
backward_lstm.W_hf   with shape torch.Size([128, 128])
backward_lstm.b_f    with shape torch.Size([128])
backward_lstm.W_xi   with shape torch.Size([100, 128])
backward_lstm.W_hi   with shape torch.Size([128, 128])
backward_lstm.b_i    with shape torch.Size([12

In [21]:
outputs, (f_h, b_h) = model(dummy_inputs)

In [22]:
print("Outputs shape:", len(outputs))
print("Shape of Forward Hidden State:", f_h[0].shape)
print("Shape of Forward Memory Cell:", f_h[1].shape)
print("Shape of backward Hidden State:", b_h[0].shape)
print("Shape of backward Memory Cell:", b_h[1].shape)

Outputs shape: 65
Shape of Forward Hidden State: torch.Size([64, 128])
Shape of Forward Memory Cell: torch.Size([64, 128])
Shape of backward Hidden State: torch.Size([64, 128])
Shape of backward Memory Cell: torch.Size([64, 128])


Output, hidden state, and memory cell shapes are as expected. So the implementation is correct.

### 2.3. Deep LSTMs

In [23]:
from src.module.deep_lstm import DeepLSTM

In [24]:
model = DeepLSTM(num_inputs=EMBEDDING_DIM,
                 num_hiddens=HIDDEN_DIM,
                 num_layers=NUM_LAYERS,
                 bidirectional=False)

In [25]:
print("List of parameters making up the LSTM model")

for name, params in model.named_parameters():
    print(f"{name:<20} with shape {params.shape}")

List of parameters making up the LSTM model
layers.0.W_xf        with shape torch.Size([100, 128])
layers.0.W_hf        with shape torch.Size([128, 128])
layers.0.b_f         with shape torch.Size([128])
layers.0.W_xi        with shape torch.Size([100, 128])
layers.0.W_hi        with shape torch.Size([128, 128])
layers.0.b_i         with shape torch.Size([128])
layers.0.W_xo        with shape torch.Size([100, 128])
layers.0.W_ho        with shape torch.Size([128, 128])
layers.0.b_o         with shape torch.Size([128])
layers.0.W_xc        with shape torch.Size([100, 128])
layers.0.W_hc        with shape torch.Size([128, 128])
layers.0.b_c         with shape torch.Size([128])
layers.1.W_xf        with shape torch.Size([128, 128])
layers.1.W_hf        with shape torch.Size([128, 128])
layers.1.b_f         with shape torch.Size([128])
layers.1.W_xi        with shape torch.Size([128, 128])
layers.1.W_hi        with shape torch.Size([128, 128])
layers.1.b_i         with shape torch.Size([12

In [26]:
outputs, Hs = model(dummy_inputs)

In [27]:
print("Output shape:", outputs.shape)

for i, h in enumerate(Hs):
    print(f"Layer {i+1}: Shape of Hidden State:", h[0].shape)
    print(f"Layer {i+1}: Shape of Memory Cell:", h[1].shape)

Output shape: torch.Size([65, 64, 128])
Layer 1: Shape of Hidden State: torch.Size([64, 128])
Layer 1: Shape of Memory Cell: torch.Size([64, 128])
Layer 2: Shape of Hidden State: torch.Size([64, 128])
Layer 2: Shape of Memory Cell: torch.Size([64, 128])


### 2.4. Deep Bidrectional LSTM

In [28]:
model = DeepLSTM(num_inputs=EMBEDDING_DIM,
                 num_hiddens=HIDDEN_DIM,
                 num_layers=NUM_LAYERS,
                 bidirectional=True)

In [29]:
print("List of parameters making up the LSTM model")

for name, params in model.named_parameters():
    print(f"{name:<20} with shape {params.shape}")

List of parameters making up the LSTM model
layers.0.forward_lstm.W_xf with shape torch.Size([100, 128])
layers.0.forward_lstm.W_hf with shape torch.Size([128, 128])
layers.0.forward_lstm.b_f with shape torch.Size([128])
layers.0.forward_lstm.W_xi with shape torch.Size([100, 128])
layers.0.forward_lstm.W_hi with shape torch.Size([128, 128])
layers.0.forward_lstm.b_i with shape torch.Size([128])
layers.0.forward_lstm.W_xo with shape torch.Size([100, 128])
layers.0.forward_lstm.W_ho with shape torch.Size([128, 128])
layers.0.forward_lstm.b_o with shape torch.Size([128])
layers.0.forward_lstm.W_xc with shape torch.Size([100, 128])
layers.0.forward_lstm.W_hc with shape torch.Size([128, 128])
layers.0.forward_lstm.b_c with shape torch.Size([128])
layers.0.backward_lstm.W_xf with shape torch.Size([100, 128])
layers.0.backward_lstm.W_hf with shape torch.Size([128, 128])
layers.0.backward_lstm.b_f with shape torch.Size([128])
layers.0.backward_lstm.W_xi with shape torch.Size([100, 128])
layers

In [30]:
outputs, Hs = model(dummy_inputs)

In [31]:
print("Output shape:", outputs.shape)

for i, h in enumerate(Hs):
    print(f"Layer {i+1}: Shape of Forward Hidden State:", h[0][0].shape)
    print(f"Layer {i+1}: Shape of Forward Memory Cell:", h[0][1].shape)
    print(f"Layer {i+1}: Shape of Forward Hidden State:", h[1][0].shape)
    print(f"Layer {i+1}: Shape of Forward Memory Cell:", h[1][1].shape)

Output shape: torch.Size([65, 64, 256])
Layer 1: Shape of Forward Hidden State: torch.Size([64, 128])
Layer 1: Shape of Forward Memory Cell: torch.Size([64, 128])
Layer 1: Shape of Forward Hidden State: torch.Size([64, 128])
Layer 1: Shape of Forward Memory Cell: torch.Size([64, 128])
Layer 2: Shape of Forward Hidden State: torch.Size([64, 128])
Layer 2: Shape of Forward Memory Cell: torch.Size([64, 128])
Layer 2: Shape of Forward Hidden State: torch.Size([64, 128])
Layer 2: Shape of Forward Memory Cell: torch.Size([64, 128])


### 2.5. POS Tagger

In [32]:
from src.module.pos_tagger import PosTagger

In [33]:
model = PosTagger(num_inputs=len(udpos.TEXT.vocab),
                  embedding_dim=100,
                  num_hiddens=128,
                  num_outputs=len(udpos.UD_TAGS.vocab),
                  bidirectional=True,
                  num_layers=2,
                  padding_idx=udpos.TEXT.vocab[udpos.TEXT.pad_token])



In [34]:
for batch in udpos.train_dataloader():
    X, y = batch
    y_hat = model(X)
    
    print(f"Features.shape: {X.shape}")
    print(f"Labels.shape: {y.shape}")
    print(f"Predictions.shape: {y_hat.shape}")
    
    y_hat = y_hat.reshape(-1, y_hat.shape[-1])
    y = y.reshape(-1)
    
    loss = model.loss(y_hat, y)
    acc = model.accuracy(y_hat, y)
    
    print(f"Loss: {loss.item()}")
    print(f"Accuracy: {acc.item()}")
    
    break

Features.shape: torch.Size([70, 64])
Labels.shape: torch.Size([70, 64])
Predictions.shape: torch.Size([70, 64, 18])
Loss: 2.8674864768981934
Accuracy: 0.00977426115423441




## Part 3: Training Mechanism

The training mechanism is basically two steps per epoch.

1. **Training step**: Go through the entire training dataset in batches. Feed and train the model using said batches.
2. **Validation step**: Check how the model performs after each training step. Don't train the model on this dataset. This is just to judge how well the model does in unknown data.

In [35]:
from src.utils.loops import train_epochs

In [36]:
udpos = UDPOS()
model = PosTagger(num_inputs=len(udpos.TEXT.vocab),
                  embedding_dim=100,
                  num_hiddens=64,
                  num_outputs=len(udpos.UD_TAGS.vocab),
                  bidirectional=True,
                  num_layers=1,
                  padding_idx=udpos.TEXT.vocab[udpos.TEXT.pad_token])

In [37]:
losses, accuracies = train_epochs(model, udpos, 10)

Epoch: 01 in (36.30) secs
	Train Loss: 0.771 | Train Acc: 84.51%
	 Val. Loss: 1.260 |  Val. Acc: 66.52%
Epoch: 02 in (35.75) secs
	Train Loss: 0.228 | Train Acc: 93.31%
	 Val. Loss: 0.827 |  Val. Acc: 77.41%
Epoch: 03 in (35.72) secs
	Train Loss: 0.149 | Train Acc: 95.54%
	 Val. Loss: 0.642 |  Val. Acc: 80.64%
Epoch: 04 in (35.56) secs
	Train Loss: 0.106 | Train Acc: 96.80%
	 Val. Loss: 0.544 |  Val. Acc: 83.42%
Epoch: 05 in (35.40) secs
	Train Loss: 0.082 | Train Acc: 97.52%
	 Val. Loss: 0.480 |  Val. Acc: 85.25%
Epoch: 06 in (35.20) secs
	Train Loss: 0.068 | Train Acc: 97.98%
	 Val. Loss: 0.435 |  Val. Acc: 86.41%
Epoch: 07 in (35.14) secs
	Train Loss: 0.057 | Train Acc: 98.30%
	 Val. Loss: 0.399 |  Val. Acc: 87.71%


KeyboardInterrupt: 