# Part of Speech Tagging in PyTorch

## Part 1: Dataset Exploration

In [1]:
from src.data_module.udpos_dataset import UDPOS

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
udpos = UDPOS()

downloading en-ud-v2.zip


.data\udpos\en-ud-v2.zip: 100%|██████████| 688k/688k [00:00<00:00, 1.52MB/s]


extracting


.vector_cache\glove.6B.zip: 862MB [10:35, 1.36MB/s]                               
100%|█████████▉| 399999/400000 [00:22<00:00, 17564.07it/s]


Let's check the number of training examples in each dataset split.

In [3]:
print(f"Number of training examples: {len(udpos.train)}")
print(f"Number of validation examples: {len(udpos.val)}")
print(f"Number of testing examples: {len(udpos.test)}")

Number of training examples: 12543
Number of validation examples: 2002
Number of testing examples: 2077


This is what each example looks like. Here,
* The `text` field will be used as feature for the POS tagger model.
* The `udtags` field will be used as labels.

In [11]:
example = vars(udpos.train.examples[0])
print("Text\n", ' '.join(example['text']))
print("UD Tags\n", example['udtags'])

Text
 al - zaman : american forces killed shaikh abdullah al - ani , the preacher at the mosque in the town of qaim , near the syrian border .
UD Tags
 ['PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'ADJ', 'NOUN', 'VERB', 'PROPN', 'PROPN', 'PROPN', 'PUNCT', 'PROPN', 'PUNCT', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'DET', 'NOUN', 'ADP', 'PROPN', 'PUNCT', 'ADP', 'DET', 'ADJ', 'NOUN', 'PUNCT']


The fields are preprocessed. This means, the tokenization step has been performed. Let's check what the vocabulary size is for each field.

In [5]:
print(f"Unique tokens in TEXT vocabulary: {len(udpos.TEXT.vocab)}")
print(f"Unique tokens in UD_TAG vocabulary: {len(udpos.UD_TAGS.vocab)}")

Unique tokens in TEXT vocabulary: 8866
Unique tokens in UD_TAG vocabulary: 18


Let's see what the top 10 most common tokens in `text` are.

In [18]:
TOP = 10

print(f"Top {TOP} most common tokens in text are as follows:")

for i, (token, count) in enumerate(udpos.TEXT.vocab.freqs.most_common(TOP)):
    print(f"{i+1:>2}) {token:<5} has count {count:>5}")

Top 10 most common tokens in text are as follows:
 1) the   has count  9076
 2) .     has count  8640
 3) ,     has count  7021
 4) to    has count  5137
 5) and   has count  5002
 6) a     has count  3782
 7) of    has count  3622
 8) i     has count  3379
 9) in    has count  3112
10) is    has count  2239
