# Beginners Guide to Torchtext

In [None]:
import torchtext
from torchtext import datasets
from torchtext.data import Field,get_tokenizer, BucketIterator,TabularDataset
from torchtext.datasets import LanguageModelingDataset, WikiText2

Torchtext Documentation: https://torchtext.readthedocs.io/en/latest/ <br>
Other tutorial Links:
https://dzlab.github.io/dltips/en/pytorch/torchtext-datasets/

### 1. Setting a Field object

**Purpose of a Field object :**<br>
Intuitively, it sets rules for preprocessing the input text, and creates a vocabulary of the words that are introduced from the data
Things you can do with a Field object
- 1.1. apply self-defined or external tokenizers to tokenize strings into word tokens
- 1.2. automatically convert word tokens (strings) to indices (ints)
- 1.3. automatically add SOS(start-of-sentence) or EOS(end-of-sentence) tokens to input strings
- 1.4. convert text to lowercase
- 1.5. determine whether to pad sentences to a fixed length or leave them as variable lengths.

**Things you CAN'T do in a Field object :**<br>
print batches of text data (remember, a Field defines how to tokenize/preprocess/label your data and arrange a vocabulary, and does not store the data itself)
MISC.

#### 1.0. setting up a default0 Field object

In [None]:
TEXT = Field()

#### 1.1 Applying Tokenizer of choice

In [None]:
tokenizer = get_tokenizer('basic_english')  

In [None]:
TEXT = Field(tokenize=tokenizer)

#### 1.2 Converting word tokens to indices

In [None]:
TEXT = Field(use_vocab=True)

#### 1.3. Adding SOS and EOS tokens to input strings

In [None]:
TEXT = Field(init_token='<SOS>', eos_token='<EOS>')

#### 1.4. Converting text to lowercase

In [None]:
TEXT = Field(lower=True)

#### 1.5. Fixed Length Seqencing

In [None]:
TEXT = Field(fix_length=40) # shorter strings will be padded

#### 1.6 Combining all

In [None]:
TEXT = Field(
    tokenize=tokenizer,
    use_vocab=True,
    init_token='<SOS>',
    eos_token='<EOS>',
    lower=True,
)

### 2. Creating a Dataset object

Here we will create a dataset for language modelling. The input text data will be tokenized and preprocessed according to our Field settings.

Ingredients!
- a Field object used to store the vocabulary of the text file
the path to a text file
- an appropriate Dataset class

Things you can do with a Dataset object
- 2.1. print examples from the text
Introducing Datasets of various purposes
- 2.2. Language modelling (WikiText2)
- 2.3. Sentiment analysis (SST)

#### 2.0 Loading Text file into dataloader

In [None]:
lm_data = LanguageModelingDataset(path = 'datasets/pg1342.txt',
                                  text_field= TEXT)

#### 2.1. Get examples from text

In [None]:
examples = lm_data.examples
print(f"Number of tokens : {len(examples[0].text)}")
print(f"First 10 tokens : {examples[0].text[:10]}")
print(f"Last 10 tokens : {examples[0].text[-10:]} ")

#### 2.2. Dataset for language modelling from PyTorch

In [None]:
TEXT_wiki = Field(
    tokenize=tokenizer,
    use_vocab=True,
    init_token='<SOS>',
    eos_token='<EOS>',
    lower=True,
    #fix_length=
)

# split into train, val, test
train, val, test = WikiText2.splits(text_field=TEXT_wiki)

#### 2.3. Loading Sentiment Analysis Dataset

In [None]:
TEXT_sst = Field(tokenize=tokenizer, init_token='<SOS>', eos_token='<EOS>',
                 lower=True)
LABEL_sst = Field(sequential=False)

# split into train, val, test
train, val, test = datasets.SST.splits(text_field=TEXT_sst, label_field=LABEL_sst)

### 3. Using a Vocab object

Now you can create a vocabulary of the words from the text file stored in your predefined Field object, TEXT. You first have to build a vocabulary in your Field object using .build_vocab() with your dataset as input. Then you can access it using TEXT.vocab, which is a Vocab object also defined by TorchText. Here is a list of the features provided by Vocab.<br>

**Things you can do with a Vocab object**
- 3.1. View vocabulary information (size, frequency of words)
- 3.2. View the created string2index list (stoi) and index2string dict (itos)
- 3.3. Create purpose-specific vocabularies (requires a Counter object)
- 3.4. Load external word embeddings
- 3.5. Easily handle unknown words</br>

#### 3.1 Building vocabulary

In [None]:
TEXT.build_vocab(lm_data) # use dataset as input
vocabulary = TEXT.vocab

#### 3.2 Retrieving vocabulary information (size, frequency of words, etc.)

In [None]:
print(f"Vocabulary size : {len(vocabulary)}")
print(f"10 most frequent words : {vocabulary.freqs.most_common(10)}")

#### 3.3 Creating 'token to index' and 'index to token' mappings

In [None]:
print(f"First 10 words of vocab mapping : {vocabulary.itos[0:10]}\n")
print(f"First 10 words of text data: {lm_data.examples[0].text[:10]}\n")
print(f"Index of the first word : {vocabulary.stoi[lm_data.examples[0].text[0]]}")

#### 3.3. Create purpose-specific vocabularies (requires a Counter object)

In [None]:
counter = vocabulary.freqs #frequency of the original vocabulary created by Field

In [None]:
len(vocabulary)

In [None]:
from torchtext.vocab import Vocab

In [None]:
vocab2 = Vocab(counter=counter,min_freq=10) # discard words appearing less than 10 times
vocab3 = Vocab(counter=counter,max_size=100000) # set max number of words for a vocabulary

print(len(vocabulary))
print(len(vocab2))
print(len(vocab3))

#### 3.4. load external word embeddings

In [None]:
GLOVE = Field()
lang2 = datasets.LanguageModelingDataset(path='datasets/pg1342.txt',
                                       text_field=GLOVE)

GLOVE.build_vocab(lang2)

# 3.4.2. loading embedding into specific Vocab object
vocab2.load_vectors(vectors='glove.6B.50d')

In [None]:
print("Word embedding size: ", vocab2.vectors.size())

In [None]:
unknown_word = "humbahumba"
print("Index for unknown word %s: %d" %(unknown_word, vocab2.stoi[unknown_word]))
print("Token for unknown word: ", vocab2.itos[vocab2.stoi[unknown_word]])

#### 4.0 Load Dataset

In [None]:
fields = {
    'text': ('text',TEXT)
}

In [None]:
train_data = TabularDataset.splits(
    path = '',
    train= 'train.json',
    format='json',
    fields= fields
)

###

In [None]:
TEXT.build_vocab(train_data)

### Creating Iterators

To create iterators, we use BucketIterator.splits by specifying the datasets, batch size, and a lambda to tell TorchText what key to use for sorting validation/test sets (traning set is shuffled every epoch).

Finally, we can then iterate over batches of the datasets using those iterators.

In [None]:
import torch
device = torch.device(
  'cuda' if torch.cuda.is_available() else 'cpu'
)


In [None]:
# create iterators for train/valid/test datasets
train_it, valid_it, test_it = BucketIterator.splits(
  (train_data,valid_data, test_data)
  sort_key = lambda x: x.text,
  sort = True,
  batch_size = 32,
  device = device
)

# iterate over training
for batch in train_it:
  pass
