## Tokens to Integers
- conceptual pseudocode
```python
dataset = long sequence of words
current_idx = 0
word2idx = {}
for word in dataset:
    if word not in word2idx:
        word2idx[word] = current_idx
        current_idx += 1
```

We usually do not want to start from index 0. We want to reserve 2 spaces. One for padding (empty spots) and one for unknown words. (words from test set)

## Constant length sequence (at least batch)

There are two types of padding
- Post-padding
- Pre-padding
    - more preferred because of the vanishing gradient

## Summary
1. Your data might arrive in an unexpected form. We need to convert to CSV.
2. Tokenization
3. Map each token to a unique integer
4. Add padding (per batch for improved efficiency

## Goal : Text Classification (many-to-one)
- input : sequence of words.
- output : a single label
- example : spam detection
- for __torchtext__, we want datast to be a CSV(1 column for label, 1 column for text)

In [1]:
import torchtext.data as ttd

In [2]:
TEXT = ttd.Field(
    sequential=True, # The words are sequential
    batch_first=True, # N x T, not T x N
    lower=True, # Lowercase
    pad_first=True # pre-padding
)
LABEL = ttd.Field(sequential=False, use_vocab=False, is_target=True)
# Not sequential Data
# use_vocab => Sigining each word to a unique integer.

In [None]:
# Torchtext only knows how to read a tabular dataset
dataset = ttd.TabularDataset(
    path='spam.csv',
    format='csv', # csv, tsv, json
    skip_header=True,
    fields=[('data',TEXT), ('label', LABEL)]
)

train_datset, test_dataset = dataset.split(split_ratio=0.7) # .7 by default

TEXT.build_vocab(train_dataset) # Assigns a unique integer to each token
vocab = TEXT.vocab
# has vocab.stoi => string to index
# has vocab.itos => index to string
# We need this in order to convert back to words.

### Iterator

In [None]:
ttd.Iterator.splits(
    (train_dataset, test_dataset),
    sort_key=lambda x: len(x.data),
    batch_size=(32, 256), # same number of elements as we have dataset, (count_train, count_test)
    device=device
)