<a href="https://colab.research.google.com/github/InhyeokYoo/NLP/blob/master/utils/1.%20torchtext/1_torchtext_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Torchtext tutorial

In this notebook, we will see how to use `torchtext` with a simple tutorial.

You can refer the Korean article of this notebook.

In [None]:
!pip install --upgrade torchtext # upgrade torchtext

Collecting torchtext
[?25l  Downloading https://files.pythonhosted.org/packages/f2/17/e7c588245aece7aa93f360894179374830daf60d7ed0bbb59332de3b3b61/torchtext-0.6.0-py3-none-any.whl (64kB)
[K     |█████                           | 10kB 13.2MB/s eta 0:00:01[K     |██████████▏                     | 20kB 1.8MB/s eta 0:00:01[K     |███████████████▎                | 30kB 2.4MB/s eta 0:00:01[K     |████████████████████▍           | 40kB 2.7MB/s eta 0:00:01[K     |█████████████████████████▌      | 51kB 1.9MB/s eta 0:00:01[K     |██████████████████████████████▋ | 61kB 2.2MB/s eta 0:00:01[K     |████████████████████████████████| 71kB 2.0MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |▎                               | 10kB 10.7MB/s eta 0:00:01[K     |▋                               | 20kB 14.0MB/s eta 

In [None]:
# get dataset
!wget https://raw.githubusercontent.com/LawrenceDuan/IMDb-Review-Analysis/master/IMDb_Reviews.csv

--2020-07-21 07:15:06--  https://raw.githubusercontent.com/LawrenceDuan/IMDb-Review-Analysis/master/IMDb_Reviews.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 65862309 (63M) [text/plain]
Saving to: ‘IMDb_Reviews.csv’


2020-07-21 07:15:08 (64.3 MB/s) - ‘IMDb_Reviews.csv’ saved [65862309/65862309]



# 1. Field

In [None]:
from torchtext.data import Field

In [None]:
# Create Filed object for a text classification task.
TEXT = Field(sequential=True,
             use_vocab=True,
             tokenize=str.split,
             lower=True,
             batch_first=True,
             fix_length=20)

LABEL = Field(sequential=False,
              use_vocab=False,
              is_target=True)

The `TEXT` contains a text file for our machine. Therefore, we need to define `sequential`, `use_vocab`, `tokenize`, etc. for text preprocessing before we train the machine.

On the other hand, the `LABEL` object contains label information corresponding to the example of the text file. Therefore, we don't need any attributes for text preprocessing.

# 2. Dataset

You can load either train set, test set and val set separately or together.

In [None]:
from torchtext.data import TabularDataset

`TabularDataset` defines a Dataset of columns stored in CSV, TSV, or JSON format.

We need to pass the parameters:
- path(str): Path to the data file.
- format(str): The format of the data file. One of “CSV”, “TSV”, or “JSON” (case-insensitive).
- fileds(list(tuple(str, Field))): -  
tuple(str, Field)]: If using a list, the format must be CSV or TSV, and the values of the list should be tuples of (name, field). The fields should be in the same order as the columns in the CSV or TSV file, while tuples of (name, None) represent columns that will be ignored. If using a dict, the keys should be a subset of the JSON keys or CSV/TSV columns, and the values should be tuples of (name, field). Keys not present in the input dictionary are ignored. This allows the user to rename columns from their JSON/CSV/TSV key names and also enables selecting a subset of columns to load.
- skip_header (bool) – Whether to skip the first line of the input file.
- csv_reader_params (dict) – Parameters to pass to the csv reader. Only relevant when format is csv or tsv. See https://docs.python.org/3/library/csv.html#csv.reader for more details.



In [None]:
# Load a dataset
train_data = TabularDataset(path='IMDb_Reviews.csv', format='csv', fields=[('text', TEXT), ('label', LABEL)])

In [None]:
# Load datasets: train/test/validation sepatarely: Use splits method in TabularDataset.
# Add the paths of both test and validation to the parameters
train_data, test_data, val_data = TabularDataset.splits(path='', format='csv', train='IMDb_Reviews.csv', test='IMDb_Reviews.csv', validation='IMDb_Reviews.csv', fields=[('text', TEXT), ('label', LABEL)])

The `TabularDataset` loads the text file/files and performs pre-processing to it/them as the way we just defined in the `Filed` object. 

In [None]:
print(f"The train set is {len(train_data)}")

The train set is 50001


We can see that `train_data` has `text` and `label` attributes which are passed as the paramter `fields` in the `TabularDataset`, by using `vars(train_data[1])`

In [None]:
print(vars(train_data[1]))

{'text': ['my', 'family', 'and', 'i', 'normally', 'do', 'not', 'watch', 'local', 'movies', 'for', 'the', 'simple', 'reason', 'that', 'they', 'are', 'poorly', 'made,', 'they', 'lack', 'the', 'depth,', 'and', 'just', 'not', 'worth', 'our', 'time.<br', '/><br', '/>the', 'trailer', 'of', '"nasaan', 'ka', 'man"', 'caught', 'my', 'attention,', 'my', 'daughter', 'in', "law's", 'and', "daughter's", 'so', 'we', 'took', 'time', 'out', 'to', 'watch', 'it', 'this', 'afternoon.', 'the', 'movie', 'exceeded', 'our', 'expectations.', 'the', 'cinematography', 'was', 'very', 'good,', 'the', 'story', 'beautiful', 'and', 'the', 'acting', 'awesome.', 'jericho', 'rosales', 'was', 'really', 'very', 'good,', "so's", 'claudine', 'barretto.', 'the', 'fact', 'that', 'i', 'despised', 'diether', 'ocampo', 'proves', 'he', 'was', 'effective', 'at', 'his', 'role.', 'i', 'have', 'never', 'been', 'this', 'touched,', 'moved', 'and', 'affected', 'by', 'a', 'local', 'movie', 'before.', 'imagine', 'a', 'cynic', 'like', 'me

In [None]:
print(train_data.fields.items())

dict_items([('text', <torchtext.data.field.Field object at 0x7fbdf3dfb828>), ('label', <torchtext.data.field.Field object at 0x7fbdf3dfb7f0>)])


# 3. Vocabulary

After preprocessing, we need to the **Integer encoding** which maps unique integer into each word. For do that, we need to build vocabulary first via `build_vocab()`

In [None]:
TEXT.build_vocab(train_data, min_freq=10, max_size=1000)

`build_vocab()` construct the `Vocab` object for this field from one or more datasets. The parameters are:
- arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
- keyword arguments (Remaining) – Passed to the constructor of Vocab.

In [None]:
print(f'The length of the Vocab is {len(TEXT.vocab)}')

The length of the Vocab is 1002


The vocab has variables `stoi` and `itos`.

`stoi` is a collections.defaultdict instance mapping token strings to numerical identifiers.

`itos` is a list of token strings indexed by their numerical identifiers.

In [None]:
print(TEXT.vocab.stoi)
print(TEXT.vocab.itos)

defaultdict(<bound method Vocab._default_unk_index of <torchtext.vocab.Vocab object at 0x7fbdcb1ca6d8>>, {'<unk>': 0, '<pad>': 1, 'the': 2, 'a': 3, 'and': 4, 'of': 5, 'to': 6, 'is': 7, 'in': 8, 'i': 9, 'this': 10, 'that': 11, 'it': 12, '/><br': 13, 'was': 14, 'as': 15, 'with': 16, 'for': 17, 'but': 18, 'on': 19, 'movie': 20, 'are': 21, 'his': 22, 'not': 23, 'you': 24, 'film': 25, 'have': 26, 'he': 27, 'be': 28, 'at': 29, 'one': 30, 'by': 31, 'an': 32, 'they': 33, 'from': 34, 'all': 35, 'who': 36, 'like': 37, 'so': 38, 'just': 39, 'or': 40, 'has': 41, 'about': 42, 'her': 43, "it's": 44, 'if': 45, 'some': 46, 'out': 47, 'what': 48, 'very': 49, 'when': 50, 'there': 51, 'more': 52, 'would': 53, 'even': 54, 'my': 55, 'good': 56, 'she': 57, 'their': 58, 'only': 59, 'no': 60, 'really': 61, 'had': 62, 'up': 63, 'can': 64, 'which': 65, 'see': 66, 'were': 67, 'than': 68, 'we': 69, '-': 70, 'been': 71, 'get': 72, 'into': 73, 'will': 74, 'much': 75, 'because': 76, 'story': 77, 'how': 78, 'most': 7

# Iterator

In [None]:
from torchtext.data import Iterator

In [None]:
batch_size = 5
train_loader = Iterator(dataset=train_data, batch_size=batch_size)

In [None]:
print(f'# of minibatches in the trainin set: {len(train_loader)}') # 50001 / 5

# of minibatches in the trainin set: 10001


In [None]:
batch = next(iter(train_loader))
print(batch)


[torchtext.data.batch.Batch of size 5]
	[.text]:[torch.LongTensor of size 5x20]
	[.label]:[torch.LongTensor of size 5]


`Dataloader` produces tensor datatype minibatch. However, `Iterator` produces minibatch as `torchtext.data.batch.Batch`.

In [None]:
print(batch.text)
print(batch.label)

tensor([[  9, 199,  10, 152,   4,  29, 982,   9,  14,   0,   0,  10,  20,   0,
           0,   0,   0,   2,   0,  17],
        [  9, 300, 357, 142,  10,  20,  14,   0,   0,   5,   2,   0,  14,  32,
           0,  25,  70,   3,   0,   0],
        [ 79,   5,  10, 955,   0,   0,  15,   3, 635, 530,   5,   2,   0,  20,
          16,   3, 595, 136,   0,   5],
        [634, 139,  12,   0,  30,   5,   2, 239, 128, 125,  96,   4,  12,  14,
          96,  31,   2, 204,  11,  96],
        [394,   0,   4, 204,   0, 315, 336,  16,  22, 103, 480,   0,   4,   0,
           0,   0, 381,   2,   0, 315]])
tensor([0, 0, 1, 0, 1])


`batch` has the attributes `text` and `label`. They contains tensors of the texts and labels in the batch, repectively. We can see that there are 5 mini-batches.

We can convert `batch.text` into text.

In [None]:
f = lambda x: TEXT.vocab.itos[x]
for tensors in batch.text:
    text = list(map(f, tensors.tolist()))
    print(" ".join(text))

i saw this movie, and at times, i was <unk> <unk> this movie <unk> <unk> <unk> <unk> the <unk> for
i completely understand why this movie was <unk> <unk> of the <unk> was an <unk> film - a <unk> <unk>
most of this political <unk> <unk> as a mostly run of the <unk> movie with a somewhat better <unk> of
i'll say it <unk> one of the worst films ever made and it was made by the director that made
star <unk> and director <unk> -- along with his two favorite <unk> and <unk> <unk> <unk> doing the <unk> --


# Custom DataLoader for NLP

In [None]:
class CustomDataset(torchdata.Dataset):
    def __init__(self, path='', format_='\t', pad_idx=1):
        self.flatten = lambda x: [tkn for s in x for tkn in s]
        # Preprocessing
        with open(path, 'r') as file:
            data = file.read().splitlines()
            data = [line.split(format_) for line in data]
        
        # Tokenization
        sentences, labels = list(zip(*data))
        all_tokens = [s.split() for s in sentences]
        labels = [int(l) for l in labels]

        #Build Vocabulary
        unique_tokens = set(self.flatten(all_tokens))
        self.vocab_stoi = defaultdict()
        self.vocab_stoi['<unk>'] = 0
        self.vocab_stoi['<pad>'] = 1
        for i, token in enumerate(unique_tokens, 3):
            self.vocab_stoi[token] = i
        self.vocab_itos = [t for t, i in sorted([(token, index) for token, index in self.vocab_stoi.items()], key=lambda x: x[1])]

        #Numericalize all tokens
        all_tokens_numerical = [list(map(self.vocab_stoi.get, s)) for s in all_tokens]
        
        self.x = all_tokens_numerical
        self.y = labels
        self.pad_idx = 1
        
    def __getitem__(self, index):
        # return index datas
        return [self.x[index], self.y[index]]
        
    def __len__(self):
        # lengths of data
        return len(self.x)

    def custom_collate_fn(self, data):
        """
        need a custom 'collate_fn' function in 'torchdata.DataLoader' for variable length of dataset
        """
        texts, labels = list(zip(*data))
        max_len = max([len(s) for s in texts])
        texts = [s + [self.pad_idx] * (max_len - len(s)) if len(s) < max_len else s for s in texts]
        return torch.LongTensor(texts), torch.LongTensor(labels)

# Reference

Allen Nie's article: ["A Tutorial on Torchtext"](http://anie.me/On-Torchtext/)

simonjisu's notebook: [TorchText Tutorials](https://github.com/simonjisu/pytorch_tutorials/blob/master/00_Basic_Utils/01_TorchText.ipynb)

원준님의 wikidocs 책: [PyTorch로 시작하는 딥 러닝 입문](https://wikidocs.net/60314)

yunjey's Github: [data_loader.py](https://github.com/yunjey/seq2seq-dataloader/blob/master/data_loader.py)