New `Torchtext` API Dataloaders must be recreated on each epoch during training #10819

00krishna · 2021-11-29T19:02:41Z

00krishna
Nov 29, 2021

Hey folks, I was wondering if there was a good strategy to handle this problem. So the ubiquitous torchtext package, has undergone a bunch of changes in the last 6 months and the API has changed quite a lot. The changes are very nice and make the package more consistent with the pytorch standards, but there are still some tricky thing to work out.

One problem is that torchtext now creates Datamodules as iterators, instead of the old BucketIterators that people were used to. However, the problem with the new Datamodules is that the iterators do not recycle after a pass through the entire iterator. That is, once the use runs through en entire epoch, the iterator is basically dead and attempts to access it will get a StopIteration error. So this is the way that iterators usually work, but for training purposes with multiple epochs this creates a problem. The solution is something like the code below, but I am not sure how to put this into a PL type framework. You basically need to recreate the DataLoader after each epoch.

from torch.utils.data import DataLoader
from torchtext.datasets import AG_NEWS
collate_fn = None #replace with actual collate function
for _ in num_epochs:
    train_iter = AG_NEWS(split='train')
    data_loader = DataLoader(train_iter, batch_size = batch_size, collate_fn = collate_fn)    # <----------RECREATE DATALOADER AFTER EACH EPOCH.
    for labels, input in data_loader:
       # do something

This code is from pytorch/text#1447 .

So in PL language, it seems like recreating this DataLoader needs to happen in the on_train_epoch_start() method. I am not sure if the PL team has already encountered this issue before, but I was trying to find a workaround.

I was experimenting with this a bit, but perhaps something in the model pl.LightningModule like:

    def on_train_epoch_start(self) -> None:
        self.trainer.train_dataloader = iter(self.trainer.train_dataloader)
        return super().on_train_epoch_start()

Note that this code does not currently work, and gets a 'CombinedLoaderIterator' object has no attribute 'sampler' error.

Of course whatever the fix, this is necessary on the validation and testing dataloaders too. But would this be the way to go, or is there some other way that is preferred? Just wanted to check before I went too far down the rabbit hole.

00krishna · 2021-12-02T14:14:25Z

00krishna
Dec 2, 2021
Author

Okay, I think I figured out how to adapt the datamodule to resolve this issue. Seems to be working now.

1 reply

tshu-w Dec 2, 2021

Okay, I think I figured out how to adapt the datamodule to resolve this issue. Seems to be working now.

can you share you method, I am interested in it.

00krishna · 2021-12-03T14:12:24Z

00krishna
Dec 3, 2021
Author

@tshu-w Sure no problem. Here is the code for the datamodule below. I did not have to do anything special for the training loop, etc.
I agree that using the new Torchtext API is rather confusing, but I found this wonderful example from another user. I tweaked the example to work with my Lightning models.

I think the key is to build and preprocess the data in the Dataset, and then just use the DataModule to create train/val/test splits. The other thing that I noticed was that in the new API docs they make a big deal out of the collate function. But I found that in this working example below, the collate function is pretty simple. So I am not sure if that will have implications for training on multiple machines, but at least the code works. Sorry the example is a bit complicated, but hopefully you will find it helpful.

I am going to see if I can write up some simpler code for a PL example model.

import os
import platform
import numpy as np
import pandas as pd
import torch
import pytorch_lightning as pl
import torch.optim as optim
from torch import nn
from torch.nn import functional as F 
from torch.utils.data import Dataset, DataLoader, random_split
from torchtext.utils import download_from_url, extract_archive
from torchtext.data.utils import get_tokenizer
from tqdm.auto import tqdm
from typing import Optional, Tuple, Any
from torchtext.vocab import Vocab, build_vocab_from_iterator
from pathlib import Path
from zipfile import ZipFile
from torchtext.data.utils import ngrams_iterator
import gdown
from typing import Optional, Tuple, Any, Dict, List
import random





class StanfordSentimentTreeBank(Dataset):
    """The Standford Sentiment Tree Bank Dataset
    Stanford Sentiment Treebank V1.0

    This is the dataset of the paper:

    Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng and Christopher Potts
    Conference on Empirical Methods in Natural Language Processing (EMNLP 2013)

    If you use this dataset in your research, please cite the above paper.

    @incollection{SocherEtAl2013:RNTN,
    title = {{Parsing With Compositional Vector Grammars}},
    author = {Richard Socher and Alex Perelygin and Jean Wu and Jason Chuang and Christopher Manning and Andrew Ng and Christopher Potts},
    booktitle = {{EMNLP}},
    year = {2013}
    }

    This file includes:
    1. original_rt_snippets.txt contains 10,605 processed snippets from the original pool of Rotten Tomatoes HTML files. Please note that some snippet may contain multiple sentences.

    2. dictionary.txt contains all phrases and their IDs, separated by a vertical line |

    3. sentiment_labels.txt contains all phrase ids and the corresponding sentiment labels, separated by a vertical line.
    Note that you can recover the 5 classes by mapping the positivity probability using the following cut-offs:
    [0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0]
    for very negative, negative, neutral, positive, very positive, respectively.
    Please note that phrase ids and sentence ids are not the same.

    4. SOStr.txt and STree.txt encode the structure of the parse trees. 
    STree encodes the trees in a parent pointer format. Each line corresponds to each sentence in the datasetSentences.txt file. The Matlab code of this paper will show you how to read this format if you are not familiar with it.

    5. datasetSentences.txt contains the sentence index, followed by the sentence string separated by a tab. These are the sentences of the train/dev/test sets.

    6. datasetSplit.txt contains the sentence index (corresponding to the index in datasetSentences.txt file) followed by the set label separated by a comma:
        1 = train
        2 = test
        3 = dev

    Please note that the datasetSentences.txt file has more sentences/lines than the original_rt_snippet.txt. 
    Each row in the latter represents a snippet as shown on RT, whereas the former is each sub sentence as determined by the Stanford parser.

    For comparing research and training models, please use the provided train/dev/test splits.

    """

    ORIG_URL = "http://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip"
    DATASET_NAME = "StanfordSentimentTreeBank"
    URL = 'https://drive.google.com/uc?id=1urNi0Rtp9XkvkxxeKytjl1WoYNYUEoPI'
    OUTPUT = 'sst_dataset.zip'
 

    def __init__(self, 
                root, 
                vocab=None, 
                text_transforms=None, 
                label_transforms=None, 
                split='train', 
                ngrams=1, 
                use_transformed_dataset=True):

        """Initiate text-classification dataset.
        Args:
            data: a list of label and text string tuple. label is an integer.
                [(label1, text1), (label2, text2), (label2, text3)]
            vocab: Vocabulary object used for dataset.
            transforms: a tuple of label and text string transforms.
        """

        super(self.__class__, self).__init__()

        if split not in ['train', 'test']:
            raise ValueError(f'split must be either ["train", "test"] unknown split {split}')

        self.vocab = vocab

        gdown.cached_download(self.URL, Path(root) / self.OUTPUT)

        self.generate_sst_dataset(split, Path(root) / self.OUTPUT)

        tokenizer = get_tokenizer("basic_english")

        # the text transform can only work at the sentence level
        # the rest of tokenization and vocab is done by this class
        self.text_transform = sequential_transforms(tokenizer, 
                                                    self.ngrams_func(ngrams))

        # Define special symbols and indices
        UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
        # Make sure the tokens are in order of their indices to properly insert them in vocab
        special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']
        
        def build_vocab(data, transforms):
            def apply_transforms(data):
                for line in data:
                    yield transforms(line)
            v = build_vocab_from_iterator(apply_transforms(data), 
                                            min_freq=1,
                                            specials=special_symbols,
                                            special_first=True)
            return v


        if self.vocab is None:
            # vocab is always built on the train dataset
            self.vocab = build_vocab(self.dataset_train["phrase"], 
                                     self.text_transform)

            self.vocab.set_default_index(self.vocab["<unk>"])
        
        if text_transforms is not None:
            self.text_transform = self.sequential_transforms(
                                                        self.text_transform, 
                                                        text_transforms, 
                                                        self.vocab_func(self.vocab), 
                                                        self.totensor(dtype=torch.long)
            )
        else:
            self.text_transform = self.sequential_transforms(
                                                        self.text_transform, 
                                                        self.vocab_func(self.vocab), 
                                                        self.totensor(dtype=torch.long)
            )

        self.label_transform = self.sequential_transforms(self.totensor(dtype=torch.long))

    def generate_sst_dataset(self, split, dataset_file):

        with ZipFile(dataset_file) as datasetzip:
            with datasetzip.open('sst_dataset/sst_dataset_augmented.csv') as f:
                dataset = pd.read_csv(f, index_col=0)

        self.dataset_orig = dataset.copy()

        dataset_train_raw = dataset[dataset['splitset_label'].isin([1, 3])]
        self.dataset_train = pd.concat([
                dataset_train_raw[['phrase_cleaned', 'sentiment_values']].rename(columns={"phrase_cleaned": 'phrase'}),
                dataset_train_raw[['synonym_sentences', 'sentiment_values']].rename(columns={"synonym_sentences": 'phrase'}),
                dataset_train_raw[['backtranslated', 'sentiment_values']].rename(columns={"backtranslated": 'phrase'}),
        ], ignore_index=True)

        if split == 'train':
            self.dataset = self.dataset_train.copy()
        else:
            self.dataset = dataset[dataset['splitset_label'].isin([2])] \
                                    [['phrase_cleaned', 'sentiment_values']] \
                                    .rename(columns={"phrase_cleaned": 'phrase'}) \
                                    .reset_index(drop=True)

    @staticmethod
    def discretize_label(label):
        if label <= 0.2: return 0
        if label <= 0.4: return 1
        if label <= 0.6: return 2
        if label <= 0.8: return 3
        return 4

    def __getitem__(self, idx):
        # print(f'text: {self.dataset["sentence"].iloc[idx]}, label: {self.dataset["sentiment_values"].iloc[idx]}')
        text = self.text_transform(self.dataset['phrase'].iloc[idx])
        label = self.label_transform(self.dataset['sentiment_values'].iloc[idx])
        # print(f't_text: {text} {text.shape}, t_label: {label}')
        return label, text 

    def __len__(self):
        return len(self.dataset)

    @staticmethod
    def get_labels():
        return ['very negative', 'negative', 'neutral', 'positive', 'very positive']

    def get_vocab(self):
        return self.vocab

    @property
    def collator_fn(self):
        def collate_fn(batch):
            pad_idx = self.get_vocab()['<pad>']
            
            labels, sequences = zip(*batch)

            labels = torch.stack(labels)

            lengths = torch.LongTensor([len(sequence) for sequence in sequences])

            # print('before padding: ', sequences[40])
            
            sequences = torch.nn.utils.rnn.pad_sequence(sequences, 
                                                        padding_value = pad_idx,
                                                        batch_first=True
                                                        )
            # print('after padding: ', sequences[40])
                    
            return labels, sequences, lengths
        
        return collate_fn


    @staticmethod
    def vocab_func(vocab):
        return vocab_func(vocab)


    @staticmethod
    def totensor(dtype):
        return totensor(dtype)


    @staticmethod
    def ngrams_func(ngrams):
        return ngrams_func(ngrams)


    @staticmethod
    def sequential_transforms(*transforms):
        return sequential_transforms(*transforms)


class SSTDataModule(pl.LightningDataModule):
    """
    DataModule for SST, train, val, test splits and transforms
    """

    name = "stanford_sentiment_treebank"

    def __init__(
        self,
        data_dir: str = '.',
        val_split: int = 1000,
        num_workers: int = 2,
        batch_size: int = 64,
        pin_memory = True):
        """
        Args:
            data_dir: where to save/load the data
            val_split: how many of the training images to use for the validation split
            num_workers: how many workers to use for loading data
            normalize: If true applies image normalize
            batch_size: desired batch size.
        """
        super().__init__()

        self.data_dir = data_dir
        self.val_split = val_split
        self.num_workers = num_workers
        self.batch_size = batch_size
        self.pin_memory = pin_memory

        self.dataset_train = ...
        self.dataset_val = ...
        self.dataset_test = ...

        self.SST = StanfordSentimentTreeBank

    def prepare_data(self):
        """Saves IMDB files to `data_dir`"""
        self.SST(self.data_dir)

    def setup(self, stage: Optional[str] = None):
        """Split the train and valid dataset"""

        train_trans, test_trans = self.default_transforms

        train_dataset = self.SST(self.data_dir, split='train', **train_trans)
        test_dataset = self.SST(self.data_dir, split='test', **test_trans)

        train_length = len(train_dataset)

        self.raw_dataset_train = train_dataset
        self.raw_dataset_test = test_dataset

        self.dataset_train, self.dataset_val = random_split(train_dataset, [train_length - self.val_split, self.val_split])
        self.dataset_train = train_dataset
        self.dataset_test = test_dataset

    def train_dataloader(self):
        """IMDB train set removes a subset to use for validation"""
        loader = DataLoader(
            self.dataset_train,
            batch_size=self.batch_size,
            shuffle=True,
            num_workers=self.num_workers,
            pin_memory=self.pin_memory,
            collate_fn=self.collator_fn
        )
        return loader

    def val_dataloader(self):
        """IMDB val set uses a subset of the training set for validation"""
        loader = DataLoader(
            self.dataset_val,
            batch_size=self.batch_size,
            shuffle=False,
            num_workers=self.num_workers,
            pin_memory=self.pin_memory,
            collate_fn=self.collator_fn
        )
        return loader

    def test_dataloader(self):
        """IMDB test set uses the test split"""
        loader = DataLoader(
            self.dataset_test,
            batch_size=self.batch_size,
            shuffle=False,
            num_workers=self.num_workers,
            pin_memory=self.pin_memory,
            collate_fn=self.collator_fn
        )
        return loader

    def get_vocab(self):
        return self.raw_dataset_train.get_vocab()

    @property
    def default_transforms(self):
        train_transforms = {
            'text_transforms': StanfordSentimentTreeBank.sequential_transforms(
                random_deletion,
                random_swap
            ),
            'label_transforms': None
        }
        test_transforms = {
            'text_transforms': None,
            'label_transforms': None
        }

        return train_transforms, test_transforms

    @property
    def collator_fn(self):
        return self.raw_dataset_train.collator_fn





#############################################
## Data Augumentations
#############################################

def random_deletion(words, p=0.1): 
    if len(words) == 1: # return if single word
        return words
    remaining = list(filter(lambda x: random.uniform(0, 1) > p, words)) 
    if len(remaining) == 0: # if not left, sample a random word
        return [random.choice(words)] 
    else:
        return remaining

def random_swap(sentence, n=3, p=0.1): 
    length = range(len(sentence))
    n = min(n, len(sentence))
    for _ in range(n):
        if random.uniform(0, 1) > p:
            idx1, idx2 = random.choices(length, k=2)
            sentence[idx1], sentence[idx2] = sentence[idx2], sentence[idx1] 
    return sentence


################################################################
## Torchtext Experimental Functions. 
## Replace when part of full Torchtext.
################################################################


def vocab_func(vocab):
    def func(tok_iter):
        return [vocab[tok] for tok in tok_iter]

    return func


def totensor(dtype):
    def func(ids_list):
        return torch.tensor(ids_list).to(dtype)

    return func


def ngrams_func(ngrams):
    def func(token_list):
        return list(ngrams_iterator(token_list, ngrams))

    return func


def sequential_transforms(*transforms):
    def func(txt_input):
        for transform in transforms:
            txt_input = transform(txt_input)
        return txt_input

    return func






if __name__ == "__main__":

    dataset = StanfordSentimentTreeBank(root='.', split='train')
    ds_next = next(iter(dataset))
    print(dataset.get_vocab()["<box"])

    datamodule = SSTDataModule()
    datamodule.setup()

    tloader = datamodule.train_dataloader()

    loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=False,
    num_workers=2,
    pin_memory=True,
    collate_fn=dataset.collator_fn
    )

1 reply

tshu-w Dec 4, 2021

Thx, I will take a look at this.

ananthsub · 2021-12-04T02:52:55Z

ananthsub
Dec 4, 2021

Have you taken a look at the reload_dataloaders_every_n_epochs flag on the Trainer?

1 reply

00krishna Dec 4, 2021
Author

Hey @ananthsub yeah I looked at that. Actually, the code I posted above does not need that flag. So it worked out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New `Torchtext` API Dataloaders must be recreated on each epoch during training #10819

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

New Torchtext API Dataloaders must be recreated on each epoch during training #10819

00krishna Nov 29, 2021

Replies: 3 comments · 3 replies

00krishna Dec 2, 2021 Author

tshu-w Dec 2, 2021

00krishna Dec 3, 2021 Author

tshu-w Dec 4, 2021

ananthsub Dec 4, 2021

00krishna Dec 4, 2021 Author

New `Torchtext` API Dataloaders must be recreated on each epoch during training #10819

00krishna
Nov 29, 2021

Replies: 3 comments 3 replies

00krishna
Dec 2, 2021
Author

00krishna
Dec 3, 2021
Author

ananthsub
Dec 4, 2021

00krishna Dec 4, 2021
Author