Replies: 3 comments 3 replies
-
Okay, I think I figured out how to adapt the datamodule to resolve this issue. Seems to be working now. |
Beta Was this translation helpful? Give feedback.
-
@tshu-w Sure no problem. Here is the code for the datamodule below. I did not have to do anything special for the training loop, etc. I think the key is to build and preprocess the data in the Dataset, and then just use the DataModule to create train/val/test splits. The other thing that I noticed was that in the new API docs they make a big deal out of the collate function. But I found that in this working example below, the collate function is pretty simple. So I am not sure if that will have implications for training on multiple machines, but at least the code works. Sorry the example is a bit complicated, but hopefully you will find it helpful. I am going to see if I can write up some simpler code for a PL example model. import os
import platform
import numpy as np
import pandas as pd
import torch
import pytorch_lightning as pl
import torch.optim as optim
from torch import nn
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader, random_split
from torchtext.utils import download_from_url, extract_archive
from torchtext.data.utils import get_tokenizer
from tqdm.auto import tqdm
from typing import Optional, Tuple, Any
from torchtext.vocab import Vocab, build_vocab_from_iterator
from pathlib import Path
from zipfile import ZipFile
from torchtext.data.utils import ngrams_iterator
import gdown
from typing import Optional, Tuple, Any, Dict, List
import random
class StanfordSentimentTreeBank(Dataset):
"""The Standford Sentiment Tree Bank Dataset
Stanford Sentiment Treebank V1.0
This is the dataset of the paper:
Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng and Christopher Potts
Conference on Empirical Methods in Natural Language Processing (EMNLP 2013)
If you use this dataset in your research, please cite the above paper.
@incollection{SocherEtAl2013:RNTN,
title = {{Parsing With Compositional Vector Grammars}},
author = {Richard Socher and Alex Perelygin and Jean Wu and Jason Chuang and Christopher Manning and Andrew Ng and Christopher Potts},
booktitle = {{EMNLP}},
year = {2013}
}
This file includes:
1. original_rt_snippets.txt contains 10,605 processed snippets from the original pool of Rotten Tomatoes HTML files. Please note that some snippet may contain multiple sentences.
2. dictionary.txt contains all phrases and their IDs, separated by a vertical line |
3. sentiment_labels.txt contains all phrase ids and the corresponding sentiment labels, separated by a vertical line.
Note that you can recover the 5 classes by mapping the positivity probability using the following cut-offs:
[0, 0.2], (0.2, 0.4], (0.4, 0.6], (0.6, 0.8], (0.8, 1.0]
for very negative, negative, neutral, positive, very positive, respectively.
Please note that phrase ids and sentence ids are not the same.
4. SOStr.txt and STree.txt encode the structure of the parse trees.
STree encodes the trees in a parent pointer format. Each line corresponds to each sentence in the datasetSentences.txt file. The Matlab code of this paper will show you how to read this format if you are not familiar with it.
5. datasetSentences.txt contains the sentence index, followed by the sentence string separated by a tab. These are the sentences of the train/dev/test sets.
6. datasetSplit.txt contains the sentence index (corresponding to the index in datasetSentences.txt file) followed by the set label separated by a comma:
1 = train
2 = test
3 = dev
Please note that the datasetSentences.txt file has more sentences/lines than the original_rt_snippet.txt.
Each row in the latter represents a snippet as shown on RT, whereas the former is each sub sentence as determined by the Stanford parser.
For comparing research and training models, please use the provided train/dev/test splits.
"""
ORIG_URL = "http://nlp.stanford.edu/~socherr/stanfordSentimentTreebank.zip"
DATASET_NAME = "StanfordSentimentTreeBank"
URL = 'https://drive.google.com/uc?id=1urNi0Rtp9XkvkxxeKytjl1WoYNYUEoPI'
OUTPUT = 'sst_dataset.zip'
def __init__(self,
root,
vocab=None,
text_transforms=None,
label_transforms=None,
split='train',
ngrams=1,
use_transformed_dataset=True):
"""Initiate text-classification dataset.
Args:
data: a list of label and text string tuple. label is an integer.
[(label1, text1), (label2, text2), (label2, text3)]
vocab: Vocabulary object used for dataset.
transforms: a tuple of label and text string transforms.
"""
super(self.__class__, self).__init__()
if split not in ['train', 'test']:
raise ValueError(f'split must be either ["train", "test"] unknown split {split}')
self.vocab = vocab
gdown.cached_download(self.URL, Path(root) / self.OUTPUT)
self.generate_sst_dataset(split, Path(root) / self.OUTPUT)
tokenizer = get_tokenizer("basic_english")
# the text transform can only work at the sentence level
# the rest of tokenization and vocab is done by this class
self.text_transform = sequential_transforms(tokenizer,
self.ngrams_func(ngrams))
# Define special symbols and indices
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
# Make sure the tokens are in order of their indices to properly insert them in vocab
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']
def build_vocab(data, transforms):
def apply_transforms(data):
for line in data:
yield transforms(line)
v = build_vocab_from_iterator(apply_transforms(data),
min_freq=1,
specials=special_symbols,
special_first=True)
return v
if self.vocab is None:
# vocab is always built on the train dataset
self.vocab = build_vocab(self.dataset_train["phrase"],
self.text_transform)
self.vocab.set_default_index(self.vocab["<unk>"])
if text_transforms is not None:
self.text_transform = self.sequential_transforms(
self.text_transform,
text_transforms,
self.vocab_func(self.vocab),
self.totensor(dtype=torch.long)
)
else:
self.text_transform = self.sequential_transforms(
self.text_transform,
self.vocab_func(self.vocab),
self.totensor(dtype=torch.long)
)
self.label_transform = self.sequential_transforms(self.totensor(dtype=torch.long))
def generate_sst_dataset(self, split, dataset_file):
with ZipFile(dataset_file) as datasetzip:
with datasetzip.open('sst_dataset/sst_dataset_augmented.csv') as f:
dataset = pd.read_csv(f, index_col=0)
self.dataset_orig = dataset.copy()
dataset_train_raw = dataset[dataset['splitset_label'].isin([1, 3])]
self.dataset_train = pd.concat([
dataset_train_raw[['phrase_cleaned', 'sentiment_values']].rename(columns={"phrase_cleaned": 'phrase'}),
dataset_train_raw[['synonym_sentences', 'sentiment_values']].rename(columns={"synonym_sentences": 'phrase'}),
dataset_train_raw[['backtranslated', 'sentiment_values']].rename(columns={"backtranslated": 'phrase'}),
], ignore_index=True)
if split == 'train':
self.dataset = self.dataset_train.copy()
else:
self.dataset = dataset[dataset['splitset_label'].isin([2])] \
[['phrase_cleaned', 'sentiment_values']] \
.rename(columns={"phrase_cleaned": 'phrase'}) \
.reset_index(drop=True)
@staticmethod
def discretize_label(label):
if label <= 0.2: return 0
if label <= 0.4: return 1
if label <= 0.6: return 2
if label <= 0.8: return 3
return 4
def __getitem__(self, idx):
# print(f'text: {self.dataset["sentence"].iloc[idx]}, label: {self.dataset["sentiment_values"].iloc[idx]}')
text = self.text_transform(self.dataset['phrase'].iloc[idx])
label = self.label_transform(self.dataset['sentiment_values'].iloc[idx])
# print(f't_text: {text} {text.shape}, t_label: {label}')
return label, text
def __len__(self):
return len(self.dataset)
@staticmethod
def get_labels():
return ['very negative', 'negative', 'neutral', 'positive', 'very positive']
def get_vocab(self):
return self.vocab
@property
def collator_fn(self):
def collate_fn(batch):
pad_idx = self.get_vocab()['<pad>']
labels, sequences = zip(*batch)
labels = torch.stack(labels)
lengths = torch.LongTensor([len(sequence) for sequence in sequences])
# print('before padding: ', sequences[40])
sequences = torch.nn.utils.rnn.pad_sequence(sequences,
padding_value = pad_idx,
batch_first=True
)
# print('after padding: ', sequences[40])
return labels, sequences, lengths
return collate_fn
@staticmethod
def vocab_func(vocab):
return vocab_func(vocab)
@staticmethod
def totensor(dtype):
return totensor(dtype)
@staticmethod
def ngrams_func(ngrams):
return ngrams_func(ngrams)
@staticmethod
def sequential_transforms(*transforms):
return sequential_transforms(*transforms)
class SSTDataModule(pl.LightningDataModule):
"""
DataModule for SST, train, val, test splits and transforms
"""
name = "stanford_sentiment_treebank"
def __init__(
self,
data_dir: str = '.',
val_split: int = 1000,
num_workers: int = 2,
batch_size: int = 64,
pin_memory = True):
"""
Args:
data_dir: where to save/load the data
val_split: how many of the training images to use for the validation split
num_workers: how many workers to use for loading data
normalize: If true applies image normalize
batch_size: desired batch size.
"""
super().__init__()
self.data_dir = data_dir
self.val_split = val_split
self.num_workers = num_workers
self.batch_size = batch_size
self.pin_memory = pin_memory
self.dataset_train = ...
self.dataset_val = ...
self.dataset_test = ...
self.SST = StanfordSentimentTreeBank
def prepare_data(self):
"""Saves IMDB files to `data_dir`"""
self.SST(self.data_dir)
def setup(self, stage: Optional[str] = None):
"""Split the train and valid dataset"""
train_trans, test_trans = self.default_transforms
train_dataset = self.SST(self.data_dir, split='train', **train_trans)
test_dataset = self.SST(self.data_dir, split='test', **test_trans)
train_length = len(train_dataset)
self.raw_dataset_train = train_dataset
self.raw_dataset_test = test_dataset
self.dataset_train, self.dataset_val = random_split(train_dataset, [train_length - self.val_split, self.val_split])
self.dataset_train = train_dataset
self.dataset_test = test_dataset
def train_dataloader(self):
"""IMDB train set removes a subset to use for validation"""
loader = DataLoader(
self.dataset_train,
batch_size=self.batch_size,
shuffle=True,
num_workers=self.num_workers,
pin_memory=self.pin_memory,
collate_fn=self.collator_fn
)
return loader
def val_dataloader(self):
"""IMDB val set uses a subset of the training set for validation"""
loader = DataLoader(
self.dataset_val,
batch_size=self.batch_size,
shuffle=False,
num_workers=self.num_workers,
pin_memory=self.pin_memory,
collate_fn=self.collator_fn
)
return loader
def test_dataloader(self):
"""IMDB test set uses the test split"""
loader = DataLoader(
self.dataset_test,
batch_size=self.batch_size,
shuffle=False,
num_workers=self.num_workers,
pin_memory=self.pin_memory,
collate_fn=self.collator_fn
)
return loader
def get_vocab(self):
return self.raw_dataset_train.get_vocab()
@property
def default_transforms(self):
train_transforms = {
'text_transforms': StanfordSentimentTreeBank.sequential_transforms(
random_deletion,
random_swap
),
'label_transforms': None
}
test_transforms = {
'text_transforms': None,
'label_transforms': None
}
return train_transforms, test_transforms
@property
def collator_fn(self):
return self.raw_dataset_train.collator_fn
#############################################
## Data Augumentations
#############################################
def random_deletion(words, p=0.1):
if len(words) == 1: # return if single word
return words
remaining = list(filter(lambda x: random.uniform(0, 1) > p, words))
if len(remaining) == 0: # if not left, sample a random word
return [random.choice(words)]
else:
return remaining
def random_swap(sentence, n=3, p=0.1):
length = range(len(sentence))
n = min(n, len(sentence))
for _ in range(n):
if random.uniform(0, 1) > p:
idx1, idx2 = random.choices(length, k=2)
sentence[idx1], sentence[idx2] = sentence[idx2], sentence[idx1]
return sentence
################################################################
## Torchtext Experimental Functions.
## Replace when part of full Torchtext.
################################################################
def vocab_func(vocab):
def func(tok_iter):
return [vocab[tok] for tok in tok_iter]
return func
def totensor(dtype):
def func(ids_list):
return torch.tensor(ids_list).to(dtype)
return func
def ngrams_func(ngrams):
def func(token_list):
return list(ngrams_iterator(token_list, ngrams))
return func
def sequential_transforms(*transforms):
def func(txt_input):
for transform in transforms:
txt_input = transform(txt_input)
return txt_input
return func
if __name__ == "__main__":
dataset = StanfordSentimentTreeBank(root='.', split='train')
ds_next = next(iter(dataset))
print(dataset.get_vocab()["<box"])
datamodule = SSTDataModule()
datamodule.setup()
tloader = datamodule.train_dataloader()
loader = DataLoader(
dataset,
batch_size=32,
shuffle=False,
num_workers=2,
pin_memory=True,
collate_fn=dataset.collator_fn
) |
Beta Was this translation helpful? Give feedback.
-
Have you taken a look at the |
Beta Was this translation helpful? Give feedback.
-
Hey folks, I was wondering if there was a good strategy to handle this problem. So the ubiquitous
torchtext
package, has undergone a bunch of changes in the last 6 months and the API has changed quite a lot. The changes are very nice and make the package more consistent with thepytorch
standards, but there are still some tricky thing to work out.One problem is that
torchtext
now createsDatamodules
as iterators, instead of the oldBucketIterators
that people were used to. However, the problem with the new Datamodules is that the iterators do not recycle after a pass through the entire iterator. That is, once the use runs through en entire epoch, the iterator is basically dead and attempts to access it will get aStopIteration
error. So this is the way that iterators usually work, but for training purposes with multiple epochs this creates a problem. The solution is something like the code below, but I am not sure how to put this into a PL type framework. You basically need to recreate the DataLoader after each epoch.This code is from pytorch/text#1447 .
So in PL language, it seems like recreating this
DataLoader
needs to happen in theon_train_epoch_start()
method. I am not sure if the PL team has already encountered this issue before, but I was trying to find a workaround.I was experimenting with this a bit, but perhaps something in the model
pl.LightningModule
like:Note that this code does not currently work, and gets a
'CombinedLoaderIterator' object has no attribute 'sampler'
error.Of course whatever the fix, this is necessary on the validation and testing dataloaders too. But would this be the way to go, or is there some other way that is preferred? Just wanted to check before I went too far down the rabbit hole.
Beta Was this translation helpful? Give feedback.
All reactions