<a href="https://colab.research.google.com/github/GeneSUN/pytorch-sentiment-analysis/blob/master/Preprocess_Pipeline_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Step 1: Create a dataset object

In [None]:
!pip uninstall -y torch torchdata torchvision torchtext torchaudio fastai
!pip install portalocker
!pip install --pre torch torchdata -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html

In [4]:
!pip install torchtext

# Step 1: Create a dataset object

In [6]:
import torch

In [16]:
from torchtext.datasets import IMDB
train_iter, test_iter = IMDB(split=('train', 'test'))

## check dimension and content

In [11]:
print(len(list(iter(train_iter))))
# my_list = list(my_generator)

25000


In [12]:
# Get the number of examples in the training set
num_train_examples = sum(1 for _ in train_iter)

# Get the number of examples in the test set
num_test_examples = sum(1 for _ in test_iter)

print("Number of training examples:", num_train_examples)
print("Number of test examples:", num_test_examples)

Number of training examples: 25000
Number of test examples: 25000


To print out the raw data, you can call the next() function on the IterableDataset.

In [97]:
next(iter(train_iter))
next(iter(train_iter))
next(iter(train_iter))
next(iter(train_iter))

(1,
 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far betwee

In [101]:
label_list=[]
i=0
for (label,content) in (iter(train_iter)):
  label_list.append(label)
  i+=1
  if i==25000:
    break

In [102]:
unique_values = set(label_list)
print(unique_values)

{1, 2}


You can also split the dataset using torch.utils.data.random_split(dataset, lengths)


# Step 2 Build the data processing pipeline
tokenizer -> vocab -> word vector

## Tokenizer

The tokenizer argument specifies the type of tokenizer to use.



*   "basic_english": uses a basic English tokenizer that splits words on whitespace and punctuation.
*   "spacy": uses the SpaCy tokenizer to tokenize text. This tokenizer provides more advanced tokenization that takes into account the context of the words, such as recognizing contractions and separating punctuation from words.
*   A custom tokenizer function: you can also pass in a custom tokenizer function that takes a string as input and returns a list of tokens.

In [25]:
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer('basic_english')

In [26]:
tokenizer('I am 1 Chi@nese and I love my hometown!')

['i', 'am', '1', 'chi@nese', 'and', 'i', 'love', 'my', 'hometown', '!']

In [28]:
tokenizer(next(iter(train_iter))[1])[:5]

['i', 'rented', 'i', 'am', 'curious-yellow']

## Vocabulary
Build a vocabulary with the raw training dataset using build_vocab_from_iterator. 

This function accepts iterator that yield list or iterator of tokens. Users can also pass any special symbols to be added to the vocabulary.

In [29]:
from torchtext.vocab import build_vocab_from_iterator

In [30]:
train_iter = iter(IMDB(split='train'))

def yield_tokens(data_iter):
    for _, text in data_iter:
        yield tokenizer(text)



In [76]:
vocab = build_vocab_from_iterator(iterator=yield_tokens(train_iter), specials=["<unk>"])

In [31]:
type(vocab)

torchtext.vocab.vocab.Vocab

The vocabulary block converts a list of tokens into integers.

In [77]:
vocab(['here', 'is', 'an', 'example','of'])

[131, 9, 40, 464, 6]

In [None]:
vocab(['here', 'is', 'an', 'example','of','Zhe'])

In [79]:
vocab.set_default_index(vocab["<unk>"])

In [80]:
vocab(['here', 'is', 'an', 'example','of','Zhe'])

[131, 9, 40, 464, 6, 0]

### Alternative to bulid vocabulary using Counter

In [39]:
from torchtext.datasets import IMDB
train_IMDB, test_IMDB = IMDB(split=('train', 'test')) 

In [122]:
from collections import Counter
from torchtext.vocab import vocab

train_iter = IMDB(split='train')
counter = Counter()
for (label, line) in train_iter:
    counter.update(tokenizer(line))
vocab_imdb = vocab(counter, min_freq=10, specials=('<unk>', '<BOS>', '<EOS>'))

In [46]:
type(vocab_imdb)

torchtext.vocab.vocab.Vocab

In [47]:
vocab_imdb(['here', 'is', 'an', 'example'])

[971, 54, 197, 3455]

In [None]:
vocab_imdb(['here', 'is', 'an', 'example','of','Zhe'])

In [124]:
vocab_imdb.set_default_index(0)

In [125]:
vocab_imdb(['here', 'is', 'an', 'example','of','Zhe'])

[971, 54, 197, 3455, 11, 0]

In [126]:
vocab_imdb["<EOS>"]

2

In [73]:
print("The length of the new vocab is", len(vocab_imdb))

print("The index of '' is", vocab_imdb[''])

print("The token at index 2 is", vocab_imdb.lookup_token(971))

The length of the new vocab is 20438
The index of '' is 0
The token at index 2 is here


## Generate data batch and iterator

In [84]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1

In [86]:
from torch.utils.data import DataLoader

In [87]:
def collate_batch(batch):
    # define the returns
    label_list, text_list, offsets = [], [], [0]

    for (_label, _text) in batch:
         label_list.append(label_pipeline(_label))

         processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
         text_list.append(processed_text)
         
         offsets.append(processed_text.size(0))
    
    label_list = torch.tensor(label_list, dtype=torch.int64)
    offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
    text_list = torch.cat(text_list)

    return label_list, text_list, offsets

In [88]:
train_iter = IMDB(split='train')
train_list = list(train_iter)
batch_size = 8  # A batch size of 8
train_dataloader = DataLoader(list(train_iter), batch_size=8, shuffle=True, 
                              collate_fn=collate_batch)

In [94]:
print(next(iter(train_dataloader)))

(tensor([1, 1, 1, 0, 1, 0, 0, 1]), tensor([  53,  163, 2104,  ..., 1335, 3933,    2]), tensor([   0,   91,  195,  806,  986, 1627, 1967, 2196]))


In [104]:
i=0
for (label,content,offsets) in (iter(train_dataloader)):
  print(content.size())
  i+=1
  if i==5:
    break

torch.Size([1339])
torch.Size([1844])
torch.Size([1469])
torch.Size([1853])
torch.Size([2255])


In [130]:
text_transform_imdb = lambda x: [vocab_imdb['<BOS>']] + [vocab_imdb[token] for token in tokenizer(x)] + [vocab_imdb['<EOS>']]
label_transform_imdb = label_pipeline

In [None]:
print(text_pipeline('I love you'))
print(text_transform_imdb('I love you'))

In [131]:

from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

def collate_batch(batch):
   label_list, text_list = [], []
   for (_label, _text) in batch:
        label_list.append(label_transform_imdb(_label))
        processed_text = torch.tensor(text_transform_imdb(_text))
        text_list.append(processed_text)
   return torch.tensor(label_list), pad_sequence(text_list, padding_value=3.0)

train_iter = IMDB(split='train')
train_dataloader = DataLoader(list(train_iter), batch_size=8, shuffle=True, 
                              collate_fn=collate_batch)

In [132]:
print(next(iter(train_dataloader)))

(tensor([1, 1, 1, 1, 0, 0, 1, 1]), tensor([[   1,    1,    1,  ...,    1,    1,    1],
        [1936,   13, 6240,  ...,    3,   87,    3],
        [  54,   20, 5405,  ...,  545,   43,  467],
        ...,
        [   3,    3,    3,  ...,    3,    3,   17],
        [   3,    3,    3,  ...,    3,    3,  400],
        [   3,    3,    3,  ...,    3,    3,    2]]))


In [133]:
i=0
for (label,content) in (iter(train_dataloader)):
  print(content.size())
  print(content)
  i+=1
  if i==5:
    break

torch.Size([594, 8])
tensor([[   1,    1,    1,  ...,    1,    1,    1],
        [2681,    3,   38,  ...,   38,   22,  766],
        [  40,  741,   54,  ...,   54,   13,   13],
        ...,
        [8841,    3,    3,  ...,    3,    3,    3],
        [  24,    3,    3,  ...,    3,    3,    3],
        [   2,    3,    3,  ...,    3,    3,    3]])
torch.Size([908, 8])
tensor([[    1,     1,     1,  ...,     1,     1,     1],
        [   38,    33,   466,  ...,  7382,   129,   108],
        [   54,   264, 16450,  ...,  2834,     3,    11],
        ...,
        [    3,     3,     3,  ..., 12419,     3,     3],
        [    3,     3,     3,  ...,    24,     3,     3],
        [    3,     3,     3,  ...,     2,     3,     3]])
torch.Size([640, 8])
tensor([[    1,     1,     1,  ...,     1,     1,     1],
        [   13,   197,    13,  ...,     3,   384, 13770],
        [ 4965,  1004,  9479,  ...,    82,    13,  6918],
        ...,
        [   17,     3,     3,  ...,     3,     3,     3],
    

In [135]:
i=0
for (label,content) in (iter(train_dataloader)):
  print(content.size())
  i+=1
  if i==5:
    break

torch.Size([423, 8])
torch.Size([426, 8])
torch.Size([306, 8])
torch.Size([738, 8])
torch.Size([257, 8])


[0, 3, 577, 264, 0]

# Step 4: Iterate batch to train a model