<a href="https://colab.research.google.com/github/SophieShin/NLP_22_Fall/blob/main/%5BSSH%5Dlab04_nlp_embedding_tokeniser.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Word Embeddings
Let's get a basic understanding of word tokenisation, i.e. representing a word with a unique numeric token.

In [1]:
corpus = "hello hello there new world"
words = corpus.split()
print(words)

['hello', 'hello', 'there', 'new', 'world']


Manually construct a simple dictionary containing the words in the sentence

In [2]:
word_dict = {"hello":0, "new":1, "there":2, "world":3 }
word_dict

# More common usage:
# word_dict = {word: idx for idx, word in enumerate(sorted(set(words)))} # sorted 해도 되고 안해도 되고 #순서만 차이
# More common name for word_dict: word2idx

{'hello': 0, 'new': 1, 'there': 2, 'world': 3}

Get the index for each word in the sentence by looking it up in the dictionary

In [3]:
indices = [word_dict[w] for w in words]
indices

[0, 0, 2, 1, 3]

Convert the word indices of the input sentence to a tensor

In [4]:
import torch

input_tensor = torch.LongTensor(indices)  # normally Long used for words

input_tensor

tensor([0, 0, 2, 1, 3])

### Create Embedding Layer
The embedding layer is a 2-D matrix of shape `(n_vocab x embedding_dimension)`. If we want to get the embeddings (word vectors) for a particular sentence, supply a list of indices of the sentence as input to the embedding layer. Each index in the input list maps to the specific row of the embedding layer matrix (word vector). The output shape after applying the input list of indices to the embedding layer is another 2-D matrix of shape `(n_words x embedding_dimension)`.

In [5]:
import torch
from torch.nn import Embedding

# Create the embeddings for the word list that we have: "hello hello there new world"
num_embeddings = len(word_dict) # num_embeddings: length of vocab
embedding_dim = 3 # embedding_dim: size of embedding vector for each word

emb_layer = Embedding(num_embeddings, embedding_dim) 
embeddings = emb_layer(input_tensor)

print(f"Embedding vector: {embeddings}")
print(f"Vector shape: {embeddings.shape}")

Embedding vector: tensor([[-0.6986,  0.8836,  0.8226],
        [-0.6986,  0.8836,  0.8226],
        [-0.3557, -1.0070,  0.2364],
        [-0.9076,  0.2229, -0.8248],
        [-0.0817, -0.0118, -1.6573]], grad_fn=<EmbeddingBackward0>)
Vector shape: torch.Size([5, 3])


In [6]:
# Q1. Why are the embeddings the same for the first two rows?
# A1. Since the first two rows(words) are the same.

The PyTorch built-in `Embedding` layer comes with randomly initialized weights that are updated with gradient descent as your model learns to map input indices to some kind of output. 

### Exercise
Create embeddings for the sentence `"hello world"` by a passing suitable parameters to the Embedding class. 
- you should look up the indices for the words in this sentence in `word_dict` and convert it to a tensor
- each word should be converted into a 12-dimensional vector
- print the embeddings for this sentence and its shape

In [7]:

# Q2. Insert your code here
words2 = 'hello world'.split()
idx2 = [word_dict[w] for w in words2]
input_tensor2 = torch.LongTensor(idx2)

embeddings2 = emb_layer(input_tensor2)
print(embeddings2)

tensor([[-0.6986,  0.8836,  0.8226],
        [-0.0817, -0.0118, -1.6573]], grad_fn=<EmbeddingBackward0>)


### Padding
- Out-of-vocab or Unknown words should be padded (don't care about their gradients)

In [8]:
num_embs = 6
emb_dim = 4

emb2 = Embedding(num_embs, emb_dim, padding_idx=3)
emb2.weight

Parameter containing:
tensor([[-0.9240, -0.0861,  0.0707, -0.9295],
        [-2.5480, -1.2619, -1.0553,  0.2260],
        [ 0.9777,  1.2833,  0.3440, -0.8356],
        [ 0.0000,  0.0000,  0.0000,  0.0000],
        [ 0.0445, -2.4306,  1.5638, -0.5372],
        [-0.8052, -1.8966,  0.6132, -0.0118]], requires_grad=True)

In [9]:
# Q3. What is the effect of applying padding_idx=3 to the embedding?
# A3. If this embedding is passed through the Neural Network, then the 3rd embedding will be ignored when calculating in the network.

### Pre-trained Embeddings
Often it is better to use pretrained embeddings that do not update but instead are frozen. [GloVe](https://nlp.stanford.edu/projects/glove/) embeddings are one of the most popular pretrained word embeddings in use. The best performing embeddings are their Common Crawl embeddings with 840B tokens; however, they take very long to download. We'll download the Wikipedia embeddings with 6B tokens.

[Source](https://github.com/A-Jacobson/CNN_Sentence_Classification/blob/master/WordVectors.ipynb)

In [10]:
# Takes 3+ mins
!wget http://nlp.stanford.edu/data/glove.6B.zip 
!unzip glove.6B.zip
!ls -lat

--2022-09-22 01:50:07--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2022-09-22 01:50:07--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2022-09-22 01:50:07--  https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


202

We will use the embeddings with 50 dimensions contained in `glove.6B.50d.txt`

In [11]:
from torch import nn
from torch.autograd import Variable
import torch
import numpy as np

### Helper Function: create Glove dictionary
`load_glove()` reads the glove embeddings txt file line by line and creates a dictionary mapping words to vectors. For `glove.6B.50d.txt` this dictionary has 400k words each mapped to a 50 dimensional vector. We can use this to check the values of our pytorch embedding layer.

In [12]:
## 이 부분 잘 이해가 안 됨 ## study에서 물어보기
def load_glove(path):
    """
    creates a dictionary mapping words to vectors from a file in glove format.
    """ 
    with open(path) as f:
        glove = {}
        for line in f.readlines():
            values = line.split()
            word = values[0]
            vector = np.array(values[1:], dtype='float32')
            glove[word] = vector
        return glove

### Helper Function: get embeddings for a sentence
`load_glove_embeddings` takes a dictionary mapping words to indexes (must be computed from your training corpus) and returns a matrix of embeddings which we can use to initialize a Pytorch embedding layer.

In [13]:
## 이 부분 잘 이해가 안 됨 # study에서 물어보기
def load_glove_embeddings(path, word2idx, embedding_dim=50):
    with open(path) as f:
        embeddings = np.zeros((len(word2idx), embedding_dim))
        for line in f.readlines():
            values = line.split()
            word = values[0]
            index = word2idx.get(word)
            if index:
                vector = np.array(values[1:], dtype='float32')
                embeddings[index] = vector
        return torch.from_numpy(embeddings).float()

In [14]:
glove_path = 'glove.6B.50d.txt'
%time glove = load_glove(glove_path)  ## %time : wall time, %%time : CPU time

CPU times: user 4.1 s, sys: 318 ms, total: 4.42 s
Wall time: 4.44 s


In [15]:
len(glove)

400000

### Toy Example

In [16]:
corpus = 'the cow jumped over the moon.'
vocab = set(corpus.split()) # compute vocab, 6 words
word2idx = {word: idx for idx, word in enumerate(vocab)} # create word index

In [17]:
vocab

{'cow', 'jumped', 'moon.', 'over', 'the'}

In [18]:
word2idx

{'moon.': 0, 'over': 1, 'cow': 2, 'jumped': 3, 'the': 4}

In [19]:
toy_embeddings = load_glove_embeddings(glove_path, word2idx)
# 5 words x 50 embedding dimensions
toy_embeddings.shape

torch.Size([5, 50])

In [20]:
nn.Parameter(toy_embeddings)

Parameter containing:
tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,
          0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00,  0.0000e+00],
        [ 1.2972e-01,  8.8073e-02,  2.4375e-01,  7.8102e-02, -1.2783e-01,
          2.7831e-01, -4.8693e-01,  1.9649e-01, -3.9558e-01, -2.8362e-01,
         -4.7425e-01, -5.9317e-01, -5.8804e-01, -3.1702e-01,  4.9593e-01,
          8.759

In [21]:
toy_embeddings.size()

torch.Size([5, 50])

In [22]:
toy_embedding = Embedding(toy_embeddings.size(0), toy_embeddings.size(1)) # 5*50 embedding layer를 생성
toy_embedding.weight = nn.Parameter(toy_embeddings) #glove를 이용한 embedding parameter를 카피
toy_embedding 

Embedding(5, 50)

Get the embedding for "cow"

In [23]:
idx = word2idx['cow']
toy_embedding(Variable(torch.LongTensor([idx])))

# Tensor와 Variable은 2018년에 합쳐진 class로 이제는 Tensor로 통합되었다. 
# 기존에는 Variable에서 gradient를 자동으로 계산해주는 역할을 해주었지만, 
# 이제는 Tensor가 그 기능을 할 수 있게 되었다.
# 즉, Variable을 사용할 수는 있지만 Tensor로 return이 되니 굳이 사용할 필요없는 클래스이다.
# 즉, requires_grad로 gradient 계산 여부를 Tensor로 사용할 수 있다.

tensor([[ 0.6125, -0.4817, -0.7420, -0.5520, -0.0076,  1.6101, -0.8856, -0.8198,
          1.5144, -0.2280,  0.5537, -0.1839,  0.7049, -0.3693,  1.0668,  1.1077,
          0.1971,  0.2473, -0.6840,  0.5475, -0.0383, -0.7899,  0.6113,  0.3147,
          0.5021, -1.6535, -0.4278,  1.0404,  0.2943, -0.3689,  1.3148, -0.1844,
          0.0928,  0.7757, -0.5484, -0.1464,  0.5113,  0.0472,  0.4178, -0.1832,
         -0.4420, -0.2524, -0.3359,  0.3096,  1.9192,  0.3396, -0.2734, -0.0132,
          0.6497, -0.8586]], grad_fn=<EmbeddingBackward0>)

Check that it is the same as Glove's vector

In [24]:
glove.get('cow') # 'get'이라는 단어의 glove에서의 embedding

array([ 0.61253 , -0.48167 , -0.74199 , -0.55203 , -0.007596,  1.6101  ,
       -0.88565 , -0.81981 ,  1.5144  , -0.22804 ,  0.55367 , -0.18392 ,
        0.7049  , -0.36931 ,  1.0668  ,  1.1077  ,  0.19709 ,  0.24731 ,
       -0.68395 ,  0.5475  , -0.038255, -0.78989 ,  0.61131 ,  0.31473 ,
        0.50215 , -1.6535  , -0.42782 ,  1.0404  ,  0.29429 , -0.36889 ,
        1.3148  , -0.18443 ,  0.092753,  0.77572 , -0.54845 , -0.14645 ,
        0.51128 ,  0.047248,  0.41781 , -0.18324 , -0.44197 , -0.25237 ,
       -0.3359  ,  0.3096  ,  1.9192  ,  0.3396  , -0.27341 , -0.01316 ,
        0.64974 , -0.85857 ], dtype=float32)

# 2. Working with `torchtext` Tokenizer
First: 
  - Install compatible versions of `torch` and `torchtext`
  - Install `torchdata`
  - Restart Runtime if required (`Runtime-->Restart runtime` or click on the RESTART RUNTIME button in the code cell)

In [25]:
!pip install torch==1.11.0 torchtext==0.12.0 torchdata

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting torch==1.11.0
  Downloading torch-1.11.0-cp37-cp37m-manylinux1_x86_64.whl (750.6 MB)
[K     |████████████████████████████████| 750.6 MB 10 kB/s 
[?25hCollecting torchtext==0.12.0
  Downloading torchtext-0.12.0-cp37-cp37m-manylinux1_x86_64.whl (10.4 MB)
[K     |████████████████████████████████| 10.4 MB 38.4 MB/s 
[?25hCollecting torchdata
  Downloading torchdata-0.4.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 28.5 MB/s 
Collecting urllib3>=1.25
  Downloading urllib3-1.26.12-py2.py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 47.8 MB/s 
[?25hCollecting torchdata
  Downloading torchdata-0.4.0-cp37-cp37m-manylinux2014_x86_64.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 34.7 MB/s 
[?25h  Downloading torchdata-0.3.0-py3-none-any.whl (47 kB)
[K     |█████████

View the versions installed (**AFTER** restarting runtime)

In [1]:
!pip3 show torch # should be 1.11.0

Name: torch
Version: 1.11.0
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /usr/local/lib/python3.7/dist-packages
Requires: typing-extensions
Required-by: torchvision, torchtext, torchdata, torchaudio, fastai


In [2]:
!pip3 show torchtext # should be 0.12.0

Name: torchtext
Version: 0.12.0
Summary: Text utilities and datasets for PyTorch
Home-page: https://github.com/pytorch/text
Author: PyTorch core devs and James Bradbury
Author-email: jekbradbury@gmail.com
License: BSD
Location: /usr/local/lib/python3.7/dist-packages
Requires: numpy, tqdm, requests, torch
Required-by: 


## 2.1. Small corpus of manually created sentences
- `torchtext` has several tokenisers, but "basic english" should be enough for this simple example
- The vocabulary (`vocab`) is a dictionary containing the mapping between each token word to its corresponding token (index)

In [3]:
import torch

from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

sample_text = ["The quick brown FOX jumps over the lazy dog.",
               "A wizard's job: is to vex chumps quickly in fog?",
               "Brown jars prevented the mixture from freezing too quickly...",
               "How vexingly quick daft zebras jump!",
               "When 'zombies' arrive, quickly fax Judge Pat ;-)"]

tokenizer = get_tokenizer('basic_english')

def yield_tokens():
  for s in sample_text:
    tokens = tokenizer(s)
    print(tokens)
    yield tokens

# yield, return 차이
# return은 list 등을 반환, yeild는 generator를 반환
# 제너레이터는 여러 개의 데이터를 미리 만들어 놓지 않고 필요할 때마다 즉석해서 하나씩 만들어낼 수 있는 객체
# 이터러블한 것들은 우리가 원하는 만큼 접근해서 사용할 수 있기 때문에 매우 유용한 한편 이렇게 하기 위해 
# 모든 값을 메모리에 담고 있어야 하기 때문에 큰 값을 다룰 때에는 별로 좋지 않습니다.
# 제너레이터(generators)는 이터레이터(iterators)입니다. 하지만 제너레이터는 모든 값을 메모리에 담고 있지 않고 
# 그때그때 값을 생성(generator)해서 반환하기 때문에 제너레이터를 사용할 때에는 
# 한 번에 한 개의 값만 순환(iterate) 할 수 있습니다:

token_generator = yield_tokens() # can now iterate through token_generator

vocab = build_vocab_from_iterator(token_generator, specials= ["<unk>"]) # add special token "<unk"> 'unknown'
vocab.set_default_index(vocab["<unk>"]) # the index will be returned when OOV token is queried


['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
['a', 'wizard', "'", 's', 'job', 'is', 'to', 'vex', 'chumps', 'quickly', 'in', 'fog', '?']
['brown', 'jars', 'prevented', 'the', 'mixture', 'from', 'freezing', 'too', 'quickly', '.', '.', '.']
['how', 'vexingly', 'quick', 'daft', 'zebras', 'jump', '!']
['when', "'", 'zombies', "'", 'arrive', ',', 'quickly', 'fax', 'judge', 'pat', '-', ')']


#### build_vocab_from_iterator 함수 이해
torchtext.vocab.build_vocab_from_iterator(iterator: Iterable, min_freq: int = 1, specials: Optional[List[str]] = None, special_first: bool = True, max_tokens: Optional[int] = None) → torchtext.vocab.vocab.Vocab

Build a Vocab from an iterator.
Return Vocab Object

---
parameters

*iterator – Iterator used to build Vocab. Must yield list or iterator of tokens.

*min_freq – The minimum frequency needed to include a token in the vocabulary.

*specials – Special symbols to add. The order of supplied tokens will be preserved.

*special_first – Indicates whether to insert symbols at the beginning or at the end.

*max_tokens – If provided, creates the vocab from the max_tokens - len(specials) most frequent tokens.

[link text](https://pytorch.org/text/stable/vocab.html#torchtext.vocab.build_vocab_from_iterator)
torch doc site 참조




In [4]:
# Q4. Based on the vocab, state three things that this tokenizer (torchtext "basic_english") does
# apart from splitting by space
# A4. The basic_english tokenizer does the below tasks in order to process further. 
# 1. Convert source text to lower case
# 2. add space before and after single-quote, period, comma,left paren, right paren, exclam, question mark.
# 3. replace colon, semicolon, (br /) with a space
# 4. remove double-quote
# 5. split on whitespace
# Reference: https://jamesmccaffrey.wordpress.com/2021/06/23/tokenizing-text-using-the-basic-english-algorithm/


Convert strings in vocab to indexes

In [5]:
# String to index (stoi) # Dictionary mapping tokens to indices.

vocab.get_stoi()

# Index to string (itos)
# vocab.get_itos()

{'zombies': 43,
 'zebras': 42,
 'wizard': 41,
 'when': 40,
 'vexingly': 39,
 'vex': 38,
 'too': 37,
 'to': 36,
 'jars': 25,
 'a': 12,
 'chumps': 14,
 'quick': 6,
 'arrive': 13,
 'is': 24,
 'job': 26,
 'freezing': 20,
 'pat': 33,
 'fox': 19,
 '!': 7,
 'judge': 27,
 '<unk>': 0,
 "'": 2,
 'fog': 18,
 ',': 9,
 'mixture': 31,
 '-': 10,
 's': 35,
 'quickly': 3,
 ')': 8,
 '.': 1,
 'brown': 5,
 'how': 22,
 'jump': 28,
 'the': 4,
 'dog': 16,
 'fax': 17,
 'in': 23,
 'daft': 15,
 'from': 21,
 'prevented': 34,
 '?': 11,
 'jumps': 29,
 'lazy': 30,
 'over': 32}

Utilities on Individual Tokens and Indexes

In [6]:
vocab['jump']

28

Could get indices of multiple tokens put in a list

In [7]:
vocab.lookup_indices(["wizard", "fox"])

[41, 19]

Could look up tokens from a list of indices

In [8]:
vocab.lookup_tokens([0,5,13])

['<unk>', 'brown', 'arrive']

Could tokenise on other sentences based on the vocab created
- Prepare the text processing pipeline with the tokenizer and vocabulary
- The text pipeline converts a text string into a list of integers based on the lookup table defined in the vocabulary. 
- The label pipeline converts the label into integers.

In [9]:
text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: int(x) - 1
## 이부분 잘 이해가 안됨


In [10]:
text_pipeline('here is a warm example')

[0, 24, 12, 0, 0]

In [11]:
# Q5. Explain what each integer in the list returned by text_pipeline means.
# A5. It means the index of each word. 
# 0 means unknow vocab meaning that there was no such word in the original dictionary

## 2.2. Reading movie reviews from file
- Adapted from [HERE](https://jamesmccaffrey.wordpress.com/2021/06/18/serving-up-pytorch-training-data-using-the-dataloader-collate_fn-parameter/).
- The file contains the review score (0 for negative, 1 for positive) and the actual review in each line, separated by a comma.


In [12]:
# Download the small text file containing the movie reviews
!wget 'https://docs.google.com/uc?export=download&id=1QjyaIEL4H4opnPh9NhLcfU6Ws8qGC2kR' -O reviews.txt


--2022-09-22 01:56:50--  https://docs.google.com/uc?export=download&id=1QjyaIEL4H4opnPh9NhLcfU6Ws8qGC2kR
Resolving docs.google.com (docs.google.com)... 173.194.197.100, 173.194.197.139, 173.194.197.101, ...
Connecting to docs.google.com (docs.google.com)|173.194.197.100|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-14-a0-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/0fv2elvtnbrf7avuvk6vaeqd058gv9r8/1663811775000/18412391637135455491/*/1QjyaIEL4H4opnPh9NhLcfU6Ws8qGC2kR?e=download&uuid=1b1de97c-90c9-4065-a88c-311023334253 [following]
--2022-09-22 01:56:50--  https://doc-14-a0-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/0fv2elvtnbrf7avuvk6vaeqd058gv9r8/1663811775000/18412391637135455491/*/1QjyaIEL4H4opnPh9NhLcfU6Ws8qGC2kR?e=download&uuid=1b1de97c-90c9-4065-a88c-311023334253
Resolving doc-14-a0-docs.googleusercontent.com (doc-14-a0-docs.googleusercontent.com)... 173.194.198.132, 2607:

In [13]:
# Q6. Locate the name of this file and where it is saved
# A6. /content/reviews.txt

# 확인해보기

In [14]:
import torch
import torchtext as tt
import collections

device = torch.device("cpu")

# data file (review.txt) looks like:
# 0, This was a BAD movie.
# 1, I liked this film! Highly recommeneded.
# 0, Don't waste your time - a real dud
# 1, Good film. Great acting.
# 0, This was a waste of talent.
# 1, Great movie. Good acting.
# ...

Make Vocab
- Need to tokenise on the review words
- Need to format the text in the file
- Tokenise the review text using `get_tokenizer()`  and build the vocab using `build_vocab_from_iterator()`.
- We will use [`collections.Counter()`](https://docs.python.org/3/library/collections.html#collections.Counter) class to hold the tokenised review and passed to `build_vocab_from_iterator()`.

In [15]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

def yield_tok(counter_obj, tokenizer):
  for c in counter_obj:
    yield tokenizer(c)

def make_vocab(fn):
  tokeniser = get_tokenizer("basic_english")  # local
  counter_obj = collections.Counter()
  f = open(fn, "r") # open file for reading
  for line in f:
    line = line.strip() # all leading and trailing whitespaces are removed from the string
    txt = line.split(",")[1] # split line by "," and return item at index 1
    split_and_lowered = tokeniser(txt)
    counter_obj.update(split_and_lowered)
  f.close()
  token_generator = yield_tok(counter_obj, tokeniser) # build_vocab_from_iterator() requires an iterable wrapping an iterable containing words/tokens.

  vocab = build_vocab_from_iterator(token_generator, specials=["<unk>"], min_freq=1)
  vocab.set_default_index(vocab["<unk>"])

  print(f"Vocab: {vocab.get_stoi()}")
  return vocab

In [16]:
vocab = make_vocab('reviews.txt')
vocab

Vocab: {'your': 40, 'would': 39, 'dud': 12, 'good': 15, 'for': 14, 'don': 11, 'highly': 17, 'recommended': 28, 'penny': 26, 'spend': 30, 'not': 23, 'waste': 38, 'afternoon': 6, 'bad': 10, '!': 1, '<unk>': 0, 'real': 27, 'awful': 9, "'": 2, '-': 3, 'time': 36, 'film': 13, 'all': 7, '.': 4, 't': 32, 'at': 8, 'great': 16, 'movie': 21, 'i': 18, 'talent': 33, 'just': 19, 'was': 37, 'liked': 20, 'this': 35, 'of': 24, 'a': 5, 'sunday': 31, 'on': 25, 'nap': 22, 'single': 29, 'terrible': 34}


Vocab()

In [17]:
vocab.lookup_tokens([1])

['!']

In [None]:
# Q7. Why is the item at index 1 returned by the line split?
# A7. the word with index 1 is Exclamation mark

In [18]:
# globals are needed for the collate_fn() function
g_tokenizer = get_tokenizer("basic_english")  # global tokenizer
g_vocab = make_vocab("./reviews.txt")  # global vocabulary


Vocab: {'your': 40, 'would': 39, 'dud': 12, 'good': 15, 'for': 14, 'don': 11, 'highly': 17, 'recommended': 28, 'penny': 26, 'spend': 30, 'not': 23, 'waste': 38, 'afternoon': 6, 'bad': 10, '!': 1, '<unk>': 0, 'real': 27, 'awful': 9, "'": 2, '-': 3, 'time': 36, 'film': 13, 'all': 7, '.': 4, 't': 32, 'at': 8, 'great': 16, 'movie': 21, 'i': 18, 'talent': 33, 'just': 19, 'was': 37, 'liked': 20, 'this': 35, 'of': 24, 'a': 5, 'sunday': 31, 'on': 25, 'nap': 22, 'single': 29, 'terrible': 34}


In [23]:
# Q8. How would you get a test string's tokens using g_tokenizer and g_vocab? (CODE)
test = g_tokenizer("That movie is awesome")
print(test)
print(g_vocab.lookup_indices(test))

['that', 'movie', 'is', 'awesome']
[0, 21, 0, 0]


Prepare Data for DataLoader
- Create list of tuples from each line in file

In [24]:
def make_data_list(fname):
  # get all data into one big list of (label, review) tuples
  # result will be passed to DataLoader
  result = []
  f = open(fname, "r")
  for line in f:
    line = line.strip()
    parts = line.split(",")
    tuple = (parts[0], parts[1]) # Create tuple
    result.append(tuple)
  f.close()
  return result 

Custom Collate Function
- Convert labels to int and tensor
- Tokenise review text and convert to tensor
- Get offset indexes of each word

In [25]:
# rearrange a batch and compute offsets
# needs a global vocab and tokenizer
def collate_data(batch):
  label_list, review_list, offset_list = [], [], [0]
  # print("\n")
  for (label, review) in batch:  # 2 items in each item of batch
    label_list.append(int(label))  # string to int, then append to a list
    r_idxs = [g_vocab[tok] for tok in g_tokenizer(review)]  # list of tokens
    r_idxs = torch.tensor(r_idxs, dtype=torch.int64)  # to tensor
    review_list.append(r_idxs)
    print(f'Review: {review}')
    offset_list.append(len(r_idxs)) # get the index of the next word and put it in offset_list

  print(f"Label list {label_list}")
  label_list = torch.tensor(label_list, dtype=torch.int64).to(device)  # convert to tensor
  print(f'Offsets before cumsum: {offset_list}')
  offset_list = torch.tensor(offset_list[:-1]).cumsum(dim=0).to(device)    # whoa!
  print(f'Offsets AFTER cumsum: {offset_list}')
  review_list = torch.cat(review_list).to(device)  # combine 2 tensors into 1

  return (label_list, review_list, offset_list)

Quick illustration of [`cumsum()`](https://numpy.org/doc/stable/reference/generated/numpy.cumsum.html)

In [26]:
x = torch.arange(0, 6).view(2, 3)
print(x)
print(f'Cumulative sum in 1st dim: \n{x.cumsum(dim=0)}')
print(f'Cumulative sum in 2nd dim: \n{x.cumsum(dim=1)}')

tensor([[0, 1, 2],
        [3, 4, 5]])
Cumulative sum in 1st dim: 
tensor([[0, 1, 2],
        [3, 5, 7]])
Cumulative sum in 2nd dim: 
tensor([[ 0,  1,  3],
        [ 3,  7, 12]])


### Call all the functions
- Prepare data from file
- Feed data to DataLoader
- Print the batches


In [27]:
batch_size = 3
print("Begin DataLoader demo using text from file")

print("\nLoading train data into tuples: ")
data_list = make_data_list("./reviews.txt")
print(data_list)

print("\nCreating DataLoader from tuples ")
train_loader = torch.utils.data.DataLoader(data_list, \
  batch_size=batch_size, shuffle=False, collate_fn=collate_data)

print("\nWorking with batches (size = 3): ")
for b_ix, (labels, reviews, offsets) in enumerate(train_loader):
  print("==========")
  print(f"BATCH  : {b_ix}")
  print("Labels : ", end=""); print(labels)
  print("Reviews: ", end=""); print(reviews)
  print("Offsets: ", end=""); print(offsets)
  print("====================================\n\n")

print("\nEnd demo")

Begin DataLoader demo using text from file

Loading train data into tuples: 
[('0', ' This was a BAD movie.'), ('1', ' I liked this film! Highly recommended.'), ('0', ' Just awful'), ('1', ' Good film'), ('0', " Don't waste your time - A real dud"), ('0', ' Terrible!'), ('1', ' Great movie.'), ('0', ' This was a waste of talent.'), ('1', ' Not bad at all.'), ('0', ' Would not spend a single penny on this'), ('0', ' Not bad for a Sunday afternoon nap')]

Creating DataLoader from tuples 

Working with batches (size = 3): 
Review:  This was a BAD movie.
Review:  I liked this film! Highly recommended.
Review:  Just awful
Label list [0, 1, 0]
Offsets before cumsum: [0, 6, 8, 2]
Offsets AFTER cumsum: tensor([ 0,  6, 14])
BATCH  : 0
Labels : tensor([0, 1, 0])
Reviews: tensor([35, 37,  5, 10, 21,  4, 18, 20, 35, 13,  1, 17, 28,  4, 19,  9])
Offsets: tensor([ 0,  6, 14])


Review:  Good film
Review:  Don't waste your time - A real dud
Review:  Terrible!
Label list [1, 0, 0]
Offsets before cumsu

In [None]:
# Q9. Based on the output, explain what this line of code does in collate_fn():
# offset_list = torch.tensor(offset_list[:-1]).cumsum(dim=0).to(device)

# To make/feed the same length, we need to specify the length.
# Also, we need to feed all of these to the device to further process. 