## CS 224N Lecture 3: Word Window Classification

### Pytorch Exploration

### Author: Matthew Lamm

In [None]:
import pprint
import torch
import torch.nn as nn
pp = pprint.PrettyPrinter()

## Our Data

The task at hand is to assign a label of 1 to words in a sentence that correspond with a LOCATION, and a label of 0 to everything else. 

In this simplified example, we only ever see spans of length 1.

In [None]:
# 소문자로 변환 후, 각 sentence를 띄어쓰기 기준으로 분할
train_sents = [s.lower().split() for s in ["we 'll always have Paris",
                                           "I live in Germany",
                                           "He comes from Denmark",
                                           "The capital of Denmark is Copenhagen"]]
train_sents

[['we', "'ll", 'always', 'have', 'paris'],
 ['i', 'live', 'in', 'germany'],
 ['he', 'comes', 'from', 'denmark'],
 ['the', 'capital', 'of', 'denmark', 'is', 'copenhagen']]

In [None]:
train_labels = [[0, 0, 0, 0, 1],
                [0, 0, 0, 1],
                [0, 0, 0, 1],
                [0, 0, 0, 1, 0, 1]]

In [None]:
assert all([len(train_sents[i]) == len(train_labels[i]) for i in range(len(train_sents))])

In [None]:
test_sents = [s.lower().split() for s in ["She comes from Paris"]]
test_labels = [[0, 0, 0, 1]]

assert all([len(test_sents[i]) == len(test_labels[i]) for i in range(len(test_sents))])

## Creating a dataset of batched tensors.

PyTorch (like other deep learning frameworks) is optimized to work on __tensors__, which can be thought of as a generalization of vectors and matrices with arbitrarily large rank.

Here well go over how to translate data to a list of vocabulary indices, and how to construct *batch tensors* out of the data for easy input to our model. 

We'll use the *torch.utils.data.DataLoader* object handle ease of batching and iteration.

### Converting tokenized sentence lists to vocabulary indices.

Let's assume we have the following vocabulary:

In [None]:
id_2_word = ["<pad>", "<unk>", "we", "always", "have", "paris",
              "i", "live", "in", "germany",
              "he", "comes", "from", "denmark",
              "the", "of", "is", "copenhagen"]
id_2_word

['<pad>',
 '<unk>',
 'we',
 'always',
 'have',
 'paris',
 'i',
 'live',
 'in',
 'germany',
 'he',
 'comes',
 'from',
 'denmark',
 'the',
 'of',
 'is',
 'copenhagen']

In [None]:
instance = train_sents[0]
print(instance)

['we', "'ll", 'always', 'have', 'paris']


In [None]:
word_2_id = {w:i for i,w in enumerate(id_2_word)}

In [None]:
def convert_tokens_to_inds(sentence, word_2_id):
    return [word_2_id.get(t, word_2_id["<unk>"]) for t in sentence]

In [None]:
token_inds = convert_tokens_to_inds(instance, word_2_id)
#pp.pprint(token_inds)
token_inds

[2, 1, 3, 4, 5]

Let's convince ourselves that worked:

In [None]:
print([id_2_word[tok_idx] for tok_idx in token_inds])

['we', '<unk>', 'always', 'have', 'paris']


### Padding for windows.

In the word window classifier, for each word in the sentence we want to get the +/- n window around the word, where 0 <= n < len(sentence).

In order for such windows to be defined for words at the beginning and ends of the sentence, we actually want to insert padding around the sentence before converting to indices:

In [None]:
#window_size만큼 앞,뒤로 padding
def pad_sentence_for_window(sentence, window_size, pad_token="<pad>"):
    return [pad_token]*window_size + sentence + [pad_token]*window_size 

In [None]:
window_size = 2
instance = pad_sentence_for_window(train_sents[0], window_size)
print(instance)

['<pad>', '<pad>', 'we', "'ll", 'always', 'have', 'paris', '<pad>', '<pad>']


Let's make sure this works with our vocabulary:

In [None]:
for sent in train_sents:
    tok_idxs = convert_tokens_to_inds(pad_sentence_for_window(sent, window_size), word_2_id)
    print(sent)
    print("sent len:", len(sent))
    print(tok_idxs)
    print("tok len:", len(tok_idxs))
    print([id_2_word[idx] for idx in tok_idxs])
    print("-"*100)

['we', "'ll", 'always', 'have', 'paris']
sent len: 5
[0, 0, 2, 1, 3, 4, 5, 0, 0]
tok len: 9
['<pad>', '<pad>', 'we', '<unk>', 'always', 'have', 'paris', '<pad>', '<pad>']
----------------------------------------------------------------------------------------------------
['i', 'live', 'in', 'germany']
sent len: 4
[0, 0, 6, 7, 8, 9, 0, 0]
tok len: 8
['<pad>', '<pad>', 'i', 'live', 'in', 'germany', '<pad>', '<pad>']
----------------------------------------------------------------------------------------------------
['he', 'comes', 'from', 'denmark']
sent len: 4
[0, 0, 10, 11, 12, 13, 0, 0]
tok len: 8
['<pad>', '<pad>', 'he', 'comes', 'from', 'denmark', '<pad>', '<pad>']
----------------------------------------------------------------------------------------------------
['the', 'capital', 'of', 'denmark', 'is', 'copenhagen']
sent len: 6
[0, 0, 14, 1, 15, 13, 16, 17, 0, 0]
tok len: 10
['<pad>', '<pad>', 'the', '<unk>', 'of', 'denmark', 'is', 'copenhagen', '<pad>', '<pad>']
----------------

### Batching sentences together with a DataLoader

When we train our model, we rarely update with respect to a single training instance at a time, because a single instance provides a very noisy estimate of the global loss's gradient. We instead construct small *batches* of data, and update parameters for each batch. 

Given some batch size, we want to construct batch tensors out of the word index lists we've just created with our vocab.

For each length B list of inputs, we'll have to:

    (1) Add window padding to sentences in the batch like we just saw.
    (2) Add additional padding so that each sentence in the batch is the same length.
    (3) Make sure our labels are in the desired format.

At the level of the dataest we want:

    (4) Easy shuffling, because shuffling from one training epoch to the next gets rid of 
        pathological batches that are tough to learn from.
    (5) Making sure we shuffle inputs and their labels together!
    
PyTorch provides us with an object *torch.utils.data.DataLoader* that gets us (4) and (5). All that's required of us is to specify a *collate_fn* that tells it how to do (1), (2), and (3). 

In [None]:
train_labels[0]

[0, 0, 0, 0, 1]

In [None]:
l = torch.LongTensor(train_labels[0])
pp.pprint(("raw train label instance", l))
print(l.size())


('raw train label instance', tensor([0, 0, 0, 0, 1]))
torch.Size([5])


In [None]:
one_hots = torch.zeros((2, len(l)))
pp.pprint(("unfilled label instance", one_hots))
print(one_hots.size())

('unfilled label instance',
 tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0.]]))
torch.Size([2, 5])


In [None]:
one_hots[1] = l
pp.pprint(("one-hot labels", one_hots))

('one-hot labels', tensor([[0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1.]]))


In [None]:
#0~255
#
~l.byte()

tensor([255, 255, 255, 255, 254], dtype=torch.uint8)

In [None]:
l_not = ~l.byte()
one_hots[0] = l_not
pp.pprint(("one-hot labels", one_hots))

('one-hot labels',
 tensor([[255., 255., 255., 255., 254.],
        [  0.,   0.,   0.,   0.,   1.]]))


In [None]:
from torch.utils.data import DataLoader
from functools import partial

In [None]:
def my_collate(data, window_size, word_2_id):
    """
    For some chunk of sentences and labels
        -add winow padding
        -pad for lengths using pad_sequence
        -convert our labels to one-hots
        -return padded inputs, one-hot labels, and lengths
    """
    
    x_s, y_s = zip(*data)
    print("x_s:", x_s) #(['we', "'ll", 'always', 'have', 'paris'], ['the', 'capital', 'of', 'denmark', 'is', 'copenhagen'])
    print("y_s:", y_s) #([0, 0, 0, 0, 1], [0, 0, 0, 1, 0, 1])

    # deal with input sentences as we've seen
    # window size만큼 앞, 뒤 padding -> 각 sentence마다 길이가 다름
    window_padded = [convert_tokens_to_inds(pad_sentence_for_window(sentence, window_size), word_2_id) 
                                                                                    for sentence in x_s]
    print("window_padded:", window_padded)

    # append zeros to each list of token ids in batch so that they are all the same length
    # batch가 가능하도록 같은 길이로 만들어 줌. batch 내 가장 긴 길이를 max로 선정. max보다 짧은 문장일 경우 padding 
    padded = nn.utils.rnn.pad_sequence([torch.LongTensor(t) for t in window_padded], batch_first=True)
    print("bathc_padded:", padded)

    # convert labels to one-hots
    # label을 one-hots으로 인코딩? 무슨 변환? 
    """
    [0, 0, 0, 1] //4*1
    -> [[255.,   0.], //4*2 -> 무슨 변환?
         [255.,   0.],
         [255.,   0.],
         [254.,   1.]]
    -> [[255.,   0.], //5*2 -> batch 내 max 길이에 맞춰 padding
         [255.,   0.],
         [255.,   0.],
         [254.,   1.],
         [  0.,   0.]]
    """
    labels = []
    lengths = []
    for y in y_s:
        lengths.append(len(y))
        label = torch.zeros((len(y),2 ))
        true = torch.LongTensor(y) 
        false = ~true.byte()
        label[:, 0] = false
        label[:, 1] = true
        labels.append(label)
        print("y: {} -> label:{}".format(y, label))
    
    # batch가 가능하도록 같은 길이로 만들어줌. batch 내 가장 긴 길이를 max로 선정. max보다 짧은 문장일 경우 padding 
    padded_labels = nn.utils.rnn.pad_sequence(labels, batch_first=True)
    print("padded_labels:", padded_labels)
    
    return padded.long(), padded_labels, torch.LongTensor(lengths)

In [None]:
# sentence와 label을 zip()을 이용해 묶음. sentence list내 각 요소(한 문장)와 label list내 각 요소(그 문장에 대한 정답) set으로 묶어, list로 반환 
list(zip(train_sents,train_labels))

[(['we', "'ll", 'always', 'have', 'paris'], [0, 0, 0, 0, 1]),
 (['i', 'live', 'in', 'germany'], [0, 0, 0, 1]),
 (['he', 'comes', 'from', 'denmark'], [0, 0, 0, 1]),
 (['the', 'capital', 'of', 'denmark', 'is', 'copenhagen'], [0, 0, 0, 1, 0, 1])]

In [None]:
# Shuffle True is good practice for train loaders.
# Use functools.partial to construct a partially populated collate function
example_loader = DataLoader(list(zip(train_sents, 
                                    train_labels)), 
                                    batch_size=2, 
                                    shuffle=True, 
                                    collate_fn=partial(my_collate, window_size=2, word_2_id=word_2_id))

In [None]:
for batched_input, batched_labels, batch_lengths in example_loader:
    pp.pprint(("inputs", batched_input, batched_input.size()))
    print("-"*10)
    pp.pprint(("labels", batched_labels, batched_labels.size()))
    print("-"*10)
    pp.pprint(("batch_lengths", batch_lengths))
    print("-"*100)

x_s: (['the', 'capital', 'of', 'denmark', 'is', 'copenhagen'], ['we', "'ll", 'always', 'have', 'paris'])
y_s: ([0, 0, 0, 1, 0, 1], [0, 0, 0, 0, 1])
window_padded: [[0, 0, 14, 1, 15, 13, 16, 17, 0, 0], [0, 0, 2, 1, 3, 4, 5, 0, 0]]
bathc_padded: tensor([[ 0,  0, 14,  1, 15, 13, 16, 17,  0,  0],
        [ 0,  0,  2,  1,  3,  4,  5,  0,  0,  0]])
y: [0, 0, 0, 1, 0, 1] -> label:tensor([[255.,   0.],
        [255.,   0.],
        [255.,   0.],
        [254.,   1.],
        [255.,   0.],
        [254.,   1.]])
y: [0, 0, 0, 0, 1] -> label:tensor([[255.,   0.],
        [255.,   0.],
        [255.,   0.],
        [255.,   0.],
        [254.,   1.]])
padded_labels: tensor([[[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.],
         [255.,   0.],
         [254.,   1.]],

        [[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.],
         [  0.,   0.]]])
('inputs',
 tensor([[ 0,  0, 14,  1, 15, 13, 16, 17,  0, 

## Modeling

### Thinking through vectorization of word windows.
Before we go ahead and build our model, let's think about the first thing it needs to do to its inputs.

We're passed batches of sentences. For each sentence i in the batch, for each word j in the sentence, we want to construct a single tensor out of the embeddings surrounding word j in the +/- n window.

Thus, the first thing we're going to need a (B, L, 2N+1) tensor of token indices.

한 batch 내 
- 문장 인덱스: i
- 문장 내 단어 인덱스: j
- 윈도우 크기: n

A *terrible* but nevertheless informative *iterative* solution looks something like the following, where we iterate through batch elements in our (dummy), iterating non-padded word positions in those, and for each non-padded word position, construct a window:

In [None]:
dummy_input = torch.zeros(2, 8).long()
dummy_input

tensor([[0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0]])

In [None]:
torch.arange(1,9) #1부터 시작, 마지막 숫자-1까지 생성
torch.arange(1,12)

tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [None]:
# view(n,m) -> n*m의 개수가 앞의 차원수와 같아야 함
torch.arange(1,9).view(2,4)

tensor([[1, 2, 3, 4],
        [5, 6, 7, 8]])

In [None]:
# 양 옆에 window size=2만큼 padding
dummy_input[:,2:-2] = torch.arange(1,9).view(2,4)
pp.pprint(dummy_input)

tensor([[0, 0, 1, 2, 3, 4, 0, 0],
        [0, 0, 5, 6, 7, 8, 0, 0]])


In [None]:
dummy_output = [[[dummy_input[i, j-2+k].item() for k in range(2*2+1)] 
                                                     for j in range(2, 6)] 
                                                            for i in range(2)] 
dummy_output

[[[0, 0, 1, 2, 3], [0, 1, 2, 3, 4], [1, 2, 3, 4, 0], [2, 3, 4, 0, 0]],
 [[0, 0, 5, 6, 7], [0, 5, 6, 7, 8], [5, 6, 7, 8, 0], [6, 7, 8, 0, 0]]]

In [None]:
# 각 단어 중심으로 context=2로 설정하여 반환
# batch는 2개씩 (batch내 각 요소는 문장)
dummy_output = torch.LongTensor(dummy_output)
print(dummy_output.size())
pp.pprint(dummy_output)

torch.Size([2, 4, 5])
tensor([[[0, 0, 1, 2, 3],
         [0, 1, 2, 3, 4],
         [1, 2, 3, 4, 0],
         [2, 3, 4, 0, 0]],

        [[0, 0, 5, 6, 7],
         [0, 5, 6, 7, 8],
         [5, 6, 7, 8, 0],
         [6, 7, 8, 0, 0]]])


*Technically* it works: For each element in the batch, for each word in the original sentence and ignoring window padding, we've got the 5 token indices centered at that word. But in practice will be crazy slow.

Instead, we ideally want to find the right tensor operation in the PyTorch arsenal. Here, that happens to be __Tensor.unfold__.

In [None]:
dummy_input.unfold(1, 2*2+1, 1)

tensor([[[0, 0, 1, 2, 3],
         [0, 1, 2, 3, 4],
         [1, 2, 3, 4, 0],
         [2, 3, 4, 0, 0]],

        [[0, 0, 5, 6, 7],
         [0, 5, 6, 7, 8],
         [5, 6, 7, 8, 0],
         [6, 7, 8, 0, 0]]])

### A model in full.

In PyTorch, we implement models by extending the nn.Module class. Minimally, this requires implementing an *\_\_init\_\_* function and a *forward* function.

In *\_\_init\_\_* we want to store model parameters (weights) and hyperparameters (dimensions).


In [None]:
#Unit Test
"""
Embedding layer 
-model holds an embedding for each layer in our vocab
-sets aside a special index in the embedding matrix for padding vector (of zeros)
-by default, embeddings are parameters (so gradients pass through them)
"""
vocab_size = 18
embed_dim = 25
pad_idx = 0
freeze_embeddings = False
embed_layer = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx) # 18x25 -> pad_idx???

if freeze_embeddings:
   embed_layer.weight.requires_grad = False

print("embed_layer:", embed_layer)
print("embed_layer weights:", embed_layer.weight)

embed_layer: Embedding(18, 25, padding_idx=0)
embed_layer weights: Parameter containing:
tensor([[ 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
          0.0000],
        [ 0.1283,  0.4550,  0.5272, -1.3682,  0.5429, -1.9266, -0.0044,  0.0167,
         -0.5833, -1.1719,  0.2619, -2.2598, -0.8888, -2.1736, -1.5393, -1.1363,
         -2.2966, -0.6285, -1.4136,  0.9921, -0.7904, -0.1863, -1.2309, -1.3202,
          0.4553],
        [-0.6184,  0.7419, -2.1573,  0.4889,  1.5273, -1.5545,  1.4465,  0.2914,
          0.8162, -0.6225, -0.7145, -0.4996, -1.3032,  0.6618, -0.5499,  0.4323,
          0.2821, -0.6551,  2.2377, -0.1332, -0.4633,  0.8047, -0.6729,  1.7358,
          0.4042],
        [ 0.5512, -1.0479, -0.8720,  0.6667,  0.5202,  0.2001, -0.3286,  1.6962,
          0.2439,  1.8662, -1.6066, -0.0498,

In [None]:
#Unit Test
"""
Hidden layer
-we want to map embedded word windows of dim (window_size+1)*self.embed_dim to a hidden layer.
-nn.Sequential allows you to efficiently specify sequentially structured models
    -first the linear transformation is evoked on the embedded word windows
    -next the nonlinear transformation tanh is evoked.
"""
window_size = 5
embed_dim = 25
hidden_dim = 25
linear = nn.Linear(window_size*embed_dim,hidden_dim)
hidden_layer = nn.Sequential(nn.Linear(window_size*embed_dim,hidden_dim),nn.Tanh())
print(linear.weight)
print(linear.weight.size()) # 25x125

Parameter containing:
tensor([[-0.0783,  0.0529,  0.0459,  ..., -0.0060, -0.0758,  0.0245],
        [ 0.0469, -0.0776,  0.0787,  ...,  0.0096, -0.0386, -0.0562],
        [-0.0655, -0.0791,  0.0295,  ..., -0.0405,  0.0273,  0.0667],
        ...,
        [-0.0437,  0.0461,  0.0226,  ...,  0.0835, -0.0285,  0.0582],
        [-0.0269, -0.0240,  0.0064,  ..., -0.0045, -0.0094,  0.0382],
        [ 0.0639, -0.0365, -0.0192,  ...,  0.0361,  0.0224,  0.0671]],
       requires_grad=True)
torch.Size([25, 125])


In [None]:
"""
L:= window-padded sentence length
S:= window_size = 2*half_window"+1 = 5

config = {"batch_size": 4, #B
          "half_window": 2, 
          "embed_dim": 25,  #D
          "hidden_dim": 25, #H
          "num_classes": 2,
          "freeze_embeddings": False,
         }
learning_rate = .0002
num_epochs = 10000
model = SoftmaxWordWindowClassifier(config, len(word_2_id))
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
"""

'\nconfig = {"batch_size": 4,\n          "half_window": 2,\n          "embed_dim": 25,\n          "hidden_dim": 25,\n          "num_classes": 2,\n          "freeze_embeddings": False,\n         }\nlearning_rate = .0002\nnum_epochs = 10000\nmodel = SoftmaxWordWindowClassifier(config, len(word_2_id))\noptimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)\n'

In [None]:
"""
   Let B:= batch_size
      L:= window-padded sentence length
      D:= self.embed_dim
      S:= self.window_size
      H:= self.hidden_dim
"""
class SoftmaxWordWindowClassifier(nn.Module):
    """
    A one-layer, binary word-window classifier.
    """
    def __init__(self, config, vocab_size, pad_idx=0):
        super(SoftmaxWordWindowClassifier, self).__init__()
        """
        Instance variables.
        """
        self.window_size = 2*config["half_window"]+1 # window size는 5
        self.embed_dim = config["embed_dim"] # 25차원으로 embedding
        self.hidden_dim = config["hidden_dim"] # 은닉층 25units
        self.num_classes = config["num_classes"] # 2classes 분류
        self.freeze_embeddings = config["freeze_embeddings"] # False
        
        """
        Embedding layer
        -model holds an embedding for each layer in our vocab
        -sets aside a special index in the embedding matrix for padding vector (of zeros)
        -by default, embeddings are parameters (so gradients pass through them)
        """
        self.embed_layer = nn.Embedding(vocab_size, self.embed_dim, padding_idx=pad_idx)
        if self.freeze_embeddings:
            self.embed_layer.weight.requires_grad = False
        
        """
        Hidden layer
        -we want to map embedded word windows of dim (window_size+1)*self.embed_dim to a hidden layer.
        -nn.Sequential allows you to efficiently specify sequentially structured models
            -first the linear transformation is evoked on the embedded word windows
            -next the nonlinear transformation tanh is evoked.
        """
        self.hidden_layer = nn.Sequential(nn.Linear(self.window_size*self.embed_dim, 
                                                    self.hidden_dim), 
                                          nn.Tanh())
        
        """
        Output layer
        -we want to map elements of the output layer (of size self.hidden dim) to a number of classes.
        """
        self.output_layer = nn.Linear(self.hidden_dim, self.num_classes)
        
        """
        Softmax
        -The final step of the softmax classifier: mapping final hidden layer to class scores.
        -pytorch has both logsoftmax and softmax functions (and many others)
        -since our loss is the negative LOG likelihood, we use logsoftmax
        -technically you can take the softmax, and take the log but PyTorch's implementation
         is optimized to avoid numerical underflow issues.
        """
        self.log_softmax = nn.LogSoftmax(dim=2)
        
    def forward(self, inputs):
        """
        Let B:= batch_size
            L:= window-padded sentence length
            D:= self.embed_dim
            S:= self.window_size
            H:= self.hidden_dim
            
        inputs: a (B, L) tensor of token indices
        """
        B, L = inputs.size()
        
        """
        Reshaping.
        Takes in a (B, L) LongTensor
        Outputs a (B, L~, S) LongTensor
        """
        # Fist, get our word windows for each word in our input.
        token_windows = inputs.unfold(1, self.window_size, 1)
        _, adjusted_length, _ = token_windows.size()
        
        # Good idea to do internal tensor-size sanity checks, at the least in comments!
        assert token_windows.size() == (B, adjusted_length, self.window_size)
        
        """
        Embedding.
        Takes in a torch.LongTensor of size (B, L~, S) 
        Outputs a (B, L~, S, D) FloatTensor.
        """
        embedded_windows = self.embed_layer(token_windows)
        
        """
        Reshaping.
        Takes in a (B, L~, S, D) FloatTensor.
        Resizes it into a (B, L~, S*D) FloatTensor.
        -1 argument "infers" what the last dimension should be based on leftover axes.
        """
        embedded_windows = embedded_windows.view(B, adjusted_length, -1)
        
        """
        Layer 1.
        Takes in a (B, L~, S*D) FloatTensor.
        Resizes it into a (B, L~, H) FloatTensor
        """
        layer_1 = self.hidden_layer(embedded_windows)
        
        """
        Layer 2
        Takes in a (B, L~, H) FloatTensor.
        Resizes it into a (B, L~, 2) FloatTensor.
        """
        output = self.output_layer(layer_1)
        
        """
        Softmax.
        Takes in a (B, L~, 2) FloatTensor of unnormalized class scores.
        Outputs a (B, L~, 2) FloatTensor of (log-)normalized class scores.
        """
        output = self.log_softmax(output)
        
        return output

### Training.

Now that we've got a model, we have to train it.

In [None]:
def loss_function(outputs, labels, lengths):
    """Computes negative LL loss on a batch of model predictions."""
    B, L, num_classes = outputs.size()
    num_elems = lengths.sum().float()
        
    # get only the values with non-zero labels
    loss = outputs*labels
    
    # rescale average
    return -loss.sum() / num_elems

In [None]:
def train_epoch(loss_function, optimizer, model, train_data):
    
    ## For each batch, we must reset the gradients
    ## stored by the model.   
    total_loss = 0
    idx_batch = 0
    # dataloader에서 batch 단위로 data 반환
    for batch, labels, lengths in train_data: #  train_data: torch.utils.data.DataLoader
        print("{}번째 batch:".format(idx_batch))
        print("batch:", batch)
        print("labels:", labels)
        print("lengths:", lengths)

        # clear gradients
        optimizer.zero_grad()

        # evoke model in training mode on batch
        # batch별 classifier모델에 넣어서 output 계산
        outputs = model.forward(batch)
        print("batch outputs:", outputs)

        # compute loss w.r.t batch
        # batch 별 loss 계산
        loss = loss_function(outputs, labels, lengths)
        print("batch loss:", loss)
        
        # pass gradients back, startiing on loss value
        # backpropagation을 위한 기울기 전달
        loss.backward()

        # update parameters
        #optimizer가 매개변수 업데이트
        optimizer.step() 

        total_loss += loss.item()
        idx_batch += 1
        print("="*100)
    
    # return the total to keep track of how you did this time around
    return total_loss
    

In [None]:
config = {"batch_size": 4,
          "half_window": 2,
          "embed_dim": 25,
          "hidden_dim": 25,
          "num_classes": 2,
          "freeze_embeddings": False,
         }
learning_rate = .0002
num_epochs = 2#10000

In [None]:
#vocab_size = 18
len(word_2_id)
word_2_id

{'<pad>': 0,
 '<unk>': 1,
 'always': 3,
 'comes': 11,
 'copenhagen': 17,
 'denmark': 13,
 'from': 12,
 'germany': 9,
 'have': 4,
 'he': 10,
 'i': 6,
 'in': 8,
 'is': 16,
 'live': 7,
 'of': 15,
 'paris': 5,
 'the': 14,
 'we': 2}

In [None]:
model = SoftmaxWordWindowClassifier(config, len(word_2_id))

In [None]:
 model.parameters() #type: generator -> 값 확인하기???

generator

In [None]:
# optimizer에 모델의 매개변수 전달
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
optimizer

SGD (
Parameter Group 0
    dampening: 0
    lr: 0.0002
    momentum: 0
    nesterov: False
    weight_decay: 0
)

In [None]:
train_loader = torch.utils.data.DataLoader(list(zip(train_sents, train_labels)), 
                                           batch_size=2, 
                                           shuffle=True, 
                                           collate_fn=partial(my_collate, window_size=2, word_2_id=word_2_id))

In [None]:
losses = []
for epoch in range(num_epochs):
    print("{}번째 epoch:".format(epoch))
    epoch_loss = train_epoch(loss_function, optimizer, model, train_loader)
    if epoch % 100 == 0:
        losses.append(epoch_loss)
    print("*"*100)
print(losses)

0번째 epoch:
x_s: (['the', 'capital', 'of', 'denmark', 'is', 'copenhagen'], ['i', 'live', 'in', 'germany'])
y_s: ([0, 0, 0, 1, 0, 1], [0, 0, 0, 1])
window_padded: [[0, 0, 14, 1, 15, 13, 16, 17, 0, 0], [0, 0, 6, 7, 8, 9, 0, 0]]
bathc_padded: tensor([[ 0,  0, 14,  1, 15, 13, 16, 17,  0,  0],
        [ 0,  0,  6,  7,  8,  9,  0,  0,  0,  0]])
y: [0, 0, 0, 1, 0, 1] -> label:tensor([[255.,   0.],
        [255.,   0.],
        [255.,   0.],
        [254.,   1.],
        [255.,   0.],
        [254.,   1.]])
y: [0, 0, 0, 1] -> label:tensor([[255.,   0.],
        [255.,   0.],
        [255.,   0.],
        [254.,   1.]])
padded_labels: tensor([[[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.],
         [255.,   0.],
         [254.,   1.]],

        [[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.],
         [  0.,   0.],
         [  0.,   0.]]])
0번째 batch:
batch: tensor([[ 0,  0, 14,  1, 15, 13, 16, 17,  0,  0],
        [ 0,  0,  6

In [None]:
"""
0번째 epoch:
x_s: (['the', 'capital', 'of', 'denmark', 'is', 'copenhagen'], ['i', 'live', 'in', 'germany'])
y_s: ([0, 0, 0, 1, 0, 1], [0, 0, 0, 1])
window_padded: [[0, 0, 14, 1, 15, 13, 16, 17, 0, 0], [0, 0, 6, 7, 8, 9, 0, 0]]
bathc_padded: tensor([[ 0,  0, 14,  1, 15, 13, 16, 17,  0,  0],
        [ 0,  0,  6,  7,  8,  9,  0,  0,  0,  0]])
y: [0, 0, 0, 1, 0, 1] -> label:tensor([[255.,   0.],
        [255.,   0.],
        [255.,   0.],
        [254.,   1.],
        [255.,   0.],
        [254.,   1.]])
y: [0, 0, 0, 1] -> label:tensor([[255.,   0.],
        [255.,   0.],
        [255.,   0.],
        [254.,   1.]])
padded_labels: tensor([[[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.],
         [255.,   0.],
         [254.,   1.]],

        [[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.],
         [  0.,   0.],
         [  0.,   0.]]])
0번째 batch:
batch: tensor([[ 0,  0, 14,  1, 15, 13, 16, 17,  0,  0],
        [ 0,  0,  6,  7,  8,  9,  0,  0,  0,  0]])
labels: tensor([[[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.],
         [255.,   0.],
         [254.,   1.]],

        [[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.],
         [  0.,   0.],
         [  0.,   0.]]])
lengths: tensor([6, 4])
batch outputs: tensor([[[-0.5967, -0.7999],
         [-0.5707, -0.8326],
         [-1.1202, -0.3948],
         [-0.4604, -0.9971],
         [-0.4353, -1.0414],
         [-0.8197, -0.5808]],

        [[-0.6273, -0.7637],
         [-0.4523, -1.0111],
         [-0.8326, -0.5707],
         [-0.9057, -0.5179],
         [-0.7767, -0.6160],
         [-0.9085, -0.5160]]], grad_fn=<LogSoftmaxBackward>)
batch loss: tensor(173.9259, grad_fn=<DivBackward0>)
====================================================================================================
x_s: (['he', 'comes', 'from', 'denmark'], ['we', "'ll", 'always', 'have', 'paris'])
y_s: ([0, 0, 0, 1], [0, 0, 0, 0, 1])
window_padded: [[0, 0, 10, 11, 12, 13, 0, 0], [0, 0, 2, 1, 3, 4, 5, 0, 0]]
bathc_padded: tensor([[ 0,  0, 10, 11, 12, 13,  0,  0,  0],
        [ 0,  0,  2,  1,  3,  4,  5,  0,  0]])
y: [0, 0, 0, 1] -> label:tensor([[255.,   0.],
        [255.,   0.],
        [255.,   0.],
        [254.,   1.]])
y: [0, 0, 0, 0, 1] -> label:tensor([[255.,   0.],
        [255.,   0.],
        [255.,   0.],
        [255.,   0.],
        [254.,   1.]])
padded_labels: tensor([[[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.],
         [  0.,   0.]],

        [[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.]]])
1번째 batch:
batch: tensor([[ 0,  0, 10, 11, 12, 13,  0,  0,  0],
        [ 0,  0,  2,  1,  3,  4,  5,  0,  0]])
labels: tensor([[[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.],
         [  0.,   0.]],

        [[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.]]])
lengths: tensor([4, 5])
batch outputs: tensor([[[-0.7121, -0.6745],
         [-0.8696, -0.5432],
         [-0.6325, -0.7577],
         [-0.6608, -0.7266],
         [-0.6235, -0.7680]],

        [[-0.5920, -0.8057],
         [-0.4362, -1.0398],
         [-0.6645, -0.7227],
         [-0.5730, -0.8298],
         [-0.7106, -0.6760]]], grad_fn=<LogSoftmaxBackward>)
batch loss: tensor(165.7885, grad_fn=<DivBackward0>)
====================================================================================================
****************************************************************************************************
1번째 epoch:
x_s: (['we', "'ll", 'always', 'have', 'paris'], ['the', 'capital', 'of', 'denmark', 'is', 'copenhagen'])
y_s: ([0, 0, 0, 0, 1], [0, 0, 0, 1, 0, 1])
window_padded: [[0, 0, 2, 1, 3, 4, 5, 0, 0], [0, 0, 14, 1, 15, 13, 16, 17, 0, 0]]
bathc_padded: tensor([[ 0,  0,  2,  1,  3,  4,  5,  0,  0,  0],
        [ 0,  0, 14,  1, 15, 13, 16, 17,  0,  0]])
y: [0, 0, 0, 0, 1] -> label:tensor([[255.,   0.],
        [255.,   0.],
        [255.,   0.],
        [255.,   0.],
        [254.,   1.]])
y: [0, 0, 0, 1, 0, 1] -> label:tensor([[255.,   0.],
        [255.,   0.],
        [255.,   0.],
        [254.,   1.],
        [255.,   0.],
        [254.,   1.]])
padded_labels: tensor([[[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.],
         [  0.,   0.]],

        [[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.],
         [255.,   0.],
         [254.,   1.]]])
0번째 batch:
batch: tensor([[ 0,  0,  2,  1,  3,  4,  5,  0,  0,  0],
        [ 0,  0, 14,  1, 15, 13, 16, 17,  0,  0]])
labels: tensor([[[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.],
         [  0.,   0.]],

        [[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.],
         [255.,   0.],
         [254.,   1.]]])
lengths: tensor([5, 6])
batch outputs: tensor([[[-0.5351, -0.8809],
         [-0.3822, -1.1468],
         [-0.5705, -0.8330],
         [-0.4945, -0.9413],
         [-0.6319, -0.7584],
         [-0.7449, -0.6439]],

        [[-0.5207, -0.9017],
         [-0.4983, -0.9353],
         [-0.9472, -0.4908],
         [-0.3760, -1.1602],
         [-0.3984, -1.1129],
         [-0.6978, -0.6885]]], grad_fn=<LogSoftmaxBackward>)
batch loss: tensor(140.3917, grad_fn=<DivBackward0>)
====================================================================================================
x_s: (['he', 'comes', 'from', 'denmark'], ['i', 'live', 'in', 'germany'])
y_s: ([0, 0, 0, 1], [0, 0, 0, 1])
window_padded: [[0, 0, 10, 11, 12, 13, 0, 0], [0, 0, 6, 7, 8, 9, 0, 0]]
bathc_padded: tensor([[ 0,  0, 10, 11, 12, 13,  0,  0],
        [ 0,  0,  6,  7,  8,  9,  0,  0]])
y: [0, 0, 0, 1] -> label:tensor([[255.,   0.],
        [255.,   0.],
        [255.,   0.],
        [254.,   1.]])
y: [0, 0, 0, 1] -> label:tensor([[255.,   0.],
        [255.,   0.],
        [255.,   0.],
        [254.,   1.]])
padded_labels: tensor([[[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.]],

        [[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.]]])
1번째 batch:
batch: tensor([[ 0,  0, 10, 11, 12, 13,  0,  0],
        [ 0,  0,  6,  7,  8,  9,  0,  0]])
labels: tensor([[[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.]],

        [[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.]]])
lengths: tensor([4, 4])
batch outputs: tensor([[[-0.5954, -0.8014],
         [-0.7166, -0.6702],
         [-0.4940, -0.9421],
         [-0.5418, -0.8715]],

        [[-0.5155, -0.9094],
         [-0.3853, -1.1402],
         [-0.6989, -0.6874],
         [-0.7854, -0.6087]]], grad_fn=<LogSoftmaxBackward>)
batch loss: tensor(150.8819, grad_fn=<DivBackward0>)
====================================================================================================
****************************************************************************************************
[339.7144470214844]
"""

### Prediction.

In [None]:
test_loader = torch.utils.data.DataLoader(list(zip(test_sents, test_labels)), 
                                           batch_size=1, 
                                           shuffle=False, 
                                           collate_fn=partial(my_collate, window_size=2, word_2_id=word_2_id))
test_loader

<torch.utils.data.dataloader.DataLoader at 0x7f63349a1eb8>

In [None]:
for batched_input, batched_labels, batch_lengths in test_loader:
    pp.pprint(("inputs", batched_input, batched_input.size()))
    print("-"*10)
    pp.pprint(("labels", batched_labels, batched_labels.size()))
    print("-"*10)
    pp.pprint(("batch_lengths", batch_lengths))
    print("-"*100)

x_s: (['she', 'comes', 'from', 'paris'],)
y_s: ([0, 0, 0, 1],)
window_padded: [[0, 0, 1, 11, 12, 5, 0, 0]]
bathc_padded: tensor([[ 0,  0,  1, 11, 12,  5,  0,  0]])
y: [0, 0, 0, 1] -> label:tensor([[255.,   0.],
        [255.,   0.],
        [255.,   0.],
        [254.,   1.]])
padded_labels: tensor([[[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.]]])
('inputs', tensor([[ 0,  0,  1, 11, 12,  5,  0,  0]]), torch.Size([1, 8]))
----------
('labels',
 tensor([[[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.]]]),
 torch.Size([1, 4, 2]))
----------
('batch_lengths', tensor([4]))
----------------------------------------------------------------------------------------------------


In [None]:
for test_instance, labs, _ in test_loader:
    print("test instance:", test_instance)
    print("test labs", labs)
    outputs = model.forward(test_instance)
    print("test outputs:", outputs)

    print("argmax(outputs):", torch.argmax(outputs, dim=2))
    print("argmax(labs):", torch.argmax(labs, dim=2))
    print("-"*100)

x_s: (['she', 'comes', 'from', 'paris'],)
y_s: ([0, 0, 0, 1],)
window_padded: [[0, 0, 1, 11, 12, 5, 0, 0]]
bathc_padded: tensor([[ 0,  0,  1, 11, 12,  5,  0,  0]])
y: [0, 0, 0, 1] -> label:tensor([[255.,   0.],
        [255.,   0.],
        [255.,   0.],
        [254.,   1.]])
padded_labels: tensor([[[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.]]])
test instance: tensor([[ 0,  0,  1, 11, 12,  5,  0,  0]])
test labs tensor([[[255.,   0.],
         [255.,   0.],
         [255.,   0.],
         [254.,   1.]]])
test outputs: tensor([[[-0.5031, -0.9280],
         [-0.7167, -0.6702],
         [-0.3982, -1.1134],
         [-0.4939, -0.9422]]], grad_fn=<LogSoftmaxBackward>)
argmax(outputs): tensor([[0, 1, 0, 0]])
argmax(labs): tensor([[0, 0, 0, 0]])
----------------------------------------------------------------------------------------------------
