<a href="https://colab.research.google.com/github/eunjiWon/SoftwareDefectPredictionMetricUsingDeepLearning/blob/master/LSTM_Implementation_(sequence_models_tutorial).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#SEQUENCE MODELS AND LONG-SHORT TERM MEMORY NETWORKS

A recurrent neural network is a network that maintains some kind of state. For example, its output could be used as part of the next input, so that information can propogate along as the network passes over the sequence. In the case of an LSTM, for each element in the sequence, there is a corresponding hidden state h1, which in principle can contain information from arbitrary points earlier in the sequence. We can use the hidden state to predict words in a language model, part-of-speech tags, and a myriad of other things.

# LSTM's in Pytorch
Before getting to the example, note a few things. Pytorch's LSTM expects all of its inputs to be 3D tensors. The semantics of the axes of these tensors is important. The first axis is the sequence itself, the second indexes instances in the mini-batch, and the sequence itself, and the third indexes elements of the input. We haven't discussed mini-batching, so lets just ignore that and assume we will always have just 1 dimension on the second axis. If we want to run the sequence model over the sentence "The cow jumped", our input should look like

Let's see a quick example.

In [0]:
# Author: Robert Guthrie
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7f4ef462aaf0>

In [0]:
lstm = nn.LSTM(3, 3) # Input dim is 3, output dim is 3
inputs = [torch.randn(1, 3) for _ in range(5)] # make a sequence of length 5
print(inputs) # (1, 3) 사이즈를 가진 tensor가 5개 생김
# initialize the hidden state.
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3))
for i in inputs:
  # after each step, hidden contains the hidden state.
  out, hiddden = lstm(i.view(1, 1, -1), hidden)
# initialize the hidden state.
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3))
for i in inputs:
  # after each step, hidden contains the hidden state.
  out, hiddden = lstm(i.view(1, 1, -1), hidden)
print("out: ", out)
print("hidden: ", hidden)

# Add the extra 2nd dimension
print("inputs: ", inputs)
inputs = torch.cat(inputs).view(len(inputs), 1, -1) # 5개의 텐서에서 하나의 텐서로 바뀜
print("inputs: ", inputs)
hidden = (torch.randn(1, 1, 3), torch.randn(1, 1, 3))  # clean out hidden state
out, hidden = lstm(inputs, hidden)
print("out: ", out)
print("hidden: ", hidden)


[tensor([[ 1.7153, -1.1099,  0.3573]]), tensor([[-0.3369, -0.1951,  1.6927]]), tensor([[-0.9122, -0.1971,  0.6831]]), tensor([[ 0.6439, -1.5348,  0.1530]]), tensor([[-0.3657,  0.0245, -0.0813]])]
out:  tensor([[[-0.3555,  0.3641, -0.1959]]], grad_fn=<StackBackward>)
hidden:  (tensor([[[0.0478, 1.3501, 0.0977]]]), tensor([[[-1.4379,  1.8068, -1.2562]]]))
inputs:  [tensor([[ 1.7153, -1.1099,  0.3573]]), tensor([[-0.3369, -0.1951,  1.6927]]), tensor([[-0.9122, -0.1971,  0.6831]]), tensor([[ 0.6439, -1.5348,  0.1530]]), tensor([[-0.3657,  0.0245, -0.0813]])]
inputs:  tensor([[[ 1.7153, -1.1099,  0.3573]],

        [[-0.3369, -0.1951,  1.6927]],

        [[-0.9122, -0.1971,  0.6831]],

        [[ 0.6439, -1.5348,  0.1530]],

        [[-0.3657,  0.0245, -0.0813]]])
out:  tensor([[[ 0.3059,  0.3641,  0.0463]],

        [[ 0.0652,  0.2791, -0.1788]],

        [[-0.1173,  0.2761, -0.0856]],

        [[-0.3872,  0.3872, -0.2095]],

        [[-0.3846,  0.2223, -0.1080]]], grad_fn=<StackBackward>)

###view function
```
x = torch.randn(4, 4)
x.size()
y = x.view(16)
y.size()
z = x.view(-1, 8)
z.size()
x.size()
```
inputs의 size는 torch.Size([5, 1, 3]) 이다.
```
for i in inputs:
  i.view(1, 1, -1)
```
위에 처럼 해주면 i의 size는 각각 torch.Size([1, 1, 3])이 된다. 


# Example: An LSTM for Part-of-Speech Tagging
In this section, we will use an LSTM to get part of speech tags. (참고로 POS tagging은 문장 내 단어들의 품사를 식별하여 태그를 붙여주는 것을 말한다. tuple의 형태로 출력되며 (단어, 태그)로 출력된다.) We will not use Viterbi (비터비 알고리즘은 히든 스테이트의 최적 시퀀스를 찾기 위한 다이나믹 프로그래밍 기법의 일종임) or Forward-Backward of anything like that.



In [0]:
# Prepare data
def prepare_sequence(seq, to_ix): # turn seq into tensors of word indices
  idxs = [to_ix[w] for w in seq]
  print("idxs: ", idxs) # e.g., "idxs: [5, 6, 7, 8]"
  return torch.tensor(idxs, dtype=torch.long)

training_data = [
                 ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
                 ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]
word_to_ix = {}
for sent, tags in training_data:
  for word in sent:
    if word not in word_to_ix:
      word_to_ix[word] = len(word_to_ix) # index를 부여하기 위해 word_to_ix의 사이즈를 이용
# print("word_to_ix: ", word_to_ix)
tag_to_ix = {"DET": 0, "NN": 1, "V": 2} # manually indexing
# print("tag_to_ix: ", tag_to_ix)

# These will usually be more like 32 or 64 dimensional. ???
# We will keep them small. so we can see how the weights change as we train. ???
EMBEDDING_DIM = 6
HIDDEN_DIM = 6 

word_to_ix:  {'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}
tag_to_ix:  {'DET': 0, 'NN': 1, 'V': 2}


In [0]:
# Create the model
class LSTMTagger(nn.Module):

  def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
    super(LSTMTagger, self).__init__()
    self.hidden_dim = hidden_dim

    self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

    # The LSTM takes word embeddings as inputs, and outputs hidden states with dimensionality hidden_dim.
    self.lstm = nn.LSTM(embedding_dim, hidden_dim)

    # The linear layer that maps from hidden state space to tag space
    # tag가 output이라서 
    self.hidden2tag = nn.Linear(hidden_dim, tagset_size)
  
  def forward(self, sentence):
    # print("sentence size: ", sentence.size())
    embeds = self.word_embeddings(sentence) # here is an actual embedding execution 
    lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1)) # 하나의 텐서로 만들어 주는 듯...
    # print("lstm_out.view(len(sentence), -1): ", lstm_out.view(len(sentence), -1))
    tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1)) # target_size parameter는 어디서 넘겨주는거지?
    tag_scores = F.log_softmax(tag_space, dim=1)
    # print("tag_scores: ", tag_scores)
    return tag_scores


In [0]:
# Train the model
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1) # SGD는 학습데이터 1개마다 가중치를 업데이트하기 때문에 전체 학습테이터가 N개면 epoch이 N번

# See what the scores are before training
# Note that element i, j of the output is the score for tag j for word i.
# Here we don't need to train, so the code is wrapped in torch.no_grad() 
# 왜 train 필요없는거지지? 간단해서???
with torch.no_grad(): # 또한 with torch.no_grad(): 로 코드 블럭을 감싸서 autograd가 .requires_grad=True 인 Tensor들의 연산 기록을 추적하는 것을 멈출 수 있습니다.
  inputs = prepare_sequence(training_data[0][0], word_to_ix) # make a tensor
  tag_scores = model(inputs) # 알아서 forward 함수가 실행되는건가보다...
  print("Before training tag_scores: ", tag_scores)

for epoch in range(300): # again, normally you would NOT do 300 epochs, it is toy data
  for sentence, tags in training_data:
    # Step 1. Remember that Pytorch accumulates gradients.
    # We need to clear them out before each instance
    model.zero_grad()
    # Step 2. Get our inputs ready for the network, that is, turn them into tensors of word indices.
    sentence_in = prepare_sequence(sentence, word_to_ix)
    targets = prepare_sequence(tags, tag_to_ix) # label
    # Step 3. Run our forward pass.
    tag_scores = model(sentence_in)
    # Step 4. Compute the loss, gradients, and update the parameters by calling optimizer.step()
    loss = loss_function(tag_scores, targets) # 위에서 nn.NLLLoss()로 정의했음 
    # test 할 때도도 loss를 계산할 수 있겠지?
    # print("training... loss: ", loss)
    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()  
    optimizer.step() # parameters을 업데이트함


In [41]:
# See what the scores are after training
with torch.no_grad():
  inputs = prepare_sequence(training_data[0][0], word_to_ix)
  print(inputs)
  tag_scores = model(inputs)
  targets = prepare_sequence(training_data[0][1], tag_to_ix) # label
  print(targets)
  loss = loss_function(tag_scores, targets)
  print("After training tag_scores: ", tag_scores)
  print("After training loss: ", loss)

idxs:  [0, 1, 2, 3, 4]
tensor([0, 1, 2, 3, 4])
idxs:  [0, 1, 2, 0, 1]
tensor([0, 1, 2, 0, 1])
After training tag_scores:  tensor([[-0.1047, -2.9359, -3.0729],
        [-3.3204, -0.0398, -5.8491],
        [-2.9641, -4.9341, -0.0606],
        [-0.0447, -3.9176, -3.7365],
        [-3.8830, -0.0220, -6.7731]])
After training loss:  tensor(0.0544)
