<a href="https://colab.research.google.com/github/SeunghyunKim00/ML/blob/main/Transformer_HW.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## [STAT 38193-01] 2176074 김승현 Homework

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

import torchvision.transforms as transforms
from torchvision.datasets import CIFAR100
from torch.utils.data import DataLoader

import numpy as np

## data loader

path = './datasets/'

transform = transforms.Compose([transforms.ToTensor()])

train_data = CIFAR100(root=path,train=True,transform=transform,download=True)
test_data = CIFAR100(root=path,train=False,transform=transform,download=True)

batch_size = 100

train_loader = DataLoader(dataset=train_data,batch_size=batch_size,shuffle=True,num_workers=0)
test_loader = DataLoader(dataset=test_data,batch_size=batch_size,shuffle=False,num_workers=0)

input_shape = train_data[0][0].shape
output_shape = len(train_data.classes)
print()


Downloading https://www.cs.toronto.edu/~kriz/cifar-100-python.tar.gz to ./datasets/cifar-100-python.tar.gz


100%|██████████| 169001437/169001437 [00:03<00:00, 48766150.44it/s]


Extracting ./datasets/cifar-100-python.tar.gz to ./datasets/
Files already downloaded and verified



**Positional Encoding**

init부분에서는 input과 무관한 값들을 생성했다.

pos로 0부터 데이터 길이만큼의 벡터를 생성하고, position값을 넣을 zero matrix를 self.pos_enc로 설정한다.

for문을 이용하여 논문과 동일하게, sin,cos함수에 각각 대입한다.


forward에서는 x가 input으로 들어왔을 때, 각각을 같은 device로 옮기고, positional embedding을 적용했다.


In [None]:
class PositionalEncoding(nn.Module):
# refer to Section 3.5 in the paper

    def __init__(self,device,max_len=512,d_model=16):
        super().__init__()
        # fill out here
        # how should we fill out self.pos_enc?
        pos =  torch.arange(0, max_len).to(device) # 512-1
        self.pos_enc = torch.zeros(max_len,d_model,requires_grad=False).to(device) # 512-16 shape zero matrix
        for i in range(d_model):
            if i%2 == 0:
                self.pos_enc[:,i] = torch.sin(pos/(10000**(i/d_model)))
            else:
                self.pos_enc[:,i] = torch.cos(pos/(10000**((i-1)/d_model)))

    def forward(self,x):
        # fill out here
        """
        x: transformed input embedding where x.shape = [batch_size, seq_len, data_dim]
        """
        pos_emb = x + self.pos_enc

        return pos_emb

**Scale Dot Product Attention**

forward에서 input으로 mask가 있는 경우와 없는 경우로 나누어서 encoder layer와 decoder layer에서 다르게 작동하도록 하였다.

공통적으로 attention score는 query와 key의 matrix multiplication에 model dimension의 squreroot값으로 나눈 꼴을 갖는다. 이 때, q,k는 각각 4dim이므로 matrix multiplication을 위해서 key부분을 transpose하여 곱했다.

**[CIFAR100 data shape]**

100-4-512-4 -> 100-4-4-512

100-4-512-4 @ 100-4-4-512 -> 100-4-512-512 : attention score shape

아래 Masking을 이용해서 생성한 mask가 None이 아니라면, attention score의 일부를 가려야한다.
masked fill을 이용하는 것은 https://github.com/hyunwoongko/transformer 의 1.3 Scale Dot Product Attention 부분 코드를 참고했다.

masking한 attention score에 softmax를 씌우고, value와 matrix multiplication을 한 attention value를 return한다.

In [None]:
class ScaledDotProductAttention(nn.Module):
# refer to Section 3.2.1 and Fig 2 (left) in the paper

    def __init__(self,d_model=16):
        super().__init__()
        # there is nothing to do here
        self.d_model = d_model

    def forward(self,q,k,v,mask=None):
        # fill out here
        # compute attention value based on transformed query, key, value where mask is given conditionally
        """
        q, k, v = transformed query, key, value
        q.shape, k.shape, v.shpae = [batch_size, num_head, seq_len, d_k=d_model/num_head]
        mask = masking matrix, if the index has value False, kill the value; else, leave the value
        """
        attention_score = torch.matmul(q,k.transpose(-2,-1)) # 100-4-512-4 -> 100-4-4-512 -> matmul 100-4-512-512
        attention_score = attention_score/(self.d_model**(0.5)) # 100-4-512-512

        if mask != None:
            # .masked_fill : 0인 것들을 -1e10으로 바꾸고, 고정하고자 하는 것을 1로 둔다.
            attention_score = attention_score.masked_fill(mask == 0, -1e10)
            # attention_score shape : 100-4-512-512

        attention_score = torch.softmax(attention_score, dim = -1) # d_ff로 softmax

        attention_value = torch.matmul(attention_score ,v)

        return attention_value

**MultiHeadAttention**

init

맨 앞단의 fully connected linear연산을 위한 self.lin_q,k,v를 만들고, 위의 ScaleDotProductAttention class를 self.attention으로, 마지막에 넣어주는 linear transformation에 해당하는 self.lin_o를 만들어둔다.

forward

input으로 들어온 q,k,v를 각각 변환한 뒤, shape을 100-4-512-4로 만들어주는 reshape을 한다.



### 수정한 부분

transpose에서 100-512-16 -> 100-4-512-4를 하려면 그냥 reshape 써버린다면, 512에 있던 2^2가 들어가게 되어서 원하는 변환이 아님. 따라서 100-512-4-4로 reshape을 하고, 100-4-512-4로 transpose를 해야함.

concatenate도 마찬가지 방법으로

In [None]:
class MultiHeadAttention(nn.Module):
# refer to Section 3.2.2 and Fig 2 (right) in the paper
    def __init__(self,d_model=16,num_head=4):
        super().__init__()
        # fill out the rest
        # refer to

        assert d_model % num_head == 0, "check if d_model is divisible by num_head"

        self.d_model = d_model
        self.num_head = num_head
        self.d_k = d_model//num_head

        # 맨 앞단의 fully connected linear 연산
        self.lin_q = nn.Linear(self.d_model, self.d_model)
        self.lin_k = nn.Linear(self.d_model, self.d_model)
        self.lin_v = nn.Linear(self.d_model, self.d_model)

        self.attention = ScaledDotProductAttention(d_model = d_model)

        self.lin_o = nn.Linear(self.d_model,self.d_model)

    def forward(self,q,k,v,mask=None):
        # fill out here
        # compute multi-head attention value
        # here, query, key, value are pre-transformed, so you need to transfrom them in this module
        """
        q, k, v = pre-transformed query, key, value
        q.shape, k.shape, v.shpae = [batch_size, seq_len, d_model]
        mask = masking matrix, if the index has value False, kill the value; else, leave the value
        """
        Q = self.lin_q(q) # 100-512-16
        K = self.lin_k(k)
        V = self.lin_v(v)

        # [batch_size, num_heads, seq_len, d_k] -> scaleddotproductattention 사용하기 위해서
        Q = Q.reshape(batch_size, -1, self.num_head, self.d_k).transpose(1,2) #Q.reshape(batch_size, self.num_head, -1, self.d_k) # 100-4-512-4 => dim 섞임
        K = K.reshape(batch_size, -1, self.num_head, self.d_k).transpose(1,2) #K.reshape(batch_size, self.num_head, -1, self.d_k)
        V = V.reshape(batch_size, -1, self.num_head, self.d_k).transpose(1,2) #V.reshape(batch_size, self.num_head, -1, self.d_k)

        attention = self.attention(Q,K,V,mask = mask) # attention value by Scaled dot-product attention

        # concat_attention = attention.reshape(batch_size, -1, self.d_model) # concat 원래 모양으로
        concat_attention = attention.transpose(1,2).reshape(batch_size, -1, self.d_model)
        output = self.lin_o(concat_attention)

        return output

**Positional Feed Forward**

Positional Feed Forward는 d_model -> d_ff -> d_model nn.linear, relu를 사용하여 작성하였다.

In [None]:
class PositionwiseFeedForwardNetwork(nn.Module):
# refer to Section 3.3 in the paper
# do not use torch.nn.Conv1d

    def __init__(self,d_model=16,d_ff=32):
        super().__init__()
        # fill out here
        self.W1 = nn.Linear(d_model,d_ff)
        self.W2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self,x):
        # fill out here
        temp = self.relu(self.W1(x))
        output = self.W2(temp)
        return output

**Masking**

Masking은 lower triangle matrix를 data의 길이에 맞게 생성하여 이후 matrix에서 1인 값만 남기고 0인 값은 masking을 하는 연산을 진행한다.

사용하는 데이터의 길이가 일정하므로, padding은 따로 진행하지 않았다.


In [None]:
class Masking(nn.Module):

    def __init__(self, device):
        super().__init__()
        self.device = device

    def forward(self, x):

        x_len = x.shape[1]
        x_mask = torch.tril(torch.ones((x_len, x_len)))
        x_mask = x_mask.view(x_len, x_len)
        x_mask = x_mask.to(self.device)

        return x_mask

**LayerNormaliation**

https://github.com/hyunwoongko/transformer
Layer Norm을 참고했습니다.

In [None]:
class LayerNormalization(nn.Module):
# do not use torch.nn.LayerNorm

    def __init__(self,device,d_model=16,eps=1e-5):
        super().__init__()
        # fill out here
        self.eps = eps
        self.d_model = d_model
        # hyunwoongko
        self.gamma = nn.Parameter(torch.ones(d_model),requires_grad = True)
        self.beta = nn.Parameter(torch.zeros(d_model), requires_grad = True)


    def forward(self,x):
        # fill out here
        # feedforward output으로 나온 값 : 100-512-16
        mean = x.mean(-1, keepdim = True)
        var = x.var(-1, unbiased = False, keepdim = True) # 최근 torch.var에서는 unbiased 없는 것 같음.-> 이전ver.

        normed = (x-mean)/torch.sqrt(var+self.eps)

        normed = self.gamma*normed + self.beta

        return normed

**EncoderLayer**

앞서 만든 multihead attention, feed forward, layernorm를 각각 init에서 인스턴스화 하고, encoder block 처음에 linear를 통해서 input값을 q,k,v로 만들어주는 nn.Linear도 3개 생성한다.

foward는 init에서 만들어둔 초기화된 인스턴스와 nn.Linear에 대해서 연산을 진행한다. dropout은 multihead attention, positionwise feed forwared network의 뒤에 넣어주었다.

### 수정한 부분
1. multihead attention 이전의 linear연산 없앰 -> parameter수 2개씩은 없어질 듯
2. dropout은 parameter없어서 하나만 있어도 됨.

In [None]:
class EncoderLayer(nn.Module):
# refer to Section 3.1 and Figure 1 in the paper
# this is a single encoder block consists of the following
# multi-head attention, positionwise feed forward network, residual connections, layer normalizations

    def __init__(self,d_model=16,num_head=4,d_ff=32,drop_prob=.1):
        super().__init__()
        # fill out here

        self.dropout = nn.Dropout(drop_prob)

        self.multiheadattention = MultiHeadAttention(d_model = d_model, num_head = num_head)
        self.ffn = PositionwiseFeedForwardNetwork(d_model, d_ff)
        self.layernorm1 = LayerNormalization(d_model)
        self.layernorm2 = LayerNormalization(d_model)

        # encoder block 처음에 linear 연산


    def forward(self,enc):
        # fill out here

        #hidden = enc + self.multiheadattention(Q,K,V) # 100-512-16
        hidden = self.multiheadattention(enc,enc,enc,None)
        hidden = enc + self.dropout(hidden)
        hidden = self.layernorm1(hidden)

        ffn_hid = self.dropout(self.ffn(hidden))
        hidden = hidden + ffn_hid
        output = self.layernorm2(hidden)

        return output


**DecoderLayer**

DecoderLayer는 decoder의 input으로 들어오는 지금까지 생성한 값에 대한 attention과, encoder의 output과 만들어진 query값의 attention 두 개의 multihead attention이 있다.

attention과 ffn 뒤에 layernorm이 오도록 만들었으므로, init에서 3개의 layernorm을 정의 하고, decoder input에 대한 linear 연산과, encoder, decoder attention에 대한 linear도 함께 정의한다.

encoder와 유사한 방법으로 self-masked multihead attention, multihead attention, feedforward를 순차적으로 계산한다.

### 수정한 부분
1. layer에서는 multihead attention에서 linear transfomation을 진행하므로 여기서는 할 필요가 없다. 그래서 nn.linear multihead attention 앞에 있던 것들 모두 없앰.

2. dropout은 parameter가 없어서 하나만 만들어도 됨.

In [None]:
class DecoderLayer(nn.Module):
# refer to Section 3.1 and Figure 1 in the paper
# this is a single decoder block consists of the following
# masked multi-head attention, multi-head attention, positionwise feed forward network, residual connections, layer normalizations

    def __init__(self, d_model=16,num_head=4,d_ff=32,drop_prob=.1):
        super().__init__()
        # fill out here
        self.dropout = nn.Dropout(drop_prob)

        self.dec_in_mha = MultiHeadAttention(d_model = d_model, num_head = num_head)
        self.enc_dec_mha = MultiHeadAttention(d_model = d_model, num_head = num_head)
        self.ffn = PositionwiseFeedForwardNetwork(d_model, d_ff)

        self.layernorm1 = LayerNormalization(d_model)
        self.layernorm2 = LayerNormalization(d_model)
        self.layernorm3 = LayerNormalization(d_model)


    def forward(self,enc_output,dec,dec_mask):
        # fill out here
        #dec_in_qkv = [linear(dec) for linear in self.lin_qkv]

        temp = self.dec_in_mha(dec,dec,dec,mask = dec_mask)
        temp = self.dropout(temp)
        temp = temp + dec
        dec_query = self.layernorm1(temp)


        hidden = self.enc_dec_mha(dec_query, enc_output, enc_output)
        hidden = self.dropout(hidden)
        hidden = dec_query + hidden
        hidden = self.layernorm2(hidden)

        ffn_hidden = self.ffn(hidden)
        ffn_hidden = self.dropout(ffn_hidden)
        hidden = hidden + ffn_hidden
        output = self.layernorm3(hidden)

        return output

**Encoder**

num_layer의 횟수만큼 encoder block을 반복한다. 이 때, encoderblock을 num_layer만큼 init에서 만들어서 진행해야한다.

self.lin_layer로 input embedding을 생성하고, positional encoding으로 encoder block 이전 과정을 포함해야한다.

for-loop을 이용하여 num_layer의 iteration을 구현하였다.

In [None]:
class Encoder(nn.Module):
# refer to Section 3.1 and Figure 1 in the paper
# this is a whole encoder, i.e., the left side of Figure 1, consists of the following as well
# input embedding, positional encoding

    def __init__(self,device,input_dim=3,num_layer=3,max_len=512,d_model=16,num_head=4,d_ff=32,drop_prob=.1):
        super().__init__()

        self.lin_layer = nn.Linear(input_dim, d_model)  # transform the input into the hidden dim with single linear transformation
        # fill out here
        self.num_layer = num_layer

        self.positional_encoding = PositionalEncoding(device, max_len =  max_len, d_model = d_model)

        self.encoder_layers = nn.ModuleList([EncoderLayer(d_model = d_model, num_head = num_head, d_ff = d_ff, drop_prob = drop_prob).to(device) for _ in range(num_layer)])

    def forward(self,x):
        # fill out here

        enc = self.lin_layer(x)
        hidden = self.positional_encoding(enc)

        for layer in self.encoder_layers:
            hidden = layer(hidden)

        return hidden

**Decoder**

encoder와 마찬가지 방법으로 decoder block을 num_layer만큼 iteration 한다.

Decoder block 이전의 embedding과 positional encoding을 포함하고 for-loop으로 num_layer만큼 iteration한다.

마지막 output은 linear로 input과 같은 dimension이 되도록 변환한다.

아래 training에서 loss가 nn.BCEWithLogitsLoss(reduction='sum')를 사용하므로 logit의 상태로 output을 반환해야 한다. 따라서 마지막에 softmax를 포함하지 않는다.

In [None]:
class Decoder(nn.Module):
# refer to Section 3.1 and Figure 1 in the paper
# this is a whole decoder, i.e., the left side of Figure 1, consists of the following as well
# input embedding, positional encoding, linear classifier

    def __init__(self,device,input_dim=3,num_layer=3,max_len=512,d_model=16,num_head=4,d_ff=32,drop_prob=.1):
        super().__init__()
        # fill out here
        # self.encoder = Encoder(input_dim, num_layer, max_len, d_model, num_head, d_ff, drop_prob)

        self.num_layer = num_layer

        self.lin_layer = nn.Linear(input_dim, d_model)

        self.positional_encoding = PositionalEncoding(device = device,max_len = max_len, d_model = d_model)
        self.decoder_layers = nn.ModuleList([DecoderLayer(d_model,num_head,d_ff,drop_prob) for _ in range(num_layer)]).to(device)

        self.o_lin = nn.Linear(d_model, input_dim)

    def forward(self,enc_output,y,y_mask):
        # fill out here

        dec = self.lin_layer(y)
        hidden = self.positional_encoding(dec)

        for layer in self.decoder_layers:
            hidden = layer(enc_output, hidden, y_mask)

        output = self.o_lin(hidden) # 100-512-3

        return output

**Transformer**

Transformer는 위에서 만든 Encoder, Decoder를 합치고, masking을 만들어 decoder에 대입하는 연산을 진행한다.

In [None]:
class Transformer(nn.Module):
# refer to Section 3.1 and Figure 1 in the paper
# sum up encoder and decoder

    def __init__(self,device,input_dim=3,num_layer=3,max_len=512,d_model=16,num_head=4,d_ff=32,drop_prob=.1):
        super().__init__()
        # fill out here
        self.encoder = Encoder(device,input_dim,num_layer,max_len,d_model,num_head,d_ff,drop_prob)
        self.decoder = Decoder(device,input_dim,num_layer,max_len,d_model,num_head,d_ff,drop_prob)
        self.masking = Masking(device)

    def forward(self,x,y):
        # fill out here
        enc_output = self.encoder(x)
        mask_y = self.masking(y)
        dec_output = self.decoder(enc_output,y,mask_y)

        return dec_output

## **Train & Test**

num_param: 30563

Epoch 0 Train: **0.632709** w/ Learning Rate: 0.00049

Epoch 0 Test: **0.564997**

Epoch 29 Train: **0.540526** w/ Learning Rate: 0.00204

Epoch 29 Test: **0.539083**

In [None]:
class ScheduledOptimizer:

    def __init__(self,optimizer,d_model=16,warmup_steps=4000):
        self.optimizer = optimizer
        self.d_model = d_model
        self.warmup_steps = warmup_steps
        self.step_num = 0

    def zero_grad(self):
        self.optimizer.zero_grad()

    def update_parameter_and_learning_rate(self):
        self.optimizer.step()
        self.step_num += 1
        self.lr = self.d_model**(-.5) * min(self.step_num**(-.5),self.step_num*self.warmup_steps**(-1.5))
        for param_group in self.optimizer.param_groups:
            param_group['lr'] = self.lr



device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

model = Transformer(device=device,input_dim=3,num_layer=3,max_len=512,d_model=16,num_head=4,d_ff=64,drop_prob=.1).to(device)
loss = nn.BCEWithLogitsLoss(reduction='sum')
optimizer = torch.optim.Adam(model.parameters(),betas=(.9,.98),eps=1e-9)
scheduled_optimizer = ScheduledOptimizer(optimizer,d_model=16)


num_epoch = 15
train_loss_list, test_loss_list = list(), list()

total_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print("num_param:", total_params)

for i in range(num_epoch):

    ## train
    model.train()

    total_loss = 0
    count = 0

    for batch_idx, (image, label) in enumerate(train_loader):

        image = image.reshape(-1,3,1024).transpose(1,2)  # 1024로 변환
        x, y = image[:,:512,:].to(device), image[:,512:,:].to(device) # 앞의 512를 input, 뒤의 512를 output (want)

        # batch = 100
        # y_ : 100-1-3 size zero + dim=1에서 마지막 값 하나 없애고 붙이기
        y_ = torch.zeros([batch_size,1,3],requires_grad=False).to(device)
        y_ = torch.cat([y_,y[:,:-1,:]],dim=1)
        # 앞에 1개는 0으로 만들고 -> initial output(마지막 output 직전에..)
        # decoder에 input으로 넣을 값은 마지막을 제거해야 함. : last input...?

        logit = model.forward(x,y_)
        cost = loss(logit,y)/(3*512)

        total_loss += cost.item()

        scheduled_optimizer.zero_grad()
        cost.backward()
        scheduled_optimizer.update_parameter_and_learning_rate()

    ave_loss = total_loss/len(train_data)
    train_loss_list.append(ave_loss)

    if i % 1 == 0:
        print("\nEpoch %d Train: %.6f w/ Learning Rate: %.5f"%(i,ave_loss,scheduled_optimizer.lr))

    ## test
    model.eval()

    total_loss = 0
    count = 0

    with torch.no_grad():
        for batch_idx, (image, label) in enumerate(test_loader):

            image = image.reshape(-1,3,1024).transpose(1,2)
            x, y = image[:,:512,:].to(device), image[:,512:,:].to(device)

            y_ = torch.zeros([batch_size,1,3],requires_grad=False).to(device)
            y_ = torch.cat([y_,y[:,:-1,:]],dim=1)

            logit = model.forward(x,y_)
            cost = loss(logit, y)/(3*512)

            total_loss += cost.item()

    ave_loss = total_loss/len(test_data)
    test_loss_list.append(ave_loss)

    if i % 1 == 0:
        print("Epoch %d Test: %.6f"%(i,ave_loss))


num_param: 23219

Epoch 0 Train: 0.613294 w/ Learning Rate: 0.00049
Epoch 0 Test: 0.565276

Epoch 1 Train: 0.560341 w/ Learning Rate: 0.00099
Epoch 1 Test: 0.555663

Epoch 2 Train: 0.556542 w/ Learning Rate: 0.00148
Epoch 2 Test: 0.554102

Epoch 3 Train: 0.555291 w/ Learning Rate: 0.00198
Epoch 3 Test: 0.553805

Epoch 4 Train: 0.554372 w/ Learning Rate: 0.00247
Epoch 4 Test: 0.552239

Epoch 5 Train: 0.552991 w/ Learning Rate: 0.00296
Epoch 5 Test: 0.551938

Epoch 6 Train: 0.551559 w/ Learning Rate: 0.00346
Epoch 6 Test: 0.549474

Epoch 7 Train: 0.549977 w/ Learning Rate: 0.00395
Epoch 7 Test: 0.547790

Epoch 8 Train: 0.548020 w/ Learning Rate: 0.00373
Epoch 8 Test: 0.545080

Epoch 9 Train: 0.546260 w/ Learning Rate: 0.00354
Epoch 9 Test: 0.542930

Epoch 10 Train: 0.543705 w/ Learning Rate: 0.00337
Epoch 10 Test: 0.540571

Epoch 11 Train: 0.542127 w/ Learning Rate: 0.00323
Epoch 11 Test: 0.539427

Epoch 12 Train: 0.541049 w/ Learning Rate: 0.00310
Epoch 12 Test: 0.538388

Epoch 13 Train

참고한 페이지

https://arxiv.org/pdf/1706.03762.pdf

https://github.com/hyunwoongko/transformer/tree/master

https://paul-hyun.github.io/transformer-02/

In [None]:
for i in range(num_epoch):

    ## train
    model.train()

    total_loss = 0
    count = 0

    for batch_idx, (image, label) in enumerate(train_loader):

        image = image.reshape(-1,3,1024).transpose(1,2)  # 1024로 변환
        x, y = image[:,:512,:].to(device), image[:,512:,:].to(device) # 앞의 512를 input, 뒤의 512를 output (want)

        # batch = 100
        # y_ : 100-1-3 size zero + dim=1에서 마지막 값 하나 없애고 붙이기
        y_ = torch.zeros([batch_size,1,3],requires_grad=False).to(device)
        y_ = torch.cat([y_,y[:,:-1,:]],dim=1)
        # 앞에 1개는 0으로 만들고 -> initial output(마지막 output 직전에..)
        # decoder에 input으로 넣을 값은 마지막을 제거해야 함. : last input...?

        logit = model.forward(x,y_)
        cost = loss(logit,y)/(3*512)

        total_loss += cost.item()

        scheduled_optimizer.zero_grad()
        cost.backward()
        scheduled_optimizer.update_parameter_and_learning_rate()

    ave_loss = total_loss/len(train_data)
    train_loss_list.append(ave_loss)

    if i % 1 == 0:
        print("\nEpoch %d Train: %.6f w/ Learning Rate: %.5f"%(i,ave_loss,scheduled_optimizer.lr))

    ## test
    model.eval()

    total_loss = 0
    count = 0

    with torch.no_grad():
        for batch_idx, (image, label) in enumerate(test_loader):

            image = image.reshape(-1,3,1024).transpose(1,2)
            x, y = image[:,:512,:].to(device), image[:,512:,:].to(device)

            y_ = torch.zeros([batch_size,1,3],requires_grad=False).to(device)
            y_ = torch.cat([y_,y[:,:-1,:]],dim=1)

            logit = model.forward(x,y_)
            cost = loss(logit, y)/(3*512)

            total_loss += cost.item()

    ave_loss = total_loss/len(test_data)
    test_loss_list.append(ave_loss)

    if i % 1 == 0:
        print("Epoch %d Test: %.6f"%(i+15,ave_loss))



Epoch 0 Train: 0.539596 w/ Learning Rate: 0.00280
Epoch 15 Test: 0.537779

Epoch 1 Train: 0.539368 w/ Learning Rate: 0.00271
Epoch 16 Test: 0.537460

Epoch 2 Train: 0.539189 w/ Learning Rate: 0.00264
Epoch 17 Test: 0.537362

Epoch 3 Train: 0.539071 w/ Learning Rate: 0.00256
Epoch 18 Test: 0.537162

Epoch 4 Train: 0.538988 w/ Learning Rate: 0.00250
Epoch 19 Test: 0.537236

Epoch 5 Train: 0.538910 w/ Learning Rate: 0.00244
Epoch 20 Test: 0.537067

Epoch 6 Train: 0.538822 w/ Learning Rate: 0.00238
Epoch 21 Test: 0.536994

Epoch 7 Train: 0.538755 w/ Learning Rate: 0.00233
Epoch 22 Test: 0.536921

Epoch 8 Train: 0.538689 w/ Learning Rate: 0.00228
Epoch 23 Test: 0.536911

Epoch 9 Train: 0.538639 w/ Learning Rate: 0.00224
Epoch 24 Test: 0.536851

Epoch 10 Train: 0.538583 w/ Learning Rate: 0.00219
Epoch 25 Test: 0.537087

Epoch 11 Train: 0.538540 w/ Learning Rate: 0.00215
Epoch 26 Test: 0.536792

Epoch 12 Train: 0.538501 w/ Learning Rate: 0.00211
Epoch 27 Test: 0.536806

Epoch 13 Train: 0.538