<a href="https://colab.research.google.com/github/Shopping-Yuan/ML2021HW/blob/Shopping_vscode_branch/HW5/HW05_modified.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

------
###Part 0 setting and installing package
------
###Part 1 preparing data set
------
######load data file
######clean data
######pick up line pairs
######tokenize : using sentencepiece
######make data set
------
###Part 2 make model
------
######positional encoding layer
######multihead attention layer
######encoder layer(s)
######decoder layer(s)
######transformer layer
------
###Part 3 training and validation process
------
######Noam optimizer
######label smoothing
######beam search
######bleu
######training and validation function
######main function
------


setting
======
>Here are all parameters using in this project.

In [1]:
setting = {
# information of the path of dataset
"data_info" : {
    "document":"/content",
    "raw_file_name":"/ted2020.tgz",
    "unzip_path":"/train_dev/",
    "source":{
        "lang":"en",
        "raw_data_path":"/train_dev/raw.en",
        "clean_data_path":"/train_dev/clean_en.txt",
        "tokenized_train_data":"/train_dev/tokenized_train_data_en.txt",
        "tokenized_val_data":"/train_dev/tokenized_val_data_en.txt"
        },
    "target":{
        "lang":"zh",
        "raw_data_path":"/train_dev/raw.zh",
        "clean_data_path":"/train_dev/clean_zh.txt",
        "tokenized_train_data":"/train_dev/tokenized_train_data_zh.txt",
        "tokenized_val_data":"/train_dev/tokenized_val_data_zh.txt"
        }
},
# tokenized setting for spm
"tokenized_setting" : {
    "vocab_size" : 8000,
    "character_coverage" : 1,
    "model_type" : "bpe", # "unigram",
    "input_sentence_size" : 400000,
    "shuffle_input_sentence" : True,
    "normalization_rule_name" : "nmt_nfkc_cf",
    "pad_id":0,
    "unk_id":1,
    "bos_id":2,
    "eos_id":3,
    "max_l":400
},
# model structure setting
"model" : {
      "encoder_embedding_dimension" : 256,
      "decoder_embedding_dimension" : 256,
      "feedforward_dimension" : 2048,
      "num_heads" : 2,
      "dropout_p" : 0.0,
      "layer_num" : 6
},

# setting in training and validation process ,
# including optimization setting.
"training_hparas" : {
    "total_step" : 40000,
    "do_valid_step" : 4000,
    "early_stop_step" : 2,
    "train_batch_size" : 40,
    "valid_batch_size" : 100,
    "workers" : 0,
    "label_smoothing" : 0.1,
    "beam_num" : 2,
    "optimization":{
        "factor" : 2,
        "warmup"  : 4000,
        "optimizer" : {
                "lr" : 0,
                "betas" : (0.9, 0.98),
                "eps" : 1e-9,
                "weight_decay" : 0.0001
                }
            },
    "model_saving_path" : "/content/model.pth"
}

}


installing package
------

In [24]:
# used in part 1
!pip install sentencepiece
# used in part 1 and 3
!pip install tqdm
# used in part 2
!pip install torchinfo
# used in part 3
!pip install torcheval



preparing data set
=============

load data file
-------------
>Here I load dataset from my drive,  
>but it also can be download from the link below.

In [3]:
# step 1 : download dataset from drive to google colab
# original dataset is in "https://mega.nz/#!vEcTCISJ!3Rw0eHTZWPpdHBTbQEqBDikDEdFPr7fI8WxaXK9yZ9U"

path_doc = setting["data_info"]["document"]
rawdata_file_name = setting["data_info"]["raw_file_name"]
rawdata_file_path = path_doc + rawdata_file_name
unzip_path = path_doc + setting["data_info"]["unzip_path"]

# mount drive
from google.colab import drive
drive_path = path_doc + "/drive"
drive_name = "/MyDrive"
drive.mount(drive_path)

# copy file from drive
import shutil
shutil.copyfile(drive_path + drive_name + rawdata_file_name, rawdata_file_path)

# step 2 : unzip dataset
import tarfile
# open file
file = tarfile.open(rawdata_file_path)
# extracting file
file.extractall(unzip_path)
file.close()

Mounted at /content/drive


clean data
------
>First each dataset (source or target) is clean  
>seperately, change to halfwidth and remove/replace  
>some kind of punctuations.

>Also because the number of sentences in one line may be  
>different in line pairs of source and target set (its an error),  
>some special punctuations is add to the end of sentences  
>for the next process dealing with these problem by  
>using sentence pairs instead of lines pairs to form datasets.



In [4]:
import unicodedata
import string
import re
# convert fullwidth to halfwidth
def to_halfwidth(string):
  return "".join(unicodedata.normalize('NFKC',letter) for letter in string)
def clean_s_zh(s):
    s = to_halfwidth(s)
    # step 1 : delete — _
    delete = " _()[]"
    delete_rules = s.maketrans("","",delete)
    s = s.translate(delete_rules)

    # step 2 : replace “” with ""
    to_be_replace = '“”'
    replace = '""'
    replace_dict = dict(zip(to_be_replace,replace))

    # step 3 : add **END** before and after punctuation

    """
    The number of sentences in one line may be different
    in line pairs of source and target set.
    "**END**" is add after "。!?" and ".!?", which can be used
    to check if the number of sentence in the pair are equal
    in the next process.
    also in english, "." may be use in abbreviation,
    these different use must be identified.

    """

    punctuation = "。!?"
    for char in punctuation:
      replace_dict[char] = char + "**END**"

    replace_rules = s.maketrans(replace_dict)
    s = s.translate(replace_rules)

    zh_list = s.strip("\n").split("\n")

    return zh_list

def clean_s_en(s):
    s = to_halfwidth(s)

    replace_dict = {}

    delete = "-()[]"
    for char in delete:
      replace_dict[char] = ""

    punctuation = "!?"
    for char in punctuation:
      replace_dict[char] = char + "**END**"
    replace_rules = s.maketrans(replace_dict)
    s = s.translate(replace_rules)

    # Identify if "." is used in abbreviation,
    # if not, add "**END**" after it.
    pattern = re.compile(r"(?<!([.\s\r\n\f][a-zA-Z]))[.]")
    s = pattern.sub("**END**",s)

    # test pattern
    # pattern = re.compile(r"(?<!([.\s\r\n\f][a-zA-Z]))[.]")
    # result = pattern.sub("**END**","There are many people in U.S. w.r.t. in Taiwan.Thank you.")

    en_list = s.strip("\n").split("\n")

    return en_list

pick up line pairs
------
>pick up line pairs has equal number of sentences and  
>split them into sentences to form sourse/target dataset.  
>Remove sentences with too many words for training and validation.

In [5]:
# using "**END**" to split line pairs to check if they have equal sentence
def divide_by_END(s):
    list_s = []
    for line_string in s.strip("**END**").split("**END**"):
      if line_string not in [""," "]:
         list_s.append(line_string)
    return(list_s)
'''
warning : devide_en_again function is apply just beacause
in "this" dataset english sentences end with ":" or ";"
sometimes not splited well.
If the dataset is change, this part may need to be
eliminated or modified.
'''
def devide_en_again(s,punctuation = ":;"):
    replace_dict = {}
    for char in punctuation:
      replace_dict[char] = char + "**END**"

    replace_rules_src = s.maketrans(replace_dict)
    new_s = divide_by_END(s.translate(replace_rules_src))
    return new_s

# remove "sentence" if it is too long.
def remove_too_long(src_list,tgt_list,threshold):
    too_long_src = 0
    too_long_tgt = 0
    remove = False
    new_s = []
    new_t = []
    for i in range(len(src_list)):
      if ((len(src_list[i])>threshold)):
        remove = True
        too_long_src += 1
      if (len(tgt_list[i])>threshold):
        remove = True
        too_long_tgt += 1
      if remove == False:
        new_s.append(src_list[i])
        new_t.append(tgt_list[i])
      else :
        remove = False
    return(new_s,new_t,too_long_src,too_long_tgt)

# pick up good line pairs for traning and validation model
def check_data_pairs(src_list,tgt_list,threshold):
    index = 0
    new_src_list = []
    new_tgt_list = []

    same = 0
    add_next = 0
    split_again = 0
    not_use = 0

    while(index < len(src_list)):

      src = divide_by_END(src_list[index])
      tgt = divide_by_END(tgt_list[index])

      # case 1 : src is as long as tgt , finished.
      if len(src) == len(tgt):
        new_src_list += src
        new_tgt_list += tgt
        same += 1
        index += 1

      else :
        # if it is not the last one : both src and tgt add next sentence
        if index != len(src_list)-1:
          src_add_next = divide_by_END(src_list[index] + src_list[index+1])
          tgt_add_next = divide_by_END(tgt_list[index] + tgt_list[index+1])
          # case 2 : src_add_next is as long as tgt_add_next , finished.
          if len(src_add_next) == len(tgt_add_next):
            new_src_list += src_add_next
            new_tgt_list += tgt_add_next
            add_next += 2
            index += 2

          # using new punctuation to divide tgt (english) sentence.
          # note that this part could cause negative effects if the dataset is change.
          else :
            src_add_next = devide_en_again(src_list[index] + src_list[index+1])
            # case 3 : src_add_next is as long as tgt_add_next , finished.
            if len(src_add_next) == len(tgt_add_next):
              new_src_list += src_add_next
              new_tgt_list += tgt_add_next
              split_again +=2
              index += 2

            # case 4 : sentence will not be used.
            else :
              not_use += 1
              # if to_do == 1 :
              #   print(index,src_add_next,tgt_add_next,len(src_add_next),len(tgt_add_next))
              index += 1

        # if it is the last one
        else :
          not_use += 1
          index += 1
    # print information
    print(f"The original total number of line is {index}.")
    print(f"The number of line pairs have the equal sentences is {same}.")
    print(f"The number of line pairs have the equal sentences after combine the next lines is {add_next}.")
    print(f"The number of line pairs have the equal sentences after combine the next lines"+\
       f"and resplit english lines using :; is {split_again}.")
    print(f"The number of line we don't use is {not_use}.")
    print(f"Note that {index} = {same}+{add_next}+{split_again}+{not_use}.")

    # remove long lines
    print(f"The total number of sentence pairs before remove long sentences is {len(new_src_list)}.")
    new_src_list,new_tgt_list,too_long_src,too_long_tgt = remove_too_long(new_src_list,new_tgt_list,threshold)
    print(f"The finally total number of sentence pairs using is {len(new_src_list)}.")
    print(f"Note that {len(new_src_list)} are the number of sentence pairs, not line pairs")

    return(new_src_list,new_tgt_list)

load and clean data
------

In [6]:
# load and clean data
def load_file(path,function):
    with open(path, "r") as f:
      data = f.read()
      return function(data)
# saving to new path
def clean_data_and_save(
    path_doc,raw_src_path,raw_tgt_path,
    clean_src_path,clean_tgt_path,threshold
    ):
    raw_src_path = path_doc + raw_src_path
    raw_tgt_path = path_doc + raw_tgt_path
    src = load_file(raw_src_path,clean_s_en),
    tgt = load_file(raw_tgt_path,clean_s_zh),
    # src , tgt are tuples with only one term : src_list, tgt_list
    src_list = src[0]
    tgt_list = tgt[0]
    clean_src_list, clean_tgt_list = check_data_pairs(src_list,tgt_list,threshold)
    with open(path_doc + clean_src_path, "w") as f:
      f.write("\n".join(clean_src_list))
    with open(path_doc + clean_tgt_path, "w") as f:
      f.write("\n".join(clean_tgt_list))
# test clean_data_and_save
# clean_data_and_save(
#     path_doc = setting["data_info"]["document"],
#     raw_src_path = setting["data_info"]["source"]["raw_data_path"],
#     raw_tgt_path = setting["data_info"]["target"]["raw_data_path"],
#     clean_src_path = setting["data_info"]["source"]["clean_data_path"],
#     clean_tgt_path = setting["data_info"]["target"]["clean_data_path"],
#     threshold = setting["tokenized_setting"]["max_l"]
# )

tokenize
------
>using sentencepiece to tokenize sentences,  
>first make the english/chinese dictionary separately,  
>then use these dict to encode sentence pair in dataset,  
>including add bos/eos/padding to tokenized sentences.  
>Finally split then into train/val set and save.

In [3]:
import sentencepiece as spm
import numpy as np
from tqdm import tqdm
import torch.utils.data as data
def tokenized(clean_data_path,
       vocab_size,
       lang,
       tokenized_setting
       ):
  model_prefix = f"spm_{vocab_size}_{lang}"
  spm.SentencePieceTrainer.train(
      input=clean_data_path,
      **tokenized_setting,
      model_prefix=model_prefix,
  )
  return(model_prefix)

def get_tokenizers(path_doc,vocab_size,src_lang,tgt_lang):
  src_tokenizer = spm.SentencePieceProcessor(model_file = path_doc + f"/spm_{vocab_size}_{src_lang}" +".model")
  tgt_tokenizer = spm.SentencePieceProcessor(model_file = path_doc + f"/spm_{vocab_size}_{tgt_lang}" +".model")
  return src_tokenizer,tgt_tokenizer

def bos_eos_padding(dataset,
          max_l,
          src_tokenizer,
          tgt_tokenizer
          ):


  padding_src = []
  padding_tgt = []
  len_s = 0
  len_t = 0
  for src,tgt in dataset:
    s = src_tokenizer.encode(src, out_type=int)
    s = np.append(s,[3])
    s = np.append([2],np.pad(s,(0, max_l-len(s)-1), constant_values = 0))
    padding_src.append(s)

    t = tgt_tokenizer.encode(tgt, out_type=int)
    t = np.append(t,[3])
    t = np.append([2],np.pad(t,(0, max_l-len(t)-1), constant_values = 0))
    padding_tgt.append(t)

  return(list(zip(padding_src,padding_tgt)))
# test SentencePieceProcessor and bos_eos_padding
# s_src = spm.SentencePieceProcessor(model_file="/content/spm8000_en.model")
# s_src.encode("hello world!", out_type=int)
# bos_eos_padding([("hello world","_哈囉")],5,10)

def data_set_preparing(path_doc,
            clean_src_path,
            clean_tgt_path,
            max_l,
            src_tokenizer,
            tgt_tokenizer,
            st_train_path,
            st_val_path,
            tt_train_path,
            tt_val_path,
            ):
    src_set = []
    tgt_set = []

    with open(path_doc+clean_src_path,"r") as in_f :
      for line in tqdm(in_f):
        src_set.append(line)
    with open(path_doc+clean_tgt_path,"r") as in_f :
      for line in tqdm(in_f):
        tgt_set.append(line)

    dataset = list(zip(src_set,tgt_set))
    dataset = bos_eos_padding(dataset,max_l,src_tokenizer,tgt_tokenizer)
    train_set, valid_set = data.random_split(dataset,[0.99,0.01])
    # print(train_set[0][0])

    with open(path_doc + st_train_path, 'w') as out_f:
      for line_pair in tqdm(train_set):
        out_f.write(" ".join(str(x) for x in line_pair[0])+"\n")
    with open(path_doc + st_val_path, 'w') as out_f:
      for line_pair in tqdm(valid_set):
        out_f.write(" ".join(str(x) for x in line_pair[0])+"\n")
    with open(path_doc + tt_train_path, 'w') as out_f:
      for line_pair in tqdm(train_set):
        out_f.write(" ".join(str(x) for x in line_pair[1])+"\n")
    with open(path_doc + tt_val_path, 'w') as out_f:
      for line_pair in tqdm(valid_set):
        out_f.write(" ".join(str(x) for x in line_pair[1])+"\n")

In [4]:
def tokenized_data(vocab_size,tokenized_setting,max_l,path_doc,clean_src_path,
          clean_tgt_path,src_lang,tgt_lang,st_train_path,st_val_path,
          tt_train_path,tt_val_path,):
  tokenized(path_doc + clean_src_path,vocab_size,src_lang,tokenized_setting)
  tokenized(path_doc + clean_tgt_path,vocab_size,tgt_lang,tokenized_setting)
  src_tokenizer,tgt_tokenizer = get_tokenizers(path_doc,vocab_size,src_lang,tgt_lang)
  data_set_preparing(path_doc,clean_src_path,clean_tgt_path,max_l,src_tokenizer,
           tgt_tokenizer,st_train_path,st_val_path,tt_train_path,tt_val_path)
  return src_tokenizer,tgt_tokenizer

# test tokenized_data()
# src_tokenizer,tgt_tokenizer = tokenized_data(
#     vocab_size = setting["tokenized_setting"]["vocab_size"],
#     tokenized_setting = {k:setting["tokenized_setting"][k] for k in \
#               set(list(setting["tokenized_setting"].keys()))-{"vocab_size","max_l"}},
#     max_l = setting["tokenized_setting"]["max_l"],
#     path_doc = setting["data_info"]["document"],
#     clean_src_path = setting["data_info"]["source"]["clean_data_path"],
#     clean_tgt_path = setting["data_info"]["target"]["clean_data_path"],
#     src_lang = setting["data_info"]["source"]["lang"],
#     tgt_lang = setting["data_info"]["target"]["lang"],
#     st_train_path = setting["data_info"]["source"]["tokenized_train_data"],
#     st_val_path = setting["data_info"]["source"]["tokenized_val_data"],
#     tt_train_path = setting["data_info"]["target"]["tokenized_train_data"],
#     tt_val_path = setting["data_info"]["target"]["tokenized_val_data"])

make data set
------
> Using tokenized data to make dataset.  
> Classmethod : padding_mask_batch which  
> where the key padding mask is constucted  
> also defined here.

In [5]:
import torch
from tqdm import tqdm
import numpy as np
from torch.utils.data import Dataset

class myDataset(Dataset):
  def __init__(self,src_path,tgt_path):

    self.src_path = src_path
    self.tgt_path = tgt_path

    src_list = []
    with open(self.src_path,"r") as f :
      d_l = f.readlines()
      for line in tqdm(d_l):
        int_list = [int(i) for i in line.split()]
        src_list.append(int_list)
    self.src = torch.LongTensor(src_list)

    tgt_list = []
    with open(self.tgt_path,"r") as f :
      l_l = f.readlines()
      for line in tqdm(l_l):
        int_list = [int(i) for i in line.split()]
        tgt_list.append(int_list)
    self.tgt = torch.LongTensor(tgt_list)

  def __len__(self):
    return len(self.src)

  def __getitem__(self, index):
    return self.src[index], self.tgt[index]

  # make key padding mask
  @classmethod
  def padding_mask_batch(cls,batch,pad_id):
    """Collate a batch of data."""
    src, tgt = zip(*batch)
    src = torch.stack(src)
    tgt = torch.stack(tgt)
    src_padding = (src == pad_id)
    tgt_padding = (tgt == pad_id)

    return src, tgt , src_padding, tgt_padding
# test myDataset
# data = []
# with open("/content/train_dev/tokenized_train_data_en.txt","r") as f :
#   d_l = f.readlines()
#   for line in tqdm(d_l):
#     int_list = [int(i) for i in line.split()]
#     data.append(int_list)
# print(data[0])

In [6]:
from torch.utils.data import DataLoader
import gc
def get_data_set(train_batch_size,valid_batch_size,num_workers,path_doc,
         st_train_path,st_val_path,tt_train_path,tt_val_path,pad_id):

  train_set = myDataset(src_path = path_doc + st_train_path,
              tgt_path = path_doc + tt_train_path,
              )
  valid_set = myDataset(src_path = path_doc + st_val_path,
              tgt_path = path_doc + tt_val_path,
              )
  train_loader = DataLoader(
    train_set,
    batch_size = train_batch_size,
    shuffle = True,
    num_workers = num_workers,
    pin_memory = True,
    collate_fn = lambda x : myDataset.padding_mask_batch(x,
                   pad_id = pad_id)
  )
  valid_loader = DataLoader(
    valid_set,
    batch_size = valid_batch_size,
    num_workers = num_workers,
    pin_memory = True,
    collate_fn = lambda x : myDataset.padding_mask_batch(x,
                   pad_id = pad_id)
  )
  del train_set,valid_set
  gc.collect()
  return train_loader,valid_loader
# test get_data_set()
# train_set,valid_set = get_data_set()
# batch = next(iter(valid_set))
# src,tgt,src_mask,tgt_mask = batch
# print(src.shape)

make model
======
positional encoding layer
------
>The first layer is embedding layer, where each integers  
>in encoder sentence will be represent by a vector.   
>I use build-in class in pytorch to finish these part,    
>and combine it with encoder layers to form my encoder.

>The layer below is the second layer :positional encoding layer  
>in this layer the position infomation is add to each "word"  
>in the sentence.
>Here I use parameters instead of constant as  
>position infomation so they will change during training process.

In [7]:
import torch
import torch.nn as nn
class Positional_Encoding(nn.Module):
    def __init__(self,max_sentence_length,embedding_dimension):
      super().__init__()
      self.dropout = nn.Dropout(0.1)
      self.encoding_values = nn.Parameter(nn.init.normal_(torch.empty(max_sentence_length,1, embedding_dimension)))
    def forward(self, x):
        # the shape of x : [batch,length,e_dim]
        # the shape of self.encoding_values : [batch,length,e_dim]
        x = x + self.encoding_values.unsqueeze(0)
        x = x.squeeze(-2)
        return self.dropout(x)

multihead attention layer
------


In [8]:
import torch.nn.functional as F
import torchvision
import math
from torchinfo import summary
# This part is modify from pytorch : torch.nn.functional.scaled_dot_product_attention
# Efficient implementation equivalent to the following:
class Scaled_Dot_Product_Attention(nn.Module):
    def __init__(self,max_sentence_length,dropout_p):
      super().__init__()
      self.dropout_p = dropout_p
      self.max_l = max_sentence_length
      attn_bias = torch.zeros(self.max_l, self.max_l)
      temp_mask = torch.ones(self.max_l, self.max_l, dtype=torch.bool).tril(diagonal=0)
      attn_bias = attn_bias.masked_fill_(temp_mask.logical_not(), float("-inf"))
      self.register_buffer("attn_bias",attn_bias)

    def forward(self, is_last_batch, query, key, value, padding_mask=None, is_causal=False, scale=None) -> torch.Tensor:
      # Efficient implementation equivalent to the following:

      scale_factor = 1 / math.sqrt(query.size(-1)) if scale is None else scale
      attn_weight = query @ key.transpose(-2, -1) * scale_factor

      if is_causal:
        if is_last_batch:
          self.attn_bias = self.attn_bias[:query.size(-2),:query.size(-2)]
        self.attn_bias.to(query.dtype)
        attn_weight += self.attn_bias

      if padding_mask is not None:
          if padding_mask.dtype == torch.bool:
            padding_mask = torch.zeros_like(padding_mask,dtype = float).masked_fill_(padding_mask, (float("-inf")))

          padding_mask = padding_mask.unsqueeze(0).unsqueeze(0)
          padding_mask.to(query.dtype)

          attn_weight = attn_weight.transpose(-4,-2)
          attn_weight += padding_mask
          attn_weight = attn_weight.transpose(-4,-2)

      attn_weight = torch.softmax(attn_weight, dim=-1)
      attn_weight = torch.dropout(attn_weight, self.dropout_p, train=True)
      return attn_weight @ value
# test scaled_dot_product_attention
# t = torch.rand([2,3,4,5])
# mask = torch.tensor([[False,False,True,True],[False,True,False,True]],dtype = torch.bool)
# print(scaled_dot_product_attention("cpu",t,t,t,padding_mask= mask, is_causal=True))
# from torch.nn.functional import scaled_dot_product_attention
class My_MultiHeadedAttention(nn.Module):
    def __init__(self, max_sentence_length, kv_input_dimension, embedding_dimension, num_heads, dropout_p, if_decoder = False):
        '''
        embedding_dimension = input dimension
        note that there are residual sublayers in MultiHeadedAttention
        '''
        super().__init__()
        assert embedding_dimension % num_heads == 0, "embed_dim must be divisible by num_heads"
        self.max_l = max_sentence_length
        self.kv_d = kv_input_dimension
        self.d = embedding_dimension
        self.num_heads = num_heads
        self.dropout_p = dropout_p
        self.is_causal = if_decoder
        self.sdpa = Scaled_Dot_Product_Attention(self.max_l,self.dropout_p)
        self.linear_for_q = nn.Linear(self.d, self.d)
        self.linear_for_kv = nn.Linear(self.kv_d, 2 * self.d)
        self.linear_out_project = nn.Linear(self.d, self.d)

    def forward(self, is_last_batch, q_input_data, kv_input_data , padding_mask = None):

        query = self.linear_for_q(q_input_data)
        key, value = self.linear_for_kv(kv_input_data).split(self.d,dim = -1)

        query,key,value = \
          map(lambda x : x.view(x.size(0),x.size(1),self.num_heads,self.d//self.num_heads),[query,key,value])
        query,key,value = \
          map(lambda x : x.transpose(-2,-3),[query,key,value])

        x = self.sdpa(is_last_batch,query,key,value,padding_mask = padding_mask,is_causal = self.is_causal)
        x = x.transpose(-2,-3).contiguous()
        x = x.view(x.size(0),x.size(1),self.d)
        x = self.linear_out_project(x)

        return x
# test My_MultiHeadedAttention
# model = My_MultiHeadedAttention(64,128,2,0.0)
# q_input = torch.rand(32,400,128)
# kv_input = torch.rand(32,400,64)
# mask = (torch.FloatTensor(32,400).uniform_() > 0.8)
# print(model(q_input,kv_input,mask).size())
# print(summary(model,device = "cpu",q_input_data = q_input, kv_input_data = kv_input,padding_mask = mask))

encoder layer(s)
------

In [9]:
import math
class My_Encoder_Layer(nn.Module):
  def __init__(self,max_sentence_length,embedding_dimension,feedforward_dimension,num_heads,dropout_p):
    super().__init__()
    self.max_l = max_sentence_length
    self.emb_dim = embedding_dimension
    self.fwd_dim = feedforward_dimension
    self.num_heads = num_heads
    self.dropout_p = dropout_p

    self.attention = My_MultiHeadedAttention(self.max_l, self.emb_dim, self.emb_dim, self.num_heads, self.dropout_p)
    self.layer_norm_attn = nn.LayerNorm(self.emb_dim)
    self.drop_out_attn_layernorm = nn.Dropout(self.dropout_p)

    self.feedforward = nn.Sequential(
    nn.Linear(self.emb_dim,self.fwd_dim),
    nn.ReLU(),
    nn.Linear(self.fwd_dim,self.emb_dim)
    )
    self.layer_norm_feedforward = nn.LayerNorm(self.emb_dim)
    self.drop_out_feedforward_layernorm = nn.Dropout(self.dropout_p)


  def forward(self,is_last_batch,x,padding_mask):
    x = x + self.attention(is_last_batch,x,x,padding_mask)
    x = self.layer_norm_attn(x)

    x = self.drop_out_attn_layernorm(x)

    x = x + self.feedforward(x)
    x = self.layer_norm_feedforward(x)
    x = self.drop_out_feedforward_layernorm(x)

    return x
# test My_Encoder_Layer
# model = My_Encoder_Layer("cpu",128,256,2,0.0)
# input = torch.rand((32,400,128))
# mask = (torch.FloatTensor(32,400).uniform_() > 0.8)
# print(model(input,mask).size())
# print(summary(model,input_data = input,padding_mask = mask))
# print(model.state_dict().keys())
class My_Encoder(nn.Module):
  def __init__(self,max_sentence_length,dictionary_length,embedding_dimension,feedforward_dimension,
         padding_idx, num_heads, dropout_p, layer_num):
    super().__init__()
    self.max_l = max_sentence_length
    self.dict_l = dictionary_length
    self.emb_dim = embedding_dimension
    self.fwd_dim = feedforward_dimension
    self.padding_idx = padding_idx
    self.num_heads = num_heads
    self.dropout_p = dropout_p
    self.layer_num = layer_num

    self.encoder_embedding = nn.Embedding(self.dict_l,self.emb_dim,self.padding_idx)
    self.positional_encoding = Positional_Encoding(self.max_l,self.emb_dim)
    self.encoder = nn.ModuleList([My_Encoder_Layer(self.max_l,self.emb_dim,self.fwd_dim,\
                    self.num_heads,self.dropout_p) for i in range(layer_num)])

  def forward(self,is_last_batch,input,padding_mask):
    x = self.encoder_embedding(input.unsqueeze(-1))* math.sqrt(self.emb_dim)
    x = self.positional_encoding(x)

    for index,module in enumerate(self.encoder):
      if index == 0:
        x = module(is_last_batch,x,padding_mask)
      else:
        x = module(is_last_batch,x,None)
    return x
# test My_Encoder
# model = My_Encoder("cpu",400,8000,128,256,0,2,0.0,2)
# input = torch.randint(0,7999,(32,400),dtype = torch.long)
# mask = (torch.FloatTensor(32,400).uniform_() > 0.8)
# print(model(input,mask).size())
# print(summary(model,input_data = input,padding_mask = mask))
# print(model.state_dict().keys())

decoder layer(s)
------

In [10]:
import math
class My_Decoder_Layer(nn.Module):
  def __init__(self,max_sentence_length,encoder_embedding_dimension,embedding_dimension,feedforward_dimension,num_heads,dropout_p):
    super().__init__()
    self.max_l = max_sentence_length
    self.encoder_dim = encoder_embedding_dimension
    self.emb_dim = embedding_dimension
    self.fwd_dim = feedforward_dimension
    self.num_heads = num_heads
    self.dropout_p = dropout_p

    self.self_attention = My_MultiHeadedAttention \
     (self.max_l,self.emb_dim,self.emb_dim, num_heads = self.num_heads,\
     dropout_p = self.dropout_p, if_decoder = True)
    self.layer_norm_sa = nn.LayerNorm(self.emb_dim)
    self.drop_out_sa = nn.Dropout(0)

    self.feedforward_sa = nn.Sequential(
    nn.Linear(self.emb_dim,self.fwd_dim),
    nn.ReLU(),
    nn.Linear(self.fwd_dim,self.emb_dim)
    )
    self.layer_norm_sa_fw = nn.LayerNorm(self.emb_dim)
    self.drop_out_sa_fw = nn.Dropout(0)

    self.cross_attention = My_MultiHeadedAttention \
    (self.max_l,self.encoder_dim, self.emb_dim, num_heads = self.num_heads,
    dropout_p = self.dropout_p, if_decoder = True)
    self.layer_norm_ca = nn.LayerNorm(self.emb_dim)
    self.drop_out_ca = nn.Dropout(0)

    self.feedforward_ca = nn.Sequential(
    nn.Linear(self.emb_dim,self.fwd_dim),
    nn.ReLU(),
    nn.Linear(self.fwd_dim,self.emb_dim)
    )
    self.layer_norm_ca_fw = nn.LayerNorm(self.emb_dim)
    self.drop_out_ca_fw = nn.Dropout(0)

  def forward(self,is_last_batch,encoder_input,input,padding_mask):

    x = input + self.self_attention(is_last_batch,input,input,padding_mask)
    x = self.layer_norm_sa(x)
    x = self.drop_out_sa(x)

    x = x + self.feedforward_sa(x)
    x = self.layer_norm_sa_fw(x)
    x = self.drop_out_sa_fw(x)

    x = x + self.cross_attention(is_last_batch,x,encoder_input,padding_mask)
    x = self.layer_norm_ca(x)
    x = self.drop_out_ca(x)

    x = x + self.feedforward_ca(x)
    x = self.layer_norm_ca_fw(x)
    x = self.drop_out_ca_fw(x)

    return x
class My_Decoder(nn.Module):
  def __init__(self,max_sentence_length, dictionary_length, encoder_embedding_dimension,
         embedding_dimension, feedforward_dimension, padding_idx, num_heads, dropout_p, layer_num):
    super().__init__()
    self.max_l = max_sentence_length
    self.dict_l = dictionary_length
    self.encoder_dim = encoder_embedding_dimension
    self.emb_dim = embedding_dimension
    self.fwd_dim = feedforward_dimension
    self.padding_idx = padding_idx
    self.num_heads = num_heads
    self.dropout_p = dropout_p
    self.layer_num = layer_num

    self.decoder_embedding = nn.Embedding(self.dict_l,self.emb_dim,padding_idx=self.padding_idx)
    self.positional_encoding = Positional_Encoding(self.max_l,self.emb_dim)
    self.decoder = nn.ModuleList([My_Decoder_Layer(self.max_l,self.encoder_dim,self.emb_dim,\
                    self.fwd_dim,self.num_heads,self.dropout_p) for i in range(self.layer_num)])
    # self.encoder = My_Encoder_Layer(self.emb_dim,self.fwd_dim)

    self.generator = nn.Linear(self.emb_dim,self.dict_l)

  def forward(self,is_last_batch,encoder_input,input,padding_mask):
    x = self.decoder_embedding(input.unsqueeze(-1))* math.sqrt(self.emb_dim)
    x = self.positional_encoding(x)
    # x = self.encoder(x,padding_mask)
    for index,module in enumerate(self.decoder):
      if index == 0:
        x = module(is_last_batch,encoder_input,x,padding_mask)
      else:
        x = module(is_last_batch,encoder_input,x,None)
    x = self.generator(x)
    x = F.log_softmax(x,dim = -1)
    return x
# test My_Decoder
# model = My_Decoder("cpu",400,8000,128,64,256,0,2,0.0,2)
# encoder_input = torch.rand(32,400,128)
# input = torch.randint(0,7999,(32,400),dtype = torch.long)
# mask = (torch.FloatTensor(32,400).uniform_() > 0.8)
# print(model(encoder_input = encoder_input,input = input, padding_mask = mask).size())
# print(summary(model,encoder_input = encoder_input,input = input, padding_mask = mask))
# print(model.state_dict().keys())

transformer layer
------

In [11]:
class My_Transformer(nn.Module):
  def __init__(self,max_sentence_length,dictionary_length,padding_idx,
         encoder_embedding_dimension,decoder_embedding_dimension,
         feedforward_dimension,num_heads,dropout_p,layer_num):
    super().__init__()
    self.max_l = max_sentence_length
    self.dict_l = dictionary_length
    self.padding_idx = padding_idx
    self.en_dim = encoder_embedding_dimension
    self.de_dim = decoder_embedding_dimension
    self.fw_dim = feedforward_dimension
    self.num_heads = num_heads
    self.dropout_p = dropout_p
    self.layer_num = layer_num
    self.encoder = My_Encoder \
     (self.max_l,self.dict_l,self.en_dim,self.fw_dim,
      self.padding_idx,self.num_heads,self.dropout_p,self.layer_num)
    self.decoder = My_Decoder \
     (self.max_l,self.dict_l,self.en_dim,self.de_dim,self.fw_dim,
      self.padding_idx,self.num_heads,self.dropout_p,self.layer_num)

  def forward(self,is_last_batch,src,tgt,src_mask,tgt_mask):
    memory = self.encoder(is_last_batch,src,src_mask)
    outputs = self.decoder(is_last_batch,memory,tgt,tgt_mask)
    return outputs

def build_model(max_sentence_length,dictionary_length,padding_idx,encoder_embedding_dimension,
         decoder_embedding_dimension,feedforward_dimension,num_heads,dropout_p,layer_num):
  return My_Transformer(max_sentence_length,dictionary_length,padding_idx,
              encoder_embedding_dimension,decoder_embedding_dimension,
              feedforward_dimension,num_heads,dropout_p,layer_num)
# test My_Transformer
# model = My_Transformer("cpu",400,8000,0,128,64,256,2,0,2)
# src = torch.randint(0,8000,(32,400),dtype = torch.long)
# tgt = torch.randint(0,8000,(32,400),dtype = torch.long)
# src_mask = torch.cat(((torch.FloatTensor(32,200).uniform_() > 1),(torch.FloatTensor(32,200).uniform_() > 0.15)),dim =1)
# tgt_mask = torch.cat(((torch.FloatTensor(32,100).uniform_() > 1),(torch.FloatTensor(32,300).uniform_() > 0.15)),dim =1)
# out = model(src,tgt,src_mask,tgt_mask)
# print(out.size(),out.dim(),out[0][0])
# print(summary(model,src = src,tgt = tgt,src_mask = src_mask,tgt_mask = tgt_mask))
# print(model.state_dict().keys())

# test build_model
# model = build_model()
# batch = next(iter(train_set))
# src, tgt, src_mask, tgt_mask = batch
# print(type(src),src.shape)
# print(summary(model,src = src,tgt = tgt,src_mask = src_mask,tgt_mask = tgt_mask))
# outputs = model(src,tgt,src_mask,tgt_mask)
# print(outputs.shape)

training and validation process
======
Noam optimizer
------

In [12]:
# reference : https://nlp.seas.harvard.edu/2018/04/03/attention.html
class NoamOpt:
    def __init__(self,dictionary_length,factor,warmup,optimizer):
        self.dict_len = dictionary_length
        self.factor = factor
        self.warmup = warmup
        self.optimizer = optimizer
        self._step = 0
        self._rate = 0
    def step(self):
        self._step += 1
        self._rate = self.factor *(self.dict_len ** (-0.5) * \
        min(self._step ** (-0.5), self._step * self.warmup ** (-1.5)))

        self.optimizer.param_groups[0]["lr"] = self._rate
        self.optimizer.step()
    def zero_grad(self):
        return self.optimizer.zero_grad()
# test NoamOpt:
# x = torch.optim.Adam(model.parameters(), lr=0, betas=(0.9, 0.98), eps=1e-9)
# x.param_groups[0]["lr"]

label smoothing
------

In [13]:
import torch
import torch.nn as nn
class LabelSmoothedCrossEntropyCriterion(nn.Module):
  def __init__(self,batch_size,dictionary_length,padding_id,smoothing):
        super().__init__()
        self.dict_len = dictionary_length
        self.smoothing = smoothing
        self.padding_id = padding_id
        shift = torch.full(size = (batch_size,1), dtype = torch.long, fill_value = self.padding_id)
        self.register_buffer("shift",shift)
  def forward(self, is_last_batch, outputs , label):

    # step1 : when using label in validation, shift is needed.
    # label_shift : {type : tensor , shape : batch  X (max_sentence_length-1)
    # value : int}
    label_shift = label[:,1:]
    # shift : {type : tensor , shape : batch  X 1 ,value : self.padding_id}

    # label_shift : {type : tensor , shape : batch  X max_sentence_length
    # value : int}
    if is_last_batch:
      label_shift = torch.cat((label_shift,self.shift[:label.size(0),:]),dim = 1)

    else:
      label_shift = torch.cat((label_shift,self.shift),dim = 1)

    # step2 : convert label to onehot tensor, then apply label smoothing
    # label_onehot : {type : tensor , shape : batch  X max_sentence_length X dictionary_length
    # value : 0 or 1}
    label_onehot = F.one_hot(label_shift,self.dict_len).float()
    # add : {type : float}
    add = self.smoothing / (self.dict_len)
    # label_onehot : {type : tensor , shape : batch  X max_sentence_length X dictionary_length
    # value : add or 1+add}
    label_onehot += add
    # label_smoothed : {type : tensor , shape : batch  X max_sentence_length X dictionary_length
    # value : add or 1+add-self.smoothing}
    label_smoothed = label_onehot.masked_fill_((label_onehot > 1),float(1-self.smoothing+add))

    '''
    Question : Is padding really needed?
    '''
    # step3 : use padding mask to ignore to loss from padding id, then calculate loss.
    # loss : {type : tensor , shape : batch  X max_sentence_length X 1, value : float}
    loss = -1*torch.sum((outputs*label_smoothed),dim = -1)
    # label_padding_mask {type : tensor , shape : batch  X max_sentence_length, value : bool}
    label_padding_mask = (label == self.padding_id)
    # mask_loss : {type : tensor , shape : batch  X max_sentence_length,
    # value : 0 or add or 1+add-self.smoothing}
    mask_loss = loss.masked_fill_(label_padding_mask,0)
    # # ignore_index_number : {type : int}
    # ignore_index_number = (mask_loss == 0).sum().item()
    # avg_loss : {type : int}
    # avg_loss = mask_loss.sum()/(mask_loss.size(0)*mask_loss.size(1)-ignore_index_number)
    avg_loss = mask_loss.sum()/mask_loss.size(0)
    return(avg_loss)

# test LabelSmoothedCrossEntropyCriterion
# cal1 = LabelSmoothedCrossEntropyCriterion()
# print(cal1(outputs,tgt))

# ignore_index not work correctly
# def LabelSmoothedCrossEntropy(outputs , label,dictionary_length,smooth,padding_id):
#   print(outputs.shape)
#   print(label.shape)
#   label_onehot = label.transpose(-1,-2).squeeze()
#   outputs = outputs.transpose(-1,-2)
#   cal_loss = nn.CrossEntropyLoss(ignore_index = padding_idx,reduction = "mean", label_smoothing=smooth)
#   return cal_loss(outputs,label_onehot)

In [14]:
# see https://arxiv.org/pdf/1512.00567.pdf page 7

#Ref 1 : Hong-Yi Li ML2021 HW5

# class LabelSmoothedCrossEntropyCriterion(nn.Module):
#     def __init__(self, smoothing, ignore_index=None, reduce=True):
#         super().__init__()
#         self.smoothing = smoothing
#         self.ignore_index = ignore_index
#         self.reduce = reduce

#     def forward(self, lprobs, target):
#         if target.dim() == lprobs.dim() - 1:
#             target = target.unsqueeze(-1)
#         # nll: Negative log likelihood，the cross-entropy when target is one-hot. following line is same as F.nll_loss
#         nll_loss = -lprobs.gather(dim=-1, index=target)
#         #  reserve some probability for other labels. thus when calculating cross-entropy,
#         # equivalent to summing the log probs of all labels
#         smooth_loss = -lprobs.sum(dim=-1, keepdim=True)
#         if self.ignore_index is not None:
#             pad_mask = target.eq(self.ignore_index)
#             nll_loss.masked_fill_(pad_mask, 0.0)
#             smooth_loss.masked_fill_(pad_mask, 0.0)
#         else:
#             nll_loss = nll_loss.squeeze(-1)
#             smooth_loss = smooth_loss.squeeze(-1)
#         if self.reduce:
#             nll_loss = nll_loss.sum()
#             smooth_loss = smooth_loss.sum()
#         # when calculating cross-entropy, add the loss of other labels
#         eps_i = self.smoothing / lprobs.size(-1)
#         loss = (1.0 - self.smoothing) * nll_loss + eps_i * smooth_loss
#         return loss

#Ref 2 : By hemingkx : https://github.com/hemingkx/ChineseNMT

# class LabelSmoothing(nn.Module):
#     """Implement label smoothing."""

#     def __init__(self, size, padding_idx, smoothing=0.0):
#         super(LabelSmoothing, self).__init__()
#         self.criterion = nn.KLDivLoss(size_average=False)
#         self.padding_idx = padding_idx
#         self.confidence = 1.0 - smoothing
#         self.smoothing = smoothing
#         self.size = size
#         self.true_dist = None


#     def forward(self, x, target):
#         assert x.size(1) == self.size
#         true_dist = x.data.clone()
#         true_dist.fill_(self.smoothing / (self.size - 2))
#         true_dist.scatter_(1, target.data.unsqueeze(1), self.confidence)
#         true_dist[:, self.padding_idx] = 0
#         mask = torch.nonzero(target.data == self.padding_idx)
#         if mask.dim() > 0:
#             true_dist.index_fill_(0, mask.squeeze(), 0.0)
#         self.true_dist = true_dist
#         return self.criterion(x, Variable(true_dist, requires_grad=False))

beam search
------

In [15]:
from tqdm import tqdm
class Decode_With_Beam_Search(nn.Module):
    def __init__(self,batch_size,model,beam_num,max_sentence_length,
           dictionary_length,bos_id,padding_id):
      super().__init__()
      self.batch_size = batch_size
      self.model = model
      self.beam_num = beam_num
      self.max_sentence_length = max_sentence_length
      self.dictionary_length = dictionary_length
      self.bos_id = bos_id
      self.padding_id = padding_id
      # decoder_input : {type : tensor , shape : Batch X 1 , value : bos_id}
      decoder_input = torch.full(size = (self.batch_size,1),fill_value = self.bos_id)
      self.register_buffer("decoder_input",decoder_input)
      # repeat : {type : tensor , shape : Batch ,value : beam_num}
      # each row repeat beam_num times before concatenate
      repeat = torch.full([self.batch_size],fill_value = self.beam_num)
      self.register_buffer("repeat",repeat)
      # decoder_probability {type : tensor , shape : Batch X beam_num X 1, value : 0.1}
      decoder_probability = torch.full(size = (self.batch_size,self.beam_num,1),fill_value = 0.0)
      self.register_buffer("decoder_probability",decoder_probability)

      # padding : {type : tensor , shape : (Batch X beam_num) X (max_sentence_length-(id+1)) ,value : int}
      padding = torch.full(size = (batch_size*self.beam_num,self.max_sentence_length),fill_value = self.padding_id)
      self.register_buffer("padding",padding)

      # row : {type : tensor , shape : batch X 1, value : [[0],[1],[2],...]}
      row = torch.tensor(range(self.batch_size)).unsqueeze(1)
      self.register_buffer("row",row)

    def forward(self,is_last_batch,src,src_mask):

      if is_last_batch:
        batch = src.size(0)
      else :
        batch = self.batch_size

      if self.beam_num > batch:
        beam_num = batch
      else :
        beam_num = self.beam_num

      decoder_input = self.decoder_input[:batch,:]
      repeat = self.repeat[:batch]
      decoder_probability = self.decoder_probability[:batch,:,]
      padding = self.padding[:batch*beam_num,:]
      row = self.row[:batch,:]

      # decoder_beam_expand : {type : tensor , shape : (Batch X beam_num) X 1 ,value : bos_id}
      decoder_beam_expand = torch.repeat_interleave(decoder_input,repeat,dim=0)

      # memory : {type : tensor , shape : Batch X max_sentence_length X encoder_output_dim ,value : arbitary float}
      memory = self.model.encoder(is_last_batch,src,src_mask)
      # memory_beam_expand : {type : tensor ,
      # shape : (Batch X n_beam) X max_sentence_length X encoder_output_dim ,value : float}
      memory_beam_expand = torch.repeat_interleave(memory,repeat,dim=0)

      gc.collect()

      for id in range(self.max_sentence_length-1):

        # decoder_n_beam : {type : tensor , shape : (Batch X beam_num) X (id+1) ,value : int}
        # decoder_probability {type : tensor , shape : Batch X beam_num X 1 , value : log_softmax probability}
        new_decoder_beam_expand , new_decoder_probability = self.get_next_word(is_last_batch,memory_beam_expand,
        decoder_beam_expand,decoder_probability,id,batch,beam_num,padding,repeat,row)

        decoder_beam_expand,decoder_probability = new_decoder_beam_expand , new_decoder_probability
        if id%10 ==0:
          print(new_decoder_beam_expand[0])
        gc.collect()

      # out_beam_expand : {type : tensor , shape : Batch X beam_num X (max_sentence_length) ,value : 0 or 1}
      decoder_beam_expand = decoder_beam_expand.view(batch,beam_num,self.max_sentence_length)
      # max_probability : {type : tensor , shape :  Batch  X 1 ,value : int(max prob index)}
      max_probability = torch.argmax(input = decoder_probability,dim = 1)
      # max_probability_expand : {type : tensor , shape :  Batch  X 1 X max_sentence_length ,
      # value : [[A,A,A....],[B,B,B...],...](A,B are int)}
      max_probability_expand = max_probability.expand(batch, self.max_sentence_length).unsqueeze(1)
      # decoder_out : {type : tensor , shape :  Batch X max_sentence_length ,
      # value : [[int,int,...],[int,int...],...]}
      decoder_out =  torch.gather(input = decoder_beam_expand ,dim = 1, index = max_probability_expand).squeeze(1)

      print(decoder_out[0])
      return decoder_out,F.one_hot(decoder_out,self.dictionary_length).float()

    def get_next_word(self,is_last_batch,memory,out,out_probability,id,batch,beam_num,padding,repeat,row):
      # padding : {type : tensor , shape : (Batch X beam_num) X (max_sentence_length-(id+1)) ,value : int}
      padding = padding[:,:self.max_sentence_length-(id+1)]
      # out_padding : {type : tensor , shape : (Batch X beam_num) X max_sentence_length,
      # value : [[bos_id],[any_id],...[padding_id],....] X Batch}
      out_padding = torch.cat((out,padding),dim = 1)
      # tgt_padding : {type : tensor , shape : (Batch X beam_num) X max_sentence_length ,value: bool}
      tgt_padding = (out_padding == self.padding_id).squeeze(-1)
      # out_add : {type : tensor , shape : Batch X beam_num X dictionary_length ,value : int}
      out_add = self.model.decoder(is_last_batch,memory,out_padding,tgt_padding)[:,id,:]\
            .view(batch,beam_num,self.dictionary_length)
      # out_n_beam : {type : tensor , shape : (Batch X beam_num) X (id+1) ,value : int}
      # out_probability {type : tensor , shape : Batch X beam_num X 1 , value : log_softmax probability}
      out , out_probability = self.beam_search_one_step(batch,beam_num,repeat,row,out,out_probability,out_add)

      gc.collect()
      return(out , out_probability)

    def beam_search_one_step(self,batch,beam_num,repeat,row,sentences,p_sentences,n_beam_output):
    # sentences : {type : tensor , shape : (batch X beam_num) X now_sentences_length X 1 value : int}
    # p_sentences : {type : tensor , shape : batch X beam_num X 1 value : log_softmax probability}
    # n_beam_output : {type : tensor , shape : batch X beam_num X dictionary_length,
    # value : [P1,P2,P3...] X beam_num times (Pk in [0,1])}

      '''
      TO DO : (set beam num = K)
      for every batch:
      expand sentences(total number = K) K times (so there are K-square sentences),then concat with
      the index of top K consequence of each beam(total K beams) in n_beam_output (so there are also K-square values).
      '''
      # sentences : {type : tensor , shape : batch X beam_num X now_sentences_length value : int}
      sentences = sentences.view(batch,beam_num,-1)
      # repeat : {type : tensor , shape : beam_num ,value : beam_num}
      # each row repeat beam_num times before concatenate
      repeat = repeat[:beam_num]
      # sentences_expand : {type : tensor , shape : batch X (beam_num X beam_num) X now_sentences_length ,
      # value : [[[A,B...] X beam_num times,[C,D...] X beam_num times}...] A,B,C,D...are int}
      sentences_expand = torch.repeat_interleave(sentences,repeat,dim=1)

      # topk_prob : {type : tensor , shape : batch X beam_num X beam_num, value : log_softmax probability}
      # topk_index : {type : tensor , shape : batch X beam_num X beam_num, value : int}
      topk_prob, topk_index = torch.topk(n_beam_output,dim = -1,k = beam_num)

      # topk_index : {type : tensor , shape : batch X (beam_num X beam_num) X 1, value : int}
      topk_index = topk_index.view(batch,-1,1)
      # sentences : {type : tensor , shape : batch X (beam_num X beam_num) X (now_sentences_length+1), value : int}
      sentences_expand = torch.cat((sentences_expand,topk_index),dim = -1)
      '''
      TO DO :
      multipies p_sentences with the probability of top K consequence of each beam(total K beams) in n_beam_output
      (so there are also K-square values).

      The final step is to choose Top K consequence from K-square sentences by using p_sentences.
      '''

      # p_sentences : {type : tensor , shape : batch X (beam_num X beam_num),
      # value : [P1,P2,P3...] X beam_num times (Pk is log_softmax probability)}
      p_sentences = (p_sentences+topk_prob).view(batch,-1)
      # p_sentences : {type : tensor , shape : batch X beam_num, value : log_softmax probability}
      # p_index : {type : tensor , shape : batch X beam_num, value : int}
      p_sentences, p_index = torch.topk(p_sentences, dim = 1, k = beam_num)
      p_sentences = p_sentences.unsqueeze(-1)
      # row : {type : tensor , shape : batch X 1, value : [[0],[1],[2],...]}
      # sentences : {type : tensor , shape : batch X beam_num X (now_sentences_length+1), value : log_softmax probability}
      new_sentences = sentences_expand[row, p_index].view(batch*beam_num,-1)
      sentences.data = new_sentences.data
      gc.collect()
      return sentences,p_sentences

# test decode_with_beam_search
# batch = 3
# beam_num = 2
# sentences = torch.randint(0,8000,(batch*beam_num,5))
# p_sentences = torch.log(torch.rand((batch , beam_num , 1)))
# n_beam_output = torch.rand((batch , beam_num , 8000))
# print(sentences,p_sentences,n_beam_output)
# print(beam_search_one_step(sentences,p_sentences,n_beam_output))
# repeat = torch.full([beam_num],fill_value = beam_num)
# sentences_expand = torch.repeat_interleave(sentences.view(batch,beam_num,-1),repeat,dim=1)
# print(sentences_expand,sentences_expand.shape)
# decode_model = Decode_With_Beam_Search(32,model,2,400,8000,2,0)
# outputs_in_word,outputs = decode_model(False,src,src_mask)
# print(output[0])

In [16]:
from tqdm import tqdm
def beam_search_one_step(device,sentences,p_sentences,n_beam_output):
    # sentences : {type : tensor , shape : (batch X beam_num) X now_sentences_length X 1 value : int}
    # p_sentences : {type : tensor , shape : batch X beam_num X 1 value : log_softmax probability}
    # n_beam_output : {type : tensor , shape : batch X beam_num X dictionary_length,
    # value : [P1,P2,P3...] X beam_num times (Pk in [0,1])}

    '''
    TO DO : (set beam num = K)
    for every batch:
    expand sentences(total number = K) K times (so there are K-square sentences),then concat with
    the index of top K consequence of each beam(total K beams) in n_beam_output (so there are also K-square values).
    '''
    batch = n_beam_output.size(0)
    beam_num = n_beam_output.size(1)
    # sentences : {type : tensor , shape : batch X beam_num X now_sentences_length value : int}
    sentences = sentences.view(batch,beam_num,-1)
    # repeat : {type : tensor , shape : beam_num ,value : beam_num}
    # each row repeat beam_num times before concatenate
    repeat = torch.full([beam_num],fill_value = beam_num)
    repeat = repeat.to(device)
    # sentences_expand : {type : tensor , shape : batch X (beam_num X beam_num) X now_sentences_length ,
    # value : [[[A,B...] X beam_num times,[C,D...] X beam_num times}...] A,B,C,D...are int}
    sentences_expand = torch.repeat_interleave(sentences,repeat,dim=1)

    # topk_prob : {type : tensor , shape : batch X beam_num X beam_num, value : log_softmax probability}
    # topk_index : {type : tensor , shape : batch X beam_num X beam_num, value : int}
    topk_prob, topk_index = torch.topk(n_beam_output,dim = -1,k = beam_num)

    # topk_index : {type : tensor , shape : batch X (beam_num X beam_num) X 1, value : int}
    topk_index = topk_index.view(batch,-1,1)
    # sentences : {type : tensor , shape : batch X (beam_num X beam_num) X (now_sentences_length+1), value : int}
    sentences_expand = torch.cat((sentences_expand,topk_index),dim = -1)
    '''
    TO DO :
    multipies p_sentences with the probability of top K consequence of each beam(total K beams) in n_beam_output
    (so there are also K-square values).

    The final step is to choose Top K consequence from K-square sentences by using p_sentences.
    '''

    # p_sentences : {type : tensor , shape : batch X (beam_num X beam_num),
    # value : [P1,P2,P3...] X beam_num times (Pk is log_softmax probability)}
    p_sentences = (p_sentences+topk_prob).view(batch,-1)
    # p_sentences : {type : tensor , shape : batch X beam_num, value : log_softmax probability}
    # p_index : {type : tensor , shape : batch X beam_num, value : int}
    p_sentences, p_index = torch.topk(p_sentences, dim = 1, k = beam_num)
    p_sentences = p_sentences.unsqueeze(-1)
    # row : {type : tensor , shape : batch X 1, value : [[0],[1],[2],...]}
    row = torch.tensor(range(batch)).unsqueeze(1)
    row = row.to(device)
    # sentences : {type : tensor , shape : batch X beam_num X (now_sentences_length+1), value : log_softmax probability}
    new_sentences = sentences_expand[row, p_index].view(batch*beam_num,-1)
    sentences.data = new_sentences.data
    gc.collect()
    return sentences,p_sentences

def get_next_word(model,is_last_batch,device,memory,out,out_probability,id,batch,beam_num,max_sentence_length,dictionary_length,padding_id):
    # padding : {type : tensor , shape : (Batch X beam_num) X (max_sentence_length-(id+1)) ,value : int}
    padding = torch.full(size = (batch*beam_num,max_sentence_length-(id+1)),fill_value = padding_id)
    padding = padding.to(device)
    # out_padding : {type : tensor , shape : (Batch X beam_num) X max_sentence_length,
    # value : [[bos_id],[any_id],...[padding_id],....] X Batch}
    out_padding = torch.cat((out,padding),dim = 1)
    # tgt_padding : {type : tensor , shape : (Batch X beam_num) X max_sentence_length ,value: bool}
    tgt_padding = (out_padding == padding_id).squeeze(-1)
    # out_add : {type : tensor , shape : Batch X beam_num X dictionary_length ,value : int}
    out_add = model.decoder(is_last_batch,memory,out_padding,tgt_padding)[:,id,:].view(batch,beam_num,dictionary_length)
    # out_n_beam : {type : tensor , shape : (Batch X beam_num) X (id+1) ,value : int}
    # out_probability {type : tensor , shape : Batch X beam_num X 1 , value : log_softmax probability}
    out , out_probability = beam_search_one_step(device,out,out_probability,out_add)

    gc.collect()
    return(out , out_probability)

def decode_with_beam_search(device,is_last_batch,model,src,src_mask,beam_num,max_sentence_length,
               dictionary_length,bos_id,padding_id):
    with torch.no_grad():
      batch = src.size(0)
      # decoder_input : {type : tensor , shape : Batch X 1 , value : bos_id}
      decoder_input = torch.full(size = (batch,1),fill_value = bos_id)
      decoder_input = decoder_input.to(device)
      # repeat : {type : tensor , shape : Batch ,value : beam_num}
      # each row repeat beam_num times before concatenate
      repeat = torch.full([batch],fill_value = beam_num)
      repeat = repeat.to(device)
      # decoder_beam_expand : {type : tensor , shape : (Batch X beam_num) X 1 ,value : bos_id}
      decoder_beam_expand = torch.repeat_interleave(decoder_input,repeat,dim=0)

      # decoder_probability {type : tensor , shape : Batch X beam_num X 1, value : 0.1}
      decoder_probability = torch.full(size = (batch,beam_num,1),fill_value = 0.0)
      decoder_probability = decoder_probability.to(device)

      # memory : {type : tensor , shape : Batch X max_sentence_length X encoder_output_dim ,value : arbitary float}
      memory = model.encoder(is_last_batch,src,src_mask)
      # memory_beam_expand : {type : tensor ,
      # shape : (Batch X n_beam) X max_sentence_length X encoder_output_dim ,value : float}
      memory_beam_expand = torch.repeat_interleave(memory,repeat,dim=0)

      gc.collect()

      for id in range(max_sentence_length-1):

        # decoder_n_beam : {type : tensor , shape : (Batch X beam_num) X (id+1) ,value : int}
        # decoder_probability {type : tensor , shape : Batch X beam_num X 1 , value : log_softmax probability}
        new_decoder_beam_expand , new_decoder_probability = \
        get_next_word(model,is_last_batch,device,memory_beam_expand,decoder_beam_expand,
               decoder_probability,id,batch,beam_num,max_sentence_length,dictionary_length,padding_id)

        decoder_beam_expand,decoder_probability = new_decoder_beam_expand , new_decoder_probability

        gc.collect()

      # out_beam_expand : {type : tensor , shape : Batch X beam_num X (max_sentence_length) ,value : 0 or 1}
      decoder_beam_expand = decoder_beam_expand.view(batch,beam_num,max_sentence_length)
      # max_probability : {type : tensor , shape :  Batch  X 1 ,value : int(max prob index)}
      max_probability = torch.argmax(input = decoder_probability,dim = 1)
      # max_probability_expand : {type : tensor , shape :  Batch  X 1 X max_sentence_length ,
      # value : [[A,A,A....],[B,B,B...],...](A,B are int)}
      max_probability_expand = max_probability.expand(batch, max_sentence_length).unsqueeze(1)
      # decoder_out : {type : tensor , shape :  Batch X max_sentence_length ,
      # value : [[int,int,...],[int,int...],...]}
      decoder_out =  torch.gather(input = decoder_beam_expand ,dim = 1, index = max_probability_expand).squeeze(1)
    print(decoder_out[0])
    return decoder_out,F.one_hot(decoder_out,dictionary_length).float()

# test decode_with_beam_search
# batch = 3
# beam_num = 2
# sentences = torch.randint(0,8000,(batch*beam_num,5))
# p_sentences = torch.log(torch.rand((batch , beam_num , 1)))
# n_beam_output = torch.rand((batch , beam_num , 8000))
# print(sentences,p_sentences,n_beam_output)
# print(beam_search_one_step(sentences,p_sentences,n_beam_output))
# repeat = torch.full([beam_num],fill_value = beam_num)
# sentences_expand = torch.repeat_interleave(sentences.view(batch,beam_num,-1),repeat,dim=1)
# print(sentences_expand,sentences_expand.shape)
# output = decode_with_beam_search(model,src,src_mask,2,400,8000,)
# print(output[0])

bleu
------

In [17]:
import numpy as np
from torcheval.metrics.functional.text import bleu
def get_bleu_score(outputs,tgt,tgt_tokenizer,eos_id):
    outputs = np.array(outputs.detach().tolist())
    outputs = [x[:np.nonzero(x == eos_id)[0][0]].tolist() if len(np.nonzero(x == eos_id)[0])>0 \
              else x.tolist() for x in outputs ]

    outputs_decode = tgt_tokenizer.decode(outputs)
    out = [outputs_decode[i] for i in range(len(outputs_decode)) if len(outputs_decode[i])>= 4]
    print(out)
    out = [" ".join(list(x)) for x in out]
    print(out)
    tgt_decode = tgt_tokenizer.decode(tgt.detach().tolist())
    tgt = [tgt_decode[i] for i in range(len(outputs_decode)) if len(outputs_decode[i])>= 4]
    print(tgt)
    tgt = [" ".join(list(x)) for x in tgt]
    print(tgt)
    return bleu.bleu_score(out, tgt, n_gram=4).detach().item()
# test bleu
# test_tokenizer = spm.SentencePieceProcessor(model_file = "/content/spm_8000_zh.model")
# candidates = torch.tensor([[21,3,9,99,42],[5,78,89,3,31]])
# references = torch.tensor([[18,5,9,3,42],[3,5,78,89,50]])
# get_bleu_score(candidates,references,test_tokenizer,3)

train and validation function
------

In [18]:
from tqdm import tqdm
import torch
import torch.nn as nn
from torch.optim import AdamW

def train_one_epoch(device,model,loss_calculator,is_last_batch,
          src,tgt,src_mask,tgt_mask,dictionary_length,
          optimizer):

    outputs = model(is_last_batch,src,tgt,src_mask,tgt_mask)

    loss = loss_calculator(is_last_batch,outputs,tgt)
    loss.backward()

    optimizer.step()
    optimizer.zero_grad()
    return loss.detach().item(),outputs[0].detach()

def valid(device,model,loss_calculator,batch_size_setting,valid_loader,beam_num,max_sentence_length,
      dictionary_length,bos_id,eos_id,pad_id,tgt_tokenizer):
    batch_loss = []
    batch_bleu_score = []
    with torch.no_grad():
      for val_batch in tqdm(valid_loader,desc="valid_step", unit=" step"):
        src,tgt,src_mask,tgt_mask = val_batch
        src,tgt,src_mask = src.to(device),tgt.to(device),src_mask.to(device)

        batch_size = src.size(0)

        is_last_batch = False
        if batch_size != batch_size_setting:
          is_last_batch = True
        decode_model = Decode_With_Beam_Search(batch_size,model,beam_num,max_sentence_length,
                            dictionary_length,bos_id,pad_id)
        decode_model.to(device)
        outputs_in_word,outputs = decode_model(is_last_batch,src,src_mask)
        # outputs_in_word,outputs = decode_with_beam_search(device,is_last_batch,model,src,src_mask,beam_num,\
        #       max_sentence_length,dictionary_length,bos_id,pad_id)


        loss = loss_calculator(is_last_batch,outputs,tgt)

        bleu_score = get_bleu_score(outputs_in_word,tgt,tgt_tokenizer,eos_id)

        batch_loss.append(loss)
        batch_bleu_score.append(bleu_score)

      avg_valid_loss = batch_loss.sum()/len(batch_loss).detach().item()
      avg_bleu_score = batch_bleu_score.sum()/len(batch_bleu_score).detach().item()

    return avg_valid_loss,avg_bleu_score

main function
------

In [19]:
from tqdm import tqdm
def main(setting,dataset_is_prepare = False):

    # set random seed
    myseed = 1
    np.random.seed(myseed)
    torch.manual_seed(myseed)
    if torch.cuda.is_available():
      torch.cuda.manual_seed_all(myseed)

    # data set & tokenizer
    if not dataset_is_prepare:
        clean_data_and_save(
        path_doc = setting["data_info"]["document"],
        raw_src_path = setting["data_info"]["source"]["raw_data_path"],
        raw_tgt_path = setting["data_info"]["target"]["raw_data_path"],
        clean_src_path = setting["data_info"]["source"]["clean_data_path"],
        clean_tgt_path = setting["data_info"]["target"]["clean_data_path"],
        threshold = setting["tokenized_setting"]["max_l"])

        src_tokenizer,tgt_tokenizer = tokenized_data(
            vocab_size = setting["tokenized_setting"]["vocab_size"],
            tokenized_setting = {k:setting["tokenized_setting"][k] for k in \
                      set(list(setting["tokenized_setting"].keys()))-{"vocab_size","max_l"}},
            max_l = setting["tokenized_setting"]["max_l"],
            path_doc = setting["data_info"]["document"],
            clean_src_path = setting["data_info"]["source"]["clean_data_path"],
            clean_tgt_path = setting["data_info"]["target"]["clean_data_path"],
            src_lang = setting["data_info"]["source"]["lang"],
            tgt_lang = setting["data_info"]["target"]["lang"],
            st_train_path = setting["data_info"]["source"]["tokenized_train_data"],
            st_val_path = setting["data_info"]["source"]["tokenized_val_data"],
            tt_train_path = setting["data_info"]["target"]["tokenized_train_data"],
            tt_val_path = setting["data_info"]["target"]["tokenized_val_data"])
    else:
        src_tokenizer,tgt_tokenizer = get_tokenizers(
            path_doc = setting["data_info"]["document"],
            vocab_size = setting["tokenized_setting"]["vocab_size"],
            src_lang = setting["data_info"]["source"]["lang"],
            tgt_lang = setting["data_info"]["target"]["lang"],)

    # data loader
    train_batch_size_setting = setting["training_hparas"]["train_batch_size"]
    valid_batch_size_setting = setting["training_hparas"]["valid_batch_size"]
    train_loader,valid_loader = get_data_set(
        train_batch_size = train_batch_size_setting,
        valid_batch_size = valid_batch_size_setting,
        num_workers = setting["training_hparas"]["workers"],
        path_doc = setting["data_info"]["document"],
        st_train_path = setting["data_info"]["source"]["tokenized_train_data"],
        st_val_path = setting["data_info"]["source"]["tokenized_val_data"],
        tt_train_path = setting["data_info"]["target"]["tokenized_train_data"],
        tt_val_path = setting["data_info"]["target"]["tokenized_val_data"],
        pad_id = setting["tokenized_setting"]["pad_id"])
    train_iter = iter(train_loader)
    valid_iter = iter(valid_loader)

    # model
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    model =  build_model(
          max_sentence_length = setting["tokenized_setting"]["max_l"],
          dictionary_length = setting["tokenized_setting"]["vocab_size"],
          padding_idx = setting["tokenized_setting"]["pad_id"],
          encoder_embedding_dimension = setting["model"]["encoder_embedding_dimension"],
          decoder_embedding_dimension = setting["model"]["decoder_embedding_dimension"],
          feedforward_dimension = setting["model"]["feedforward_dimension"],
          num_heads = setting["model"]["num_heads"],
          dropout_p = setting["model"]["dropout_p"],
          layer_num = setting["model"]["layer_num"])

    model = model.to(device)

    train_loss_calculator = LabelSmoothedCrossEntropyCriterion(
                batch_size = train_batch_size_setting,
                dictionary_length = setting["tokenized_setting"]["vocab_size"],
                padding_id = setting["tokenized_setting"]["pad_id"],
                smoothing = setting["training_hparas"]["label_smoothing"])

    valid_loss_calculator = LabelSmoothedCrossEntropyCriterion(
            batch_size = valid_batch_size_setting,
            dictionary_length = setting["tokenized_setting"]["vocab_size"],
            padding_id = setting["tokenized_setting"]["pad_id"],
            smoothing = 0)

    train_loss_calculator,valid_loss_calculator = \
    train_loss_calculator.to(device),valid_loss_calculator.to(device)

    # optimizer
    optimizer = torch.optim.Adam(model.parameters(), **(setting["training_hparas"]["optimization"]["optimizer"]))

    Noam_optimizer = NoamOpt(
             dictionary_length = setting["tokenized_setting"]["vocab_size"],
             factor = setting["training_hparas"]["optimization"]["factor"],
             warmup = setting["training_hparas"]["optimization"]["warmup"],
             optimizer = optimizer)

    # step
    total_step = setting["training_hparas"]["total_step"]
    early_stop_epoch = setting["training_hparas"]["early_stop_step"]
    do_valid_steps = setting["training_hparas"]["do_valid_step"]
    early_stop_count = 0
    progress_bar = tqdm(total = do_valid_steps, desc="train_step", unit=" step")

    # output datas
    train_loss_every_batchs = []
    valid_loss = []
    bleu_score = []
    best_bleu_score = 0

    for step in range(total_step):

      # training
      # iter batch
      try:
        train_batch = next(train_iter)
      except StopIteration:
        train_iter = iter(train_loader)
        train_batch = next(train_iter)

      # compute batch loss and update parameters in model
      model.train()

      src,tgt,src_mask,tgt_mask = train_batch
      src,tgt,src_mask,tgt_mask = src.to(device),tgt.to(device),\
                     src_mask.to(device),tgt_mask.to(device)
      batch_size = src.size(0)

      is_last_batch = False
      if batch_size != train_batch_size_setting:
        is_last_batch = True

      train_loss, test_sentence = train_one_epoch(
              device = device,
              model = model,
              loss_calculator = train_loss_calculator,
              is_last_batch = is_last_batch,
              src = src,
              tgt = tgt,
              src_mask = src_mask,
              tgt_mask = tgt_mask,
              dictionary_length = setting["tokenized_setting"]["vocab_size"],
              optimizer = Noam_optimizer)

      train_loss_every_batchs.append(train_loss)
      if (step+1) % (do_valid_steps//20) == 0:
        print(train_loss_every_batchs[-1])
        print(tgt_tokenizer.decode(torch.argmax(test_sentence,dim = -1).tolist()))
        print(tgt_tokenizer.decode(tgt.detach().tolist()))

      progress_bar.update()
      if (step+1) % do_valid_steps == 0:

        print(train_loss_every_batchs[-1])

        progress_bar.close()

        model.eval()
        avg_val_loss,avg_bleu_score = valid(
                        device = device,
                        model = model,
                        loss_calculator = valid_loss_calculator,
                        batch_size_setting = valid_batch_size_setting,
                        valid_loader = valid_loader,
                        beam_num = setting["training_hparas"]["beam_num"],
                        max_sentence_length = setting["tokenized_setting"]["max_l"],
                        dictionary_length = setting["tokenized_setting"]["vocab_size"],
                        bos_id = setting["tokenized_setting"]["bos_id"],
                        eos_id = setting["tokenized_setting"]["eos_id"],
                        pad_id = setting["tokenized_setting"]["pad_id"],
                        tgt_tokenizer = tgt_tokenizer)
        valid_loss.append(avg_val_loss)
        bleu_score.append(avg_bleu_score)

        # print avg loss
        print(f"average train loss = {sum(train_loss_every_batchs[-1*do_valid_steps:-1])/len(do_valid_steps):.4f}")
        print(f"average valid loss = {valid_loss[-1]:.4f}")
        print(f"average valid loss = {bleu_score[-1]:.4f}")

        # saving model and check early stop criterion
        if bleu_score[-1] > best_bleu_score:
          torch.save(model.state_dict(), setting["tokenized_setting"]["model_saving_path"])
        else :
          early_stop_count += 1

        if early_stop_count == early_stop_epoch:
          break

        progress_bar = tqdm(total = do_valid_steps, desc="train_step", unit=" step")

    progress_bar.close()

    return train_loss_every_batchs,valid_loss,bleu_score

In [None]:
main(setting,dataset_is_prepare = True)
# gc.collect()

100%|██████████| 384064/384064 [00:31<00:00, 12375.50it/s]
100%|██████████| 384064/384064 [00:34<00:00, 11268.07it/s]
100%|██████████| 3879/3879 [00:00<00:00, 13999.97it/s]
100%|██████████| 3879/3879 [00:00<00:00, 16465.11it/s]
train_step:   5%|▌         | 200/4000 [02:59<59:46,  1.06 step/s]

137.8590087890625
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
['他是一位年輕的丈夫,一位半職業的棒球選手,也是一個紐約市的消防員。', '所以,我們反而必須多花時間發展人文、社會學和社會科學,修辭、哲學、倫理,因為這些知識構成我們的背景涵養,對大數據非常重要,也因為這能幫助我們更會思辨,', '我就是無法相信,我的父親,我年輕時的阿多尼斯,我親愛的朋友,會認為這樣的生命還值得活下去。', '第一天:「把它放到你的口袋中。', '我覺得這件事也一樣。', '很遺憾,這只是眾多事件的其中一件。', '而也不是所有的遷移都是自主的。', '第一:我沒有印度口音,我有的是巴勒斯坦的,好嗎?', '你們喜歡這種被加上標籤的感覺嗎?', '」', '我們開了另一個會議', '我們要讓選擇「左」的「協調者」獲得比較高的獎勵', '公約文件定義「難民」為:國民離鄉背景、無法回歸家園,是出於被迫害的恐懼,', '嫌惡是一種情緒,它結合了驚訝、尷尬,還有一些厭惡感,就像是不知道你的雙手要做什麼好。', '但我想我們忽略了某些告訴我們可以做的事的事實', '觀察使我們現在能夠', '」', '各位可以試想一下,如果我不這麽想,可能我們便沒有辦法從哥本哈根的爛攤子裏走出來。', '那麼─所有這些東拉西扯的東西應該歸到何處呢?', '我試著消失到jr的眼睛裏面,但是jr的所有作品裏面的模特眼睛都特別大,', '更糟的是,當你問:「你是否

train_step:  10%|█         | 400/4000 [06:06<56:14,  1.07 step/s]

138.47463989257812
,,,,,,,,,,,,,,,,。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。
['所以我們花了九年的時間,讓政府相信當地有很多被石油污染的企鵝。', '公司、銀行,', '第一:從最重要的部分開始寫。', '時間是早上三點。', '與其等待,我乾脆自己做一個,只用了一張紙', '他的守靈日剛好發生交通罷工。', '但那就是次要的了,', '男性文化對於持續不斷的男性對女性及兒童施暴的悲劇已經有太多的沉默,不是嗎?', '我,很幸運地,幾年以後遇到了zackieachmat"治療"活動的創辦人他是個很棒的遊説家和社會運動領導', '有馬爾地夫的人跟你聯繫時,你要做的第一件事,就是告訴他,你要現場勘查。', '克:我的意思是,有些其他計畫是從零開始。', '抗瘧疾蚊帳其實是功臣之一', '現在,你仍然可以看到她在販售各種口味的烤玉米和不同的零食等等。', '所以,千禧世代黑人、褐色人種、所有有色人種的故事,都需要被說出來,也需要被傾聽。', '這六年來,我懷抱著不屈不饒的態度,向這個體系注入樂觀主義,無論媒體會提出甚麼樣的質疑——我現在變得更會處理這些質疑了——而且,無論對立面有著怎樣的證據。', '至少,我希望他們只是在賺取廣告收入。', '如果一點漆的行進速度恰好,它撞上其中的任一物體都可能令其作廢。', '事實證明,常常。', '因此在地球最為寒冷的地區他們實質上是在冰箱裏面工作', '這讓我想到稍早michael在做的平衡

train_step:  15%|█▌        | 600/4000 [09:13<53:11,  1.07 step/s]

124.4197769165039
,,,,,,???。???。。。?。。。???。。。。?????。?。。?。。?。。?。?。。。。?。?。。。。。?。???????????。?。。?????。???。???。。??????????????????????????????????。????。???????。??????????????????。。??。??????????
['你能想出方法讓你們全都脫困嗎?', '你們能夠想像這樣的畫面嗎?', '非洲還在發生些什麽?', '其中一件事情是,當我們把事情從簡單變成複雜是當我們想要更多。', '現代的女人比男人要負擔更多的家務,例如烹飪和打掃。', '但了解crispr技術提供了我們一個工具可以來做這些改變是很重要的一旦那些知識變成能取得後。', '這只是要告訴你在腦幹紅色的區塊,那裏有,簡單來說,所有的小方塊會對應到模組那是可以真的做出腦圖譜的我們內部不同的面向,及我們身體不同的面向。', '貓鼬就住在那裡。', '嗯,可能大家都會叫做法蘭克福-黑比諾', '我們如何辦到呢?', '數位id掃瞄器取代了人工驗票,加速了上車的流程,網路上還採用了人工智慧,讓旅行路線能最佳化。', '他們確實創造了新領域,探索一些領域你可能認為無人駕駛飛機只有軍隊在製造其實有一整群人在製造無人駕駛飛機或車輛就是你可以利用程式控制,讓它自主飛行而不用控制桿之類的,就可以控制它的飛行路徑', '正當我在繪圖板上醞釀這想法的時候,電話鈴響了。', '謝謝。', '」或者,「他住在哪裡?', '在漫長的太空旅程中,根本無法沿途取得任何碳,所以必須想辦法在艙裡,將碳回收再利用。', '所以要真正了解多世代職場的美好,我想,我們只需要認識人們的真我。', '或是疾病的表徵', '兩人也同時創造「放射性」一詞', '應該這樣說,我們能夠多早預知事件的發生,取決於幾個主要的因素。', '幾天後,我們在他家見面。', '我一直說,向本地診所提供藥物的非政府組織可能會幫他兒子恢復健康。', '這是地球的尺寸。', '」', '破懷你人際關係的方式,甚至是憤怒恐怖的模樣。', '我的病人可以因此受惠', '因為它無所爭,也就不會有所怨尤。', '但這點將會改變。', '統整這份報告的是國際山地綜合開發中心,縮寫為icimod,位在尼泊爾。', '我們把它稱

train_step:  20%|██        | 800/4000 [12:20<50:03,  1.07 step/s]

140.36119079589844
我們,,,的的的的的的的的的。。。。。。。。。。。。。。。。。。。。。。
['我們引進財政法規讓財政預算與石油價格脫勾', '5點鐘:任務該是英雄面對最大考驗惡夢成真的時候了', '這是被變換成藝術的醫療工具。', '我們創造出了小型自駕式機器人,能在安靜的巷弄及人行道中找到方向,用行走的速度行進,還有安全的貨倉可以運送你的食物和用品。', '沒人來,動力不會出現,你永遠不會感覺想去做。', '他的女朋友大概也蠻生氣他的。', '和別人相處時要投入、心要在。', '不是去想說要製造什麼,而是為了思考而製造。', '歷史上,經過衝突後重建的地區,有40%在十年之內又再次發生了衝突', '我看看--搞一下牌,我才不能--', 'rives:所以當我回神的時候,我發現我有了一個興趣,我不知道我想不想要,但它得到大眾迴響。', '它在水中。', '我曾經有過許多的答案去試著回答這個問題。', '看看這些生物,它們到處游,它們在找地方吃東西和繁殖。', '現在,你們並不需要我給太多合成快樂的例子', '幾件事情會隨之而生。', '像死胡同,原地打轉一樣像塞車,談話遇到僵局一樣', '然而,我必須說,對我而言,溝通的過程並不是輕而易舉的。', '崔西寫信說,她是五個孩子的媽,也很享受居家的時間,但她剛經歷一場離婚必需回去工作賺錢,加上她真的很想把工作帶回她的生命中,因為她熱愛工作。', '完全的線上評分機制、同儕互動以及討論版都是我們必須努力的。', '好,讓我們弄清楚吧。', '瓦拉杰村座落的位置非常接近耶路撒冷那裡的人面臨跟布德鲁斯村非常相似的困境', '我們實際上是在做社會裡的道德選擇說我們不要珊瑚礁', '但這只是整個生意裏面的一丁點。', '什麼會令你恐懼?', '要有這麼多直行的機率,對有個c的每種物種,或有個t的每種物種,在隨機的狀況下,是無限小的。', '沒有什麽真正的你。', '是時候突破那些錯誤的詮釋了這些詮釋把不平等歸結於個人問題卻大幅地忽略了財富的優勢', '有這麼一個"名人錄"。', '如果你乘飛機橫跨美國,這就相當於你的飛行高度。', '去問任何一位天文物理學。', '我會把這個當作我的科學遺產', '我們建造了一個小型的工作台高度可以調整,矮小的學生也能參與', '我出生就帶了一種罕見的視覺障礙就是"全色盲"我

train_step:  25%|██▌       | 1000/4000 [15:27<46:57,  1.06 step/s]

155.09397888183594
,,,,,,,,,,,,,,,,,,,,的,,的,的
['2005年十月的時候第一批的七個貸款都還清後我和matt將網站的"試用版"字樣移除', '大家都是如此,許多隨我們過活的故事甚至不是源於我們。', '我們得要支持這些人,他們現在不只是在拯救人命,也要靠他們,才能在衝突結束後把受傷的社區重新縫合,協助它們療癒。', '計算著所量測到水落下的概念可能使您忽然有所心得"oh,布魯克林是多麼的大,從布魯克林到曼哈頓的距離是個明顯的例子,這東邊流域下游是如此之大。', '他選擇用這樣的結構來設計屋頂的其中一個原因,便是他驚訝的發現,你竟然可以用這樣少的材料建造這樣強壯的結構,而且只需要靠幾個點來支撐。', '把最大矩形的面積減掉最小矩形的面積。', '釷是天然產生的核燃料,在地球的地殼中,比鈾還要常見四倍。', '當我是小孩子時,明確地說是在高中時,我被告知我會困在新的世界經濟大海中,除非我懂日文。', '比較易揮發的啡色芥末種子、一些白酒,', '之後,我們的政府保證永遠全額資助。', '有往上跑的氣泡,然後最上面是用凸起的磁磚做成的泡沫。', '到了第三天,迦納變得喜怒無常', '帶著正確的工具和正確的方法到一個國家,並衝滿活力地去執行防疫工作,那麼你可以做到局部的根除,', '或許他沒有足夠的錢或許他有家庭問題或許他喜歡的女孩不喜歡他', '不過有件事情一直在困擾著我,', '你也能改變世界', '黎巴嫩人請鼓掌黎巴嫩人。', 'ah:當然,請。', '我們強調,靠單一國家的力量是沒用的你必須動員所有的國家。', '想想你的家鄉。', '這是我們的信仰,我們會對我們的信仰忠誠。', '沒錯,化身人物是一個表現真實自我的方式我們可以成為的最英勇且理想化的樣子', '我們已經騎了五個半小時,我們來到我喜愛的部份:爬坡,我愛爬坡。', '那是100年來這個城市最低的投票率,', '所以我開始尋找更有創意的方式來把科技知識介紹給學生。', '親愛的朋友們,咱們鼓起勇氣吧。', '我們一同分析了西岸局勢,挑選出100個身處險境的家庭:它們在關卡旁,在軍隊基地邊,緊挨著定居點。', '都是因為石油;這是事實你知我知天下人都知道。', '」', '但是受到創傷的人會感覺不到這種不朽感。', '你為什麼不該去惹未接觸之印地安人,這就是他們的

train_step:  28%|██▊       | 1132/4000 [17:31<44:46,  1.07 step/s]

In [None]:
# torch.save({
#             'epoch': epoch,
#             'model_state_dict': model.state_dict(),
#             'optimizer_state_dict': optimizer.state_dict(),
#             'loss': loss,
#             ...
#             }, PATH)
# model = TheModelClass(*args, **kwargs)
# optimizer = TheOptimizerClass(*args, **kwargs)

# checkpoint = torch.load(PATH)
# model.load_state_dict(checkpoint['model_state_dict'])
# optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
# epoch = checkpoint['epoch']
# loss = checkpoint['loss']

# model.eval()
# # - or -
# model.train()