Data: 
 [google drive folder](https://drive.google.com/drive/folders/16KmyvvVynQIfWWa7343z6XL6elWSWlCr?usp=sharing) `NLP_C4_W3_Colabs`. '내 드라이브에 바로가기 추가'


##  Downloading and loading dependencies



In [1]:
!pip -q install trax

[K     |████████████████████████████████| 634kB 8.5MB/s 
[K     |████████████████████████████████| 153kB 37.4MB/s 
[K     |████████████████████████████████| 4.3MB 39.5MB/s 
[K     |████████████████████████████████| 2.3MB 48.6MB/s 
[K     |████████████████████████████████| 61kB 9.7MB/s 
[K     |████████████████████████████████| 368kB 33.6MB/s 
[K     |████████████████████████████████| 256kB 41.4MB/s 
[K     |████████████████████████████████| 3.9MB 47.8MB/s 
[K     |████████████████████████████████| 1.2MB 42.6MB/s 
[K     |████████████████████████████████| 3.3MB 38.2MB/s 
[K     |████████████████████████████████| 901kB 35.9MB/s 
[?25h

In [2]:
import pickle
import string
import ast
import numpy as np
import trax 
from trax.supervised import decoding
import textwrap 
# Will come handy later.
wrapper = textwrap.TextWrapper(width=70)


## Mounting data


In [3]:
from google.colab import drive
drive.mount('/content/drive/', force_remount=True)

Mounted at /content/drive/


In [4]:
path = "/content/drive/My Drive/NLP_C4_W3_Colabs"


## Preprocessing Data



In [5]:
example_jsons = list(map(ast.literal_eval, open(path + "/data/data.txt")))

natural_language_texts = [example_json['text'] for example_json in example_jsons]

PAD, EOS, UNK = 0, 1, 2
 
def detokenize(np_array):
  return trax.data.detokenize(
      np_array,
      vocab_type = 'sentencepiece',
      vocab_file = 'sentencepiece.model',
      vocab_dir = path + "/models/")
 
def tokenize(s):
  return next(trax.data.tokenize(
      iter([s]),
      vocab_type = 'sentencepiece',
      vocab_file = 'sentencepiece.model',
      vocab_dir = path + "/models/"))
 
vocab_size = trax.data.vocab_size(
    vocab_type = 'sentencepiece',
    vocab_file = 'sentencepiece.model',
    vocab_dir = path + "/models/")

def get_sentinels(vocab_size):
    sentinels = {}

    for i, char in enumerate(reversed(string.ascii_letters), 1):

        decoded_text = detokenize([vocab_size - i]) 
        
        # Sentinels, ex: <Z> - <a>
        sentinels[decoded_text] = f'<{char}>'
        
    return sentinels

sentinels = get_sentinels(vocab_size)   


def pretty_decode(encoded_str_list, sentinels=sentinels):
    # If already a string, just do the replacements.
    if isinstance(encoded_str_list, (str, bytes)):
        for token, char in sentinels.items():
            encoded_str_list = encoded_str_list.replace(token, char)
        return encoded_str_list
  
    # We need to decode and then prettyfy it.
    return pretty_decode(detokenize(encoded_str_list))


inputs_targets_pairs = []

# here you are reading already computed input/target pairs from a file
with open (path + "/data/inputs_targets_pairs_file.txt", 'rb') as fp:
    inputs_targets_pairs = pickle.load(fp)  


def display_input_target_pairs(inputs_targets_pairs):
    for i, inp_tgt_pair in enumerate(inputs_targets_pairs, 1):
      inps, tgts = inp_tgt_pair
      inps, tgts = pretty_decode(inps), pretty_decode(tgts)
      print(f'[{i}]\n'
            f'inputs:\n{wrapper.fill(text=inps)}\n\n'
            f'targets:\n{wrapper.fill(text=tgts)}\n\n\n\n')      

In [6]:
display_input_target_pairs(inputs_targets_pairs)

[1]
inputs:
Beginners BBQ <Z> Taking <Y> in Missoula! <X> want to get better <W>
making delicious <V>? You will have the opportunity, put this on <U>
calendar now <T> Thursday, September 22nd<S> World Class BBQ Champion,
Tony Balay from Lonestar Smoke Rangers. He<R> be <Q> a beginner<P>
class for everyone <O> wants to<N> better <M> their <L> skills. He
will teach you <K> you need to know <J> compete in  <I> KCBS BBQ
competition, including techniques, recipes,<H>s, meat selection<G>
trimming, plus smoker <F> information. The cost to be in the class is
$35 per person<E> for spectator<D> is free. Included in the cost will
be either a t-shirt or apron and you will<C> tasting samples of each
meat that <B>.

targets:
<Z> Class <Y> Place <X> Do you <W> at <V> BBQ <U> your <T>.<S> join<R>
will <Q> teaching<P> level <O> who<N> get <M> with <L> culinary <K>
everything <J> to <I>a<H> timeline<G> and <F> and fire<E>, and<D>s
it<C> be <B> is prepared




[2]
inputs:
Discussion <Z> ' <Y> X Lion (10.


## Load pre-trained BERT Loss



In [7]:
# Initializing the model
model = trax.models.Transformer(
    d_ff = 4096,
    d_model = 1024,
    max_len = 2048,
    n_heads = 16,
    dropout = 0.1,
    input_vocab_size = 32000,
    n_encoder_layers = 24,
    n_decoder_layers = 24,
    mode='predict')  

In [None]:
shape11 = trax.shapes.ShapeDtype((1, 1), dtype=np.int32)  
model.init_from_file(path + '/models/model.pkl.gz',
                     weights_only=True, input_signature=(shape11, shape11))

In [11]:
print(model)

Serial_in2_out2[
  Select[0,1,1]_in2_out3
  Branch_out2[
    []
    Serial[
      PaddingMask(0)
    ]
  ]
  Cache_in2_out2[
    Serial_in2_out2[
      Embedding_32000_1024
      Dropout
      PositionalEncoding
      Serial_in2_out2[
        Branch_in2_out3[
          None
          Serial_in2_out2[
            LayerNorm
            Serial_in2_out2[
              _in2_out2
              Serial_in2_out2[
                Select[0,0,0]_out3
                Serial_in4_out2[
                  _in4_out4
                  Serial_in4_out2[
                    Parallel_in3_out3[
                      Dense_1024
                      Dense_1024
                      Dense_1024
                    ]
                    PureAttention_in4_out2
                    Dense_1024
                  ]
                  _in2_out2
                ]
              ]
              _in2_out2
            ]
            Dropout
          ]
        ]
        Add_in2
      ]
      Serial[
        Branch_out2[
      


## Decoding


In [16]:

c4_input = inputs_targets_pairs[3][0]
c4_target = inputs_targets_pairs[3][1]

print('pretty_decoded input: \n\n', pretty_decode(c4_input))
print('\npretty_decoded target: \n\n', pretty_decode(c4_target))
print('\nc4_input:\n\n', c4_input)
print('\nc4_target:\n\n', c4_target)
print(len(c4_target))
print(len(pretty_decode(c4_target)))

pretty_decoded input: 

 How many back <Z>s per day for <Y> site? Discussion in 'Black Hat SEO <X> by Omoplat <W>, Dec 3, 2010. 1) <V> a newly created site, what's the max # back <U>s per day I should do to be <T>? 2) how long do I have to let my site age before I can start<S> blinks? I did about <R> profiles every 24 hours <Q> 10 days for one of<P> sites which had a brand new domain. There is three backlinks for <O> of these<N> profile so thats 18 000 backlinks <M> 24 hours and nothing happened in terms of being penalized or sandboxed. This is now <L> 3 <K> ago and the site is ranking on first page for a lot of my targeted keywords <J> build more you can in starting but <I> manual<H> and not spammy type<G> manual + relevant to the <F>.. then after 1 month you can make a big<E>.. Wow, dude, you built 18k backlinks a day on a brand new<D>? How quickly<C> rank up? What <B> of competition/searches did those keywords have?

pretty_decoded target: 

 <Z>link <Y> new <X>' started <W>a <V> fo

Run the cell below to decode

In [20]:
#adjust temp 0~1
output = decoding.autoregressive_sample(model, inputs=np.array(c4_input)[None, :],
                                        temperature=0.0, max_length=50)
print(wrapper.fill(pretty_decode(output[0])))

<Z>o <I>o<H>cra<G> a rhy <F> Attached metallic elastic waist<E> with
O-ring. Printed<D>. Printed<C> and <B> costume <A> and <z>cra hat.
Printed on a soft cotton
