Eileen Zhang 2020/8/29

# Rnn

数据集  
$$\big\{\left(\mathbf{x}_t,\mathbf{y}_t\right)\big\}_{t=1}^T$$
其中，  
第$t$时刻输入数据$\mathbf{x}_t=\left(x_t^{\left(1\right)},x_t^{\left(2\right)},\dots,x_t^{\left(n\right)}\right)^{\top}\in\mathbb{R}^n,$   
第$t$时刻输出数据$\mathbf{y}_t=\left(y_t^{\left(1\right)},y_t^{\left(2\right)},\dots,y_t^{\left(m\right)}\right)^{\top}\in\mathbb{R}^m$

循环神经网络模型结构
$$\begin{align} 
\left\{ 
\begin{array}{**lr**} 
\mathbf{h}_t=f\left(\mathbf{W}\centerdot\mathbf{x}_t+\mathbf{U}\centerdot\mathbf{h}_{t-1}\right) & \\ 
{\mathbf{y}}_t=f\left( \mathbf{V}\centerdot\mathbf{h}_t\right) \\ 
\end{array} 
\right. 
\end{align} $$  
其中，$\mathbf{h}$为隐状态，$f\left(\cdot\right)$为非线性激活函数，$\mathbf{U},\mathbf{W},\mathbf{V}$为模型参数。

![rnn.gif](../data/rnn.gif)

**GRU 算法**

![gru.png](../data/gru.png)

![RNN-vs-LSTM-vs-GRU-1024x308.png](../data/RNN-vs-LSTM-vs-GRU-1024x308.png)

**units for gru** : units 它们本身并行，并没有什么联系，然后它们通过最后的softmax求导时产生了联系，仅此而已。

![gru_unit.png](../data/gru_units.png)

# seq2seq

![seq2seq.png](../data/seq2seq.png)

#  seq2seq + attention

![s2s_attention.png](../data/s2s_attention.png)

- EO : encoder 各个位置的输出  


- H : decoder 某一步的隐含状态 


- FC : 全连阶层 


- X : decoder 的一个输入

- [Bahdanau 注意力] score = FC(tanh(FC(EO) + FC(H))) 
- [luong 注意力] score = EO\*W\*H 


- attention_weights = softmax(score,axis = 1)
- context = sum(attention_weights * EO, axis = 1)
- final_input = concat(context,embed(x))

# seq2seq + attention 实现

注:以下是在google colab 上GPU跑的

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

In [2]:
import io
import re

In [3]:
import jieba

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
file = "/content/drive/My Drive/cmn.txt"

## 数据处理

### 数据查看 预处理

In [6]:
def preprocess_sentence(w):
    w = w.lower().strip()

    # 在单词与跟在其后的标点符号之间插入一个空格
    # https://stackoverflow.com/questions/3645931/python-padding-punctuation-with-white-spaces-keeping-punctuation
    w = re.sub(r"([?.!,¿])", r" \1 ", w)
    w = re.sub(r'[" "]+', " ", w)

    w = w.rstrip().strip()

    return w

In [7]:
def read_txt(path):
    lines = io.open(path, encoding='UTF-8').read().strip().split('\n')

    word_pairs = [[preprocess_sentence(w) for w in l.split('\t')]  for l in lines]

    return zip(*word_pairs)

In [9]:
en_data, cn_data, _ = read_txt(file)

In [10]:
cn_data[-5:]

('我母亲的法语比我父亲的英语要好，所以他们通常用法语交流。',
 '汤姆不知如何翻译“计算机”一词，因为同他谈话的人从未见过一台。',
 '即使是现在，我偶尔还是想见到你。不是今天的你，而是我记忆中曾经的你。',
 '你很容易把母语说得通顺流畅，却很容易把非母语说得不自然。',
 '如果一個人在成人前沒有機會習得目標語言，他對該語言的認識達到母語者程度的機會是相當小的。')

In [11]:
cn_data = [" ".join(jieba.cut(x, cut_all=False)) for x in cn_data]

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.738 seconds.
Prefix dict has been built successfully.


In [12]:
cn_data[-5:]

['我 母亲 的 法语 比 我 父亲 的 英语 要 好 ， 所以 他们 通常 用 法语 交流 。',
 '汤姆 不知 如何 翻译 “ 计算机 ” 一词 ， 因为 同 他 谈话 的 人 从未见过 一台 。',
 '即使 是 现在 ， 我 偶尔 还是 想 见到 你 。 不是 今天 的 你 ， 而是 我 记忆 中 曾经 的 你 。',
 '你 很 容易 把 母语 说 得 通顺 流畅 ， 却 很 容易 把 非 母语 说 得 不 自然 。',
 '如果 一個 人 在 成人 前 沒 有 機會習 得 目標 語言 ， 他 對 該 語言 的 認識 達 到 母語者 程度 的 機會 是 相當 小 的 。']

In [13]:
# 给句子加上开始和结束标记
# 以便模型知道何时开始和结束预测
en_data = ['<start> ' + w + ' <end>' for w in en_data]
cn_data = ['<start> ' + w + ' <end>' for w in cn_data]

In [14]:
en_data[-5:],cn_data[-5:]

(['<start> my mother speaks french better than my father speaks english , so they usually speak to each other in french . <end>',
  "<start> tom didn't know how to translate the word computer because the people he was talking to had never seen one . <end>",
  "<start> even now , i occasionally think i'd like to see you . not the you that you are today , but the you i remember from the past . <end>",
  "<start> it's very easy to sound natural in your own native language , and very easy to sound unnatural in your non-native language . <end>",
  "<start> if a person has not had a chance to acquire his target language by the time he's an adult , he's unlikely to be able to reach native speaker level in that language . <end>"],
 ['<start> 我 母亲 的 法语 比 我 父亲 的 英语 要 好 ， 所以 他们 通常 用 法语 交流 。 <end>',
  '<start> 汤姆 不知 如何 翻译 “ 计算机 ” 一词 ， 因为 同 他 谈话 的 人 从未见过 一台 。 <end>',
  '<start> 即使 是 现在 ， 我 偶尔 还是 想 见到 你 。 不是 今天 的 你 ， 而是 我 记忆 中 曾经 的 你 。 <end>',
  '<start> 你 很 容易 把 母语 说 得 通顺 流畅 ， 却 很 容易 把 非 母语 说 得 不 自

In [15]:
lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(
      filters='')

In [59]:
lang_tokenizer.fit_on_texts(cn_data[:3])
tensor = lang_tokenizer.texts_to_sequences(cn_data[:3])

In [60]:
tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,
                                                         padding='post')

In [61]:
tensor

array([[1, 4, 2, 3, 0, 0, 0],
       [1, 5, 2, 3, 0, 0, 0],
       [1, 6, 7, 8, 9, 2, 3]], dtype=int32)

In [62]:
lang_tokenizer.word_index

{'<end>': 3,
 '<start>': 1,
 '。': 2,
 '你': 6,
 '你好': 5,
 '嗨': 4,
 '用': 7,
 '的': 9,
 '跑': 8}

### 创建dataset

In [16]:
def tokenize(txt):
    tokenizer = tf.keras.preprocessing.text.Tokenizer(
      filters='')
    tokenizer.fit_on_texts(txt)
    tensor = tokenizer.texts_to_sequences(txt)
    tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,padding='post')
    return tensor, tokenizer

In [17]:
en_tensor, en_tokenizer = tokenize(en_data)
cn_tensor, cn_tokenizer = tokenize(cn_data)

In [18]:
en_tensor_length, cn_tensor_length = en_tensor.shape[1], cn_tensor.shape[1]

In [19]:
en_tensor_length, cn_tensor_length

(36, 32)

In [20]:
from sklearn.model_selection import train_test_split

In [21]:
en_tensor_train, en_tensor_val, cn_tensor_train, cn_tensor_val = train_test_split(en_tensor, cn_tensor, test_size=0.2)

In [22]:
en_tensor_train.shape, en_tensor_val.shape

((18755, 36), (4689, 36))

In [23]:
BUFFER_SIZE = len(en_tensor_train)
BATCH_SIZE = 64
steps_per_epoch = BUFFER_SIZE//BATCH_SIZE

vocab_en_size = len(en_tokenizer.word_index)+1
vocab_cn_size = len(cn_tokenizer.word_index)+1

dataset = tf.data.Dataset.from_tensor_slices((en_tensor_train, cn_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

In [24]:
example_en_batch, example_cn_batch = next(iter(dataset))
example_en_batch.shape, example_cn_batch.shape

(TensorShape([64, 36]), TensorShape([64, 32]))

In [25]:
vocab_en_size = len(en_tokenizer.word_index)+1
vocab_cn_size = len(cn_tokenizer.word_index)+1

# Encoder

In [26]:
class Encoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, enc_units):
        super(Encoder, self).__init__()
        self.enc_units = enc_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.enc_units,
                                   return_sequences=True, # return all the hidden state outputs；[batch,x_length,units]
                                   return_state=True, #return the last hidden state output；[batch,units]
                                   recurrent_initializer='glorot_uniform')
    def call(self, x, init_hidden):
        x = self.embedding(x)
        hiddens, last = self.gru(x, initial_state = init_hidden)
        return hiddens, last

In [27]:
embedding_dim = 256
units = 1000

In [28]:
hidden_initializer = tf.zeros((BATCH_SIZE, units))

In [29]:
encoder = Encoder(vocab_en_size, embedding_dim, units)

In [30]:
sample_encoder_output, sample_encoder_lst = encoder(example_en_batch,hidden_initializer)

In [31]:
example_en_batch.shape

TensorShape([64, 36])

In [32]:
sample_encoder_output.shape,sample_encoder_lst.shape

(TensorShape([64, 36, 1000]), TensorShape([64, 1000]))

# Bahdanau Attention


EO : encoder 各个位置的输出

H : decoder 某一步的隐含状态

[Bahdanau 注意力] score = FC(tanh(FC(EO) + FC(H))) 

In [33]:
class BahdanauAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        super(BahdanauAttention, self).__init__()
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)
    
    def call(self, last_output, hiddens):
        last_output_with_time_axis = tf.expand_dims(last_output, 1)
        score = self.V(tf.nn.tanh(self.W1(hiddens) + self.W2(last_output_with_time_axis)))

        # shape:(batch_size, length, 1)
        attention_weights = tf.nn.softmax(score, axis=1)
        
        # context_vector.shape :(batch_size, length, units)
        context_vector = attention_weights * hiddens

        # context_vector.shape :(batch_size, units)
        context_vector = tf.reduce_sum(context_vector, axis=1)
        
        return context_vector, attention_weights


In [34]:
attention = BahdanauAttention(10)
attention_result, attention_weights = attention(sample_encoder_lst, sample_encoder_output)

In [35]:
attention_result.shape,attention_weights.shape

(TensorShape([64, 1000]), TensorShape([64, 36, 1]))

In [36]:
class Decoder(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, dec_units):
        super(Decoder, self).__init__()
        self.dec_units = dec_units
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.gru = tf.keras.layers.GRU(self.dec_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')
        self.fc = tf.keras.layers.Dense(vocab_size)
        # used for attention
        self.attention = BahdanauAttention(self.dec_units)
    
    def call(self, x, en_last_output, en_hiddens):
        # enc_output shape == (batch_size, max_length, hidden_size)
        context_vector, attention_weights = self.attention(en_last_output, en_hiddens)
        # x shape after passing through embedding == (batch_size, 1, embedding_dim)
        x = self.embedding(x)
        # x shape after concatenation == (batch_size, 1, embedding_dim + hidden_size)
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
        
        # passing the concatenated vector to the GRU
        #print(x.shape)
        output, state = self.gru(x)
        #print(output.shape)
        # output shape == (batch_size * 1, hidden_size)
        output = tf.reshape(output, (-1, output.shape[2]))
        
        # output shape == (batch_size, vocab)
        x = self.fc(output)

        return x, state, attention_weights


In [37]:
embedding_dim = 256
units = 1000
decoder = Decoder(vocab_cn_size, embedding_dim, units)

In [38]:
sample_decoder_output, _, _ = decoder(tf.expand_dims(example_cn_batch[...,1], -1),
                                      sample_encoder_lst, sample_encoder_output)

In [86]:
sample_decoder_output

<tf.Tensor: shape=(64, 14490), dtype=float32, numpy=
array([[-8.9625141e-04, -2.0650481e-03,  1.4418727e-04, ...,
        -4.8619256e-04,  7.0079265e-04,  1.7073995e-04],
       [-2.3152698e-03, -2.8430824e-03, -9.4862672e-04, ...,
         4.8295050e-03,  3.3216036e-04,  1.9666724e-04],
       [-8.1448827e-04, -2.1365557e-03,  1.3786015e-05, ...,
        -6.9960975e-04,  7.5361761e-04, -5.3609720e-06],
       ...,
       [-1.7005488e-03,  1.8617237e-03,  1.4689261e-03, ...,
         3.3776503e-04,  3.3742061e-04, -1.4049484e-03],
       [-5.2604056e-04, -4.9776379e-05, -1.6909594e-03, ...,
         4.0681385e-03, -8.1996905e-04, -5.3315624e-03],
       [-1.2560968e-03,  7.0743688e-04, -1.1438805e-03, ...,
        -1.3315986e-03, -9.7000704e-04, -7.1235624e-04]], dtype=float32)>

# Loss Function

In [39]:
optimizer = tf.keras.optimizers.Adam()

In [40]:
from tensorflow.keras.losses import SparseCategoricalCrossentropy

In [1]:
@tf.function
def train_step(inp, targ, enc_hidden):
    loss = 0

    with tf.GradientTape() as tape:
        enc_hiddens, enc_last = encoder(inp, enc_hidden

        dec_input = tf.expand_dims([cn_tokenizer.word_index['<start>']] * BATCH_SIZE, 1)

        targ_len = targ.shape[1]

        for t in range(1, targ_len):
            predictions, dec_hidden, _ = decoder(dec_input, enc_last, enc_hiddens)
            mask = tf.cast(tf.math.logical_not(tf.math.equal(targ[:, t], 0)), dtype = tf.float32)
            loss_ = 0.0
            loss_ = SparseCategoricalCrossentropy(from_logits=True, reduction='none')(targ[:, t], predictions)
            loss_ *= mask #除去pad
            loss_ = tf.reduce_mean(loss_)
            loss += loss_
            dec_input = tf.expand_dims(targ[:, t], 1)
    variables = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(loss, variables)
    optimizer.apply_gradients(zip(gradients, variables))
    loss_mean = (loss / int(targ_len))
    
    return loss_mean


SyntaxError: invalid syntax (<ipython-input-1-04977c30b5f0>, line 8)

# 训练

In [42]:
import time

In [43]:
%%time
EPOCHS = 20
enc_hidden_initializer = tf.zeros((BATCH_SIZE, units))

for epoch in range(EPOCHS):
    total_loss = 0
    for (batch_id, (inp, targ)) in enumerate(dataset.take(steps_per_epoch)):
        batch_loss = train_step(inp, targ, enc_hidden_initializer)
        total_loss += batch_loss
    print('Epoch {} Loss {:.4f}'.format(epoch + 1,total_loss / steps_per_epoch))


Epoch 1 Loss 1.3683
Epoch 2 Loss 1.1301
Epoch 3 Loss 0.9585
Epoch 4 Loss 0.8010
Epoch 5 Loss 0.6365
Epoch 6 Loss 0.4723
Epoch 7 Loss 0.3324
Epoch 8 Loss 0.2292
Epoch 9 Loss 0.1584
Epoch 10 Loss 0.1117
Epoch 11 Loss 0.0814
Epoch 12 Loss 0.0613
Epoch 13 Loss 0.0472
Epoch 14 Loss 0.0387
Epoch 15 Loss 0.0326
Epoch 16 Loss 0.0277
Epoch 17 Loss 0.0255
Epoch 18 Loss 0.0235
Epoch 19 Loss 0.0233
Epoch 20 Loss 0.0245
CPU times: user 42min 40s, sys: 22min 27s, total: 1h 5min 8s
Wall time: 1h 17min 35s


# 翻译

In [104]:
# batch = 1
def translate(inputs):
    result = ''
    # batch = 1
    inputs = tf.expand_dims(inputs, 0)
    hidden = [tf.zeros((1, units))]
    enc_out, enc_hidden = encoder(inputs, hidden)
    dec_hidden = enc_hidden
    dec_input = tf.expand_dims([cn_tokenizer.word_index['<start>']], 1)

    for t in range(cn_tensor_length):
        predictions, dec_hidden, attention_weights = decoder(dec_input,dec_hidden,enc_out)

        predicted_id = tf.argmax(predictions,1).numpy()[0]

        result += cn_tokenizer.index_word[predicted_id] + ' '

        if cn_tokenizer.index_word[predicted_id] == '<end>':
            return result

        dec_input = tf.expand_dims([predicted_id], 0)

    return result

In [120]:
def tensor_to_str(tensor,en_tokenizer):
    str_lst = ' '.join(en_tokenizer.sequences_to_texts(np.expand_dims(tensor,-1)))
    return str_lst.split('<end>')[0]

In [121]:
tensor_to_str(en_tensor_val[0],en_tokenizer)

'<start> she was frequently late for school . '

In [122]:
translate(en_tensor_val[0])

'她 經常上 學遲 到 。 <end> '

In [128]:
for i in range(10):
    print(tensor_to_str(en_tensor_val[i],en_tokenizer))
    print(translate(en_tensor_val[i]))

<start> she was frequently late for school . 
她 經常上 學遲 到 。 <end> 
<start> man is mortal . 
人 都 是 綠燈 。 <end> 
<start> tom can't remember anything . 
湯姆 不能 再 做 任何 事 。 <end> 
<start> all of you are familiar with the truth of the story . 
你們 所有 的 故事 故事 都 很 开心 。 <end> 
<start> your memory is good . 
您 记性 很 好 。 <end> 
<start> the dog attacked the little boy . 
這 隻 男孩 救 了 這個 小女孩 。 <end> 
<start> smoking is not allowed here . 
不允許 在 這裡 。 <end> 
<start> is it cheaper to call after 9:00 ? 
最近 的 时候 後 的 时候 是 去 钓鱼 嗎 ？ <end> 
<start> he comes back from sydney today . 
他 今天 会 下雪 。 <end> 
<start> i do not want any bananas at all . 
我 一個 香蕉 也 不要 。 <end> 
