# Assignment 5

## 1.复习课上内容， 阅读相应论文。

## 2. 回答以下理论题目

### 2. 1.  What is autoencoder?
* 将输入数据压缩并产生出可被decoder解码的输出结果

### 2. 2. What are the differences between greedy search and beam search?
* greedy search: 每一步都把概率最高的输出作为输出结果
* beam search: 在概率最高的几个输出中随机选取作为输出结果

### 2. 3. What is the intuition of attention mechanism?
* 在encoding的时候计算单词在句子中的的重要程度，并用于decoding。为模型提供长期记忆的能力，加快训练速度并提高准确度。

### 2. 4. What is the disadvantage of word embedding introduced in previous lectures ?
* 无法处理未知词汇
* 以单词为单位处理,难以将学习到的词根等更低维度的信息复用
* 更换语言需要新的嵌入矩阵
* 无法被用于初始化目前最新发展出来的的模型架构

### 2. 5. Briefly describe what is self-attention and what is multi-head attention?
* Self-Attention利用了Attention机制，计算每个单词与其他所有单词之间的关联，计算单词之间的Attention score。利用这些Attention score就可以得到一个加权的表示，然后再放到一个前馈神经网络中得到新的表示，从而对上下文的信息加以利用。
* Multi-head Attention其实就是多个Self-Attention结构的结合，每个head学习到在不同表示空间中的特征，每个head学习到的Attention侧重点可能略有不同，这样给了模型更大的容量。

## 3. 中英文自动翻译模型的构建（使用encoder-decoder模型）

![](https://media.geeksforgeeks.org/wp-content/uploads/seq2seq.png)

### 3.1 [中英文翻译数据集下载](http://www.manythings.org/anki/)
找到Chinese (Mandarin) - English cmn-eng.zip (22075条中英文翻译)

### 3.2  数据处理：encoder的输入，decoder的输入与输出
1，句子转换为one-hot编码     
2，LSTM需要的三维输入[n_samples, timestamp, one-hot feature]

In [6]:
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.models import Model, load_model
from tensorflow.keras.utils import plot_model
import pandas as pd
import numpy as np

N_UNITS = 256
BATCH_SIZE = 64
EPOCH = 50
NUM_SAMPLES = 10000


data_path = 'cmn.txt'
df = pd.read_table(data_path,header=None).iloc[:NUM_SAMPLES,:,]
df.columns=['inputs', 'targets', 'others']

df['targets'] = df['targets'].apply(lambda x: '\t'+x+'\n')

input_texts = df.inputs.values.tolist()
target_texts = df.targets.values.tolist()

input_characters = sorted(list(set(df.inputs.unique().sum())))
target_characters = sorted(list(set(df.targets.unique().sum())))

INUPT_LENGTH = max([len(i) for i in input_texts])
OUTPUT_LENGTH = max([len(i) for i in target_texts])
INPUT_FEATURE_LENGTH = len(input_characters)
OUTPUT_FEATURE_LENGTH = len(target_characters)

encoder_input = np.zeros((NUM_SAMPLES, INUPT_LENGTH, INPUT_FEATURE_LENGTH))
decoder_input = np.zeros((NUM_SAMPLES, OUTPUT_LENGTH, OUTPUT_FEATURE_LENGTH))
decoder_output = np.zeros((NUM_SAMPLES, OUTPUT_LENGTH, OUTPUT_FEATURE_LENGTH))

input_dict = {char:index for index,char in enumerate(input_characters)}
input_dict_reverse = {index:char for index,char in enumerate(input_characters)}
target_dict = {char:index for index,char in enumerate(target_characters)}
target_dict_reverse = {index:char for index,char in enumerate(target_characters)}

for seq_index,seq in enumerate(input_texts):
    for char_index, char in enumerate(seq):
        encoder_input[seq_index, char_index, input_dict[char]] = 1

for seq_index,seq in enumerate(target_texts):
    for char_index,char in enumerate(seq):
        decoder_input[seq_index,char_index, target_dict[char]] = 1.0
        if char_index > 0:
            decoder_output[seq_index,char_index-1, target_dict[char]] = 1.0

  


In [10]:
print(encoder_input[0][1])

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]


### 3.3 encoder-decoder模型的搭建
1，模型训练    
2，模型推理     
3，模型预测，展示结果   

In [2]:
def create_model(n_input, n_output, n_units):
    # encoder
    encoder_input = Input(shape=(None, n_input))
    encoder = LSTM(n_units, return_state=True)
    _,encoder_h, encoder_c = encoder(encoder_input)
    encoder_state = [encoder_h, encoder_c]
    
    
    decoder_input = Input(shape=(None, n_output))
    decoder = LSTM(n_units, return_sequences=True, return_state=True)
    decoder_output, _, _ = decoder(decoder_input,
                                   initial_state=encoder_state)
    decoder_dense = Dense(n_output, activation='softmax')
    decoder_output = decoder_dense(decoder_output)
    
    model = Model([encoder_input, decoder_input], decoder_output)
    
    encoder_infer = Model(encoder_input, encoder_state)
    
    decoder_state_input_h = Input(shape=(n_units,))
    decoder_state_input_c = Input(shape=(n_units,))    
    decoder_state_input = [decoder_state_input_h, decoder_state_input_c] 
    
    decoder_infer_output, decoder_infer_state_h, decoder_infer_state_c = decoder(decoder_input,
                                                                                 initial_state=decoder_state_input)
    decoder_infer_state = [decoder_infer_state_h, decoder_infer_state_c]
    decoder_infer_output = decoder_dense(decoder_infer_output)
    decoder_infer = Model([decoder_input] + decoder_state_input,
                          [decoder_infer_output] + decoder_infer_state)
    
    return model, encoder_infer, decoder_infer


def predict_chinese(source,encoder_inference, decoder_inference, n_steps, features):
    state = encoder_inference.predict(source)
    predict_seq = np.zeros((1,1,features))
    predict_seq[0,0,target_dict['\t']] = 1
    output = ''

    for i in range(n_steps):
        yhat,h,c = decoder_inference.predict([predict_seq]+state)
        char_index = np.argmax(yhat[0,-1,:])
        char = target_dict_reverse[char_index]
        output += char
        state = [h,c]
        predict_seq = np.zeros((1,1,features))
        predict_seq[0,0,char_index] = 1
        if char == '\n':
            break
    return output

In [3]:
model_train, encoder_infer, decoder_infer = create_model(
    INPUT_FEATURE_LENGTH,
    OUTPUT_FEATURE_LENGTH,
    N_UNITS)

model_train.compile(optimizer='rmsprop', loss='categorical_crossentropy')

validation_split = 0.2
model_train.fit([encoder_input,decoder_input],
    decoder_output,
    batch_size=BATCH_SIZE,
    epochs=EPOCH,
    validation_split=validation_split)

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Train on 8000 samples, validate on 2000 samples
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x20992dc4c50>

In [4]:
for i in range(1000,1100):
    test = encoder_input[i:i+1,:,:] 
    out = predict_chinese(test,encoder_infer,decoder_infer,OUTPUT_LENGTH,OUTPUT_FEATURE_LENGTH)
    print(input_texts[i])
    print(out)

Stop grumbling.
停止大声不会说。

Stop resisting!
停止吧。

Summer is over.
夏天过去了。

Take your time.
你可以慢慢来。

Take your time.
你可以慢慢来。

That was wrong.
那是不喜欢的。

That's a shame.
那是一個正確的。

That's logical.
那是一個好的計劃。

That's my coat.
那是我的。

That's perfect.
那是一個正常。

That's too bad.
那不太好。

That's too bad.
那不太好。

That's too bad.
那不太好。

The birds sang.
这个男孩子。

The flag is up.
这个男孩子在吃面包。

The phone rang.
这个男孩子吃面包。

Their eyes met.
那些狗都很大。

These are pens.
這些是筆。

They hated Tom.
他们没看。

They have jobs.
他们有孩子。

They let me go.
他们亲吻了。

They love that.
他们没看。

They trust Tom.
他們會出敗。

They want more.
他们不喜欢我。

They want this.
他们没看。

They were good.
他们不喜欢我。

This is a book.
这是一个好。

This is my bag.
这是我的自行车。

Tom can change.
汤姆不傻。

Tom can't swim.
汤姆不会游泳。

Tom has a plan.
汤姆没有狗。

Tom is a rabbi.
汤姆是个骗子。

Tom is no fool.
汤姆不傻。

Tom isn't dumb.
汤姆不傻。

Tom looks pale.
汤姆走了。

Tom loves dogs.
汤姆走了。

Tom turned red.
汤姆睡着了。

Tom walked out.
湯姆會等。

Tom was crying.
汤姆不傻。

Tom won't stop.
汤姆不会游泳。

Tom's fearless.
汤姆很抱。

Tom's la

![](https://stickershop.line-scdn.net/stickershop/v1/product/3624648/LINEStorePC/main.png;compress=true)