[View in Colaboratory](https://colab.research.google.com/github/KushalVenkatesh/botkit/blob/master/Implementing_tf_seq2seq.ipynb)

Preparing The Training Data:

Let us start with preparing the data.

I will be doing all this in the Google Colab.

I will be using NLTK lib that requires some “pre-warming”:

In [10]:
import nltk

dwlr = nltk.downloader.Downloader()

for pkg in dwlr.packages():
    if pkg.subdir== 'tokenizers':
        dwlr.download(pkg.id)

[nltk_data] Downloading package punkt to /content/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Let’s create brand new notebook with Python3 support.

As soon as notebook is created I had to change the runtime type and set it to GPU

Here I will be using the following to demonstrate  neural machine translation (NMT).

In [0]:
!rm -rf dialog_converter

At this point one may ask, okay I see what this is all about, but how is it possible to train a model for free? Google recently have announced that they are giving one Nvidia K80 GPU for 12 hours for free with their new service Colab. Essentially Colab is a custom version of Jupyter Notebook.

So looks like I have both components:

Model that I want to train and
service with the Nvidia K80 that I will be using for the actual training.
Here I begin the journey…

Now I can clone the branch and run the logic that prepares training data:

In [12]:
!git clone https://github.com/b0noI/dialog_converter.git

Cloning into 'dialog_converter'...
remote: Counting objects: 140, done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 140 (delta 6), reused 16 (delta 5), pack-reused 122[K
Receiving objects: 100% (140/140), 15.13 MiB | 28.01 MiB/s, done.
Resolving deltas: 100% (73/73), done.


In [0]:
!cd dialog_converter

In [14]:
!git checkout b9cc7b7d82a959c80e5048b18e956841233c7688

fatal: Not a git repository (or any of the parent directories): .git


In [15]:
!python ./converter.py

python3: can't open file './converter.py': [Errno 2] No such file or directory


In [16]:
!ls

datalab  dialog_converter  input  input.pre  input.pre.bpe  nltk_data  output


In [0]:
!cd dialog_converter/

In [18]:
!ls

datalab  dialog_converter  input  input.pre  input.pre.bpe  nltk_data  output


In [0]:
!cd dialog_converter

In [20]:
!ls

datalab  dialog_converter  input  input.pre  input.pre.bpe  nltk_data  output


In [21]:
!pwd

/content


In [22]:
%%bash

UsageError: %%bash is a cell magic, but the cell body is empty.


In [15]:
%%bash
rm -rf dialog_converter
git clone https://github.com/b0noI/dialog_converter.git
cd dialog_converter
git checkout b9cc7b7d82a959c80e5048b18e956841233c7688
python ./converter.py
ls

converter.py
LICENSE
movie_lines.txt
pre_processing.py
README.md
test.a
test.b
train.a
train.b


Cloning into 'dialog_converter'...
Note: checking out 'b9cc7b7d82a959c80e5048b18e956841233c7688'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at b9cc7b7... Updated preprocessing


Now I have raw dialog data for training. One of the key differences this time around is that I are going to have more sophisticated dictionary. In order to explain the difference in our new dictionary:

Learning a model based on words has a couple of drawbacks. Because NMT models output a probability distribution over words, they can became very slow with large number of possible words. If I include misspellings and derived words in our vocabulary, the number of possible words is essentially infinite and I need to impose an artificial limit of how of the most common words I want our model to handle. This is also called the vocabulary size and typically set to something in the range of 10,000 to 100,000. Another drawback of training on word tokens is that the model does not learn about common “stems” of words. For example, it would consider “loved” and “loving” as completely separate classes despite their common root.
One way to handle an open vocabulary issue is learn subword units for a given text. For example, the word “loved” may be split up into “lov” and “ed”, while “loving” would be split up into “lov” and “ing”. This allows to model to generalize to new words, while also resulting in a smaller vocabulary size. There are several techniques for learning such subword units, including Byte Pair Encoding (BPE), which is what I used here. To generate a BPE for a given text, I can follow the instructions in the official subword-nmt repository:

So our next logical steps are:

Get all the required software that can learn a BPE vocabulary from training text
Convert training data to BPE and create a vocabulary
Convert all text with the vocabulary.

Inorder to get all the required software that can learn a BPE vocabulary from training text,
I will be installing the subword-nmt pip package in order to perform required manipulations. 
In order to do so I run the following command in the cell:

In [16]:
%%bash
rm -rf subword-nmt
git clone https://github.com/b0noI/subword-nmt.git
cd subword-nmt
git checkout dbe97c8f95f14d06b2e46b8053e2e2f9b9bf804e

Cloning into 'subword-nmt'...
Note: checking out 'dbe97c8f95f14d06b2e46b8053e2e2f9b9bf804e'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b <new-branch-name>

HEAD is now at dbe97c8... bug2 output removed


Converting Training Data To BPE and Create a Vocabulary

I will need to execute this in following 3 steps.

Step 1: This step is responsible to create a vocabulary based on input training data and specified size of vocabulary. It creates code.bpe which is basically ‘compressed trie’ of all the words in training data. It also generates most frequent words from training data along with their frequencies in the files vocab.train.bpe.{a,b}.

In [17]:
%%bash
# Create unique words (vocabulary) from training data
subword-nmt/learn_joint_bpe_and_vocab.py --input dialog_converter/train.a dialog_converter/train.b -s 50000 -o code.bpe --write-vocabulary vocab.train.bpe.a vocab.train.bpe.b

no pair has frequency >= 2. Stopping


Step 2: Our training data has few tabs which are not needed in vocabulary, so lets clean the vocabulary.

In [0]:
%%bash
# Remove the tab from vocabulary 
sed -i '/\t/d' ./vocab.train.bpe.a
sed -i '/\t/d' ./vocab.train.bpe.b

Step 3: The output files vocab.train.{a,b} has list of words along with their frequencies, tf-seq2seq takes input as set of words, so I can get rid of frequency.

In [0]:
%%bash
# Remove the frequency from vocabulary
cat vocab.train.bpe.a | cut -f1 --delimiter=' ' > revocab.train.bpe.a
cat vocab.train.bpe.b | cut -f1 --delimiter=' ' > revocab.train.bpe.b

Converting all text with the vocabulary:

This cell will create BPE encoding and dictionaries per each raw file. And now I can re-apply this dictionaries to our raw files:

In [0]:
%%bash
subword-nmt/apply_bpe.py -c code.bpe --vocabulary vocab.train.bpe.a --vocabulary-threshold 5 < dialog_converter/train.a > train.bpe.a
subword-nmt/apply_bpe.py -c code.bpe --vocabulary vocab.train.bpe.b --vocabulary-threshold 5 < dialog_converter/train.b > train.bpe.b
subword-nmt/apply_bpe.py -c code.bpe --vocabulary vocab.train.bpe.a --vocabulary-threshold 5 < dialog_converter/test.a > test.bpe.a
subword-nmt/apply_bpe.py -c code.bpe --vocabulary vocab.train.bpe.b --vocabulary-threshold 5 < dialog_converter/test.b > test.bpe.b

Preparation for training:

Step 1: Downloading the nmt model

In [21]:
%%bash
rm -rf /content/nmt_model
rm -rf nmt
git clone https://github.com/tensorflow/nmt/

Cloning into 'nmt'...


Step 2: Moving all the required file for the training to one place. Which includes the training data, test data and vocabulary, this includes just setting of words.

Starting the training:

In [0]:
%%bash
mkdir -p /content/nmt_model
cp dialog_converter/train.a /content/nmt_model
cp dialog_converter/train.b /content/nmt_model
cp dialog_converter/test.a /content/nmt_model
cp dialog_converter/test.b /content/nmt_model
cp revocab.train.bpe.a /content/nmt_model
cp revocab.train.bpe.b /content/nmt_model
cp train.bpe.a /content/nmt_model
cp test.bpe.a /content/nmt_model
cp train.bpe.b /content/nmt_model
cp test.bpe.b /content/nmt_model

In [0]:
!cd nmt && python3 -m nmt.nmt \
    --src=a --tgt=b \
    --vocab_prefix=/content/nmt_model/revocab.train.bpe \
    --train_prefix=/content/nmt_model/train.bpe \
    --dev_prefix=/content/nmt_model/test.bpe \
    --test_prefix=/content/nmt_model/test.bpe \
    --out_dir=/content/nmt_model \
    --num_train_steps=45000000 \
    --steps_per_stats=100000 \
    --num_layers=2 \
    --num_units=128 \
    --batch_size=16 \
    --num_gpus=1 \
    --dropout=0.2 \
    --learning_rate=0.2 \
    --metrics=bleu

# Job id 0
# hparams:
  src=a
  tgt=b
  train_prefix=/content/nmt_model/train.bpe
  dev_prefix=/content/nmt_model/test.bpe
  test_prefix=/content/nmt_model/test.bpe
  out_dir=/content/nmt_model
# Vocab file /content/nmt_model/revocab.train.bpe.a exists
The first 3 vocab words [.@@, ,, .] are not [<unk>, <s>, </s>]
# Vocab file /content/nmt_model/revocab.train.bpe.b exists
The first 3 vocab words [.@@, ., ,] are not [<unk>, <s>, </s>]
  saving hparams to /content/nmt_model/hparams
  saving hparams to /content/nmt_model/best_bleu/hparams
  attention=
  attention_architecture=standard
  avg_ckpts=False
  batch_size=16
  beam_width=0
  best_bleu=0
  best_bleu_dir=/content/nmt_model/best_bleu
  check_special_token=True
  colocate_gradients_with_ops=True
  decay_scheme=
  dev_prefix=/content/nmt_model/test.bpe
  dropout=0.2
  embed_prefix=None
  encoder_type=uni
  eos=</s>
  epoch_step=0
  forget_bias=1.0
  infer_batch_size=32
  init_op=uniform
  init_weight=0.1
  learning_rate=0.2
 

  created eval model with fresh parameters, time 0.12s
  eval dev: perplexity 32310.67, time 23s, Sun Jun 24 19:55:53 2018.
  eval test: perplexity 32310.67, time 23s, Sun Jun 24 19:56:16 2018.
2018-06-24 19:56:16.650384: I tensorflow/core/kernels/lookup_util.cc:373] Table trying to initialize from file /content/nmt_model/revocab.train.bpe.b is already initialized.
2018-06-24 19:56:16.650764: I tensorflow/core/kernels/lookup_util.cc:373] Table trying to initialize from file /content/nmt_model/revocab.train.bpe.b is already initialized.
2018-06-24 19:56:16.650971: I tensorflow/core/kernels/lookup_util.cc:373] Table trying to initialize from file /content/nmt_model/revocab.train.bpe.a is already initialized.
  created infer model with fresh parameters, time 0.06s
# Start step 0, lr 0.2, Sun Jun 24 19:56:16 2018
# Init train iterator, skipping 0 elements


There are several things to note here:

num_train_steps — this is the amount of steps that the network will take before stopping, I make it big since it is always better to be in a situation when I need to stop training manually than in the situation when the network stopped when I have not expected it to stop;
steps_per_stats — frequency in which network will output some stats. A thing to keep in mind here: outputting stats takes time so I need to find balance between outputting this too often and training model in a complete dark;
metrics — logic of computing the distance between 2 sentences to evaluate model quality;
Important things here is not to use “%%bash” but to use “!”. “%%bash” will wait till the cell completely executed before showing the output, which means almost forever due to the training of the model. On opposite, the “!” is showing output dynamically.

As soon as training finishes its first epoch I can start chatting with the model.

Let us be sure to kill the training if it is still in progress, otherwise I will not be able to chat with the model.

Next, I copy paste following code in a file (lets say chat.sh) under nmt directory and run it like ./chat.sh <path to the model>. I might need to change the permission like chmod +x chat.sh.

In [0]:
%%bash
pwd
cd nmt
touch /content/output
chat () {
   echo $1 > /content/input
   python3 $HOME/pre_processing.py /content/input >/content/input.pre
   $HOME/subword-nmt/apply_bpe.py -c $HOME/code.bpe --vocabulary $HOME/vocab.train.bpe.a --vocabulary-threshold 5 < /content/input.pre > /content/input.pre.bpe
   cd $HOME/nmt
   python -m nmt.nmt  --out_dir=/content/nmt_model --inference_input_file=/content/input.pre.bpe --inference_output_file=/content/output > /dev/null 2>&1
   cat /content/output
}
chat "hi"