# Word-level review generation

This part of our project utilizes word embedding to realize word-level review generation. Due to the large vocabulary size (typically tens of thousand) and limited computation resources, only 6000 most frequent words are selected. The embedded output is then fed as the input of a stacked two-layer gated recurrent units (GRU). Finally, the output is linearly projected to word space and yields a softmax probability.

To improve the model performance and training speed, within GRU, a dropout probability of 0.2 is employed to all reset and input gates. A dropout probability of 0.5 is employed between the word embedding layer and the GRU, and between the GRU and the fully-connected output layer. 

**Sample generated texts:**

- If you're looking for a small place to take a good time at the end of the night. I would recommend this place to anyone to try the <unk\>, and you won't be disappointed. Weeknight, and the food is great.
- My husband and I were looking for a little <unk\>. The pizza was delicious, the shrimp was good and the chicken was good! My friend had the chicken and the shrimp and it was delicious. The chips were great and the food was amazing!
- This is a great place to eat. I've had a few people here. The place is very nice and the service is exceptional. I would recommend the food and food.
- Very good, the food was delicious. I had a great meal and the food was great. I'm so glad I had the same thing that I had. I'm not sure if I was going to get a drink.
- A few years ago, and I was really excited to try for my first time. I'm glad I would be going back to the place. The food was good, the food was good, but the food was pretty good.
- I was told that the other employee came out and said they were. We sat in the front desk for our table. They brought us the food and the waiter was very friendly and nice.
 
 (Nothing changed except for formatting)
 

In [0]:
import argparse
import time
import os
import random
import sys
import re
import gensim

import numpy as np

In [3]:
from keras import backend as K
from keras.layers.embeddings import Embedding
from keras import layers
from keras.layers import Dense, LSTM, GRU, Dropout, TimeDistributed
from keras.models import Model
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.callbacks import LambdaCallback, ModelCheckpoint, CSVLogger, LearningRateScheduler
from keras.models import Sequential, load_model
from keras.optimizers import RMSprop

Using TensorFlow backend.


In [0]:
def load_text(filename):
    with open(filename, encoding='utf-8') as f:
        sentences = f.readlines()

    sentences = [sentence.lower() for sentence in sentences]
    sentences = [re.sub(r'([\.\,\?\!\']+)', r' \1 ', sentence) for sentence in sentences]
    sentences = [re.sub(r'["#\$%&()\*\+\-/:;\<\=\>@\\^_`\{\|\}~\t\n]', '', sentence) for sentence in sentences]
    sentences = [re.sub(r'\s+', ' ', sentence) for sentence in sentences]
    sentences = [sentence.split() for sentence in sentences]

    print('{} sentences loaded. Sample:""'.format(len(sentences)))
    print(' '.join(random.choice(sentences)))
    
    return sentences

## 1. Load pretrained word embedding

In [0]:
word_model = gensim.models.Word2Vec.load("word2vec2.model")

### Training of the word embedding (optional)

The codes below loads 100000 lines of texts and use skip-gram and CBOW to train a word2vec model.

In [85]:
sentences = load_text('input_100000.txt')
sentences = [sentence for sentence in sentences if len(sentence) >= 2]

100068 sentences loaded. Sample:""
nail came out really cute , but they were extremely shorthanded . i went in with 2 other people and after over 2 hours in the place my nails and toes were finished while the other 2 were still waiting to be seated .


In [92]:
# Pretrain word embedding

embedding_size = 128
print('\nTraining word2vec...')
word_model = gensim.models.Word2Vec(sentences, size=embedding_size, max_final_vocab=6000, window=5, iter=1000)
pretrained_weights = word_model.wv.vectors
vocab_size, embedding_size = pretrained_weights.shape
print('vocab_size: {} embedding size: {}'.format(vocab_size, embedding_size))


Training word2vec...
vocab_size: 5959 embedding size: 128


In [0]:
word_model.save("word2vec2.model")
# word_model.train(sentences, total_examples=word_model.corpus_count, epochs=100)

In [8]:
for word in ['sushi', 'beef', 'toppings', 'green']:
  most_similar = ', '.join('%s (%.2f)' % (similar, dist) for similar, dist in word_model.wv.most_similar(word)[:8])
  print('  %s -> %s' % (word, most_similar))

  sushi -> sashimi (0.67), pho (0.59), food (0.58), nigiri (0.57), ramen (0.54), poke (0.48), seafood (0.48), dimsum (0.45)
  beef -> pork (0.62), chicken (0.57), steak (0.52), shrimp (0.52), meat (0.50), lamb (0.48), soup (0.48), veal (0.46)
  toppings -> ingredients (0.61), flavors (0.60), veggies (0.58), sides (0.57), meats (0.55), sauces (0.52), choices (0.51), topping (0.50)
  green -> lemon (0.48), black (0.47), red (0.45), pineapple (0.44), fritters (0.43), roasted (0.42), cayenne (0.42), bean (0.42)


  if np.issubdtype(vec.dtype, np.int):


## 2. Prepare data

When tokenization, reserves index 0 for padding and index 1 for unk.

In [0]:
# 0 for mask
pretrained_weights_one = np.insert(pretrained_weights, 0, 0,axis=0)

# 1 for <UNK>
embedding_w = np.insert(pretrained_weights_one, 0, 0, axis=0)
embedding_w[1:] = np.mean(pretrained_weights, axis=0)

vocab_size, embedding_size = embedding_w.shape

In [0]:
pretrained_weights = word_model.wv.vectors
vocab_size, embedding_size = pretrained_weights.shape

In [0]:
def word2idx(word):
    if word == '<UNK>' or word not in word_model.wv.vocab:
        return 1
    return word_model.wv.vocab[word].index + 2
def idx2word(idx):
    if idx == 1:
        return '<UNK>'
    return word_model.wv.index2word[idx - 2]

### Text Preparation and Tokenization

The following codes tokenize text in `input_100000.txt`.

Besides, it breaks lines that are longer than `max_len` into smaller ones. As a result, there will be 275k lines (instead of 100k) in the dataset. 

All lines that are too short (less than `min_len`) are pruned.

In [73]:
# Dataset Prep for training

max_len = 50
min_len = 10

sentences = load_text('input_100000.txt')
sentences = [sentence for sentence in sentences if len(sentence) >= min_len]

100068 sentences loaded. Sample:""
always manage a good time here . good steam and hot tub . the dark areas are fun and the sling can support a great amount of weight . nice guys and the roof is a sunny refuge . i will be back .


In [97]:
tokenized = [[word2idx(w) for w in sentence] for sentence in sentences]

tokenized_m = []
length = max_len + 1 # for x and y
for sentence in tokenized:
    while len(sentence) > length:
        tokenized_m.append(sentence[:length])
        sentence = sentence[length:]
    if len(sentence) >= min_len:
        tokenized_m.append(sentence + [0] * (length  - len(sentence)))

tokenized_m = np.array(tokenized_m)
print(tokenized_m.shape)

(275288, 51)


### Training Dataset

Each line in the target data set `y` is identical to that in the target data set `x` except for that `y` is shifted 1 index forward, i.e. `train_y[:, i, :]` is equal to `train_x[:, i+1, :]` (not considering the first and last element).

During training, the cross entropy loss for **every** word (rather than merely the last one) in `y` will be taken into consideration.

In [99]:
tokenized_len = tokenized_m.shape[1]
train_x = tokenized_m[:, :tokenized_len-1]
train_y = tokenized_m[:, 1:]
print('train_x shape:', train_x.shape)
print('train_y shape:', train_y.shape)

train_x shape: (275288, 50)
train_y shape: (275288, 50)


## 3. Training

In [100]:
K.clear_session()

max_words = vocab_size

# build the model: two layer LSTM
print('Build model...')
model = Sequential()
model.add(Embedding(input_dim=vocab_size,
                    output_dim=embedding_size,
                    input_length=max_len,
                    weights=[embedding_w],
                    mask_zero=True,
#                     trainable=False,
                   ))
model.add(Dropout(0.5))
model.add(GRU(embedding_size, recurrent_dropout=0.2, return_sequences=True))
model.add(GRU(embedding_size, recurrent_dropout=0.2, return_sequences=True))
model.add(Dropout(0.5))
# model.add(Dense(vocab_size, activation='softmax'))
model.add(TimeDistributed(Dense(vocab_size, activation='softmax')))

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam')

model.summary()

Build model...
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 50, 128)           763008    
_________________________________________________________________
dropout_1 (Dropout)          (None, 50, 128)           0         
_________________________________________________________________
gru_1 (GRU)                  (None, 50, 128)           98688     
_________________________________________________________________
gru_2 (GRU)                  (None, 50, 128)           98688     
_________________________________________________________________
dropout_2 (Dropout)          (None, 50, 128)           0         
_________________________________________________________________
time_distributed_1 (TimeDist (None, 50, 5961)          768969    
Total params: 1,729,353
Trainable params: 1,729,353
Non-trainable params: 0
___________________________________________________

In [0]:
def sample(preds, temperature=1.0, exclude_unk=False):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    if exclude_unk:
        for i in range(3):
            probas = np.random.multinomial(1, preds, 1)
            result = np.argmax(probas)
            if result > 1:
                break
    else:
        probas = np.random.multinomial(1, preds, 1)
        result = np.argmax(probas) 
        
    return result

def on_epoch_end(epoch, _):
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    for diversity in [0.2, 0.6]:
        print('----- diversity:', diversity)

        generated = ''
        sentence_list = [[1]]
        for _ in range(2):
            sentence_list.append([random.randrange(2, 250)])

        for sentence in sentence_list:
            generated = sentence

            for i in range(20):
                x_pred = np.zeros((1, max_len))
                for t, word in enumerate(sentence):
                    x_pred[0, t] = word

                preds = model.predict(np.array(x_pred), verbose=0)
                next_index = sample(preds[-1, i], diversity)

                sentence.append(next_index)

            print(' '.join([idx2word(w) for w in sentence]))

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)
save_callback = ModelCheckpoint(os.path.join('model',
                                             "weights.{epoch:d}-{loss:.2f}.hdf5"),
                                monitor='loss', period=5)

In [0]:
model.fit(train_x, np.expand_dims(train_y, -1), batch_size=1024, epochs=5,
          callbacks=[
                     print_callback, 
                     save_callback,
                    ])

# 4. Generation

In [175]:
! ls model

weights.10-6.00.hdf5  weights.25-4.51.hdf5  weights.5-6.08.hdf5
weights.15-4.74.hdf5  weights.30-4.47.hdf5
weights.20-4.58.hdf5  weights.35-4.44.hdf5


In [0]:
model = load_model('model/weights.35-4.44.hdf5')

In [170]:
generated = ''
sentence_list = [['this'], ['my'], ['it'],
                ['a'], ['many'], ['the'],
                ['if'], ['so'], ['in'],
                ['i'], ['we'], ['very']]

for sentence in sentence_list:
    generated = sentence

    for i in range(max_len - len(sentence)):
        x_pred = np.zeros((1, max_len))
        for t, word in enumerate(sentence):
            x_pred[0, t] = word2idx(word)

        preds = model.predict(x_pred, verbose=0)
#         print(preds.shape)
        next_index = sample(preds[-1, i], 0.5, True)
#         next_index = np.argmax(preds[-1, i])
        next_word = idx2word(next_index)

        sentence.append(next_word)
#     print(' '.join([idx2word(np.argmax(wv)) for wv in preds[-1,:20,:]]))
    print(' '.join(sentence))
    print(' ')

this location is a shame , and i ' m not sure what we ' ve ever had . i love the <UNK> . i think it ' s a big fan of a <UNK> and one of the best restaurants in vegas . my favorite is that the <UNK>
 
my husband and i have been wanting to try the chicken and the chicken . the rice is good , but the food is great ! i ' m a big fan of the fish and beef and the sweet potato salad . the food is fantastic and i definitely
 
it was a bit pricey . the ambiance was nice , the staff was nice and helpful . we had a great experience . i ordered the <UNK> and a <UNK> and i loved the chicken . the salad was fresh and the chicken was outstanding . the chicken was
 
a few times . the food is excellent and the service is excellent . i think it ' s a must try . weeknight and a friendly place . weeknight is not a lot of food , but it ' s a great place to hang out . weeknight ,
 
many . the service was excellent , the food was great . i ' ve been going to the <UNK> <UNK> for a few times and i ' m a bit su