# Deceptive Review Generation: LSTM Generative Model

This is our first Google Colab hosted notebook, so the formatting may be a bit strange. We are moving over to Google Colab due to Google AI making TPU's free to use via Colab! This opens the door to a lot of exciting possibilities, as hardware is no longer a limitation.

---

In this notebook, I will attempt to create a generative language model that learns how to generate realistic reviews. Let's get started.

---

The first thing we must do is check that our TPU is connected and working, and authenticate it with Google.

In [0]:
import datetime
import json
import os
import pprint
import random
import string
import sys
import tensorflow as tf
from google.colab import auth

auth.authenticate_user()

Next, let's begin our experimentation. Due to the nature of this new format of experiment, we will have to do things differently.

I ran the NYC data through our Protobuffer processor locally and saved the output to a .txt file. In future, we can come up with a better solution (like actually using Protobuffer objects instead of the messy string stuff below) but right now I just want to test Colab.

In [2]:
import os
import re
import pandas as pd
import numpy as np
import random
import sys
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import LSTM
from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau
from keras.utils import np_utils
from keras import optimizers
from keras import layers
import keras

# Download and process the dataset files.
def download_and_load_dataset(force_download=False):
  dataset = tf.keras.utils.get_file(
      fname="normalizedNYCYelp.txt", 
      origin="https://storage.googleapis.com/lucas0/imdb_classification/normalizedNYCYelp.txt", 
      extract=False)
  dfile = open(dataset).read()
  reviews = dfile.split('\n, ')
  return reviews

reviews = download_and_load_dataset()

data = {}
data['review'] = []
data['deceptive'] = []

for x in reviews:
  data['review'].append(x.split('\n')[0].split(': ')[1].replace('"', '').strip())
  data['deceptive'].append(0 if 'label: ' in x else 1)

dataDict = pd.DataFrame.from_dict(data)
print(len(dataDict))

Using TensorFlow backend.


160933


In [3]:
reviews = dataDict['review']
mask = (dataDict['review'].str.len() < 251) 
shortReviews = dataDict.loc[mask]
print(len(shortReviews))
short_reviews=shortReviews.sample(frac=1).reset_index(drop=True)
open('short_reviews.txt', 'w+')
filename='/short_reviews.txt'
short_reviews.to_csv(filename, header=None, index=None, sep=' ')
text = open('/short_reviews.txt').read()
print(len(text))
chars = sorted(list(set(text)))
print('Unique characters:', len(chars))
char_indices = dict((char, chars.index(char)) for char in chars)
maxlen = 60
step = 1
char_indices

54689
7622574
Unique characters: 94


{'\n': 0,
 ' ': 1,
 '!': 2,
 '"': 3,
 '#': 4,
 '$': 5,
 '%': 6,
 '&': 7,
 "'": 8,
 '(': 9,
 ')': 10,
 '*': 11,
 '+': 12,
 ',': 13,
 '-': 14,
 '.': 15,
 '/': 16,
 '0': 17,
 '1': 18,
 '2': 19,
 '3': 20,
 '4': 21,
 '5': 22,
 '6': 23,
 '7': 24,
 '8': 25,
 '9': 26,
 ':': 27,
 ';': 28,
 '=': 29,
 '?': 30,
 '@': 31,
 'A': 32,
 'B': 33,
 'C': 34,
 'D': 35,
 'E': 36,
 'F': 37,
 'G': 38,
 'H': 39,
 'I': 40,
 'J': 41,
 'K': 42,
 'L': 43,
 'M': 44,
 'N': 45,
 'O': 46,
 'P': 47,
 'Q': 48,
 'R': 49,
 'S': 50,
 'T': 51,
 'U': 52,
 'V': 53,
 'W': 54,
 'X': 55,
 'Y': 56,
 'Z': 57,
 '[': 58,
 '\\': 59,
 ']': 60,
 '^': 61,
 '_': 62,
 '`': 63,
 'a': 64,
 'b': 65,
 'c': 66,
 'd': 67,
 'e': 68,
 'f': 69,
 'g': 70,
 'h': 71,
 'i': 72,
 'j': 73,
 'k': 74,
 'l': 75,
 'm': 76,
 'n': 77,
 'o': 78,
 'p': 79,
 'q': 80,
 'r': 81,
 's': 82,
 't': 83,
 'u': 84,
 'v': 85,
 'w': 86,
 'x': 87,
 'y': 88,
 'z': 89,
 '{': 90,
 '|': 91,
 '}': 92,
 '~': 93}

In [0]:
#This get Data From Chunk is necessary to process large data sets like the one we have
#If you're using a sample less than 1 million characters you can train the whole thing at once

def getDataFromChunk(txtChunk, maxlen=60, step=1):
    sentences = []
    next_chars = []
    for i in range(0, len(txtChunk) - maxlen, step):
        sentences.append(txtChunk[i : i + maxlen])
        next_chars.append(txtChunk[i + maxlen])
    print('nb sequences:', len(sentences))
    print('Vectorization...')
    X = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
    y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
    for i, sentence in enumerate(sentences):
        for t, char in enumerate(sentence):
            X[i, t, char_indices[char]] = 1
            y[i, char_indices[next_chars[i]]] = 1
    return [X, y]

In [5]:
model = keras.models.Sequential()
model.add(layers.LSTM(1024, input_shape=(maxlen, len(chars)),return_sequences=True))
model.add(layers.LSTM(1024, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))
model.summary()

Instructions for updating:
Colocations handled automatically by placer.
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 60, 1024)          4583424   
_________________________________________________________________
lstm_2 (LSTM)                (None, 1024)              8392704   
_________________________________________________________________
dense_1 (Dense)              (None, 94)                96350     
Total params: 13,072,478
Trainable params: 13,072,478
Non-trainable params: 0
_________________________________________________________________


In [0]:
optimizer = keras.optimizers.Adam(lr=0.001)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

In [0]:
# this saves the weights everytime they improve so you can let it train.  Also learning rate decay

filepath="Mar-4-all-{epoch:02d}-{loss:.4f}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.5,
              patience=1, min_lr=0.00001)
early_stopping = EarlyStopping(monitor='loss', patience=6, restore_best_weights=True)
callbacks_list = [checkpoint, reduce_lr, early_stopping]

In [0]:
def sample(preds, temperature=1.0):
    '''
    Generate some randomness with the given preds
    which is a list of numbers, if the temperature
    is very small, it will always pick the index
    with highest pred value
    '''
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [0]:
for iteration in range(1, 5):
    print()
    print('-' * 50)
    print('Iteration', iteration)
    with open("/short_reviews.txt") as f:
        for review in iter(lambda: f.read(90000), ""):
            X, y = getDataFromChunk(review)
            model.fit(X, y, batch_size=128, epochs=1, callbacks=callbacks_list)


--------------------------------------------------
Iteration 1
nb sequences: 89940
Vectorization...
Epoch 1/1

Epoch 00001: loss improved from inf to 2.31078, saving model to Mar-4-all-01-2.3108.hdf5
nb sequences: 89940
Vectorization...
Epoch 1/1

Epoch 00001: loss improved from 2.31078 to 1.72143, saving model to Mar-4-all-01-1.7214.hdf5
nb sequences: 89940
Vectorization...
Epoch 1/1

Epoch 00001: loss improved from 1.72143 to 1.52597, saving model to Mar-4-all-01-1.5260.hdf5
nb sequences: 89940
Vectorization...
Epoch 1/1

Epoch 00001: loss improved from 1.52597 to 1.42405, saving model to Mar-4-all-01-1.4241.hdf5
nb sequences: 89940
Vectorization...
Epoch 1/1

Epoch 00001: loss improved from 1.42405 to 1.36622, saving model to Mar-4-all-01-1.3662.hdf5
nb sequences: 89940
Vectorization...
Epoch 1/1

Epoch 00001: loss improved from 1.36622 to 1.31831, saving model to Mar-4-all-01-1.3183.hdf5
nb sequences: 89940
Vectorization...
Epoch 1/1

Epoch 00001: loss did not improve from 1.31831