This notebook uses deep learning (CNN, GRU, LSTM) to predict user ratings of movies based on user reviews. 

In [1]:
import pandas as pd
import numpy as np
import json
from time import time
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import text_to_word_sequence

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Data pre-processing.

In [None]:
def get_list_of_dicts(fname): 
    return [json.loads(i) for i in open(fname, "rt")]

raw_data = get_list_of_dicts("Amazon_Instant_Video_5.json")
data = pd.DataFrame(raw_data).loc[:, ["reviewerID", "reviewText", "asin", "overall"]]

data.head()

Unnamed: 0,reviewerID,reviewText,asin,overall
0,A11N155CW1UV02,I had big expectations because I love English ...,B000H00VBQ,2.0
1,A3BC8O2KCL29V2,I highly recommend this series. It is a must f...,B000H00VBQ,5.0
2,A60D5HQFOTSOM,This one is a real snoozer. Don't believe anyt...,B000H00VBQ,1.0
3,A1RJPIGRSNX4PW,Mysteries are interesting. The tension betwee...,B000H00VBQ,4.0
4,A16XRPF40679KG,"This show always is excellent, as far as briti...",B000H00VBQ,5.0


In [None]:
def add_user_reviews(x):
    ur = user_reviews.loc[x["reviewerID"]].drop(x["asin"]).values.tolist()
    mr = movie_reviews.loc[x["asin"]].drop(x["reviewerID"]).values.tolist()
    x["userReviews"] = " ".join(list(map(lambda x: x[0], ur)))
    x["movieReviews"] = " ".join(list(map(lambda x: x[0], mr)))
    return x

user_item_review = data.drop("reviewText", axis=1)
user_reviews = pd.pivot_table(data, index=["reviewerID", "asin"], aggfunc=lambda x: x).drop("overall", axis=1)  
movie_reviews = pd.pivot_table(data, index=["asin", "reviewerID"], aggfunc=lambda x: x).drop("overall", axis=1)

df = user_item_review.apply(add_user_reviews, axis=1)
df.head()

Unnamed: 0,reviewerID,asin,overall,userReviews,movieReviews
0,A11N155CW1UV02,B000H00VBQ,2.0,I really like the characters and the actors. I...,"This show always is excellent, as far as briti..."
1,A3BC8O2KCL29V2,B000H00VBQ,5.0,This is one good show. It is interesting in it...,I had big expectations because I love English ...
2,A60D5HQFOTSOM,B000H00VBQ,1.0,I watched this a couple of weeks ago. There ar...,I had big expectations because I love English ...
3,A1RJPIGRSNX4PW,B000H00VBQ,4.0,"The acting was excellent. The acting, the rel...",I had big expectations because I love English ...
4,A16XRPF40679KG,B000H00VBQ,5.0,As many people said this show kept getting bet...,I had big expectations because I love English ...


In [None]:
# Train-test split 
test_size = 0.005

# get test_size percentage of users
unique_users = df.loc[:, "reviewerID"].unique()
users_size = len(unique_users)
test_idx = np.random.choice(users_size, size=int(users_size * test_size), replace=False)

# get test users
test_users = unique_users[test_idx]

# everyone else is a training user
train_users = np.delete(unique_users, test_idx)

test = df[df["reviewerID"].isin(test_users)]
train = df[df["reviewerID"].isin(train_users)]

unique_test_movies = test["asin"].unique()

# drop the movies that also appear in our test set. In order to be
# a true train/test split, we are forced to discard some data entirely
train = train.where(np.logical_not(train["asin"].isin(unique_test_movies))).dropna()

train.head()

Unnamed: 0,reviewerID,asin,overall,userReviews,movieReviews
0,A11N155CW1UV02,B000H00VBQ,2.0,I really like the characters and the actors. I...,"This show always is excellent, as far as briti..."
1,A3BC8O2KCL29V2,B000H00VBQ,5.0,This is one good show. It is interesting in it...,I had big expectations because I love English ...
2,A60D5HQFOTSOM,B000H00VBQ,1.0,I watched this a couple of weeks ago. There ar...,I had big expectations because I love English ...
3,A1RJPIGRSNX4PW,B000H00VBQ,4.0,"The acting was excellent. The acting, the rel...",I had big expectations because I love English ...
4,A16XRPF40679KG,B000H00VBQ,5.0,As many people said this show kept getting bet...,I had big expectations because I love English ...


Embed the reviews into GloVe word2vect model. 

The pre-trained GloVe model is downloadable at
https://nlp.stanford.edu/projects/glove/

In [None]:
import os.path
# functions to embed user reviews into the GloVe word2vect model
def init_embeddings_map(fname):
    with open(os.path.join("glove.6B", fname), encoding="utf8") as glove:
        return {l[0]: np.asarray(l[1:], dtype="float32") for l in
                [line.split() for line in glove]}

def get_embed_func(i_len, u_len, pad_value, embedding_map):
    def embed(row):
        sentence = row["userReviews"].split()[:u_len]
        reviews = list(map(lambda word: embedding_map.get(word)
            if word in embedding_map else pad_value, sentence))
        row["userReviews"] = reviews +[pad_value] * (u_len - len(reviews))
        sentence = row["movieReviews"].split()[:i_len]
        reviews = list(map(lambda word: embedding_map.get(word) if word in embedding_map else pad_value, sentence))
        row["movieReviews"] = reviews +[pad_value] * (i_len - len(reviews))
        return row
    return embed

emb_size = 50 #or 100, 200, 300
embedding_map = init_embeddings_map("glove.6B." + str(emb_size) + "d.txt")

user_sizes = df.loc[:, "userReviews"].apply(lambda x: x.split()).apply(len)
item_sizes = df.loc[:, "movieReviews"].apply(lambda x: x.split()).apply(len)

u_ptile = 40
i_ptile = 15
u_len = int(np.percentile(user_sizes, u_ptile))
i_len = int(np.percentile(item_sizes, i_ptile))

embedding_fn = get_embed_func(i_len, u_len, np.array([0.0] * emb_size), embedding_map)
    
train_embedded = train.apply(embedding_fn, axis=1)
test_embedded = test.apply(embedding_fn, axis=1)

print(u_len, i_len) # size of input in deep neural networks, useful to set parameters
train_embedded.head()

241 722


Unnamed: 0,reviewerID,asin,overall,userReviews,movieReviews
0,A11N155CW1UV02,B000H00VBQ,2.0,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...","[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."
1,A3BC8O2KCL29V2,B000H00VBQ,5.0,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...","[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."
2,A60D5HQFOTSOM,B000H00VBQ,1.0,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...","[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."
3,A1RJPIGRSNX4PW,B000H00VBQ,4.0,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...","[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."
4,A16XRPF40679KG,B000H00VBQ,5.0,"[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...","[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,..."


Deep learning models

In [None]:
import tensorflow as tf
from keras.models import Model
from keras.callbacks import EarlyStopping, TensorBoard, ModelCheckpoint
from keras.layers import Conv1D, GRU, LSTM, MaxPooling1D, Flatten
from keras.layers import Input, Dense, Dropout
from keras.layers.merge import Add, Dot, Concatenate

In [None]:
def cnn_tower(max_len, embedding_size, hidden_size, filters=4, kernel_size=10):
        input_layer = Input(shape=(max_len, embedding_size))
        tower = Conv1D(filters=filters, kernel_size=kernel_size, activation="tanh")(input_layer)
        tower = MaxPooling1D()(tower)
        tower = Conv1D(filters=filters, kernel_size=kernel_size, activation="tanh")(tower)
        tower = MaxPooling1D()(tower)
        tower = Flatten()(tower)
        tower = Dense(hidden_size, activation="relu")(tower)
        tower = Dropout(0.4)(tower)
        return input_layer, tower
    
def CNN_model(embedding_size, hidden_size, u_len, i_len):
    inputU, towerU = cnn_tower(u_len, embedding_size, hidden_size)
    inputM, towerM = cnn_tower(i_len, embedding_size, hidden_size)
    joined = Concatenate()([towerU, towerM])
    outNeuron = Dense(1)(joined)
    dotproduct = Dot(axes=1)([towerU, towerM])
    output_layer = Add()([outNeuron, dotproduct])
        
    model = Model(inputs=[inputU, inputM], outputs=[output_layer])
    return model

hidden_size = 64

model_cnn = CNN_model(emb_size, hidden_size, u_len, i_len)
model_cnn.compile(optimizer='Adam', loss='mse')
model_cnn.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 241, 50)      0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 722, 50)      0                                            
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 232, 4)       2004        input_1[0][0]                    
__________________________________________________________________________________________________
conv1d_3 (Conv1D)               (None, 713, 4)       2004        input_2[0][0]                    
__________________________________________________________________________________________________
max_poolin

In [None]:
batch_size = 32
epochs = 20

user_reviews = np.array(list(train_embedded.loc[:, "userReviews"]))
movie_reviews = np.array(list(train_embedded.loc[:, "movieReviews"]))

train_inputs = [user_reviews, movie_reviews]
train_outputs = train_embedded.loc[:, "overall"]

tensorboard = TensorBoard(log_dir="cnn_log")
earlystop = EarlyStopping(monitor='val_loss', patience=3)
checkpoint = ModelCheckpoint('cnn_weights.{epoch:02d}-{val_loss:.2f}.h5', monitor='val_loss', save_best_only=True)
train_history = model_cnn.fit(train_inputs, train_outputs, callbacks=[tensorboard, earlystop, checkpoint], 
                              validation_split=0.05, batch_size=batch_size, epochs=epochs)

model_cnn.save("cnn.h5")

Train on 24469 samples, validate on 1288 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20


In [None]:
def gru_tower(max_len, embedding_size, hidden_size, rnn_hidden_size, filters=2, kernel_size=8):
        input_layer = Input(shape=(max_len, embedding_size))
        tower = GRU(rnn_hidden_size, activation="tanh")(input_layer)
        tower = Dense(hidden_size, activation="relu")(tower)
        tower = Dropout(0.4)(tower)
        return input_layer, tower
    
def GRU_model(embedding_size, hidden_size, rnn_hidden_size, u_len, i_len):
    inputU, towerU = gru_tower(u_len, embedding_size, hidden_size, rnn_hidden_size)
    inputM, towerM = gru_tower(i_len, embedding_size, hidden_size, rnn_hidden_size)
    joined = Concatenate()([towerU, towerM])
    outNeuron = Dense(1)(joined)
    dotproduct = Dot(axes=1)([towerU, towerM])
    output_layer = Add()([outNeuron, dotproduct])
        
    model = Model(inputs=[inputU, inputM], outputs=[output_layer])
    return model

hidden_size = 64
rnn_hidden_size = 64

model_gru = GRU_model(emb_size, hidden_size, rnn_hidden_size, u_len, i_len)
model_gru.compile(optimizer='Adam', loss='mse')
model_gru.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            (None, 241, 50)      0                                            
__________________________________________________________________________________________________
input_4 (InputLayer)            (None, 722, 50)      0                                            
__________________________________________________________________________________________________
gru_1 (GRU)                     (None, 64)           22080       input_3[0][0]                    
__________________________________________________________________________________________________
gru_2 (GRU)                     (None, 64)           22080       input_4[0][0]                    
__________________________________________________________________________________________________
dense_4 (D

In [None]:
batch_size = 32
epochs = 20

tensorboard = TensorBoard(log_dir="gru_log")
earlystop = EarlyStopping(monitor='val_loss', patience=3)
checkpoint = ModelCheckpoint('gru_weights.{epoch:02d}-{val_loss:.2f}.h5', monitor='val_loss', save_best_only=True)
train_history = model_gru.fit(train_inputs, train_outputs, callbacks=[tensorboard, earlystop, checkpoint], 
                              validation_split=0.05, batch_size=batch_size, epochs=epochs)

model_gru.save("gru.h5")

Train on 24469 samples, validate on 1288 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20


In [None]:
# LSTM model
def lstm_tower(max_len, embedding_size, hidden_size, rnn_hidden_size, filters=2, kernel_size=8):
        input_layer = Input(shape=(max_len, embedding_size))
        tower = LSTM(rnn_hidden_size, activation="tanh")(input_layer)
        tower = Dense(hidden_size, activation="relu")(tower)
        tower = Dropout(0.4)(tower)
        return input_layer, tower
    
def LSTM_model(embedding_size, hidden_size, rnn_hidden_size, u_len, i_len):
    inputU, towerU = lstm_tower(u_len, embedding_size, hidden_size, rnn_hidden_size)
    inputM, towerM = lstm_tower(i_len, embedding_size, hidden_size, rnn_hidden_size)
    joined = Concatenate()([towerU, towerM])
    outNeuron = Dense(1)(joined)
    dotproduct = Dot(axes=1)([towerU, towerM])
    output_layer = Add()([outNeuron, dotproduct])
        
    model = Model(inputs=[inputU, inputM], outputs=[output_layer])
    return model


hidden_size = 64
rnn_hidden_size = 64

model_lstm = LSTM_model(emb_size, hidden_size, rnn_hidden_size, u_len, i_len)
model_lstm.compile(optimizer='Adam', loss='mse')
model_lstm.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_5 (InputLayer)            (None, 241, 50)      0                                            
__________________________________________________________________________________________________
input_6 (InputLayer)            (None, 722, 50)      0                                            
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 64)           29440       input_5[0][0]                    
__________________________________________________________________________________________________
lstm_2 (LSTM)                   (None, 64)           29440       input_6[0][0]                    
__________________________________________________________________________________________________
dense_7 (D

In [14]:
batch_size = 32
epochs = 30

tensorboard = TensorBoard(log_dir="lstm_log")
earlystop = EarlyStopping(monitor='val_loss', patience=3)
checkpoint = ModelCheckpoint('lstm_weights.{epoch:02d}-{val_loss:.2f}.h5', monitor='val_loss', save_best_only=True)
train_history = model_lstm.fit(train_inputs, train_outputs, callbacks=[tensorboard, earlystop, checkpoint], 
                              validation_split=0.05, batch_size=batch_size, epochs=epochs)

model_lstm.save("lstm.h5")

Train on 24469 samples, validate on 1288 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30


Prediction and errors

In [15]:
user_reviews = np.array(list(test_embedded.loc[:, "userReviews"]))
movie_reviews = np.array(list(test_embedded.loc[:, "movieReviews"]))
test_inputs = [user_reviews, movie_reviews]

true_rating = np.array(list(test_embedded.loc[:, "overall"])).reshape((-1, 1))

predictions_cnn = model_cnn.predict(test_inputs)
predictions_gru = model_gru.predict(test_inputs)
predictions_lstm = model_lstm.predict(test_inputs)

error_cnn = np.square(predictions_cnn - true_rating)
print("Test MSE for CNN model :", np.average(error_cnn))

error_gru = np.square(predictions_gru - true_rating)
print("Test MSE for GRU model :", np.average(error_gru))

error_lstm = np.square(predictions_lstm - true_rating)
print("Test MSE for LSTM model :", np.average(error_lstm))

Test MSE for CNN model : 1.606608528314209
Test MSE for GRU model : 1.5976189643735184
Test MSE for LSTM model : 1.5789539424967107


The data is very sparse and we need to run more epochs, and also a larger dataset. So the resulting MSE's are not so satisfying. However, we can still compare them and draw some early conclusions.

1. RNN works better than CNN. A possible reason might be that reviews are sequential data.
2. LSTM works better than GRU. More epochs will lead to better performance.