# Keras on Yelp dataset

This is the final day's work - lets see if we can use what we learned on a novel problem for which I already have machine learning metrics.  We're going to do sentiment analysis on the Yelp dataset from (I believe) 2014, which contains review text and a five point scale.

This was my grad school final project and the lexicon-based approach took a long time to get working. I'm interested to see if I can condense weeks of work into a single day.

Note that I'm modifying the IMDB sentiment code for this particular problem

In [101]:
'''Train RNN on Yelp dataset'''

from __future__ import print_function
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers import Conv1D, MaxPooling1D
from keras.datasets import imdb
import keras
import pandas as pd
import numpy as np

In [12]:
imdb

<module 'keras.datasets.imdb' from '/home/jasen/anaconda3/envs/keras/lib/python2.7/site-packages/keras/datasets/imdb.pyc'>

Ok, so upon checking imdb.py we see the "load data" function does almost all the heavy lifting for importing and formatting the dataset.  I'm going to add that particular below for reference

In [15]:
data = pd.read_csv("./Yelp/yelp_academic_all.csv")
data.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0.1,Unnamed: 0,user_id,review_id,review_text,business_id,review_stars,review_date,review_type,user_review_count,user_type,...,business_categories,latitude,longitude,business_name,business_review_count,city,state,full_address,business_stars,business_type
0,0,Xqd0DzHaiyRqVH3WRG7hzg,15SdjuK7DmYqUAj6rjGowg,dr. goldberg offers everything i look for in a...,vcNAWiLM4dR7D2nwwJ7nCA,5.0,2007-05-17,review,270,user,...,"['Doctors', 'Health & Medical']",33.499313,-111.984,"Eric Goldberg, MD",9,Phoenix,AZ,"4840 E Indian School Rd\nSte 101\nPhoenix, AZ ...",3.5,business
1,1,H1kH6QZV7Le4zqTRNxoZow,RF6UnRTtG7tWMcrO2GEoAg,"Unfortunately, the frustration of being Dr. Go...",vcNAWiLM4dR7D2nwwJ7nCA,2.0,2010-03-22,review,14,user,...,"['Doctors', 'Health & Medical']",33.499313,-111.984,"Eric Goldberg, MD",9,Phoenix,AZ,"4840 E Indian School Rd\nSte 101\nPhoenix, AZ ...",3.5,business
2,2,zvJCcrpm2yOZrxKffwGQLA,-TsVN230RCkLYKBeLsuz7A,Dr. Goldberg has been my doctor for years and ...,vcNAWiLM4dR7D2nwwJ7nCA,4.0,2012-02-14,review,1446,user,...,"['Doctors', 'Health & Medical']",33.499313,-111.984,"Eric Goldberg, MD",9,Phoenix,AZ,"4840 E Indian School Rd\nSte 101\nPhoenix, AZ ...",3.5,business
3,3,KBLW4wJA_fwoWmMhiHRVOA,dNocEAyUucjT371NNND41Q,Been going to Dr. Goldberg for over 10 years. ...,vcNAWiLM4dR7D2nwwJ7nCA,4.0,2012-03-02,review,16,user,...,"['Doctors', 'Health & Medical']",33.499313,-111.984,"Eric Goldberg, MD",9,Phoenix,AZ,"4840 E Indian School Rd\nSte 101\nPhoenix, AZ ...",3.5,business
4,4,zvJCcrpm2yOZrxKffwGQLA,ebcN2aqmNUuYNoyvQErgnA,Got a letter in the mail last week that said D...,vcNAWiLM4dR7D2nwwJ7nCA,4.0,2012-05-15,review,1446,user,...,"['Doctors', 'Health & Medical']",33.499313,-111.984,"Eric Goldberg, MD",9,Phoenix,AZ,"4840 E Indian School Rd\nSte 101\nPhoenix, AZ ...",3.5,business


For our current purposes, I only want to look at the review text and star rating.  I'm going to skip the review date for now.

In [16]:
data = data[["review_text","review_stars"]]
data.columns = ["x","y"]
data.head()

Unnamed: 0,x,y
0,dr. goldberg offers everything i look for in a...,5.0
1,"Unfortunately, the frustration of being Dr. Go...",2.0
2,Dr. Goldberg has been my doctor for years and ...,4.0
3,Been going to Dr. Goldberg for over 10 years. ...,4.0
4,Got a letter in the mail last week that said D...,4.0


In [17]:
#We have over 1.5 million samples.  Let's train with the standard 80/20 guideline
len(data)

1569265

In [79]:
#Shuffle the dataset, sample the entire thing randomly and return without the original index
seed=123
df = data.copy().sample(frac=1,random_state=seed).reset_index(drop=True)
df = df[(df['y']>0)&(df['y']!=12.0)]
df.head()

Unnamed: 0,x,y
0,My family LOVES Taste of Italy in Goodyear! We...,5.0
1,Waterslide Hyperdrive\nBy The Rue\n\nSunday I ...,4.0
2,Nice little spot with a coin-op pool table.,4.0
3,"I've driven past this place about 50 times, th...",5.0
4,"I was craving for something cold and icy, I as...",3.0


In [80]:
#There are two samples that have weird values for the ratings, so we get rid of those here
len(df)

1569263

In [81]:
from keras.preprocessing import text

In [82]:
#First off, some of the dataset seems to have only float values, so we need to make sure they're all strings for the tokenizer
df['x'] = df['x'].apply(str)

#Then we tokenize
tokenizer = text.Tokenizer()
tokenizer.fit_on_texts(df['x'])

In [83]:
#alright, lets see if the tokenizer did it's job!
df['x_tokenized'] = tokenizer.texts_to_sequences(df['x'])
df['x_tokenized'].head()

0    [13, 344, 1384, 241, 7, 3254, 11, 8161, 17, 47...
1    [25538, 146658, 72, 1, 7166, 649, 3, 192, 13, ...
2           [73, 90, 341, 14, 4, 5897, 7474, 461, 169]
3    [92, 3834, 560, 16, 29, 57, 531, 178, 942, 3, ...
4    [3, 6, 1047, 10, 176, 470, 2, 7205, 3, 214, 1,...
Name: x_tokenized, dtype: object

### ^ Happy dance ^

Let's check #2 there - it seems weird that one would be so short

In [84]:
df['x'][2]

'Nice little spot with a coin-op pool table.'

Cool, so that makes sense!

I'm going to leave in the below, but would note that it was a bad idea to cast these as ints because it screwed with numpy's indexing and caused some problems

In [153]:
#The ratings are actually ints, so I'll cast them that way
df['y'] = df['y'].apply(int)

#Yelp's scale goes from 1-5, but it makes things difficult for us later because numpy counts from 0
#This is a sloppy fix for that
df['y'] = df['y'] - 1
df.head()

Unnamed: 0,x,y,x_tokenized
0,My family LOVES Taste of Italy in Goodyear! We...,4,"[13, 344, 1384, 241, 7, 3254, 11, 8161, 17, 47..."
1,Waterslide Hyperdrive\nBy The Rue\n\nSunday I ...,3,"[25538, 146658, 72, 1, 7166, 649, 3, 192, 13, ..."
2,Nice little spot with a coin-op pool table.,3,"[73, 90, 341, 14, 4, 5897, 7474, 461, 169]"
3,"I've driven past this place about 50 times, th...",4,"[92, 3834, 560, 16, 29, 57, 531, 178, 942, 3, ..."
4,"I was craving for something cold and icy, I as...",2,"[3, 6, 1047, 10, 176, 470, 2, 7205, 3, 214, 1,..."


Now that we have the data in the same format the the load_data() function is expecting, let's just make a few small changes to it and run this stuff through

In [87]:
def load_data(x_train, labels_train, x_test, labels_test, num_words=None, skip_top=0,
              maxlen=None, seed=123,
              start_char=1, oov_char=2, index_from=3, **kwargs):
    """Loads the YELP dataset.

    # Arguments
        path: where to cache the data (relative to `~/.keras/dataset`).
        num_words: max number of words to include. Words are ranked
            by how often they occur (in the training set) and only
            the most frequent words are kept
        skip_top: skip the top N most frequently occuring words
            (which may not be informative).
        maxlen: truncate sequences after this length.
        seed: random seed for sample shuffling.
        start_char: The start of a sequence will be marked with this character.
            Set to 1 because 0 is usually the padding character.
        oov_char: words that were cut out because of the `num_words`
            or `skip_top` limit will be replaced with this character.
        index_from: index actual words with this index and higher.

    # Returns
        Tuple of Numpy arrays: `(x_train, y_train), (x_test, y_test)`.

    # Raises
        ValueError: in case `maxlen` is so low
            that no input sequence could be kept.

    Note that the 'out of vocabulary' character is only used for
    words that were present in the training set but are not included
    because they're not making the `num_words` cut here.
    Words that were not seen in the training set but are in the test set
    have simply been skipped.
    """
    # Legacy support
    if 'nb_words' in kwargs:
        warnings.warn('The `nb_words` argument in `load_data` '
                      'has been renamed `num_words`.')
        num_words = kwargs.pop('nb_words')
    if kwargs:
        raise TypeError('Unrecognized keyword arguments: ' + str(kwargs))


    np.random.seed(seed)
    np.random.shuffle(x_train)
    np.random.seed(seed)
    np.random.shuffle(labels_train)

    np.random.seed(seed * 2)
    np.random.shuffle(x_test)
    np.random.seed(seed * 2)
    np.random.shuffle(labels_test)

    xs = np.concatenate([x_train, x_test])
    labels = np.concatenate([labels_train, labels_test])

    if start_char is not None:
        xs = [[start_char] + [w + index_from for w in x] for x in xs]
    elif index_from:
        xs = [[w + index_from for w in x] for x in xs]

    if maxlen:
        new_xs = []
        new_labels = []
        for x, y in zip(xs, labels):
            if len(x) < maxlen:
                new_xs.append(x)
                new_labels.append(y)
        xs = new_xs
        labels = new_labels
    if not xs:
        raise ValueError('After filtering for sequences shorter than maxlen=' +
                         str(maxlen) + ', no sequence was kept. '
                         'Increase maxlen.')
    if not num_words:
        num_words = max([max(x) for x in xs])

    # by convention, use 2 as OOV word
    # reserve 'index_from' (=3 by default) characters:
    # 0 (padding), 1 (start), 2 (OOV)
    if oov_char is not None:
        xs = [[oov_char if (w >= num_words or w < skip_top) else w for w in x] for x in xs]
    else:
        new_xs = []
        for x in xs:
            nx = []
            for w in x:
                if w >= num_words or w < skip_top:
                    nx.append(w)
            new_xs.append(nx)
        xs = new_xs

    x_train = np.array(xs[:len(x_train)])
    y_train = np.array(labels[:len(x_train)])

    x_test = np.array(xs[len(x_train):])
    y_test = np.array(labels[len(x_train):])

    return (x_train, y_train), (x_test, y_test)

In [157]:
#Ok, so actually thinking about it - maybe I shouldn't make such a huge leap right off the bat.  Let's train on a smaller
# sample to start.
x_train = np.array(df['x_tokenized'][:50000])
y_train = np.array(df['y'][:50000])
x_test = np.array(df['x_tokenized'][50000:70000])
y_test = np.array(df['y'][50000:70000])

(x_train, y_train), (x_test, y_test) = load_data(x_train,y_train,x_test,y_test)

#This part took a little while to figure out. We have to mold our labels to categorical binary arrays.
#This should be fairly simple, but by casting them as ints, the numerical values were out of sync with numpy
#indexer, so '5' corresponded to index 5 rather than index 4.  
#We fixed this by subtracting '1' from every rating and it worked
y_train_cat = keras.utils.to_categorical(y_train, num_classes=5)
y_test_cat = keras.utils.to_categorical(y_test, num_classes=5)

In [156]:
# Embedding
max_features = 20000
maxlen = 100
embedding_size = 128

# Convolution
kernel_size = 5
filters = 64
pool_size = 5

# LSTM
lstm_output_size = 70

# Training
batch_size = 4000
epochs = 18

''

In [158]:
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('Build model...')

model = Sequential()

model.add(Embedding(max_features, embedding_size, input_length=maxlen))
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
#model.add(MaxPooling1D(pool_size=pool_size))
model.add(LSTM(lstm_output_size))
model.add(Dropout(0.5))
model.add(Dense(5))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
              optimizer='RMSprop',
              metrics=['accuracy'])

print('Train...')
model.fit(x_train, y_train_cat,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test_cat))
score, acc = model.evaluate(x_test, y_test_cat, batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

50000 train sequences
20000 test sequences
Pad sequences (samples x time)
x_train shape: (50000, 100)
x_test shape: (20000, 100)
Build model...
Train...
Train on 50000 samples, validate on 20000 samples
Epoch 1/2
Epoch 2/2
Test accuracy: 0.589500048071


# Victory! Sort of!

We got it running! It does pretty well with the limited dataset - 58.9% accuracy on a 5 class problem with only 50,000 training examples and 2 epochs isn't bad. So let's go for the full dataset and see how we do.

In [159]:
#Split the dataset - roughly 80/20
x_train = np.array(df['x_tokenized'][:1255000])
y_train = np.array(df['y'][:1255000])
x_test = np.array(df['x_tokenized'][1255000:])
y_test = np.array(df['y'][1255000:])
(x_train, y_train), (x_test, y_test) = load_data(x_train,y_train,x_test,y_test)

y_train_cat = keras.utils.to_categorical(y_train, num_classes=5)
y_test_cat = keras.utils.to_categorical(y_test, num_classes=5)

In [160]:
# Embedding
max_features = 20000
maxlen = 100
embedding_size = 128

# Convolution
kernel_size = 5
filters = 64
pool_size = 4

# LSTM
lstm_output_size = 70

# Training
batch_size = 60
epochs = 2

print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')

print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

print('Build model...')

model = Sequential()
model.add(Embedding(max_features, embedding_size, input_length=maxlen))
model.add(Dropout(0.25))
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
model.add(MaxPooling1D(pool_size=pool_size))
model.add(LSTM(lstm_output_size))
model.add(Dense(5))
model.add(Activation('sigmoid'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(x_train, y_train_cat,
          batch_size=batch_size,
          epochs=epochs,
          validation_data=(x_test, y_test_cat))
score, acc = model.evaluate(x_test, y_test_cat, batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

1255000 train sequences
314263 test sequences
Pad sequences (samples x time)
x_train shape: (1255000, 100)
x_test shape: (314263, 100)
Build model...
Train...
Train on 1255000 samples, validate on 314263 samples
Epoch 1/2
Epoch 2/2
Test accuracy: 0.651320762624
