# Yelp Review
## Business Problem
In today’s information era, even non technological
savvy consumers can quickly discover information
about a business. This is great for customers, as it
allows them to identify the “snake oil salesmen” (this
is a metaphor that was coined for individuals who sold
fraudulent goods). The motivation behind this project
is to develop a tool that can show businesses, in this
case restaurants, what aspects of their business are
leading to their reviews without them having to crawl
through potentially hundreds of text data. Is it their
price? Their food? Their service? Businesses that know
this information can get ahead of the competition by
improving the areas that lead to less positive reviews.
For example, a restaurant that knows they possess four
stars on Yelp, along with a breakdown of which
aspects of their establishment contributed to these four
stars might see that they lose marks on their service.
This may encourage management to retrain
waiters/waitresses in an effort to improve their service

score and thus their overall Yelp score. This prospect
would be extremely useful to restaurants because when
was the last time the reader went to a restaurant
without reading some online reviews beforehand?

## The Data
The data, as discussed above, is text review data. In this re-enforcement learning stage, the data set comes from two locations. One set is a snapshot of about 10,000 records from a larger database where the trained model will be deployed on. This snapshot was manually coded with one of the following categories
* Food#Postive
* Food#Negative
* Price#Postive
* Price#Negative
* Quality#Positive
* Quality#Negative
* Restaurant#Positive
* Restaurant#Negative
* Location#Postive
* Location#Negative
* Service#Positive
* Service#Negative
This was done for validation purposes. The other 1,600 records come from a Yelp Kaggle competition. The categories were coded to reflect categories above.

In [1]:
# Step 1 Read in data
# Step 2 Preprocess text data
# Step 3 Word Embedding
# Step 4 Deep Learning

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
stop = stopwords.words('english')
import re
# importing keras packages
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences


from keras.layers import Flatten
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils
from keras.models import Sequential #### required layer in our LSTM network
from keras.layers import Dense #### required layer in our LSTM network
 #### required layer in our LSTM network
from keras.layers.embeddings import Embedding #### required layer in our LSTM network
from keras.preprocessing import sequence #### Packaged preprocessing step in Keras
from sklearn.model_selection import train_test_split

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [3]:
yelp=pd.read_csv('all_data20180608.csv')

In [4]:
yelp.head

<bound method NDFrame.head of        Unnamed: 0             category  \
0               0     SERVICE#POSITIVE   
1               1     SERVICE#POSITIVE   
2               2     SERVICE#POSITIVE   
3               3     SERVICE#POSITIVE   
4               4     SERVICE#POSITIVE   
5               5     SERVICE#POSITIVE   
6               6     SERVICE#POSITIVE   
7               7     SERVICE#POSITIVE   
8               8     SERVICE#POSITIVE   
9               9     SERVICE#POSITIVE   
10             10     SERVICE#POSITIVE   
11             11     SERVICE#POSITIVE   
12             12     SERVICE#POSITIVE   
13             13     SERVICE#POSITIVE   
14             14     SERVICE#POSITIVE   
15             15     SERVICE#POSITIVE   
16             16     SERVICE#POSITIVE   
17             17     SERVICE#POSITIVE   
18             18     SERVICE#POSITIVE   
19             19     SERVICE#POSITIVE   
20             20     SERVICE#POSITIVE   
21             21     SERVICE#POSITIVE   
22  

## Text Analytics
Text analytics means examining text that was written by, or about, customers. You find patterns and topics of interest, and then take practical action based on what you learn.

Text analytics can be performed manually, but it is an inefficient process. Therefore, text analytics software has been created that uses text mining and natural language processing algorithms to find meaning in huge amounts of text, which is the attempt of this project.

### PreProcessing
Unlike regular data preprocessing steps, text mining requires an alternate approach. The end goal is more or less the same, a normalised data set that is numbers only containing less noise. 

#### Lower Case 
This is a step that helps put the text on an equal footing. The most simple way to explain it is as follows: One wouldn't count *Everything* and *everything* as two separate words.

In [5]:
# PreProcessing
#step 1 lower case
#step 2 punctuation
#step 3 stop word
#step 4 common word removal
#step 5 rare word removal
#step 6 token
#step 7 stemming
#step 8 lemma

In [6]:
#step 1
yelp['lower'] = yelp.text.apply(lambda x: " ".join(x.lower() for x in x.split()))
yelp.lower.head()

0    my friend gabi, i love your cute parisian inte...
1     had a good waiter, all the staff were very cool.
2    my only regret is not catching the name of our...
3    lotus of siam did not disappoint, the service ...
4    his name is carlos if you ever want to request...
Name: lower, dtype: object

#### Removing Punctuation
Again, this is just to keep things equal and prevent items like full stops from adding noise

In [7]:
#step 2
from nltk.tokenize import RegexpTokenizer
reg_tok = RegexpTokenizer(r'\w+')#+ is one or more
yelp['no_punc'] = yelp['lower'].apply(lambda x: ' '.join(reg_tok.tokenize(x)))
yelp.no_punc.head()

0    my friend gabi i love your cute parisian inter...
1       had a good waiter all the staff were very cool
2    my only regret is not catching the name of our...
3    lotus of siam did not disappoint the service w...
4    his name is carlos if you ever want to request...
Name: no_punc, dtype: object

#### Stop Words
Stop words are those words which are filtered out before further processing of text, since these words contribute little to overall meaning, given that they are generally the most common words in a language. For instance, "the," "and," and "a," while all required words in a particular passage, don't generally contribute greatly to one's understanding of content. 

In [8]:
#step 3
yelp['no_stop'] = yelp['no_punc'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
yelp.no_stop.head()

0    friend gabi love cute parisian interior dim li...
1                               good waiter staff cool
2    regret catching name server best experienced f...
3        lotus siam disappoint service great attentive
4          name carlos ever want request service great
Name: no_stop, dtype: object

#### Common and Rare Words
Removing the most common and the rarest words often can add a small amount of value to your model. This is because words like "good" or "food", often don't add a specific value to a model.

In [9]:
freq = pd.Series(' '.join(yelp['no_stop']).split()).value_counts()[:20]#combining all rows and then splitting and converitign and value count
freq
#looking at these, we actually want to keep them so no need to carry out this step

food          3504
good          1925
buffet        1556
service       1554
great         1365
place         1111
vegas          882
like           764
restaurant     656
one            642
get            641
best           635
really         625
quality        611
price          596
would          552
time           539
go             539
selection      470
better         463
dtype: int64

In [10]:
#step 4
rare = pd.Series(' '.join(yelp['no_stop']).split()).value_counts()[-600:]


In [11]:
#step 5
rare = list(rare.index)
yelp['no_rare'] = yelp['no_stop'].apply(lambda x: " ".join(x for x in x.split() if x not in rare))


In [12]:
# just did for note
from textblob import TextBlob
# not really doing that for tutorial, this is just demo of it
yelp['no_stop'][:5].apply(lambda x: str(TextBlob(x).correct()))

0    friend gave love cut parisian interior dim lig...
1                               good waiter staff cool
2    regret catching name server best experienced f...
3         lots siam disappoint service great attentive
4           name carlo ever want request service great
Name: no_stop, dtype: object

#### Tokenising
Tokenization is a step which splits longer strings of text into smaller pieces, or tokens. Larger chunks of text can be tokenized into sentences, sentences can be tokenized into words, etc. Further processing is generally performed after a piece of text has been appropriately tokenized. Tokenization is also referred to as text segmentation or lexical analysis. Sometimes segmentation is used to refer to the breakdown of a large chunk of text into pieces larger than words (e.g. paragraphs or sentences), while tokenization is reserved for the breakdown process which results exclusively in words.

In [13]:
#step 6
from nltk.tokenize.treebank import TreebankWordTokenizer
_word_tokenize = TreebankWordTokenizer()
yelp['token'] = yelp['no_rare'].apply(lambda x: ' '.join(_word_tokenize.tokenize(x)))
yelp.token.head()

0    friend gabi love cute parisian interior dim li...
1                               good waiter staff cool
2    regret catching name server best experienced f...
3        lotus siam disappoint service great attentive
4          name carlos ever want request service great
Name: token, dtype: object

#### Stemming
Stemming is the process of eliminating affixes (suffixed, prefixes, infixes, circumfixes) from a word in order to obtain a word stem. For example, running -> run

In [14]:
#step 7
from nltk.stem.snowball import SnowballStemmer
st = SnowballStemmer("english")
yelp['stemed']=yelp['token'].apply(lambda x: " ".join([st.stem(word) for word in x.split()]))
yelp.stemed.head()

0    friend gabi love cute parisian interior dim li...
1                               good waiter staff cool
2    regret catch name server best experienc far tr...
3            lotus siam disappoint servic great attent
4            name carlo ever want request servic great
Name: stemed, dtype: object

Lemmatization is related to stemming, differing in that lemmatization is able to capture canonical forms based on a word's lemma.

In [15]:
#step 8
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
yelp['lemma']=yelp['stemed'].apply(lambda x: " ".join([wordnet_lemmatizer.lemmatize(word) for word in x.split()]))
yelp.lemma.head(20)

0     friend gabi love cute parisian interior dim li...
1                                good waiter staff cool
2     regret catch name server best experienc far tr...
3             lotus siam disappoint servic great attent
4             name carlo ever want request servic great
5                               room beauti server good
6     servic quick price ok get pretti darn good san...
7                                 good servic good food
8     say locat decor lotus siam never life find bet...
9                              servic snappi food tasti
10    came month ago food ok initi encount cashier g...
11                       hostess waitress friend attent
12                     shout boy wesley host cool peopl
13                            waitress awesom help ball
14     servic great busi afternoon outdoor set look day
15    arriv 3pm weekday prompt seat busi patio time ...
16    happi help take mani pictur request alway kept...
17    item order mon ami gabi oyster du jour 15 

### Word Embedding 
Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with one dimension per word to a continuous vector space with a much lower dimension.

Methods to generate this mapping include neural networks,[1] dimensionality reduction on the word co-occurrence matrix, probabilistic models,explainable knowledge base method,and explicit representation in terms of the context in which words appear. They are needed as computers do not understand words, and in laymans terms convert words to numbers.

In [16]:
# Prepping the Word Embedding by getting dictionary length and max sentence length

yelp.lemma.str.len().max()

596

In [17]:
from collections import Counter

count=Counter(" ".join(yelp.lemma).split(" ")).items()
# print(sorted(count))

In [18]:
#length of dictionary
len(count)


5246

In [19]:
#longest sentence
print(max(yelp.lemma, key=len))

like singl littl dish put tast portion deep fri broccoli chees casserol surpris favorit american plate love littl bucket tater tot waffl fri mini fri basket piec fri chicken sweet potato fri brisket nice outsid like option bbq sauc red velvet whoopi pie soft point flavor authent dessert varieti cupcak cooki bread pud uniqu gelato flavor made order crepe sugar free dessert ton choos midlight good amount empti spot item look good guess popular ran lowlight shrimp cold one tast bit fishi hot one head overlook spici fri fish excit dish great probabl sit meat dri item great buffet other mediocr


In [20]:
# embeddings = tf.Variable(
#     tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
# embed = tf.nn.embedding_lookup(embeddings, train_inputs)

In [21]:

# define documents

# define class labels
encoder = LabelEncoder()
encoder.fit(yelp.category)
encoded_Y = encoder.transform(yelp.category)
# convert integers to dummy variables (i.e. one hot encoded)
dummy_y = np_utils.to_categorical(encoded_Y)

# integer encode the documents
vocab_size = 6000
encoded_docs = [one_hot(d, vocab_size) for d in yelp.lemma]
#print(encoded_docs)
# pad documents to a max length of 4 words
max_length = 130
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)
# define the model
model = Sequential()

model.add(Embedding(vocab_size, 32, input_length=max_length))
model.add(Flatten())
model.add(Dense(12, activation='sigmoid'))
#model.add(Dropout(0.3))
#model.add(Dense(12, activation='sigmoid'))
#model.add(Dropout(0.3))
# compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
X_train, X_test, y_train, y_test = train_test_split(padded_docs,dummy_y,test_size=0.2)
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=30, batch_size=64)
# evaluate the model



[[ 406 3963 5829 ...    0    0    0]
 [1579 5282   27 ...    0    0    0]
 [4362 4502 2098 ...    0    0    0]
 ...
 [4270 2852  513 ...    0    0    0]
 [4270 2852  513 ...    0    0    0]
 [4270 2852  513 ...    0    0    0]]
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 130, 32)           192000    
_________________________________________________________________
flatten_1 (Flatten)          (None, 4160)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 12)                49932     
Total params: 241,932
Trainable params: 241,932
Non-trainable params: 0
_________________________________________________________________
None
Train on 10041 samples, validate on 2511 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 1

<keras.callbacks.History at 0x1030a6d8>

In [22]:
# embedding_vecor_length = 32
# model2 = Sequential()
# model2.add(Embedding(vocab_size, embedding_vecor_length,input_length=max_length))
# from keras.layers import LSTM
# model2.add(LSTM(100))
# model2.add(Dense(12, activation='sigmoid'))
# model2.add(Dense(12, activation='sigmoid'))
# model2.add(Dense(12, activation='sigmoid'))
# model2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# print(model.summary())
# model2.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=64)

In [23]:
# from keras.layers import Dropout
# embedding_vecor_length = 32
# model3 = Sequential()
# model3.add(Embedding(vocab_size, embedding_vecor_length,input_length=max_length))

# model3.add(Dense(12, activation='sigmoid')) 
# model3.add(Dropout(0.3))
# model3.add(Dense(12, activation='sigmoid'))
# model3.add(Dropout(0.3))
# model3.add(Dense(12, activation='sigmoid'))
# model3.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
# print(model.summary())
# model3.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=5, batch_size=64)

In [24]:
def create_model():
    # create model
    model = Sequential()
    model.add(Embedding(vocab_size, 32, input_length=max_length))
    model.add(Flatten())
    model.add(Dense(12, activation='sigmoid'))
    
    # compile the model
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
    return model

In [None]:
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV


model = KerasClassifier(build_fn=create_model,verbose=0)
# define the grid search parameters
# learn_rate = [0.001, 0.01, 0.1, 0.2, 0.3]
# momentum = [0.0, 0.2, 0.4, 0.6, 0.8, 0.9]
epochs=[1,3,5]
batch_size=[32,64,128,256]
param_grid = dict(epochs=epochs,batch_size=batch_size)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)

grid_result = grid.fit(X_train, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))