[Reference](https://github.com/msahamed/yelp_comments_classification_nlp)

# <b>Introduction<b>

* The reviews contain a lot of metadata that can be mined and used to infer meaning, business attributes, and sentiment.
* For simplicity, I classify the review comments into two class: either as positive or negative. 
* Reviews that have star higher than three are regarded as positive while the reviews with star less than or equal to 3 are negative. 
* Therefore, the problem is a supervised learning. 
* To build and train the model, I first tokenize the text and convert them to sequences.
* Each review comment is limited to 50 words. 
* As a result, short texts less than 50 words are padded with zeros, and long ones are truncated.

## <b>Import libraries<b>

In [3]:
# Keras
from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Flatten, LSTM, Conv1D, MaxPooling1D, Dropout, Activation
# from keras.layers.embeddings import Embedding
from tensorflow.keras.layers import Embedding

## Plot
import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode(connected=True)
import matplotlib as plt

# NLTK
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

# Other
import re
import string
import numpy as np
import pandas as pd
from sklearn.manifold import TSNE

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\2joon\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Import Data

In [2]:
# import wget
# wget.download('https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz')

In [3]:
# please unzip manually and control directory

## <b> Data Processing<b>

In [4]:
# train = pd.read_csv('train.csv', sep = '|', , error_bad_lines=False)
train=pd.read_csv('D:/yelp_review_polarity_csv/train.csv', names = ['stars', 'text'])
test=pd.read_csv('D:/yelp_review_polarity_csv/test.csv', names = ['stars', 'text'])

In [5]:
train.head()

Unnamed: 0,stars,text
0,1,"Unfortunately, the frustration of being Dr. Go..."
1,2,Been going to Dr. Goldberg for over 10 years. ...
2,1,I don't know what Dr. Goldberg was like before...
3,1,I'm writing this review to give you a heads up...
4,2,All the food is great here. But the best thing...


In [6]:
# train = train.dropna()
# train = train[train.stars.apply(lambda x: x.isnumeric())]
# train = train[train.stars.apply(lambda x: x !="")]
# train = train[train.text.apply(lambda x: x !="")]

# train.describe()

### Convert five classes into two classes (positive = 1 and negative = 0)

Since the main purpose is to identify positive or negative comments, I convert five class star category into two classes: 

- (1) Positive: comments with stars > 3 and 
- (2) Negative: comments with stars <= 3

In [7]:
# labels = train['stars'].map(lambda x : 1 if int(x) > 3 else 0)
train_labels = train['stars'].map(lambda x : 0 if int(x) == 1 else 1)
test_labels = test['stars'].map(lambda x : 0 if int(x) == 1 else 1)

train['labels']=train_labels
test['labels']=test_labels

In [8]:
y_train=np.array(train_labels)
y_test=np.array(test_labels)

In [9]:
y_test

array([1, 0, 1, ..., 0, 0, 0], dtype=int64)

### Tokenize text data

- Because of the computational expenses, I use the top 20000 unique words.
- First, tokenize the comments then convert those into sequences.
- I keep 100 words to limit the number of words in each comment. 

In [10]:
def clean_text(text):
    
    ## Remove puncuation
    text = text.translate(string.punctuation)
    
    ## Convert words to lower case and split them
    text = text.lower().split()
    
    ## Remove stop words
    stops = set(stopwords.words("english"))
    text = [w for w in text if not w in stops and len(w) >= 3]
    
    text = " ".join(text)

    # Clean the text
    text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r",", " ", text)
    text = re.sub(r"\.", " ", text)
    text = re.sub(r"!", " ! ", text)
    text = re.sub(r"\/", " ", text)
    text = re.sub(r"\^", " ^ ", text)
    text = re.sub(r"\+", " + ", text)
    text = re.sub(r"\-", " - ", text)
    text = re.sub(r"\=", " = ", text)
    text = re.sub(r"'", " ", text)
    text = re.sub(r"(\d+)(k)", r"\g<1>000", text)
    text = re.sub(r":", " : ", text)
    text = re.sub(r" e g ", " eg ", text)
    text = re.sub(r" b g ", " bg ", text)
    text = re.sub(r" u s ", " american ", text)
    text = re.sub(r"\0s", "0", text)
    text = re.sub(r" 9 11 ", "911", text)
    text = re.sub(r"e - mail", "email", text)
    text = re.sub(r"j k", "jk", text)
    text = re.sub(r"\s{2,}", " ", text)
    
    text = text.split()
    stemmer = SnowballStemmer('english')
    stemmed_words = [stemmer.stem(word) for word in text]
    text = " ".join(stemmed_words)

    return text

In [11]:
train['text'] = train['text'].map(lambda x: clean_text(x))
test['text'] = test['text'].map(lambda x: clean_text(x))

In [5]:
vocabulary_size = 20000

# tokenizer
tokenizer = Tokenizer(num_words=vocabulary_size) # 20000
tokenizer.fit_on_texts(train['text'])

KeyboardInterrupt: 

In [13]:
# sequences = tokenizer.texts_to_sequences(test['text'])
train_data=pad_sequences(tokenizer.texts_to_sequences(train['text']), maxlen=50)
test_data=pad_sequences(tokenizer.texts_to_sequences(test['text']), maxlen=50)

In [14]:
print("train_data.shape: ", train_data.shape)
print("test_data.shape: ", test_data.shape)

train_data.shape:  (560000, 50)
test_data.shape:  (38000, 50)


## <b>Build neural network with LSTM<b>

### Network Architechture

- The network starts with an embedding layer.
- The layer lets the system expand each token to a more massive vector, allowing the network to represent a word in a meaningful way.
- The layer takes 20000 as the first argument, which is the size of our vocabulary, and 100 as the second input parameter, which is the dimension of the embeddings.
- The third parameter is the input_length of 50, which is the length of each comment sequence.

In [15]:
model=Sequential()
model.add(Embedding(20000, 100, input_length=50))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

### Train the network

In [16]:
model.fit(train_data, y_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1a8585a1dc0>

In [17]:
model.evaluate(test_data, y_test)



[0.281419962644577, 0.9179736971855164]

## Build neural network with LSTM and CNN
- The LSTM model worked well. 
- However, it takes forever to train three epochs. 
- One way to speed up the training time is to improve the network adding “Convolutional” layer. 
- Convolutional Neural Networks (CNN) come from image processing. They pass a “filter” over the data and calculate a higher-level representation. 
- They have been shown to work surprisingly well for text, even though they have none of the sequence processing ability of LSTMs.

In [18]:
def create_conv_model():
    model_conv = Sequential()
    model_conv.add(Embedding(vocabulary_size, 100, input_length=50))
    model_conv.add(Dropout(0.2))
    model_conv.add(Conv1D(64, 5, activation='relu'))
    model_conv.add(MaxPooling1D(pool_size=4))
    model_conv.add(LSTM(100))
    model_conv.add(Dense(1, activation='sigmoid'))
    model_conv.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model_conv 

### Train the network

In [19]:
model_conv = create_conv_model()
model_conv.fit(train_data, y_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1a858ba9be0>

In [20]:
model.evaluate(test_data, y_test)



[0.281419962644577, 0.9179736971855164]

## Word embedding visialization

- In this subsection, I want to visualize word embedding weights obtained from trained models.
- Word embeddings with 100 dimensions are first reduced to 2 dimensions using t-SNE.
- Tensorflow has an excellent tool to visualize the embeddings in a great way, but here I just want to visualize the word relationship. 

### Get embedding weights from glove

In [21]:
lstm_embds = model.layers[0].get_weights()[0]
conv_embds = model_conv.layers[0].get_weights()[0]

### Get word list 

In [22]:
word_list = []
for word, i in tokenizer.word_index.items():
    word_list.append(word)

### Scatter plot of first two components of TSNE

In [23]:
def plot_words(data, start, stop, step):
    trace = go.Scatter(
        x = data[start:stop:step, 0], 
        y = data[start:stop:step, 1],
        mode = 'markers',
        text= word_list[start:stop:step]
    )
    layout = dict(title= 't-SNE 1 vs t-SNE 2',
                  yaxis = dict(title='t-SNE 2'),
                  xaxis = dict(title='t-SNE 1'),
                  hovermode= 'closest')
    fig = dict(data = [trace], layout= layout)
    py.iplot(fig)

In [24]:
number_of_words = 100

In [25]:
lstm_tsne_embds = TSNE(n_components=2).fit_transform(lstm_embds)
plot_words(lstm_tsne_embds, 0, number_of_words, 1)


The default initialization in TSNE will change from 'random' to 'pca' in 1.2.


The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.



In [26]:
conv_tsne_embds = TSNE(n_components=2).fit_transform(conv_embds)
plot_words(conv_tsne_embds, 0, number_of_words, 1)


The default initialization in TSNE will change from 'random' to 'pca' in 1.2.


The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.



## Sample Test

In [1]:
# positive review
review_sample="Thankfully there has been no monkeying around with the formidably tall gâteau Basque, which is flavored with rum and served with a sparkling orange puddle of Cara Cara marmalade.\
The genius of traditional Spanish cooking lies in knowing when to leave well enough alone. It’s a principle the bartenders at El Quijote could stand to study. Cocktails that originally called for two or three ingredients get five or six; the kalimotxo, a blend of red wine and cola that is one of Spain’s great party tricks, has wine, rum and two kinds of amaro when it just needs a Coke.\
The more-is-more approach works better with the sangria; infused with cinnamon and spiked with balsamic vinegar, it goes down something like a chilled mulled wine, and is a huge improvement over its predecessor. So, I suspect, is the wine list, which is brief but manages to rope in a fair sampling of modern winemakers like Ramón Jané and more traditional outfits like C.V.N.E.\
I miss the sprawling, sheltering atmosphere of the old El Quijote, but not much else. Toward the end, even El Quijote’s Ford administration prices weren’t quite enough to make anyone forget that a number of restaurants served far better Spanish food. Now it is one of them, and that’s OK."

In [2]:
review_sample

'Thankfully there has been no monkeying around with the formidably tall gâteau Basque, which is flavored with rum and served with a sparkling orange puddle of Cara Cara marmalade.The genius of traditional Spanish cooking lies in knowing when to leave well enough alone. It’s a principle the bartenders at El Quijote could stand to study. Cocktails that originally called for two or three ingredients get five or six; the kalimotxo, a blend of red wine and cola that is one of Spain’s great party tricks, has wine, rum and two kinds of amaro when it just needs a Coke.The more-is-more approach works better with the sangria; infused with cinnamon and spiked with balsamic vinegar, it goes down something like a chilled mulled wine, and is a huge improvement over its predecessor. So, I suspect, is the wine list, which is brief but manages to rope in a fair sampling of modern winemakers like Ramón Jané and more traditional outfits like C.V.N.E.I miss the sprawling, sheltering atmosphere of the old 

In [80]:
df=pd.DataFrame({'text':[clean_text(review_sample)]})
df.head()

Unnamed: 0,text
0,thank monkey around formid tall g teau basqu f...


In [81]:
model.predict(pad_sequences(tokenizer.texts_to_sequences(df['text']), maxlen=50))



array([[0.00051905]], dtype=float32)

In [82]:
model_conv.predict(pad_sequences(tokenizer.texts_to_sequences(df['text']), maxlen=50))



array([[0.52031463]], dtype=float32)

In [83]:
## Save Models

In [84]:
from keras.models import load_model

model.save('lstm_model.h5')
model_conv.save('lstm_conv_model.h5')

In [None]:
# model = load_model('lstm_model.h5')

In [None]:
import pickle

# saving
with open('tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

# loading
with open('tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)