[Reference](https://github.com/msahamed/yelp_comments_classification_nlp)

# <b>Introduction<b>

In this project, I classify Yelp round-10 review datasets. The reviews contain a lot of metadata that can be mined and used to infer meaning, business attributes, and sentiment. For simplicity, I classify the review comments into two class: either as positive or negative. Reviews that have star higher than three are regarded as positive while the reviews with star less than or equal to 3 are negative. Therefore, the problem is a supervised learning. To build and train the model, I first tokenize the text and convert them to sequences. Each review comment is limited to 50 words. As a result, short texts less than 50 words are padded with zeros, and long ones are truncated. After processing the review comments, I trained three model in three different ways: 

<li> Model-1: In this model, a neural network with LSTM and a single embedding layer were used. 
<li> Model-2: In Model-1, an extra 1D convolutional layer has been added on top of LSTM layer to reduce the training time.
<li> Model-3:  In this model, I use the same network architecture as Model-2, but use the pre-trained glove 100 dimension word embeddings as initial input.

Since there are about 1.6 million input comments, it takes a while to train the models. To reduce the training time step, I limit the training epoch to three. After three epochs, it is evident that Model-2 is better regarding both training time and validation accuracy.

## <b>Project Outline <b>

In this project I will cover the follwouings :

<li> Download data from yelp and process them
<li> Build neural network with LSTM
<li> Build neural network with LSTM and CNN
<li> Use pre-trained GloVe word embeddings
<li> Word Embeddings from Word2Vec

## <b>Import libraries<b>

In [1]:
# Keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Flatten, LSTM, Conv1D, MaxPooling1D, Dropout, Activation
from keras.layers.embeddings import Embedding

## Plot
import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode(connected=True)
import matplotlib as plt

# NLTK
import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

# Other
import re
import string
import numpy as np
import pandas as pd
from sklearn.manifold import TSNE

## Import Data

In [2]:
import wget
# wget.download('https://s3.amazonaws.com/fast-ai-nlp/yelp_review_polarity_csv.tgz')

In [3]:
# please unzip manually and control directory

## <b> Data Processing<b>

In [4]:
# train = pd.read_csv('train.csv', sep = '|', , error_bad_lines=False)
train=pd.read_csv('D:/yelp_review_polarity_csv/train.csv', names = ['stars', 'text'])
test=pd.read_csv('D:/yelp_review_polarity_csv/test.csv', names = ['stars', 'text'])

In [5]:
train.head()

Unnamed: 0,stars,text
0,1,"Unfortunately, the frustration of being Dr. Go..."
1,2,Been going to Dr. Goldberg for over 10 years. ...
2,1,I don't know what Dr. Goldberg was like before...
3,1,I'm writing this review to give you a heads up...
4,2,All the food is great here. But the best thing...


In [6]:
# train = train.dropna()
# train = train[train.stars.apply(lambda x: x.isnumeric())]
# train = train[train.stars.apply(lambda x: x !="")]
# train = train[train.text.apply(lambda x: x !="")]

# train.describe()

### Convert five classes into two classes (positive = 1 and negative = 0)

Since the main purpose is to identify positive or negative comments, I convert five class star category into two classes: 

- (1) Positive: comments with stars > 3 and 
- (2) Negative: comments with stars <= 3

In [7]:
# labels = train['stars'].map(lambda x : 1 if int(x) > 3 else 0)
train_labels = train['stars'].map(lambda x : 0 if int(x) == 1 else 1)
test_labels = test['stars'].map(lambda x : 0 if int(x) == 1 else 1)

train['labels']=train_labels
test['labels']=test_labels

In [8]:
y_train=np.array(train_labels)
y_test=np.array(test_labels)

In [9]:
y_test

array([1, 0, 1, ..., 0, 0, 0], dtype=int64)

### Tokenize text data

- Because of the computational expenses, I use the top 20000 unique words.
- First, tokenize the comments then convert those into sequences.
- I keep 100 words to limit the number of words in each comment. 

In [10]:
def clean_text(text):
    
    ## Remove puncuation
    text = text.translate(string.punctuation)
    
    ## Convert words to lower case and split them
    text = text.lower().split()
    
    ## Remove stop words
    stops = set(stopwords.words("english"))
    text = [w for w in text if not w in stops and len(w) >= 3]
    
    text = " ".join(text)

    # Clean the text
    text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r",", " ", text)
    text = re.sub(r"\.", " ", text)
    text = re.sub(r"!", " ! ", text)
    text = re.sub(r"\/", " ", text)
    text = re.sub(r"\^", " ^ ", text)
    text = re.sub(r"\+", " + ", text)
    text = re.sub(r"\-", " - ", text)
    text = re.sub(r"\=", " = ", text)
    text = re.sub(r"'", " ", text)
    text = re.sub(r"(\d+)(k)", r"\g<1>000", text)
    text = re.sub(r":", " : ", text)
    text = re.sub(r" e g ", " eg ", text)
    text = re.sub(r" b g ", " bg ", text)
    text = re.sub(r" u s ", " american ", text)
    text = re.sub(r"\0s", "0", text)
    text = re.sub(r" 9 11 ", "911", text)
    text = re.sub(r"e - mail", "email", text)
    text = re.sub(r"j k", "jk", text)
    text = re.sub(r"\s{2,}", " ", text)
    
    text = text.split()
    stemmer = SnowballStemmer('english')
    stemmed_words = [stemmer.stem(word) for word in text]
    text = " ".join(stemmed_words)

    return text

def Get_Token_Sequence(df):
    df['text'] = df['text'].map(lambda x: clean_text(x))
    
    vocabulary_size = 20000
    
    # tokenizer
    tokenizer = Tokenizer(num_words=vocabulary_size)
    tokenizer.fit_on_texts(df['text'])
    
    sequences = tokenizer.texts_to_sequences(df['text'])
    
    return pad_sequences(sequences, maxlen=50)

In [11]:
train_data=Get_Token_Sequence(train)
test_data=Get_Token_Sequence(test)

In [12]:
print("train_data.shape: ", train_data.shape)
print("test_data.shape: ", test_data.shape)

train_data.shape:  (560000, 50)
test_data.shape:  (38000, 50)


## <b>Build neural network with LSTM<b>

### Network Architechture

- The network starts with an embedding layer.
- The layer lets the system expand each token to a more massive vector, allowing the network to represent a word in a meaningful way.
- The layer takes 20000 as the first argument, which is the size of our vocabulary, and 100 as the second input parameter, which is the dimension of the embeddings.
- The third parameter is the input_length of 50, which is the length of each comment sequence.

In [13]:
model = Sequential()
model.add(Embedding(20000, 100, input_length=50))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

### Train the network

In [None]:
model.fit(train_data, y_train, validation_split=0.4, epochs=3)

Epoch 1/3

In [None]:
model.evaluate(test_data, y_test)

In [None]:
test_data2=Get_Token_Sequence(pd.read_excel('test.xlsx'))
model_lstm.evaluate(test_data2, np.array([1]))

## Word embedding visialization

- In this subsection, I want to visualize word embedding weights obtained from trained models.
- Word embeddings with 100 dimensions are first reduced to 2 dimensions using t-SNE.
- Tensorflow has an excellent tool to visualize the embeddings in a great way, but here I just want to visualize the word relationship. 

### Get embedding weights from glove

In [None]:
lstm_embds = model.layers[0].get_weights()[0]

### Get word list 

In [None]:
word_list = []
for word, i in tokenizer.word_index.items():
    word_list.append(word)

### Scatter plot of first two components of TSNE

In [None]:
def plot_words(data, start, stop, step):
    trace = go.Scatter(
        x = data[start:stop:step,0], 
        y = data[start:stop:step, 1],
        mode = 'markers',
        text= word_list[start:stop:step]
    )
    layout = dict(title= 't-SNE 1 vs t-SNE 2',
                  yaxis = dict(title='t-SNE 2'),
                  xaxis = dict(title='t-SNE 1'),
                  hovermode= 'closest')
    fig = dict(data = [trace], layout= layout)
    py.iplot(fig)

In [None]:
number_of_words = 100

In [None]:
lstm_tsne_embds = TSNE(n_components=2).fit_transform(lstm_embds)
plot_words(lstm_tsne_embds, 0, number_of_words, 1)