# Recommender Systems Based On Amazon Review

#### Teammates: Nivetitha Ramachandar, Zhirou Zhang, Supriya Tiwari, Jiamin Zhu, Shruti Deshpande

## Contents

Here is the outline of what we're going to do in this notebook.
- Problem Statement,Objective
- Overview and Benefits of review based engine
- Dataset Overview
- Data cleaning steps 
- Sentiment Analysis 
- Word2vec model 
- Recommendation Results
- Search Engine

## Dataset

Here is the bacis information of our dataset.

- reviewerID: ID of the reviewer
- asin: ID of the product
- reviewerName: name of the reviewer
- helpful: helpfulness rating of the review, e.g. 2/3
- reviewText:  text of the review
- overall: rating score from 0 to 5
- summary: summary of the review
- unixReviewTime: time of the review (unix time)
- reviewTime: time of the review (raw)

In [None]:
import pandas as pd

In [None]:
data = pd.read_json("reviews_Clothing_Shoes_and_Jewelry_5.json",lines=True)

In [None]:
data.head()

In [None]:
data.columns

In [None]:
data.shape

## Preparing data

For this analysis, we'll extract only reviews text from the data and store it in a file.

In [None]:
%%time
USE_PREMADE_REVIEWS_TEXT = True

from os import path
reviews_text_filepath = 'medium/reviews_text.txt'
if not USE_PREMADE_REVIEWS_TEXT:
    with open(reviews_text_filepath, 'w') as f:
        for review in data.reviewText.values:
            # if the row lacks a review, skip it.
            if pd.isnull(review):
                continue
            f.write(review + '\n')
else:
    assert path.exists(reviews_text_filepath)

Let's build a simple function which helps us read each line from the reviews text file.

In [None]:
def read_reviews(filepath):
    """
    helper function to read in the file and yield each line at a time.
    """
    with open(filepath) as f:
        for review in f:
            yield review

In [None]:
from itertools import islice
def retrieve_review(sample_num):
    """
    get a specific review from reviews text file and return it.
    """
    return next(islice(read_reviews(reviews_text_filepath), sample_num, sample_num+1))

Let's take a sample and see what they look like.

In [None]:
sample_review = retrieve_review(200)
sample_review

## TEXT PROCESSING WITH SPACY

We'll use spaCy to normalize reviews.

In [None]:
%%time
import spacy
# load english vocabulary and language models. This takes some time.
nlp = spacy.load('en')

In [None]:
def lemmatize(line):
    """
    remove punctuation and whitespace.
    """
    return [token.lemma_ for token in line 
                      if not token.is_punct and not token.is_space]

Let's see how well spaCy did. Here's a normalized version of the sample review above. You can see that many words have been lowered & stemmed.

In [None]:
sample_review_normalized = lemmatize(nlp(sample_review))
' '.join(sample_review_normalized)


We now perform normalizatioin for all the reviews we have. This takes a while.

In [None]:
%%time
USE_PREMADE_SENTENCES_NORMALIZED = True

sentences_normalized_filepath = 'medium/sentences_normalized.txt'


if not USE_PREMADE_SENTENCES_NORMALIZED:
    with open(sentences_normalized_filepath, 'w') as f:
        for review_parsed in nlp.pipe(read_reviews(reviews_text_filepath)):
            for sentence_parsed in review_parsed.sents:
                lemmas = lemmatize(sentence_parsed)
                f.write(' '.join(lemmas) + '\n')
else:
    assert path.exists(sentences_normalized_filepath)


## Phrase modeling


There are words which are often used together, and which get a special meaning when they're used together. We call them 'phrases'. We're now going to find bigram/trigram phrases from the reviews.

To do so, we turn to the famous NLP library in Python, gensim. Particularly, the Phrases class.

In [None]:
pip install -U gensim

In [None]:
from gensim.models import Phrases

We take the normalized texts from the previous section, and build a bigram model upon them.



In [None]:
%%time
USE_PREMADE_BIGRAM_MODEL = True

bigram_model_filepath = 'medium/bigram_model.dms'

# gensim's LineSentence provies a convenient way to iterate over lines in a text file.
# it outputs one line at a time, so you can save memory space. it works well with other gensim components.
from gensim.models.word2vec import LineSentence
# we take normalized sentences as unigram sentences, which means we didn't apply any phrase modeling yet.
unigram_sentences = LineSentence(sentences_normalized_filepath)

if not USE_PREMADE_BIGRAM_MODEL:    
    
    bigram_model = Phrases(unigram_sentences)
    bigram_model.save(bigram_model_filepath)
    
else:
    bigram_model = Phrases.load(bigram_model_filepath)

In [None]:
sample_review_bigram = bigram_model[sample_review_normalized]
' '.join(sample_review_bigram)

We process all the normalized texts in the same way.

In [None]:
%%time
USE_PREMADE_BIGRAM_SENTENCES = True

bigram_sentences_filepath = 'medium/bigram_sentences.txt'

if not USE_PREMADE_BIGRAM_SENTENCES:
    
    with open(bigram_sentences_filepath, 'w') as f:
        for unigram_sentence in unigram_sentences:
            bigram_sentence = bigram_model[unigram_sentence]
            f.write(' '.join(bigram_sentence) + '\n')
else:
    assert path.exists(bigram_sentences_filepath)

Let's take one step further. We're going to build a trigram phrase model on bigram model. It means, we can combine together two bigram phrases, or one unigram and one bigram phrase.

In [None]:
%%time
USE_PREMADE_TRIGRAM_MODEL = True

trigram_model_filepath = 'medium/trigram_model.dms'

from gensim.models.word2vec import LineSentence
from gensim.models import Phrases

if not USE_PREMADE_TRIGRAM_MODEL:
    
    bigram_sentences = LineSentence(bigram_sentences_filepath)
    trigram_model = Phrases(bigram_sentences)
    trigram_model.save(trigram_model_filepath)

else:
    trigram_model = Phrases.load(trigram_model_filepath)

In [None]:
spacy.cli.download("en")

In [None]:
%%time
import spacy
# load english vocabulary and language models. This takes some time.
nlp = spacy.load('en')

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download en

In [None]:
import spacy
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
nlp = spacy.load('en')

In [None]:
from spacy.lang.en.stop_words import STOP_WORDS

In [None]:
%%time
USE_PREMADE_REVIEWS_FOR_LDA = True

reviews_for_lda_filepath = 'medium/reviews_for_lda.txt'

if not USE_PREMADE_REVIEWS_FOR_LDA:
    
    with open(reviews_for_lda_filepath, 'w') as f:
        
        for review_parsed in nlp.pipe(read_reviews(reviews_text_filepath)):
            
            unigram_review = lemmatize(review_parsed)
            bigram_review = bigram_model[unigram_review]
            trigram_review = trigram_model[bigram_review]
            # remove stop words
            trimmed_review = [lemma for lemma in trigram_review 
                              if lemma not in spacy.lang.en.STOP_WORDS and lemma != '-PRON-']
            f.write(' '.join(trimmed_review) + '\n')
else:
    assert path.exists(reviews_for_lda_filepath)

In [None]:
%%time
USE_PREMADE_SENTENCES_FOR_WORD2VEC = True

sentences_for_word2vec_filepath = 'medium/sentences_for_word2vec.txt'

if not USE_PREMADE_SENTENCES_FOR_WORD2VEC:
    
    with open(sentences_for_word2vec_filepath, 'w') as f:
        for bigram_sentence in bigram_sentences:
            trigram_sentence = trigram_model[bigram_sentence]
            # remove stop words
            trimmed_sentence = [lemma for lemma in trigram_sentence 
                                if lemma not in spacy.lang.en.STOP_WORDS and lemma != '-PRON-']
            f.write(' '.join(trimmed_sentence) + '\n')
else:
    assert path.exists(sentences_for_word2vec_filepath)

## Topic modeling with LDA

Topic modeling is to automatically find topics from a bunch of documents - reviews, in this case. We'll now perform LDA, the most basic topic modeling method, on our reviews.

In [None]:
from gensim.corpora import Dictionary, MmCorpus


First, we need to compile our dictionary.



In [None]:
%%time
USE_PREMADE_DICTIONARY = True

dictionary_filepath = 'medium/dictionary.dict'

if not USE_PREMADE_DICTIONARY:
    
    reviews_for_lda = LineSentence(reviews_for_lda_filepath)
    dictionary = Dictionary(reviews_for_lda)
    dictionary.filter_extremes(no_below=10, no_above=0.4)
    dictionary.compactify()
    
    dictionary.save(dictionary_filepath)
else:
    dictionary = Dictionary.load(dictionary_filepath)

Then, we build a corpus which we'll use when performing LDA.



In [None]:
%%time
USE_PREMADE_CORPUS = True

corpus_filepath = 'medium/corpus.mm'

if not USE_PREMADE_CORPUS:
    
    def make_bow_corpus(filepath):
        """
        generator function to read in reviews from the file
        and output a bag-of-words represention of the text
        """
        for review in LineSentence(filepath):
            yield dictionary.doc2bow(review)
            
    MmCorpus.serialize(corpus_filepath, make_bow_corpus(reviews_for_lda_filepath))
    
review_corpus = MmCorpus(corpus_filepath)

Finally, we can turn to gensim's LdaMulticore class for parallelized LDA, which is claimed to be faster.

In [None]:
from gensim.models import LdaMulticore


In [None]:
%%time
USE_PREMADE_LDA = True

lda_filepath = 'medium/lda.dms'

if not USE_PREMADE_LDA:
    
    # number of workers should be set to your number of physical cores minus one
    lda = LdaMulticore(review_corpus,
                           num_topics=20,
                           id2word=dictionary,
                           workers=2)
    lda.save(lda_filepath)
else:
    lda = LdaMulticore.load(lda_filepath)

You can inspect a specific topic from the model by it's index. There's no names for topics, though, because LDA is an unsupervised learning algorithm. Instead, you can see the words associated to the topic.

In [None]:
lda.show_topic(0)

In [None]:
import seaborn as sb
from matplotlib import pyplot as plt

#df_lda.distplot(df['petal_length'],kde = False)
#plt.show()

## Sentiment analysis

Based on previous findings, we conduct sentiment analysis

In [None]:
# spaCy stuff
!pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download en

import spacy 

nlp = spacy.load("en")

In [None]:
reviews_df = data
reviews_df.head()

The customer overall rating is from 0 to 5. 
In order to simplify the problem we will split those into two categories: 
bad reviews: overall ratings < 3 
good reviews: overall ratings >= 3

In [None]:
# create the label
reviews_df["is_bad_review"] = reviews_df["overall"].apply(lambda x: 1 if x < 3 else 0)
# select only relevant columns
reviews_df = reviews_df[["reviewText", "is_bad_review"]]
reviews_df.head()

In [None]:
# Reviews data is sampled in order to speed up computations.
reviews_df = reviews_df.sample(frac = 0.1, replace = False, random_state=42)

If the customer doesn't leave any negative/positive review, this will apprear as "No negative" or "No positive" in our dataset.

In [None]:
# remove 'No Negative' or 'No Positive' from text
reviews_df["reviewText"] = reviews_df["reviewText"].apply(lambda x: x.replace("No Negative", "").replace("No Positive", ""))

In [None]:
# return the wordnet object value corresponding to the POS tag
from nltk.corpus import wordnet

def get_wordnet_pos(pos_tag):
    if pos_tag.startswith('J'):
        return wordnet.ADJ
    elif pos_tag.startswith('V'):
        return wordnet.VERB
    elif pos_tag.startswith('N'):
        return wordnet.NOUN
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN
    
import string
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.tokenize import WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer

def clean_text(text):
    # lower text
    text = text.lower()
    # tokenize text and remove puncutation
    text = [word.strip(string.punctuation) for word in text.split(" ")]
    # remove words that contain numbers
    text = [word for word in text if not any(c.isdigit() for c in word)]
    # remove stop words
    stop = stopwords.words('english')
    text = [x for x in text if x not in stop]
    # remove empty tokens
    text = [t for t in text if len(t) > 0]
    # pos tag text
    pos_tags = pos_tag(text)
    # lemmatize text
    text = [WordNetLemmatizer().lemmatize(t[0], get_wordnet_pos(t[1])) for t in pos_tags]
    # remove words with only one letter
    text = [t for t in text if len(t) > 1]
    # join all
    text = " ".join(text)
    return(text)

# clean text data
reviews_df["review_clean"] = reviews_df["reviewText"].apply(lambda x: clean_text(x))

During the process of cleanning reviewText data, we did following things:

- lower the text
- tokenize the text (split the text into words) and remove the punctuation
- remove useless words that contain numbers
- remove useless stop words like 'the', 'a' ,'this' etc.
- Part-Of-Speech (POS) tagging: assign a tag to every word to define if it corresponds to a noun, a verb etc. using - the WordNet lexical database
- lemmatize the text: transform every word into their root form (e.g. rooms -> room, slept -> sleep)

In [None]:
reviews_df.head()

In [None]:
import nltk
nltk.download('vader_lexicon')

We use Vader to conduct sentiment analysis. Vader uses a lexicon of words to find which ones are positives or negatives. It also takes into accout the context of the sentences to determine the sentiment scores.For each text, Vader retuns 4 values:

- a neutrality score
- a positivity score
- a negativity score
- an overall score that summarizes the previous scores

We will integrate those 4 values as features in our dataset.

In [None]:
# add sentiment anaylsis columns
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
reviews_df["sentiments"] = reviews_df["reviewText"].apply(lambda x: sid.polarity_scores(x))
reviews_df = pd.concat([reviews_df.drop(['sentiments'], axis=1), reviews_df['sentiments'].apply(pd.Series)], axis=1)

Next, we add some simple metrics for every text:
- number of characters in the text
- number of words in the text

In [None]:
# add number of characters column
reviews_df["nb_chars"] = reviews_df["reviewText"].apply(lambda x: len(x))

# add number of words column
reviews_df["nb_words"] = reviews_df["reviewText"].apply(lambda x: len(x.split(" ")))

The next step consist in extracting vector representations for every review. The module Gensim creates a numerical vector representation of every word in the corpus by using the contexts in which they appear (Word2Vec).

Each text can also be transformed into numerical vectors using the word vectors (Doc2Vec). Same texts will also have similar representations and that is why we can use those vectors as training features.

We first have to train a Doc2Vec model by feeding in our text data. By applying this model on our reviews, we can get those representation vectors.

In [None]:
# create doc2vec vector columns
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

documents = [TaggedDocument(doc, [i]) for i, doc in enumerate(reviews_df["review_clean"].apply(lambda x: x.split(" ")))]

# train a Doc2Vec model with our text data
model = Doc2Vec(documents, vector_size=5, window=2, min_count=1, workers=4)

# transform each document into a vector data
doc2vec_df = reviews_df["review_clean"].apply(lambda x: model.infer_vector(x.split(" "))).apply(pd.Series)
doc2vec_df.columns = ["doc2vec_vector_" + str(x) for x in doc2vec_df.columns]
reviews_df = pd.concat([reviews_df, doc2vec_df], axis=1)

Finally we add the TF-IDF (Term Frequency - Inverse Document Frequency) values for every word and every document.

We add TF-IDF columns for every word that appear in at least 10 different texts to filter some of them and reduce the size of the final output.

In [None]:
# add tf-idfs columns
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(min_df = 10)
tfidf_result = tfidf.fit_transform(reviews_df["review_clean"]).toarray()
tfidf_df = pd.DataFrame(tfidf_result, columns = tfidf.get_feature_names())
tfidf_df.columns = ["word_" + str(x) for x in tfidf_df.columns]
tfidf_df.index = reviews_df.index
reviews_df = pd.concat([reviews_df, tfidf_df], axis=1)

In [None]:
reviewnew_df = reviews_df[reviews_df.columns[0:6]]

In [None]:
reviewnew_df.head()

Example of comparing raw review and clean review:

In [None]:
reviewnew_df['reviewText'].values[0]

In [None]:
reviewnew_df['review_clean'].values[0]

In order to have a better understanding of our data, we are going to exlpore more:

In [None]:
# show is_bad_review distribution
reviews_df["is_bad_review"].value_counts(normalize = True)

Based on this result, we can imply that our dataset is highly imbalanced because less than 10% of our reviews are consideres as negative reviews.

In [None]:
# highest positive sentiment reviews (with more than 5 words)
reviews_df[reviews_df["nb_words"] >= 5].sort_values("pos", ascending = False)[["reviewText", "pos"]].head(10)

Based on the result, we found that most positive reviews are related to good feedbacks.

In [None]:
# lowest negative sentiment reviews (with more than 5 words)
reviews_df[reviews_df["nb_words"] >= 5].sort_values("neg", ascending = False)[["reviewText", "neg"]].head(10)

Based on the result, all negative reviews correspond to bad feedbacks.

In [None]:
# plot sentiment distribution for positive and negative reviews

import seaborn as sns

for x in [0, 1]:
    subset = reviews_df[reviews_df['is_bad_review'] == x]
    
    # Draw the density plot
    if x == 0:
        label = "Good reviews"
    else:
        label = "Bad reviews"
    sns.distplot(subset['compound'], hist = False, label = label)

The graph shows the distribution of the reviews sentiments among goods reviews and bad ones. We can see that good reviews are for most of them considered as very positive by Vader. On the contrary, bad reviews tend to have lower compound sentiment scores.

## Word vector modeling with Word2Vec


Word vector modeling (word embedding, put another way) is a method to transform words to vectors, which enables arithmetic with them. Word2Vec has been proposed by Google in 2013, and you can find python implementation of the model in ... gensim (of course!)

In [None]:
from gensim.models import Word2Vec


In [None]:
%%time
USE_PREMADE_WORD2VEC = True

word2vec_filepath = 'medium/word2vec_model.dms'

if not USE_PREMADE_WORD2VEC:
    
    sentences_for_word2vec = LineSentence(sentences_for_word2vec_filepath)
    
    # initiate the model with 100 dimensions of vectors, 5 words to look before and after each focus word, etc.
    # and perform the first epoch of training
    model = Word2Vec(sentences_for_word2vec, size=100, window=5, min_count=5, sg=1)
    
    # perform another 10 epochs of training
    for _ in range(9):
        model.train(sentences_for_word2vec,epochs=model.iter, total_examples=model.corpus_count)

    model.save(word2vec_filepath)
else:
    model = Word2Vec.load(word2vec_filepath)
model.init_sims()

In [None]:
print('{} training epochs so far.'.format(model.train_count))

We transformed each word in our reviews to 100 dimentional vectors. Wonder how they look? here's word vectors in pandas dataframe form.



In [None]:
# take word vectors of most frequent words.
num_words = 2000
word_embeddings = pd.DataFrame(model.wv.syn0norm[:num_words, :], index=model.wv.index2word[:num_words])
word_embeddings.head(10)

In [None]:
model.most_similar(positive=['adidas'], topn=5)


In [None]:
model.most_similar(positive=['shoe'], negative=['ugly'], topn=5)


## Search Engine

Request key word insert

In [None]:
search_item=input("Please insert the key word of the product")

Apply most_similar method to output the most similar words corresponding to the key word

In [None]:
search_word=model.most_similar(positive=[search_item],topn=5)

In [None]:
search_word

In [None]:
target_list=[]

In [None]:
for x in search_word:
    target_list.append(x[0]) 

This is the list contains top rates of related words through user given key word.

In [None]:
target_list

Remove punctuation and final target list of most related word.

In [None]:
target_list2=[]
#remove punctuation
for x in target_list:
    x=x.replace("_"," ")
    target_list2.append(x)
target_list2

Output the cleaned review contains the certain word within the target list

In [None]:
df1 = reviewnew_df[reviewnew_df['review_clean'].str.contains(target_list2[3])]

In [None]:
df1.nlargest(10, 'pos')

Find all review has the word in target list

In [None]:
#the dataframe contains all the related words that are in the target list
for x in target_list2:
    target_df = reviewnew_df[reviewnew_df['review_clean'].str.contains(x)]
    target_df=target_df.append(target_df)
target_df.sort_values("pos", inplace = True) 
target_df

In [None]:
final_df=target_df.nlargest(10, 'pos')

In [None]:
final_df

In [None]:
final_df = final_df.drop_duplicates(subset='reviewText', keep='first')


Filter the results leave the highest pos score from sentimental analysist

In [None]:
final_df

search back into original dataframe, maybe not needed.

In [None]:
for x in final_df['reviewText']:
    data_original=data_original.append(data.loc[data['reviewText'] == x])
data_original

Output of the suggested products ID

In [None]:
data_original['asin']