# Analysing Amazon Product Reviews Using LDA Topic Modelling

With the boom in the number of online buyers and the simultaneous influx of reviews, understanding user experience is becoming an increasingly challenging task. Reviews talk volumes about a product, the seller and local partners. However, scraping such a myriad of customer feedback can be a tricky task. This tutorial helps you understand better ways of retrieving and structuring reviews of products to draw powerful insights.

For our use case here, we will be using reviews of Amazon Echo.

## Load the libraries

In [5]:
import re # We clean text using regex
import csv # To read the csv
from collections import defaultdict # For accumlating values
from nltk.corpus import stopwords # To remove stopwords
from gensim import corpora # To create corpus and dictionary for the LDA model
from gensim.models import LdaModel # To use the LDA model
import pandas as pd
import nltk


### Loading Amazon Echo Review Data


Here is a sample dataset for Amazon Echo reviews : [here](https://www.scrapehero.com/wp/wp-content/uploads/blog_attachments/reviews.zip)


In [6]:
fileContents = defaultdict(list)
with open('reviews_sample.csv', 'r') as f:
    reader = csv.DictReader(f)
    for row in reader: # read a row as {column1: value1, column2: value2,...}
        for (k,v) in row.items(): # go over each column name and value 
            fileContents[k].append(v) 

Extract just reviews to a list using



In [7]:
reviews = fileContents['review_body']

In [8]:
data = pd.DataFrame(reviews)
data.head()

Unnamed: 0,0
0,It’s ok. Doesn’t know much. Can’t answer most ...
1,So easy to use.
2,Alexa is an entertaining assistant to have aro...
3,Fun and addictive.
4,I think google home assist is much smarter and...


In [9]:
reviews

['It’s ok. Doesn’t know much. Can’t answer most of my questions. Should have gotten a google mini instead',
 'So easy to use.',
 'Alexa is an entertaining assistant to have around.',
 'Fun and addictive.',
 'I think google home assist is much smarter and works better. Although this is also great - kids love that they can do their own programs',
 'This was my 7th Echo dot. I have them everywhere. I bought this one for my sister. She loves it as much as I do.',
 'Great product! I can’t wait to see what the next few generations will be like. This is a good entry to voice commands but far from AI that is really useful.',
 "I personally only use it for playing Spotify, didn't really use it for anything else.",
 'Perfect for an alarm clock and fun for music and general knowledge.',
 'If you live alone, this is the perfect thing for you. I just lost my dog, and was feeling so lonely. Then my Dot arrived in the mail! I am no longer alone. Alexa talks to me, and I can actually have a conversati

## Cleaning Up The Data


### Punctuation

In [10]:
 reviews = [re.sub(r'[^\w\s]','',str(item)) for item in reviews]


### Stop-words
The reviews we have contains a lot of words that aren’t really necessary for our study. These are called stopwords. We will remove them from our text while converting our reviews to tokens.

In [11]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/nathanamar/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [12]:
stopwords = set(stopwords.words('english'))


Let’s remove those stopwords while converting the reviews list to a list of reviews which are split into words that matter.

In [13]:
texts = [[word for word in document.lower().split() if word not in stopwords] for document in reviews]


### Taking out the less frequent words


One of the easiest markers of how important a certain word is in a text (stopwords are exceptions) is how many times it has occurred. If it has occurred just once, then it must be rather irrelevant in the context of topic modeling. Let’s remove those words out.

In [14]:
frequency = defaultdict(int)
for text in texts:
    for token in text:
         frequency[token] += 1

texts = [[token for token in text if frequency[token] > 1] for text in texts]

## Begin processing


### Turning our text to dictionary


A dictionary in the context of machine learning is a mapping between words and their integer ids. We know that a machine can’t understand words and documents as they are. So we split and vectorize them. 

In [15]:
dictionary = corpora.Dictionary(texts)


f you try printing the dictionary, you can see the number of unique tokens in the same.

In [16]:
print(dictionary)


Dictionary(196 unique tokens: ['answer', 'cant', 'doesnt', 'google', 'instead']...)


In [17]:
corpus = [dictionary.doc2bow(text) for text in texts]


In [18]:
print(corpus[:1])

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]]


doc2bow counts the number of occurrences of each distinct word, converts the word to its integer word id and returns the result as a sparse vector. So it would have lists of tuples which goes [(word id no, occurred this many times), … ]

So if corpus reads [(0,1),(1,4)] it means Word with ID no ‘0’ occurred one time and word with id number ‘1’ occurred 4 times in the document. Now that we have our reviews in a language the machine could understand, let’s get to finding topics in them.

# What is an LDA Model?

Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. LDA expands to Latent Dirichlet Allocation (LDA) is an example of a model which is used to classify text in a document to a particular topic. It builds a topic per document model and words per topic model, modeled as Dirichlet distributions.

Let’s go with nine topics for now. The number of topics you give is largely a guess/arbitrary. The model assumes the document contains that many topics. You may use Coherence model to find an optimum number of topics.

In [19]:
NUM_TOPICS = 9 # This is an assumption. 
ldamodel = LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)#This might take some time.

## Insights
### Extracting Topics from your model

Let’s see the topics. Note that you might not receive the exact result as shown here. The objective function for LDA is non-convex, making it a multimodal problem. In layman’s terms, LDA topic modeling won’t give you one single best solution, it’s an optimization problem. It gives locally optimal solutions; you cannot expect that any given run would outperform some other run from different starting points. 

In [20]:
topics = ldamodel.show_topics()
for topic in topics:
    print(topic)

(0, '0.028*"worth" + 0.028*"sound" + 0.028*"music" + 0.028*"bought" + 0.028*"better" + 0.028*"room" + 0.028*"could" + 0.028*"timer" + 0.028*"speaker" + 0.028*"gift"')
(1, '0.063*"alexa" + 0.032*"thing" + 0.032*"ordered" + 0.032*"im" + 0.022*"one" + 0.022*"also" + 0.022*"like" + 0.022*"play" + 0.022*"night" + 0.022*"actually"')
(2, '0.053*"works" + 0.038*"one" + 0.038*"set" + 0.031*"alarm" + 0.030*"get" + 0.030*"alexa" + 0.024*"amazon" + 0.023*"price" + 0.023*"kids" + 0.023*"device"')
(3, '0.046*"great" + 0.037*"dot" + 0.037*"echo" + 0.027*"alexa" + 0.023*"well" + 0.019*"home" + 0.019*"connection" + 0.016*"product" + 0.015*"amazon" + 0.015*"little"')
(4, '0.036*"like" + 0.036*"good" + 0.036*"small" + 0.036*"assistant" + 0.036*"still" + 0.036*"love" + 0.036*"list" + 0.036*"dont" + 0.036*"addition" + 0.004*"get"')
(5, '0.075*"use" + 0.046*"easy" + 0.045*"fun" + 0.031*"product" + 0.031*"anything" + 0.031*"love" + 0.031*"time" + 0.031*"learning" + 0.031*"still" + 0.016*"son"')
(6, '0.036*"w

In [21]:
word_dict = {};
for i in range(NUM_TOPICS):
    words = ldamodel.show_topic(i, topn = 20)
    word_dict['Topic # ' + '{:02d}'.format(i+1)] = [i[0] for i in words]
pd.DataFrame(word_dict)

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05,Topic # 06,Topic # 07,Topic # 08,Topic # 09
0,worth,alexa,works,great,like,use,would,alexa,love
1,sound,thing,one,dot,good,easy,good,use,much
2,music,ordered,set,echo,small,fun,alexa,bluetooth,great
3,bought,im,alarm,alexa,assistant,product,music,love,alexa
4,better,one,get,well,still,anything,great,laptop,echo
5,room,also,alexa,home,love,love,amazon,via,quality
6,could,like,amazon,connection,list,time,ask,cant,excellent
7,timer,play,price,product,dont,learning,link,phone,like
8,speaker,night,kids,amazon,addition,still,apps,mobile,isnt
9,gift,actually,device,little,get,son,make,every,also


## Visualization using PyLDAvis


PyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data, by showing them visually. Let’s see ours.

In [22]:
import pyLDAvis.gensim # To visualise LDA model effectively


In [23]:
lda_display = pyLDAvis.gensim.prepare(ldamodel, corpus, dictionary, sort_topics=False)
pyLDAvis.display(lda_display)


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


Each bubble on the left-hand side plot represents a topic. The larger the bubble, the more prevalent is that topic.

A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant.

A model with too many topics, will typically have many overlaps, small sized bubbles clustered in one region of the chart.

Alright, if you move the cursor over one of the bubbles, the words and bars on the right-hand side will update. These words are the salient keywords that form the selected topic.

In [24]:
# Compute Perplexity
print('\nPerplexity: ', ldamodel.log_perplexity(corpus))  # a measure of how good the model is. lower the better.




Perplexity:  -5.805102263178053
