# Text Mining Online Amazon Reviews using Topic Modeling

Our Question: How we can analyze a large number of text documents, online product reviews, etc. using NLP?

An example on Amazon review data:
Ratings alone do not give a complete picture of the products we wish to purchase. A further possibility provides reviews of online products which are a great source of information for consumers. From the seller's point of view, online reviews can be used to gauge the consumer's feedback on the products or services they are selling. However, since these online reviews are quite often overwhelming in terms of numbers and information, we need an intelligent system, that will help for both the consumers and the sellers. This system will serve two purposes:

1. Enable consumers to quickly extract the key topics covered by the reviews without having to go through all of them
2. Help the sellers get consumer feedback in the form of topics, extacted from the consumer reviews

To solve this problem, we will use the concept of Topic Modeling using Latent Dirichlet Allocation (LDA) on Amazon review data. You can download it from this website: 

### What is Topic Modeling?

Topic modeling is one of the most popular NLP techniques with several real-world applications such as dimensionality reduction, text summarization, recommendation engine, etc. 

### Why we should use Topic Modeling?

Topic Modeling try to automatically identify useful topics present in a text object like document and to derive hidden patterns exhibited by a called text corpus. Topic Modeling can be used for multiple purposes, including:

- Document clustering
- Organizing large blocks of textual data
- Information retrieval from unstructured text
- Feature selection

Our aim here is to extract a certain useful number of groups of important words from the reviews. These groups of words are basically the topics which would help in ascertaining what the consumers are actually talking about in the reviews. The purpose of this notebook is to demonstrate the application of LDA on a raw, crowd-generated text data.

Let's first load all the necessary libraries:

In [44]:
import nltk
from nltk import FreqDist
nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')
from textblob import Word


import pandas as pd
import numpy as np
import datapreprocessing as dp

import re
import spacy
import gensim
from gensim import corpora

# libraries for visualization
#import pyLDAvis
#import pyLDAvis.gensim
import matplotlib.pyplot as plt
import seaborn as sns

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/dimitriwilhelm/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Reading the data

In [11]:
df_Apps = dp.getDF('Apps_for_Android_5.json')
df_Apps.head()

Unnamed: 0,reviewerID,asin,reviewerName,helpful,reviewText,overall,summary,unixReviewTime,reviewTime
0,A1N4O8VOJZTDVB,B004A9SDD8,Annette Yancey,"[1, 1]","Loves the song, so he really couldn't wait to ...",3.0,Really cute,1383350400,"11 2, 2013"
1,A2HQWU6HUKIEC7,B004A9SDD8,"Audiobook lover ""Kathy""","[0, 0]","Oh, how my little grandson loves this app. He'...",5.0,2-year-old loves it,1323043200,"12 5, 2011"
2,A1SXASF6GYG96I,B004A9SDD8,Barbara Gibbs,"[0, 0]",I found this at a perfect time since my daught...,5.0,Fun game,1337558400,"05 21, 2012"
3,A2B54P9ZDYH167,B004A9SDD8,"Brooke Greenstreet ""Babylove""","[3, 4]",My 1 year old goes back to this game over and ...,5.0,We love our Monkeys!,1354752000,"12 6, 2012"
4,AFOFZDTX5UC6D,B004A9SDD8,C. Galindo,"[1, 1]",There are three different versions of the song...,5.0,This is my granddaughters favorite app on my K...,1391212800,"02 1, 2014"


Our data contains the following columns

- reviewerID
- asin
- reviewerName
- helpful
- reviewText
- overall
- summary
- unixReviewTime
- reviewTime

For our analysis, we create a new dataframe with the reviews column. 

In [19]:
df_review = pd.DataFrame(df_Apps.reviewText)

In [41]:
# function to plot most frequent terms
def freq_words(x, terms=30):
    all_words = " ".join([word for word in x])
    all_words = all_words.split()
    
    fdist = FreqDist(all_words)
    print(fdist.key())
    words_df = pd.DataFrame({'word': list(fdist.key()), 'count': list(fdist.values())})
    
    # selecting top 10 frequent words
    d = words_df.nlargest(columns="count", n = terms)
    plt.figure(figsize=(12,10))
    sns.barplot(x="word", y = "count", data = d)
    plt.show()

## Data Preprocessing

In [26]:
# function to clean our data
def clean_data(data):
    """

    """
    # Lower case
    # Transform our review into lower case. This avoids having multiple copies of the same words
    data['reviewText'] = data['reviewText'].apply(
        lambda x: " ".join( x.lower() for x in x.split() )
    )
    
    # Removing Punctuation, Numbers and Special Characters
    # It does not add any extra information while treating text data. Therefore it will help us reduce the size of the data
    data['reviewText'] = data['reviewText'].str.replace('[^a-zA-Z#]',' ')
    
    # Removal of Stop Words, i.e. we just removed commonly occurring words in a genearl sense
    # Stop Words should be removed from the text data. We use for this predefined libraries from nltk
    data['reviewText'] = data['reviewText'].apply(
        lambda x: " ".join( x for x in x.split() if x not in stop)
    )
    
    # Removing commonly occurring words from our text data
    # Let's check the 10 most frequently occuring words in our text data
    freq = pd.Series(" ".join( data['reviewText'] ).split()).value_counts()[:1]
    # Let's remove these words as their presence will not of any use in classification of our text data
    freq = list(freq.index)
    data['reviewText'] = data['reviewText'].apply(
        lambda x: " ".join( x for x in x.split() if x not in freq)
    )
    
    # Remove rare words
    # Let's check the 10 rarely occurring words in our text data
    #freq_1 = pd.Series(" ".join( data['reviewText'] ).split()).value_counts()[:-1]
    # Let's remove these words as their presence will not of any use in classification of our text data
    #freq_1 = list(freq.index)
    #data['reviewText'] = data['reviewText'].apply(
    #    lambda x: " ".join(x for x in x.split() if x not in freq_1)
    #)
    
    # Stemming, i.e. we're removing suffices, like "ing", "ly", etc. by a simple rule-based approach.
    # For this purpose, we will use PorterStemmer from the NLTK library
    #st = PorterStemmer()
    #data['reviewText'] = data['reviewText'].apply(
    #    lambda x: " ".join([ st.stem(word) for word in x.split() ])
    #)
   
    # Lemmatization
    # Lemmatization is more effective that stemming because it converts the word into its root word, 
    # rather than just stripping the suffices. We usually prefer using lemmatiziation over stemming.
    data['reviewText'] = data['reviewText'].apply(
        lambda x: " ".join([ Word(word).lemmatize() for word in x.split() ])
    )
    
    # Remove short words (Length < 3)
    data['reviewText'] = data['reviewText'].apply(
        lambda x: " ".join([w for w in x.split() if len(w) > 3])
    )

In [27]:
clean_data(df_review)

In [45]:
freq_words(df_review)

AttributeError: 'FreqDist' object has no attribute 'key'

## Building an LDA model

We will start by creating the term dictionary of our corpus, where every unique term is assigned an index

In [55]:
df_review.reviewText = df_review.reviewText.apply(lambda x: x.split())

In [67]:
df_review.reviewText[:10]

0    [love, song, really, wait, play, little, inter...
1    [little, grandson, love, always, asking, monke...
2    [found, perfect, time, since, daughter, favori...
3    [year, back, simple, easy, toddler, even, caug...
4    [three, different, version, song, keep, occupi...
5    [cute, great, little, love, think, funny, kick...
6    [watch, great, grandson, week, hard, keep, mon...
7    [wild, crazy, little, love, singing, song, fiv...
8    [love, love, love, going, different, apps, cam...
9    [cute, alot, item, move, would, awesome, said,...
Name: reviewText, dtype: object

In [68]:
dictionary = corpora.Dictionary(df_review.reviewText[:10])

Next we convert the list of reviews into a Document Term Matrix using the dictionary prepared above.

In [69]:
doc_term_matrix = [dictionary.doc2bow(rev) for rev in df_review.reviewText[:10]]

In [70]:
# Creating the LDA model
LDA = gensim.models.ldamodel.LdaModel

# Build LDA model
lda_model = LDA(corpus=doc_term_matrix, id2word = dictionary, num_topics=5, random_state=1, chunksize=1000, passes=50)

Let's try it. We print out the topics that our LDA model has learned

In [71]:
lda_model.print_topics()

[(0,
  '0.054*"little" + 0.037*"year" + 0.020*"singing" + 0.020*"five" + 0.020*"easy" + 0.020*"crazy" + 0.020*"country" + 0.020*"full" + 0.020*"included" + 0.020*"player"'),
 (1,
  '0.034*"monkey" + 0.023*"keep" + 0.023*"occupied" + 0.023*"click" + 0.023*"light" + 0.023*"ring" + 0.023*"great" + 0.014*"song" + 0.013*"going" + 0.013*"toddler"'),
 (2,
  '0.031*"cute" + 0.031*"item" + 0.031*"different" + 0.031*"move" + 0.031*"moved" + 0.031*"alot" + 0.031*"said" + 0.031*"awesome" + 0.031*"would" + 0.031*"voice"'),
 (3,
  '0.057*"love" + 0.057*"little" + 0.055*"song" + 0.046*"play" + 0.024*"really" + 0.024*"different" + 0.024*"cute" + 0.013*"operate" + 0.013*"variety" + 0.013*"highly"'),
 (4,
  '0.046*"monkey" + 0.031*"love" + 0.031*"grandson" + 0.031*"little" + 0.017*"great" + 0.017*"long" + 0.017*"thing" + 0.017*"five" + 0.017*"worth" + 0.017*"well"')]

## Topics Visualization

In [None]:
# Visualize the topics
#pyLDAvis.enable_notebook()
#vis = pyLDAvis.gensim.prepare(lda_model, doc_term_matrix, dictionary)
#ivs

## Summary

Information retrieval saves us from the labor of going through product reviews one by one. It gives us a fair idea of what other consumers are talking about the product.

Another Methods to Leverage Online Reviews

Apart from Topic Modeling, there are many other methods using NLP as well which are used for analyzing and understanding online reviews. Some ideas:

- Text Summarization: Summarize the reviews into a paragraph or a few bullet points
- Entity Recognition: Extract entities from the reviews and identify which products are most popular (or unpopular) among the consumers
- Identify Emerging Trends: Based on the Timestamp of the reviews, new and emerging topics or entities can be identified. It would be enable us to figure out which products are becoming popular and which are losing their grip on the market
- Sentiment Analysis: It tells us whether the reviews are positive, netural or negative. For sellers/retailers, understanding the sentiment of the reviews can be helpful in improving their products and services.