# Topic Analysis of Review Data 

## DESCRIPTION

Help a leading mobile brand understand the voice of the customer by analyzing the reviews of their product on Amazon and the topics that customers are talking about. You will perform topic modeling on specific parts of speech. You’ll finally interpret the emerging topics.

## Problem Statement: 

A popular mobile phone brand, Lenovo has launched their budget smartphone in the Indian market. The client wants to understand the VOC (voice of the customer) on the product. This will be useful to not just evaluate the current product, but to also get some direction for developing the product pipeline. The client is particularly interested in the different aspects that customers care about. Product reviews by customers on a leading e-commerce site should provide a good view.

## Domain:

Amazon reviews for a leading phone brand

## Analysis to be done:

POS tagging, topic modeling using LDA, and topic interpretation

## Content: 

## Dataset:

‘K8 Reviews v0.2.csv’

## Columns:

* Sentiment: The sentiment against the review (4,5 star reviews are positive, 1,2 are negative)

* Reviews: The main text of the review

## Steps to perform:

* Discover the topics in the reviews and present it to business in a consumable format. Employ techniques in syntactic processing and topic modeling.

* Perform specific cleanup, POS tagging, and restricting to relevant POS tags, then, perform topic modeling using LDA. Finally, give business-friendly names to the topics and make a table for business.

## Tasks: 

1. Read the .csv file using Pandas. Take a look at the top few records.

2. Normalize casings for the review text and extract the text into a list for easier manipulation.

3. Tokenize the reviews using NLTKs word_tokenize function.

4. Perform parts-of-speech tagging on each sentence using the NLTK POS tagger.

5. For the topic model, we should  want to include only nouns.

6. Find out all the POS tags that correspond to nouns.

7. Limit the data to only terms with these tags.

8. Lemmatize. 

9. Different forms of the terms need to be treated as one.

10. No need to provide POS tag to lemmatizer for now.

11. Remove stopwords and punctuation (if there are any). 

12. Create a topic model using LDA on the cleaned-up data with 12 topics.

13. Print out the top terms for each topic.

14. What is the coherence of the model with the c_v metric?

15. Analyze the topics through the business lens.

16. Determine which of the topics can be combined.

17. Create a topic model using LDA with what you think is the optimal number of topics

18. What is the coherence of the model?

19. The business should be able to interpret the topics.

20. Name each of the identified topics.

21. Create a table with the topic name and the top 10 terms in each to present to the business.

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import nltk

In [2]:
%matplotlib inline

In [3]:
nltk.download()

NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> d

Download which package (l=list; x=cancel)?
  Identifier> all
    Downloading collection 'all'
       | 
       | Downloading package abc to /root/nltk_data...
       |   Unzipping corpora/abc.zip.
       | Downloading package alpino to /root/nltk_data...
       |   Unzipping corpora/alpino.zip.
       | Downloading package biocreative_ppi to /root/nltk_data...
       |   Unzipping corpora/biocreative_ppi.zip.
       | Downloading package brown to /root/nltk_data...
       |   Unzipping corpora/brown.zip.
       | Downloading package brown_tei to /root/nltk_data...
       |   Unzipping corpora/brown_tei.zip.
       | Downloading package cess_cat to /root/nltk_data...
       |   Unzipping corpora/cess_cat.zip.
       | Downloading package

True

In [4]:
from google.colab import drive
drive.mount("./gdrive")

Mounted at ./gdrive


1. Read the .csv file using Pandas. Take a look at the top few records.


In [5]:
path = "./gdrive/MyDrive/datasets/Topic Analysis of Review Data/K8 Reviews v0.2.csv"

In [6]:
df = pd.read_csv(path)
df.head()

Unnamed: 0,sentiment,review
0,1,Good but need updates and improvements
1,0,"Worst mobile i have bought ever, Battery is dr..."
2,1,when I will get my 10% cash back.... its alrea...
3,1,Good
4,0,The worst phone everThey have changed the last...


2. Normalize casings for the review text and extract the text into a list for easier manipulation.

In [7]:
df["review"] = df["review"].str.lower()
df.head()

Unnamed: 0,sentiment,review
0,1,good but need updates and improvements
1,0,"worst mobile i have bought ever, battery is dr..."
2,1,when i will get my 10% cash back.... its alrea...
3,1,good
4,0,the worst phone everthey have changed the last...


3. Tokenize the reviews using NLTKs word_tokenize function.


In [8]:
from nltk.tokenize import sent_tokenize, word_tokenize

In [9]:
df["word_tokens"] = df["review"].apply(lambda x: word_tokenize(x))
df["sent_tokens"] = df["review"].apply(lambda x: sent_tokenize(x))
df.head(10)

Unnamed: 0,sentiment,review,word_tokens,sent_tokens
0,1,good but need updates and improvements,"[good, but, need, updates, and, improvements]",[good but need updates and improvements]
1,0,"worst mobile i have bought ever, battery is dr...","[worst, mobile, i, have, bought, ever, ,, batt...","[worst mobile i have bought ever, battery is d..."
2,1,when i will get my 10% cash back.... its alrea...,"[when, i, will, get, my, 10, %, cash, back, .....",[when i will get my 10% cash back.... its alre...
3,1,good,[good],[good]
4,0,the worst phone everthey have changed the last...,"[the, worst, phone, everthey, have, changed, t...",[the worst phone everthey have changed the las...
5,0,only i'm telling don't buyi'm totally disappoi...,"[only, i, 'm, telling, do, n't, buyi, 'm, tota...",[only i'm telling don't buyi'm totally disappo...
6,1,"phone is awesome. but while charging, it heats...","[phone, is, awesome, ., but, while, charging, ...","[phone is awesome., but while charging, it hea..."
7,0,the battery level has worn down,"[the, battery, level, has, worn, down]",[the battery level has worn down]
8,0,it's over hitting problems...and phone hanging...,"[it, 's, over, hitting, problems, ..., and, ph...",[it's over hitting problems...and phone hangin...
9,0,a lot of glitches dont buy this thing better g...,"[a, lot, of, glitches, dont, buy, this, thing,...",[a lot of glitches dont buy this thing better ...


4. Perform parts-of-speech tagging on each sentence using the NLTK POS tagger

In [10]:
df["word_tags"] = df["word_tokens"].apply(lambda x: nltk.pos_tag(x))
df.head(10)

Unnamed: 0,sentiment,review,word_tokens,sent_tokens,word_tags
0,1,good but need updates and improvements,"[good, but, need, updates, and, improvements]",[good but need updates and improvements],"[(good, JJ), (but, CC), (need, VBP), (updates,..."
1,0,"worst mobile i have bought ever, battery is dr...","[worst, mobile, i, have, bought, ever, ,, batt...","[worst mobile i have bought ever, battery is d...","[(worst, JJS), (mobile, NN), (i, NN), (have, V..."
2,1,when i will get my 10% cash back.... its alrea...,"[when, i, will, get, my, 10, %, cash, back, .....",[when i will get my 10% cash back.... its alre...,"[(when, WRB), (i, NN), (will, MD), (get, VB), ..."
3,1,good,[good],[good],"[(good, JJ)]"
4,0,the worst phone everthey have changed the last...,"[the, worst, phone, everthey, have, changed, t...",[the worst phone everthey have changed the las...,"[(the, DT), (worst, JJS), (phone, NN), (everth..."
5,0,only i'm telling don't buyi'm totally disappoi...,"[only, i, 'm, telling, do, n't, buyi, 'm, tota...",[only i'm telling don't buyi'm totally disappo...,"[(only, RB), (i, JJ), ('m, VBP), (telling, VBG..."
6,1,"phone is awesome. but while charging, it heats...","[phone, is, awesome, ., but, while, charging, ...","[phone is awesome., but while charging, it hea...","[(phone, NN), (is, VBZ), (awesome, JJ), (., .)..."
7,0,the battery level has worn down,"[the, battery, level, has, worn, down]",[the battery level has worn down],"[(the, DT), (battery, NN), (level, NN), (has, ..."
8,0,it's over hitting problems...and phone hanging...,"[it, 's, over, hitting, problems, ..., and, ph...",[it's over hitting problems...and phone hangin...,"[(it, PRP), ('s, VBZ), (over, IN), (hitting, V..."
9,0,a lot of glitches dont buy this thing better g...,"[a, lot, of, glitches, dont, buy, this, thing,...",[a lot of glitches dont buy this thing better ...,"[(a, DT), (lot, NN), (of, IN), (glitches, NNS)..."


5. For the topic model, we should  want to include only nouns.


6. Find out all the POS tags that correspond to nouns.

7. Limit the data to only terms with these tags.



In [11]:
noun_tags = ["NN", "NNP", "NNS", "NNPS"]

In [12]:
df["word_tags"]

0        [(good, JJ), (but, CC), (need, VBP), (updates,...
1        [(worst, JJS), (mobile, NN), (i, NN), (have, V...
2        [(when, WRB), (i, NN), (will, MD), (get, VB), ...
3                                             [(good, JJ)]
4        [(the, DT), (worst, JJS), (phone, NN), (everth...
                               ...                        
14670    [(i, RB), (really, RB), (like, IN), (the, DT),...
14671    [(the, DT), (lenovo, JJ), (k8, NN), (note, NN)...
14672    [(awesome, JJ), (gaget.., NN), (@, NN), (this,...
14673    [(this, DT), (phone, NN), (is, VBZ), (nice, JJ...
14674    [(good, JJ), (product, NN), (but, CC), (the, D...
Name: word_tags, Length: 14675, dtype: object

In [13]:
df["only_nouns"] = df["word_tags"].apply(lambda x: [i[0] for i in x if i[1] in noun_tags])
df.head(10)

Unnamed: 0,sentiment,review,word_tokens,sent_tokens,word_tags,only_nouns
0,1,good but need updates and improvements,"[good, but, need, updates, and, improvements]",[good but need updates and improvements],"[(good, JJ), (but, CC), (need, VBP), (updates,...","[updates, improvements]"
1,0,"worst mobile i have bought ever, battery is dr...","[worst, mobile, i, have, bought, ever, ,, batt...","[worst mobile i have bought ever, battery is d...","[(worst, JJS), (mobile, NN), (i, NN), (have, V...","[mobile, i, battery, hell, backup, hours, uses..."
2,1,when i will get my 10% cash back.... its alrea...,"[when, i, will, get, my, 10, %, cash, back, .....",[when i will get my 10% cash back.... its alre...,"[(when, WRB), (i, NN), (will, MD), (get, VB), ...","[i, %, cash, january..]"
3,1,good,[good],[good],"[(good, JJ)]",[]
4,0,the worst phone everthey have changed the last...,"[the, worst, phone, everthey, have, changed, t...",[the worst phone everthey have changed the las...,"[(the, DT), (worst, JJS), (phone, NN), (everth...","[phone, everthey, phone, problem, amazon, phon..."
5,0,only i'm telling don't buyi'm totally disappoi...,"[only, i, 'm, telling, do, n't, buyi, 'm, tota...",[only i'm telling don't buyi'm totally disappo...,"[(only, RB), (i, JJ), ('m, VBP), (telling, VBG...","[camerawaste, money]"
6,1,"phone is awesome. but while charging, it heats...","[phone, is, awesome, ., but, while, charging, ...","[phone is awesome., but while charging, it hea...","[(phone, NN), (is, VBZ), (awesome, JJ), (., .)...","[phone, reason, k8]"
7,0,the battery level has worn down,"[the, battery, level, has, worn, down]",[the battery level has worn down],"[(the, DT), (battery, NN), (level, NN), (has, ...","[battery, level]"
8,0,it's over hitting problems...and phone hanging...,"[it, 's, over, hitting, problems, ..., and, ph...",[it's over hitting problems...and phone hangin...,"[(it, PRP), ('s, VBZ), (over, IN), (hitting, V...","[problems, phone, hanging, problems, note, sta..."
9,0,a lot of glitches dont buy this thing better g...,"[a, lot, of, glitches, dont, buy, this, thing,...",[a lot of glitches dont buy this thing better ...,"[(a, DT), (lot, NN), (of, IN), (glitches, NNS)...","[lot, glitches, thing, options]"


8. Lemmatize


9. Different forms of the terms need to be treated as one.


10. No need to provide POS tag to lemmatizer for now.


11. Remove stopwords and punctuation (if there are any). 


In [14]:
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from string import punctuation

In [15]:
stop_words = stopwords.words("english")
lemmatizer = WordNetLemmatizer()

In [16]:
df["only_nouns"].apply(lambda x: [w for w in x if (lemmatizer.lemmatize(w) not in stop_words) and (w not in punctuation)])

0                                  [updates, improvements]
1        [mobile, battery, hell, backup, hours, uses, i...
2                                        [cash, january..]
3                                                       []
4        [phone, everthey, phone, problem, amazon, phon...
                               ...                        
14670                   [phone, everything, whater, phone]
14671    [k8, note, pictures, camera, body, bit, hand, ...
14672                                     [gaget.., price]
14673                     [phone, processing, camera, mod]
14674                                  [product, pakeging]
Name: only_nouns, Length: 14675, dtype: object

12. Create a topic model using LDA on the cleaned-up data with 12 topics.


In [17]:
from gensim.models.ldamodel import LdaModel
from gensim import corpora

In [18]:
id2word = corpora.Dictionary(df["only_nouns"])

In [19]:
corpus = [id2word.doc2bow(x) for x in df["only_nouns"]]

In [20]:
lda = LdaModel(corpus=corpus, 
               id2word=id2word,
               num_topics=12)

In [21]:
from pprint import pprint

13. Print out the top terms for each topic.


In [22]:
pprint(lda.print_topics())

[(0,
  '0.039*"screen" + 0.032*"call" + 0.023*"phone" + 0.021*"cast" + '
  '0.020*"glass" + 0.017*"option" + 0.014*"excellent" + 0.014*"i" + '
  '0.012*"feature" + 0.010*"gorilla"'),
 (1,
  '0.183*"product" + 0.038*"charger" + 0.038*"battery" + 0.029*"phone" + '
  '0.026*"i" + 0.026*"time" + 0.020*"handset" + 0.019*"hours" + 0.018*"turbo" '
  '+ 0.016*"delivery"'),
 (2,
  '0.193*"mobile" + 0.074*"money" + 0.037*"waste" + 0.029*"value" + 0.027*"i" '
  '+ 0.022*"product" + 0.014*"phone" + 0.011*"notification" + 0.011*"battery" '
  '+ 0.010*"features"'),
 (3,
  '0.122*"problem" + 0.072*"i" + 0.054*"phone" + 0.050*"heating" + '
  '0.022*"battery" + 0.017*"issues" + 0.016*"screen" + 0.012*"days" + '
  '0.011*"use" + 0.010*"image"'),
 (4,
  '0.087*"note" + 0.054*"k8" + 0.051*"i" + 0.032*"lenovo" + 0.025*"phone" + '
  '0.019*"update" + 0.018*"sim" + 0.017*"battery" + 0.016*"jio" + '
  '0.014*"time"'),
 (5,
  '0.106*"battery" + 0.070*"phone" + 0.028*"life" + 0.022*"usage" + '
  '0.017*"process

In [41]:
lda_doc = lda[corpus]
lda_doc[2]

[(0, 0.016666818),
 (1, 0.016667465),
 (2, 0.016666764),
 (3, 0.0166667),
 (4, 0.01666678),
 (5, 0.016666759),
 (6, 0.46845302),
 (7, 0.36487824),
 (8, 0.016666677),
 (9, 0.016667234),
 (10, 0.016666904),
 (11, 0.01666669)]

In [24]:
print('Perplexity: ', lda.log_perplexity(corpus))

Perplexity:  -7.3590376019000825


14. What is the coherence of the model with the c_v metric?


In [25]:
from gensim.models import CoherenceModel

In [26]:
coherence = CoherenceModel(model=lda, texts=df["only_nouns"], dictionary=id2word, coherence='c_v')
coherence = coherence.get_coherence()
print('Coherence Score: ', coherence)

Coherence Score:  0.5070463723638798


17. Create a topic model using LDA with what you think is the optimal number of topics


In [27]:
lda_6_topics = LdaModel(corpus=corpus, 
                        id2word=id2word,
                        num_topics=6)

In [29]:
pprint(lda_6_topics.print_topics())

[(0,
  '0.103*"product" + 0.100*"camera" + 0.047*"quality" + 0.029*"money" + '
  '0.016*"features" + 0.015*"mode" + 0.014*"battery" + 0.013*"value" + '
  '0.013*"waste" + 0.012*"price"'),
 (1,
  '0.031*"phone" + 0.024*"call" + 0.021*"network" + 0.018*"speaker" + '
  '0.015*"time" + 0.014*"i" + 0.013*"camera" + 0.012*"app" + 0.011*"screen" + '
  '0.011*"issues"'),
 (2,
  '0.062*"problem" + 0.031*"heating" + 0.025*"phone" + 0.020*"battery" + '
  '0.018*"hai" + 0.014*"problems" + 0.012*"time" + 0.011*"lenovo" + '
  '0.010*"handset" + 0.009*"%"'),
 (3,
  '0.083*"phone" + 0.061*"i" + 0.031*"note" + 0.023*"lenovo" + 0.021*"issue" + '
  '0.019*"amazon" + 0.017*"service" + 0.016*"k8" + 0.011*"product" + '
  '0.010*"network"'),
 (4,
  '0.158*"phone" + 0.088*"battery" + 0.036*"camera" + 0.024*"price" + '
  '0.017*"backup" + 0.017*"performance" + 0.013*"i" + 0.013*"charger" + '
  '0.012*"quality" + 0.011*"life"'),
 (5,
  '0.107*"mobile" + 0.063*"i" + 0.021*"device" + 0.018*"note" + '
  '0.016*"pr

18. What is the coherence of the model?


In [28]:
coherence = CoherenceModel(model=lda_6_topics, texts=df["only_nouns"], dictionary=id2word, coherence='c_v')
coherence = coherence.get_coherence()
print('Coherence Score: ', coherence)

Coherence Score:  0.5396455824193315


19. The business should be able to interpret the topics.
