## Review Project Analysis

Help a leading mobile brand understand the voice of the customer by analyzing the reviews of their product on Amazon and the topics that customers are talking about. You will perform topic modeling on specific parts of speech. You‚Äôll finally interpret the emerging topics.

Problem Statement: 

A popular mobile phone brand, Lenovo has launched their budget smartphone in the Indian market. The client wants to understand the VOC (voice of the customer) on the product. This will be useful to not just evaluate the current product, but to also get some direction for developing the product pipeline. The client is particularly interested in the different aspects that customers care about. Product reviews by customers on a leading e-commerce site should provide a good view.

Domain: Amazon reviews for a leading phone brand

Analysis to be done: POS tagging, topic modeling using LDA, and topic interpretation

# 1. Read the .csv file using Pandas. Take a look at the top few records.

In [1]:
import nltk
import pandas as pd

#Library for tokenization
from nltk.tokenize import word_tokenize

#Library for Lemmatizer
from nltk.stem import WordNetLemmatizer

#Stop words, punctation
from nltk.corpus import stopwords
from string import punctuation

#Library for Gensim -LDA Model
import gensim
from gensim import corpora
from gensim import models
from gensim.models import CoherenceModel

In [3]:
reviews=pd.read_csv(r"K8 Reviews v0.2.csv")

In [4]:
reviews.head()

Unnamed: 0,sentiment,review
0,1,Good but need updates and improvements
1,0,"Worst mobile i have bought ever, Battery is dr..."
2,1,when I will get my 10% cash back.... its alrea...
3,1,Good
4,0,The worst phone everThey have changed the last...


# 2. Normalize casings for the review text and extract the text into a list for easier manipulation.

# 3. Tokenize the reviews using NLTKs word_tokenize function.

In [5]:
reviews_tokenized = []

for sent in reviews.review.values:
    reviews_tokenized.append(word_tokenize(sent.lower()))


In [6]:
reviews_tokenized

[['good', 'but', 'need', 'updates', 'and', 'improvements'],
 ['worst',
  'mobile',
  'i',
  'have',
  'bought',
  'ever',
  ',',
  'battery',
  'is',
  'draining',
  'like',
  'hell',
  ',',
  'backup',
  'is',
  'only',
  '6',
  'to',
  '7',
  'hours',
  'with',
  'internet',
  'uses',
  ',',
  'even',
  'if',
  'i',
  'put',
  'mobile',
  'idle',
  'its',
  'getting',
  'discharged.this',
  'is',
  'biggest',
  'lie',
  'from',
  'amazon',
  '&',
  'lenove',
  'which',
  'is',
  'not',
  'at',
  'all',
  'expected',
  ',',
  'they',
  'are',
  'making',
  'full',
  'by',
  'saying',
  'that',
  'battery',
  'is',
  '4000mah',
  '&',
  'booster',
  'charger',
  'is',
  'fake',
  ',',
  'it',
  'takes',
  'at',
  'least',
  '4',
  'to',
  '5',
  'hours',
  'to',
  'be',
  'fully',
  'charged.do',
  "n't",
  'know',
  'how',
  'lenovo',
  'will',
  'survive',
  'by',
  'making',
  'full',
  'of',
  'us.please',
  'don',
  ';',
  't',
  'go',
  'for',
  'this',
  'else',
  'you',
  'will

# 4. Perform parts-of-speech tagging on each sentence using the NLTK POS tagger.

In [6]:
tagged_reviews = [nltk.pos_tag(word) for word in reviews_tokenized]


In [7]:
tagged_reviews

[[('good', 'JJ'),
  ('but', 'CC'),
  ('need', 'VBP'),
  ('updates', 'NNS'),
  ('and', 'CC'),
  ('improvements', 'NNS')],
 [('worst', 'JJS'),
  ('mobile', 'NN'),
  ('i', 'NN'),
  ('have', 'VBP'),
  ('bought', 'VBN'),
  ('ever', 'RB'),
  (',', ','),
  ('battery', 'NN'),
  ('is', 'VBZ'),
  ('draining', 'VBG'),
  ('like', 'IN'),
  ('hell', 'NN'),
  (',', ','),
  ('backup', 'NN'),
  ('is', 'VBZ'),
  ('only', 'RB'),
  ('6', 'CD'),
  ('to', 'TO'),
  ('7', 'CD'),
  ('hours', 'NNS'),
  ('with', 'IN'),
  ('internet', 'JJ'),
  ('uses', 'NNS'),
  (',', ','),
  ('even', 'RB'),
  ('if', 'IN'),
  ('i', 'JJ'),
  ('put', 'VBP'),
  ('mobile', 'JJ'),
  ('idle', 'NN'),
  ('its', 'PRP$'),
  ('getting', 'VBG'),
  ('discharged.this', 'NN'),
  ('is', 'VBZ'),
  ('biggest', 'JJS'),
  ('lie', 'NN'),
  ('from', 'IN'),
  ('amazon', 'NN'),
  ('&', 'CC'),
  ('lenove', 'NN'),
  ('which', 'WDT'),
  ('is', 'VBZ'),
  ('not', 'RB'),
  ('at', 'IN'),
  ('all', 'DT'),
  ('expected', 'VBN'),
  (',', ','),
  ('they', 'PRP'),


# 5.For the topic model, we should  want to include only nouns.

1 Find out all the POS tags that correspond to nouns.

2 Limit the data to only terms with these tags.

In [8]:
noun_reviews = []

for item in tagged_reviews:
    noun_reviews.append([word for word in item if "NN" in word[1]])
    
noun_reviews

[[('updates', 'NNS'), ('improvements', 'NNS')],
 [('mobile', 'NN'),
  ('i', 'NN'),
  ('battery', 'NN'),
  ('hell', 'NN'),
  ('backup', 'NN'),
  ('hours', 'NNS'),
  ('uses', 'NNS'),
  ('idle', 'NN'),
  ('discharged.this', 'NN'),
  ('lie', 'NN'),
  ('amazon', 'NN'),
  ('lenove', 'NN'),
  ('battery', 'NN'),
  ('charger', 'NN'),
  ('hours', 'NNS'),
  ('don', 'NN')],
 [('i', 'NN'), ('%', 'NN'), ('cash', 'NN'), ('..', 'NN')],
 [],
 [('phone', 'NN'),
  ('everthey', 'NN'),
  ('phone', 'NN'),
  ('problem', 'NN'),
  ('amazon', 'NN'),
  ('phone', 'NN'),
  ('amazon', 'NN')],
 [('camerawaste', 'NN'), ('money', 'NN')],
 [('phone', 'NN'),
  ('allot', 'NN'),
  ('..', 'NNP'),
  ('reason', 'NN'),
  ('k8', 'NNS')],
 [('battery', 'NN'), ('level', 'NN')],
 [('problems', 'NNS'),
  ('phone', 'NN'),
  ('hanging', 'NN'),
  ('problems', 'NNS'),
  ('note', 'NN'),
  ('station', 'NN'),
  ('ahmedabad', 'NN'),
  ('years', 'NNS'),
  ('phone', 'NN'),
  ('lenovo', 'NN')],
 [('lot', 'NN'), ('glitches', 'NNS'), ('thing',

# 6. Lemmatize. 

1 Different forms of the terms need to be treated as one.

2 No need to provide POS tag to lemmatizer for now.

In [9]:
lemmatized_review=[]
lemm = WordNetLemmatizer()

for sentence in noun_reviews:
    lemmatized_review.append([lemm.lemmatize(noun[0]) for noun in sentence])

# 7. Remove stopwords and punctuation (if there are any). 

In [11]:
cleansed_reviews=[]

my_stopwords = list(stopwords.words('english'))+list(punctuation)
my_stopwords.append("..")

for sentence in lemmatized_review:
    cleansed_reviews.append([noun for noun in sentence if noun not in my_stopwords])

In [12]:
cleansed_reviews

[['update', 'improvement'],
 ['mobile',
  'battery',
  'hell',
  'backup',
  'hour',
  'us',
  'idle',
  'discharged.this',
  'lie',
  'amazon',
  'lenove',
  'battery',
  'charger',
  'hour'],
 ['cash'],
 [],
 ['phone', 'everthey', 'phone', 'problem', 'amazon', 'phone', 'amazon'],
 ['camerawaste', 'money'],
 ['phone', 'allot', 'reason', 'k8'],
 ['battery', 'level'],
 ['problem',
  'phone',
  'hanging',
  'problem',
  'note',
  'station',
  'ahmedabad',
  'year',
  'phone',
  'lenovo'],
 ['lot', 'glitch', 'thing', 'option'],
 ['wrost'],
 ['phone', 'charger', 'damage', 'month'],
 ['item', 'battery', 'life'],
 ['battery', 'problem', 'motherboard', 'problem', 'month', 'mobile', 'life'],
 ['phone', 'slim', 'battry', 'backup', 'screen'],
 ['headset'],
 ['time'],
 ['product',
  'prize',
  'range',
  'specification',
  'comparison',
  'mobile',
  'range',
  'phone',
  'seal',
  'credit',
  'card',
  'deal',
  'amazon'],
 ['battery', 'solution', 'battery', 'life'],
 ['smartphone'],
 [],
 ['gal

# 8.Create a topic model using LDA on the cleaned-up data with 12 topics.

Print out the top terms for each topic.

What is the coherence of the model with the c_v metric?

In [13]:
NUM_TOPICS = 12
dictionary = corpora.Dictionary(cleansed_reviews)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in cleansed_reviews]

In [14]:
dictionary.token2id 

{'improvement': 0,
 'update': 1,
 'amazon': 2,
 'backup': 3,
 'battery': 4,
 'charger': 5,
 'discharged.this': 6,
 'hell': 7,
 'hour': 8,
 'idle': 9,
 'lenove': 10,
 'lie': 11,
 'mobile': 12,
 'us': 13,
 'cash': 14,
 'everthey': 15,
 'phone': 16,
 'problem': 17,
 'camerawaste': 18,
 'money': 19,
 'allot': 20,
 'k8': 21,
 'reason': 22,
 'level': 23,
 'ahmedabad': 24,
 'hanging': 25,
 'lenovo': 26,
 'note': 27,
 'station': 28,
 'year': 29,
 'glitch': 30,
 'lot': 31,
 'option': 32,
 'thing': 33,
 'wrost': 34,
 'damage': 35,
 'month': 36,
 'item': 37,
 'life': 38,
 'motherboard': 39,
 'battry': 40,
 'screen': 41,
 'slim': 42,
 'headset': 43,
 'time': 44,
 'card': 45,
 'comparison': 46,
 'credit': 47,
 'deal': 48,
 'prize': 49,
 'product': 50,
 'range': 51,
 'seal': 52,
 'specification': 53,
 'solution': 54,
 'smartphone': 55,
 'galery': 56,
 'speaker': 57,
 'camera': 58,
 'features.excelent': 59,
 'speed.excellent': 60,
 'call': 61,
 'cast': 62,
 'hotspot': 63,
 'wifi': 64,
 'cable': 65,
 

In [15]:
lda_model = models.LdaModel(corpus=doc_term_matrix, num_topics=12, id2word=dictionary, random_state=31)

In [16]:
for idx in range(12):
    print("\nTopic #%s:" % idx, lda_model.print_topic(idx, 12))


Topic #0: 0.136*"charger" + 0.057*"delivery" + 0.045*"phone" + 0.043*"turbo" + 0.039*"budget" + 0.025*"charging" + 0.023*"expectation" + 0.019*"till" + 0.018*"plz" + 0.017*"cable" + 0.016*"need" + 0.015*"order"

Topic #1: 0.074*"heat" + 0.046*"performance" + 0.033*"month" + 0.032*"system" + 0.032*"phone" + 0.031*"battery" + 0.021*"problem" + 0.020*"thanks" + 0.020*"purchase" + 0.019*"condition" + 0.018*"hour" + 0.018*"...."

Topic #2: 0.130*"price" + 0.109*"...." + 0.086*"phone" + 0.042*"range" + 0.036*"everything" + 0.033*"camera" + 0.023*"feature" + 0.020*"worth" + 0.019*"product" + 0.019*"smartphone" + 0.017*"specification" + 0.013*"pls"

Topic #3: 0.059*"phone" + 0.039*"call" + 0.027*"sim" + 0.026*"network" + 0.025*"camera" + 0.022*"h" + 0.020*"video" + 0.016*"quality" + 0.016*"feature" + 0.015*"screen" + 0.015*"photo" + 0.015*"display"

Topic #4: 0.281*"phone" + 0.037*"amazon" + 0.033*"time" + 0.023*"issue" + 0.021*"superb" + 0.016*"replacement" + 0.015*"dolby" + 0.015*"problem" 

In [17]:
lda_coherence_score = CoherenceModel(model=lda_model, texts=cleansed_reviews, dictionary=dictionary, coherence='c_v')
coherence_lda = lda_coherence_score.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.49463728236998034


# 9. Analyze the topics through the business lens.

1 Determine which of the topics can be combined.

Based on the LDA model results, it can be inferred that several topics can be combined into the following 6 major categories:

Battery Problems - Topic # 0, 8

Phone Performance - Topic # 1, 7

Pricing - Topic # 2, 6, 9

Hardware Issues - Topic # 3

Provider¬¥s Service - Topic # 4, 5, 11

Phone Accesories - Topic # 10

The aforementioned topics represent the main categories being faced by the company according to our analysis performed through the LDA model results

# 10 Create topic model using LDA with what you think is the optimal number of topics

What is the coherence of the model?

In [18]:
i=1
for i in range (20):
     if i != 0:
        lda_model2 = models.LdaModel(corpus=doc_term_matrix, num_topics=i, id2word=dictionary, random_state=31)
        lda_coherence_score2 = CoherenceModel(model=lda_model2, texts=cleansed_reviews, dictionary=dictionary, coherence='c_v')
        coherence_lda2 = lda_coherence_score2.get_coherence()
        print('\n Iteration {} Coherence Score: {}'.format(i, coherence_lda2))
        


 Iteration 1 Coherence Score: 0.4530977719924728

 Iteration 2 Coherence Score: 0.5420413053840256

 Iteration 3 Coherence Score: 0.517223904502231

 Iteration 4 Coherence Score: 0.4932724626383746

 Iteration 5 Coherence Score: 0.5296196266830859

 Iteration 6 Coherence Score: 0.5612164418353448

 Iteration 7 Coherence Score: 0.5444517913780446

 Iteration 8 Coherence Score: 0.5352033283337323

 Iteration 9 Coherence Score: 0.5155521159925478

 Iteration 10 Coherence Score: 0.5027658134599642

 Iteration 11 Coherence Score: 0.4943050244058666

 Iteration 12 Coherence Score: 0.49463728236998034

 Iteration 13 Coherence Score: 0.5067517543400267

 Iteration 14 Coherence Score: 0.5124270575030481

 Iteration 15 Coherence Score: 0.5284891406927803

 Iteration 16 Coherence Score: 0.5086458544214055

 Iteration 17 Coherence Score: 0.5022565482463999

 Iteration 18 Coherence Score: 0.48506598649201993

 Iteration 19 Coherence Score: 0.506188812618673


In [19]:
lda_model = models.LdaModel(corpus=doc_term_matrix, num_topics=6, id2word=dictionary, random_state=31)

In [20]:
for idx in range(6):
    print("\nTopic #%s:" % idx, lda_model.print_topic(idx, 12))


Topic #0: 0.025*"charger" + 0.019*"super" + 0.018*"box" + 0.017*"piece" + 0.015*"device" + 0.015*"earphone" + 0.013*"awesome" + 0.011*"star" + 0.011*"hai" + 0.008*"gud" + 0.008*"bill" + 0.007*"ko"

Topic #1: 0.119*"battery" + 0.032*"hour" + 0.028*"backup" + 0.021*"problem" + 0.020*"phone" + 0.019*"charge" + 0.016*"issue" + 0.015*"charger" + 0.015*"h" + 0.014*"heat" + 0.013*"time" + 0.012*"month"

Topic #2: 0.122*"camera" + 0.081*"phone" + 0.057*"battery" + 0.057*"quality" + 0.038*"performance" + 0.030*"price" + 0.029*"...." + 0.024*"feature" + 0.015*"everything" + 0.014*"life" + 0.013*"backup" + 0.012*"budget"

Topic #3: 0.060*"phone" + 0.032*"camera" + 0.019*"screen" + 0.018*"issue" + 0.016*"network" + 0.015*"battery" + 0.015*"speaker" + 0.014*"call" + 0.014*"sim" + 0.012*"quality" + 0.011*"feature" + 0.011*"display"

Topic #4: 0.159*"phone" + 0.057*"mobile" + 0.037*"problem" + 0.032*"note" + 0.029*"issue" + 0.020*"time" + 0.019*"k8" + 0.018*"lenovo" + 0.018*"amazon" + 0.015*"heating

# 11. The business should  be able to interpret the topics.

Name each of the identified topics.

Create a table with the topic name and the top 10 terms in each to present to the  business.

In [21]:
topics = lda_model.show_topics(formatted=False)

In [22]:
Categories=["Phone Accesories","Battery Problems", "Phone Performance", "Hardware Issues", "Provider¬¥s Service", "Pricing"]

In [23]:
g= pd.DataFrame(topics)
Topic_DF= g[1]

In [24]:
Table = []
for topic in Topic_DF:
        Table.append([word[0] for word in topic])
    
Table[0]

['charger',
 'super',
 'box',
 'piece',
 'device',
 'earphone',
 'awesome',
 'star',
 'hai',
 'gud']

In [25]:
pd.set_option("display.max_rows", None, "display.max_columns", None)

DataFrame= pd.DataFrame({'Topic': Categories, 'Top Terms':Table})

In [26]:
DataFrame.style.hide_index()

Topic,Top Terms
Phone Accesories,"['charger', 'super', 'box', 'piece', 'device', 'earphone', 'awesome', 'star', 'hai', 'gud']"
Battery Problems,"['battery', 'hour', 'backup', 'problem', 'phone', 'charge', 'issue', 'charger', 'h', 'heat']"
Phone Performance,"['camera', 'phone', 'battery', 'quality', 'performance', 'price', '....', 'feature', 'everything', 'life']"
Hardware Issues,"['phone', 'camera', 'screen', 'issue', 'network', 'battery', 'speaker', 'call', 'sim', 'quality']"
Provider¬¥s Service,"['phone', 'mobile', 'problem', 'note', 'issue', 'time', 'k8', 'lenovo', 'amazon', 'heating']"
Pricing,"['product', 'phone', 'camera', 'money', 'quality', 'price', 'service', 'lenovo', 'waste', 'day']"
