# Project 1- >>>> Review Project Analysis.  [ Submitted by Manjeet Singh]

## DESCRIPTION

Help a leading mobile brand understand the voice of the customer by analyzing the reviews of their product on Amazon and the topics that customers are talking about. You will perform topic modeling on specific parts of speech. You’ll finally interpret the emerging topics.

## Problem Statement: 

A popular mobile phone brand, Lenovo has launched their budget smartphone in the Indian market. The client wants to understand the VOC (voice of the customer) on the product. This will be useful to not just evaluate the current product, but to also get some direction for developing the product pipeline. The client is particularly interested in the different aspects that customers care about. Product reviews by customers on a leading e-commerce site should provide a good view.

## Domain: Amazon reviews for a leading phone brand

Analysis to be done: POS tagging, topic modeling using LDA, and topic interpretation

Content: 

Dataset: ‘K8 Reviews v0.2.csv’

Columns:

Sentiment: The sentiment against the review (4,5 star reviews are positive, 1,2 are negative)

Reviews: The main text of the review

In [83]:
###################################Tasks###########################################: 

#--1] Read the .csv file using Pandas. Take a look at the top few records.

#--2] Normalize casings for the review text and extract the text into a list for easier manipulation.

#--3] Tokenize the reviews using NLTKs word_tokenize function.

#--4] Perform parts-of-speech tagging on each sentence using the NLTK POS tagger.

#--5] For the topic model, we should  want to include only nouns.

#------- 5a] Find out all the POS tags that correspond to nouns.

#------- 5b] Limit the data to only terms with these tags.

#--6] Lemmatize. 

#------- 6a] Different forms of the terms need to be treated as one.

#------- 6b] No need to provide POS tag to lemmatizer for now.

#--7] Remove stopwords and punctuation (if there are any). 

#--8] Create a topic model using LDA on the cleaned-up data with 12 topics.

#-------8a] Print out the top terms for each topic.

#-------8b] What is the coherence of the model with the c_v metric?

#--9] Analyze the topics through the business lens.

#-------9a] Determine which of the topics can be combined.

#--10] Create topic model using LDA with what you think is the optimal number of topics

#-------10a] What is the coherence of the model?

#--11] The business should  be able to interpret the topics.

#-------11a] Name each of the identified topics.

#-------11b] Create a table with the topic name and the top 10 terms in each to present to the  business

In [223]:
# -- 0] Importing all the Modules/Packages

In [250]:
import pandas as pd
import nltk as nltk
import gensim as gensim
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
import re as re
from string import punctuation
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
from gensim.models import ldamodel
import pyLDAvis
from pyLDAvis import gensim as gs1
from gensim.corpora import Dictionary
import seaborn as sns

In [220]:
# --1] Read the .csv file using Pandas. Take a look at the top few records.

In [87]:
DataRevs = pd.read_csv("K8 Reviews v0.2.csv")
DataRevs.head()

Unnamed: 0,sentiment,review
0,1,Good but need updates and improvements
1,0,"Worst mobile i have bought ever, Battery is dr..."
2,1,when I will get my 10% cash back.... its alrea...
3,1,Good
4,0,The worst phone everThey have changed the last...


In [221]:
#--2] Normalize casings for the review text and extract the text into a list for easier manipulation.

In [176]:
DataRevs_lCase = [sent.lower() for sent in DataRevs.review.values]
DataRevs_lCase[0]

'good but need updates and improvements'

In [222]:
#--3] Tokenize the reviews using NLTKs word_tokenize function.

In [177]:
Token_Revs = [word_tokenize(sent) for sent in DataRevs_lCase]
Token_Revs[0]

['good', 'but', 'need', 'updates', 'and', 'improvements']

In [224]:
#--4] Perform parts-of-speech tagging on each sentence using the NLTK POS tagger.

In [178]:
nltk.pos_tag(Token_Revs[0])

[('good', 'JJ'),
 ('but', 'CC'),
 ('need', 'VBP'),
 ('updates', 'NNS'),
 ('and', 'CC'),
 ('improvements', 'NNS')]

In [179]:
Tagged_Revs = [nltk.pos_tag(tokens) for tokens in Token_Revs]
Tagged_Revs[0]

[('good', 'JJ'),
 ('but', 'CC'),
 ('need', 'VBP'),
 ('updates', 'NNS'),
 ('and', 'CC'),
 ('improvements', 'NNS')]

In [225]:
#--5] For the topic model, we should  want to include only nouns.

#------- 5a] Find out all the POS tags that correspond to nouns.

#------- 5b] Limit the data to only terms with these tags.

In [181]:
Tupple_Tagged = nltk.pos_tag(['great'])
Tupple_Tagged[0]

('great', 'JJ')

In [182]:
Noun_Revs=[]

In [183]:
for sent in Tagged_Revs:
    Noun_Revs.append([token for token in sent if re.search("NN.*", token[1])])

Noun_Revs[0]

[('updates', 'NNS'), ('improvements', 'NNS')]

In [226]:
#--6] Lemmatize. 

#------- 6a] Different forms of the terms need to be treated as one.

#------- 6b] No need to provide POS tag to lemmatizer for now.

In [184]:
Init_lemm = WordNetLemmatizer()

In [185]:
Lemm_Revs=[]

In [186]:
for sent in Noun_Revs:
    Lemm_Revs.append([Init_lemm.lemmatize(word[0]) for word in sent])    

Lemm_Revs[0]

['update', 'improvement']

In [228]:
#--7] Remove stopwords and punctuation (if there are any). 

In [99]:
Init_stopW = stopwords.words("english")

In [234]:
list(punctuation)

['!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~']

In [187]:
Init_stopW_upd = Init_stopW + list(punctuation) + ["..."] + [".."]

In [188]:
SW_Remv_Revs=[]

In [189]:
for sent in Lemm_Revs:
    SW_Remv_Revs.append([term for term in sent if term not in Init_stopW_upd])

In [190]:
SW_Remv_Revs

[['update', 'improvement'],
 ['mobile',
  'battery',
  'hell',
  'backup',
  'hour',
  'us',
  'idle',
  'discharged.this',
  'lie',
  'amazon',
  'lenove',
  'battery',
  'charger',
  'hour'],
 ['cash', 'january..'],
 [],
 ['phone', 'everthey', 'phone', 'problem', 'amazon', 'phone', 'amazon'],
 ['camerawaste', 'money'],
 ['phone', 'reason', 'k8'],
 ['battery', 'level'],
 ['problem',
  'phone',
  'hanging',
  'problem',
  'note',
  'station',
  'ahmedabad',
  'year',
  'phone',
  'lenovo'],
 ['lot', 'glitch', 'thing', 'option'],
 ['wrost'],
 ['phone', 'charger', 'damage', 'month'],
 ['item', 'battery', 'life'],
 ['battery', 'problem', 'motherboard', 'problem', 'month', 'mobile', 'life'],
 ['phone', 'slim', 'battry', 'backup', 'screen'],
 ['headset'],
 ['time'],
 ['product',
  'prize',
  'range',
  'specification',
  'comparison',
  'mobile',
  'range',
  'phone',
  'seal',
  'credit',
  'card',
  'deal',
  'amazon..'],
 ['battery', 'solution', 'battery', 'life'],
 ['smartphone'],
 [],


In [191]:
SW_Remv_Revs[1]

['mobile',
 'battery',
 'hell',
 'backup',
 'hour',
 'us',
 'idle',
 'discharged.this',
 'lie',
 'amazon',
 'lenove',
 'battery',
 'charger',
 'hour']

In [229]:
#--8] Create a topic model using LDA on the cleaned-up data with 12 topics.

#-------8a] Print out the top terms for each topic.

#-------8b] What is the coherence of the model with the c_v metric?

In [192]:
Var_id2word_Revs = corpora.Dictionary(SW_Remv_Revs)
Var_texts_Revs = SW_Remv_Revs

In [193]:
Corpus_Revs = [Var_id2word_Revs.doc2bow(text) for text in Var_texts_Revs]
print(Corpus_Revs[200])

[(426, 1), (427, 1), (428, 1), (429, 1)]


In [206]:
Revs_LDA_Model_basic_12T_Rnd42 = gensim.models.ldamodel.LdaModel(corpus=Corpus_Revs,
                                           id2word=Var_id2word_Revs,
                                           num_topics=12, 
                                           random_state=42,
                                           passes=10,
                                           per_word_topics=True)

In [209]:
Revs_LDA_Model_12Topics_Rnd150 = gensim.models.ldamodel.LdaModel(corpus=Corpus_Revs,
                                           id2word=Var_id2word_Revs,
                                           num_topics=12, 
                                           random_state=150,
                                           passes=10,
                                           per_word_topics=True)

In [201]:
print(Revs_LDA_Model_basic_12T_Rnd42.print_topics())

[(0, '0.381*"mobile" + 0.023*"problem" + 0.023*"notification" + 0.017*"heat" + 0.016*"cell" + 0.016*"message" + 0.011*"hang" + 0.011*"rate" + 0.010*"whatsapp" + 0.009*"call"'), (1, '0.267*"battery" + 0.105*"problem" + 0.055*"backup" + 0.055*"heating" + 0.052*"issue" + 0.037*"performance" + 0.036*"hour" + 0.032*"day" + 0.030*"time" + 0.029*"life"'), (2, '0.062*"handset" + 0.051*"software" + 0.041*"box" + 0.032*"contact" + 0.030*"update" + 0.026*"set" + 0.023*"star" + 0.023*"option" + 0.022*"item" + 0.020*"purchase"'), (3, '0.080*"phone" + 0.049*"amazon" + 0.044*"service" + 0.030*"lenovo" + 0.030*"day" + 0.029*"issue" + 0.027*"problem" + 0.026*"time" + 0.022*"delivery" + 0.019*"experience"'), (4, '0.135*"feature" + 0.076*"camera" + 0.048*"mode" + 0.037*"video" + 0.027*"android" + 0.025*"stock" + 0.023*"depth" + 0.019*"gallery" + 0.018*"volta" + 0.017*"thanks"'), (5, '0.439*"product" + 0.090*"charger" + 0.018*"earphone" + 0.016*"turbo" + 0.016*"buy" + 0.016*"piece" + 0.015*"awesome" + 0.0

In [207]:
Revs_Coh_LDA_Model_basic_12T_Rnd42 = CoherenceModel(model=Revs_LDA_Model_basic_12T_Rnd42, texts=Var_texts_Revs, dictionary=Var_id2word_Revs, coherence='c_v')
Revs_Score_Coh_LDA_12T_Rnd42 = Revs_Coh_LDA_Model_basic_12T_Rnd42.get_coherence()

print('\nCoherence Score --12T- Rnd42-->: ', Revs_Score_Coh_LDA_12T_Rnd42)



Coherence Score --12T- Rnd42-->:  0.5560767730635368


In [210]:
Revs_Coh_LDA_Model_12Topics_Rnd150 = CoherenceModel(model=Revs_LDA_Model_12Topics_Rnd150, texts=Var_texts_Revs, dictionary=Var_id2word_Revs, coherence='c_v')
Revs_Score_Coh_LDA_12T_Rnd150 = Revs_Coh_LDA_Model_12Topics_Rnd150.get_coherence()

print('\nCoherence Score --12T- Rnd150-->: ', Revs_Score_Coh_LDA_12T_Rnd150)



Coherence Score --12T- Rnd150-->:  0.5252625972736661


In [212]:
Revs_LDA_Model_8Topics_Rnd42 = gensim.models.ldamodel.LdaModel(corpus=Corpus_Revs,
                                           id2word=Var_id2word_Revs,
                                           num_topics=8, 
                                           random_state=42,
                                           passes=10,
                                           per_word_topics=True)


In [213]:
Revs_LDA_Model_8Topics_Rnd150 = gensim.models.ldamodel.LdaModel(corpus=Corpus_Revs,
                                           id2word=Var_id2word_Revs,
                                           num_topics=8, 
                                           random_state=150,
                                           passes=10,
                                           per_word_topics=True)


In [214]:
Revs_Coh_LDA_Model_8Topics_Rnd42 = CoherenceModel(model=Revs_LDA_Model_8Topics_Rnd42, texts=Var_texts_Revs, dictionary=Var_id2word_Revs, coherence='c_v')
Revs_Score_Coh_LDA_8Topics_Rnd42 = Revs_Coh_LDA_Model_8Topics_Rnd42.get_coherence()
print('\nCoherence Score--8T- Rnd42-->:', Revs_Score_Coh_LDA_8Topics_Rnd42)



Coherence Score--8T- Rnd42-->: 0.5470127061130555


In [215]:
Revs_Coh_LDA_Model_8Topics_Rnd150 = CoherenceModel(model=Revs_LDA_Model_8Topics_Rnd150, texts=reviews_sw_removed, dictionary=Var_id2word_Revs, coherence='c_v')
Revs_Score_Coh_LDA_8Topics_Rnd150 = Revs_Coh_LDA_Model_8Topics_Rnd150.get_coherence()
print('\nCoherence Score 8T Rnd150: ', Revs_Score_Coh_LDA_8Topics_Rnd150)


Coherence Score 8T Rnd150:  0.5261495802619411


In [230]:
#--9] Analyze the topics through the business lens.

#-------9a] Determine which of the topics can be combined.

In [3]:
##---- p

In [160]:
lda_display_8T_Rnd150 = pyLDAvis.gensim.prepare(Revs_LDA_Model_8Topics_Rnd150, Corpus_Revs, Var_id2word_Revs, sort_topics=False)
pyLDAvis.display(lda_display_8T_Rnd150)

In [216]:
lda_display12 = pyLDAvis.gensim.prepare(lda_model, corpus, id2word, sort_topics=False)
pyLDAvis.display(lda_display12)

In [217]:
#### ---- try to further visually analyse and improve the coherence score

In [231]:
#--10] Create topic model using LDA with what you think is the optimal number of topics

#-------10a] What is the coherence of the model?

In [237]:
lda_model_6try = gensim.models.ldamodel.LdaModel(corpus=Corpus_Revs,
                                           id2word=Var_id2word_Revs,
                                           num_topics=6, 
                                           random_state=150,
                                           passes=10,
                                           per_word_topics=True)

In [71]:
#--2]

In [238]:
coherence_model_ldaX = CoherenceModel(model=lda_model_6try, texts=Var_texts_Revs, dictionary=Var_id2word_Revs, coherence='c_v')
coherence_ldaX = coherence_model_ldaX.get_coherence()
print('\nCoherence Score: ', coherence_ldaX)


Coherence Score:  0.5924798285087691


In [73]:
# --3]

In [239]:
lda_display_X6 = pyLDAvis.gensim.prepare(lda_model_6try, Corpus_Revs, Var_id2word_Revs, sort_topics=False)
pyLDAvis.display(lda_display_X6)

In [75]:
#### trying with 5

In [240]:
lda_model_5try = gensim.models.ldamodel.LdaModel(corpus=Corpus_Revs,
                                           id2word=Var_id2word_Revs,
                                           num_topics=5, 
                                           random_state=150,
                                           passes=10,
                                           per_word_topics=True)

In [241]:
coherence_model_ldaX5 = CoherenceModel(model=lda_model_5try, texts=Var_texts_Revs, dictionary=Var_id2word_Revs, coherence='c_v')
coherence_ldaX5 = coherence_model_ldaX5.get_coherence()
print('\nCoherence Score: ', coherence_ldaX5)


Coherence Score:  0.5209107199882128


In [242]:
lda_display_X5 = pyLDAvis.gensim.prepare(lda_model_5try, Corpus_Revs, Var_id2word_Revs, sort_topics=False)
pyLDAvis.display(lda_display_X5)

In [243]:
lda_model_7try = gensim.models.ldamodel.LdaModel(corpus=Corpus_Revs,
                                           id2word=Var_id2word_Revs,
                                           num_topics=7, 
                                           random_state=500,
                                           passes=10,
                                           per_word_topics=True)

In [244]:
coherence_model_ldaX7 = CoherenceModel(model=lda_model_7try, texts=Var_texts_Revs, dictionary=Var_id2word_Revs, coherence='c_v')
coherence_ldaX7 = coherence_model_ldaX7.get_coherence()
print('\nCoherence Score: ', coherence_ldaX7)


Coherence Score:  0.5347779202679346


####  Analysis Result: - With Six topics  we get the  best coherence score ####

In [233]:
#--11] The business should  be able to interpret the topics.

#-------11a] Name each of the identified topics.

#-------11b] Create a table with the topic name and the top 10 terms in each to present to the  business

In [248]:
x = lda_model_6try.show_topics(formatted=False)
Grp_topics_words = [(tp[0], [wd[0] for wd in tp[1]]) for tp in x]

for Var_topic,Var_words in Grp_topics_words:
    print(str(Var_topic)+ "::"+ str(Var_words))
print()


0::['battery', 'phone', 'problem', 'issue', 'heating', 'network', 'time', 'backup', 'day', 'hour']
1::['camera', 'quality', 'phone', 'battery', 'performance', 'feature', 'processor', 'display', 'ram', 'mode']
2::['phone', 'product', 'price', 'range', 'feature', 'lenovo', 'issue', 'month', 'superb', 'performance']
3::['mobile', 'screen', 'charger', 'music', 'glass', 'everything', 'feature', 'video', 'lenovo', 'dolby']
4::['product', 'amazon', 'service', 'device', 'time', 'phone', 'speaker', 'delivery', 'problem', 'customer']
5::['note', 'money', 'k8', 'lenovo', 'waste', 'phone', 'value', 'hai', 'h', 'handset']



# Business Interpretations

    ##  1 Phone Performance
    ##  2 Overall General Phone Features
    ##  3 Pricing
    ##  4 Product Accessories
    ##  5 Service
    ##  6 Value 


# Observations

1  Only filtering the Nouns doesnot help, The context doesnt gets clear whether it is positive or negative
Ex  Network -> Good Network or Bad Network. 
    Backup - > Whether it gives a good backup or not
    
2  Sentiments segregation into Positive and Negative would have helped achieve at better output

3  Bigger dataset would have helped the model to perform better




# After visually analysing the results and filtering the overlaps with 6 Topics we get the best score of 0.5924 