# Tasks - 

1. Normalize case
2. Tokenize (using word_tokenize from NLTK)
3. POS tagging using the NLTK pos tagger
4. For the topic model, we would want to include only nouns
 - First, find out all the POS tags that correspond to nouns
 - Limit the data to only terms with these tags
5. Lemmatize (you want different forms of the terms to be treated as one, don't worry about providing POS tag to lemmatizer for now)
6. Remove stop words and punctuation (if there are any at all after the POS tagging)
7. Create a topic model using LDA on the cleaned up data with 12 topics
 - choose the topic model parameters carefully
 - what is the perplexity of the model?
 - what is the coherence of the model?
8. Analyze the topics, which pairs of topics can be combined?
9. Create topic model using LDA with what you think is the optimal number of topics
 - choose the topic model parameters carefully
 - is the perplexity better now?
 - is the coherence better now?
10. The business finally needs to be able to interpret the topics
 - name each of the identified topics
 - create a table with the topic name and the top 10 terms in each to present to business

#### Importing all necessary package

In [2]:
import warnings
warnings.filterwarnings("ignore")

# Importing the usual utilities
import numpy as np, pandas as pd
import re, random, os, string

from pprint import pprint #pretty print
import matplotlib.pyplot as plt
%matplotlib inline

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
import pandas as pd
import pyLDAvis.gensim

  from collections import Mapping


#### Loading the necessary files

In [3]:
reviews0 = pd.read_csv("K8 Reviews v0.2.csv")
reviews0.tail()

Unnamed: 0,sentiment,review
14670,1,"I really like the phone, Everything is working..."
14671,1,The Lenovo K8 Note is awesome. It takes best p...
14672,1,Awesome Gaget.. @ this price
14673,1,This phone is nice processing will be successf...
14674,1,Good product but the pakeging was not enough.


## Task 1. Normalize case

In [4]:
# marking an array of sentences
Normalize_rev_lower = [rev.lower() for rev in reviews0.review.values]
Normalize_rev_lower[0:3]

['good but need updates and improvements',
 "worst mobile i have bought ever, battery is draining like hell, backup is only 6 to 7 hours with internet uses, even if i put mobile idle its getting discharged.this is biggest lie from amazon & lenove which is not at all expected, they are making full by saying that battery is 4000mah & booster charger is fake, it takes at least 4 to 5 hours to be fully charged.don't know how lenovo will survive by making full of us.please don;t go for this else you will regret like me.",
 'when i will get my 10% cash back.... its already 15 january..']

In [5]:
#Checking any links present in the review

reviews0[reviews0['review'].apply(lambda x:x.find('http') > 0)]

Unnamed: 0,sentiment,review
3893,0,Turbo charging has stopped after 5 days. It ta...


In [6]:
#Replacing all the link and change the review text:
reviews0.loc[3893]= re.sub(r'http\S+', '', Normalize_rev_lower[3893])

In [7]:
#Making an array of sentences with lower values
Normalize_rev_lower = [rev.lower() for rev in reviews0.review.values]
Normalize_rev_lower[0:3]

['good but need updates and improvements',
 "worst mobile i have bought ever, battery is draining like hell, backup is only 6 to 7 hours with internet uses, even if i put mobile idle its getting discharged.this is biggest lie from amazon & lenove which is not at all expected, they are making full by saying that battery is 4000mah & booster charger is fake, it takes at least 4 to 5 hours to be fully charged.don't know how lenovo will survive by making full of us.please don;t go for this else you will regret like me.",
 'when i will get my 10% cash back.... its already 15 january..']

## Task 2. Tokenize (using word_tokenize from NLTK)

In [8]:
# tokenization of  words from normalize sentences
reviews_word_token = [word_tokenize(sent) for sent in Normalize_rev_lower]
reviews_word_token[0]

['good', 'but', 'need', 'updates', 'and', 'improvements']

## Task 3. POS tagging using the NLTK pos tagger

In [9]:
#checking pos tag
nltk.pos_tag(reviews_word_token[0])

[('good', 'JJ'),
 ('but', 'CC'),
 ('need', 'VBP'),
 ('updates', 'NNS'),
 ('and', 'CC'),
 ('improvements', 'NNS')]

In [10]:
#Converting all the tags in to pos tag tuple
reviews_pos_tagged = [nltk.pos_tag(tokens) for tokens in reviews_word_token]
reviews_pos_tagged[0]

[('good', 'JJ'),
 ('but', 'CC'),
 ('need', 'VBP'),
 ('updates', 'NNS'),
 ('and', 'CC'),
 ('improvements', 'NNS')]

## Task 4. For the topic model, we would want to include only nouns
 - First, find out all the POS tags that correspond to nouns
 - Limit the data to only terms with these tags


You see that for each term, the POS taggger returns a tuple. The first element of the tuple being the term, the second being the tag.

In [11]:
# making an array of only for nouns
reviews_noun=[]
for sent in reviews_pos_tagged:
    reviews_noun.append([token for token in sent if re.search("NN.*", token[1])])
reviews_noun[0:3]

[[('updates', 'NNS'), ('improvements', 'NNS')],
 [('mobile', 'NN'),
  ('i', 'NN'),
  ('battery', 'NN'),
  ('hell', 'NN'),
  ('backup', 'NN'),
  ('hours', 'NNS'),
  ('uses', 'NNS'),
  ('idle', 'NN'),
  ('discharged.this', 'NN'),
  ('lie', 'NN'),
  ('amazon', 'NN'),
  ('lenove', 'NN'),
  ('battery', 'NN'),
  ('charger', 'NN'),
  ('hours', 'NNS'),
  ('don', 'NN')],
 [('i', 'NN'), ('%', 'NN'), ('cash', 'NN'), ('january..', 'NN')]]

You'll need to extract the tag from the resulting tuple, of course and then limit to the desired tags

In [12]:
# Extracting the words 
only_nowns=[]
for sent in reviews_noun:
    nowns_row_wise=[]
    for tup in sent:
        nowns_row_wise.append(tup[0])
    only_nowns.append(nowns_row_wise)
only_nowns[0:5]

[['updates', 'improvements'],
 ['mobile',
  'i',
  'battery',
  'hell',
  'backup',
  'hours',
  'uses',
  'idle',
  'discharged.this',
  'lie',
  'amazon',
  'lenove',
  'battery',
  'charger',
  'hours',
  'don'],
 ['i', '%', 'cash', 'january..'],
 [],
 ['phone', 'everthey', 'phone', 'problem', 'amazon', 'phone', 'amazon']]

## Task  5. Lemmatize
 - you want different forms of the terms to be treated as one
 - don't worry about providing POS tag to lemmatizer for now

In [13]:
# now lemmatize to get the root of the word
lemm = WordNetLemmatizer()
reviews_lemm=[]
for sent in only_nowns:
    reviews_lemm.append([lemm.lemmatize(word) for word in sent])

In [14]:
#checking the output data
reviews_lemm[0:2]

[['update', 'improvement'],
 ['mobile',
  'i',
  'battery',
  'hell',
  'backup',
  'hour',
  'us',
  'idle',
  'discharged.this',
  'lie',
  'amazon',
  'lenove',
  'battery',
  'charger',
  'hour',
  'don']]

## Task  6. Remove stop words and punctuation (if there are any at all after the POS tagging)

Use NLTK standard stop word list and the punctuations

In [15]:
from string import punctuation
from nltk.corpus import stopwords
stop_nltk = stopwords.words("english")

In [16]:
stop_updated = stop_nltk + list(punctuation) + ["..."] + [".."]
reviews_sw_removed=[]
for sent in reviews_lemm:
    reviews_sw_removed.append([term for term in sent if term not in stop_updated])

In [17]:
reviews_sw_removed[3893]

['turbo',
 'charging',
 'day',
 'hour',
 'phone',
 'lenovo',
 'charger',
 'battery',
 'backup',
 'hour',
 'charge',
 'usage',
 'user',
 'forum',
 'complaint',
 'backup',
 'phone',
 'step',
 'support',
 'issue',
 'signal',
 'sim',
 'provider',
 'phone',
 'battery',
 'saver.the',
 'deca',
 'core',
 'processor',
 'phone',
 'mah',
 'battery',
 'backup',
 'phone',
 'point',
 'horsepower',
 'battery',
 'backup',
 'power',
 'wonder',
 'phone',
 'discount.the',
 'review',
 'amazon',
 'link',
 'phone']

## Task 7. Create a topic model using LDA on the cleaned up data with 12 topics
 - what is the coherence of the model?
 
 Use gensim for this task

In [18]:
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
from gensim.models import ldamodel

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()


In [19]:
# Creaking a bag of words
id2word = corpora.Dictionary(reviews_sw_removed)
texts = reviews_sw_removed
# Assing id for each words for every sentences
corpus = [id2word.doc2bow(text) for text in texts]

In [20]:
#Checking the corpus
print(corpus[100])

[(14, 2), (51, 1), (243, 1), (244, 1), (245, 1)]


In [21]:
#Checking the data of corpus
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]

[[('improvement', 1), ('update', 1)]]

In [22]:
%%time
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=12, 
                                           random_state=42,
                                           passes=10,
                                           per_word_topics=True)

Wall time: 21 s


In [23]:
#checking the topics
pprint(lda_model.print_topics())

[(0,
  '0.250*"battery" + 0.054*"backup" + 0.045*"day" + 0.045*"hour" + '
  '0.044*"charger" + 0.031*"time" + 0.028*"charge" + 0.028*"life" + '
  '0.026*"problem" + 0.024*"issue"'),
 (1,
  '0.231*"camera" + 0.092*"quality" + 0.080*"phone" + 0.049*"performance" + '
  '0.036*"battery" + 0.022*"processor" + 0.018*"mode" + 0.012*"depth" + '
  '0.012*"speed" + 0.012*"picture"'),
 (2,
  '0.415*"mobile" + 0.038*"box" + 0.019*"item" + 0.017*"expectation" + '
  '0.016*"cable" + 0.014*"facility" + 0.012*"plz" + 0.012*"bug" + 0.011*"cost" '
  '+ 0.010*"phn"'),
 (3,
  '0.162*"money" + 0.062*"value" + 0.053*"superb" + 0.037*"smartphone" + '
  '0.030*"specification" + 0.028*"buy" + 0.027*"super" + 0.027*"gb" + '
  '0.024*"ram" + 0.018*"fast"'),
 (4,
  '0.264*"product" + 0.067*"amazon" + 0.030*"delivery" + 0.026*"hai" + '
  '0.023*"return" + 0.021*"h" + 0.019*"replacement" + 0.018*"service" + '
  '0.018*"lenovo" + 0.017*"customer"'),
 (5,
  '0.066*"call" + 0.035*"option" + 0.029*"handset" + 0.027*"hr

In [24]:
%%time
coherence_model_lda = CoherenceModel(model=lda_model, texts=reviews_sw_removed, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.5426220809226854
Wall time: 8.03 s


## Task 8. Analyze the topics, which pairs of topics can be combined?
 - you can assume that if a pair of topics has very similar top terms, they are very close and can be combined

In [25]:
%%time
for i in range(5,12):
    lda_modeli = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=i, 
                                           random_state=42,
                                           passes=10,
                                           per_word_topics=True)
    coherence_model_lda = CoherenceModel(model=lda_modeli, texts=reviews_sw_removed, dictionary=id2word, coherence='c_v')
    coherence_lda = coherence_model_lda.get_coherence()
    print('topis no ',i,'Coherence Score: ', coherence_lda)

topis no  5 Coherence Score:  0.5251528821543048
topis no  6 Coherence Score:  0.5245406128917262
topis no  7 Coherence Score:  0.5563400136317252
topis no  8 Coherence Score:  0.5573730599122244
topis no  9 Coherence Score:  0.5771882389357432
topis no  10 Coherence Score:  0.5752965487587095
topis no  11 Coherence Score:  0.5402446563227769
Wall time: 3min 34s


 **As per loop The coherence is high when topic is 9. so we are building the model with taking 9 values**

### Looking at the topics and each terms following can be combined -

** Topic 2 and 5 possibly talks about 'pricing'  
 Topic 4, 6 and 10 closely talks about 'battery related issues'  
 Topic 3 and 11 vaguely talks about 'performance'**

## Task 9. Create topic model using LDA with what you think is the optimal number of topics

 - is the coherence better now?

In [26]:
lda_modeli9 = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=9, 
                                           random_state=42,
                                           passes=10,
                                           per_word_topics=True)
   

Printing the coherence of the model

In [27]:
x = lda_modeli9.show_topics(formatted=False,num_words=10)
topics_words = [(tp[0], [wd[0] for wd in tp[1]]) for tp in x]
df_topics =pd.DataFrame(topics_words,columns =['Topic ID','Topics'])
df_topics.head(2)

Unnamed: 0,Topic ID,Topics
0,0,"[battery, phone, backup, day, hour, issue, tim..."
1,1,"[camera, quality, performance, battery, phone,..."


In [28]:
for topic,words in topics_words:
    print(str(topic)+ "::"+ str(words))
print()

0::['battery', 'phone', 'backup', 'day', 'hour', 'issue', 'time', 'life', 'problem', 'charge']
1::['camera', 'quality', 'performance', 'battery', 'phone', 'mode', 'everything', 'depth', 'clarity', 'photo']
2::['mobile', 'service', 'center', 'specification', 'cost', 'centre', 'month', 'lenovo', 'processor', 'purchase']
3::['problem', 'money', 'note', 'issue', 'k8', 'heating', 'device', 'waste', 'value', 'charger']
4::['product', 'amazon', 'delivery', 'service', 'customer', 'return', 'lenovo', 'replacement', 'day', 'time']
5::['phone', 'call', 'time', 'issue', 'network', 'problem', 'sim', 'update', 'software', 'handset']
6::['note', 'screen', 'speaker', 'sound', 'feature', 'music', 'camera', 'display', 'lenovo', 'glass']
7::['hai', 'h', 'heat', 'ho', 'excellent', 'k', 'bhi', 'hi', 'superb', 'ye']
8::['phone', 'price', 'range', 'feature', 'budget', 'box', 'headphone', 'superb', 'earphone', 'headset']



In [29]:
pyLDAvis.enable_notebook()

In [30]:
%%time
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)

Wall time: 4min 47s


In [31]:
vis

In [32]:
from gensim.models import LsiModel

In [33]:
lsamodel = LsiModel(corpus, num_topics=9, id2word = id2word)  # train model
print(lsamodel.show_topics(num_topics=9, num_words=2,formatted=False))

[(0, [('phone', 0.8297863570288482), ('camera', 0.2823636478813544)]), (1, [('camera', 0.5852816670863007), ('phone', -0.5158954773068307)]), (2, [('camera', 0.6528325578028571), ('battery', -0.5523657385514037)]), (3, [('product', -0.6917433151739936), ('battery', 0.5287808509381974)]), (4, [('note', -0.5829068698642121), ('product', 0.5480949767436817)]), (5, [('note', -0.48554388128581116), ('issue', 0.4795369357090081)]), (6, [('issue', 0.6840945975852147), ('problem', -0.6724311171588435)]), (7, [('mobile', -0.8378712596999436), ('problem', 0.38358999978606345)]), (8, [('quality', -0.922469610859012), ('camera', 0.2591717485469043)])]


In [34]:
x = lsamodel.show_topics(formatted=False,num_words=10)
topics_words = [(tp[0], [wd[0] for wd in tp[1]]) for tp in x]
for topic,words in topics_words:
    print(str(topic)+ "::"+ str(words))
print()

0::['phone', 'camera', 'battery', 'issue', 'note', 'quality', 'problem', 'day', 'lenovo', 'time']
1::['camera', 'phone', 'battery', 'quality', 'product', 'mobile', 'note', 'issue', 'backup', 'performance']
2::['camera', 'battery', 'product', 'issue', 'problem', 'quality', 'day', 'mobile', 'hour', 'time']
3::['product', 'battery', 'note', 'lenovo', 'problem', 'issue', 'amazon', 'mobile', 'service', 'k8']
4::['note', 'product', 'mobile', 'k8', 'lenovo', 'issue', 'battery', 'camera', 'phone', 'feature']
5::['note', 'issue', 'problem', 'product', 'mobile', 'battery', 'k8', 'network', 'camera', 'heating']
6::['issue', 'problem', 'mobile', 'heating', 'note', 'time', 'cell', 'lenovo', 'product', 'software']
7::['mobile', 'problem', 'note', 'issue', 'price', 'device', 'feature', 'k8', 'heating', 'product']
8::['quality', 'camera', 'time', 'product', 'note', 'mobile', 'sound', 'hai', 'video', 'picture']



## Task 10. The business finally needs to be able to interpret the topics
 - name each of the identified topics
 - create a table with the topic name and the top 10 terms in each to present to business

# Conclusion:
As per Comparing both LDA and LSA model,LAS model loop more promissing

For LDA model topics identified as :
-0::['battery', 'phone', 'backup', 'day', 'hour', 'issue', 'time', 'life', 'problem', 'charge']------**issue in backup,phone,charging**<br>
-1::['camera', 'quality', 'performance', 'battery', 'phone', 'mode', 'everything', 'depth', 'clarity', 'photo']---**camera**<br>
-2::['mobile', 'service', 'center', 'specification', 'cost', 'centre', 'month', 'lenovo', 'processor', 'purchase']---**money**<br>
-3::['problem', 'money', 'note', 'issue', 'k8', 'heating', 'device', 'waste', 'value', 'charger']---**charger**<br>
-4::['product', 'amazon', 'delivery', 'service', 'customer', 'return', 'lenovo', 'replacement', 'day', 'time']---**amazon delivery**<br>
-5::['phone', 'call', 'time', 'issue', 'network', 'problem', 'sim', 'update', 'software', 'handset']---**handset problem**<br>
-6::['note', 'screen', 'speaker', 'sound', 'feature', 'music', 'camera', 'display', 'lenovo', 'glass']--**speaker probe**<br>
-7::['hai', 'h', 'heat', 'ho', 'excellent', 'k', 'bhi', 'hi', 'superb', 'ye']---**not sure but heating problem**<br>
-8::['phone', 'price', 'range', 'feature', 'budget', 'box', 'headphone', 'superb', 'earphone', 'headset']---**phone quality**<br>

-for LSA model:
-0::['phone', 'camera', 'battery', 'issue', 'note', 'quality', 'problem', 'day', 'lenovo', 'time'] --**issue in backup,phone,charging**<br>
-1::['camera', 'phone', 'battery', 'quality', 'product', 'mobile', 'note', 'issue', 'backup', 'performance']--**issue in performance**<br>
-2::['camera', 'battery', 'product', 'issue', 'problem', 'quality', 'day', 'mobile', 'hour', 'time']---**same feature as above**<br>
-3::['product', 'battery', 'note', 'lenovo', 'problem', 'issue', 'amazon', 'mobile', 'service', 'k8']--**same issue as above**<br>
-4::['note', 'product', 'mobile', 'k8', 'lenovo', 'issue', 'battery', 'camera', 'phone', 'feature']--**same problem as topic 1**<br>
-5::['note', 'issue', 'problem', 'product', 'mobile', 'battery', 'k8', 'network', 'camera', 'heating']--**heating problem**<br>
-6::['issue', 'problem', 'mobile', 'heating', 'note', 'time', 'cell', 'lenovo', 'product', 'software']--**vhandset problem**<br>
-7::['mobile', 'problem', 'note', 'issue', 'price', 'device', 'feature', 'k8', 'heating', 'product']--**heating**<br>
-8::['quality', 'camera', 'time', 'product', 'note', 'mobile', 'sound', 'hai', 'video', 'picture']--**phone quality**<br>