# COURSE:   PGP [AI&ML]

## Learner :  Chaitanya Kumar Battula
## Module  : NLP
## Topic   :  Topic Analysis of Review Data.

# Tasks

## 1)	Read the .csv file using Pandas. Take a look at the top few records.
## 2)	Normalize casings for the review text and extract the text into a list for easier manipulation.
## 3)	Tokenize the reviews using NLTKs word_tokenize function.
## 4)	Perform parts-of-speech tagging on each sentence using the NLTK POS tagger.
## 5)	For the topic model, we should  want to include only nouns.
#####  5.1) Find out all the POS tags that correspond to nouns.
##### 	5.2) Limit the data to only terms with these tags.
## 6)	Lemmatize. 
##### 6.1) Different forms of the terms need to be treated as one.
##### 6.2) No need to provide POS tag to lemmatizer for now.
## 7)	Remove stopwords and punctuation (if there are any). 
## 8)	Create a topic model using LDA on the cleaned up data with 12 topics.
##### 	8.1) Print out the top terms for each topic.
##### 	8.2) What is the coherence of the model with the c_v metric?
## 9)	Analyze the topics through the business lens.
##### 	Determine which of the topics can be combined.
## 10)	Create topic model using LDA with what you think is the optimal number of topics
##### 	What is the coherence of the model?
## 11)	The business should  be able to interpret the topics.
##### 11.1) Name each of the identified topics.
##### 11.2) Create a table with the topic name and the top 10 terms in each 	to present to the  business.


#              PROJECT  CODE   STARTS   FROM  HERE 

In [1]:
import pandas as pd
import numpy as np

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tag import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from string import punctuation

from pprint import pprint

import re

import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import pyLDAvis
import pyLDAvis.gensim  

import matplotlib.pyplot as plt
%matplotlib inline

# Task-1


#####  Task  1-1]
1)	Read the .csv file using Pandas. Take a look at the top few records.

In [2]:
Df= pd.read_csv('K8 Reviews v0.2.csv')
Df.head()

Unnamed: 0,sentiment,review
0,1,Good but need updates and improvements
1,0,"Worst mobile i have bought ever, Battery is dr..."
2,1,when I will get my 10% cash back.... its alrea...
3,1,Good
4,0,The worst phone everThey have changed the last...


#####   Task  1.2         Understand Data  

In [3]:
number_of_rows = len(Df.index)
number_of_cols = len(Df.columns)



print("<< OUTPUT >>")
print("Number Of rows:",  number_of_rows)
print('Number of columns :', number_of_cols)
print()
print("Data Info:-")
Df.info()

<< OUTPUT >>
Number Of rows: 14675
Number of columns : 2

Data Info:-
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14675 entries, 0 to 14674
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   sentiment  14675 non-null  int64 
 1   review     14675 non-null  object
dtypes: int64(1), object(1)
memory usage: 229.4+ KB


#####    Task  1.3    Data Sampling 

In [4]:
# Sampling 0.01%  of data randomly 
print("<< OUTPUT >>")
print("Data sampling:-")
rows = Df.sample(frac =0.0001)

# display
rows

<< OUTPUT >>
Data sampling:-


Unnamed: 0,sentiment,review
14140,0,"Faced Heating issue, hanging issue, sound is n..."


In [5]:
# Sampling 3nos of data randomly 
print("<< OUTPUT >>")
print("Data sampling:-")

rows = Df.sample(n = 3)
rows

<< OUTPUT >>
Data sampling:-


Unnamed: 0,sentiment,review
14381,1,Good product at good price...
10589,1,Very good
11788,0,Not good


In [6]:
# Sample any one row randomly
print("<< OUTPUT >>")
print("Data sampling:-")


Df.review.sample()

<< OUTPUT >>
Data sampling:-


1194    Product quality is not as good as it describes...
Name: review, dtype: object

In [7]:
print("<< OUTPUT >>")
print("Data sampling:-")

Df.review.sample().values[0]


<< OUTPUT >>
Data sampling:-


'Camera Quality is good not super'

# Task- 2  



## 2.1]     Extract the text into a list for easier manipulation.

In [8]:
Reviews = Df['review'].values

print("<< OUTPUT >>")
print("All the text is put in a List!!")
print("Verifying if the data is a List or not:")
print("Reviews Object is a", type(Reviews))

<< OUTPUT >>
All the text is put in a List!!
Verifying if the data is a List or not:
Reviews Object is a <class 'numpy.ndarray'>


## 2.2]    Normalize casings for the review text 

In [9]:
def To_LowerCase(Reviews):
    Normalized = [Review.lower() for Review in Reviews]
    return Normalized

Reviews_LowerCase = To_LowerCase(Reviews)



print("<< OUTPUT >>")
print("All the text converted to Lower Cases. !!")
print()
print("Verifying if the text is in lower case or not:")
print("Before Normalizing:\n", Reviews[0])
print()
print("After Normalizing:\n",  Reviews_LowerCase[0])

<< OUTPUT >>
All the text converted to Lower Cases. !!

Verifying if the text is in lower case or not:
Before Normalizing:
 Good but need updates and improvements

After Normalizing:
 good but need updates and improvements


# Task- 3
## Tokenize the reviews using NLTKs word_tokenize function.

In [10]:
def Tokenize(Reviews):
    #t = TweetTokenizer()
    #t = word_tokenize()
    Tokens = [word_tokenize(Review) for Review in Reviews]
    return Tokens


Reviews_Tokens  = Tokenize(Reviews_LowerCase)



print("<< OUTPUT >>")
print("All the text converted to tokens !!")
print()
print("Verfying If Tokenized:")
print("Before Tokenizing:\n", Reviews_LowerCase[0])
print()
print("After Tokenizing:\n",  Reviews_Tokens[0])

<< OUTPUT >>
All the text converted to tokens !!

Verfying If Tokenized:
Before Tokenizing:
 good but need updates and improvements

After Tokenizing:
 ['good', 'but', 'need', 'updates', 'and', 'improvements']


# Task.4
## Perform parts-of-speech tagging on each sentence using the NLTK POS tagger.

In [11]:
def POS_Tagging(Reviews):
    Taggs = [nltk.pos_tag(Review) for Review in Reviews]
    return Taggs
    
    
 

Reviews_Tagged  = POS_Tagging(Reviews_Tokens)


print("<< OUTPUT >>")
print("All the text tagged woth parts of speech: !!")
print()
print("\t-:Verfying If tagged:-")
print("Before Tagging:\n", Reviews_Tokens[0])
print()
print("After Tagging:\n",  Reviews_Tagged[0])

<< OUTPUT >>
All the text tagged woth parts of speech: !!

	-:Verfying If tagged:-
Before Tagging:
 ['good', 'but', 'need', 'updates', 'and', 'improvements']

After Tagging:
 [('good', 'JJ'), ('but', 'CC'), ('need', 'VBP'), ('updates', 'NNS'), ('and', 'CC'), ('improvements', 'NNS')]


# 5)	For the topic model, we should  want to include only nouns.


#####   5.1) Find out all the POS tags that correspond to nouns.
##### 	5.2) Limit the data to only terms with these tags. 

In [12]:

def Nouns(tagged):
    Big_List=[]
    for WTList in tagged:
        Small_List = []
        for (Word, Tag) in WTList:
            #if (Tag == 'NN' or Tag == 'NNP' or Tag == 'NNS' or Tag == 'NNPS'):
            if (Tag == 'NN'):
                Small_List.append(Word)
        #print(nouns_List)
        Big_List.append(Small_List)
      
    return Big_List   


Reviews_Nouns = Nouns(Reviews_Tagged)




i = 4
print("<< OUTPUT >>")
print("All the nouns extracted sucessfully !!!")
print()
print("\t-:Verfying in 1st sentence, If Nouns alone are extracted:-")
print()
print("Prior Extracting Nouns:\n\t", Reviews_Tagged[i])
print()
print("After Extracting Nouns:\n\t", Reviews_Nouns[i])

<< OUTPUT >>
All the nouns extracted sucessfully !!!

	-:Verfying in 1st sentence, If Nouns alone are extracted:-

Prior Extracting Nouns:
	 [('the', 'DT'), ('worst', 'JJS'), ('phone', 'NN'), ('everthey', 'NN'), ('have', 'VBP'), ('changed', 'VBN'), ('the', 'DT'), ('last', 'JJ'), ('phone', 'NN'), ('but', 'CC'), ('the', 'DT'), ('problem', 'NN'), ('is', 'VBZ'), ('still', 'RB'), ('same', 'JJ'), ('and', 'CC'), ('the', 'DT'), ('amazon', 'NN'), ('is', 'VBZ'), ('not', 'RB'), ('returning', 'VBG'), ('the', 'DT'), ('phone', 'NN'), ('.highly', 'RB'), ('disappointing', 'JJ'), ('of', 'IN'), ('amazon', 'NN')]

After Extracting Nouns:
	 ['phone', 'everthey', 'phone', 'problem', 'amazon', 'phone', 'amazon']


# 6)	Lemmatize. 
##### 6.1) Different forms of the terms need to be treated as one.
##### 6.2) No need to provide POS tag to lemmatizer for now.

In [13]:
def Lemmatize(Nouns):
    L = WordNetLemmatizer()
    Big_List=[]
    for NounList in Nouns:
        Small_List = []
        for Noun in NounList:
            Lemmatized = L.lemmatize(Noun)
            Small_List.append(Lemmatized)
        
        #print(nouns_List)
        Big_List.append(Small_List)
      
    return Big_List   




Reviews_Lemmatized = Lemmatize(Reviews_Nouns)



i = 7
print("<< OUTPUT >>")
print("All the nouns are lemmatized !!!")
print()
print("\t-:Verfying If Nouns are lemmatized:-")

print("Prior Lemmatizing :", Reviews_Nouns[i])
print()
print("After Lemmatizings:", Reviews_Lemmatized[i])

<< OUTPUT >>
All the nouns are lemmatized !!!

	-:Verfying If Nouns are lemmatized:-
Prior Lemmatizing : ['battery', 'level']

After Lemmatizings: ['battery', 'level']


# Topic:
## 7)	Remove stopwords and punctuation (if there are any). 

#### Removing various punctuations 

In [14]:
def Remove_NonAlpha(Reviews):
    cleaned = [word for word in Reviews if word.isalpha()]
    return  cleaned


Reviews_No_Punctuations = [Remove_NonAlpha(Reviews) for Reviews in Reviews_Lemmatized]




i = 2
print("<< OUTPUT >>")
print("All Punctuations Removed !!!")
print()
print("\t-:Verfying If Punctuations are removed or Not:-")

print("Prior Removing Punctuations :\n", Reviews_Lemmatized[i])
print()
print("After Removing Punctuations:\n", Reviews_No_Punctuations[i])



<< OUTPUT >>
All Punctuations Removed !!!

	-:Verfying If Punctuations are removed or Not:-
Prior Removing Punctuations :
 ['i', '%', 'cash', '..']

After Removing Punctuations:
 ['i', 'cash']


##### Removing Stopwords 

In [15]:
stop_words = stopwords.words("english")



def Remove_stops(Reviews):
    Removed = [Review for Review in Reviews if Review not in stop_words]
    return Removed

Cleaned_Reviews =  Remove_stops(Reviews_No_Punctuations)



In [16]:
MostFrequentWord = ['phone']



def Remove_MostFrequentWord(Reviews):
    Removed = [Review for Review in Reviews if Review not in MostFrequentWord]
    return Removed

Cleaned_Reviews =  Remove_MostFrequentWord(Cleaned_Reviews)
Cleaned_Reviews

[[],
 ['mobile',
  'i',
  'battery',
  'hell',
  'backup',
  'idle',
  'lie',
  'amazon',
  'lenove',
  'battery',
  'charger',
  'don'],
 ['i', 'cash'],
 [],
 ['phone', 'everthey', 'phone', 'problem', 'amazon', 'phone', 'amazon'],
 ['camerawaste', 'money'],
 ['phone', 'allot', 'reason'],
 ['battery', 'level'],
 ['phone', 'hanging', 'note', 'station', 'ahmedabad', 'phone', 'lenovo'],
 ['lot', 'thing'],
 ['wrost'],
 ['phone', 'charger', 'damage'],
 ['item', 'battery', 'life'],
 ['battery', 'problem', 'motherboard', 'problem', 'mobile', 'life'],
 ['phone', 'slim', 'battry', 'backup', 'screen'],
 ['headset'],
 ['time', 'i'],
 ['product',
  'prize',
  'range',
  'specification',
  'comparison',
  'mobile',
  'range',
  'i',
  'phone',
  'seal',
  'credit',
  'card',
  'i',
  'deal',
  'amazon'],
 ['battery', 'battery', 'life'],
 ['smartphone'],
 [],
 ['galery', 'problem', 'speaker', 'phone'],
 ['camera', 'battery'],
 ['product'],
 ['product', 'camera', 'o', 'battery', 'phone', 'product'],


# 8)	Create a topic model using LDA on the cleaned up data with 12 topics.
##### 	8.1) Print out the top terms for each topic.
##### 	8.2) What is the coherence of the model with the c_v metric?



In [17]:
# Create Dictionary
dictionary = corpora.Dictionary(Cleaned_Reviews)


# Create Corpus
texts = Cleaned_Reviews


# Term Document Frequency
corpus = [dictionary.doc2bow(text) for text in texts]

# View
print("<< OUTPUT >>")
print("\t-:Viewing a sample of word to id dictionaty:-")
print()
print("Word to Id Dictionary:", corpus[:2])
print("Corresponding words of word to ID:\n", [dictionary[0],dictionary[0]] )

[[(dictionary[id], freq) for id, freq in cp] for cp in corpus[:2]]


<< OUTPUT >>
	-:Viewing a sample of word to id dictionaty:-

Word to Id Dictionary: [[], [(0, 1), (1, 1), (2, 2), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1)]]
Corresponding words of word to ID:
 ['amazon', 'amazon']


[[],
 [('amazon', 1),
  ('backup', 1),
  ('battery', 2),
  ('charger', 1),
  ('don', 1),
  ('hell', 1),
  ('i', 1),
  ('idle', 1),
  ('lenove', 1),
  ('lie', 1),
  ('mobile', 1)]]

In [18]:
# Build LDA model

NUM_TOPICS = 12
Lda_model_1 = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word = dictionary,
                                           num_topics=NUM_TOPICS, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

In [19]:
# Print the Keyword in the 12 topics
print("<< OUTPUT >>")
print("\t-:Viewing keywords from topics:-")


pprint(Lda_model_1.print_topics())
doc_lda = Lda_model_1[corpus]

<< OUTPUT >>
	-:Viewing keywords from topics:-
[(0,
  '0.286*"battery" + 0.079*"time" + 0.038*"charger" + 0.032*"life" + '
  '0.028*"waste" + 0.028*"update" + 0.026*"charge" + 0.025*"speed" + '
  '0.025*"use" + 0.021*"turbo"'),
 (1,
  '0.112*"day" + 0.109*"device" + 0.091*"mode" + 0.061*"usage" + 0.050*"depth" '
  '+ 0.044*"mark" + 0.033*"thing" + 0.029*"model" + 0.026*"effect" + '
  '0.025*"mah"'),
 (2,
  '0.157*"issue" + 0.098*"amazon" + 0.041*"h" + 0.041*"app" + 0.041*"hai" + '
  '0.037*"charging" + 0.029*"buy" + 0.023*"refund" + 0.022*"cast" + '
  '0.022*"ho"'),
 (3,
  '0.367*"product" + 0.264*"mobile" + 0.062*"display" + 0.041*"superb" + '
  '0.015*"super" + 0.015*"excellent" + 0.013*"hand" + 0.013*"lag" + '
  '0.011*"selfie" + 0.011*"resolution"'),
 (4,
  '0.102*"heating" + 0.099*"network" + 0.059*"processor" + 0.050*"sim" + '
  '0.045*"music" + 0.039*"lot" + 0.032*"ram" + 0.024*"card" + 0.023*"jio" + '
  '0.022*"star"'),
 (5,
  '0.254*"money" + 0.210*"backup" + 0.088*"value" + 0

# 9)	Analyze the topics through the business lens.
##### 	Determine which of the topics can be combined.



In [20]:
word_dict = {}
for i in range(NUM_TOPICS):
    words = Lda_model_1.show_topic(i, topn = 20)
    word_dict["Topic # " + "{}".format(i+1)] = [i[0] for i in words]

print("<< OUTPUT >>")
print("\t-:Viewing keywords from topics:-")    
pd.DataFrame(word_dict)    

<< OUTPUT >>
	-:Viewing keywords from topics:-


Unnamed: 0,Topic # 1,Topic # 2,Topic # 3,Topic # 4,Topic # 5,Topic # 6,Topic # 7,Topic # 8,Topic # 9,Topic # 10,Topic # 11,Topic # 12
0,battery,day,issue,product,heating,money,quality,screen,i,problem,phone,camera
1,time,device,amazon,mobile,network,backup,delivery,call,note,performance,price,heat
2,charger,mode,h,display,processor,value,everything,option,lenovo,service,sound,speaker
3,life,usage,app,superb,sim,work,bit,replacement,range,software,voice,video
4,waste,depth,hai,super,music,worth,clarity,glass,month,customer,anything,system
5,update,mark,charging,excellent,lot,volume,light,support,please,center,processing,apps
6,charge,thing,buy,hand,ram,fingerprint,box,volta,feature,purchase,recorder,flash
7,speed,model,refund,lag,card,nice,front,internet,handset,touch,help,till
8,use,effect,cast,selfie,jio,net,smartphone,set,budget,care,game,killer
9,turbo,mah,ho,resolution,star,wastage,photo,gorilla,dolby,look,body,user


###  Determine which of the topics can be combined.

Topic No.3,4,5,6 and 12 could be clubbed together for two reasons.

The circles have overlapped and formmed a cluster.
Also, they have common keywords such as heat, display, speaker video.
These are related to complaining of the product.

In [21]:
import pyLDAvis.gensim

Lda_display = pyLDAvis.gensim.prepare(Lda_model_1, corpus, dictionary, sort_topics=False)
pyLDAvis.display(Lda_display)

In [22]:
print("<< OUTPUT >>")

# Compute Perplexity
print('\nPerplexity: ', Lda_model_1.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=Lda_model_1, texts=Cleaned_Reviews , dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

<< OUTPUT >>

Perplexity:  -7.1395742271151565

Coherence Score:  0.4129194953003848


# 10)	Create topic model using LDA with what you think is the optimal number of topics
##### 	What is the coherence of the model?


# 10.1       Build a Model with 6 Topics

In [23]:
NUM_TOPICS = 6
Lda_model_2 = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word = dictionary,
                                           num_topics=NUM_TOPICS, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

In [24]:
topics = Lda_model_2.show_topics()


print("<< OUTPUT >>")
print("\t-:Viewing keywords from topics:-")    
print()

for topic in topics:
    print(topic)
    print()
    
    


<< OUTPUT >>
	-:Viewing keywords from topics:-

(0, '0.353*"phone" + 0.126*"i" + 0.046*"price" + 0.041*"time" + 0.039*"lenovo" + 0.018*"display" + 0.017*"processor" + 0.017*"software" + 0.014*"update" + 0.014*"charge"')

(1, '0.178*"battery" + 0.126*"product" + 0.081*"problem" + 0.036*"heating" + 0.030*"backup" + 0.027*"day" + 0.026*"device" + 0.024*"charger" + 0.022*"range" + 0.022*"mode"')

(2, '0.070*"amazon" + 0.068*"screen" + 0.029*"h" + 0.027*"video" + 0.026*"charging" + 0.024*"budget" + 0.023*"replacement" + 0.023*"glass" + 0.023*"stock" + 0.022*"android"')

(3, '0.122*"note" + 0.048*"service" + 0.035*"sound" + 0.031*"delivery" + 0.024*"hai" + 0.023*"month" + 0.022*"please" + 0.022*"feature" + 0.022*"handset" + 0.019*"dolby"')

(4, '0.065*"network" + 0.045*"call" + 0.033*"sim" + 0.033*"waste" + 0.029*"speaker" + 0.026*"app" + 0.026*"lot" + 0.025*"option" + 0.022*"customer" + 0.020*"support"')

(5, '0.215*"camera" + 0.099*"mobile" + 0.088*"quality" + 0.057*"issue" + 0.054*"perfor

In [25]:
word_dict = {}
for i in range(NUM_TOPICS):
    words = Lda_model_2.show_topic(i, topn = 20)
    word_dict["Topic # " + "{}".format(i+1)] = [i[0] for i in words]

print("<< OUTPUT >>")
print("\t-:Viewing keywords from topics:-")
pd.DataFrame(word_dict)

<< OUTPUT >>
	-:Viewing keywords from topics:-


Unnamed: 0,Topic # 1,Topic # 2,Topic # 3,Topic # 4,Topic # 5,Topic # 6
0,phone,battery,amazon,note,network,camera
1,i,product,screen,service,call,mobile
2,price,problem,h,sound,sim,quality
3,time,heating,video,delivery,waste,issue
4,lenovo,backup,charging,hai,speaker,performance
5,display,day,budget,month,app,money
6,processor,device,replacement,please,lot,heat
7,software,charger,glass,feature,option,life
8,update,range,stock,handset,customer,superb
9,charge,mode,android,dolby,support,value


In [26]:
import pyLDAvis.gensim

Lda_display = pyLDAvis.gensim.prepare(Lda_model_2, corpus, dictionary, sort_topics=False)
pyLDAvis.display(Lda_display)

In [27]:
print("<< OUTPUT >>")

# Compute Perplexity
print('\nPerplexity: ', Lda_model_2.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=Lda_model_2, texts=Cleaned_Reviews , dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

<< OUTPUT >>

Perplexity:  -6.364815884849992

Coherence Score:  0.4580549208202087


## 10.2     Build a Model with 18 topics 

In [28]:
NUM_TOPICS = 18
Lda_model_3 = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word = dictionary,
                                           num_topics=NUM_TOPICS, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)



topics = Lda_model_3.show_topics()

print("<< OUTPUT >>")
print("\t-:Viewing keywords from topics:-")    
print()


for topic in topics:
    print(topic)
    print()

<< OUTPUT >>
	-:Viewing keywords from topics:-

(7, '0.446*"update" + 0.061*"oreo" + 0.000*"software" + 0.000*"heat" + 0.000*"condition" + 0.000*"glass" + 0.000*"rating" + 0.000*"zero" + 0.000*"minus" + 0.000*"user"')

(14, '0.310*"service" + 0.204*"sim" + 0.136*"customer" + 0.097*"center" + 0.034*"one" + 0.000*"care" + 0.000*"jio" + 0.000*"company" + 0.000*"support" + 0.000*"heat"')

(3, '0.707*"lenovo" + 0.000*"hai" + 0.000*"k" + 0.000*"feature" + 0.000*"ko" + 0.000*"software" + 0.000*"refund" + 0.000*"chor" + 0.000*"h" + 0.000*"support"')

(5, '0.705*"performance" + 0.021*"cash" + 0.000*"heat" + 0.000*"everything" + 0.000*"ram" + 0.000*"company" + 0.000*"clarity" + 0.000*"budget" + 0.000*"software" + 0.000*"sensor"')

(8, '0.662*"price" + 0.057*"ok" + 0.049*"point" + 0.000*"ram" + 0.000*"purchase" + 0.000*"software" + 0.000*"dolby" + 0.000*"feature" + 0.000*"everything" + 0.000*"clarity"')

(2, '0.634*"product" + 0.164*"amazon" + 0.048*"buy" + 0.044*"work" + 0.013*"focus" + 0.000*"r

In [29]:
import pyLDAvis.gensim

Lda_display = pyLDAvis.gensim.prepare(Lda_model_3, corpus, dictionary, sort_topics=False)
pyLDAvis.display(Lda_display)

In [30]:
print("<< OUTPUT >>")

# Compute Perplexity
print('\nPerplexity: ', Lda_model_3.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=Lda_model_3, texts=Cleaned_Reviews , dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

<< OUTPUT >>

Perplexity:  -13.10255490116688

Coherence Score:  0.40705347424212474


# 11)	The business should  be able to interpret the topics.

##### 11.1) Name each of the identified topics.


In [31]:
word_dict = {}
for i in range(NUM_TOPICS):
    words = Lda_model_3.show_topic(i, topn = 10)
    word_dict["Topic # " + "{}".format(i+1)] = [i[0] for i in words]

print("<< OUTPUT >>")
print("\t-:Viewing keywords from topics:-")    
pd.DataFrame(word_dict)

<< OUTPUT >>
	-:Viewing keywords from topics:-


Unnamed: 0,Topic # 1,Topic # 2,Topic # 3,Topic # 4,Topic # 5,Topic # 6,Topic # 7,Topic # 8,Topic # 9,Topic # 10,Topic # 11,Topic # 12,Topic # 13,Topic # 14,Topic # 15,Topic # 16,Topic # 17,Topic # 18
0,charger,camera,product,lenovo,time,performance,quality,update,price,heating,speed,money,note,phone,service,problem,i,battery
1,processor,mobile,amazon,hai,network,cash,reason,oreo,ok,option,speaker,screen,issue,call,sim,superb,day,backup
2,waste,mark,buy,k,stock,heat,video,software,point,memory,month,lot,life,range,customer,standby,device,delivery
3,charge,smartphone,work,feature,android,everything,software,heat,ram,model,handset,value,hour,sound,center,software,mode,headset
4,music,signal,focus,ko,cost,ram,everything,condition,purchase,choice,u,charging,item,system,one,heat,display,software
5,use,weight,replacement,software,need,company,light,glass,software,anyone,deal,card,offer,voice,care,h,usage,heat
6,turbo,o,heat,refund,class,clarity,image,rating,dolby,casting,centre,cast,hanging,look,jio,video,depth,bit
7,return,dont,please,chor,recording,budget,sensor,zero,feature,ram,specification,tv,ka,wifi,company,bit,star,everything
8,hr,hell,app,h,hardware,software,flash,minus,everything,heat,prize,app,disappointment,cable,support,everything,thing,please
9,core,idle,software,support,rate,sensor,clarity,user,clarity,set,heat,software,hai,recorder,heat,connection,mah,app


# 11.2) Create a table with the topic name and the top 10 terms in each 	to present to the  business.  

Topic # 1 :  Issues with Charger 	

Topic # 2 :  Camera issues	

Topic # 3 :  Issues with product replacement 	

Topic # 4 :  Matters related to refunding	

Topic # 5 :  Customers expectations	

Topic # 6 :  Phone heatingup

Topic # 7 :  Overall quality related
 
Topic # 8 :   Phone heatingup	

Topic # 9 :  A satisfied cutomers opinions

Topic # 10 :  Phone heatingup

Topic # 11 :  Processor speed related

Topic # 12	: Phone heatingup

Topic # 13	:  Phone heatingup

Topic # 14	: Wifi related issues  & Phone heatingup

Topic # 15	: Phone heatingup

Topic # 16	: Phone heatingup

Topic # 17	: disaplay & Phone heatingup

Topic # 18  : battery Backup & Phone heatingup

# Learners comments

1) In Model-1, Have tried with 12nos topics and found resonably comfortable.

2) In Model-2, Have reduced the topics by 50%, from initial 12 to 6nos topics and found it to difficult from business angle.

3) In Model-3, have increased the topics by 50% from intial 12 to 18nos topics.  Observed it is more easier to topic names.

4) Have laso observed that, as the number of topics increase it reduces the Perplexity further.

# END OF THE PROJECT