TASK

1 - Read the .csv file using Pandas. Take a look at the top few records.

2 -Normalize casings for the review text and extract the text into a list for easier manipulation.

3 -Tokenize the reviews using NLTKs word_tokenize function.

4 -Perform parts-of-speech tagging on each sentence using the NLTK POS tagger.

5 -For the topic model, we should want to include only nouns. (a) Find out all the POS tags that correspond to nouns. (b) Limit the data to only terms with these tags.

6 -Lemmatize. (a) Different forms of the terms need to be treated as one. (b) No need to provide POS tag to lemmatizer for now.

7 -Remove stopwords and punctuation (if there are any).

8 -Create a topic model using LDA on the cleaned up data with 12 topics. (a) Print out the top terms for each topic. (b) What is the coherence of the model with the c_v metric?

9 -Analyze the topics through the business lens. (a)Determine which of the topics can be combined.

10 -Create topic model using LDA with what you think is the optimal number of topics. (a) What is the coherence of the model?

11 -The business should be able to interpret the topics. (a) Name each of the identified topics. (b) Create a table with the topic name and the top 10 terms in each to present to the business.

In [1]:
import pandas as pd
import nltk
import os
import matplotlib.pyplot as plt

In [2]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


**2 -Normalize casings for the review text and extract the text into a list for easier manipulation.**

In [3]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from textblob import TextBlob
import seaborn as sns
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
topic_data = pd.read_csv('/content/drive/MyDrive/nltk/Topic_analysis/K8_Reviews_v0.2.csv')
topic_data.head()

Unnamed: 0,sentiment,review
0,1,Good but need updates and improvements
1,0,"Worst mobile i have bought ever, Battery is dr..."
2,1,when I will get my 10% cash back.... its alrea...
3,1,Good
4,0,The worst phone everThey have changed the last...


In [5]:
topic_data_lower = [sent.lower() for sent in topic_data.review.values]
topic_data_lower[0]

'good but need updates and improvements'

**3 -Tokenize the reviews using NLTKs word_tokenize function.**

In [6]:
reviews_token = [word_tokenize(sent) for sent in topic_data_lower]
reviews_token[0]

['good', 'but', 'need', 'updates', 'and', 'improvements']

**4 -Perform parts-of-speech tagging on each sentence using the NLTK POS tagger.**

In [7]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [8]:
blob = TextBlob(topic_data_lower[0])
blob.tags

[('good', 'JJ'),
 ('but', 'CC'),
 ('need', 'VBP'),
 ('updates', 'NNS'),
 ('and', 'CC'),
 ('improvements', 'NNS')]

In [9]:
nltk.pos_tag(reviews_token[0])

[('good', 'JJ'),
 ('but', 'CC'),
 ('need', 'VBP'),
 ('updates', 'NNS'),
 ('and', 'CC'),
 ('improvements', 'NNS')]

#### **5 -For the topic model, we should want to include only nouns.**

**(a) Find out all the POS tags that correspond to nouns.**
**(b) Limit the data to only terms with these tags.**

In [10]:
nltk.download('tagsets')

[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Package tagsets is already up-to-date!


True

In [11]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [12]:
tagged_tuple = nltk.pos_tag(['great'])
tagged_tuple[0]

('great', 'JJ')

In [13]:
print(tagged_tuple[0][0])
print(tagged_tuple[0][1])

great
JJ


In [14]:
reviews_tagged = [nltk.pos_tag(tokens) for tokens in reviews_token]
reviews_tagged[0]

[('good', 'JJ'),
 ('but', 'CC'),
 ('need', 'VBP'),
 ('updates', 'NNS'),
 ('and', 'CC'),
 ('improvements', 'NNS')]

In [15]:
import re, os, random
reviews_noun=[]
for sent in reviews_tagged:
  reviews_noun.append([token for token in sent if re.search("NN.*", token[1])])
reviews_noun[1]

[('mobile', 'NN'),
 ('i', 'NN'),
 ('battery', 'NN'),
 ('hell', 'NN'),
 ('backup', 'NN'),
 ('hours', 'NNS'),
 ('uses', 'NNS'),
 ('idle', 'NN'),
 ('discharged.this', 'NN'),
 ('lie', 'NN'),
 ('amazon', 'NN'),
 ('lenove', 'NN'),
 ('battery', 'NN'),
 ('charger', 'NN'),
 ('hours', 'NNS'),
 ('don', 'NN')]

**6 -Lemmatize.**
(a) Different forms of the terms need to be treated as one.
(b) No need to provide POS tag to lemmatizer for now.

In [16]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [17]:
lemma = WordNetLemmatizer()
reviews_lemma = []
for sent in reviews_noun:
  reviews_lemma.append([lemma.lemmatize(word[0]) for word in sent])
reviews_lemma[1]

['mobile',
 'i',
 'battery',
 'hell',
 'backup',
 'hour',
 'us',
 'idle',
 'discharged.this',
 'lie',
 'amazon',
 'lenove',
 'battery',
 'charger',
 'hour',
 'don']

**7 -Remove stopwords and punctuation (if there are any).**

In [18]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [19]:
nltk.download('punctuation')

[nltk_data] Error loading punctuation: Package 'punctuation' not found
[nltk_data]     in index


False

In [20]:
from string import punctuation
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [21]:
stop_updated = stop_words + list(punctuation) + ["..."] + [".."]
reviews_sw_removed=[]
for sent in reviews_lemma:
    reviews_sw_removed.append([term for term in sent if term not in stop_updated])

In [22]:
reviews_sw_removed[1]

['mobile',
 'battery',
 'hell',
 'backup',
 'hour',
 'us',
 'idle',
 'discharged.this',
 'lie',
 'amazon',
 'lenove',
 'battery',
 'charger',
 'hour']

**8 -Create a topic model using LDA on the cleaned up data with 12 topics.** (a) Print out the top terms for each topic. (b) What is the coherence of the model with the c_v metric?

In [23]:
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
from gensim.models import ldamodel

In [24]:
id2word = corpora.Dictionary(reviews_sw_removed)
texts = reviews_sw_removed
corpus = [id2word.doc2bow(text) for text in texts]

In [25]:
print(corpus[300])

[(4, 1), (33, 1), (36, 1), (38, 1), (58, 1), (132, 1), (188, 1), (201, 1), (226, 1), (302, 1), (472, 2), (596, 1), (597, 1), (598, 1)]


In [26]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=12,
                                           random_state=42,
                                           passes=10,
                                           per_word_topics=True)

In [27]:
print(lda_model.print_topics())

[(0, '0.138*"mobile" + 0.040*"call" + 0.036*"screen" + 0.031*"feature" + 0.030*"option" + 0.020*"music" + 0.017*"software" + 0.016*"app" + 0.015*"video" + 0.015*"card"'), (1, '0.151*"money" + 0.128*"...." + 0.071*"waste" + 0.056*"value" + 0.046*"glass" + 0.038*"speaker" + 0.024*"gorilla" + 0.022*"set" + 0.022*"ok" + 0.020*"piece"'), (2, '0.216*"note" + 0.113*"k8" + 0.090*"lenovo" + 0.030*"sound" + 0.023*"dolby" + 0.020*"killer" + 0.018*"gallery" + 0.018*"system" + 0.018*"atmos" + 0.018*"excellent"'), (3, '0.078*"phone" + 0.040*"day" + 0.038*"amazon" + 0.035*"service" + 0.034*"issue" + 0.027*"time" + 0.027*"lenovo" + 0.026*"battery" + 0.024*"month" + 0.023*"device"'), (4, '0.280*"product" + 0.176*"problem" + 0.080*"network" + 0.075*"issue" + 0.066*"heating" + 0.021*"jio" + 0.021*"sim" + 0.019*"volta" + 0.010*"connection" + 0.009*"signal"'), (5, '0.093*"heat" + 0.070*"....." + 0.052*"processor" + 0.038*"everything" + 0.038*"budget" + 0.031*"...." + 0.030*"core" + 0.025*"display" + 0.017*

**What is the coherence of the model with the c_v metric**

In [28]:
coherence_model_lda = CoherenceModel(model=lda_model, texts=reviews_sw_removed, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)


Coherence Score:  0.48926964040228144


### **9 -Analyze the topics through the business lens. (a)Determine which of the topics can be combined.**

In [29]:
x = lda_model.show_topics(formatted=False)
topics_words = [(tp[0], [wd[0] for wd in tp[1]]) for tp in x]

In [30]:
for topic,words in topics_words:
    print(str(topic)+ "::"+ str(words))
print()

5::['heat', '.....', 'processor', 'everything', 'budget', '....', 'core', 'display', 'cell', 'hr']
9::['camera', 'battery', 'quality', 'phone', 'performance', 'backup', 'issue', 'life', 'day', 'mode']
1::['money', '....', 'waste', 'value', 'glass', 'speaker', 'gorilla', 'set', 'ok', 'piece']
2::['note', 'k8', 'lenovo', 'sound', 'dolby', 'killer', 'gallery', 'system', 'atmos', 'excellent']
0::['mobile', 'call', 'screen', 'feature', 'option', 'music', 'software', 'app', 'video', 'card']
6::['range', 'price', 'work', 'mobile', 'specification', 'super', '......', 'bit', 'cam', 'k']
3::['phone', 'day', 'amazon', 'service', 'issue', 'time', 'lenovo', 'battery', 'month', 'device']
4::['product', 'problem', 'network', 'issue', 'heating', 'jio', 'sim', 'volta', 'connection', 'signal']
7::['charger', 'hai', 'handset', 'box', 'turbo', 'charge', 'plz', 'hi', 'cable', 'bhi']
8::['price', 'superb', 'buy', 'headphone', 'thanks', 'worth', 'feature', 'smartphone', 'expectation', 'offer']



###possible topics from terms present
1. Topic 0 : Overal Features
2. Topic 1 : product accessories review
3. Topic 2 : Product specification
4. Topic 3 : Amazon service
5. Topic 5 : product technology
6. Topic 6 : range of product
7. Topic 7 : accessories
8. Topic 8 : possitive review
9. Topic 9 : Normal phone features
10. Topic 11 : overal review of a product

In [31]:
!pip install pyLDAvis



In [32]:
import pyLDAvis
import pyLDAvis.gensim
import pickle

In [33]:
pyLDAvis.enable_notebook()

  and should_run_async(code)


In [34]:
LDAvis = pyLDAvis.gensim.prepare(lda_model,corpus,id2word)
LDAvis

  and should_run_async(code)


###10 -Create topic model using LDA with what you think is the optimal number of topics. (a) What is the coherence of the model?

In [35]:
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
from gensim.models import ldamodel

  and should_run_async(code)


In [36]:
id2word = corpora.Dictionary(reviews_sw_removed)
texts_2 = reviews_sw_removed
corpus = [id2word.doc2bow(text) for text in texts_2]

  and should_run_async(code)


In [37]:
print(corpus[200])

[(36, 1), (143, 1), (314, 1), (415, 1), (416, 1)]


  and should_run_async(code)


In [38]:
lda_topic_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=8,
                                           random_state=42,
                                           passes=10,
                                           per_word_topics=True)

  and should_run_async(code)


In [39]:
print(lda_topic_model.print_topics())

[(0, '0.188*"mobile" + 0.026*"charging" + 0.023*"hour" + 0.020*"charger" + 0.019*"charge" + 0.018*"battery" + 0.015*"turbo" + 0.014*"hr" + 0.013*"card" + 0.012*"notification"'), (1, '0.092*"money" + 0.043*"waste" + 0.034*"value" + 0.033*"screen" + 0.028*"glass" + 0.028*"speaker" + 0.026*"call" + 0.025*"handset" + 0.021*"box" + 0.020*"headphone"'), (2, '0.067*"note" + 0.056*"camera" + 0.051*"quality" + 0.038*"k8" + 0.031*"feature" + 0.023*"lenovo" + 0.023*"sound" + 0.019*"phone" + 0.016*"music" + 0.013*"speaker"'), (3, '0.183*"phone" + 0.027*"day" + 0.025*"issue" + 0.023*"time" + 0.022*"battery" + 0.020*"lenovo" + 0.017*"month" + 0.017*"problem" + 0.017*"service" + 0.013*"update"'), (4, '0.199*"product" + 0.095*"problem" + 0.049*"network" + 0.040*"issue" + 0.037*"heating" + 0.024*"amazon" + 0.018*"sim" + 0.016*"return" + 0.013*"...." + 0.013*"delivery"'), (5, '0.122*"camera" + 0.114*"battery" + 0.082*"phone" + 0.043*"performance" + 0.035*"quality" + 0.028*"backup" + 0.020*"...." + 0.016

  and should_run_async(code)


In [40]:
coherence_model_lda = CoherenceModel(model=lda_topic_model, texts=reviews_sw_removed, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

  and should_run_async(code)



Coherence Score:  0.5645540248584124


In [41]:
x = lda_model.show_topics(formatted=False)
topics_words = [(tp[0], [wd[0] for wd in tp[1]]) for tp in x]

  and should_run_async(code)


In [42]:
for topic,words in topics_words:
    print(str(topic)+ "::"+ str(words))
print()

9::['camera', 'battery', 'quality', 'phone', 'performance', 'backup', 'issue', 'life', 'day', 'mode']
2::['note', 'k8', 'lenovo', 'sound', 'dolby', 'killer', 'gallery', 'system', 'atmos', 'excellent']
10::['phone', 'h', 'ram', 'hang', 'gb', 'game', 'ho', 'u', 'lot', 'interface']
11::['feature', 'delivery', 'time', 'star', 'experience', 'camera', 'condition', 'cost', 'class', 'awesome']
3::['phone', 'day', 'amazon', 'service', 'issue', 'time', 'lenovo', 'battery', 'month', 'device']
1::['money', '....', 'waste', 'value', 'glass', 'speaker', 'gorilla', 'set', 'ok', 'piece']
4::['product', 'problem', 'network', 'issue', 'heating', 'jio', 'sim', 'volta', 'connection', 'signal']
0::['mobile', 'call', 'screen', 'feature', 'option', 'music', 'software', 'app', 'video', 'card']
7::['charger', 'hai', 'handset', 'box', 'turbo', 'charge', 'plz', 'hi', 'cable', 'bhi']
6::['range', 'price', 'work', 'mobile', 'specification', 'super', '......', 'bit', 'cam', 'k']



  and should_run_async(code)


In [43]:
import pyLDAvis
import pyLDAvis.gensim
import pickle

  and should_run_async(code)


In [44]:
pyLDAvis.enable_notebook()
LDAvis = pyLDAvis.gensim.prepare(lda_topic_model,corpus,id2word)
LDAvis

  and should_run_async(code)


These are the topic heading

- topic1 - phone features
- topic2 - Amazon services
- topic3 - normal mobile specification
- topic4 - phone
- topic5 - product review
- topic6 - lenovo model
- topic7 - product price
- topic8 - shipping
- topic9 - product durability
- topic10 - battery charger
- topic11 - overal specification
- topic12 - overal review


####**create a tabel with the topic name and the top 10 terms in each to present to the business.**

In [45]:
import numpy as np

  and should_run_async(code)


In [54]:
def __init__(topic,topics_words):
  print(topic, topics_words)

  and should_run_async(code)


In [69]:
Topics_table = __init__(topic,topics_words)
Topics_table

6 [(9, ['camera', 'battery', 'quality', 'phone', 'performance', 'backup', 'issue', 'life', 'day', 'mode']), (2, ['note', 'k8', 'lenovo', 'sound', 'dolby', 'killer', 'gallery', 'system', 'atmos', 'excellent']), (10, ['phone', 'h', 'ram', 'hang', 'gb', 'game', 'ho', 'u', 'lot', 'interface']), (11, ['feature', 'delivery', 'time', 'star', 'experience', 'camera', 'condition', 'cost', 'class', 'awesome']), (3, ['phone', 'day', 'amazon', 'service', 'issue', 'time', 'lenovo', 'battery', 'month', 'device']), (1, ['money', '....', 'waste', 'value', 'glass', 'speaker', 'gorilla', 'set', 'ok', 'piece']), (4, ['product', 'problem', 'network', 'issue', 'heating', 'jio', 'sim', 'volta', 'connection', 'signal']), (0, ['mobile', 'call', 'screen', 'feature', 'option', 'music', 'software', 'app', 'video', 'card']), (7, ['charger', 'hai', 'handset', 'box', 'turbo', 'charge', 'plz', 'hi', 'cable', 'bhi']), (6, ['range', 'price', 'work', 'mobile', 'specification', 'super', '......', 'bit', 'cam', 'k'])]


  and should_run_async(code)
