## Assignment 3                      -- Dan Hua Li
1.	(20 points) Perform latent semantic analysis (LSA), non-negative matrix factorization (NMF), and latent Dirichlet allocation (LDA) using scikit-learn and Gensim on the ***Amazon5000reviws dataset***. Set the number of topics = 5. Follow the same steps and parameter settings as those in the ***topic-modeling-animals-food-weather*** example (e.g., print out the top 10 words for each topic). Specific steps include:
- a.	Add both ‘ha’ and ‘wa’ to the stop word list.
- b.	scikit-learn for LSA, NMF, and LDA.
- c.	scikit-learn grid search for the best number of topics k for LDA, based on the perplexity score, with the grid of k = [2,3,4,5,6,7,8,10,12,14,16,18,20] (this search may take about 10-20 minutes).
- d.	Gensim for LSI, NMF, and LDA.
- e.	Gensim tqdm search for the best number of topics k for LDA, based on the c_v coherence score, with k ranging from 2 to 20 (consecutively).

In [2]:
#pip install --upgrade gensim
import pandas as pd
import numpy as np
from time import time

textdf = pd.read_csv('Amazon5000reviews.csv', sep=',')
textdf.info()
textdf.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 1 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Reviews  5000 non-null   object
dtypes: object(1)
memory usage: 39.2+ KB


Unnamed: 0,Reviews
0,I thought it would be as big as small paper bu...
1,This kindle is light and easy to use especiall...
2,Didnt know how much i'd use a kindle so went f...
3,I am 100 happy with my purchase. I caught it o...
4,Solid entry level Kindle. Great for kids. Gift...


In [3]:
# Tokenization with sklearn and nltk: set to lower case, remove stop words, and lemmatize words
import nltk
import string
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer

def lemma_tokenizer(corpus): # a method to lemmatize corpus
    corpus = ''.join([ch for ch in corpus if ch not in string.punctuation]) # remove punctuation
    tokens = nltk.word_tokenize(corpus)
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(token) for token in tokens]

nltk_stopwords = nltk.corpus.stopwords.words('english') # use nltk's English stopwords list
tf = CountVectorizer(tokenizer=lemma_tokenizer, stop_words=nltk_stopwords) # default lowercase
tf_sparse = tf.fit_transform(textdf.Reviews)
tf_dictionary = tf.get_feature_names_out()
print(tf_dictionary)
tf_sparse



['0' '1' '10' ... '‚äúthings' '‚äúthings‚äù' '‚ù§ô∏è']


<5000x5442 sparse matrix of type '<class 'numpy.int64'>'
	with 72870 stored elements in Compressed Sparse Row format>

In [4]:
nltk_stopwords.append('ha') # add 'ha' to stopword list for removal
nltk_stopwords.append('wa')
tf = CountVectorizer(tokenizer=lemma_tokenizer, stop_words=nltk_stopwords) # default lowercase
tf_sparse = tf.fit_transform(textdf.Reviews)
tf_dictionary = tf.get_feature_names_out()
print(tf_dictionary)
tf_sparse



['0' '1' '10' ... '‚äúthings' '‚äúthings‚äù' '‚ù§ô∏è']


<5000x5440 sparse matrix of type '<class 'numpy.int64'>'
	with 71557 stored elements in Compressed Sparse Row format>

In [5]:
tf_dense = tf_sparse.toarray() # convert sparse to dense matrix
pd.DataFrame(tf_dense, columns=tf_dictionary)

Unnamed: 0,0,1,10,100,1000,101,1012,1013,105,1080,...,‚äúalexa‚äù,‚äúbest‚äù,‚äúdropping,‚äúdualbattery,‚äúshow‚äù,‚äúskills‚äù,‚äústar,‚äúthings,‚äúthings‚äù,‚ù§ô∏è
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(tokenizer=lemma_tokenizer, stop_words=nltk_stopwords) # default lowercase
tfidf_sparse = tfidf.fit_transform(textdf.Reviews)
tfidf_dictionary = tfidf.get_feature_names_out()
tfidf_sparse



<5000x5440 sparse matrix of type '<class 'numpy.float64'>'
	with 71557 stored elements in Compressed Sparse Row format>

In [7]:
from sklearn.decomposition import TruncatedSVD
lsa = TruncatedSVD(n_components=5)
lsa

In [8]:
lsa_tf_topics = lsa.fit_transform(tf_sparse)
lsa_tf_topics.shape

(5000, 5)

In [9]:
lsa.components_.shape

(5, 5440)

In [10]:
# print top terms for each topic
def print_top_terms(model, vocabulary, n_top_terms):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([vocabulary[i]
                             for i in topic.argsort()[:-n_top_terms - 1:-1]])
        print(message)
    print()

print('LSA topics based on term-document matrix:')
print_top_terms(lsa, tf_dictionary, 10)

LSA topics based on term-document matrix:
Topic #0: tablet device amazon use screen one great apps kindle love
Topic #1: device magazine amazon app screen apps photo button menu home
Topic #2: kindle charge oasis book read would cover one day reading
Topic #3: echo show alexa music sound dot device video home see
Topic #4: great tablet would echo work price good charge oasis sound



In [11]:
# sklearn for non-negative matrix factorization (NMF)
from sklearn.decomposition import NMF
nmf = NMF(n_components=5, random_state=1, alpha_W=.1, l1_ratio=.5) # alpha_W and l1 related to regularization
nmf

In [12]:
nmf.fit_transform(tf_sparse)
print('NMF topics based on term-document matrix:')
print_top_terms(nmf, tf_dictionary, 10)

NMF topics based on term-document matrix:
Topic #0: ‚ù§ô∏è exposed expirence explained explaining explanatory explore explored exploring exponentially
Topic #1: ‚ù§ô∏è exposed expirence explained explaining explanatory explore explored exploring exponentially
Topic #2: ‚ù§ô∏è exposed expirence explained explaining explanatory explore explored exploring exponentially
Topic #3: ‚ù§ô∏è exposed expirence explained explaining explanatory explore explored exploring exponentially
Topic #4: ‚ù§ô∏è exposed expirence explained explaining explanatory explore explored exploring exponentially



In [13]:
nmf.fit_transform(tfidf_sparse)
print('NMF topics based on tfidf matrix:')
print_top_terms(nmf, tfidf_dictionary, 10)

NMF topics based on tfidf matrix:
Topic #0: ‚ù§ô∏è exposed expirence explained explaining explanatory explore explored exploring exponentially
Topic #1: ‚ù§ô∏è exposed expirence explained explaining explanatory explore explored exploring exponentially
Topic #2: ‚ù§ô∏è exposed expirence explained explaining explanatory explore explored exploring exponentially
Topic #3: ‚ù§ô∏è exposed expirence explained explaining explanatory explore explored exploring exponentially
Topic #4: ‚ù§ô∏è exposed expirence explained explaining explanatory explore explored exploring exponentially



In [14]:
# sklearn for latent Dirichlet allocation (LDA)
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(
    n_components=5, random_state=1, learning_method='online', learning_offset=50.)
lda

In [15]:
lda.fit_transform(tf_sparse)
print('LDA topics based on term-document matrix:')
print_top_terms(lda, tf_dictionary, 10)

LDA topics based on term-document matrix:
Topic #0: music alexa love echo enjoy video play family question home
Topic #1: tablet love great one use bought kindle easy kid fire
Topic #2: kindle model user memory storage card page ereader voyage friendly
Topic #3: great work echo product use show easy sound good recommend
Topic #4: book read reader reading web email facebook ease shop want



In [16]:
lda.fit_transform(tfidf_sparse)
print('LDA topics based on tfidf matrix:')
print_top_terms(lda, tfidf_dictionary, 10)

LDA topics based on tfidf matrix:
Topic #0: great love tablet use easy bought good one kindle price
Topic #1: better beginner good upgrade position could located allow kindle father
Topic #2: keeping smooth asesome reallllllllllllllllllllllllllllllllllllllllllllllly travelling grandson8loves advertised ui pricereally active
Topic #3: tableti 6yr author title recipient performed fundraiser competitive prize usewould
Topic #4: class replaces nexus 3yr increase cousin font goodexcellent feachers screenlike



In [17]:
# sklearn grid search for the best number of topics for LDA based on perplexity score
from sklearn.model_selection import GridSearchCV
param_grid = {'n_components': [2,3,4,5,6,7,8,10,12,14,16,18,20]}
lda = LatentDirichletAllocation(random_state=1, learning_method='online', learning_offset=50.)
lda_grid = GridSearchCV(lda, param_grid)
lda_grid.fit(tf_sparse)

In [18]:
print("Best model's params: ", lda_grid.best_params_)
print("Best log likelihood score: ", lda_grid.best_score_)
print("Model perplexity: ", lda_grid.best_estimator_.perplexity(tf_sparse))
cvresult = lda_grid.cv_results_
for mean_test_score, params in zip(cvresult['mean_test_score'], cvresult['params']):
    print(mean_test_score, params)
# the best number of topics is 2 based on sklearn grid search, which does not looks good 

Best model's params:  {'n_components': 2}
Best log likelihood score:  -121263.99937418592
Model perplexity:  973.3272661707207
-121263.99937418592 {'n_components': 2}
-123541.91503155879 {'n_components': 3}
-125219.1148578464 {'n_components': 4}
-126932.75652981366 {'n_components': 5}
-127402.07122772327 {'n_components': 6}
-130354.38193482 {'n_components': 7}
-129932.3043603483 {'n_components': 8}
-133897.86389593122 {'n_components': 10}
-134332.02489993474 {'n_components': 12}
-136257.83153047334 {'n_components': 14}
-137876.81880141242 {'n_components': 16}
-140247.02924032547 {'n_components': 18}
-142645.58511122147 {'n_components': 20}


In [19]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(
    n_components=2, random_state=1, learning_method='online', learning_offset=50.)
lda

In [20]:
lda.fit_transform(tfidf_sparse)
print('LDA topics based on tfidf matrix:')
print_top_terms(lda, tfidf_dictionary, 10)

LDA topics based on tfidf matrix:
Topic #0: echo alexa music show sound home great family love speaker
Topic #1: tablet great love easy use bought good kindle one price



#### Gensim 

In [2]:
# data processing for gensim
t0 = time()
stopwds = nltk.corpus.stopwords.words('english')
stopwds.append('ha')  # add 'ha' to stopword list for removal
stopwds.append('wa')
wtk = nltk.tokenize.RegexpTokenizer(r'\w+')
wnl = nltk.stem.wordnet.WordNetLemmatizer()

def tokenize_docs(docs):
    tokened_docs = []
    for doc in docs:
        doc = doc.lower()
        doc_tokens = [token.strip() for token in wtk.tokenize(doc)]
        doc_tokens = [wnl.lemmatize(token) for token in doc_tokens if not token.isnumeric()]
        doc_tokens = [token for token in doc_tokens if len(token) > 1]
        doc_tokens = [token for token in doc_tokens if token not in stopwds]
        doc_tokens = list(filter(None, doc_tokens))
        if doc_tokens:
            tokened_docs.append(doc_tokens)
    return tokened_docs

tokened_docs = tokenize_docs(textdf.Reviews)
print("Computing time: %0.3f seconds." % (time() - t0))
print('\nTokenized documents:\n', tokened_docs[:50])  # shows a little

### Output:>

Computing time: 1.490 seconds.

Tokenized documents:
 [['thought', 'would', 'big', 'small', 'paper', 'turn', 'like', 'palm', 'think', 'small', 'read', 'comfortable', 'regular', 'kindle', 'would', 'definitely', 'recommend', 'paperwhite', 'instead'], ['kindle', 'light', 'easy', 'use', 'especially', 'beach'], ['didnt', 'know', 'much', 'use', 'kindle', 'went', 'lower', 'end', 'im', 'happy', 'even', 'little', 'dark'], ['happy', 'purchase', 'caught', 'sale', 'really', 'good', 'price', 'normally', 'real', 'book', 'person', 'year', 'old', 'love', 'ripping', 'page', 'kindle', 'prevents', 'extremely', 'portable', 'fit', 'better', 'purse', 'giant', 'book', 'loaded', 'lot', 'book', 'finish', 'one', 'start', 'another', 'without', 'go', 'store', 'serf', 'need', 'picked', 'one', 'paperwhite', 'price', 'unbeatable', 'difference', 'could', 'see', 'one', 'backlit', 'simple', 'book', 'light', 'dollar', 'tree', 'solves', 'issue', 'second', 'kindle', 'first', 'old', 'keyboard', 'model', 'put', 'fell', 'love', 'keyboard', 'lol', 'likely', 'last'], ['solid', 'entry', 'level', 'kindle', 'great', 'kid', 'gifted', 'kid', 'friend', 'love', 'use', 'read', 'ipads', 'battery', 'good', 'higher', 'model', 'bit', 'better'], ['make', 'excellent', 'ebook', 'reader', 'expect', 'much', 'device', 'except', 'read', 'basic', 'ebooks', 'good', 'thing', 'cheap', 'good', 'read', 'sun'], ['ordered', 'daughter', 'black', 'paperwhite', 'love', 'read', 'quite', 'bit', 'larger', 'book', 'driving', 'crazy', 'hold', 'laying', 'wanting', 'take', 'book', 'vacation', 'lugging', 'around', 'thick', 'paperback', 'throw', 'bag', 'read', 'anywhere', 'light', 'weight', 'easy', 'use', 'batter

In [3]:
from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary(tokened_docs)
print(dictionary)
print('\n', list(dictionary.items()))  

#### Output: >

Select some lists show belows:

Dictionary<4650 unique tokens: ['big', 'comfortable', 'definitely', 'instead', 'kindle']...>

 [(0, 'big'), (1, 'comfortable'), (2, 'definitely'), (3, 'instead'), (4, 'kindle'), (5, 'like'), (6, 'palm'), (7, 'paper'), (8, 'paperwhite'), (9, 'read'), (10, 'recommend'), (11, 'regular'), (12, 'small'), (13, 'think'), (14, 'thought'), (15, 'turn'), (16, 'would'), (17, 'beach'), (18, 'easy'), (19, 'especially'), (20, 'light'), (21, 'use'), (22, 'dark'), (23, 'didnt'), (24, 'end'), (25, 'even'), (26, 'happy'), (27, 'im'), (28, 'know'), (29, 'little'), (30, 'lower'), (31, 'much'), (32, 'went'), (33, 'another'), (34, 'backlit'), (35, 'better'), (36, 'book'), (37, 'caught'), (38, 'could'), (39, 'difference'), (40, 'dollar'), (41, 'extremely'), (42, 'fell'), (43, 'finish'), (44, 'first'), (45, 'fit'), (46, 'giant'), (47, 'go'), (48, 'good'), (49, 'issue'), (50, 'keyboard'), (51, 'last'), (52, 'likely'), (53, 'loaded'), (54, 'lol'), (55, 'lot'), (56, 'love'),

In [4]:
bow_tf = [dictionary.doc2bow(word) for word in tokened_docs]  # bow for 'bag of words'
print(bow_tf [:50])   # term-document matrix represented in (term_id, frequency)

### Output:>

[[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 1), (14, 1), (15, 1), (16, 2)], [(4, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1)], [(4, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1)], [(4, 2), (8, 1), (20, 1), (26, 1), (33, 1), (34, 1), (35, 1), (36, 4), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 2), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 2), (57, 1), (58, 1), (59, 1), (60, 2), (61, 3), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 2), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 1)], [(4, 1), (9, 1), (21, 1), (35, 1), (48, 1), (56, 1), (57, 1), (86, 1), (87, 1), (88, 1), (89, 1), (90, 1), (91, 1), (92, 1), (93, 1), (94, 2), (95, 1), (96, 1)], [(9, 2), (31, 1), (48, 2), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 1), (105, 1), (106, 1), (107, 1), (108, 1)], [(8, 1), (9, 3), (18, 1), (20, 1), (21, 1), (36, 4), (51, 1), (56, 1), (86, 1), (87, 1), (99, 2), (105, 1), (106, 1), (109, 1), (110, 1), (111, 1), (112, 1), (113, 2), (114, 1), (115, 1), (116, 1), (117, 1), (118, 1), (119, 1), (120, 2), (121, 2), (122, 1), (123, 2), (124, 1), (125, 1), (126, 1), (127, 2), (128, 1), (129, 1), (130, 2), (131, 1), (132, 1), (133, 1), (134, 2), (135, 1), (136, 2), (137, 1), (138, 1), (139, 1), (140, 1), (141, 1), (142, 1), (143, 1), (144, 1), (145, 1), (146, 1)], [(4, 1), (86, 1), (147, 1), (148, 1), (149, 1), (150, 1), (151, 1), (152, 1)], [(4, 1), (57, 1), (100, 1), (110, 1), (153, 1), (154, 1), (155, 1), (156, 1), (157, 1)], [(25, 1), (158, 1), (159, 1), (160, 1), (161, 1), (162, 1), (163, 1), (164, 1), (165, 1)], [(9, 1), (18, 1), (41, 1), (48, 1), (51, 1), (72, 2), (86, 1), (102, 1), (166, 1), (167, 1), (168, 1), (169, 1), (170, 1), (171, 1), (172, 1), (173, 1), (174, 1), (175, 1), (176, 1), (177, 1),

### Gensim for LSI

In [24]:
# gensim for latent semantic indexing (LSI)
from gensim.models.lsimodel import LsiModel
NumberOfTopics = 5
lsi = LsiModel(corpus=bow_tf, num_topics=NumberOfTopics, id2word=dictionary)
print(lsi)

LsiModel<num_terms=4650, num_topics=5, decay=1.0, chunksize=20000>


In [25]:
# print term-topic matrix (top 10 terms in each topic), the U matrix in X = U*S*V
for i, topic in lsi.print_topics(num_words=10):
    print('\nTopic', i)
    print(topic)


Topic 0
0.320*"tablet" + 0.290*"device" + 0.279*"amazon" + 0.233*"screen" + 0.220*"use" + 0.179*"one" + 0.169*"apps" + 0.157*"great" + 0.149*"kindle" + 0.135*"love"

Topic 1
0.303*"kindle" + -0.298*"device" + 0.287*"love" + 0.265*"great" + -0.191*"magazine" + 0.179*"would" + -0.172*"amazon" + 0.168*"bought" + -0.156*"apps" + -0.155*"app"

Topic 2
-0.407*"kindle" + 0.367*"great" + 0.274*"echo" + 0.215*"love" + -0.206*"book" + -0.200*"charge" + 0.187*"tablet" + -0.181*"read" + -0.181*"oasis" + -0.136*"äôt"

Topic 3
0.621*"tablet" + -0.473*"echo" + -0.262*"show" + -0.190*"alexa" + -0.140*"music" + 0.130*"great" + 0.119*"kid" + -0.117*"sound" + -0.116*"device" + 0.097*"price"

Topic 4
0.745*"love" + -0.393*"great" + -0.188*"tablet" + 0.147*"old" + 0.134*"year" + 0.133*"bought" + -0.125*"echo" + -0.124*"would" + -0.100*"price" + -0.099*"work"


In [26]:
# seperate the terms with positive weight from those with negative weight
for i in range(NumberOfTopics):
    print('Topic', i)
    print('-'*50)
    g1 = []
    g2 = []
    for term, wt in lsi.show_topic(i, topn=10):
        if wt >= 0: g1.append((term, round(wt, 3)))
        else:       g2.append((term, round(wt, 3)))
    print('Gropu +:', g1)
    print('-'*50)
    print('Group -:', g2)
    print('='*50)

Topic 0
--------------------------------------------------
Gropu +: [('tablet', 0.32), ('device', 0.29), ('amazon', 0.279), ('screen', 0.233), ('use', 0.22), ('one', 0.179), ('apps', 0.169), ('great', 0.157), ('kindle', 0.149), ('love', 0.135)]
--------------------------------------------------
Group -: []
Topic 1
--------------------------------------------------
Gropu +: [('kindle', 0.303), ('love', 0.287), ('great', 0.265), ('would', 0.179), ('bought', 0.168)]
--------------------------------------------------
Group -: [('device', -0.298), ('magazine', -0.191), ('amazon', -0.172), ('apps', -0.156), ('app', -0.155)]
Topic 2
--------------------------------------------------
Gropu +: [('great', 0.367), ('echo', 0.274), ('love', 0.215), ('tablet', 0.187)]
--------------------------------------------------
Group -: [('kindle', -0.407), ('book', -0.206), ('charge', -0.2), ('read', -0.181), ('oasis', -0.181), ('äôt', -0.136)]
Topic 3
--------------------------------------------------
Grop

In [27]:
# print document-topic matrix, the V matrix in X = U*S*V
from gensim.matutils import corpus2dense
term_topic = lsi.projection.u  # left singular vectors
singular_values = lsi.projection.s  # singular values
topic_document = (corpus2dense(lsi[bow_tf], len(singular_values)).T / singular_values).T 
                             # lsi[bow_tf] is right singular vectors
term_topic.shape, singular_values.shape, topic_document.shape

((4650, 5), (5,), (5, 5000))

In [28]:
document_topics = pd.DataFrame(np.round(topic_document.T, 4),
                               columns=['Topic '+str(i) for i in range(NumberOfTopics)])
document_topics.head()

Unnamed: 0,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4
0,0.0074,0.0148,-0.0171,-0.0051,-0.0084
1,0.0051,0.0085,-0.0071,0.0003,0.0039
2,0.0056,0.0064,-0.0068,0.0004,0.0031
3,0.0287,0.0399,-0.0307,0.0063,0.041
4,0.0105,0.0207,0.0036,0.0091,0.0083


### Gensim for NMF

In [29]:
# gensim for non-negative matrix factorization (NMF)
from gensim.models.nmf import Nmf
nmf_model = Nmf(corpus=bow_tf, num_topics=NumberOfTopics, id2word=dictionary, random_state=1)

In [30]:
# print term-topic matrix (top 10 terms in each topic), the W matrix in X = W*H
# all coefficients are non-negative
for i, topic in nmf_model.print_topics(num_words=10):
    print('\nTopic', i)
    print(topic)


Topic 0
0.072*"tablet" + 0.063*"great" + 0.017*"price" + 0.016*"work" + 0.014*"amazon" + 0.013*"game" + 0.012*"use" + 0.011*"kid" + 0.010*"play" + 0.009*"good"

Topic 1
0.057*"love" + 0.019*"bought" + 0.018*"year" + 0.017*"old" + 0.015*"use" + 0.015*"time" + 0.013*"like" + 0.013*"kid" + 0.012*"one" + 0.011*"tablet"

Topic 2
0.030*"device" + 0.025*"amazon" + 0.019*"screen" + 0.017*"magazine" + 0.016*"app" + 0.014*"apps" + 0.014*"alexa" + 0.014*"home" + 0.010*"photo" + 0.009*"show"

Topic 3
0.031*"one" + 0.027*"echo" + 0.019*"screen" + 0.018*"sound" + 0.014*"device" + 0.013*"use" + 0.012*"show" + 0.012*"great" + 0.012*"light" + 0.011*"see"

Topic 4
0.044*"kindle" + 0.019*"book" + 0.018*"read" + 0.016*"charge" + 0.016*"would" + 0.015*"oasis" + 0.012*"äôt" + 0.011*"cover" + 0.009*"day" + 0.009*"amazon"


In [31]:
# print document-topic matrix, the H matrix (transpose) in X = W*H
topic_documet_H = corpus2dense(nmf_model.get_document_topics(bow_tf), NumberOfTopics)
document_topic_H_df = pd.DataFrame(np.round(topic_documet_H.T, 4),
                                   columns=['Topic '+str(i) for i in range(NumberOfTopics)])
document_topic_H_df.head()

Unnamed: 0,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4
0,0.0,0.0,0.0,0.1706,0.8294
1,0.0,0.0,0.0,0.3129,0.6871
2,0.0,0.237,0.0,0.1585,0.6045
3,0.0,0.4241,0.0,0.1731,0.4028
4,0.2164,0.4564,0.0,0.0,0.3272


#### Gensim (LDA)

In [32]:
# gensim for latent Dirichlet allocation (LDA)
from gensim.models.ldamodel import LdaModel
lda_model = LdaModel(corpus=bow_tf, num_topics=NumberOfTopics, id2word=dictionary, random_state=1)
print(lda_model)

LdaModel<num_terms=4650, num_topics=5, decay=0.5, chunksize=2000>


In [33]:
# print term-topic matrix (top 10 terms in each topic)
for i, topic in lda_model.print_topics(num_words=10):
    print('\nTopic', i)
    print(topic)


Topic 0
0.023*"kindle" + 0.014*"tablet" + 0.013*"year" + 0.013*"book" + 0.013*"old" + 0.012*"love" + 0.012*"bought" + 0.010*"use" + 0.009*"game" + 0.009*"amazon"

Topic 1
0.039*"tablet" + 0.033*"great" + 0.017*"use" + 0.016*"one" + 0.015*"amazon" + 0.015*"work" + 0.014*"good" + 0.011*"apps" + 0.011*"easy" + 0.011*"price"

Topic 2
0.035*"use" + 0.029*"tablet" + 0.026*"easy" + 0.012*"like" + 0.012*"good" + 0.012*"love" + 0.011*"book" + 0.010*"reading" + 0.008*"great" + 0.008*"device"

Topic 3
0.047*"love" + 0.033*"great" + 0.024*"kindle" + 0.024*"tablet" + 0.018*"one" + 0.018*"bought" + 0.014*"price" + 0.012*"fire" + 0.012*"kid" + 0.012*"book"

Topic 4
0.016*"screen" + 0.015*"like" + 0.013*"better" + 0.011*"would" + 0.008*"price" + 0.008*"one" + 0.008*"echo" + 0.007*"doe" + 0.007*"get" + 0.007*"fire"


In [35]:
# print document-topic matrix, the theta parameter in LDA model
topic_document_theta = corpus2dense(lda_model.get_document_topics(bow_tf), NumberOfTopics)
document_topic_theta_df = pd.DataFrame(np.round(topic_document_theta.T, 4),
                                       columns=['Topic '+str(i) for i in range(NumberOfTopics)])
document_topic_theta_df.head()

Unnamed: 0,Topic 0,Topic 1,Topic 2,Topic 3,Topic 4
0,0.0102,0.0102,0.0102,0.959,0.0104
1,0.0293,0.0289,0.8831,0.0296,0.0291
2,0.7667,0.0148,0.1888,0.0147,0.0149
3,0.3309,0.0,0.1674,0.4955,0.0
4,0.473,0.0102,0.0102,0.2764,0.2301


In [36]:
# get the topics with the highest coherence score based on top 10 terms
topics_coherences = lda_model.top_topics(bow_tf, topn=10)  # default coherence='u_mass'

# compute the average score for the top topics
avg_coherence_score = np.mean([item[1] for item in topics_coherences])
print('Average coherence score:', avg_coherence_score)

Average coherence score: -2.028441029653849


#### show topics

In [38]:
# other ways to show the results with top 10 terms
topics_with_wts = [item[0] for item in topics_coherences]
print('LDA topics with weights:')
for i, topic in enumerate(topics_with_wts):
    print('\nTopic', i)
    print([(term, round(wt, 3)) for wt, term in topic])

LDA topics with weights:

Topic 0
[('kindle', 0.023), ('tablet', 0.014), ('year', 0.013), ('book', 0.013), ('old', 0.013), ('love', 0.012), ('bought', 0.012), ('use', 0.01), ('game', 0.009), ('amazon', 0.009)]

Topic 1
[('tablet', 0.039), ('great', 0.033), ('use', 0.017), ('one', 0.016), ('amazon', 0.015), ('work', 0.015), ('good', 0.014), ('apps', 0.011), ('easy', 0.011), ('price', 0.011)]

Topic 2
[('use', 0.035), ('tablet', 0.029), ('easy', 0.026), ('like', 0.012), ('good', 0.012), ('love', 0.012), ('book', 0.011), ('reading', 0.01), ('great', 0.008), ('device', 0.008)]

Topic 3
[('love', 0.047), ('great', 0.033), ('kindle', 0.024), ('tablet', 0.024), ('one', 0.018), ('bought', 0.018), ('price', 0.014), ('fire', 0.012), ('kid', 0.012), ('book', 0.012)]

Topic 4
[('screen', 0.016), ('like', 0.015), ('better', 0.013), ('would', 0.011), ('price', 0.008), ('one', 0.008), ('echo', 0.008), ('doe', 0.007), ('get', 0.007), ('fire', 0.007)]


In [41]:
topics_df = pd.DataFrame([[(term, round(wt, 3)) for (wt, term) in topic] for topic in topics_with_wts],
                           columns=['Term '+str(i) for i in range(0, 10)],
                           index=['Topic '+str(t) for t in range(0, lda_model.num_topics)]).T

pd.set_option('display.max_colwidth', None)
topics_df = pd.DataFrame([', '.join([term for wt, term in topic]) for topic in topics_with_wts],
                         columns=['Terms per Topic'],
                         index=['Topic '+str(t) for t in range(0, 5)])
topics_df

Unnamed: 0,Terms per Topic
Topic 0,"kindle, tablet, year, book, old, love, bought, use, game, amazon"
Topic 1,"tablet, great, use, one, amazon, work, good, apps, easy, price"
Topic 2,"use, tablet, easy, like, good, love, book, reading, great, device"
Topic 3,"love, great, kindle, tablet, one, bought, price, fire, kid, book"
Topic 4,"screen, like, better, would, price, one, echo, doe, get, fire"


In [43]:
# compute four different coherence scores based on all terms
from gensim.models import CoherenceModel
cv_coherence_model_lda = CoherenceModel(model=lda_model,
                                        corpus=bow_tf, 
                                        texts=tokened_docs, 
                                        dictionary=dictionary, 
                                        coherence='c_v')
print('Average coherence score (c_v):', cv_coherence_model_lda.get_coherence())

umass_coherence_model_lda = CoherenceModel(model=lda_model,
                                           corpus=bow_tf,
                                           texts=tokened_docs,
                                           dictionary=dictionary,
                                           coherence='u_mass')
print('Average coherence score (u_mass):', umass_coherence_model_lda.get_coherence())

uci_coherence_model_lda = CoherenceModel(model=lda_model,
                                         corpus=bow_tf,
                                         texts=tokened_docs,
                                         dictionary=dictionary,
                                         coherence='c_uci')
print('Average coherence score (c_uci):', uci_coherence_model_lda.get_coherence())

npmi_coherence_model_lda = CoherenceModel(model=lda_model,
                                          corpus=bow_tf,
                                          texts=tokened_docs,
                                          dictionary=dictionary,
                                          coherence='c_npmi')
print('Average coherence score (c_npmi):', npmi_coherence_model_lda.get_coherence())

perplexity = lda_model.log_perplexity(bow_tf)
print('Model perplexity:', perplexity)

Average coherence score (c_v): 0.31110491219621467
Average coherence score (u_mass): -2.312433545281084
Average coherence score (c_uci): -0.0651867048688011
Average coherence score (c_npmi): 0.0009089531016212293
Model perplexity: -7.075235269703249


In [51]:
# gensim tqdm search for the best number of topics for LDA based on coherence score
from tqdm import tqdm
def topic_model_coherence_generator(corpus, texts, dictionary,
                                    start_topic_count, end_topic_count, step):
    models = []
    coherence_scores = []
    for number_of_topics in tqdm(range(start_topic_count, end_topic_count+1, step)):
        lda_model = LdaModel(corpus=bow_tf,
                             num_topics=number_of_topics,
                             id2word=dictionary,
                             random_state=1)
        coherence_model_lda = CoherenceModel(model=lda_model,
                                             corpus=bow_tf, 
                                             texts=tokened_docs, 
                                             dictionary=dictionary, 
                                             coherence='c_v')
        coherence_score = coherence_model_lda.get_coherence()
        coherence_scores.append(coherence_score)
        models.append(lda_model)
    return models, coherence_scores

lda_models, coherence_scores = topic_model_coherence_generator(corpus=bow_tf,
                                                               texts=tokened_docs,
                                                               dictionary=dictionary,
                                                               start_topic_count=2,
                                                               end_topic_count=20,
                                                               step=1)

100%|██████████████████████████████████████████████████████████████████████████| 19/19 [04:40<00:00, 14.75s/it]


In [53]:
# display the search results
coherence_df = pd.DataFrame({'Number of Topics': range(2, 21, 1), #(start_topic_ct, end_topic_ct+1, step)
                             'Coherence Score': np.round(coherence_scores, 4)})
coherence_df.sort_values(by=['Coherence Score'], ascending=False)

Unnamed: 0,Number of Topics,Coherence Score
9,11,0.3164
6,8,0.316
5,7,0.3158
12,14,0.3148
8,10,0.3135
10,12,0.3112
3,5,0.3111
15,17,0.3087
1,3,0.3063
13,15,0.3032


In [None]:
-end- 