# Topic Modeling for Everybody with Google Colab

**Super simple topic modeling using both the Non Negative Matrix Factorization (NMF) and Latent Dirichlet Allocation (LDA) algorithms.**

This Google Colab Notebook makes topic modeling accessible to everybody. Textual data can be loaded from a Google Sheet and topics derived from NMF and LDA can be generated. Only simple form entry is required to set:

* the name of the google sheet
* the number of topics to be generated
* the number of top words and documents that must be printed out for each topic





In [1]:
#@title Install pyLDAVis (specific version for Google Collab)
!pip install pyLDAvis==2.1.2

Collecting pyLDAvis==2.1.2
  Downloading pyLDAvis-2.1.2.tar.gz (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 5.2 MB/s 
Collecting funcy
  Downloading funcy-1.17-py2.py3-none-any.whl (33 kB)
Building wheels for collected packages: pyLDAvis
  Building wheel for pyLDAvis (setup.py) ... [?25l[?25hdone
  Created wheel for pyLDAvis: filename=pyLDAvis-2.1.2-py2.py3-none-any.whl size=97738 sha256=cb233eba405f92b6add6e46e4a352a930e23304056f6bd3ec25e93862c9af23a
  Stored in directory: /root/.cache/pip/wheels/3b/fb/41/e32e5312da9f440d34c4eff0d2207b46dc9332a7b931ef1e89
Successfully built pyLDAvis
Installing collected packages: funcy, pyLDAvis
Successfully installed funcy-1.17 pyLDAvis-2.1.2


In [2]:
from google.colab import drive
import os
drive.mount('/content/drive')
os.chdir("./drive/MyDrive/NLP Project")
path="./splitData"


Mounted at /content/drive


In [4]:
import pandas as pd

In [5]:
df_censored=pd.read_pickle(path+"/censored.pickle")

In [6]:
df_censored.head()

Unnamed: 0,mid,text,created_at,deleted_last_seen,permission_denied,guse
0,mkTL1pvXTN,这王某人与中国青年报怎么这么有缘啊。uRLOGZLR4： 原来是个大流氓,2012-01-07 22:29:56,2012-03-03 03:35:43.903155,True,"[0.006624801550060511, 0.031350668519735336, 0..."
1,m5PIm1JzG8,经常看到有代表四处调研，你代表的是哪个行业，就好好调研一下本行业的问题好不好？自己的工作都做...,2012-01-05 09:00:03,2012-01-30 21:36:44.229791,True,"[0.03378220275044441, -0.007479660212993622, 0..."
2,mJGNyWD5fj,刚在宝宝的书架上发现这本书，翻了一下，历史事实描述一派胡言。问宝宝，这本书的内容你们考试吗？...,2012-01-02 12:45:58,False,True,"[0.009367837570607662, -0.03505528345704079, -..."
3,m3e5KqAhFh,应该找uRLOWL0ZX： 去影他相！新年流流揸公车出来行，仲要违章，真是岂有此理！,2012-01-03 18:15:24,2012-01-30 19:15:54.549843,True,"[-0.02759631723165512, -0.009455885738134384, ..."
4,mkTLyULvfH,崔永元：微博春晚节目——天津快板：竹板这么一打啊，啥也不能说，不能说食品，不能说动车，不能说...,2012-01-07 12:02:02,2012-02-28 11:08:24.583212,True,"[0.002411450492218137, 0.025735968723893166, 0..."


In [7]:
documents=df_censored["text"].to_list()

In [8]:
len(documents)

134417

In [None]:
# #@title Load and preview data from a Google Sheet

# gc = gspread.authorize(GoogleCredentials.get_application_default())

# worksheet = gc.open(googlesheet_filename).sheet1

# # get_all_values gives a list of rows.
# rows = worksheet.get_all_values()

# # convert the 2nd column values to a list
# documents = []
# for row in rows[1:]:
#   documents.append(row[1])
  
# #print(documents)

# # Convert to a DataFrame and render.
# import pandas as pd
# dataset_df = pd.DataFrame.from_records(rows)
# dataset_df.head(n=data_rows_to_preview)


Unnamed: 0,0,1
0,id,text
1,1,Human machine interface for Lab ABC computer a...
2,2,A survey of user opinion of computer system re...
3,3,The EPS user interface management system
4,4,System and human system engineering testing of...
5,5,Relation of user-perceived response time to er...
6,6,"The generation of random, binary, unordered trees"
7,7,The intersection graph of paths in trees
8,8,Graph minors IV: Widths of trees and quasi-ord...
9,9,Graph minors: A survey




---



---



In [10]:
#@title Set topic modeling algorithm arguments

no_topics = 3 #@param {type:"integer"}

no_top_words = 4 #@param {type:"integer"}

no_top_documents = 3 #@param {type:"integer"}

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import numpy as np
import jieba


In [11]:
#@title Run NMF

def display_topics(H, W, feature_names, documents, no_top_words, no_top_documents):
    for topic_idx, topic in enumerate(H):
        print("Topic %d:" % (topic_idx))
        print(" ".join([ (feature_names[i] + " (" + str(topic[i].round(2)) + ")")
          for i in topic.argsort()[:-no_top_words - 1:-1]]))
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            print(str(doc_index) + ". " + documents[doc_index])

# NMF is able to use tf-idf

#change to chinese type
# https://investigate.ai/text-analysis/using-tf-idf-with-chinese/
chinese_stopwords = ["、","。","〈","〉","《","》","一","一些","一何","一切","一则","一方面","一旦","一来","一样","一般","一转眼","七","万一","三","上","上下","下","不","不仅","不但","不光","不单","不只","不外乎","不如","不妨","不尽","不尽然","不得","不怕","不惟","不成","不拘","不料","不是","不比","不然","不特","不独","不管","不至于","不若","不论","不过","不问","与","与其","与其说","与否","与此同时","且","且不说","且说","两者","个","个别","中","临","为","为了","为什么","为何","为止","为此","为着","乃","乃至","乃至于","么","之","之一","之所以","之类","乌乎","乎","乘","九","也","也好","也罢","了","二","二来","于","于是","于是乎","云云","云尔","五","些","亦","人","人们","人家","什","什么","什么样","今","介于","仍","仍旧","从","从此","从而","他","他人","他们","他们们","以","以上","以为","以便","以免","以及","以故","以期","以来","以至","以至于","以致","们","任","任何","任凭","会","似的","但","但凡","但是","何","何以","何况","何处","何时","余外","作为","你","你们","使","使得","例如","依","依据","依照","便于","俺","俺们","倘","倘使","倘或","倘然","倘若","借","借傥然","假使","假如","假若","做","像","儿","先不先","光是","全体","全部","八","六","兮","共","关于","关于具体地说","其","其一","其中","其二","其他","其余","其它","其次","具体地说","具体说来","兼之","内","再","再其次","再则","再有","再者","再者说","再说","冒","冲","况且","几","几时","凡","凡是","凭","凭借","出于","出来","分","分别","则","则甚","别","别人","别处","别是","别的","别管","别说","到","前后","前此","前者","加之","加以","即","即令","即使","即便","即如","即或","即若","却","去","又","又及","及","及其","及至","反之","反而","反过来","反过来说","受到","另","另一方面","另外","另悉","只","只当","只怕","只是","只有","只消","只要","只限","叫","叮咚","可","可以","可是","可见","各","各个","各位","各种","各自","同","同时","后","后者","向","向使","向着","吓","吗","否则","吧","吧哒","含","吱","呀","呃","呕","呗","呜","呜呼","呢","呵","呵呵","呸","呼哧","咋","和","咚","咦","咧","咱","咱们","咳","哇","哈","哈哈","哉","哎","哎呀","哎哟","哗","哟","哦","哩","哪","哪个","哪些","哪儿","哪天","哪年","哪怕","哪样","哪边","哪里","哼","哼唷","唉","唯有","啊","啐","啥","啦","啪达","啷当","喂","喏","喔唷","喽","嗡","嗡嗡","嗬","嗯","嗳","嘎","嘎登","嘘","嘛","嘻","嘿","嘿嘿","四","因","因为","因了","因此","因着","因而","固然","在","在下","在于","地","基于","处在","多","多么","多少","大","大家","她","她们","好","如","如上","如上所述","如下","如何","如其","如同","如是","如果","如此","如若","始而","孰料","孰知","宁","宁可","宁愿","宁肯","它","它们","对","对于","对待","对方","对比","将","小","尔","尔后","尔尔","尚且","就","就是","就是了","就是说","就算","就要","尽","尽管","尽管如此","岂但","己","已","已矣","巴","巴巴","年","并","并且","庶乎","庶几","开外","开始","归","归齐","当","当地","当然","当着","彼","彼时","彼此","往","待","很","得","得了","怎","怎么","怎么办","怎么样","怎奈","怎样","总之","总的来看","总的来说","总的说来","总而言之","恰恰相反","您","惟其","慢说","我","我们","或","或则","或是","或曰","或者","截至","所","所以","所在","所幸","所有","才","才能","打","打从","把","抑或","拿","按","按照","换句话说","换言之","据","据此","接着","故","故此","故而","旁人","无","无宁","无论","既","既往","既是","既然","日","时","时候","是","是以","是的","更","曾","替","替代","最","月","有","有些","有关","有及","有时","有的","望","朝","朝着","本","本人","本地","本着","本身","来","来着","来自","来说","极了","果然","果真","某","某个","某些","某某","根据","欤","正值","正如","正巧","正是","此","此地","此处","此外","此时","此次","此间","毋宁","每","每当","比","比及","比如","比方","没奈何","沿","沿着","漫说","焉","然则","然后","然而","照","照着","犹且","犹自","甚且","甚么","甚或","甚而","甚至","甚至于","用","用来","由","由于","由是","由此","由此可见","的","的确","的话","直到","相对而言","省得","看","眨眼","着","着呢","矣","矣乎","矣哉","离","秒","竟而","第","等","等到","等等","简言之","管","类如","紧接着","纵","纵令","纵使","纵然","经","经过","结果","给","继之","继后","继而","综上所述","罢了","者","而","而且","而况","而后","而外","而已","而是","而言","能","能否","腾","自","自个儿","自从","自各儿","自后","自家","自己","自打","自身","至","至于","至今","至若","致","般的","若","若夫","若是","若果","若非","莫不然","莫如","莫若","虽","虽则","虽然","虽说","被","要","要不","要不是","要不然","要么","要是","譬喻","譬如","让","许多","论","设使","设或","设若","诚如","诚然","该","说","说来","请","诸","诸位","诸如","谁","谁人","谁料","谁知","贼死","赖以","赶","起","起见","趁","趁着","越是","距","跟","较","较之","边","过","还","还是","还有","还要","这","这一来","这个","这么","这么些","这么样","这么点儿","这些","这会儿","这儿","这就是说","这时","这样","这次","这般","这边","这里","进而","连","连同","逐步","通过","遵循","遵照","那","那个","那么","那么些","那么样","那些","那会儿","那儿","那时","那样","那般","那边","那里","都","鄙人","鉴于","针对","阿","除","除了","除外","除开","除此之外","除非","随","随后","随时","随着","难道说","零","非","非但","非徒","非特","非独","靠","顺","顺着","首先","︿","！","＃","＄","％","＆","（","）","＊","＋","，","０","１","２","３","４","５","６","７","８","９","：","；","＜","＞","？","＠","［","］","｛","｜","｝","～","￥"]

tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words=chinese_stopwords, tokenizer=jieba.lcut)
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# Run NMF
nmf_model = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)
nmf_W = nmf_model.transform(tfidf)
nmf_H = nmf_model.components_

print("NMF Topics")
display_topics(nmf_H, nmf_W, tfidf_feature_names, documents, no_top_words, no_top_documents)
print("--------------")



Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
Loading model cost 1.062 seconds.
Prefix dict has been built successfully.
  % sorted(inconsistent)


NMF Topics
Topic 0:
转发 (13.71) 微博 (13.45) 轉發 (0.19) ukn (0.16)
73634. 转发微博
102303. 转发微博
79129. 转发微博
Topic 1:
  (23.88) ukn (5.01) 话筒 (0.79) 转 (0.61)
91018.  純呀么純
129406. uFABE3MG： 哈哈
94713. XVIII 大！
Topic 2:
… (10.04) ” (2.96) “ (2.95) 中国 (0.92)
89729. …………
56287. 嘘…
105787. …
--------------


In [16]:
# #@title Visualise NMF with pyLDAVis

# import pyLDAvis.sklearn

# pyLDAvis.enable_notebook()

# pyLDAvis_data = pyLDAvis.sklearn.prepare(nmf_model, tfidf, tfidf_vectorizer)
# # Visualization can be displayed in the notebook
# pyLDAvis.display(pyLDAvis_data)

In [14]:
#@title Run LDA

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model

chinese_stopwords = ["、","。","〈","〉","《","》","一","一些","一何","一切","一则","一方面","一旦","一来","一样","一般","一转眼","七","万一","三","上","上下","下","不","不仅","不但","不光","不单","不只","不外乎","不如","不妨","不尽","不尽然","不得","不怕","不惟","不成","不拘","不料","不是","不比","不然","不特","不独","不管","不至于","不若","不论","不过","不问","与","与其","与其说","与否","与此同时","且","且不说","且说","两者","个","个别","中","临","为","为了","为什么","为何","为止","为此","为着","乃","乃至","乃至于","么","之","之一","之所以","之类","乌乎","乎","乘","九","也","也好","也罢","了","二","二来","于","于是","于是乎","云云","云尔","五","些","亦","人","人们","人家","什","什么","什么样","今","介于","仍","仍旧","从","从此","从而","他","他人","他们","他们们","以","以上","以为","以便","以免","以及","以故","以期","以来","以至","以至于","以致","们","任","任何","任凭","会","似的","但","但凡","但是","何","何以","何况","何处","何时","余外","作为","你","你们","使","使得","例如","依","依据","依照","便于","俺","俺们","倘","倘使","倘或","倘然","倘若","借","借傥然","假使","假如","假若","做","像","儿","先不先","光是","全体","全部","八","六","兮","共","关于","关于具体地说","其","其一","其中","其二","其他","其余","其它","其次","具体地说","具体说来","兼之","内","再","再其次","再则","再有","再者","再者说","再说","冒","冲","况且","几","几时","凡","凡是","凭","凭借","出于","出来","分","分别","则","则甚","别","别人","别处","别是","别的","别管","别说","到","前后","前此","前者","加之","加以","即","即令","即使","即便","即如","即或","即若","却","去","又","又及","及","及其","及至","反之","反而","反过来","反过来说","受到","另","另一方面","另外","另悉","只","只当","只怕","只是","只有","只消","只要","只限","叫","叮咚","可","可以","可是","可见","各","各个","各位","各种","各自","同","同时","后","后者","向","向使","向着","吓","吗","否则","吧","吧哒","含","吱","呀","呃","呕","呗","呜","呜呼","呢","呵","呵呵","呸","呼哧","咋","和","咚","咦","咧","咱","咱们","咳","哇","哈","哈哈","哉","哎","哎呀","哎哟","哗","哟","哦","哩","哪","哪个","哪些","哪儿","哪天","哪年","哪怕","哪样","哪边","哪里","哼","哼唷","唉","唯有","啊","啐","啥","啦","啪达","啷当","喂","喏","喔唷","喽","嗡","嗡嗡","嗬","嗯","嗳","嘎","嘎登","嘘","嘛","嘻","嘿","嘿嘿","四","因","因为","因了","因此","因着","因而","固然","在","在下","在于","地","基于","处在","多","多么","多少","大","大家","她","她们","好","如","如上","如上所述","如下","如何","如其","如同","如是","如果","如此","如若","始而","孰料","孰知","宁","宁可","宁愿","宁肯","它","它们","对","对于","对待","对方","对比","将","小","尔","尔后","尔尔","尚且","就","就是","就是了","就是说","就算","就要","尽","尽管","尽管如此","岂但","己","已","已矣","巴","巴巴","年","并","并且","庶乎","庶几","开外","开始","归","归齐","当","当地","当然","当着","彼","彼时","彼此","往","待","很","得","得了","怎","怎么","怎么办","怎么样","怎奈","怎样","总之","总的来看","总的来说","总的说来","总而言之","恰恰相反","您","惟其","慢说","我","我们","或","或则","或是","或曰","或者","截至","所","所以","所在","所幸","所有","才","才能","打","打从","把","抑或","拿","按","按照","换句话说","换言之","据","据此","接着","故","故此","故而","旁人","无","无宁","无论","既","既往","既是","既然","日","时","时候","是","是以","是的","更","曾","替","替代","最","月","有","有些","有关","有及","有时","有的","望","朝","朝着","本","本人","本地","本着","本身","来","来着","来自","来说","极了","果然","果真","某","某个","某些","某某","根据","欤","正值","正如","正巧","正是","此","此地","此处","此外","此时","此次","此间","毋宁","每","每当","比","比及","比如","比方","没奈何","沿","沿着","漫说","焉","然则","然后","然而","照","照着","犹且","犹自","甚且","甚么","甚或","甚而","甚至","甚至于","用","用来","由","由于","由是","由此","由此可见","的","的确","的话","直到","相对而言","省得","看","眨眼","着","着呢","矣","矣乎","矣哉","离","秒","竟而","第","等","等到","等等","简言之","管","类如","紧接着","纵","纵令","纵使","纵然","经","经过","结果","给","继之","继后","继而","综上所述","罢了","者","而","而且","而况","而后","而外","而已","而是","而言","能","能否","腾","自","自个儿","自从","自各儿","自后","自家","自己","自打","自身","至","至于","至今","至若","致","般的","若","若夫","若是","若果","若非","莫不然","莫如","莫若","虽","虽则","虽然","虽说","被","要","要不","要不是","要不然","要么","要是","譬喻","譬如","让","许多","论","设使","设或","设若","诚如","诚然","该","说","说来","请","诸","诸位","诸如","谁","谁人","谁料","谁知","贼死","赖以","赶","起","起见","趁","趁着","越是","距","跟","较","较之","边","过","还","还是","还有","还要","这","这一来","这个","这么","这么些","这么样","这么点儿","这些","这会儿","这儿","这就是说","这时","这样","这次","这般","这边","这里","进而","连","连同","逐步","通过","遵循","遵照","那","那个","那么","那么些","那么样","那些","那会儿","那儿","那时","那样","那般","那边","那里","都","鄙人","鉴于","针对","阿","除","除了","除外","除开","除此之外","除非","随","随后","随时","随着","难道说","零","非","非但","非徒","非特","非独","靠","顺","顺着","首先","︿","！","＃","＄","％","＆","（","）","＊","＋","，","０","１","２","３","４","５","６","７","８","９","：","；","＜","＞","？","＠","［","］","｛","｜","｝","～","￥"]

# tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words=chinese_stopwords, tokenizer=jieba.lcut)

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words=chinese_stopwords)
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()

# Run LDA
lda_model = LatentDirichletAllocation(n_components=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)
lda_W = lda_model.transform(tf)
lda_H = lda_model.components_

print("LDA Topics")
display_topics(lda_H, lda_W, tf_feature_names, documents, no_top_words, no_top_documents)



LDA Topics
Topic 0:
回复ukn (1240.57) 吃惊 (492.36) 求证 (465.45) 支持 (423.21)
86507. 【新三字经】每两会 必热点 房价高 上青天 孩子苦 上学难 看病贵 泪涟涟 官财产 捂得严 贫富差 日渐宽 CPI 高运转 贪与腐 年复年 众代表 非民间 不官商 即大款 偶贤人 被禁言 好提案 落地难 雷人语 炸晕咱 娱乐化 笑翻天 劝诸位 清醒点 话语权 不在咱 安守己 品甘甜 活下去 亦很难 只留下 一声叹
38133. 【新三字经】每两会 必热点 房价高 上青天 孩子苦 上学难 看病贵 泪涟涟 官财产 捂得严 贫富差 日渐宽 CPI 高运转 贪与腐 年复年 众代表 非民间 不官商 即大款 偶贤人 被禁言 好提案 落地难 雷人语 炸晕咱 娱乐化 笑翻天 劝诸位 清醒点 话语权 不在咱 安守己 品甘甜 活下去 亦很难 只留下 一声叹
3918. 自学英语，应付老外：专栏作家、编辑、共产党、主席、选票、民主、自由、文明、小说、文化、时事、政治、改革、农民、独裁者、安理会、人权、游行、革命、国民党、西藏、台湾、诺贝尔和平奖、13亿、爱未来、dl喇嘛、拆迁、虎妈、甄子丹、监狱、金陵十三钗、游行、小平邓、熊猫、刘xb。真难！
Topic 1:
ulcmy22pk (831.89) 蜡烛 (772.03) uvgjj1db0 (763.79) 偷笑 (597.18)
35271. Have you ever taken any taxi in London Can you as a passenger lower the windows or open the doors from within the taxi I am not titfortatting but serious If I were the taxi driver I would choose to take control of the windows and the doors for my own safety
83645. Have you ever taken any taxi in London Can you as a passenger lower the windows or open the doors from within 

In [15]:
#@title Visualise LDA with pyLDAVis

import pyLDAvis.sklearn

pyLDAvis.enable_notebook()

pyLDAvis_data = pyLDAvis.sklearn.prepare(lda_model, tf, tf_vectorizer)
# Visualization can be displayed in the notebook
pyLDAvis.display(pyLDAvis_data)

  head(R).drop('saliency', 1)
