隐含狄利克雷分布（Latent Dirichlet Allocation，LDA），是一种主题模型（topic model），它可以将文档集中每篇文档的主题按照概率分布的形式给出。

# 1.课堂示例代码复刻

## 1.1数据导入和处理

In [3]:
import pandas as pd     # 数据表
import numpy as np     # 数组
import re     # 正则表达式
import jieba     # 中文分词
import matplotlib.pyplot as plt     # 画图
from gensim import corpora, models
import pyLDAvis     # 交互式LDA可视化
import pyLDAvis.gensim as gensimvis

In [14]:
df = pd.read_csv('text_analysis_weibo.csv', index_col = 0)
df_weibo = df[:1000]

In [17]:
df_weibo.head()

Unnamed: 0,标题/微博内容,点赞,转发,评论,账号昵称UID加密,粉丝数,关注数,地域
0,#高校通报教师图书馆打电话声音过大出言不逊#公道自在人心，谣言自在人心 ​​,0,0,0,a2331b38901d62d2d9a20529177ef3b3,0,22,湖北
1,转发C,0,0,0,d6dc4470f51fce93cc0cbad8abf55a75,0,33,广西
2,【#刘雨昕运动者联濛#河山覆冰雪，健儿迎冬奥[金牌]全能唱跳不设限，运动联濛开新年🇨🇳 期待...,0,0,0,372bc4782eb442b88035f920a7c1a68e,6,85,广东
3,丁程鑫//@丁程鑫后援会官博:#丁程鑫[超话]# ✨#丁程鑫 二十成金筑梦鑫世界# 大年初一...,0,0,0,6fe0d482bd3e78a3483e2a1d57f14ef2,75,1012,广东
4,诶，你们真不要脸诶。。。没资格宣传奥运。。。抵制抵制！,0,0,0,872380d71d6ee9130e8b49d331f2baa9,0,10,广东


### 1.1.1剔除符号和数字

In [18]:
def remove_nums(text):
    nonums = re.sub('[^\u4e00-\u9fa5]+', '', text)
    return nonums
test = df_weibo['标题/微博内容'][0]
remove_nums(test)[:100]

'高校通报教师图书馆打电话声音过大出言不逊公道自在人心谣言自在人心'

### 1.1.2分词

In [19]:
# 加载中文停用词词典，可个性化设置
stopwords = open('stopwords.txt', encoding = 'utf-8').read()

def clean_text(text):
    words = jieba.lcut(text)
    words = [w for w in words if w not in stopwords and w!='[' and w!=']']
    return ' '.join(words)

test = df_weibo['标题/微博内容'][0]
clean_text(test)

'高校 通报 教师 图书馆 打电话 声音 过大 出言不逊 公道 人心 谣言 人心 \u200b \u200b'

In [21]:
df_weibo['标题/微博内容'] = df_weibo['标题/微博内容'].astype(str)
df_weibo['微博内容分词'] = df_weibo['标题/微博内容'].apply(remove_nums)
df_weibo['微博内容分词'] = df_weibo['微博内容分词'].apply(clean_text)
df_weibo['微博内容分词'] = df_weibo['微博内容分词'].apply(lambda x: x.split())
df_weibo.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using

Unnamed: 0,标题/微博内容,点赞,转发,评论,账号昵称UID加密,粉丝数,关注数,地域,微博内容分词
0,#高校通报教师图书馆打电话声音过大出言不逊#公道自在人心，谣言自在人心 ​​,0,0,0,a2331b38901d62d2d9a20529177ef3b3,0,22,湖北,"[高校, 通报, 教师, 图书馆, 打电话, 声音, 过大, 出言不逊, 公道, 人心, 谣..."
1,转发C,0,0,0,d6dc4470f51fce93cc0cbad8abf55a75,0,33,广西,[转发]
2,【#刘雨昕运动者联濛#河山覆冰雪，健儿迎冬奥[金牌]全能唱跳不设限，运动联濛开新年🇨🇳 期待...,0,0,0,372bc4782eb442b88035f920a7c1a68e,6,85,广东,"[刘雨昕, 运动, 濛, 河山, 覆, 冰雪, 健儿, 冬奥, 金牌, 全能, 唱, 跳, ..."
3,丁程鑫//@丁程鑫后援会官博:#丁程鑫[超话]# ✨#丁程鑫 二十成金筑梦鑫世界# 大年初一...,0,0,0,6fe0d482bd3e78a3483e2a1d57f14ef2,75,1012,广东,"[丁程鑫, 丁程鑫, 后援会, 官博丁, 程鑫, 超话, 丁程鑫, 二十, 成金筑梦鑫, 世..."
4,诶，你们真不要脸诶。。。没资格宣传奥运。。。抵制抵制！,0,0,0,872380d71d6ee9130e8b49d331f2baa9,0,10,广东,"[诶, 不要脸, 诶, 资格, 宣传, 奥运, 抵制, 抵制]"


## 1.2LDA

In [22]:
dictionary = corpora.Dictionary(df_weibo['微博内容分词'])     # 根据分词结果创建字典
corpus = [dictionary.doc2bow(text) for text in df_weibo['微博内容分词']]     # 根据分词结果创建语料库

In [23]:
# 训练LDA模型
lda_model = models.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=15)

In [24]:
# 查看主题
topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)

(0, '0.015*"冬奥" + 0.011*"冬奥会" + 0.011*"加油" + 0.010*"哈哈哈" + 0.010*"北京"')
(1, '0.030*"刘雨昕" + 0.028*"运动" + 0.027*"冬奥" + 0.025*"濛" + 0.020*"北京"')
(2, '0.010*"真源" + 0.010*"定律" + 0.006*"墩" + 0.006*"教师" + 0.006*"冰雪"')
(3, '0.022*"少年" + 0.020*"北京" + 0.017*"时代" + 0.014*"团" + 0.014*"宋亚轩"')
(4, '0.112*"转发" + 0.018*"冬奥" + 0.016*"北京" + 0.012*"墩" + 0.011*"冬奥会"')


In [26]:
df_weibo['微博内容分词'].iloc[0]

['高校', '通报', '教师', '图书馆', '打电话', '声音', '过大', '出言不逊', '公道', '人心', '谣言', '人心']

In [27]:
for index, score in sorted(lda_model[corpus[0]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.6364753842353821	 
Topic: 0.010*"真源" + 0.010*"定律" + 0.006*"墩" + 0.006*"教师" + 0.006*"冰雪" + 0.005*"罗云熙" + 0.005*"湖北" + 0.005*"工业" + 0.005*"大学" + 0.005*"图书馆"

Score: 0.3172453045845032	 
Topic: 0.022*"少年" + 0.020*"北京" + 0.017*"时代" + 0.014*"团" + 0.014*"宋亚轩" + 0.010*"春晚" + 0.009*"分享" + 0.009*"感谢" + 0.009*"台" + 0.008*"冬奥"

Score: 0.015479739755392075	 
Topic: 0.112*"转发" + 0.018*"冬奥" + 0.016*"北京" + 0.012*"墩" + 0.011*"冬奥会" + 0.008*"中国" + 0.008*"冰墩" + 0.005*"加油" + 0.005*"肖战" + 0.004*"打打"

Score: 0.015407940372824669	 
Topic: 0.015*"冬奥" + 0.011*"冬奥会" + 0.011*"加油" + 0.010*"哈哈哈" + 0.010*"北京" + 0.008*"蔡" + 0.008*"徐坤" + 0.007*"真的" + 0.007*"中国" + 0.006*"苦涩"

Score: 0.015391631983220577	 
Topic: 0.030*"刘雨昕" + 0.028*"运动" + 0.027*"冬奥" + 0.025*"濛" + 0.020*"北京" + 0.015*"肖战" + 0.012*"未来" + 0.011*"周深" + 0.011*"抱" + 0.011*"加油"


In [29]:
documents = df_weibo['微博内容分词'].values

In [30]:
# Function to infer topics for a document
def infer_topics(lda_model, document):
    bow = dictionary.doc2bow(document)
    topics = lda_model.get_document_topics(bow)
    return topics

# Print topics for each document
for i, doc in enumerate(documents[:10]):
    doc_topics = infer_topics(lda_model, doc)
    print(f"Document {i+1}:")
    print(doc_topics)
    print()

Document 1:
[(0, 0.015407942), (1, 0.015391632), (2, 0.6364743), (3, 0.3172465), (4, 0.015479599)]

Document 2:
[(0, 0.10000017), (1, 0.10000015), (2, 0.10000019), (3, 0.100000225), (4, 0.59999925)]

Document 3:
[(1, 0.97027034)]

Document 4:
[(3, 0.96890634)]

Document 5:
[(0, 0.02242358), (1, 0.022276886), (2, 0.022255206), (3, 0.022271894), (4, 0.9107724)]

Document 6:
[(0, 0.10000017), (1, 0.10000015), (2, 0.10000019), (3, 0.100000225), (4, 0.59999925)]

Document 7:
[(4, 0.9725118)]

Document 8:
[(0, 0.10000017), (1, 0.10000015), (2, 0.10000019), (3, 0.100000225), (4, 0.59999925)]

Document 9:
[(0, 0.011991359), (1, 0.63988423), (2, 0.011958053), (3, 0.011883147), (4, 0.32428318)]

Document 10:
[(0, 0.53702444), (1, 0.3618879), (2, 0.03370243), (3, 0.033528008), (4, 0.033857238)]



## 1.3可视化

In [31]:
lda_vis = gensimvis.prepare(lda_model, corpus, dictionary, n_jobs=1) # 备注：上述语句如果在数据量比较大的时候跑不出来，可以选择加一个n_jobs=1的参数，降低计算量，避免报错
# 显示可视化界面
pyLDAvis.display(lda_vis)

In [32]:
# 导出可视化结果到html
pyLDAvis.save_html(lda_vis, 'weibosample.html')

# 2.所收集的数据

## 2.1数据导入和处理

In [35]:
df_bad= pd.read_csv('热辣滚烫差评影评.csv', index_col = 0)
df_bad.head()

Unnamed: 0_level_0,Name,Rate,Vote,Time,Comment,Location
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
['4085534670'],['小侠来了'],['2'],['14224'],['2024-02-10 11:24:07'],['一个比99％都有钱的人，富态胖再怎么减，也没法还原丧人生的感觉，偏偏她还想试图感动我们，...,['浙江']
['4085530064'],['彩云之南'],['1'],['5499'],['2024-02-10 11:16:57'],['如芒刺背，如鲠在喉，如坐针毡！绝对是春节档最拉垮的一部电影，虽然其它的几部我还没来得及看！'],['河南']
['4085568644'],['CC就是屁股哥'],['1'],['4436'],['2024-02-10 12:08:01'],['从去年就开始买各种自媒体说贾玲减肥的消息宣传，真是恶心死了。你为了电影减肥增肥不是太正常...,['山东']
['4087934911'],['Season'],['2'],['3261'],['2024-02-12 00:35:30'],['减肥的过程我看到了，但是我没看到电影。'],['福建']
['4088237906'],['太阳留住我'],['2'],['2802'],['2024-02-12 10:53:02'],['这也能有7.9的评分，在座各位都有责任。'],['江西']


In [36]:
df_mid= pd.read_csv('热辣滚烫中评影评.csv', index_col = 0)
df_mid.head()

Unnamed: 0_level_0,Name,Rate,Vote,Time,Comment,Location
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
['4085535392'],['友心'],['3'],['19379'],['2024-02-10 11:25:13'],['看点真的只有贾玲减肥'],['山西']
['4085604652'],['Boringbbbb'],['3'],['7133'],['2024-02-10 12:48:58'],['成功的明星电影而非电影，从热搜词条上也可见一斑。总的来说是贾玲个人魅力大于剧作本身。前半...,['福建']
['4085117805'],['楠朋友'],['3'],['6424'],['2024-02-10 22:56:17'],['《李焕英》破了30亿，答应观众要减肥。可是减肥很痛苦的，不能白减！减肥又费时间，这样就会...,['浙江']
['4086974503'],['溺死的鱼'],['3'],['3763'],['2024-02-11 12:04:48'],['贾玲赢了，但乐莹没有赢。花一年专心做一件让自己变好的事情，虽然很燃，但本身就很奢侈。贾玲...,['广东']
['4088915499'],['Xaviera'],['3'],['2779'],['2024-02-12 20:55:35'],['电影，请用电影的方式去赢；演员，也请用演技去赢。如果不是，那等于从来都没有赢。'],['北京']


In [37]:
df_good= pd.read_csv('热辣滚烫好评影评.csv', index_col = 0)
df_good.head()

Unnamed: 0_level_0,Name,Rate,Vote,Time,Comment,Location
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
['4085785724'],['吴吴'],['4'],['43954'],['2024-02-10 15:54:19'],['她拍这部片子的目的根本不是说她瘦了这个事，是告诉你，无论你这个人活的有多烂、多失败、多差...,['湖北']
['4087285023'],['aiaiaixiahe'],['4'],['20586'],['2024-02-11 17:22:12'],['心疼贾玲，太强了。最有意思的是，影院一个男的看完出来说，“这种我一个月就练出来了”。太经...,['浙江']
['4086125600'],['Donuts🕳️'],['4'],['14065'],['2024-02-10 20:16:54'],['支持女的上桌，大上特上，大赚特赚，气死破防的人'],['贵州']
['4076452698'],['真珠'],['5'],['14901'],['2024-02-10 11:22:46'],['看完想大喊：姐！'],['安徽']
['4087374348'],['张春[阿卡纳]'],['5'],['4523'],['2024-02-11 18:37:56'],['一个女孩在关系里受尽屈辱历尽心死后，选择以训练自己身体的方式，独自前行，只拥抱了自己的对...,['福建']


In [46]:
merged_df = df_bad.append(df_mid)
merged_df = merged_df.append(df_good)
merged_df

Unnamed: 0_level_0,Name,Rate,Vote,Time,Comment,Location
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
['4085534670'],['小侠来了'],['2'],['14224'],['2024-02-10 11:24:07'],['一个比99％都有钱的人，富态胖再怎么减，也没法还原丧人生的感觉，偏偏她还想试图感动我们，...,['浙江']
['4085530064'],['彩云之南'],['1'],['5499'],['2024-02-10 11:16:57'],['如芒刺背，如鲠在喉，如坐针毡！绝对是春节档最拉垮的一部电影，虽然其它的几部我还没来得及看！'],['河南']
['4085568644'],['CC就是屁股哥'],['1'],['4436'],['2024-02-10 12:08:01'],['从去年就开始买各种自媒体说贾玲减肥的消息宣传，真是恶心死了。你为了电影减肥增肥不是太正常...,['山东']
['4087934911'],['Season'],['2'],['3261'],['2024-02-12 00:35:30'],['减肥的过程我看到了，但是我没看到电影。'],['福建']
['4088237906'],['太阳留住我'],['2'],['2802'],['2024-02-12 10:53:02'],['这也能有7.9的评分，在座各位都有责任。'],['江西']
...,...,...,...,...,...,...
['4086114610'],['Y'],['5'],['21'],['2024-02-10 20:08:32'],['散场后的电梯里，有个小女孩说：“妈妈我也想打拳，想变得厉害！” 挺好，送你一朵小红花'],['北京']
['4093115216'],['杨采薇（郑版）'],['5'],['45'],['2024-02-15 22:20:51'],['讨论的内容是恶性的，叙事方式却没有任何不适。开年封神榜的男性进行封闭式训练会换来不停的褒...,['云南']
['4086485001'],['caesius'],['5'],['36'],['2024-02-10 23:33:28'],['作为一枚半年减了40斤的曾经的大胖子，看完电影在影院哭到不能自已。减肥再苦哪有胖的时候遭...,['湖北']
['4085784170'],['乔峰'],['4'],['34'],['2024-02-10 15:52:57'],['所以瘦了真的能拥有刀刻般的下颌线对吗？'],['四川']


### 2.1.1剔除符号与数字

In [50]:
def remove_nums(text):
    nonums = re.sub('[^\u4e00-\u9fa5]+', '', text)
    return nonums
test = merged_df['Comment'][0]
remove_nums(test)[:100]

'一个比都有钱的人富态胖再怎么减也没法还原丧人生的感觉偏偏她还想试图感动我们再赚一波按这票房以后都翻拍吧这多舒服啊'

### 2.1.2分词

In [52]:
# 加载中文停用词词典，可个性化设置
stopwords = open('stopwords.txt', encoding = 'utf-8').read()

def clean_text(text):
    words = jieba.lcut(text)
    words = [w for w in words if w not in stopwords and w!='[' and w!=']']
    return ' '.join(words)

test = merged_df['Comment'][0]
clean_text(test)

"' 一个 99 ％ 有钱 富态 胖再 减 没法 还原 丧 人生 感觉 偏偏 想 试图 感动 赚 一波 票房 翻拍 这多 舒服 '"

In [53]:
merged_df['Comment'] = merged_df['Comment'].astype(str)
merged_df['分词'] = merged_df['Comment'].apply(remove_nums)
merged_df['分词'] = merged_df['分词'].apply(clean_text)
merged_df['分词'] = merged_df['分词'].apply(lambda x: x.split())
merged_df.head()

Unnamed: 0_level_0,Name,Rate,Vote,Time,Comment,Location,分词
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
['4085534670'],['小侠来了'],['2'],['14224'],['2024-02-10 11:24:07'],['一个比99％都有钱的人，富态胖再怎么减，也没法还原丧人生的感觉，偏偏她还想试图感动我们，...,['浙江'],"[一个, 有钱, 富态, 胖再, 减, 没法, 还原, 丧, 人生, 感觉, 偏偏, 想, ..."
['4085530064'],['彩云之南'],['1'],['5499'],['2024-02-10 11:16:57'],['如芒刺背，如鲠在喉，如坐针毡！绝对是春节档最拉垮的一部电影，虽然其它的几部我还没来得及看！'],['河南'],"[如芒刺背, 如鲠在喉, 如坐针毡, 春节, 档, 最拉垮, 一部, 电影, 几部, 来得及]"
['4085568644'],['CC就是屁股哥'],['1'],['4436'],['2024-02-10 12:08:01'],['从去年就开始买各种自媒体说贾玲减肥的消息宣传，真是恶心死了。你为了电影减肥增肥不是太正常...,['山东'],"[买, 媒体, 贾玲, 减肥, 消息, 宣传, 恶心, 死, 电影, 减肥, 增肥, 工作,..."
['4087934911'],['Season'],['2'],['3261'],['2024-02-12 00:35:30'],['减肥的过程我看到了，但是我没看到电影。'],['福建'],"[减肥, 过程, 电影]"
['4088237906'],['太阳留住我'],['2'],['2802'],['2024-02-12 10:53:02'],['这也能有7.9的评分，在座各位都有责任。'],['江西'],"[评分, 在座, 责任]"


## 2.2LDA

In [54]:
dictionary = corpora.Dictionary(merged_df['分词'])     # 根据分词结果创建字典
corpus = [dictionary.doc2bow(text) for text in merged_df['分词']]     # 根据分词结果创建语料库

In [55]:
# 训练LDA模型
lda_model = models.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=15)

In [56]:
# 查看主题
topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)

(0, '0.012*"贾玲" + 0.011*"电影" + 0.011*"一个" + 0.008*"减肥" + 0.007*"女性"')
(1, '0.032*"电影" + 0.022*"贾玲" + 0.011*"减肥" + 0.007*"营销" + 0.007*"导演"')
(2, '0.014*"贾玲" + 0.012*"电影" + 0.010*"减肥" + 0.007*"真的" + 0.006*"一个"')
(3, '0.027*"贾玲" + 0.025*"电影" + 0.014*"减肥" + 0.011*"一个" + 0.007*"女性"')
(4, '0.033*"电影" + 0.014*"贾玲" + 0.013*"减肥" + 0.013*"女性" + 0.010*"导演"')


In [57]:
merged_df['分词'].iloc[0]

['一个',
 '有钱',
 '富态',
 '胖再',
 '减',
 '没法',
 '还原',
 '丧',
 '人生',
 '感觉',
 '偏偏',
 '想',
 '试图',
 '感动',
 '赚',
 '一波',
 '票房',
 '翻拍',
 '舒服']

In [58]:
for index, score in sorted(lda_model[corpus[0]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.9592868089675903	 
Topic: 0.032*"电影" + 0.022*"贾玲" + 0.011*"减肥" + 0.007*"营销" + 0.007*"导演" + 0.005*"想" + 0.005*"一个" + 0.005*"女性" + 0.004*"拍" + 0.004*"真诚"

Score: 0.010225319303572178	 
Topic: 0.033*"电影" + 0.014*"贾玲" + 0.013*"减肥" + 0.013*"女性" + 0.010*"导演" + 0.006*"营销" + 0.006*"一部" + 0.005*"赢" + 0.005*"拍" + 0.004*"一个"

Score: 0.010199504904448986	 
Topic: 0.014*"贾玲" + 0.012*"电影" + 0.010*"减肥" + 0.007*"真的" + 0.006*"一个" + 0.005*"一部" + 0.005*"故事" + 0.004*"喜欢" + 0.004*"乐莹" + 0.004*"励志"

Score: 0.010166424326598644	 
Topic: 0.027*"贾玲" + 0.025*"电影" + 0.014*"减肥" + 0.011*"一个" + 0.007*"女性" + 0.006*"导演" + 0.006*"真的" + 0.006*"故事" + 0.005*"赢" + 0.005*"拳击"

Score: 0.010121949017047882	 
Topic: 0.012*"贾玲" + 0.011*"电影" + 0.011*"一个" + 0.008*"减肥" + 0.007*"女性" + 0.007*"真的" + 0.006*"赢" + 0.006*"喜欢" + 0.005*"导演" + 0.004*"拍"


In [59]:
documents = merged_df['分词'].values

In [60]:
# Function to infer topics for a document
def infer_topics(lda_model, document):
    bow = dictionary.doc2bow(document)
    topics = lda_model.get_document_topics(bow)
    return topics

# Print topics for each document
for i, doc in enumerate(documents[:10]):
    doc_topics = infer_topics(lda_model, doc)
    print(f"Document {i+1}:")
    print(doc_topics)
    print()

Document 1:
[(0, 0.010121939), (1, 0.9592936), (2, 0.010199284), (3, 0.010166407), (4, 0.010218801)]

Document 2:
[(0, 0.018245654), (1, 0.018386792), (2, 0.018390859), (3, 0.018338833), (4, 0.9266379)]

Document 3:
[(0, 0.01456082), (1, 0.014674733), (2, 0.01439759), (3, 0.9416898), (4, 0.0146771055)]

Document 4:
[(0, 0.051182475), (1, 0.79444456), (2, 0.05162153), (3, 0.051588986), (4, 0.05116244)]

Document 5:
[(0, 0.795256), (1, 0.05200187), (2, 0.05187774), (3, 0.05020303), (4, 0.050661355)]

Document 6:
[(0, 0.9737437)]

Document 7:
[(0, 0.016909221), (1, 0.017233893), (2, 0.3139225), (3, 0.017039452), (4, 0.6348949)]

Document 8:
[(0, 0.033510372), (1, 0.03391033), (2, 0.033634577), (3, 0.03405751), (4, 0.86488724)]

Document 9:
[(0, 0.01183363), (1, 0.011947391), (2, 0.011893205), (3, 0.01187864), (4, 0.9524471)]

Document 10:
[(0, 0.028944764), (1, 0.029035505), (2, 0.8834175), (3, 0.029366516), (4, 0.029235702)]



## 2.3可视化

In [61]:
lda_vis = gensimvis.prepare(lda_model, corpus, dictionary, n_jobs=1) # 如果在数据量比较大的时候跑不出来，可以选择加一个n_jobs=1的参数，降低计算量，避免报错
# 显示可视化界面
pyLDAvis.display(lda_vis)

In [63]:
# 导出可视化结果到html
# pyLDAvis.save_html(lda_vis, 'hot.html')