## Introduction to Topic Modeling using Scikit-Learn

As for the dataset, you can choose to use your own or download the publicly available [20 News Group Dataset](http://qwone.com/~jason/20Newsgroups/). It consists of approximately 20k documents related to newsgroup. 

There are altogether 3 variations:
* [20news-19997.tar.gz](http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz) — contains the original unmodified 20 Newsgroups data set
* [20news-bydate.tar.gz](http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz) — dataset is sorted by date in addition to the removal of duplicates and some headers. Split into train and test folder.
* [20news-18828.tar.gz](http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz) — duplicates are removed and headers contain only From and Subject.

I am using the `20news-18828` dataset in this tutorial. 
To keep things simple and short, I am going to use only **5 topics out of 20**:
`['rec.sport.hockey', 'soc.religion.christian', 'talk.politics.mideast', 'comp.graphics', 'sci.crypt']`

### 数据预处理
* 首先将`20news-18828.tar.gz`解压缩
* 运行下面的代码，将`base_topics`相关的文本放到一个文件`data.json`中
* 后续你也可以尝试更多的类别

In [11]:
import glob
import json

# 选用的5个topics
base_topics = ['rec.sport.hockey', 'soc.religion.christian', 'talk.politics.mideast', 'comp.graphics', 'sci.crypt']

data = []
labels = []

for i in base_topics:
    for j in glob.glob(f'./20news-18828/{i}/*'):  #获取这个路径下的所有文件
        with open(j, 'r', encoding='cp1252') as f: #cp1252是MS Windows英文版安装的默认编码
            data.append(f.read())
            labels.append(i)
# json.dumps() 是把python对象转换成json对象的一个过程，生成的是字符串
with open('data.json', 'w', encoding='utf8') as f: 
    f.write(json.dumps(data))
with open('label.json','w', encoding='utf8') as f:
    f.write(json.dumps(labels))

### 导入相关的库

In [12]:
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# 两个主题分析模型：LDA模型,基于截断的SVD模型
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
import numpy as np
import json
import random

### 从文件载入数据集
* 载入之后，一般会对数据做一次`shuffle`操作

In [13]:
X = []
y = []

# 读入数据
with open('data.json', 'r', encoding='utf8') as f:
    X = json.loads(f.read())
with open('label.json', 'r', encoding='utf8') as f:
    y = json.loads(f.read())

# 打乱数据
randnum = random.randint(0,100)
random.seed(randnum)
random.shuffle(X)
random.seed(randnum)
random.shuffle(y)
print(X[:1])
print(y[:1])

["From: shadow@r-node.hub.org (Jay Chu)\nSubject: Lindros will be traded!!!\n\nTrue rumor.  Fact!  A big three way deal!\n\nEric Lindros going to Ottawa Senators.  And Senators get $15mill from\nMontreal.\n\nMontreal gets Alexander Daigle (the first round pick from Senators)\n\nPhilly gets Damphousse, Bellow, Patrick Roy and a draft pick.\n\n-- \n        ______                shadow@r-node.gts.org\n       | |__| |   If it's there and you can see it       - it's real\n       |  ()  |   If it's there and you can't see it     - it's transparent\n       |______|   If it's not there and you can't see it - you erased it!\n"]
['rec.sport.hockey']


### 划分训练集和测试集

In [14]:
from sklearn.model_selection import train_test_split # function for splitting data to train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## 生成文本的独热表达
* `CountVectorizer` converts a collection of text documents to a matrix which contains all the token counts. Sometimes, token count is referred to as term frequency.
* Unlike `CountVectorizer`, `TfidfVectorizer` converts documents to a matrix of TF-IDF features.

- CountVectornizer:将文本中的词语转换为词频矩阵
- TfidVectorizer:将文本的词语转换为对应的TF-IDF矩阵

In [15]:
# 限制作为特征的单词个数
n_features = 1000

- 参数
    - max_df:可以设置为范围在[0.0 1.0]的float，也可以设置为没有范围限制的int，默认为1.0。这个参数的作用是作为一个阈值，当构造语料库的关键词集的时候，如果某个词的document frequence大于max_df，这个词不会被当作关键词。如果这个参数是float，则表示词出现的次数与语料库文档数的百分比，如果是int，则表示词出现的次数。如果参数中已经给定了vocabulary，则这个参数无效
    - min_df:类似于max_df，不同之处在于如果某个词的document frequence小于min_df，则这个词不会被当作关键词
    - ngram_range=(1,2):表示选用1到2个词进行前后的组合，构成新的标签值,比如“I love you”——"I love" "love you"

In [16]:
# 利用CountVectorizer将文本表示成向量的形式
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features, stop_words='english', ngram_range=(1, 2))
tf = tf_vectorizer.fit_transform(X_train)
features = pd.DataFrame(tf.toarray(), columns=tf_vectorizer.get_feature_names())
features.head()

Unnamed: 0,00,000,02,10,100,11,12,128,13,14,...,writes article,written,wrong,wrote,xv,year,years,yes,york,young
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,0,1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [17]:
# 利用TfidfVectorizer将文本表示成向量的形式
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, 
                                   max_features=n_features, stop_words='english', ngram_range=(1, 2))
tfidf = tfidf_vectorizer.fit_transform(X_train)
features = pd.DataFrame(tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names())
features.head()

Unnamed: 0,00,000,02,10,100,11,12,128,13,14,...,writes article,written,wrong,wrote,xv,year,years,yes,york,young
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.147727,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.131053,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.050566,0.0,0.0,0.0,...,0.04659,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.061037,0.0,0.0,0.0
4,0.0,0.048738,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.039226,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 使用朴素贝叶斯分类器分类

In [18]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB() # 构建一个分类器
nb.fit(tf, y_train) #用训练集训练分类器
X_test_tf = tf_vectorizer.transform(X_test) #embedding测试集
y_predict = nb.predict(X_test_tf) # 利用测试集去做预测

from sklearn.metrics import classification_report
print("测试集上其他指标：\n",classification_report(y_test, y_predict))

测试集上其他指标：
                         precision    recall  f1-score   support

         comp.graphics       0.93      0.92      0.93       199
      rec.sport.hockey       0.98      0.98      0.98       197
             sci.crypt       0.93      0.94      0.93       198
soc.religion.christian       0.94      0.95      0.94       196
 talk.politics.mideast       0.94      0.92      0.93       190

              accuracy                           0.94       980
             macro avg       0.94      0.94      0.94       980
          weighted avg       0.94      0.94      0.94       980



In [19]:
nb.fit(tfidf, y_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
y_predict = nb.predict(X_test_tfidf)

from sklearn.metrics import classification_report
print("测试集上其他指标：\n",classification_report(y_test, y_predict))

测试集上其他指标：
                         precision    recall  f1-score   support

         comp.graphics       0.93      0.94      0.93       199
      rec.sport.hockey       0.98      0.98      0.98       197
             sci.crypt       0.93      0.96      0.95       198
soc.religion.christian       0.96      0.95      0.96       196
 talk.politics.mideast       0.98      0.94      0.96       190

              accuracy                           0.96       980
             macro avg       0.96      0.96      0.96       980
          weighted avg       0.96      0.96      0.96       980



## 生成文本的主题模型

In [20]:
n_features = 1000
n_components = 5
n_top_words = 20

In [21]:
lsa_topics = ['soc.religion.christian', 'sci.crypt', 'rec.sport.hockey', 'talk.politics.mideast', 'comp.graphics']
lda_topics = ['talk.politics.mideast', 'rec.sport.hockey', 'soc.religion.christian', 'sci.crypt', 'comp.graphics']

### Latent Semantic Analysis
`scikit-learn` also comes with a great and useful dimensionality reduction model called `Truncated Singular Value Decomposition` (`TruncatedSVD`). 

In the event where `TruncatedSVD` model is fitted with `count` or `tfidf` matrices, it is also known as `Latent Semantic Analysis (LSA)`.

In [22]:
lsa = TruncatedSVD(n_components=n_components, random_state=1, algorithm='arpack').fit(tfidf)

### LatentDirichletAllocation (LDA)
LDA is a good generative probabilistic model for identifying abstract topics from discrete dataset such as text corpora.


You should use CountVectorizer when fitting LDA instead of TfidfVectorizer since LDA is based on term count and document count. Fitting LDA with TfidfVectorizer will result in rare words being dis-proportionally sampled. As a result, they will have greater impact and influence on the final topic distribution.

In [23]:
lda = LatentDirichletAllocation(n_components=n_components, random_state=1).fit(tf)

### 查看学习到的Topics信息

In [24]:
def get_model_topics(model, vectorizer, topics, n_top_words=n_top_words):
    word_dict = {}
    feature_names = vectorizer.get_feature_names() # 返回词组列表
    #∣V∣为单词个数，∣T∣为SVD降维后的主题个数。
    # model.components是一个大小为(∣T∣,∣V∣)的矩阵，
    # 每一行为主题在每个单词上的分布。我们可以通过这个矩阵得到哪些词对主题t贡献最大。
    # 也就是说topic_idx的取值是5个topics，topic
    for topic_idx, topic in enumerate(model.components_):
        #表示对数据进行从大到小（最后有-1是从大到小，没有是从小到大）进行排序,返回数据的索引值
        #-n_top_words - 1：指取前20个index
        top_features_ind = topic.argsort()[:-n_top_words - 1:-1]
        top_features = [feature_names[i] for i in top_features_ind] # 取词组
        word_dict[topics[topic_idx]] = top_features #存成字典的形式

    return pd.DataFrame(word_dict)

In [25]:
get_model_topics(lsa, tfidf_vectorizer, lsa_topics)

Unnamed: 0,soc.religion.christian,sci.crypt,rec.sport.hockey,talk.politics.mideast,comp.graphics
0,edu,key,game,armenian,graphics
1,com,clipper,hockey,israel,thanks
2,writes,chip,team,armenians,image
3,god,encryption,ca,turkish,edu
4,people,com,espn,israeli,files
5,article,keys,games,armenia,file
6,don,escrow,nhl,jews,program
7,know,clipper chip,season,turks,ac
8,just,government,play,turkey,format
9,think,netcom,edu,arab,ftp


In [26]:
get_model_topics(lda, tf_vectorizer, lda_topics)

Unnamed: 0,talk.politics.mideast,rec.sport.hockey,soc.religion.christian,sci.crypt,comp.graphics
0,edu,com,edu,edu,armenian
1,game,key,image,god,turkish
2,team,clipper,graphics,people,armenians
3,hockey,encryption,jpeg,don,people
4,ca,chip,file,think,jews
5,10,government,available,writes,armenia
6,play,writes,mail,know,said
7,games,edu,com,just,turkey
8,year,db,ftp,like,war
9,11,use,software,article,government


### 基于学习到的模型进行推理

In [27]:
def get_inference(model, vectorizer, topics, text, threshold):
    # 这个函数的意思是：输入任意一个文本，给它判断主题
    v_text = vectorizer.transform([text]) #embedding
    score = model.transform(v_text) #用模型对这个文本打分判断主题
    labels = set() 
    for i in range(len(score[0])):
        if score[0][i] > threshold: #设置一个阈值，只有模型评分高于阈值才有机会给它赋予主题
            labels.add(topics[i])

    if not labels:
        return 'None', -1, set()

    return topics[np.argmax(score)], score, labels

In [28]:
text = 'you should use either jpeg or png files for it'
topic, score, labels = get_inference(lsa, tfidf_vectorizer, lsa_topics, text, 0)
print(topic)
print(score)

comp.graphics
[[ 0.0525437   0.05812372  0.01217637 -0.0393242   0.14331403]]


In [29]:
topic, score, labels = get_inference(lda, tf_vectorizer, lda_topics, text, 0)
print(topic)
print(score)

soc.religion.christian
[[0.050002   0.05052469 0.79926921 0.05011649 0.05008762]]


### 利用学习到的特征分类

In [32]:
lda_vec = lda.transform(tf)
nb.fit(lda_vec, y_train)

lda_test_vec = lda.transform(X_test_tf)
y_predict = nb.predict(lda_test_vec)

from sklearn.metrics import classification_report
print("测试集上其他指标：\n",classification_report(y_test, y_predict))

测试集上其他指标：
                         precision    recall  f1-score   support

         comp.graphics       0.90      0.88      0.89       199
      rec.sport.hockey       0.96      0.94      0.95       197
             sci.crypt       0.92      0.92      0.92       198
soc.religion.christian       0.70      0.90      0.79       196
 talk.politics.mideast       0.92      0.68      0.78       190

              accuracy                           0.87       980
             macro avg       0.88      0.87      0.87       980
          weighted avg       0.88      0.87      0.87       980



### 请针对上述实验结果，分析LDA模型的适用性