## Introduction to Topic Modeling using Scikit-Learn

As for the dataset, you can choose to use your own or download the publicly available [20 News Group Dataset](http://qwone.com/~jason/20Newsgroups/). It consists of approximately 20k documents related to newsgroup. 

There are altogether 3 variations:
* [20news-19997.tar.gz](http://qwone.com/~jason/20Newsgroups/20news-19997.tar.gz) — contains the original unmodified 20 Newsgroups data set
* [20news-bydate.tar.gz](http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz) — dataset is sorted by date in addition to the removal of duplicates and some headers. Split into train and test folder.
* [20news-18828.tar.gz](http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz) — duplicates are removed and headers contain only From and Subject.

I am using the `20news-18828` dataset in this tutorial. 
To keep things simple and short, I am going to use only **5 topics out of 20**:
`['rec.sport.hockey', 'soc.religion.christian', 'talk.politics.mideast', 'comp.graphics', 'sci.crypt']`

### 数据预处理
* 首先将`20news-18828.tar.gz`解压缩
* 运行下面的代码，将`base_topics`相关的文本放到一个文件`data.json`中
* 后续你也可以尝试更多的类别

In [1]:
import glob
import json

base_topics = ['rec.sport.hockey', 'soc.religion.christian', 'talk.politics.mideast', 'comp.graphics', 'sci.crypt']

data = []
labels = []

for i in base_topics:
    for j in glob.glob(f'./20news-18828/{i}/*'):
        with open(j, 'r', encoding='cp1252') as f:
            data.append(f.read())
            labels.append(i)

with open('data.json', 'w', encoding='utf8') as f:
    f.write(json.dumps(data))
with open('label.json','w', encoding='utf8') as f:
    f.write(json.dumps(labels))

### 导入相关的库

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
import numpy as np
import json
import random

### 从文件载入数据集
* 载入之后，一般会对数据做一次`shuffle`操作

In [3]:
X = []
y = []
with open('data.json', 'r', encoding='utf8') as f:
    X = json.loads(f.read())
with open('label.json', 'r', encoding='utf8') as f:
    y = json.loads(f.read())
    
randnum = random.randint(0,100)
random.seed(randnum)
random.shuffle(X)
random.seed(randnum)
random.shuffle(y)
print(X[:1])
print(y[:1])

["From: bc744@cleveland.Freenet.Edu (Mark Ira Kaufman)\nSubject: About this 'Center for Policy Research'...\n\n\n   I have read numerous posts over a period of several months, by\nthis anti-Israel fanatic, hiding in the shadow of the respectable\nsounding name of the 'Center for Policy Research.'  Obviously, it\nis no research center of any kind, unless 'researching' published\ndocuments to find material to use against Israel makes it so.  \n\n   Labeling a propaganda mill a research center is not surprising\nin itself.  That is simply part of the propaganda process.  I was\ncurious if anyone knew who this anti-Israel fanatic hiding behind\nhis phoney 'research center' name is.  Is he an Arab?  Is he some\ntypical anti-semite hiding behind a veneer of 'anti-zionism?'  Is\nhe some Jew who perhaps lived in Israel and just couldn't make it\nthere, and is now taking his failure out on Israel?  \n\n   Let's shed some light on this clown once and for all.  It will\nhelp put his nonsense in t

### 划分训练集和测试集

In [4]:
from sklearn.model_selection import train_test_split # function for splitting data to train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

## 生成文本的独热表达
* `CountVectorizer` converts a collection of text documents to a matrix which contains all the token counts. Sometimes, token count is referred to as term frequency.
* Unlike `CountVectorizer`, `TfidfVectorizer` converts documents to a matrix of TF-IDF features.

In [5]:
# 限制作为特征的单词个数
n_features = 1000

In [6]:
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=n_features, stop_words='english', ngram_range=(1, 2))
tf = tf_vectorizer.fit_transform(X_train)
features = pd.DataFrame(tf.toarray(), columns=tf_vectorizer.get_feature_names())
features.head()

Unnamed: 0,00,000,01,02,10,100,11,12,128,13,...,writing,written,wrong,wrote,year,years,years ago,yes,york,young
0,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,2,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0


In [7]:
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, 
                                   max_features=n_features, stop_words='english', ngram_range=(1, 2))
tfidf = tfidf_vectorizer.fit_transform(X_train)
features = pd.DataFrame(tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names())
features.head()

Unnamed: 0,00,000,01,02,10,100,11,12,128,13,...,writing,written,wrong,wrote,year,years,years ago,yes,york,young
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.065268,0.0,0.123587,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.13638,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.141576,0.0,0.0,0.0,0.0,0.0
3,0.0,0.158867,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.123731,0.0,0.0,0.0,0.0,0.0


### 使用朴素贝叶斯分类器分类

In [8]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb.fit(tf, y_train)
X_test_tf = tf_vectorizer.transform(X_test)
y_predict = nb.predict(X_test_tf)

from sklearn.metrics import classification_report
print("测试集上其他指标：\n",classification_report(y_test, y_predict))

测试集上其他指标：
                         precision    recall  f1-score   support

         comp.graphics       0.90      0.96      0.93       204
      rec.sport.hockey       0.97      0.96      0.97       196
             sci.crypt       0.95      0.94      0.94       188
soc.religion.christian       0.96      0.90      0.93       193
 talk.politics.mideast       0.93      0.93      0.93       199

              accuracy                           0.94       980
             macro avg       0.94      0.94      0.94       980
          weighted avg       0.94      0.94      0.94       980



In [9]:
nb.fit(tfidf, y_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
y_predict = nb.predict(X_test_tfidf)

from sklearn.metrics import classification_report
print("测试集上其他指标：\n",classification_report(y_test, y_predict))

测试集上其他指标：
                         precision    recall  f1-score   support

         comp.graphics       0.87      0.96      0.91       204
      rec.sport.hockey       0.96      0.97      0.97       196
             sci.crypt       0.96      0.93      0.95       188
soc.religion.christian       0.97      0.91      0.94       193
 talk.politics.mideast       0.96      0.95      0.95       199

              accuracy                           0.94       980
             macro avg       0.95      0.94      0.94       980
          weighted avg       0.94      0.94      0.94       980



## 生成文本的主题模型

In [10]:
n_features = 1000
n_components = 5
n_top_words = 20

In [11]:
lsa_topics = ['soc.religion.christian', 'sci.crypt', 'rec.sport.hockey', 'talk.politics.mideast', 'comp.graphics']
lda_topics = ['talk.politics.mideast', 'rec.sport.hockey', 'soc.religion.christian', 'sci.crypt', 'comp.graphics']

### Latent Semantic Analysis
`scikit-learn` also comes with a great and useful dimensionality reduction model called `Truncated Singular Value Decomposition` (`TruncatedSVD`). 

In the event where `TruncatedSVD` model is fitted with `count` or `tfidf` matrices, it is also known as `Latent Semantic Analysis (LSA)`.

In [12]:
lsa = TruncatedSVD(n_components=n_components, random_state=1, algorithm='arpack').fit(tfidf)

### LatentDirichletAllocation (LDA)
LDA is a good generative probabilistic model for identifying abstract topics from discrete dataset such as text corpora.


You should use CountVectorizer when fitting LDA instead of TfidfVectorizer since LDA is based on term count and document count. Fitting LDA with TfidfVectorizer will result in rare words being dis-proportionally sampled. As a result, they will have greater impact and influence on the final topic distribution.

In [13]:
lda = LatentDirichletAllocation(n_components=n_components, random_state=1).fit(tf)

### 查看学习到的Topics信息

In [14]:
def get_model_topics(model, vectorizer, topics, n_top_words=n_top_words):
    word_dict = {}
    feature_names = vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        top_features_ind = topic.argsort()[:-n_top_words - 1:-1]
        top_features = [feature_names[i] for i in top_features_ind]
        word_dict[topics[topic_idx]] = top_features

    return pd.DataFrame(word_dict)

In [15]:
get_model_topics(lsa, tfidf_vectorizer, lsa_topics)

Unnamed: 0,soc.religion.christian,sci.crypt,rec.sport.hockey,talk.politics.mideast,comp.graphics
0,edu,key,game,israel,graphics
1,com,clipper,team,israeli,edu
2,god,chip,hockey,jews,image
3,writes,encryption,ca,armenian,thanks
4,people,com,espn,turkish,file
5,article,keys,games,armenians,files
6,don,escrow,nhl,arab,program
7,just,government,season,armenia,ac
8,know,clipper chip,players,jewish,3d
9,like,netcom,play,arabs,gif


In [16]:
get_model_topics(lda, tf_vectorizer, lda_topics)

Unnamed: 0,talk.politics.mideast,rec.sport.hockey,soc.religion.christian,sci.crypt,comp.graphics
0,edu,god,edu,edu,turkish
1,com,edu,people,image,armenian
2,key,people,israel,graphics,10
3,writes,jesus,writes,file,armenians
4,game,church,jews,jpeg,25
5,clipper,think,article,use,11
6,encryption,christ,said,mail,12
7,article,believe,israeli,com,14
8,chip,know,just,available,armenia
9,team,christian,like,files,20


### 基于学习到的模型进行推理

In [17]:
def get_inference(model, vectorizer, topics, text, threshold):
    v_text = vectorizer.transform([text])
    score = model.transform(v_text)
    labels = set()
    for i in range(len(score[0])):
        if score[0][i] > threshold:
            labels.add(topics[i])

    if not labels:
        return 'None', -1, set()

    return topics[np.argmax(score)], score, labels

In [18]:
text = 'you should use either jpeg or png files for it'
topic, score, labels = get_inference(lsa, tfidf_vectorizer, lsa_topics, text, 0)
print(topic)
print(score)

comp.graphics
[[ 0.05302177  0.05875669  0.00225079 -0.03293454  0.14861712]]


In [19]:
topic, score, labels = get_inference(lda, tf_vectorizer, lda_topics, text, 0)
print(topic)
print(score)

sci.crypt
[[0.05027019 0.05013133 0.05006827 0.7995194  0.05001081]]


### 利用学习到的特征分类

In [20]:
lda_vec = lda.transform(tf)
nb.fit(lda_vec, y_train)

lda_test_vec = lda.transform(X_test_tf)
y_predict = nb.predict(lda_test_vec)

from sklearn.metrics import classification_report
print("测试集上其他指标：\n",classification_report(y_test, y_predict))

测试集上其他指标：
                         precision    recall  f1-score   support

         comp.graphics       0.88      0.89      0.89       204
      rec.sport.hockey       0.61      0.81      0.70       196
             sci.crypt       0.65      0.42      0.51       188
soc.religion.christian       0.91      0.88      0.90       193
 talk.politics.mideast       0.88      0.92      0.90       199

              accuracy                           0.79       980
             macro avg       0.79      0.78      0.78       980
          weighted avg       0.79      0.79      0.78       980



### 请针对上述实验结果，分析LDA模型的适用性