## BERT 기반 복합 토픽 모델(Combined Topic Models, CTM)

문맥을 반영한 토픽 모델(Contextualized Topic Models)은 문맥을 반영한 BERT의 문서 임베딩의 표현력과 기존 토픽 모델의 비지도 학습 능력을 결합하여 문서에서 주제를 가져오는 토픽 모델을 말한다. 복합 토픽 모델(Combined Topic Models, CTM)은 문맥을 반영한 토픽 모델의 일종이다.

In [26]:
# pip install contextualized-topic-models==2.2.0
import pandas as pd
import urllib.request
import nltk
from nltk.corpus import stopwords
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing

#### 1) 데이터 로드

In [39]:
# 데이터 다운로드
urllib.request.urlretrieve("https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_unprep.txt", filename="dbpedia_sample_abstract_20k_unprep.txt")

text_file = "dbpedia_sample_abstract_20k_unprep.txt"

#### 2) 전처리

In [40]:
# documents = [line.strip() for line in text_file.readlines()]
documents = [line.strip() for line in open(text_file, encoding="utf-8").readlines()]
sp = WhiteSpacePreprocessing(documents, stopwords_language='english')
preprocessed_documents, unpreprocessed_corpus, vocab = sp.preprocess()

In [41]:
# 전처리 전 문서
unpreprocessed_corpus[:2]

['The Mid-Peninsula Highway is a proposed freeway across the Niagara Peninsula in the Canadian province of Ontario. Although plans for a highway connecting Hamilton to Fort Erie south of the Niagara Escarpment have surfaced for decades,it was not until The Niagara Frontier International Gateway Study was published by the Ministry',
 "Monte Zucker (died March 15, 2007) was an American photographer. He specialized in wedding photography, entering it as a profession in 1947. In the 1970s he operated a studio in Silver Spring, Maryland. Later he lived in Florida. He was Brides Magazine's Wedding Photographer of the Year for 1990 and"]

In [42]:
# normalization 전처리 후 문서
preprocessed_documents[:2]

['mid peninsula highway proposed across peninsula canadian province ontario although highway connecting hamilton fort south international study published ministry',
 'died march american photographer specialized photography operated studio silver spring maryland later lived florida magazine photographer year']

In [44]:
# 전체 단어 집합의 크기
# WhiteSpacePreprocessing()의 vocabulary_size 인자의 기본값이 2000
print('bag of words에 사용 될 단어 집합의 크기 :',len(vocab))

bag of words에 사용 될 단어 집합의 크기 : 2000


In [45]:
tp = TopicModelDataPreparation("paraphrase-distilroberta-base-v1")
training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

Downloading (…)7f4ef/.gitattributes:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)f279f7f4ef/README.md:   0%|          | 0.00/3.74k [00:00<?, ?B/s]

Downloading (…)79f7f4ef/config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading (…)279f7f4ef/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)7f4ef/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.35k [00:00<?, ?B/s]

Downloading (…)279f7f4ef/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)9f7f4ef/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

Batches:   0%|          | 0/100 [00:00<?, ?it/s]

In [46]:
len(tp.vocab)

2000

#### 3) Combined TM 학습

In [47]:
ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=50, num_epochs=20)
ctm.fit(training_dataset)

Epoch: [20/20]	 Seen Samples: [400000/400000]	Train Loss: 135.73821025390626	Time: 0:00:22.563363: : 20it [07:53, 23.67s/it]


#### 4) 결과 및 시각화

In [48]:
# 결과
ctm.get_topic_lists(5)

[['american', 'known', 'music', 'best', 'born'],
 ['french', 'century', 'de', 'died', 'king'],
 ['born', 'world', 'summer', 'olympics', 'medal'],
 ['released', 'album', 'music', 'band', 'live'],
 ['km', 'north', 'mi', 'south', 'west'],
 ['states', 'united', 'county', 'state', 'new'],
 ['politician', 'member', 'served', 'john', 'born'],
 ['de', 'century', 'greek', 'king', 'roman'],
 ['cup', 'championship', 'held', 'tournament', 'edition'],
 ['family', 'found', 'species', 'genus', 'native'],
 ['american', 'football', 'team', 'college', 'head'],
 ['member', 'members', 'election', 'state', 'council'],
 ['district', 'also', 'iran', 'province', 'population'],
 ['university', 'professor', 'born', 'american', 'law'],
 ['district', 'county', 'population', 'census', 'town'],
 ['use', 'used', 'either', 'process', 'usually'],
 ['research', 'university', 'established', 'education', 'international'],
 ['family', 'found', 'species', 'native', 'mm'],
 ['season', 'team', 'division', 'league', 'football

In [49]:
# 시각화
lda_vis_data = ctm.get_ldavis_data_format(tp.vocab, training_dataset, n_samples=10)

Sampling: [10/10]: : 10it [02:43, 16.38s/it]


In [50]:
import pyLDAvis as vis

lda_vis_data = ctm.get_ldavis_data_format(tp.vocab, training_dataset, n_samples=10)

ctm_pd = vis.prepare(**lda_vis_data)
vis.display(ctm_pd)

Sampling: [10/10]: : 10it [02:36, 15.61s/it]
  by='saliency', ascending=False).head(R).drop('saliency', 1)


#### 5) 예측
임의의 문서를 가져와서 어떤 토픽이 할당되었는지 확인할 수 있다. 예를 들어, 반도(peninsula)에 대한 주제를 담고 있는 첫번째 전처리 된 문서의 토픽을 예측해보자.

In [51]:
topics_predictions = ctm.get_thetas(training_dataset, n_samples=5) # get all the topic predictions

Sampling: [5/5]: : 5it [01:16, 15.38s/it]


In [52]:
# 전처리 문서의 첫번째 문서
print(preprocessed_documents[0])

mid peninsula highway proposed across peninsula canadian province ontario although highway connecting hamilton fort south international study published ministry


In [53]:
import numpy as np
topic_number = np.argmax(topics_predictions[0]) # 예측된 첫 번째 문서의 주제 ID를 가져옴

In [57]:
print(topic_number)

29


In [58]:
ctm.get_topic_lists(5)[29]  # 첫번째 문서 topic id

['state', 'located', 'south', 'miles', 'river']

In [59]:
ctm.get_topic_lists(5)[topic_number] # 첫번째 문서의 예측된 topic_number

['state', 'located', 'south', 'miles', 'river']