In [1]:
!pip install bertopic[visualization]

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bertopic[visualization]
  Downloading bertopic-0.14.0-py2.py3-none-any.whl (119 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.9/119.9 KB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[0mCollecting hdbscan>=0.8.29
  Downloading hdbscan-0.8.29.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m41.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting sentence-transformers>=0.4.1
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0
  Downl

## 1. BERTopic

BERTopic은 BERT embeddings와 클래스 기반(class-based) TF-IDF를 활용하여 주제 설명에서 중요한 단어를 유지하면서도 쉽게 해석할 수 있는 조밀한 클러스터를 만드는 토픽 모델링 기술입니다. BERTopic알고리즘은 크게 세 가지 과정을 거칩니다.
<br />
<br />

### 1) 텍스트 데이터를 SBERT로 임베딩합니다.
<br />
SBERT를 사용하여 문서를 임베딩합니다. 이때, BERTopic은 기본적으로 아래의 BERT들을 사용합니다.
<br />
<br />

- 'paraphrase-MiniLM-L6-v2' : 영어 데이터로 학습된 SBERT

- 'paraphrase-multilingual-MiniLM-L12-v2' : 50개 이상의 언어로 학습된 다국어 SBERT
<br />

### 2) 문서를 군집화합니다.
<br />
UMAP을 사용하여 임베딩의 차원을 줄이고 HDBSCAN 기술을 사용하여 차원 축소된 임베딩을 클러스터링하고 의미적으로 유사한 문서 클러스터를 생성합니다.
<br />
<br />

### 3) 토픽 표현을 생성
<br />
마지막 단계는 클래스 기반 TF-IDF로 토픽을 추출합니다.

In [2]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
docs[:5]

["\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n",
 'My brother is in the market for a high-performance video card that supports\nVESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:\n\n  - Diamond Stealth Pro Local Bus\n\n  - Orchid Farenheit 1280\n\n  - ATI Graphics Ultra Pro\n\n  - Any other high-per

In [3]:
print(len(docs))

18846


In [4]:
# BERTopic 모델 객체를 만들고, fit_transform 메소드에 문자열들의 리스트를 입력으로 넣으면 토픽 모델링을 수행합니다.
model = BERTopic()
topics, probabilities = model.fit_transform(docs)

print('각 문서의 토픽 번호 리스트 : ', len(topics))
print('첫 번째 문서의 토픽 번호 : ', topics[0])

Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

각 문서의 토픽 번호 리스트 :  18846
첫 번째 문서의 토픽 번호 :  0


In [5]:
model.get_topic_info()

Unnamed: 0,Topic,Count,Name
0,-1,6750,-1_to_the_and_of
1,0,1819,0_game_team_games_he
2,1,573,1_key_clipper_chip_encryption
3,2,527,2_ites_cheek_yep_huh
4,3,470,3_israel_israeli_jews_arab
...,...,...,...
211,210,10,210_jesus_law_gentiles_paul
212,211,10,211_letter_your_letters_boss
213,212,10,212_religion_religious_wars_history
214,213,10,213_icon_icons_click_box


In [6]:
model.get_topic_info()['Count'].sum()

18846

In [7]:
# 위 출력에서 Topic -1이 가장 큰 것으로 보입니다. -1은 토픽이 할당되지 않은 모든 이상치 문서들을 나타냅니다.
model.get_topic(5)

[('xterm', 0.013925810439811162),
 ('echo', 0.011925040728282895),
 ('x11r5', 0.011275269004701821),
 ('server', 0.011178156567704163),
 ('error', 0.011140850827502578),
 ('xdm', 0.00843679286644767),
 ('sun', 0.008329774715673685),
 ('display', 0.007781619897128317),
 ('running', 0.007554317532121254),
 ('set', 0.00742540193070729)]

In [8]:
# Topic visualize
model.visualize_topics()

In [9]:
model.visualize_barchart()

In [10]:
model.visualize_heatmap()

In [11]:
# 임의의 문서에 대한 예측
new_doc = docs[0]
print(new_doc)



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!




In [12]:
topics, probs = model.transform([new_doc])
print('예측한 토픽 번호 : ', topics)


Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.



예측한 토픽 번호 :  [0]


In [None]:
# # 토픽 개수 임의로 지정 or 자동으로 정하게하기
# model = BERTopic(nr_topics='auto')
# model = BERTopic(nr_topics=20)

# # 모델 저장과 로드
# model.save('my_topics_model')
# BertTopic_model = BERTopic.load('my_topics_model')