1) 뉴스그룹 데이터에 대한 이해

In [1]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD


In [2]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
documents = dataset.data
print('샘플의 수 :',len(documents))


샘플의 수 : 11314


In [3]:
documents[1]


"\n\n\n\n\n\n\nYeah, do you expect people to read the FAQ, etc. and actually accept hard\natheism?  No, you need a little leap of faith, Jimmy.  Your logic runs out\nof steam!\n\n\n\n\n\n\n\nJim,\n\nSorry I can't pity you, Jim.  And I'm sorry that you have these feelings of\ndenial about the faith you need to get by.  Oh well, just pretend that it will\nall end happily ever after anyway.  Maybe if you start a new newsgroup,\nalt.atheist.hard, you won't be bummin' so much?\n\n\n\n\n\n\nBye-Bye, Big Jim.  Don't forget your Flintstone's Chewables!  :) \n--\nBake Timmons, III"

In [4]:
print(dataset.target_names)


['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


2) 텍스트 전처리

In [5]:
news_df = pd.DataFrame({'document':documents})
# 특수 문자 제거
news_df['clean_doc'] = news_df['document'].str.replace("[^a-zA-Z]", " ")
# 길이가 3이하인 단어는 제거 (길이가 짧은 단어 제거)
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))
# 전체 단어에 대한 소문자 변환
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: x.lower())


  news_df['clean_doc'] = news_df['document'].str.replace("[^a-zA-Z]", " ")


In [6]:
news_df['clean_doc'][1]


'yeah expect people read actually accept hard atheism need little leap faith jimmy your logic runs steam sorry pity sorry that have these feelings denial about faith need well just pretend that will happily ever after anyway maybe start newsgroup atheist hard bummin much forget your flintstone chewables bake timmons'

In [8]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [9]:
# NLTK로부터 불용어를 받아온다.
stop_words = stopwords.words('english')
tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split()) # 토큰화
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])
# 불용어를 제거합니다.


In [10]:
print(tokenized_doc[1])


['yeah', 'expect', 'people', 'read', 'actually', 'accept', 'hard', 'atheism', 'need', 'little', 'leap', 'faith', 'jimmy', 'logic', 'runs', 'steam', 'sorry', 'pity', 'sorry', 'feelings', 'denial', 'faith', 'need', 'well', 'pretend', 'happily', 'ever', 'anyway', 'maybe', 'start', 'newsgroup', 'atheist', 'hard', 'bummin', 'much', 'forget', 'flintstone', 'chewables', 'bake', 'timmons']


3) TF-IDF 행렬 만들기

In [11]:
# 역토큰화 (토큰화 작업을 역으로 되돌림)
detokenized_doc = []
for i in range(len(news_df)):
    t = ' '.join(tokenized_doc[i])
    detokenized_doc.append(t)

news_df['clean_doc'] = detokenized_doc


In [12]:
news_df['clean_doc'][1]


'yeah expect people read actually accept hard atheism need little leap faith jimmy logic runs steam sorry pity sorry feelings denial faith need well pretend happily ever anyway maybe start newsgroup atheist hard bummin much forget flintstone chewables bake timmons'

In [13]:
vectorizer = TfidfVectorizer(stop_words='english', max_features= 1000, # 상위 1,000개의 단어를 보존
max_df = 0.5, smooth_idf=True)

X = vectorizer.fit_transform(news_df['clean_doc'])

# TF-IDF 행렬의 크기 확인
print('TF-IDF 행렬의 크기 :',X.shape)


TF-IDF 행렬의 크기 : (11314, 1000)


4) 토픽 모델링(Topic Modeling)

In [14]:
svd_model = TruncatedSVD(n_components=20, algorithm='randomized', n_iter=100, random_state=122)
svd_model.fit(X)
len(svd_model.components_)


20

In [17]:
import numpy as np

In [18]:
np.shape(svd_model.components_)


(20, 1000)

In [20]:
terms = vectorizer.get_feature_names_out() # 단어 집합. 1,000개의 단어가 저장됨.

def get_topics(components, feature_names, n=5):
    for idx, topic in enumerate(components):
        print("Topic %d:" % (idx+1), [(feature_names[i], topic[i].round(5)) for i in topic.argsort()[:-n - 1:-1]])
get_topics(svd_model.components_,terms)


Topic 1: [('like', 0.21386), ('know', 0.20046), ('people', 0.19293), ('think', 0.17805), ('good', 0.15128)]
Topic 2: [('thanks', 0.32888), ('windows', 0.29088), ('card', 0.18069), ('drive', 0.17455), ('mail', 0.15111)]
Topic 3: [('game', 0.37064), ('team', 0.32443), ('year', 0.28154), ('games', 0.2537), ('season', 0.18419)]
Topic 4: [('drive', 0.53324), ('scsi', 0.20165), ('hard', 0.15628), ('disk', 0.15578), ('card', 0.13994)]
Topic 5: [('windows', 0.40399), ('file', 0.25436), ('window', 0.18044), ('files', 0.16078), ('program', 0.13894)]
Topic 6: [('chip', 0.16114), ('government', 0.16009), ('mail', 0.15625), ('space', 0.1507), ('information', 0.13562)]
Topic 7: [('like', 0.67086), ('bike', 0.14236), ('chip', 0.11169), ('know', 0.11139), ('sounds', 0.10371)]
Topic 8: [('card', 0.46633), ('video', 0.22137), ('sale', 0.21266), ('monitor', 0.15463), ('offer', 0.14643)]
Topic 9: [('know', 0.46047), ('card', 0.33605), ('chip', 0.17558), ('government', 0.1522), ('video', 0.14356)]
Topic 10

LDA 실습

1) 정수 인코딩과 단어 집합 만들기

In [21]:
tokenized_doc[:5]


0    [well, sure, story, seem, biased, disagree, st...
1    [yeah, expect, people, read, actually, accept,...
2    [although, realize, principle, strongest, poin...
3    [notwithstanding, legitimate, fuss, proposal, ...
4    [well, change, scoring, playoff, pool, unfortu...
Name: clean_doc, dtype: object

In [22]:
from gensim import corpora
dictionary = corpora.Dictionary(tokenized_doc)
corpus = [dictionary.doc2bow(text) for text in tokenized_doc]
print(corpus[1]) # 수행된 결과에서 두번째 뉴스 출력. 첫번째 문서의 인덱스는 0


[(52, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 2), (67, 1), (68, 1), (69, 1), (70, 1), (71, 2), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 2), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1), (85, 2), (86, 1), (87, 1), (88, 1), (89, 1)]


In [23]:
print(dictionary[66])


faith


In [24]:
len(dictionary)


64281

2) LDA 모델 훈련시키기

In [25]:
import gensim
NUM_TOPICS = 20 # 20개의 토픽, k=20
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)


(0, '0.016*"available" + 0.014*"software" + 0.013*"image" + 0.011*"graphics"')
(1, '0.025*"windows" + 0.016*"window" + 0.013*"using" + 0.010*"server"')
(2, '0.022*"health" + 0.018*"medical" + 0.012*"disease" + 0.010*"patients"')
(3, '0.023*"game" + 0.021*"team" + 0.016*"games" + 0.015*"play"')
(4, '0.022*"space" + 0.010*"bike" + 0.009*"nasa" + 0.007*"engine"')
(5, '0.011*"people" + 0.010*"would" + 0.007*"jesus" + 0.006*"believe"')
(6, '0.015*"armenian" + 0.015*"jews" + 0.013*"turkish" + 0.012*"armenians"')
(7, '0.023*"university" + 0.020*"research" + 0.020*"center" + 0.019*"science"')
(8, '0.018*"government" + 0.010*"public" + 0.009*"encryption" + 0.009*"people"')
(9, '0.014*"program" + 0.014*"president" + 0.011*"space" + 0.007*"office"')
(10, '0.019*"would" + 0.014*"like" + 0.013*"think" + 0.011*"good"')
(11, '0.050*"file" + 0.028*"output" + 0.028*"entry" + 0.020*"program"')
(12, '0.026*"thanks" + 0.022*"please" + 0.021*"anyone" + 0.021*"would"')
(13, '0.017*"israel" + 0.011*"israeli"

In [26]:
print(ldamodel.print_topics())


[(0, '0.016*"available" + 0.014*"software" + 0.013*"image" + 0.011*"graphics" + 0.009*"version" + 0.008*"also" + 0.008*"data" + 0.007*"package" + 0.006*"format" + 0.006*"includes"'), (1, '0.025*"windows" + 0.016*"window" + 0.013*"using" + 0.010*"server" + 0.010*"display" + 0.010*"problem" + 0.009*"file" + 0.009*"motif" + 0.009*"files" + 0.009*"program"'), (2, '0.022*"health" + 0.018*"medical" + 0.012*"disease" + 0.010*"patients" + 0.009*"among" + 0.008*"study" + 0.008*"medicine" + 0.007*"drug" + 0.007*"drugs" + 0.007*"gordon"'), (3, '0.023*"game" + 0.021*"team" + 0.016*"games" + 0.015*"play" + 0.013*"season" + 0.010*"hockey" + 0.010*"period" + 0.010*"players" + 0.009*"league" + 0.009*"year"'), (4, '0.022*"space" + 0.010*"bike" + 0.009*"nasa" + 0.007*"engine" + 0.007*"launch" + 0.007*"cars" + 0.005*"road" + 0.005*"power" + 0.005*"miles" + 0.005*"shuttle"'), (5, '0.011*"people" + 0.010*"would" + 0.007*"jesus" + 0.006*"believe" + 0.006*"many" + 0.006*"think" + 0.005*"even" + 0.005*"true" 

3) LDA 시각화 하기

In [33]:
pip install pyLDAvis


  and should_run_async(code)




In [34]:
pip install gensim

  and should_run_async(code)




In [36]:
pip install --upgrade pandas

  and should_run_async(code)




In [38]:
pip install --upgrade gensim pyldavis

  and should_run_async(code)




In [41]:
pip install gensim==3.8.3

  and should_run_async(code)


Collecting gensim==3.8.3
  Downloading gensim-3.8.3.tar.gz (23.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.4/23.4 MB[0m [31m59.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: gensim
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Building wheel for gensim (setup.py) ... [?25lerror
[31m  ERROR: Failed building wheel for gensim[0m[31m
[0m[?25h  Running setup.py clean for gensim
Failed to build gensim
[31mERROR: Could not build wheels for gensim, which is required to install pyproject.toml-based projects[0m[31m
[0m

In [42]:
import pyLDAvis.gensim_models

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(ldamodel, corpus, dictionary)
pyLDAvis.display(vis)


  and should_run_async(code)


BrokenProcessPool: ignored

In [43]:
for i, topic_list in enumerate(ldamodel[corpus]):
    if i==5:
        break
    print(i,'번째 문서의 topic 비율은',topic_list)


0 번째 문서의 topic 비율은 [(4, 0.13122064), (5, 0.3542902), (6, 0.13644736), (10, 0.11109903), (13, 0.16143198), (14, 0.094018765)]
1 번째 문서의 topic 비율은 [(0, 0.031110212), (3, 0.08073408), (5, 0.4496509), (7, 0.055765793), (10, 0.33870876), (15, 0.025576591)]
2 번째 문서의 topic 비율은 [(5, 0.12693569), (6, 0.075742394), (7, 0.11277084), (8, 0.03509439), (10, 0.4175312), (13, 0.22061537)]
3 번째 문서의 topic 비율은 [(8, 0.37302494), (10, 0.26549032), (12, 0.06228346), (14, 0.16139883), (16, 0.12604046)]
4 번째 문서의 topic 비율은 [(3, 0.26817322), (8, 0.22222552), (10, 0.31826603), (11, 0.0955529), (12, 0.06799089)]


  and should_run_async(code)


In [44]:
def make_topictable_per_doc(ldamodel, corpus):
    topic_table = pd.DataFrame()

    # 몇 번째 문서인지를 의미하는 문서 번호와 해당 문서의 토픽 비중을 한 줄씩 꺼내온다.
    for i, topic_list in enumerate(ldamodel[corpus]):
        doc = topic_list[0] if ldamodel.per_word_topics else topic_list
        doc = sorted(doc, key=lambda x: (x[1]), reverse=True)
        # 각 문서에 대해서 비중이 높은 토픽순으로 토픽을 정렬한다.
        # EX) 정렬 전 0번 문서 : (2번 토픽, 48.5%), (8번 토픽, 25%), (10번 토픽, 5%), (12번 토픽, 21.5%),
        # Ex) 정렬 후 0번 문서 : (2번 토픽, 48.5%), (8번 토픽, 25%), (12번 토픽, 21.5%), (10번 토픽, 5%)
        # 48 > 25 > 21 > 5 순으로 정렬이 된 것.

        # 모든 문서에 대해서 각각 아래를 수행
        for j, (topic_num, prop_topic) in enumerate(doc): #  몇 번 토픽인지와 비중을 나눠서 저장한다.
            if j == 0:  # 정렬을 한 상태이므로 가장 앞에 있는 것이 가장 비중이 높은 토픽
                topic_table = topic_table.append(pd.Series([int(topic_num), round(prop_topic,4), topic_list]), ignore_index=True)
                # 가장 비중이 높은 토픽과, 가장 비중이 높은 토픽의 비중과, 전체 토픽의 비중을 저장한다.
            else:
                break
    return(topic_table)


  and should_run_async(code)


In [45]:
topictable = make_topictable_per_doc(ldamodel, corpus)
topictable = topictable.reset_index() # 문서 번호을 의미하는 열(column)로 사용하기 위해서 인덱스 열을 하나 더 만든다.
topictable.columns = ['문서 번호', '가장 비중이 높은 토픽', '가장 높은 토픽의 비중', '각 토픽의 비중']
topictable[:10]


[1;30;43m스트리밍 출력 내용이 길어서 마지막 5000줄이 삭제되었습니다.[0m
  topic_table = topic_table.append(pd.Series([int(topic_num), round(prop_topic,4), topic_list]), ignore_index=True)
  topic_table = topic_table.append(pd.Series([int(topic_num), round(prop_topic,4), topic_list]), ignore_index=True)
  topic_table = topic_table.append(pd.Series([int(topic_num), round(prop_topic,4), topic_list]), ignore_index=True)
  topic_table = topic_table.append(pd.Series([int(topic_num), round(prop_topic,4), topic_list]), ignore_index=True)
  topic_table = topic_table.append(pd.Series([int(topic_num), round(prop_topic,4), topic_list]), ignore_index=True)
  topic_table = topic_table.append(pd.Series([int(topic_num), round(prop_topic,4), topic_list]), ignore_index=True)
  topic_table = topic_table.append(pd.Series([int(topic_num), round(prop_topic,4), topic_list]), ignore_index=True)
  topic_table = topic_table.append(pd.Series([int(topic_num), round(prop_topic,4), topic_list]), ignore_index=True)
  topic_table = topic_

Unnamed: 0,문서 번호,가장 비중이 높은 토픽,가장 높은 토픽의 비중,각 토픽의 비중
0,0,5,0.3543,"[(4, 0.1312204), (5, 0.3543026), (6, 0.1364478..."
1,1,5,0.4499,"[(0, 0.03111312), (3, 0.080735005), (5, 0.4498..."
2,2,10,0.4175,"[(5, 0.1269831), (6, 0.07573116), (7, 0.112768..."
3,3,8,0.373,"[(8, 0.37302428), (10, 0.2654892), (12, 0.0622..."
4,4,10,0.3184,"[(3, 0.26817578), (8, 0.22222382), (10, 0.3184..."
5,5,5,0.3069,"[(5, 0.30686975), (10, 0.2217948), (13, 0.1840..."
6,6,16,0.3339,"[(0, 0.059928235), (1, 0.101623565), (7, 0.042..."
7,7,10,0.4213,"[(5, 0.30226162), (10, 0.42133245), (13, 0.098..."
8,8,18,0.3419,"[(2, 0.13520256), (5, 0.07558674), (9, 0.15734..."
9,9,10,0.6918,"[(4, 0.06495174), (8, 0.029125297), (10, 0.691..."


사이킷런의 잠재 디리클레 할당(LDA) 실습

In [47]:
import pandas as pd
import urllib.request
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

urllib.request.urlretrieve("https://raw.githubusercontent.com/ukairia777/tensorflow-nlp-tutorial/main/19.%20Topic%20Modeling%20(LDA%2C%20BERT-Based)/dataset/abcnews-date-text.csv")

data = pd.read_csv('/content/abcnews-date-text.csv', error_bad_lines=False)
print('뉴스 제목 개수 :',len(data))


  and should_run_async(code)


뉴스 제목 개수 : 63765




  data = pd.read_csv('/content/abcnews-date-text.csv', error_bad_lines=False)


In [48]:
print(data.head(5))


   publish_date                                      headline_text
0      20030219  aba decides against community broadcasting lic...
1      20030219     act fire witnesses must be aware of defamation
2      20030219     a g calls for infrastructure protection summit
3      20030219           air nz staff in aust strike for pay rise
4      20030219      air nz strike to affect australian travellers


  and should_run_async(code)


In [49]:
text = data[['headline_text']]
text.head(5)


  and should_run_async(code)


Unnamed: 0,headline_text
0,aba decides against community broadcasting lic...
1,act fire witnesses must be aware of defamation
2,a g calls for infrastructure protection summit
3,air nz staff in aust strike for pay rise
4,air nz strike to affect australian travellers


In [52]:
import nltk
nltk.download('punkt')

  and should_run_async(code)
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [53]:
text['headline_text'] = text.apply(lambda row: nltk.word_tokenize(row['headline_text']), axis=1) #텍스트전처리

  and should_run_async(code)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text['headline_text'] = text.apply(lambda row: nltk.word_tokenize(row['headline_text']), axis=1) #텍스트전처리


In [54]:
print(text.head(5))


                                       headline_text
0  [aba, decides, against, community, broadcastin...
1  [act, fire, witnesses, must, be, aware, of, de...
2  [a, g, calls, for, infrastructure, protection,...
3  [air, nz, staff, in, aust, strike, for, pay, r...
4  [air, nz, strike, to, affect, australian, trav...


  and should_run_async(code)


In [55]:
stop_words = stopwords.words('english')
text['headline_text'] = text['headline_text'].apply(lambda x: [word for word in x if word not in (stop_words)])


  and should_run_async(code)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text['headline_text'] = text['headline_text'].apply(lambda x: [word for word in x if word not in (stop_words)])


In [56]:
print(text.head(5))


                                       headline_text
0   [aba, decides, community, broadcasting, licence]
1    [act, fire, witnesses, must, aware, defamation]
2     [g, calls, infrastructure, protection, summit]
3          [air, nz, staff, aust, strike, pay, rise]
4  [air, nz, strike, affect, australian, travellers]


  and should_run_async(code)


In [58]:
nltk.download('wordnet')

  and should_run_async(code)
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [59]:
text['headline_text'] = text['headline_text'].apply(lambda x: [WordNetLemmatizer().lemmatize(word, pos='v') for word in x])
print(text.head(5))


  and should_run_async(code)


                                       headline_text
0       [aba, decide, community, broadcast, licence]
1      [act, fire, witness, must, aware, defamation]
2      [g, call, infrastructure, protection, summit]
3          [air, nz, staff, aust, strike, pay, rise]
4  [air, nz, strike, affect, australian, travellers]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text['headline_text'] = text['headline_text'].apply(lambda x: [WordNetLemmatizer().lemmatize(word, pos='v') for word in x])


In [60]:
tokenized_doc = text['headline_text'].apply(lambda x: [word for word in x if len(word) > 3])
print(tokenized_doc[:5])


  and should_run_async(code)


0       [decide, community, broadcast, licence]
1      [fire, witness, must, aware, defamation]
2    [call, infrastructure, protection, summit]
3                   [staff, aust, strike, rise]
4      [strike, affect, australian, travellers]
Name: headline_text, dtype: object


In [61]:
# 역토큰화 (토큰화 작업을 되돌림)
detokenized_doc = []
for i in range(len(text)):
    t = ' '.join(tokenized_doc[i])
    detokenized_doc.append(t)

# 다시 text['headline_text']에 재저장
text['headline_text'] = detokenized_doc


  and should_run_async(code)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text['headline_text'] = detokenized_doc


In [62]:
text['headline_text'][:5]


  and should_run_async(code)


0       decide community broadcast licence
1       fire witness must aware defamation
2    call infrastructure protection summit
3                   staff aust strike rise
4      strike affect australian travellers
Name: headline_text, dtype: object

In [63]:
# 상위 1,000개의 단어를 보존
vectorizer = TfidfVectorizer(stop_words='english', max_features= 1000)
X = vectorizer.fit_transform(text['headline_text'])

# TF-IDF 행렬의 크기 확인
print('TF-IDF 행렬의 크기 :',X.shape)


  and should_run_async(code)


TF-IDF 행렬의 크기 : (63765, 1000)


In [64]:
lda_model = LatentDirichletAllocation(n_components=10,learning_method='online',random_state=777,max_iter=1)


  and should_run_async(code)


In [65]:
lda_top = lda_model.fit_transform(X)


  and should_run_async(code)


In [66]:
print(lda_model.components_)
print(lda_model.components_.shape)


[[1.00003318e-01 1.00001639e-01 1.00016888e-01 ... 1.00006345e-01
  1.00004810e-01 1.00003815e-01]
 [1.00007590e-01 1.00014600e-01 1.00001961e-01 ... 1.00003155e-01
  1.00005718e-01 1.00008361e-01]
 [1.00006739e-01 1.00004432e-01 1.00053101e-01 ... 5.52441343e+01
  1.74802587e+02 1.00009464e-01]
 ...
 [1.00773298e-01 4.43789608e+01 4.25930303e+01 ... 1.00003633e-01
  1.00005146e-01 1.00004055e-01]
 [1.00005143e-01 1.00003436e-01 1.00020368e-01 ... 1.00004344e-01
  1.00006233e-01 1.00006718e-01]
 [6.34125389e+01 1.00002171e-01 1.00013766e-01 ... 1.00007038e-01
  1.00005309e-01 1.00021957e-01]]
(10, 1000)


  and should_run_async(code)


In [68]:
# 단어 집합. 1,000개의 단어가 저장됨.
terms = vectorizer.get_feature_names_out()

def get_topics(components, feature_names, n=5):
    for idx, topic in enumerate(components):
        print("Topic %d:" % (idx+1), [(feature_names[i], topic[i].round(2)) for i in topic.argsort()[:-n - 1:-1]])

get_topics(lda_model.components_,terms)


Topic 1: [('crash', 326.67), ('minister', 318.94), ('open', 277.44), ('health', 274.98), ('case', 263.41)]
Topic 2: [('world', 385.75), ('help', 305.51), ('work', 302.34), ('bomb', 280.7), ('year', 241.52)]
Topic 3: [('claim', 430.44), ('sydney', 357.22), ('rise', 259.88), ('centre', 248.72), ('talk', 246.21)]
Topic 4: [('police', 963.33), ('report', 463.41), ('probe', 445.52), ('seek', 402.01), ('fund', 396.46)]
Topic 5: [('face', 555.85), ('court', 535.06), ('warn', 524.57), ('public', 275.76), ('urge', 263.01)]
Topic 6: [('plan', 781.83), ('boost', 370.63), ('consider', 366.59), ('attack', 355.83), ('home', 335.32)]
Topic 7: [('charge', 563.96), ('govt', 531.29), ('group', 336.59), ('death', 331.44), ('murder', 316.57)]
Topic 8: [('hospital', 346.56), ('make', 276.89), ('lead', 275.67), ('strike', 270.05), ('india', 253.4)]
Topic 9: [('council', 517.35), ('search', 243.2), ('defend', 230.37), ('bush', 226.75), ('welcome', 220.34)]
Topic 10: [('kill', 639.09), ('miss', 386.66), ('eng

  and should_run_async(code)


BERT를 이용한 키워드 추출 : 키버트(KeyBERT)

In [69]:
!pip install sentence_transformers


  and should_run_async(code)


Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0 (from sentence_transformers)
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece (from sentence_transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0 (from sentence_transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 k

In [70]:
import numpy as np
import itertools

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer


  and should_run_async(code)


In [72]:
doc = """
         Supervised learning is the machine learning task of
         learning a function that maps an input to an output based
         on example input-output pairs.[1] It infers a function
         from labeled training data consisting of a set of
         training examples.[2] In supervised learning, each
         example is a pair consisting of an input object
         (typically a vector) and a desired output value (also
         called the supervisory signal). A supervised learning
         algorithm analyzes the training data and produces an
         inferred function, which can be used for mapping new
         examples. An optimal scenario will allow for the algorithm
         to correctly determine the class labels for unseen
         instances. This requires the learning algorithm to
         generalize from the training data to unseen situations
         in a 'reasonable' way (see inductive bias).
      """


  and should_run_async(code)


In [73]:
# 3개의 단어 묶음인 단어구 추출
n_gram_range = (3, 3)
stop_words = "english"

count = CountVectorizer(ngram_range=n_gram_range, stop_words=stop_words).fit([doc])
candidates = count.get_feature_names_out()

print('trigram 개수 :',len(candidates))
print('trigram 다섯개만 출력 :',candidates[:5])


trigram 개수 : 72
trigram 다섯개만 출력 : ['algorithm analyzes training' 'algorithm correctly determine'
 'algorithm generalize training' 'allow algorithm correctly'
 'analyzes training data']


  and should_run_async(code)


In [75]:
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
doc_embedding = model.encode([doc])
candidate_embeddings = model.encode(candidates)


  and should_run_async(code)


Downloading (…)925a9/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)1a515925a9/README.md:   0%|          | 0.00/3.99k [00:00<?, ?B/s]

Downloading (…)515925a9/config.json:   0%|          | 0.00/550 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)925a9/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/450 [00:00<?, ?B/s]

Downloading (…)1a515925a9/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)15925a9/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [76]:
top_n = 5
distances = cosine_similarity(doc_embedding, candidate_embeddings)
keywords = [candidates[index] for index in distances.argsort()[0][-top_n:]]
print(keywords)


['algorithm analyzes training', 'learning algorithm generalize', 'learning machine learning', 'learning algorithm analyzes', 'algorithm generalize training']


  and should_run_async(code)


In [77]:
def max_sum_sim(doc_embedding, candidate_embeddings, words, top_n, nr_candidates):
    # 문서와 각 키워드들 간의 유사도
    distances = cosine_similarity(doc_embedding, candidate_embeddings)

    # 각 키워드들 간의 유사도
    distances_candidates = cosine_similarity(candidate_embeddings,
                                            candidate_embeddings)

    # 코사인 유사도에 기반하여 키워드들 중 상위 top_n개의 단어를 pick.
    words_idx = list(distances.argsort()[0][-nr_candidates:])
    words_vals = [candidates[index] for index in words_idx]
    distances_candidates = distances_candidates[np.ix_(words_idx, words_idx)]

    # 각 키워드들 중에서 가장 덜 유사한 키워드들간의 조합을 계산
    min_sim = np.inf
    candidate = None
    for combination in itertools.combinations(range(len(words_idx)), top_n):
        sim = sum([distances_candidates[i][j] for i in combination for j in combination if i != j])
        if sim < min_sim:
            candidate = combination
            min_sim = sim

    return [words_vals[idx] for idx in candidate]


  and should_run_async(code)


In [78]:
max_sum_sim(doc_embedding, candidate_embeddings, candidates, top_n=5, nr_candidates=10)


  and should_run_async(code)


['requires learning algorithm',
 'signal supervised learning',
 'learning function maps',
 'algorithm analyzes training',
 'learning machine learning']

In [79]:
max_sum_sim(doc_embedding, candidate_embeddings, candidates, top_n=5, nr_candidates=20)


  and should_run_async(code)


['set training examples',
 'generalize training data',
 'requires learning algorithm',
 'supervised learning algorithm',
 'learning machine learning']

In [80]:
def mmr(doc_embedding, candidate_embeddings, words, top_n, diversity):

    # 문서와 각 키워드들 간의 유사도가 적혀있는 리스트
    word_doc_similarity = cosine_similarity(candidate_embeddings, doc_embedding)

    # 각 키워드들 간의 유사도
    word_similarity = cosine_similarity(candidate_embeddings)

    # 문서와 가장 높은 유사도를 가진 키워드의 인덱스를 추출.
    # 만약, 2번 문서가 가장 유사도가 높았다면
    # keywords_idx = [2]
    keywords_idx = [np.argmax(word_doc_similarity)]

    # 가장 높은 유사도를 가진 키워드의 인덱스를 제외한 문서의 인덱스들
    # 만약, 2번 문서가 가장 유사도가 높았다면
    # ==> candidates_idx = [0, 1, 3, 4, 5, 6, 7, 8, 9, 10 ... 중략 ...]
    candidates_idx = [i for i in range(len(words)) if i != keywords_idx[0]]

    # 최고의 키워드는 이미 추출했으므로 top_n-1번만큼 아래를 반복.
    # ex) top_n = 5라면, 아래의 loop는 4번 반복됨.
    for _ in range(top_n - 1):
        candidate_similarities = word_doc_similarity[candidates_idx, :]
        target_similarities = np.max(word_similarity[candidates_idx][:, keywords_idx], axis=1)

        # MMR을 계산
        mmr = (1-diversity) * candidate_similarities - diversity * target_similarities.reshape(-1, 1)
        mmr_idx = candidates_idx[np.argmax(mmr)]

        # keywords & candidates를 업데이트
        keywords_idx.append(mmr_idx)
        candidates_idx.remove(mmr_idx)

    return [words[idx] for idx in keywords_idx]


  and should_run_async(code)


In [81]:
mmr(doc_embedding, candidate_embeddings, candidates, top_n=5, diversity=0.2)


  and should_run_async(code)


['algorithm generalize training',
 'supervised learning algorithm',
 'learning machine learning',
 'learning algorithm analyzes',
 'learning algorithm generalize']

In [82]:
mmr(doc_embedding, candidate_embeddings, candidates, top_n=5, diversity=0.7)


  and should_run_async(code)


['algorithm generalize training',
 'labels unseen instances',
 'new examples optimal',
 'determine class labels',
 'supervised learning algorithm']

BERT 기반 복합 토픽 모델(Combined Topic Models, CTM)

In [86]:
pip install contextualized-topic-models==2.2.0


  and should_run_async(code)


Collecting contextualized-topic-models==2.2.0
  Downloading contextualized_topic_models-2.2.0-py2.py3-none-any.whl (33 kB)
Collecting ipywidgets==7.5.1 (from contextualized-topic-models==2.2.0)
  Downloading ipywidgets-7.5.1-py2.py3-none-any.whl (121 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.6/121.6 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ipython==7.16.1 (from contextualized-topic-models==2.2.0)
  Downloading ipython-7.16.1-py3-none-any.whl (785 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m785.1/785.1 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
Collecting jedi>=0.10 (from ipython==7.16.1->contextualized-topic-models==2.2.0)
  Downloading jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
Collecting widgetsnbextension~=3.5.0 (from ipywidgets==7.5.1->contextualized-topic-models==2.2.0)
  Downloading widg

In [87]:
# 데이터 다운로드
!wget https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_unprep.txt


  and should_run_async(code)


--2023-11-06 17:02:04--  https://raw.githubusercontent.com/vinid/data/master/dbpedia_sample_abstract_20k_unprep.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6208417 (5.9M) [text/plain]
Saving to: ‘dbpedia_sample_abstract_20k_unprep.txt’


2023-11-06 17:02:04 (58.5 MB/s) - ‘dbpedia_sample_abstract_20k_unprep.txt’ saved [6208417/6208417]



In [88]:
!head -n 3 dbpedia_sample_abstract_20k_unprep.txt


The Mid-Peninsula Highway is a proposed freeway across the Niagara Peninsula in the Canadian province of Ontario. Although plans for a highway connecting Hamilton to Fort Erie south of the Niagara Escarpment have surfaced for decades,it was not until The Niagara Frontier International Gateway Study was published by the Ministry
Monte Zucker (died March 15, 2007) was an American photographer. He specialized in wedding photography, entering it as a profession in 1947. In the 1970s he operated a studio in Silver Spring, Maryland. Later he lived in Florida. He was Brides Magazine's Wedding Photographer of the Year for 1990 and
Henry Howard, 13th Earl of Suffolk, 6th Earl of Berkshire (8 August 1779 – 10 August 1779) was a British peer, the son of Henry Howard, 12th Earl of Suffolk. His father died on 7 March 1779, leaving behind his pregnant widow. The Earldom of Suffolk became dormant until she


  and should_run_async(code)


In [89]:
text_file = "dbpedia_sample_abstract_20k_unprep.txt"


  and should_run_async(code)


In [90]:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing
import nltk


  and should_run_async(code)


In [94]:
nltk.download('stopwords')

documents = [line.strip() for line in open(text_file, encoding="utf-8").readlines()]
sp = WhiteSpacePreprocessing(documents, stopwords_language='english')
preprocessed_documents, unpreprocessed_corpus, vocab = sp.preprocess()


  and should_run_async(code)
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


AttributeError: ignored

버토픽(BERTopic)

In [95]:
!pip install bertopic[visualization]


  and should_run_async(code)


Collecting bertopic[visualization]
  Downloading bertopic-0.15.0-py2.py3-none-any.whl (143 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/143.4 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Collecting hdbscan>=0.8.29 (from bertopic[visualization])
  Downloading hdbscan-0.8.33.tar.gz (5.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.2/5.2 MB[0m [31m17.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting umap-learn>=0.5.0 (from bertopic[visualization])
  Downloading umap-learn-0.5.4.tar.gz (90 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m90.8/90.8 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting cython<3,>=0.27 (from hdbscan>=0.8.29->bertopic[visualization])
  Using cached Cython-0

In [96]:
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups


  and should_run_async(code)


In [97]:
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']
docs[:5]


  and should_run_async(code)


["\n\nI am sure some bashers of Pens fans are pretty confused about the lack\nof any kind of posts about the recent Pens massacre of the Devils. Actually,\nI am  bit puzzled too and a bit relieved. However, I am going to put an end\nto non-PIttsburghers' relief with a bit of praise for the Pens. Man, they\nare killing those Devils worse than I thought. Jagr just showed you why\nhe is much better than his regular season stats. He is also a lot\nfo fun to watch in the playoffs. Bowman should let JAgr have a lot of\nfun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final\nregular season game.          PENS RULE!!!\n\n",
 'My brother is in the market for a high-performance video card that supports\nVESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:\n\n  - Diamond Stealth Pro Local Bus\n\n  - Orchid Farenheit 1280\n\n  - ATI Graphics Ultra Pro\n\n  - Any other high-per

In [98]:
model = BERTopic()
topics, probabilities = model.fit_transform(docs)

print('각 문서의 토픽 번호 리스트 :',len(topics))
print('첫번째 문서의 토픽 번호 :', topics[0])


  and should_run_async(code)


Downloading (…)e9125/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)7e55de9125/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)55de9125/config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)125/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)e9125/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading (…)9125/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)7e55de9125/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)5de9125/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

각 문서의 토픽 번호 리스트 : 18846
첫번째 문서의 토픽 번호 : 0


In [99]:
model.get_topic_info()


  and should_run_async(code)


Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,6374,-1_to_of_the_is,"[to, of, the, is, you, and, in, for, it, that]",[Since electrical wiring questions do turn up ...
1,0,1837,0_game_team_games_he,"[game, team, games, he, players, season, hocke...",[The FLYERS team that can beat any team on any...
2,1,574,1_key_clipper_chip_encryption,"[key, clipper, chip, encryption, keys, escrow,...",[The following document summarizes the Clipper...
3,2,526,2_idjits_ites_cheek_dancing,"[idjits, ites, cheek, dancing, yep, huh, ken, ...","[\nYep.\n, ites:, \nDancing With Idjits.\n\n\n]"
4,3,477,3_card_monitor_video_vga,"[card, monitor, video, vga, drivers, screen, m...",[I have a Radius Precision Color 24x video car...
...,...,...,...,...,...
210,209,10,209_dod_denizens_doom_motorcycle,"[dod, denizens, doom, motorcycle, muck, recmot...",[This is probably a stupid question but as I a...
211,210,10,210_weight_fat_muscle_chromium,"[weight, fat, muscle, chromium, pills, eat, di...",[Gordon Banks:\n\nThis certainly describes my ...
212,211,10,211_mission_sail_solar_orbit,"[mission, sail, solar, orbit, propulsion, plut...",[\nIf you've got a good propulsion system that...
213,212,10,212_moscow_aviation_russian_kaliningrad,"[moscow, aviation, russian, kaliningrad, insti...","[\nCorrection, and some more info: The Kalinin..."


In [100]:
model.get_topic_info()['Count'].sum()


  and should_run_async(code)


18846

In [101]:
model.get_topic(5)


  and should_run_async(code)


[('fbi', 0.01562680938843602),
 ('batf', 0.013057121824315093),
 ('koresh', 0.01174963047213605),
 ('fire', 0.011299384921070226),
 ('compound', 0.009276368338129419),
 ('they', 0.008562092329417455),
 ('gas', 0.008241257334233964),
 ('were', 0.0074046400715348445),
 ('was', 0.00717689477734736),
 ('adl', 0.006866080959303942)]

In [102]:
model.visualize_topics()


  and should_run_async(code)


In [103]:
model.visualize_heatmap()



`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [104]:
model = BERTopic(nr_topics=20)
topics, probabilities = model.fit_transform(docs)

model.visualize_topics()



`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [105]:
model = BERTopic(nr_topics="auto")
topics, probabilities = model.fit_transform(docs)

model.get_topic_info()



`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



Unnamed: 0,Topic,Count,Name,Representation,Representative_Docs
0,-1,6533,-1_the_to_of_and,"[the, to, of, and, is, in, for, that, it, you]",[Frequently-asked questions about the OPEN LOO...
1,0,6656,0_the_of_to_and,"[the, of, to, and, is, in, that, it, for, you]","[\n\nSo far, you have presented your opinions ..."
2,1,1840,1_game_team_he_the,"[game, team, he, the, games, in, was, players,...",[\nI agree and disagree. John is saying that ...
3,2,648,2_car_bike_the_it,"[car, bike, the, it, and, cars, to, you, my, on]","[\n\n*nnnnnnnng* Thank you for playing, I cann..."
4,3,452,3_fbi_batf_they_koresh,"[fbi, batf, they, koresh, the, fire, was, were...",[This was posted by Lyn Bates to the firearms-...
...,...,...,...,...,...
80,79,12,79_corn_seizure_cereals_seizures,"[corn, seizure, cereals, seizures, she, sugar,...",[\n Path: news.larc.nasa.gov!darwin.sura.net...
81,80,12,80_plutonium_nuclear_clancy_weapons,"[plutonium, nuclear, clancy, weapons, bomb, re...","[\tHate to mess up your point, but it is incre..."
82,81,12,81_42_tea_question_number,"[42, tea, question, number, answer, universe, ...","[: Well,\n: \n: 42 is 101010 binary, and who w..."
83,82,11,82_surgery_hospital_doctors_famous,"[surgery, hospital, doctors, famous, medical, ...","[Hello,\n\n Just one quick question:\n ..."


In [106]:
new_doc = docs[0]
print(new_doc)




I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killing those Devils worse than I thought. Jagr just showed you why
he is much better than his regular season stats. He is also a lot
fo fun to watch in the playoffs. Bowman should let JAgr have a lot of
fun in the next couple of games since the Pens are going to beat the pulp out of Jersey anyway. I was very disappointed not to see the Islanders lose the final
regular season game.          PENS RULE!!!





`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.



In [107]:
topics, probs = model.transform([new_doc])
print('예측한 토픽 번호 :', topics)



`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.


Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.



예측한 토픽 번호 : [1]


In [108]:
model.save("my_topics_model")
BerTopic_model = BERTopic.load("my_topics_model")



`should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.

