# 토픽모델링
앞서 워드클라우드를 그렸던 데이터를 이용해 토픽모델링을 수행해 보겠습니다<br>
우선 토픽모델링에 대한 개념설명은 [ratsgo's blog](https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/06/01/LDA/)를 참조로 설명<br>
<br>
토픽모델링은 기계학습 알고리즘입니다. 기계가 스스로 설계자인 인간이 미처 파악하지 못하는<br>
추상적인 '토픽'을 스스로 찾아나가는 알고리즘인 것이죠<br>
따라서 대량의 데이터가 필요합니다<br>
이번에는 일단 실습에 더 방점을 두고 사용했던 데이터를 그대로 활용하겠습니다

## 데이터 읽기
여기서는 앞서와 달리 openpyxl이라는 라이브러리를 이용해 엑셀 파일을 여는 느낌으로 파일을 다뤄 보겠습니다

In [1]:
from openpyxl import load_workbook
from collections import defaultdict
Theme="songpa" # 적절한 이름으로 변경
RawFile='NewsResult_20200229-20200530.xlsx'
wb = load_workbook(RawFile)
ws = wb.active

  warn("Workbook contains no default style, apply openpyxl's default")


엑셀 파일을 열어 보면 *키워드*는 *O*열에 위치합니다<br>

In [None]:
# 키워드(칼럼 O)의 단어들 리스트로 읽기
words = ws['O']
words_list = [x.value.split(',') for x in words[1:]]

전처리는 단순화 해서 전체 텍스트에서 한번 나타난 단어만 제거하는 것으로(빅카인즈 데이터는 사실 전처리가 되어 있는 셈이므로)

In [4]:
# 한번만 나타난 단어는 제거
frequency = defaultdict(int)
for text in words_list:
    for token in text:
        frequency[token] += 1
words_list = [[token for token in text if frequency[token] > 1] for text in words_list]

## 토픽 모델 만들기
[gensim](https://radimrehurek.com/gensim/)은 토픽모델링을 비롯해 워드엠베딩 등<br> 
다양한 자연어처리 기계학습 알고리즘을 제공하는 매우 유용한 라이브러리입니다<br>
이를 활용해 LDA 모델을 만들어 보겠습니다

In [5]:
import logging
logging.basicConfig(format = '%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from gensim import corpora
from gensim.models import LdaModel

### 파라미터 설정
토픽모델링에서 토픽의 숫자는 분석자가 정하는 하이퍼파라미터입니다<br>
해당 말뭉치에 가장 적절한 토픽의 숫자는 몇개일까요?<br>
이것을 추정하는 방법들이 있는데 대표적인 것이 **Perplexity**와 **Coherence** 점수입니다([참조 코어닷 블로그](https://coredottoday.github.io/2018/09/17/%EB%AA%A8%EB%8D%B8-%ED%8C%8C%EB%9D%BC%EB%AF%B8%ED%84%B0-%ED%8A%9C%EB%8B%9D/))<br>
지금은 우선 임의의 숫자로 일단 설정해 보겠습니다

In [6]:
K = 5 # 일단 임의의 토픽수
iterations = 1000 # 반복 횟수
random_state = 1 # 설정
fname="lda_"+Theme+"K"+str(K)+"R"+str(random_state)+"I"+str(iterations) #임의의 모델명

토픽모델링은 우선 사전을 구축해야 합니다

In [7]:
dictionary = corpora.Dictionary(words_list)
dictionary.save(fname+'_dictionary.pkl')

2020-05-30 03:47:36,770 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2020-05-30 03:47:37,379 : INFO : built Dictionary(19006 unique tokens: ['0.02', '0.02%', '0.06%', '0.10%', '18억']...) from 1916 documents (total 443957 corpus positions)
2020-05-30 03:47:37,381 : INFO : saving Dictionary object under lda_songpaK5R1I1000_dictionary.pkl, separately None
2020-05-30 03:47:37,397 : INFO : saved lda_songpaK5R1I1000_dictionary.pkl


그리고 말뭉치를 학습에 적절한 **Bag of Words(BOW) 모델** 형태로 바꾸어 줍니다 

In [8]:
corpus = [dictionary.doc2bow(text) for text in words_list]
corpora.MmCorpus.serialize(fname+'_corpus.mm', corpus)

2020-05-30 03:49:21,575 : INFO : storing corpus in Matrix Market format to lda_songpaK5R1I1000_corpus.mm
2020-05-30 03:49:21,576 : INFO : saving sparse matrix to lda_songpaK5R1I1000_corpus.mm
2020-05-30 03:49:21,578 : INFO : PROGRESS: saving document #0
2020-05-30 03:49:21,787 : INFO : PROGRESS: saving document #1000
2020-05-30 03:49:21,958 : INFO : saved 1916x19006 matrix, density=0.717% (261116/36415496)
2020-05-30 03:49:21,960 : INFO : saving MmCorpus index to lda_songpaK5R1I1000_corpus.mm.index


이후 실제 모델 구축에 들어갑니다

In [9]:
lda = LdaModel(corpus, 
               num_topics=K, 
               id2word=dictionary,
               random_state=random_state, 
               iterations=iterations)
lda.save(fname)

2020-05-30 03:49:40,216 : INFO : using symmetric alpha at 0.2
2020-05-30 03:49:40,218 : INFO : using symmetric eta at 0.2
2020-05-30 03:49:40,229 : INFO : using serial LDA version on this node
2020-05-30 03:49:40,246 : INFO : running online (single-pass) LDA training, 5 topics, 1 passes over the supplied corpus of 1916 documents, updating model once every 1916 documents, evaluating perplexity every 1916 documents, iterating 1000x with a convergence threshold of 0.001000
2020-05-30 03:49:57,457 : INFO : -10.490 per-word bound, 1438.6 perplexity estimate based on a held-out corpus of 1916 documents with 443957 words
2020-05-30 03:49:57,459 : INFO : PROGRESS: pass 0, at document #1916/1916
2020-05-30 03:50:14,052 : INFO : topic #0 (0.200): 0.021*"아파트" + 0.014*"서울" + 0.010*"강남" + 0.008*"가격" + 0.007*"거래" + 0.006*"하락" + 0.006*"부동산" + 0.005*"주택" + 0.005*"지역" + 0.005*"단지"
2020-05-30 03:50:14,053 : INFO : topic #1 (0.200): 0.018*"후보" + 0.014*"지역" + 0.013*"서울" + 0.009*"강남" + 0.009*"민주당" + 0.007*

### 결과 보기 및 저장
좋은 토픽모델은 각각 모델이 겹치는 단어가 적고<br>
토픽들이 각자 명확한 개성을 나타내는 것입니다<br>
결과가 만족스러운가요?

In [10]:
lda.print_topics(-1,20)

2020-05-30 03:56:51,024 : INFO : topic #0 (0.200): 0.021*"아파트" + 0.014*"서울" + 0.010*"강남" + 0.008*"가격" + 0.007*"거래" + 0.006*"하락" + 0.006*"부동산" + 0.005*"주택" + 0.005*"지역" + 0.005*"단지" + 0.005*"송파" + 0.005*"코로나19" + 0.005*"시장" + 0.004*"정부" + 0.004*"후보" + 0.004*"송파구" + 0.004*"분양" + 0.004*"기준" + 0.003*"규제" + 0.003*"조사"
2020-05-30 03:56:51,026 : INFO : topic #1 (0.200): 0.018*"후보" + 0.014*"지역" + 0.013*"서울" + 0.009*"강남" + 0.009*"민주당" + 0.007*"아파트" + 0.006*"의원" + 0.006*"상승" + 0.005*"송파" + 0.005*"총선" + 0.005*"정부" + 0.005*"선거" + 0.005*"통합" + 0.005*"경기" + 0.005*"가격" + 0.004*"대표" + 0.004*"통합당" + 0.004*"코로나19" + 0.004*"하락" + 0.003*"주택"
2020-05-30 03:56:51,028 : INFO : topic #2 (0.200): 0.016*"서울" + 0.008*"아파트" + 0.006*"송파" + 0.006*"통합" + 0.006*"송파구" + 0.006*"지역" + 0.005*"대표" + 0.005*"위원장" + 0.005*"후보" + 0.005*"코로나19" + 0.004*"선거" + 0.004*"부동산" + 0.004*"경기" + 0.004*"강남" + 0.004*"주택" + 0.004*"총선" + 0.004*"지원" + 0.003*"서비스" + 0.003*"의원" + 0.003*"민주"
2020-05-30 03:56:51,030 : INFO : topic #3 (0.200): 0.

[(0,
  '0.021*"아파트" + 0.014*"서울" + 0.010*"강남" + 0.008*"가격" + 0.007*"거래" + 0.006*"하락" + 0.006*"부동산" + 0.005*"주택" + 0.005*"지역" + 0.005*"단지" + 0.005*"송파" + 0.005*"코로나19" + 0.005*"시장" + 0.004*"정부" + 0.004*"후보" + 0.004*"송파구" + 0.004*"분양" + 0.004*"기준" + 0.003*"규제" + 0.003*"조사"'),
 (1,
  '0.018*"후보" + 0.014*"지역" + 0.013*"서울" + 0.009*"강남" + 0.009*"민주당" + 0.007*"아파트" + 0.006*"의원" + 0.006*"상승" + 0.005*"송파" + 0.005*"총선" + 0.005*"정부" + 0.005*"선거" + 0.005*"통합" + 0.005*"경기" + 0.005*"가격" + 0.004*"대표" + 0.004*"통합당" + 0.004*"코로나19" + 0.004*"하락" + 0.003*"주택"'),
 (2,
  '0.016*"서울" + 0.008*"아파트" + 0.006*"송파" + 0.006*"통합" + 0.006*"송파구" + 0.006*"지역" + 0.005*"대표" + 0.005*"위원장" + 0.005*"후보" + 0.005*"코로나19" + 0.004*"선거" + 0.004*"부동산" + 0.004*"경기" + 0.004*"강남" + 0.004*"주택" + 0.004*"총선" + 0.004*"지원" + 0.003*"서비스" + 0.003*"의원" + 0.003*"민주"'),
 (3,
  '0.016*"서울" + 0.009*"아파트" + 0.007*"송파" + 0.005*"지역" + 0.005*"코로나19" + 0.005*"상승" + 0.004*"강남" + 0.004*"가격" + 0.004*"경기" + 0.004*"단지" + 0.004*"확진자" + 0.004*"정부" + 0.00

#### 결과 텍스트 파일로 저장하기

In [11]:
topics = lda.print_topics(-1,20)
feat_fname = fname+'_feats.txt'
with open(feat_fname, 'w') as text_file:
    for topic_num, features in topics:
        text_file.write("Topic={0} \n {1} \n".format(topic_num, features))

2020-05-30 03:57:25,855 : INFO : topic #0 (0.200): 0.021*"아파트" + 0.014*"서울" + 0.010*"강남" + 0.008*"가격" + 0.007*"거래" + 0.006*"하락" + 0.006*"부동산" + 0.005*"주택" + 0.005*"지역" + 0.005*"단지" + 0.005*"송파" + 0.005*"코로나19" + 0.005*"시장" + 0.004*"정부" + 0.004*"후보" + 0.004*"송파구" + 0.004*"분양" + 0.004*"기준" + 0.003*"규제" + 0.003*"조사"
2020-05-30 03:57:25,858 : INFO : topic #1 (0.200): 0.018*"후보" + 0.014*"지역" + 0.013*"서울" + 0.009*"강남" + 0.009*"민주당" + 0.007*"아파트" + 0.006*"의원" + 0.006*"상승" + 0.005*"송파" + 0.005*"총선" + 0.005*"정부" + 0.005*"선거" + 0.005*"통합" + 0.005*"경기" + 0.005*"가격" + 0.004*"대표" + 0.004*"통합당" + 0.004*"코로나19" + 0.004*"하락" + 0.003*"주택"
2020-05-30 03:57:25,859 : INFO : topic #2 (0.200): 0.016*"서울" + 0.008*"아파트" + 0.006*"송파" + 0.006*"통합" + 0.006*"송파구" + 0.006*"지역" + 0.005*"대표" + 0.005*"위원장" + 0.005*"후보" + 0.005*"코로나19" + 0.004*"선거" + 0.004*"부동산" + 0.004*"경기" + 0.004*"강남" + 0.004*"주택" + 0.004*"총선" + 0.004*"지원" + 0.003*"서비스" + 0.003*"의원" + 0.003*"민주"
2020-05-30 03:57:25,861 : INFO : topic #3 (0.200): 0.

## 시각화
시각화를 해서 보면 보다 명확하게 나타납니다

In [12]:
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt

In [13]:
vis = pyLDAvis.gensim.prepare(lda,corpus,dictionary,sort_topics=False)

2020-05-30 04:00:11,852 : INFO : NumExpr defaulting to 4 threads.
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [14]:
pyLDAvis.display(vis)

In [None]:
pyLDAvis.save_html(vis, fname+'.html')