<a href="https://colab.research.google.com/github/SeongwonTak/TIL_swtak/blob/master/LatentDirichletAllocation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Latent Dirichlet Allocation
잠재 디리클레 할당(Latent Dirichlet Allocation)에 대해 

(참고자료)
- https://ratsgo.github.io/from%20frequency%20to%20semantics/2017/06/01/LDA/
- https://wikidocs.net/30708
- 머신러닝 교과서 p309~p314

## 개요


토픽 모델링(Topic Modeling)이란 레이블이 없는 '텍스트'에 토픽을 할당하는 분야이다.
예를 들어, 뉴스 기사를 보고 분야를 할당하는 문제를 고려할 수 있다.(이는 clustering)

LDA, 잠재 디리클레 할당이란 기본적으로, 여러 문서에 걸져 자주 등장하는 단어의 그룹을 찾는 확률적 생성 모델이다. 즉 토픽을 자주 등장하는 단어들로 나타낼 수 있다는 것에서 출발한다.

### BoW 행렬을 통한 LDA
LDA에서는 입력으로 받은 BoW 행렬을 분해한다.
- 문서-토픽 행렬
- 단어-토픽 행렬

이 두 행렬을 곱하여 가능한 작은 오차로 입력 행렬을 재구성할 수 있도록 한다.


## Scikit-Learn에서의 LDA

우선 실제 구현 예시부터 확인하려고 한다.

영화 리뷰 데이터 셋을 분해하여, 여러개의 토픽으로 분류하고자 한다.
이를 위해서는 데이터셋을 불러와야 할 것이다.

In [5]:
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/movie_review.csv')

In [6]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


이제 BoW 행렬을 만들어보자.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(stop_words='english',
                        max_df = 0.1,
                        max_features = 5000)

max_df 지정을 통해 너무 자주 등장하는 단어를 제외하였다. (너무 자주 등장하면 토픽 카테고리와 연관성이 적을 것이기에)

또한 자주 등장하는 단어를 제한하였다. 이는 차원을 제한하여 오버피팅 방지 및 LDA 추론 성능을 향상시킨 것이다.

In [8]:
X = count.fit_transform(df['review'].values)

In [9]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components = 10,
                                random_state = 999,
                                learning_method='batch')
X_topics = lda.fit_transform(X)

- 10개의 토픽에 대해 5000개의 단어를 분류할 것이다.
- batch로 설정하여, lda가 한 번 반복할때 모든 데이터를 사용하여 학습된다. 

In [10]:
lda.components_.shape

(10, 5000)

In [15]:
n_topic_words = 5
feature_names = count.get_feature_names()
for topic_idx, topic in enumerate(lda.components_):
  print('토픽 %d:' % (topic_idx + 1))
  print(" ".join([feature_names[i] for i in topic.argsort()\
                  [:-n_topic_words -1:-1]]))
                 

토픽 1:
music audience feel cinema beautiful
토픽 2:
family kids mother father girl
토픽 3:
action game fight fi sci
토픽 4:
book version read original novel
토픽 5:
comedy worst script awful minutes
토픽 6:
role performance plays john comedy
토픽 7:
war american men women woman
토픽 8:
series episode tv dvd episodes
토픽 9:
horror guy dead house killer
토픽 10:
worst money waste minutes wasn


분류가 잘 되어있나 확인해보자.

In [21]:
trash_movie = X_topics[:, 8].argsort()[::-1]
for iter_idx, movie_idx in enumerate(trash_movie[:5]):
  print('horror movie #%d' % (iter_idx+1))
  print(df['review'][movie_idx][:200], ',,,')

horror movie #1
Why you ask does this man claim to have the truth behind the existence of the almighty? Well its deductive logic my friends, you see I know God exists because Satan does, how else would my poor eyes h ,,,
horror movie #2
Wow. Just wow. Never before have i seen a horror movie in which it seemed like a bad self insert fanfic that somebody wrote one day in 20 minutes. And then i happened to come upon "Lady Frankenstein". ,,,
horror movie #3
***SPOILERS*** ***SPOILERS*** Some bunch of Afrikkaner-Hillbilly types are out in the desert looking for Diamonds when they find a hard mound in the middle of a sandy desert area. Spoilers: The dumbes ,,,
horror movie #4
That is what this movie is. Good God the special effects suck in this movie. It is difficult for anything to suck more than this movie's plot, but the special effects manage to pull it off. Let me try ,,,
horror movie #5
This straight to video cheap flick is based on a true story. I don't doubt it. Doesn't mean it's par

In [20]:
trash_movie = X_topics[:, 9].argsort()[::-1]
for iter_idx, movie_idx in enumerate(trash_movie[:5]):
  print('trash movie #%d' % (iter_idx+1))
  print(df['review'][movie_idx][:200], ',,,')

trash movie #1
This movie is the biggest waste of nine dollars that I've spent in a very, very long time. If you knew how often I went to the movies you'd probably say, that's hard to imagine, but never-the-less, it ,,,
trash movie #2
I rented this movie from my local library and thought it might be good considering I like this type of movie and considering who was in it but boy was I wrong. The acting stunk, the fight scenes were  ,,,
trash movie #3
This movie is stupid and i hate it!!! i turned it off before it reached half i hate this movie. Amitabh sucks in this movie i wanna throw eggs at the person who directed this movie. This movie is stup ,,,
trash movie #4
IT IS A PIECE OF CRAP! not funny at all. during the whole movie nothing ever happens. i almost fell asleep, which in my case happens only if a movie is rally bad. (that is why it didn't get 1 (awful)  ,,,
trash movie #5
I sat down to watch this movie with my friends with very low expectations. My expectations were no where 

In [19]:
trash_movie2 = X_topics[:, 4].argsort()[::-1]
for iter_idx, movie_idx in enumerate(trash_movie[:5]):
  print('trash movie2 #%d' % (iter_idx+1))
  print(df['review'][movie_idx][:200], ',,,')

trash movie2 #1
Do all spoof films require pure stupidity and a lack of ANY sort of intelligence whatsoever to the humour? Is there even just a single genuinely FUNNY parody film anymore? All I see are zero-quality f ,,,
trash movie2 #2
I haven't actually finished the film. You may say that in this case I have no right to review it, especially so negatively. But I do, only because I stopped it on account of I couldn't watch anymore.. ,,,
trash movie2 #3
I am shocked to see that this movie has been given more than two stars by some people. They must either be kidding or be totally blind for the art of acting, directing and other flaws of the movie.<br ,,,
trash movie2 #4
This movie really shows its age. The print I saw was terrible due to age, but it is possible that there are better prints out there. However, this was not the major problem with the movie. The problem ,,,
trash movie2 #5
This movie really shows its age. The print I saw was terrible due to age, but it is possible that th

5번, 10번이 아쉽게 중복이되어버린 점이 있으나, 전반적으로 분류가 잘 되어 있고 리뷰도 알맞게 배당되어있음을 알 수 있다. 다만 시작 데이터 특성상 어떤 영화인지는 알기는 어렵다.

## LDA 모델 해석

LDA에 대한 자세한 해설은 다음과 같다.

### LDA의 기본 가정 및 과정

LDA는 문서를 만드는 과정을 역공학으로 추척하는 과정을 담은 알고리즘이다.
그럼 먼저 문서를 만드는 과정에 대해 알아보자.

- 문서 사용 토픽의 혼합을 확률 분포에 기반하여 결정한다.
- 해당 확률 분포를 기반하여 단어를 선정한다

즉,  **LDA는 토픽의 단어분포와 문서의 토픽분포의 결합으로 문서 내 단어들이 생성된다고 가정한다.**
따라서 LDA는 다음의 과정을 바탕으로 이루어지게 된다.

- 사용자가 Topic의 개수를 지정한다.
- 모든 단어를 1개의 Topic에 할당한다.
- 단어 w가 잘못 파악이 되었을 경우 다음 확률들을 바탕으로 재분배한다.
  - 문서 d 중 topic t에 해당되는 단어의 비율
  - topic t 중 word w가 차지하는 비율