# 토픽 모델링(Topic Modeling)

* 토픽 모델링은 문서 집합에서 주제를 찾아내기 위한 기술
* 토픽 모델링은 '특정 주제에 관한 문서에서는 특정 단어가 자주 등장할 것이다'라는 직관을 기반
* 예를 들어, 주제가 '개'인 문서에서는 개의 품종, 개의 특성을 나타내는 단어가 다른 문서에 비해 많이 등장
* 주로 사용되는 토픽 모델링 방법은 잠재 의미 분석과 잠재 디리클레 할당 기법이 있음

## 잠재 의미 분석(Latent Semantic Analysis)

* 잠재 의미 분석(LSA)은 주로 문서 색인의 의미 검색에 사용
* 잠재 의미 인덱싱(Latent Semantic Indexing, LSI)로도 알려져 있음
* LSA의 목표는 문서와 단어의 기반이 되는 잠재적인 토픽을 발견하는 것
* 잠재적인 토픽은 문서에 있는 단어들의 분포를 주도한다고 가정

* LSA 방법
  + 문서 모음에서 생성한 문서-단어 행렬(Document Term Matrix)에서 단어-토픽 행렬(Term-Topic Matrix)과 토픽-중요도 행렬(Topic-Importance Matrix), 그리고 토픽-문서 행렬(Topic-Document Matrix)로 분해

## 잠재 디리클레 할당(Latent Dirichlet Allocation)

* 잠재 디레클레 할당(LDA)은 대표적인 토픽 모델링  알고리즘 중 하나

* 잠재 디레클레 할당 방법
  1. 사용자가 토픽이 개수를 지정해 알고리즘에 전달
  2. 모든 단어들을 토픽 중 하나에 할당
  3. 모든 문서의 모든 단어에 대해 단어 w가 가정에 의거, $p(t|d)$, $p(w|t)$에 따라 토픽을 재할당, 이를 반복, 이 때 가정은 자신만이 잘못된 토픽에 할당되어 있고 다른 모든 단어는 올바른 토픽에 할당된다는 것을 의미    

* $p(t|d)$ - 문서 d의 단어들 중 토픽 t에 해당하는 비율
* 해당 문서의 자주 등장하는 다른 단어의 토픽이 해당 단어의 토픽이 될 가능성이 높음을 의미    

* $p(w|t)$- 단어 w를 가지고 있는 모든 문서들 중  토픽 t가 할당된 비율
* 다른 문서에서 단어 w에 많이 할당된 토픽이 해당 단어의 토픽이 될 가능성이 높음을 의미

## **1. load data**
20 newsgroups dataset (classification) <br>
https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html

In [67]:
from sklearn.datasets import fetch_20newsgroups

dataset=fetch_20newsgroups(shuffle=True, random_state=1)
dataset.data #header, footers, quotes까지 포함됨 -> 제거

['From: ab4z@Virginia.EDU ("Andi Beyer")\nSubject: Re: Israeli Terrorism\nOrganization: University of Virginia\nLines: 15\n\nWell i\'m not sure about the story nad it did seem biased. What\nI disagree with is your statement that the U.S. Media is out to\nruin Israels reputation. That is rediculous. The U.S. media is\nthe most pro-israeli media in the world. Having lived in Europe\nI realize that incidences such as the one described in the\nletter have occured. The U.S. media as a whole seem to try to\nignore them. The U.S. is subsidizing Israels existance and the\nEuropeans are not (at least not to the same degree). So I think\nthat might be a reason they report more clearly on the\natrocities.\n\tWhat is a shame is that in Austria, daily reports of\nthe inhuman acts commited by Israeli soldiers and the blessing\nreceived from the Government makes some of the Holocaust guilt\ngo away. After all, look how the Jews are treating other races\nwhen they got power. It is unfortunate.\n',
 'F

In [68]:
dataset=fetch_20newsgroups(shuffle=True, random_state=1,remove=("headers","footers","quotes"))
documents=dataset.data #header, footers, quotes까지 포함됨 -> 제거
print("##문서 길이 : ",len(documents))
print("##첫번째 문서 내용\n", documents[0])

##문서 길이 :  11314
##첫번째 문서 내용
 Well i'm not sure about the story nad it did seem biased. What
I disagree with is your statement that the U.S. Media is out to
ruin Israels reputation. That is rediculous. The U.S. media is
the most pro-israeli media in the world. Having lived in Europe
I realize that incidences such as the one described in the
letter have occured. The U.S. media as a whole seem to try to
ignore them. The U.S. is subsidizing Israels existance and the
Europeans are not (at least not to the same degree). So I think
that might be a reason they report more clearly on the
atrocities.
	What is a shame is that in Austria, daily reports of
the inhuman acts commited by Israeli soldiers and the blessing
received from the Government makes some of the Holocaust guilt
go away. After all, look how the Jews are treating other races
when they got power. It is unfortunate.



### **1-2. Data Frame**

In [69]:
import pandas as pd

article_df=pd.DataFrame({"article":documents})
article_df.head()

Unnamed: 0,article
0,Well i'm not sure about the story nad it did s...
1,"\n\n\n\n\n\n\nYeah, do you expect people to re..."
2,Although I realize that principle is not one o...
3,Notwithstanding all the legitimate fuss about ...
4,"Well, I will have to change the scoring on my ..."


### **1-3.Explore data**

In [70]:
print("##total number of article##\n")
article_df["article"].shape[0]

##total number of article##



11314

In [71]:
print("##unique txt##\n")
article_df["article"].unique()

##unique txt##



array(["Well i'm not sure about the story nad it did seem biased. What\nI disagree with is your statement that the U.S. Media is out to\nruin Israels reputation. That is rediculous. The U.S. media is\nthe most pro-israeli media in the world. Having lived in Europe\nI realize that incidences such as the one described in the\nletter have occured. The U.S. media as a whole seem to try to\nignore them. The U.S. is subsidizing Israels existance and the\nEuropeans are not (at least not to the same degree). So I think\nthat might be a reason they report more clearly on the\natrocities.\n\tWhat is a shame is that in Austria, daily reports of\nthe inhuman acts commited by Israeli soldiers and the blessing\nreceived from the Government makes some of the Holocaust guilt\ngo away. After all, look how the Jews are treating other races\nwhen they got power. It is unfortunate.\n",
       "\n\n\n\n\n\n\nYeah, do you expect people to read the FAQ, etc. and actually accept hard\natheism?  No, you need a

In [72]:
article_df.replace("",float("NaN"),inplace=True)
article_df.isnull().any()

article    True
dtype: bool

In [73]:

article_df.dropna(inplace=True)
print("##total number of article after drop NaN##\n")
article_df["article"].shape[0]

##total number of article after drop NaN##



11096

## **2. Preprocessing**

### **2-1.Tokenization & Clean**

In [74]:
import re
import string
import nltk
from nltk.corpus import stopwords
from gensim.parsing.preprocessing import preprocess_string
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [75]:
#string.punctuation 사용해 문장부호 추출
print("##삭제할 문장 부호##\n", string.punctuation,"\n")


#문장 부호 : key, 공백 : value로 하는 dictionary 형성
print("##dictionary (문장 부호 : 공백)##")
delete_dict = {sp_character: '' for sp_character in string.punctuation} 
delete_dict[" "]=" " #띄어쓰기 유지
delete_dict["\t"]=" " #간격 -> 공백 한 캄
delete_dict["\n"]= " " #줄바꿈 -> 공백 한 칸
print(delete_dict,"\n")

#아스키코드로 변환해 dictionary 형태로 1:1 매칭 (두 인자의 길이 동일해야 함)
print("##dictionary (문장 부호-아스키코드 : 공백)##")
table = str.maketrans(delete_dict)
print(table,"\n")

#dictionary의 매칭 관계 이용해 변환
print("##첫번째 문서에 변환 적용 결과##\n",documents[0].translate(table))

##삭제할 문장 부호##
 !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 

##dictionary (문장 부호 : 공백)##
{'!': '', '"': '', '#': '', '$': '', '%': '', '&': '', "'": '', '(': '', ')': '', '*': '', '+': '', ',': '', '-': '', '.': '', '/': '', ':': '', ';': '', '<': '', '=': '', '>': '', '?': '', '@': '', '[': '', '\\': '', ']': '', '^': '', '_': '', '`': '', '{': '', '|': '', '}': '', '~': '', ' ': ' ', '\t': ' ', '\n': ' '} 

##dictionary (문장 부호-아스키코드 : 공백)##
{33: '', 34: '', 35: '', 36: '', 37: '', 38: '', 39: '', 40: '', 41: '', 42: '', 43: '', 44: '', 45: '', 46: '', 47: '', 58: '', 59: '', 60: '', 61: '', 62: '', 63: '', 64: '', 91: '', 92: '', 93: '', 94: '', 95: '', 96: '', 123: '', 124: '', 125: '', 126: '', 32: ' ', 9: ' ', 10: ' '} 

##첫번째 문서에 변환 적용 결과##
 Well im not sure about the story nad it did seem biased What I disagree with is your statement that the US Media is out to ruin Israels reputation That is rediculous The US media is the most proisraeli media in the world Having lived in Europe I realize

In [76]:
stop_words=stopwords.words("english")
print("## Number of stop words : ",len(stop_words))
stop_words

## Number of stop words :  179


['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [77]:
ex_txt="apple"
print("## txt 숫자 여부 판정 :",ex_txt.isdigit())
print("## txt 길이 :",len(ex_txt))

## txt 숫자 여부 판정 : False
## txt 길이 : 5


In [78]:
#함수로 작성

def remve_punctuation(text): #문장부호 제거
  delete_dict = {sp_character: '' for sp_character in string.punctuation} 
  delete_dict[' '] = ' ' 
  delete_dict["\n"]= " " #줄바꿈 -> 공백 한 칸
  delete_dict["\t"]=" " #들여쓰기/간격 -> 공백 한 칸
  table = str.maketrans(delete_dict)
  text1 = text.translate(table)
  return text1.lower()


def remove_stopword(text): #불필요한 단어 제거
    text_split = text.split(' ')
    rem_text = " ".join([i for i in text_split if i not in stop_words])
    return rem_text


def remove_number(text): #숫자 제거
    text_split = text.split(' ')
    rem_text = " ".join([i for i in text_split if (not i.isdigit())])
    return rem_text

def remove_short(text): #너무 짧은 단어 삭제 (관사, 감탄사..)
    text_split = text.split(' ')
    rem_text = " ".join([i for i in text_split if (len(i)>3)])
    return rem_text

def preprocessing(text):
  return preprocess_string(text)

In [79]:
print("## clean 전 article 길이 : ", len(article_df["article"].iloc[20]))

## clean 전 article 길이 :  668


In [80]:
print("##불필요한 문장부호 삭제")
article_df["article"]=article_df["article"].apply(remve_punctuation)
article_df["article"].iloc[20]

##불필요한 문장부호 삭제


'     id like to see this info as well  as for wavelength i think youre primarily going to find two  880 nm  a bit andor 950 nm  a bit  usually it is about 10 nm either way  the two most common i have seen were 880 and 950 but i have also heard of 890 and 940 im not sure that the 10 nm one way or another will make a great deal of difference   another suggestion  find a brand of tv that uses an ir remote and go look at the sams photofact for it  you can often find some very detailed schematics and parts list for not only the receiver but the transmitter as well including carrier freq specs and tone decoding specs if the system uses that'

In [81]:
print("##불필요한 단어 삭제")
article_df["article"]=article_df["article"].apply(remove_stopword)
article_df["article"].iloc[20]
# as, well 등 사라짐

##불필요한 단어 삭제


'     id like see info well  wavelength think youre primarily going find two  880 nm  bit andor 950 nm  bit  usually 10 nm either way  two common seen 880 950 also heard 890 940 im sure 10 nm one way another make great deal difference   another suggestion  find brand tv uses ir remote go look sams photofact  often find detailed schematics parts list receiver transmitter well including carrier freq specs tone decoding specs system uses'

In [82]:
print("##불필요한 숫자 삭제")
article_df["article"]=article_df["article"].apply(remove_number)
article_df["article"].iloc[20]

##불필요한 숫자 삭제


'     id like see info well  wavelength think youre primarily going find two  nm  bit andor nm  bit  usually nm either way  two common seen also heard im sure nm one way another make great deal difference   another suggestion  find brand tv uses ir remote go look sams photofact  often find detailed schematics parts list receiver transmitter well including carrier freq specs tone decoding specs system uses'

In [83]:
print("##불필요한 짧은 단어 삭제")
article_df["article"]=article_df["article"].apply(remove_short)
article_df["article"].iloc[20]

##불필요한 짧은 단어 삭제


'like info well wavelength think youre primarily going find andor usually either common seen also heard sure another make great deal difference another suggestion find brand uses remote look sams photofact often find detailed schematics parts list receiver transmitter well including carrier freq specs tone decoding specs system uses'

In [84]:
print("## clean 후 article 길이 : ", len(article_df["article"].iloc[20]))

## clean 후 article 길이 :  333


In [85]:
tokenized_article=article_df["article"].apply(preprocessing).to_list()
tokenized_article

[['sure',
  'stori',
  'bias',
  'disagre',
  'statement',
  'media',
  'ruin',
  'israel',
  'reput',
  'redicul',
  'media',
  'proisra',
  'media',
  'world',
  'live',
  'europ',
  'realiz',
  'incid',
  'describ',
  'letter',
  'occur',
  'media',
  'ignor',
  'subsid',
  'israel',
  'exist',
  'european',
  'degre',
  'think',
  'reason',
  'report',
  'clearli',
  'atroc',
  'shame',
  'austria',
  'daili',
  'report',
  'inhuman',
  'act',
  'commit',
  'isra',
  'soldier',
  'bless',
  'receiv',
  'govern',
  'make',
  'holocaust',
  'guilt',
  'awai',
  'look',
  'jew',
  'treat',
  'race',
  'power',
  'unfortun'],
 ['yeah',
  'expect',
  'peopl',
  'read',
  'actual',
  'accept',
  'hard',
  'atheism',
  'need',
  'littl',
  'leap',
  'faith',
  'jimmi',
  'logic',
  'run',
  'steam',
  'sorri',
  'piti',
  'sorri',
  'feel',
  'denial',
  'faith',
  'need',
  'pretend',
  'happili',
  'mayb',
  'start',
  'newsgroup',
  'altatheisthard',
  'wont',
  'bummin',
  'byeby',
  

In [86]:
##문장 길이 자체가 너무 짧은 것 삭제
import numpy as np
drop_article=[idx for idx,sentence in enumerate(tokenized_article) if len(sentence)<=2]
article_txt=np.delete(tokenized_article,drop_news,axis=0)
print("##짧은 문장 삭제 후 갯수: ",article_txt.shape[0] )

##짧은 문장 삭제 후 갯수:  10928


  return array(a, dtype, copy=False, order=order)


## Gensim을 이용한 토픽 모델링

In [87]:
#!pip install --upgrade gensim



In [89]:
import gensim
from gensim import corpora

In [92]:
dic=corpora.Dictionary(article_txt)
corpus=[dic.doc2bow(text) for text in article_txt]
print(corpus[1])

[(50, 1), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 2), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 2), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 2), (80, 1), (81, 1), (82, 1), (83, 1), (84, 1)]


### 잠재 의미 분석을 위한 `LsiModel`

In [None]:
from gensim.models.coherencemodel import CoherenceModel

min_topics,max_topics=20,25
co

In [93]:
from gensim.models import LsiModel

In [94]:
LSI=LsiModel(corpus, num_topics=20,id2word=dic)
topics=LSI.print_topics()
topics
                   

[(0,
  '-1.000*"maxaxaxaxaxaxaxaxaxaxaxaxaxaxax" + -0.008*"mgvgvgvgvgvgvgvgvgvgvgvgvgvgvgv" + -0.005*"maxaxaxaxaxaxaxaxaxaxaxaxaxax" + -0.003*"maxaxaxaxaxaxaxaxaxaxaxaxaxaxaxq" + -0.002*"maxaxaxaxaxaxaxaxaxaxaxaxaxaxf" + -0.002*"mqaxaxaxaxaxaxaxaxaxaxaxaxaxax" + -0.001*"maxaxaxaxaxaxaxaxaxaxaxaxaxasqq" + -0.001*"maxaxaxaxaxaxaxaxaxaxaxaxaxaxqq" + -0.001*"maxaxaxaxaxaxaxaxaxaxaxaxaxaxasq" + -0.001*"maxaxaxaxaxaxaxaxaxaxaxaxaxaxqqf"'),
 (1,
  '0.392*"file" + 0.191*"program" + 0.158*"imag" + 0.125*"peopl" + 0.125*"avail" + 0.119*"inform" + 0.116*"includ" + 0.116*"entri" + 0.114*"work" + 0.111*"dont"'),
 (2,
  '-0.454*"file" + 0.216*"peopl" + 0.210*"know" + 0.192*"said" + 0.176*"dont" + 0.158*"think" + -0.157*"entri" + 0.154*"stephanopoulo" + -0.140*"imag" + 0.129*"go"'),
 (3,
  '-0.412*"file" + -0.288*"entri" + 0.242*"imag" + 0.167*"avail" + 0.138*"wire" + 0.136*"data" + 0.122*"version" + -0.116*"onam" + 0.109*"window" + -0.102*"said"'),
 (4,
  '0.618*"wire" + 0.250*"ground" + 0.188*"circ

### 잠재 디리클레 할당을 위한 `LdaModel`

## 토픽 모델링 시각화