1. 잠재 의미 분석(LSA, latent semantic analysis)

DTM(문서 단어 행렬), TF-IDF는 bag of words 기반 -> 단어의 의미(토픽) 고려 x



2. 특이값 분해(SVD, singular value decomposition)

A가 m x n 행렬일 때, A= UΣV^T 로 분해되는 것

* U: m x m orthogonal matrix
* V^T: transpose of n x n orthogonal matrix
* Σ(S로 표기): m x n rectangular DIAGONAL matrix (직사각 대각행렬)


LSA의 경우, 각각 행렬에서 일부 벡터들을 절단한 SVD (truncated SVD)를 사용한다.

* U,V : t열/행까지만 남김 (t = 찾으려는 토픽의 수를 반영한 하이퍼파라미터값, 즉 사용자가 직접 값을 선택하는 매개변수)

* Σ: 대각원소 중 σ1~σt까지만 남김

t를 크게 잡으면 기존 A로부터 다양한 의미를 가져가지만, t를 작게 잡아야 노이즈(설명력이 낮은 정보)를 제거할 수 있다.

3. LSA 실습

In [None]:
import numpy as np
A = np.array([[0,0,0,1,0,1,1,0,0], [0,0,0,1,1,0,1,0,0], [0,1,1,0,2,0,0,0,0], [1,0,0,0,0,0,0,1,1]])
np.shape(A) # 4행, 9열

(4, 9)

In [None]:
# 풀SVD (Σ 대신 S 사용)
# VT: V의 전치행렬

U, s, VT = np.linalg.svd(A)
print(U.round(2)) # 소수점 두번째자리까지만
np.shape(U)

[[-0.24  0.75  0.   -0.62]
 [-0.51  0.44 -0.    0.74]
 [-0.83 -0.49 -0.   -0.27]
 [-0.   -0.    1.    0.  ]]


(4, 4)

In [None]:
print(s.round(2))
np.shape(s) # 대각행렬이 아닌 특이값의 리스트를 반환

[2.69 2.05 1.73 0.77]


(4,)

In [None]:
S = np.zeros((4,9))
S[:4, :4] = np.diag(s) # 특이값을 대각행렬에 삽입
print(S.round(2))
np.shape(S)
# S는 내림차순의 형태를 띈다.

[[2.69 0.   0.   0.   0.   0.   0.   0.   0.  ]
 [0.   2.05 0.   0.   0.   0.   0.   0.   0.  ]
 [0.   0.   1.73 0.   0.   0.   0.   0.   0.  ]
 [0.   0.   0.   0.77 0.   0.   0.   0.   0.  ]]


(4, 9)

In [None]:
print(VT.round(2))
np.shape(VT)

[[-0.   -0.31 -0.31 -0.28 -0.8  -0.09 -0.28 -0.   -0.  ]
 [ 0.   -0.24 -0.24  0.58 -0.26  0.37  0.58 -0.   -0.  ]
 [ 0.58 -0.    0.    0.   -0.    0.   -0.    0.58  0.58]
 [ 0.   -0.35 -0.35  0.16  0.25 -0.8   0.16 -0.   -0.  ]
 [-0.   -0.78 -0.01 -0.2   0.4   0.4  -0.2   0.    0.  ]
 [-0.29  0.31 -0.78 -0.24  0.23  0.23  0.01  0.14  0.14]
 [-0.29 -0.1   0.26 -0.59 -0.08 -0.08  0.66  0.14  0.14]
 [-0.5  -0.06  0.15  0.24 -0.05 -0.05 -0.19  0.75 -0.25]
 [-0.5  -0.06  0.15  0.24 -0.05 -0.05 -0.19 -0.25  0.75]]


(9, 9)

In [None]:
np.allclose(A, np.dot(np.dot(U,S), VT).round(2))

# A와 SVD(full)가 동일

True

In [None]:
# t=2일 때

S = S[:2, :2]
print(S.round(2))

[[2.69 0.  ]
 [0.   2.05]]


In [None]:
U=U[:,:2] # t개의 열만 남김 (문서의 수 x 토픽의 수)
print(U.round(2))
# 각 행은 잠재의미를 표현하기 위해 수치화된 문서 벡터이다.

[[-0.24  0.75]
 [-0.51  0.44]
 [-0.83 -0.49]
 [-0.   -0.  ]]


In [None]:
VT = VT[:2, :] # t개의 행만 남김 (토픽의 수 x 단어의 개수의 크기)
print(VT.round(2)) 
# 각 행은 잠재의미를 표현하기 위해 수치화된 단어 벡터이다.

[[-0.   -0.31 -0.31 -0.28 -0.8  -0.09 -0.28 -0.   -0.  ]
 [ 0.   -0.24 -0.24  0.58 -0.26  0.37  0.58 -0.   -0.  ]]


In [None]:
A_prime = np.dot(np.dot(U,S), VT)
print(A)
print(A_prime.round(2))

[[0 0 0 1 0 1 1 0 0]
 [0 0 0 1 1 0 1 0 0]
 [0 1 1 0 2 0 0 0 0]
 [1 0 0 0 0 0 0 1 1]]
[[ 0.   -0.17 -0.17  1.08  0.12  0.62  1.08 -0.   -0.  ]
 [ 0.    0.2   0.2   0.91  0.86  0.45  0.91  0.    0.  ]
 [ 0.    0.93  0.93  0.03  2.05 -0.17  0.03  0.    0.  ]
 [ 0.    0.    0.    0.    0.   -0.    0.    0.    0.  ]]


실습

1. 뉴스그룹 데이터 (사이킷런 내)


In [None]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers','footers','quotes'))
documents = dataset.data # bunch에서 data 호출
len(documents)

11314

In [None]:
type(documents)

list

In [None]:
type(dataset)

sklearn.utils.Bunch

In [None]:
documents[1]

"\n\n\n\n\n\n\nYeah, do you expect people to read the FAQ, etc. and actually accept hard\natheism?  No, you need a little leap of faith, Jimmy.  Your logic runs out\nof steam!\n\n\n\n\n\n\n\nJim,\n\nSorry I can't pity you, Jim.  And I'm sorry that you have these feelings of\ndenial about the faith you need to get by.  Oh well, just pretend that it will\nall end happily ever after anyway.  Maybe if you start a new newsgroup,\nalt.atheist.hard, you won't be bummin' so much?\n\n\n\n\n\n\nBye-Bye, Big Jim.  Don't forget your Flintstone's Chewables!  :) \n--\nBake Timmons, III"

2. 텍스트 전처리

In [None]:
df = pd.DataFrame({'A':[4,1,5], 'B':[9,4,6]})
df

Unnamed: 0,A,B
0,4,9
1,1,4
2,5,6


In [None]:
df = pd.DataFrame([[4,9], [1,4], [5,6]], columns = ['A','B'])
df

Unnamed: 0,A,B
0,4,9
1,1,4
2,5,6


In [None]:
# 1. A열의 값에 함수의 return값을 일괄 적용하기
# 2. df 전체에 함수의 return값을 일괄 적용하기

In [None]:
# 1

def plusone(x):
  x += 1
  return x

df['A'].apply(plusone) # x에는 각 열의 값이 들어간다.
# 특정 열에만 적용했으므로 series로 출력

0    5
1    2
2    6
Name: A, dtype: int64

In [None]:
df
# apply만 해서는 변경되지 않는다.

Unnamed: 0,A,B
0,4,9
1,1,4
2,5,6


In [None]:
df['A'] = df['A'].apply(plusone)
df

Unnamed: 0,A,B
0,5,9
1,2,4
2,6,6


In [None]:
df = pd.DataFrame({'A':[4,1,5], 'B':[9,4,6]})
df['A'] = df['A'].apply(lambda x: x+1)
df 

# 즉 dataframe에서 변경값 적용을 위해서는 꼭 치환이 필요함!!

Unnamed: 0,A,B
0,5,9
1,2,4
2,6,6


In [None]:
documents[1]

"\n\n\n\n\n\n\nYeah, do you expect people to read the FAQ, etc. and actually accept hard\natheism?  No, you need a little leap of faith, Jimmy.  Your logic runs out\nof steam!\n\n\n\n\n\n\n\nJim,\n\nSorry I can't pity you, Jim.  And I'm sorry that you have these feelings of\ndenial about the faith you need to get by.  Oh well, just pretend that it will\nall end happily ever after anyway.  Maybe if you start a new newsgroup,\nalt.atheist.hard, you won't be bummin' so much?\n\n\n\n\n\n\nBye-Bye, Big Jim.  Don't forget your Flintstone's Chewables!  :) \n--\nBake Timmons, III"

In [None]:
import re

news_df = pd.DataFrame({'document':documents})
# column name = 'document', values = documents

#특수문자 제거
news_df['clean_doc'] = news_df['document'].str.replace('[^a-zA-Z]', ' ')
# 단어 구분을 위해 ' '.join() 사용.

# 길이 3 이하인 단어 제거
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))
# [for구문으로 조건 걸어주고] 조건에 맞는 단어만 ' '.join() 통해 탈출!
# 탈출한 단어들은 붙어서 나오면 안되므로 ' '.join()을 사용.

# 전체 소문자 변환
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: x.lower())

news_df['clean_doc'][1]

'yeah expect people read actually accept hard atheism need little leap faith jimmy your logic runs steam sorry pity sorry that have these feelings denial about faith need well just pretend that will happily ever after anyway maybe start newsgroup atheist hard bummin much forget your flintstone chewables bake timmons'

In [None]:
# 토큰화, 불용어 제거

import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split())
# x (values)를 split부터 해줘야, for 구문으로 불용어를 제거할 수 있다.
tokenized_doc = tokenized_doc.apply(lambda x: [word for word 
                                               in x if word not in stop_words])


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
print(tokenized_doc[1])

['yeah', 'expect', 'people', 'read', 'actually', 'accept', 'hard', 'atheism', 'need', 'little', 'leap', 'faith', 'jimmy', 'logic', 'runs', 'steam', 'sorry', 'pity', 'sorry', 'feelings', 'denial', 'faith', 'need', 'well', 'pretend', 'happily', 'ever', 'anyway', 'maybe', 'start', 'newsgroup', 'atheist', 'hard', 'bummin', 'much', 'forget', 'flintstone', 'chewables', 'bake', 'timmons']


3. tf-idf 행렬 만들기

In [None]:
# tfidfvectorizer - 토큰화되지 않은 텍스트를 데이터로 이용
# 디토큰화 필요

print(len(news_df))
print(len(news_df['clean_doc']))
# len(dataframe) : 행의 개수

11314
11314


In [None]:
news_df

Unnamed: 0,document,clean_doc
0,Well i'm not sure about the story nad it did s...,well sure about story seem biased what disagre...
1,"\n\n\n\n\n\n\nYeah, do you expect people to re...",yeah expect people read actually accept hard a...
2,Although I realize that principle is not one o...,although realize that principle your strongest...
3,Notwithstanding all the legitimate fuss about ...,notwithstanding legitimate fuss about this pro...
4,"Well, I will have to change the scoring on my ...",well will have change scoring playoff pool unf...
...,...,...
11309,"Danny Rubenstein, an Israeli journalist, will ...",danny rubenstein israeli journalist will speak...
11310,\n,
11311,\nI agree. Home runs off Clemens are always m...,agree home runs clemens always memorable kinda...
11312,I used HP DeskJet with Orange Micros Grappler ...,used deskjet with orange micros grappler syste...


In [None]:
detokenized_doc = []
for i in range(len(news_df)):
  t = ' '.join(tokenized_doc[i]) # 순서대로 tokenized_doc를 이어붙임
  detokenized_doc.append(t) # 리스트에 담음

news_df['clean_doc'] = detokenized_doc

In [None]:
news_df['clean_doc'][1]
# 불용어가 제거된 상태로 디토큰화 완료

'yeah expect people read actually accept hard atheism need little leap faith jimmy logic runs steam sorry pity sorry feelings denial faith need well pretend happily ever anyway maybe start newsgroup atheist hard bummin much forget flintstone chewables bake timmons'

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000,max_df=0.5,smooth_idf=True)
# max_features = 1000: 상위 1000개의 단어를 보존
# max_df = 0.5 : ignore terms that appear in more than 50% of documents
# default는 1, 즉 ignore하는 terms가 없다.

X = vectorizer.fit_transform(news_df['clean_doc'])
X.shape

(11314, 1000)

4. 토픽 모델링

In [None]:
# 20개의 news category이므로 n=20으로 truncated SDV 진행

from sklearn.decomposition import TruncatedSVD
svd_model = TruncatedSVD(n_components=20, n_iter=100, random_state=122)
# n_iter : number of iterations for randomized(default) SVD solver

svd_model.fit(X)
len(svd_model.components_) # svd_model.components_ : VT


20

In [None]:
import numpy as np
np.shape(svd_model.components_)
# 토픽의 수 x 단어수의 크기

(20, 1000)

In [None]:
terms = vectorizer.get_feature_names()

def get_topics(components, feature_names, n=5):
  for idx, topic in enumerate(components):
    print('topic %d' % (idx+1), [(feature_names[i], topic[i].round(5)) for i in topic.argsort()[:-n - 1:-1]])

get_topics(svd_model.components_, terms)

topic 1 [('like', 0.21386), ('know', 0.20046), ('people', 0.19293), ('think', 0.17805), ('good', 0.15128)]
topic 2 [('thanks', 0.32888), ('windows', 0.29088), ('card', 0.18069), ('drive', 0.17455), ('mail', 0.15111)]
topic 3 [('game', 0.37064), ('team', 0.32443), ('year', 0.28154), ('games', 0.2537), ('season', 0.18419)]
topic 4 [('drive', 0.53324), ('scsi', 0.20165), ('hard', 0.15628), ('disk', 0.15578), ('card', 0.13994)]
topic 5 [('windows', 0.40399), ('file', 0.25436), ('window', 0.18044), ('files', 0.16078), ('program', 0.13894)]
topic 6 [('chip', 0.16114), ('government', 0.16009), ('mail', 0.15625), ('space', 0.1507), ('information', 0.13562)]
topic 7 [('like', 0.67086), ('bike', 0.14236), ('chip', 0.11169), ('know', 0.11139), ('sounds', 0.10371)]
topic 8 [('card', 0.46633), ('video', 0.22137), ('sale', 0.21266), ('monitor', 0.15463), ('offer', 0.14643)]
topic 9 [('know', 0.46047), ('card', 0.33605), ('chip', 0.17558), ('government', 0.1522), ('video', 0.14356)]
topic 10 [('good'



잠재 디리클레 할당 (LDA, Latent Dirichlet Allocation)

토픽 모델링의 대표적인 알고리즘

ex. 

문서1 : 저는 사과랑 바나나를 먹어요

문서2 : 우리는 귀여운 강아지가 좋아요

문서3 : 저의 깜찍하고 귀여운 강아지가 바나나를 먹어요

하이퍼파라미터 = 2(사용자가 정한다.), 전처리된 DTM을 입력했다고 가정

<각 문서의 토픽 분포>

문서1 : 토픽 A 100%
문서2 : 토픽 B 100%
문서3 : 토픽 B 60%, 토픽 A 40%

<각 토픽의 단어 분포>

토픽A : 사과 20%, 바나나 40%, 먹어요 40%, 귀여운 0%, 강아지 0%, 깜찍하고 0%, 좋아요 0%

토픽B : 사과 0%, 바나나 0%, 먹어요 0%, 귀여운 33%, 강아지 33%, 깜찍하고 16%, 좋아요 16%



가정

1. 문서에 사용할 단어의 개수를 정한다.
ex. 5개

2. 사용할 토픽의 비율을 확률에 기반해 결정한다.
ex. 강아지 60%, 과일 40%

3. 확률에 따라 토픽을 선택하고, 단어의 출현 확률 분포에 기반해 단어를 고른다. (과정 반복해 문서 완성)

ex. 60% 확률로 강아지 선택, 33% 확률로 강아지라는 단어 선택.



이러한 가정 하에 LDA는 위 과정을 역으로 추적한다.

1. 사용자가 알고리즘에 토픽의 개수(k)를 알려준다.

2. 모든 단어에 랜덤한 토픽을 할당한다. (틀릴 수 있음)

3. 단어는 두 가지 기준에 따라 토픽이 재할당된다.
* p(t|d): 문서에 토픽이 존재할 확률
* p(w|t): 단어가 토픽에 존재할 확률

계속 반복하면 결국 수렴 상태가 된다.

ex. d1 ) apple:B, banana:B, apple:???, dog:A, dog:A

p(t|d1) : A 50%, B 50%

p(w|t) : apple이 B에 할당되어 있으므로, apple은 topic B에 할당될 확률이 높다.

LSA(latent semantic analysis) : DTM을 차원 축소하여 근접 단어들을 토픽으로 묶는다.

LDA(latent Dirichlet Allocation) : 단어가 특정 토픽에 존재할 확률과, 문서에 특정 토픽이 존재할 확률을 결합하여 토픽을 추출한다.

1. 정수 인코딩, 단어 집합 만들기 

In [5]:
# 전처리
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(random_state=1, remove=('headers','footers','quotes'))
# random_state=1 : shuffle될 때 seed값을 고정시켜 매번 같은 결과가 나오게끔 한다. 
# headers(머리글), footers(바닥글), quotes(인용)
documents = dataset.data
len(documents)

11314

In [6]:
documents[1]

"\n\n\n\n\n\n\nYeah, do you expect people to read the FAQ, etc. and actually accept hard\natheism?  No, you need a little leap of faith, Jimmy.  Your logic runs out\nof steam!\n\n\n\n\n\n\n\nJim,\n\nSorry I can't pity you, Jim.  And I'm sorry that you have these feelings of\ndenial about the faith you need to get by.  Oh well, just pretend that it will\nall end happily ever after anyway.  Maybe if you start a new newsgroup,\nalt.atheist.hard, you won't be bummin' so much?\n\n\n\n\n\n\nBye-Bye, Big Jim.  Don't forget your Flintstone's Chewables!  :) \n--\nBake Timmons, III"

In [8]:
import re
import pandas as pd
news_df = pd.DataFrame({'document': documents})
news_df['clean_doc'] = news_df['document'].str.replace('[^a-zA-Z]', ' ')
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: ' '.join([w for w in x.split() if len(w) > 3]))
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: x.lower())

news_df['clean_doc'][1]

'yeah expect people read actually accept hard atheism need little leap faith jimmy your logic runs steam sorry pity sorry that have these feelings denial about faith need well just pretend that will happily ever after anyway maybe start newsgroup atheist hard bummin much forget your flintstone chewables bake timmons'

In [9]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split())

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [10]:
print(tokenized_doc)

0        [well, sure, about, story, seem, biased, what,...
1        [yeah, expect, people, read, actually, accept,...
2        [although, realize, that, principle, your, str...
3        [notwithstanding, legitimate, fuss, about, thi...
4        [well, will, have, change, scoring, playoff, p...
                               ...                        
11309    [danny, rubenstein, israeli, journalist, will,...
11310                                                   []
11311    [agree, home, runs, clemens, always, memorable...
11312    [used, deskjet, with, orange, micros, grappler...
11313    [argument, with, murphy, scared, hell, when, c...
Name: clean_doc, Length: 11314, dtype: object


In [11]:
tokenized_doc[:5]

0    [well, sure, about, story, seem, biased, what,...
1    [yeah, expect, people, read, actually, accept,...
2    [although, realize, that, principle, your, str...
3    [notwithstanding, legitimate, fuss, about, thi...
4    [well, will, have, change, scoring, playoff, p...
Name: clean_doc, dtype: object

In [45]:
# (word_id, word_frequency): gensim의 corpora.Dictionary().doc2bow() 통해 구하기

from gensim import corpora # corpora: corpus(분석에 용이한 데이터집합)의 복수형
dictionary = corpora.Dictionary(tokenized_doc)  # Dictionary: encapsulates the mapping between normalized words and their integer ids.
corpus = [dictionary.doc2bow(text) for text in tokenized_doc]
print(corpus[1])

[(0, 1), (2, 1), (20, 1), (60, 2), (66, 1), (72, 2), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 2), (85, 1), (86, 1), (87, 1), (88, 1), (89, 2), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 2), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 2), (105, 1), (106, 1), (107, 1), (108, 1), (109, 1), (110, 1)]


In [48]:
# (word_id, frequency) 구하기
from gensim.corpora import Dictionary
dictionary = Dictionary(tokenized_doc) # 인덱스 번호와 실제 단어를 사전 형태로 출력
corpus = [dictionary.doc2bow(text) for text in tokenized_doc] # 실제 단어 부분(doc)을 bow(인덱스별 빈도수)로 변환
print(corpus[1]) # corpus = (인덱스번호, 빈도수) 쌍으로 구성된 리스트

[(0, 1), (2, 1), (20, 1), (60, 2), (66, 1), (72, 2), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 1), (83, 1), (84, 2), (85, 1), (86, 1), (87, 1), (88, 1), (89, 2), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 1), (97, 2), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1), (103, 1), (104, 2), (105, 1), (106, 1), (107, 1), (108, 1), (109, 1), (110, 1)]


In [25]:
print(dictionary.token2id)



In [13]:
print(dictionary[0])
# word_id 0 = 'about'

about


In [14]:
len(dictionary)
# 총 학습된 단어의 수

64365

2. LDA 모델 훈련시키기

In [16]:
import gensim
from gensim.models.ldamodel import LdaModel
ldamodel = LdaModel(corpus, num_topics = 20, id2word = dictionary, passes=15)

In [17]:
import gensim
NUM_TOPICS = 20 #20개의 토픽, k=20
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
# passes: 모델 학습 횟수
# id2word: id화된 숫자들을 단어로 변환할 수 있게 설정
topics = ldamodel.print_topics(num_words=4)
for topic in topics:
    print(topic)

(0, '0.010*"nrhj" + 0.007*"wwiz" + 0.006*"bxom" + 0.006*"gizw"')
(1, '0.016*"that" + 0.011*"their" + 0.009*"people" + 0.008*"from"')
(2, '0.019*"game" + 0.017*"team" + 0.013*"year" + 0.013*"games"')
(3, '0.042*"they" + 0.027*"were" + 0.015*"there" + 0.014*"said"')
(4, '0.022*"will" + 0.014*"that" + 0.013*"this" + 0.011*"government"')
(5, '0.016*"armenian" + 0.015*"turkish" + 0.014*"were" + 0.012*"armenians"')
(6, '0.045*"that" + 0.021*"have" + 0.020*"this" + 0.016*"they"')
(7, '0.032*"space" + 0.013*"nasa" + 0.009*"data" + 0.007*"from"')
(8, '0.026*"that" + 0.015*"jesus" + 0.011*"from" + 0.009*"christian"')
(9, '0.008*"from" + 0.007*"with" + 0.006*"will" + 0.006*"national"')
(10, '0.027*"output" + 0.025*"entry" + 0.020*"file" + 0.015*"program"')
(11, '0.010*"compass" + 0.007*"wave" + 0.007*"maine" + 0.006*"conductor"')
(12, '0.017*"file" + 0.016*"this" + 0.011*"window" + 0.011*"that"')
(13, '0.012*"from" + 0.011*"available" + 0.010*"information" + 0.008*"mail"')
(14, '0.013*"jpeg" + 0.

3. 시각화하기 (pyLDAvis)

In [28]:
!pip install pyLDAvis



In [40]:
import pyLDAvis.gensim_models as models
pyLDAvis.enable_notebook() # 파이썬 노트북의 자료를 D3로 구현하게 해준다.
vis = models.prepare(ldamodel, corpus, dictionary)
pyLDAvis.display(vis)
# corpus - (id, frequency)

TypeError: ignored

4. 문서별 토픽 분포 보기

In [41]:
# 토픽별 단어분포는 확인, 아직 문서별 토픽분포는 확인 못함.

for i, topic_list in enumerate(ldamodel[corpus]):
  if i == 5:
    break
  print (i, '번째 문서의 topic 비율은', topic_list)

# 토픽 번호, 분포도 (ex. 1번째 문서에서 3번 토픽은 2%의 분포도를 가진다.)

0 번째 문서의 topic 비율은 [(1, 0.36409152), (4, 0.044686798), (5, 0.06237396), (6, 0.4475007), (9, 0.055516638), (15, 0.01769082)]
1 번째 문서의 topic 비율은 [(3, 0.0247517), (6, 0.42970002), (8, 0.101952225), (15, 0.29948777), (17, 0.020588238), (19, 0.10979459)]
2 번째 문서의 topic 비율은 [(1, 0.2877936), (6, 0.6151155), (9, 0.08659703)]
3 번째 문서의 topic 비율은 [(3, 0.035233036), (4, 0.3216232), (6, 0.38214737), (14, 0.0120269945), (15, 0.24081722)]
4 번째 문서의 topic 비율은 [(2, 0.26751658), (6, 0.4213391), (10, 0.05496909), (14, 0.23117523)]


In [51]:
ldamodel[corpus][1]

[(3, 0.02475179),
 (6, 0.4296854),
 (8, 0.10194742),
 (15, 0.29949108),
 (17, 0.020588236),
 (19, 0.10981061)]

In [56]:
def topictable_perdoc(ldamodel, corpus):
  topic_table = pd.DataFrame()
  for i, topic_list in enumerate(ldamodel[corpus]):
    doc = topic_list[0] if ldamodel.per_word_topics else topic_list
    doc = sorted(doc, key=lambda x: x[1], reverse = True) # 분포도 기준, 내림차순 정렬
    for j, (topic_num, distribution) in enumerate(doc):
      if j == 0: # 가장 비중높은 토픽
        topic_table = topic_table.append(pd.Series([int(topic_num), round(distribution,4), topic_list]), ignore_index = True)

      else:
        break

  return(topic_table)


실습2 - sklearn 통한 LDA

1. 데이터 불러오기

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [9]:
filename = '/content/abcnews-date-text.csv'
data = pd.read_csv('abcnews-date-text.csv')
data.head(5)

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


In [13]:
text = data[['headline_text']] # 대괄호 두번: 특정 column으로 다시 dataframe 생성
text.head(5)

Unnamed: 0,headline_text
0,aba decides against community broadcasting lic...
1,act fire witnesses must be aware of defamation
2,a g calls for infrastructure protection summit
3,air nz staff in aust strike for pay rise
4,air nz strike to affect australian travellers


2. 텍스트 전처리

In [14]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [15]:
import nltk
text['headline_text'] = text['headline_text'].apply(lambda x: nltk.word_tokenize(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [16]:
print(text.head(5))

                                       headline_text
0  [aba, decides, against, community, broadcastin...
1  [act, fire, witnesses, must, be, aware, of, de...
2  [a, g, calls, for, infrastructure, protection,...
3  [air, nz, staff, in, aust, strike, for, pay, r...
4  [air, nz, strike, to, affect, australian, trav...


In [20]:
from nltk.corpus import stopwords
text['headline_text'][:5] = text['headline_text'][:5].apply(lambda x: [word for word in x if word not in stopwords.words('english')])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._set_values(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [21]:
print(text.head(5))

                                       headline_text
0   [aba, decides, community, broadcasting, licence]
1    [act, fire, witnesses, must, aware, defamation]
2     [g, calls, infrastructure, protection, summit]
3          [air, nz, staff, aust, strike, pay, rise]
4  [air, nz, strike, affect, australian, travellers]


In [24]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [25]:
# 표제어(올림말 등) 추출 - 1인칭, 현재형으로 변경
from nltk.stem import WordNetLemmatizer
text['headline_text'][:5] = text['headline_text'][:5].apply(lambda x:[WordNetLemmatizer().lemmatize(word, pos='v') for word in x])
# pos(part of speech) 태그: 대부분 명사로 인식되므로 설정해주는 것
print(text.head(5))

                                       headline_text
0       [aba, decide, community, broadcast, licence]
1      [act, fire, witness, must, aware, defamation]
2      [g, call, infrastructure, protection, summit]
3          [air, nz, staff, aust, strike, pay, rise]
4  [air, nz, strike, affect, australian, travellers]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._set_values(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [27]:
tokenized_doc = text['headline_text'][:5].apply(lambda x: [word for word in x if len(word) > 3])
print(tokenized_doc[:5])

0       [decide, community, broadcast, licence]
1      [fire, witness, must, aware, defamation]
2    [call, infrastructure, protection, summit]
3                   [staff, aust, strike, rise]
4      [strike, affect, australian, travellers]
Name: headline_text, dtype: object


3. TF-IDF 행렬 만들기

In [28]:
# 역토큰화
detokenized_doc = []
for i in range(len(text[:5])):
  t = ' '.join(tokenized_doc[:5][i])
  detokenized_doc.append(t)

text['headline_text'][:5] = detokenized_doc # 재저장

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._set_values(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


In [29]:
text['headline_text'][:5]

0       decide community broadcast licence
1       fire witness must aware defamation
2    call infrastructure protection summit
3                   staff aust strike rise
4      strike affect australian travellers
Name: headline_text, dtype: object

In [42]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words = 'english', max_features = 1000)
X = vectorizer.fit_transform(text['headline_text'][:5])
print(X)

# 17개 단어들의 가중치

  (0, 9)	0.5
  (0, 4)	0.5
  (0, 5)	0.5
  (0, 6)	0.5
  (1, 7)	0.5773502691896258
  (1, 3)	0.5773502691896258
  (1, 16)	0.5773502691896258
  (2, 14)	0.5773502691896258
  (2, 10)	0.5773502691896258
  (2, 8)	0.5773502691896258
  (3, 11)	0.5233582502695435
  (3, 13)	0.4222421409859579
  (3, 1)	0.5233582502695435
  (3, 12)	0.5233582502695435
  (4, 15)	0.5233582502695435
  (4, 2)	0.5233582502695435
  (4, 0)	0.5233582502695435
  (4, 13)	0.4222421409859579


4. 토픽 모델링

In [43]:
from sklearn.decomposition import LatentDirichletAllocation
lda_model = LatentDirichletAllocation(n_components = 10, learning_method = 'online', random_state = 777, max_iter = 1)
# learning_method: method for updating components
# random_state: 수행 시마다 동일한 결과 얻기 위함
# max_iter: 최대 반복 횟수
lda_top = lda_model.fit_transform(X)

In [44]:
print(lda_model.components_)
# 주제 10개 설정했으므로 10개 묶음 출력

[[0.79187958 0.76431241 0.90515156 0.89313138 0.69517227 0.91345964
  0.87435459 0.99313632 0.88676516 0.78476756 1.01362058 0.85101851
  0.86690775 0.79241809 1.15550038 0.86697262 0.90985291]
 [0.83406494 0.89954591 0.71110762 0.77578291 0.97133842 1.04655739
  0.9114273  0.72056717 0.73268541 0.98570917 0.81702034 0.67560954
  0.8564272  0.85306345 0.7810675  0.8421721  0.92849379]
 [0.8913782  0.90193066 0.86644916 0.86295243 0.82268479 0.79853011
  0.8723072  0.86253133 0.88290082 0.80713839 0.79890343 1.09751132
  1.00929894 0.96938762 0.87505749 0.76487199 0.75792083]
 [0.71628429 0.81108858 0.76003454 0.83427839 0.92840428 0.87344855
  0.87425228 0.80783069 0.79699077 0.81114103 0.78038769 0.95050004
  0.925239   0.88296289 0.8703788  0.90518624 0.91576783]
 [0.82165433 0.84537341 0.85951565 0.63961542 0.82853134 0.78613315
  0.79602198 0.85585715 0.81473848 0.82997762 0.79425326 0.92116303
  0.95399711 0.74992575 0.977438   0.88756472 0.79619525]
 [0.88594961 0.87035907 0.9640

In [62]:
terms = vectorizer.get_feature_names()
def get_topics(components, feature_names, n=5):
  for i, topic in enumerate(components):
    print('Topic %d:' % (i+1), [(feature_names[i], topic[i].round(2)) for i in topic.argsort()[:-n -1:-1]])
get_topics(lda_model.components_, terms)

Topic 1: [('summit', 1.16), ('protection', 1.01), ('defamation', 0.99), ('community', 0.91), ('witness', 0.91)]
Topic 2: [('community', 1.05), ('licence', 0.99), ('broadcast', 0.97), ('witness', 0.93), ('decide', 0.91)]
Topic 3: [('rise', 1.1), ('staff', 1.01), ('strike', 0.97), ('aust', 0.9), ('affect', 0.89)]
Topic 4: [('rise', 0.95), ('broadcast', 0.93), ('staff', 0.93), ('witness', 0.92), ('travellers', 0.91)]
Topic 5: [('summit', 0.98), ('staff', 0.95), ('rise', 0.92), ('travellers', 0.89), ('australian', 0.86)]
Topic 6: [('travellers', 1.0), ('australian', 0.96), ('strike', 0.95), ('affect', 0.89), ('summit', 0.88)]
Topic 7: [('defamation', 1.05), ('witness', 0.94), ('protection', 0.93), ('australian', 0.92), ('aware', 0.85)]
Topic 8: [('strike', 0.98), ('community', 0.89), ('licence', 0.87), ('witness', 0.86), ('staff', 0.86)]
Topic 9: [('strike', 0.93), ('broadcast', 0.93), ('community', 0.89), ('witness', 0.88), ('protection', 0.88)]
Topic 10: [('defamation', 1.0), ('licence',

