# LSA (Latent Semantic Analysis, LSA)
- BoW에 기반한 DTM, TF-IDF는 단어 빈도 수를 이용한 수치화 방법. 따라서 단어의 의미를 고려하지 못함(=토픽을 고려하지 못함)
- LSA는 잠재된 의미를 이끌어내는 방법. LSI(Latent Semantic Indexing)이라고도 불림
- LSA에서는 특이값 분해(SVD)가 중요하다.
  - A = UΣV(T)
  - 여기서 각 행렬의 모든 요소를 선택하면 Full SVD가 되고, 상위값 t개만 남기게 되면 Truncated SVD가 된다. 이 t가 바로 찾고자 하는 토픽의 수를 반영한 하이퍼 파라미터 값이다. 
  - t를 크게 잡으면 기존 행렬에서 다양한 의미를 가져갈 수 있지만, t를 작게 잡아야만 노이즈를 제거할 수 있다.(설명력이 높은 정보만 남긴다.)

- LSA는 DTM이나 TF-IDF 행렬에 Truncated SVD를 사용하여 차원을 축소시키고, 단어들의 잠재적인 의미를 끌


### 1. Full SVD

In [1]:
import numpy as np

A = np.array([[0,0,0,1,0,1,1,0,0],[0,0,0,1,1,0,1,0,0],[0,1,1,0,2,0,0,0,0],[1,0,0,0,0,0,0,1,1]])
print('DTM의 크기: ', np.shape(A))

DTM의 크기:  (4, 9)


In [2]:
A

array([[0, 0, 0, 1, 0, 1, 1, 0, 0],
       [0, 0, 0, 1, 1, 0, 1, 0, 0],
       [0, 1, 1, 0, 2, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 1]])

In [3]:
U,s,VT = np.linalg.svd(A, full_matrices= True)
print('행렬 U :')
print(U.round(2))
print('행렬 U의 크기(shape) :',np.shape(U))

행렬 U :
[[-0.24  0.75  0.   -0.62]
 [-0.51  0.44 -0.    0.74]
 [-0.83 -0.49 -0.   -0.27]
 [-0.   -0.    1.    0.  ]]
행렬 U의 크기(shape) : (4, 4)


In [4]:
print('특이값 벡터 :')
print(s.round(2))
print('특이값 벡터의 크기(shape) :',np.shape(s))


특이값 벡터 :
[2.69 2.05 1.73 0.77]
특이값 벡터의 크기(shape) : (4,)


In [5]:
S = np.zeros((4,9))

S[:4,:4] = np.diag(s)
print('대각 행렬 S :')
print(S.round(2))

print('대각 행렬의 크기(shape) :')
print(np.shape(S))

대각 행렬 S :
[[2.69 0.   0.   0.   0.   0.   0.   0.   0.  ]
 [0.   2.05 0.   0.   0.   0.   0.   0.   0.  ]
 [0.   0.   1.73 0.   0.   0.   0.   0.   0.  ]
 [0.   0.   0.   0.77 0.   0.   0.   0.   0.  ]]
대각 행렬의 크기(shape) :
(4, 9)


In [6]:
print('직교행렬 VT :')
print(VT.round(2))

print('직교 행렬 VT의 크기(shape) :')
print(np.shape(VT))


직교행렬 VT :
[[-0.   -0.31 -0.31 -0.28 -0.8  -0.09 -0.28 -0.   -0.  ]
 [ 0.   -0.24 -0.24  0.58 -0.26  0.37  0.58 -0.   -0.  ]
 [ 0.58 -0.    0.    0.   -0.    0.   -0.    0.58  0.58]
 [ 0.   -0.35 -0.35  0.16  0.25 -0.8   0.16 -0.   -0.  ]
 [-0.   -0.78 -0.01 -0.2   0.4   0.4  -0.2   0.    0.  ]
 [-0.29  0.31 -0.78 -0.24  0.23  0.23  0.01  0.14  0.14]
 [-0.29 -0.1   0.26 -0.59 -0.08 -0.08  0.66  0.14  0.14]
 [-0.5  -0.06  0.15  0.24 -0.05 -0.05 -0.19  0.75 -0.25]
 [-0.5  -0.06  0.15  0.24 -0.05 -0.05 -0.19 -0.25  0.75]]
직교 행렬 VT의 크기(shape) :
(9, 9)


In [7]:
# allclose() : 2개의 행렬이 동일하면 True를 리턴.
np.allclose(A, np.dot(np.dot(U,S),VT).round(2) )

True

### 2. Truncated SVD

In [8]:
S

array([[2.68731789, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 2.04508425, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 1.73205081, 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.77197992, 0.        ,
        0.        , 0.        , 0.        , 0.        ]])

In [9]:
# 특이값 상위 2개만 보존
S = S[:2,:2]

print('대각 행렬 S :')
print(S.round(2))


대각 행렬 S :
[[2.69 0.  ]
 [0.   2.05]]


In [10]:
U = U[:,:2]
print('행렬 U :')
print(U.round(2))


행렬 U :
[[-0.24  0.75]
 [-0.51  0.44]
 [-0.83 -0.49]
 [-0.   -0.  ]]


In [11]:
VT = VT[:2,:]
print('직교행렬 VT :')
print(VT.round(2))

직교행렬 VT :
[[-0.   -0.31 -0.31 -0.28 -0.8  -0.09 -0.28 -0.   -0.  ]
 [ 0.   -0.24 -0.24  0.58 -0.26  0.37  0.58 -0.   -0.  ]]


In [12]:
A_prime = np.dot(np.dot(U,S),VT)
print(A)
print()
print(A_prime.round(2))

[[0 0 0 1 0 1 1 0 0]
 [0 0 0 1 1 0 1 0 0]
 [0 1 1 0 2 0 0 0 0]
 [1 0 0 0 0 0 0 1 1]]

[[ 0.   -0.17 -0.17  1.08  0.12  0.62  1.08 -0.   -0.  ]
 [ 0.    0.2   0.2   0.91  0.86  0.45  0.91  0.    0.  ]
 [ 0.    0.93  0.93  0.03  2.05 -0.17  0.03  0.    0.  ]
 [ 0.    0.    0.    0.    0.    0.    0.    0.    0.  ]]


- 원래 문서 A의 크기 = (4,9). 즉, 4개의 문서와 각 문서는 9개의 단어로 구성되어 있다는 것.
- 축소된 U 는 (4,2)의 크기. 이는 4개의 문서 * 토픽의 수(t)이다. 다시 말하면, 문서 각각을 2개의 값으로 표현하고 있는 것. U의 각 행은 잠재의미를 표현하기 위해 수치화된 각각의 `문서 벡터`.
- 축소된 VT는 (2,9)의 크기. 이는 토픽의 수*단어의 개수의 크기다. VT의 각 열은 잠재 의미를 표현하기 위해 수치화된 각각의 `단어 벡터`라고 볼 수 있다.
- 이 문서 벡터와 단어 벡터를 통해 다른 문서의 유사도, 다른 단어의 유사도, 단어로부터 문서의 유사도를 구하는 것이 가능하다.

#### 1) 뉴스그룹 데이터에 대한 이해

In [13]:
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

In [14]:
dataset = fetch_20newsgroups(shuffle=True, random_state=1,remove=('headers','footers','quotes'))
documents = dataset.data
print(f'샘플의 수: {len(documents)}')

샘플의 수: 11314


In [15]:
dir(dataset)

['DESCR', 'data', 'filenames', 'target', 'target_names']

In [16]:
# 각 뉴스가 어느 그룹에 속하는지.
dataset.target

array([17,  0, 17, ...,  9,  4,  9])

In [17]:
dataset.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

#### 2) 텍스트 전처리

##### ① 문자 처리

In [18]:
news_df = pd.DataFrame({'documents':documents})

# 특수문자 제거
news_df['clean_doc'] = news_df['documents'].str.replace("[^a-zA-Z]"," ")

# 길이가 3이하인 단어 제거(길이 짧은 단어 제거)
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda row: ' '.join([w for w in row.split() if len(w)>3]))

# 전체 단어에 대한 소문자 변환
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda row: row.lower())

In [19]:
news_df['clean_doc'][1]

"yeah, expect people read faq, etc. actually accept hard atheism? need little leap faith, jimmy. your logic runs steam! jim, sorry can't pity you, jim. sorry that have these feelings denial about faith need well, just pretend that will happily ever after anyway. maybe start newsgroup, alt.atheist.hard, won't bummin' much? bye-bye, jim. don't forget your flintstone's chewables! bake timmons,"

##### ② 토큰화 & 불용어 제거

In [20]:
# nltk.download('stopwords')

In [21]:
# NLTK로부터 불용어 받아옴.
stop_words = stopwords.words('english')
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [22]:
# 토큰화
tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split())
tokenized_doc

0        [well, sure, about, story, seem, biased., what...
1        [yeah,, expect, people, read, faq,, etc., actu...
2        [although, realize, that, principle, your, str...
3        [notwithstanding, legitimate, fuss, about, thi...
4        [well,, will, have, change, scoring, playoff, ...
                               ...                        
11309    [danny, rubenstein,, israeli, journalist,, wil...
11310                                                   []
11311    [agree., home, runs, clemens, always, memorabl...
11312    [used, deskjet, with, orange, micros, grappler...
11313    [^^^^^^, argument, with, murphy., scared, hell...
Name: clean_doc, Length: 11314, dtype: object

In [23]:
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])
tokenized_doc

0        [well, sure, story, seem, biased., disagree, s...
1        [yeah,, expect, people, read, faq,, etc., actu...
2        [although, realize, principle, strongest, poin...
3        [notwithstanding, legitimate, fuss, proposal,,...
4        [well,, change, scoring, playoff, pool., unfor...
                               ...                        
11309    [danny, rubenstein,, israeli, journalist,, spe...
11310                                                   []
11311    [agree., home, runs, clemens, always, memorabl...
11312    [used, deskjet, orange, micros, grappler, syst...
11313    [^^^^^^, argument, murphy., scared, hell, came...
Name: clean_doc, Length: 11314, dtype: object

#### 3)TF-IDF 행렬 만들기
- TfidfVectorizer는 기본적으로 토큰화가 되어있지 않은 텍스트 데이터를 입력으로 사용한다.
- 따라서 역토큰화(Detokenization)을 진행해보겠음.

##### ① 역토큰화

In [24]:
news_df

Unnamed: 0,documents,clean_doc
0,Well i'm not sure about the story nad it did s...,well sure about story seem biased. what disagr...
1,"\n\n\n\n\n\n\nYeah, do you expect people to re...","yeah, expect people read faq, etc. actually ac..."
2,Although I realize that principle is not one o...,although realize that principle your strongest...
3,Notwithstanding all the legitimate fuss about ...,notwithstanding legitimate fuss about this pro...
4,"Well, I will have to change the scoring on my ...","well, will have change scoring playoff pool. u..."
...,...,...
11309,"Danny Rubenstein, an Israeli journalist, will ...","danny rubenstein, israeli journalist, will spe..."
11310,\n,
11311,\nI agree. Home runs off Clemens are always m...,agree. home runs clemens always memorable. kin...
11312,I used HP DeskJet with Orange Micros Grappler ...,used deskjet with orange micros grappler syste...


In [25]:
# 역토큰화
detokenized_doc = []
for i in range(len(news_df)):
    t = ' '.join(tokenized_doc[i])
    detokenized_doc.append(t)

news_df['clean_doc'] = detokenized_doc

In [26]:
news_df

Unnamed: 0,documents,clean_doc
0,Well i'm not sure about the story nad it did s...,well sure story seem biased. disagree statemen...
1,"\n\n\n\n\n\n\nYeah, do you expect people to re...","yeah, expect people read faq, etc. actually ac..."
2,Although I realize that principle is not one o...,"although realize principle strongest points, w..."
3,Notwithstanding all the legitimate fuss about ...,"notwithstanding legitimate fuss proposal, much..."
4,"Well, I will have to change the scoring on my ...","well, change scoring playoff pool. unfortunate..."
...,...,...
11309,"Danny Rubenstein, an Israeli journalist, will ...","danny rubenstein, israeli journalist, speaking..."
11310,\n,
11311,\nI agree. Home runs off Clemens are always m...,agree. home runs clemens always memorable. kin...
11312,I used HP DeskJet with Orange Micros Grappler ...,used deskjet orange micros grappler system6.0....


##### ② TF-IDF 행렬 생성

In [27]:
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000, # 상위 1000개 단어 보존 
                             max_df = 0.5, smooth_idf = True) # max_df = 단어가 너무 자주 나타나는 경우(50%이상의 문서에 나타나는 경우) 해당 단어 무시.

X = vectorizer.fit_transform(news_df['clean_doc']) # X = TF-IDF 가중치가 부여된 문서-단어 행렬

# TF-IDF의 행렬 크기 확인
print(f'TF-IDF의 행렬의 크기: {X.shape}')


TF-IDF의 행렬의 크기: (11314, 1000)


#### 4) 토픽 모델링(Topic Modeling)
- 이제 TF-IDF행렬을 다수의 행렬로 분해.
- Truncated SVD 사용.

In [28]:
svd_model = TruncatedSVD(n_components=20, algorithm='randomized',
                         n_iter=100, random_state=122)

svd_model.fit(X)
len(svd_model.components_) # svd_model.components_ = VT에 해당됨.

20

In [29]:
np.shape(svd_model.components_) # 토픽의 수(t) * 단어의 수의 크기

(20, 1000)

In [30]:
svd_model.components_[0].argsort()

array([166,  48, 397,  45,  33,  32, 385,  57, 143,  23,  44,  47, 131,
       203,  37,   6,  24, 948,  38, 717,  56, 167,  49, 522, 664,   5,
       266, 398, 265, 975, 142, 130, 596, 819,  13, 594,  74, 851, 149,
        43, 558,  36, 508, 477,   4,  53,  31, 855,   2,  40,  46, 879,
         3,  75, 330, 226,  27,  35, 123,  28, 551, 756,  60,  54, 776,
       579, 391,  20,  52,  26,  18,  17, 109, 921, 757, 947, 835, 712,
       916, 348, 104, 580, 824, 933, 872,  86, 994,  29, 284, 628, 843,
       355, 555, 973,  11, 810, 865, 301,  85, 237, 816, 383, 242, 635,
        16, 624, 804,  39, 323, 868, 638, 278, 907, 151,  19, 122,  51,
       436, 914, 711, 964, 840, 966, 146,  67, 705, 633, 112, 595,  42,
       749, 359, 119, 120,  79, 124, 772, 715, 834, 369, 320, 475, 615,
       465,  84, 976, 632, 230, 903,  55,  68, 541, 409, 556, 327, 787,
       228, 663, 754, 926, 170, 100, 229, 642,  81,  70, 963, 862, 331,
        50, 428, 967, 125, 445, 238, 252,  92, 353, 103, 767, 89

In [31]:
svd_model.components_[0].argsort()[:-5-1:-1]

array([512, 486, 653, 894, 404], dtype=int64)

In [32]:
terms = vectorizer.get_feature_names_out() 
# 이전의 TF-IDF 변환에 사용한 vectorizer에서 단어집합(1000개단어)을 얻어옴.
# TF-IDF 변환에 사용된 단어들을 포함하는 리스트 = terms

def get_topics(components, feature_names, n=5):
    for idx, topic in enumerate(components):
        print("Topic %d:" % (idx+1), [(feature_names[i], topic[i].round(5)) for i in topic.argsort()[:-n - 1:-1]])
get_topics(svd_model.components_,terms)


Topic 1: [('like', 0.2085), ('know', 0.19656), ('people', 0.1912), ('think', 0.17523), ('good', 0.14902)]
Topic 2: [('thanks', 0.31338), ('windows', 0.27934), ('card', 0.17289), ('drive', 0.16141), ('mail', 0.14507)]
Topic 3: [('game', 0.36553), ('team', 0.3133), ('year', 0.28465), ('games', 0.23048), ('season', 0.17026)]
Topic 4: [('edu', 0.50341), ('thanks', 0.25409), ('mail', 0.1758), ('com', 0.11498), ('email', 0.11166)]
Topic 5: [('edu', 0.49934), ('drive', 0.24972), ('com', 0.10645), ('sale', 0.10616), ('soon', 0.09199)]
Topic 6: [('drive', 0.40102), ('thanks', 0.34667), ('know', 0.27592), ('scsi', 0.13765), ('mail', 0.11332)]
Topic 7: [('chip', 0.21565), ('government', 0.20249), ('like', 0.17148), ('encryption', 0.14654), ('clipper', 0.14478)]
Topic 8: [('like', 0.64668), ('edu', 0.31439), ('bike', 0.12683), ('know', 0.12403), ('think', 0.11547)]
Topic 9: [('card', 0.3572), ('sale', 0.17543), ('00', 0.17496), ('video', 0.16994), ('good', 0.15574)]
Topic 10: [('card', 0.45093), (

# LDA (Latent Dirichlet Allocation, 잠재 디리클레 할당)
- 토픽 모델링: 문서의 집합에서 토픽을 찾아내는 프로세스.
- LDA는 BoW의 행렬 DTM 또는 TF-IDF 행렬을 입력으로 함. -> 즉, LDA는 단어의 순서는 신경쓰지 않는다.

진행 과정
1. 토픽 개수 k를 지정한다. k개의 토픽이 M개의 전체 문서에 걸쳐 분표되어 있다고 가정한다.
2. 모든 단어를 k개 중 하나의 토픽에 할당한다.
3. 모든 문서의 모든 단어에 대해서 아래의 사항을 반복 진행한다.(iterative)


LDA - LSA의 차이
- LSA: DTM을 차원 축소하여 축소된 차원에서 근접 단어들을 토픽으로 묶는다.
- LDA: 단어가 특정 토픽에 존재할 확률과 문서에 특정 토픽이 존재할 확률을 결합확률로 추정하여 토픽을 추출한다.

### 1. 사이킷런 LDA

#### 1) 데이터 이해

In [41]:
import pandas as pd
import urllib.request
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

data = pd.read_csv('https://raw.githubusercontent.com/ukairia777/tensorflow-nlp-tutorial/main/19.%20Topic%20Modeling%20(LDA%2C%20BERT-Based)/dataset/abcnews-date-text.csv')
print(f'뉴스 제목 개수: {len(data)}')


뉴스 제목 개수: 1082168


In [43]:
data.head()

Unnamed: 0,publish_date,headline_text
0,20030219,aba decides against community broadcasting lic...
1,20030219,act fire witnesses must be aware of defamation
2,20030219,a g calls for infrastructure protection summit
3,20030219,air nz staff in aust strike for pay rise
4,20030219,air nz strike to affect australian travellers


In [44]:
text = data[['headline_text']]
text

Unnamed: 0,headline_text
0,aba decides against community broadcasting lic...
1,act fire witnesses must be aware of defamation
2,a g calls for infrastructure protection summit
3,air nz staff in aust strike for pay rise
4,air nz strike to affect australian travellers
...,...
1082163,when is it ok to compliment a womans smile a g...
1082164,white house defends trumps tweet
1082165,winter closes in on tasmania as snow ice falls
1082166,womens world cup australia wins despite atapat...


#### 2) 텍스트 전처리
- 불용어 제거
- 표제어 추출
- 길이 짧은 단어 제거

##### ① 토큰화

In [45]:
text['headline_text'] = text.apply(lambda row: nltk.word_tokenize(row['headline_text']),
                                   axis=1)
text

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text['headline_text'] = text.apply(lambda row: nltk.word_tokenize(row['headline_text']),


Unnamed: 0,headline_text
0,"[aba, decides, against, community, broadcastin..."
1,"[act, fire, witnesses, must, be, aware, of, de..."
2,"[a, g, calls, for, infrastructure, protection,..."
3,"[air, nz, staff, in, aust, strike, for, pay, r..."
4,"[air, nz, strike, to, affect, australian, trav..."
...,...
1082163,"[when, is, it, ok, to, compliment, a, womans, ..."
1082164,"[white, house, defends, trumps, tweet]"
1082165,"[winter, closes, in, on, tasmania, as, snow, i..."
1082166,"[womens, world, cup, australia, wins, despite,..."


##### ② 불용어 제거

In [46]:
stop_words = stopwords.words('english')
text['headline_text'] = text['headline_text'].apply(lambda x: [word for word in x if word not in stop_words])
text

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text['headline_text'] = text['headline_text'].apply(lambda x: [word for word in x if word not in stop_words])


Unnamed: 0,headline_text
0,"[aba, decides, community, broadcasting, licence]"
1,"[act, fire, witnesses, must, aware, defamation]"
2,"[g, calls, infrastructure, protection, summit]"
3,"[air, nz, staff, aust, strike, pay, rise]"
4,"[air, nz, strike, affect, australian, travellers]"
...,...
1082163,"[ok, compliment, womans, smile, guide]"
1082164,"[white, house, defends, trumps, tweet]"
1082165,"[winter, closes, tasmania, snow, ice, falls]"
1082166,"[womens, world, cup, australia, wins, despite,..."


##### ③ 짧은 글자 제거

In [47]:
text['headline_text'] = text['headline_text'].apply(lambda x: [word for word in x if len(word)>3])
text

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text['headline_text'] = text['headline_text'].apply(lambda x: [word for word in x if len(word)>3])


Unnamed: 0,headline_text
0,"[decides, community, broadcasting, licence]"
1,"[fire, witnesses, must, aware, defamation]"
2,"[calls, infrastructure, protection, summit]"
3,"[staff, aust, strike, rise]"
4,"[strike, affect, australian, travellers]"
...,...
1082163,"[compliment, womans, smile, guide]"
1082164,"[white, house, defends, trumps, tweet]"
1082165,"[winter, closes, tasmania, snow, falls]"
1082166,"[womens, world, australia, wins, despite, atap..."


#### 3) TF-IDF 만들기

In [48]:
text['headline_text']

0                [decides, community, broadcasting, licence]
1                 [fire, witnesses, must, aware, defamation]
2                [calls, infrastructure, protection, summit]
3                                [staff, aust, strike, rise]
4                   [strike, affect, australian, travellers]
                                 ...                        
1082163                   [compliment, womans, smile, guide]
1082164               [white, house, defends, trumps, tweet]
1082165              [winter, closes, tasmania, snow, falls]
1082166    [womens, world, australia, wins, despite, atap...
1082167         [youtube, stunt, death, foreshadowed, tweet]
Name: headline_text, Length: 1082168, dtype: object

##### ① 역토큰화
- TfidfVectorizer는 토큰화가 되어있지 않은 텍스트 데이터를 입력으로 사용
- 따라서 역 토큰화가 필요

In [53]:
detokenized_doc = []
for i in range(len(text)):
    t = ' '.join(text['headline_text'][i])
    detokenized_doc.append(t)

text['headline_text'] = detokenized_doc

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  text['headline_text'] = detokenized_doc


In [54]:
text

Unnamed: 0,headline_text
0,decides community broadcasting licence
1,fire witnesses must aware defamation
2,calls infrastructure protection summit
3,staff aust strike rise
4,strike affect australian travellers
...,...
1082163,compliment womans smile guide
1082164,white house defends trumps tweet
1082165,winter closes tasmania snow falls
1082166,womens world australia wins despite atapattus


##### ② TF-IDF 행렬 생성

In [55]:
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X = vectorizer.fit_transform(text['headline_text'])

print(f'TF-IDF행렬 크기: {X.shape}')

TF-IDF행렬 크기: (1082168, 1000)


#### 4) 토픽 모델링

In [57]:
lda_model = LatentDirichletAllocation(n_components=10, # 토픽을 10개로.
                                      learning_method = 'online',
                                      random_state=777,
                                      max_iter=1)


In [58]:
lda_top = lda_model.fit_transform(X)

In [59]:
print(lda_model.components_) # 토픽 10개 각각의 10000개의 단어들을..?
print(lda_model.components_.shape)

[[1.00000524e-01 1.00000945e-01 1.00003031e-01 ... 1.00010614e-01
  1.00003498e-01 1.00001781e-01]
 [1.00003244e-01 1.00006985e-01 1.00018485e-01 ... 1.00009374e-01
  1.00004140e-01 1.00002710e-01]
 [1.00000613e-01 1.00001003e-01 1.00001719e-01 ... 1.00010032e-01
  1.00003629e-01 1.00002942e-01]
 ...
 [1.00002359e-01 1.00002223e-01 1.00008046e-01 ... 2.25245378e+03
  1.00004394e-01 1.00001389e-01]
 [1.00001729e-01 1.00001155e-01 3.56084764e+03 ... 1.00006474e-01
  1.00003539e-01 1.00001515e-01]
 [1.00001008e-01 1.00001333e-01 1.00004934e-01 ... 1.00015214e-01
  1.00003583e-01 1.00001347e-01]]
(10, 1000)


In [61]:
terms = vectorizer.get_feature_names_out()

def get_topics(components, feature_names, n=5):
    for idx, topic in enumerate(components):
        print(f'Topic: {idx+1}, {[(feature_names[i], topic[i].round(2)) for i in topic.argsort()[:-n-1:-1]]}')

get_topics(lda_model.components_, terms)

Topic: 1, [('trump', 10983.42), ('state', 4433.33), ('league', 4147.04), ('open', 4035.79), ('funding', 3547.71)]
Topic: 2, [('court', 7730.97), ('world', 7084.24), ('canberra', 6440.95), ('interview', 5813.89), ('country', 5368.42)]
Topic: 3, [('police', 12780.13), ('government', 9184.56), ('sydney', 8885.32), ('woman', 5633.17), ('north', 5261.69)]
Topic: 4, [('election', 8013.99), ('south', 6909.67), ('test', 3978.44), ('missing', 3629.12), ('power', 3624.58)]
Topic: 5, [('melbourne', 7830.58), ('years', 5286.89), ('calls', 4858.96), ('final', 3758.86), ('accused', 3747.29)]
Topic: 6, [('australia', 14376.71), ('death', 6201.75), ('2016', 5731.04), ('turnbull', 4364.71), ('people', 4166.9)]
Topic: 7, [('adelaide', 7079.47), ('perth', 6705.92), ('charged', 5845.25), ('dies', 4730.09), ('indigenous', 4334.57)]
Topic: 8, [('australian', 11564.41), ('queensland', 7981.64), ('year', 5768.28), ('brisbane', 5103.73), ('tasmania', 4820.45)]
Topic: 9, [('coast', 5606.92), ('tasmanian', 5034.

#### 2. gensim 사용한 LDA

#### 1) 정수 인코딩, 단어집합 만들기

In [64]:
tokenized_doc


0        [well, sure, story, seem, biased., disagree, s...
1        [yeah,, expect, people, read, faq,, etc., actu...
2        [although, realize, principle, strongest, poin...
3        [notwithstanding, legitimate, fuss, proposal,,...
4        [well,, change, scoring, playoff, pool., unfor...
                               ...                        
11309    [danny, rubenstein,, israeli, journalist,, spe...
11310                                                   []
11311    [agree., home, runs, clemens, always, memorabl...
11312    [used, deskjet, orange, micros, grappler, syst...
11313    [^^^^^^, argument, murphy., scared, hell, came...
Name: clean_doc, Length: 11314, dtype: object

In [71]:
from gensim import corpora

dictionary = corpora.Dictionary(tokenized_doc) # dictionary에 단어가 저장됨.
corpus =[dictionary.doc2bow(text) for text in tokenized_doc]


[(59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 2), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 2), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 2), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1)]


In [73]:
print(corpus[1]) 
print()
print(dictionary[66])

[(59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1), (66, 1), (67, 1), (68, 1), (69, 1), (70, 1), (71, 1), (72, 1), (73, 1), (74, 1), (75, 1), (76, 1), (77, 1), (78, 1), (79, 1), (80, 1), (81, 1), (82, 2), (83, 1), (84, 1), (85, 1), (86, 1), (87, 1), (88, 1), (89, 2), (90, 1), (91, 1), (92, 1), (93, 1), (94, 1), (95, 1), (96, 2), (97, 1), (98, 1), (99, 1), (100, 1), (101, 1), (102, 1)]

bye-bye,


- corpus[1]은 두번 째 뉴스
- 두번 째 뉴스에서 정수 인코딩이 66으로 된 단어가 1번 등장함. 66이란 단어는 어떤 단어일까?
- dictionary[66]에서 확인해본 결과 'bye-bye'인 것을 알 수 있다.

In [74]:
len(dictionary)

181856

#### 2) LDA 모델 훈련시키기

In [77]:
import gensim
NUM_TOPICS = 20 # 20개 토픽
lda_model = gensim.models.ldamodel.LdaModel(corpus, num_topics=NUM_TOPICS, id2word=dictionary,
                                            passes=15) # passes = 알고리즘 동작횟수.
topics = lda_model.print_topics(num_words=4) # 4개의 단어만 출력하기.
for topic in topics:
    print(topic)

(0, '0.007*"cubs" + 0.004*"francis" + 0.003*"trace" + 0.003*"fleet"')
(1, '0.012*"space" + 0.008*"available" + 0.007*"information" + 0.006*"data"')
(2, '0.005*"outlets" + 0.003*"carson" + 0.003*"want," + 0.003*"outlet"')
(3, '0.012*"government" + 0.012*"encryption" + 0.010*"security" + 0.007*"chip"')
(4, '0.030*"armenian" + 0.026*"turkish" + 0.020*"armenians" + 0.012*"greek"')
(5, '0.013*"henrik]" + 0.003*"karabakh" + 0.003*"terminals" + 0.002*"authorities."')
(6, '0.008*"power" + 0.005*"bike" + 0.005*"ground" + 0.004*"----"')
(7, '0.004*"prophecy" + 0.004*"captain" + 0.004*"patents" + 0.003*"intentional"')
(8, '0.013*"period" + 0.011*"power" + 0.008*"---------------" + 0.008*"toronto"')
(9, '0.005*"people" + 0.004*"government" + 0.004*"state" + 0.004*"president"')
(10, '0.003*"injuries" + 0.003*"mild" + 0.003*"hong" + 0.003*"0.333"')
(11, '0.008*"people" + 0.007*"jesus" + 0.005*"believe" + 0.005*"many"')
(12, '0.008*"gordon" + 0.008*"surrender" + 0.008*"-------------------------------

#### 3) LDA 시각화하기

In [78]:
!pip install pyLDAvis

Defaulting to user installation because normal site-packages is not writeable




토픽 별 단어의 분포 확인하기

In [79]:

import pyLDAvis.gensim_models

pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(vis)

#### 4) 문서 별 토픽 분포 보기

In [86]:
for i, topic_list in enumerate(lda_model[corpus]):
    if i == 5:
        break
    print(i,'번째 문서의 topic 비율은', topic_list)

0 번째 문서의 topic 비율은 [(6, 0.015467612), (9, 0.33660227), (11, 0.2843388), (13, 0.21866275), (14, 0.062374968), (19, 0.071025945)]
1 번째 문서의 topic 비율은 [(9, 0.21830307), (11, 0.12236565), (13, 0.5655474), (15, 0.07420644)]
2 번째 문서의 topic 비율은 [(9, 0.26269913), (10, 0.017646022), (11, 0.058571972), (13, 0.56194925), (18, 0.03446153), (19, 0.052785996)]
3 번째 문서의 topic 비율은 [(3, 0.21696894), (5, 0.116726324), (6, 0.05509893), (8, 0.015176914), (9, 0.048764464), (13, 0.3845344), (15, 0.14096539), (18, 0.013017756)]
4 번째 문서의 topic 비율은 [(13, 0.32925636), (16, 0.63736343)]


(숫자, 확률) = 각 토픽 번호, 해당 토픽이 문서에서 차지하는 분포도
- 0번째 문서에서 (9,0.3366..)은 9번 토픽이 33%의 분포도를 갖는다는 것을 의미

In [99]:
def make_topictable_per_doc(ldamodel, corpus):
    topic_table = pd.DataFrame()

    # 몇 번째 문서인지를 의미하는 문서 번호와 해당 문서의 토픽 비중을 한 줄씩 꺼내온다.
    for i, topic_list in enumerate(ldamodel[corpus]):
        doc = topic_list[0] if ldamodel.per_word_topics else topic_list            
        doc = sorted(doc, key=lambda x: (x[1]), reverse=True)
        # 각 문서에 대해서 비중이 높은 토픽순으로 토픽을 정렬한다.
        # EX) 정렬 전 0번 문서 : (2번 토픽, 48.5%), (8번 토픽, 25%), (10번 토픽, 5%), (12번 토픽, 21.5%), 
        # Ex) 정렬 후 0번 문서 : (2번 토픽, 48.5%), (8번 토픽, 25%), (12번 토픽, 21.5%), (10번 토픽, 5%)
        # 48 > 25 > 21 > 5 순으로 정렬이 된 것.

        # 모든 문서에 대해서 각각 아래를 수행
        for j, (topic_num, prop_topic) in enumerate(doc): #  몇 번 토픽인지와 비중을 나눠서 저장한다.
            if j == 0:  # 정렬을 한 상태이므로 가장 앞에 있는 것이 가장 비중이 높은 토픽
                topic_table = topic_table.append(pd.Series([int(topic_num), round(prop_topic,4), topic_list]), ignore_index=True)
                # 가장 비중이 높은 토픽과, 가장 비중이 높은 토픽의 비중과, 전체 토픽의 비중을 저장한다.
            else:
                break
    return(topic_table)


In [102]:
topictable = make_topictable_per_doc(lda_model, corpus)
topictable = topictable.reset_index() # 문서 번호을 의미하는 열(column)로 사용하기 위해서 인덱스 열을 하나 더 만든다.
topictable.columns = ['문서 번호', '가장 비중이 높은 토픽', '가장 높은 토픽의 비중', '각 토픽의 비중']
topictable[:10]


ValueError: Length mismatch: Expected axis has 5 elements, new values have 4 elements