# 20 Newsgroup 토픽 모델링
- 20개 중 8개의 주제 데이터 로드 및 Count기반 피처 벡터화
- LDA는 Count기반 Vectorizer만 적용

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# 모토사이클, 야구, 그래픽스, 윈도우즈, 중동, 기독교, 전자공학, 의학 등 8개 주제를 추출 (원래는 총 20개의 주제가 존재함)
cats = ['rec.motorcycles', 'rec.sport.baseball', 'comp.graphics', 'comp.windows.x',
        'talk.politics.mideast', 'soc.religion.christian', 'sci.electronics', 'sci.med']

# 위에서 cats 변수로 기재된 category만 추출
# featch_20newsgroups()의 categories에 cats 입력
news_df= fetch_20newsgroups(subset = 'all', remove = ('headers', 'footers', 'quotes'), 
                            categories = cats, random_state = 0)

# LDA는 Count기반의 Vectorizer만 적용
count_vect = CountVectorizer(max_df = 0.95, max_features = 1000, min_df = 2, stop_words = 'english', ngram_range = (1, 2))
feat_vect = count_vect.fit_transform(news_df.data)
print('CountVectorizer Shape:', feat_vect.shape)

CountVectorizer Shape: (7862, 1000)


**LDA 객체 생성 후 Count 피처 벡터화 객체로 LDA수행**

In [2]:
lda = LatentDirichletAllocation(n_components = 8, random_state = 0) # 토픽 개수를 8개로 지정
lda.fit(feat_vect)

LatentDirichletAllocation(n_components=8, random_state=0)

**각 토픽 모델링 주제별 단어들의 연관도 확인**  
- lda객체의 components_ 속성은 주제별로 개별 단어들의 연관도 정규화 숫자가 들어있음
- shape는 주제 개수 X 피처 단어 개수
    - 여기에서는 총 8개의 토픽이 존재하고, 각 토픽별로 1000개의 단어들이 존재함
- components_ 에 들어 있는 숫자값은 각 주제별로 단어가 나타난 횟수를 정규화 하여 나타냄
- 숫자가 클 수록 토픽에서 단어가 차지하는 비중이 높음

In [3]:
print(lda.components_.shape)
lda.components_

(8, 1000)


array([[2.46251560e+02, 1.18842248e+02, 1.51715288e+02, ...,
        1.00147234e+02, 7.63673375e+01, 1.17028758e+02],
       [1.25033020e-01, 1.25052288e-01, 1.25003012e-01, ...,
        1.10644583e+02, 1.51405141e-01, 5.09788954e+01],
       [1.25103419e-01, 1.25075224e-01, 1.25082214e-01, ...,
        6.72008817e+01, 1.25138615e-01, 2.48516614e+00],
       ...,
       [1.05055615e+02, 4.94858011e-01, 2.52075927e+01, ...,
        1.80695744e+01, 1.25115936e-01, 8.33321314e+00],
       [1.25147502e-01, 2.27058083e+02, 5.45176328e+00, ...,
        1.41751120e+00, 7.67217701e+01, 4.49861794e+01],
       [1.25096012e-01, 4.05666840e+00, 1.25049904e-01, ...,
        1.63821915e+02, 1.25049991e-01, 1.49550227e-01]])

**각 토픽별 중심 단어 확인**

In [4]:
def display_topic_words(model, feature_names, no_top_words):
    for topic_index, topic in enumerate(model.components_):
        print('\nTopic #', topic_index)

        # components_ array에서 가장 값이 큰 순으로 정렬했을 때, 그 값의 array index를 반환
        topic_word_indexes = topic.argsort()[::-1] # 내림차순 정렬
        top_indexes = topic_word_indexes[:no_top_words]
        
        # top_indexes 대상인 index별로 feature_names에 해당하는 word feature 추출 후 join()으로 concat
        feature_concat = ' + '.join([str(feature_names[i]) + '*' + str(round(topic[i], 1)) for i in top_indexes])                
        print(feature_concat)

# CountVectorizer 객체 내의 전체 word들의 명칭을 get_features_names()를 통해 추출
feature_names = count_vect.get_feature_names()

# Topic별 가장 연관도가 높은 word를 15개만 추출
display_topic_words(lda, feature_names, 15)


Topic # 0
year*729.8 + said*697.2 + don*585.9 + didn*519.9 + know*480.2 + game*478.7 + just*470.3 + time*467.4 + went*430.7 + people*423.2 + think*408.4 + did*388.2 + like*385.6 + say*381.8 + home*374.4

Topic # 1
god*2036.0 + people*949.8 + jesus*687.8 + church*659.0 + think*633.1 + believe*625.1 + christ*549.7 + say*539.7 + does*521.7 + don*480.7 + christian*473.4 + know*450.2 + christians*434.9 + bible*426.1 + faith*415.3

Topic # 2
know*892.5 + does*680.0 + thanks*656.4 + like*429.4 + question*342.2 + information*341.0 + help*317.9 + time*288.5 + post*284.4 + advance*274.9 + book*274.2 + just*263.6 + looking*256.2 + group*253.2 + read*249.4

Topic # 3
edu*1681.5 + com*805.4 + graphics*779.5 + mail*521.5 + ftp*480.6 + information*446.1 + available*445.8 + data*445.3 + pub*442.9 + list*411.8 + computer*384.1 + send*339.5 + software*339.3 + ca*294.2 + 3d*290.1

Topic # 4
israel*837.6 + jews*722.7 + jewish*518.2 + israeli*476.1 + dos dos*401.1 + arab*386.1 + turkish*382.1 + people*364

**개별 문서별 토픽 분포 확인**
- lda객체의 transform( )을 수행하면 개별 문서별 토픽 분포를 반환함

In [5]:
doc_topics = lda.transform(feat_vect)
print(doc_topics.shape)
print(doc_topics[:3])

(7862, 8)
[[0.01392621 0.01392753 0.90257079 0.01389129 0.01389917 0.01389072
  0.01398844 0.01390584]
 [0.16469595 0.00212157 0.53426711 0.00212271 0.00212121 0.00212044
  0.17359772 0.11895329]
 [0.00544169 0.00544092 0.00545121 0.00543707 0.0054391  0.23298243
  0.00543968 0.7343679 ]]


**개별 문서별 토픽 분포도를 출력**
- 20newsgroup으로 만들어진 문서명을 출력
- fetch_20newsgroups( )으로 만들어진 데이터의 filename 속성은 모든 문서의 문서명을 가지고 있음
- filename 속성은 절대 디렉토리를 가지는 문서명을 가지고 있으므로, '\\'로 분할하여 맨 마지막 두 번째부터 파일명으로 가져옴

In [6]:
def get_filename_list(newsdata):
    filename_list = []

    for file in newsdata.filenames:
            # print(file)
            filename_temp = file.split('\\')[-2:]
            filename = '.'.join(filename_temp)
            filename_list.append(filename)
    
    return filename_list

filename_list = get_filename_list(news_df)
print("filename 개수:", len(filename_list), "filename list 10개만:", filename_list[:10])

filename 개수: 7862 filename list 10개만: ['soc.religion.christian.20630', 'sci.med.59422', 'comp.graphics.38765', 'comp.graphics.38810', 'sci.med.59449', 'comp.graphics.38461', 'comp.windows.x.66959', 'rec.motorcycles.104487', 'sci.electronics.53875', 'sci.electronics.53617']


**DataFrame으로 생성하여 문서별 토픽 분포도 확인**

In [7]:
import pandas as pd 

topic_names = ['Topic #' + str(i) for i in range(0, 8)]
doc_topic_df = pd.DataFrame(data = doc_topics, columns = topic_names, index = filename_list)
doc_topic_df.head(20)

Unnamed: 0,Topic #0,Topic #1,Topic #2,Topic #3,Topic #4,Topic #5,Topic #6,Topic #7
soc.religion.christian.20630,0.013926,0.013928,0.902571,0.013891,0.013899,0.013891,0.013988,0.013906
sci.med.59422,0.164696,0.002122,0.534267,0.002123,0.002121,0.00212,0.173598,0.118953
comp.graphics.38765,0.005442,0.005441,0.005451,0.005437,0.005439,0.232982,0.00544,0.734368
comp.graphics.38810,0.005438,0.005445,0.09498,0.336259,0.005441,0.157695,0.00544,0.3893
sci.med.59449,0.085576,0.006593,0.289101,0.006589,0.006591,0.006587,0.006591,0.592371
comp.graphics.38461,0.008342,0.008349,0.008349,0.008344,0.008338,0.210398,0.008337,0.739542
comp.windows.x.66959,0.041676,0.041667,0.041679,0.041685,0.041672,0.334088,0.415862,0.041671
rec.motorcycles.104487,0.211533,0.004813,0.004817,0.004815,0.004822,0.00481,0.004827,0.759563
sci.electronics.53875,0.245428,0.008932,0.008933,0.008955,0.008937,0.008936,0.00894,0.700939
sci.electronics.53617,0.041695,0.041714,0.041742,0.041667,0.041688,0.041785,0.041703,0.708007
