# chapter 8. Text Analysis

## Topic Modeling

Topic Modeling이란 **문서 집합에 숨어있는 주제를 찾아내는 것**이다. <br>
ML 기반 Topic Modeling에 자주 사용되는 기법은 LSA(Latent Semantic Analysis)와 LDA(Latent Dirichlet Allocation)이다. <br>
sklearn은 LDA 기반 Topic Modeling을 LatentDirichletAllocation Class로 제공한다. 

### 20 News Group Topic Modeling

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# 모토사이클, 야구, 그래픽스, 윈도우즈, 중동, 기독교, 전자공학, 의학 등 8개 주제를 추출
cats = ['rec.motorcycles', 'rec.sport.baseball', 'comp.graphics', 'comp.windows.x',
        'talk.politics.mideast', 'soc.religion.christian', 'sci.electronics', 'sci.med'  ]

# 위에서 cats 변수로 기재된 category만 추출. featch_20newsgroups( )의 categories에 cats 입력
news_df = fetch_20newsgroups(subset = 'all',remove = ('headers', 'footers', 'quotes'), 
                            categories = cats, random_state = 0)

# LDA 는 Count기반의 Vectorizer만 적용 
count_vect = CountVectorizer(max_df = 0.95, max_features = 1000, min_df = 2, stop_words = 'english', ngram_range = (1,2))
feat_vect = count_vect.fit_transform(news_df.data)
print('CountVectorizer Shape:', feat_vect.shape)

CountVectorizer Shape: (7862, 1000)


In [2]:
count_vect.get_feature_names_out()

array(['00', '000', '01', '02', '03', '04', '05', '10', '100', '11', '12',
       '128', '13', '14', '15', '16', '17', '18', '19', '1990', '1991',
       '1992', '1993', '20', '200', '21', '22', '23', '24', '24 bit',
       '25', '256', '26', '27', '28', '29', '30', '300', '31', '32', '35',
       '3d', '40', '44', '50', '500', '60', '80', '800', '90', '91', '92',
       '93', 'ability', 'able', 'ac', 'accept', 'accepted', 'access',
       'according', 'act', 'action', 'actions', 'acts', 'actually', 'add',
       'added', 'addition', 'address', 'adl', 'advance', 'age', 'ago',
       'agree', 'aids', 'al', 'allow', 'american', 'amiga', 'analysis',
       'anonymous', 'anonymous ftp', 'answer', 'answers', 'anti',
       'anybody', 'apartment', 'apparently', 'appear', 'appears',
       'application', 'applications', 'apply', 'appreciate',
       'appreciated', 'approach', 'appropriate', 'april', 'arab', 'arabs',
       'archive', 'area', 'areas', 'aren', 'argic', 'argument', 'armenia',
  

**LDA 객체 생성 후 Count Feature Vectorization 객체로 LDA수행**

In [5]:
lda = LatentDirichletAllocation(n_components = 8, random_state = 0)
lda.fit(feat_vect)

In [6]:
print(lda.components_.shape)
lda.components_

(8, 1000)


array([[3.60992018e+01, 1.35626798e+02, 2.15751867e+01, ...,
        3.02911688e+01, 8.66830093e+01, 6.79285199e+01],
       [1.25199920e-01, 1.44401815e+01, 1.25045596e-01, ...,
        1.81506995e+02, 1.25097844e-01, 9.39593286e+01],
       [3.34762663e+02, 1.25176265e-01, 1.46743299e+02, ...,
        1.25105772e-01, 3.63689741e+01, 1.25025218e-01],
       ...,
       [3.60204965e+01, 2.08640688e+01, 4.29606813e+00, ...,
        1.45056650e+01, 8.33854413e+00, 1.55690009e+01],
       [1.25128711e-01, 1.25247756e-01, 1.25005143e-01, ...,
        9.17278769e+01, 1.25177668e-01, 3.74575887e+01],
       [5.49258690e+01, 4.47009532e+00, 9.88524814e+00, ...,
        4.87048440e+01, 1.25034678e-01, 1.25074632e-01]])

**각 Topic별 중심 단어 확인**

In [8]:
def display_topic_words(model, feature_names, no_top_words):
    for topic_index, topic in enumerate(model.components_):
        print('Topic', topic_index)

        # components_ array에서 가장 값이 큰 순으로 정렬했을 때, 그 값의 array index를 반환
        topic_word_indexes = topic.argsort()[::-1]
        top_indexes = topic_word_indexes[:no_top_words]
        
        # top_indexes대상인 index별로 feature_names에 해당하는 word feature 추출 후 join으로 concat
        feature_concat = ' '.join([str(feature_names[i]) for i in top_indexes])
        #feature_concat = ' + '.join([str(feature_names[i])+'*'+str(round(topic[i],1)) for i in top_indexes])                
        print(feature_concat)

# CountVectorizer객체내의 전체 word들의 명칭을 get_features_names( )를 통해 추출
feature_names = count_vect.get_feature_names_out()

# Topic별 가장 연관도가 높은 word를 15개만 추출
display_topic_words(lda, feature_names, 15)

# 모토사이클, 야구, 그래픽스, 윈도우즈, 중동, 기독교, 전자공학, 의학 등 8개 주제를 추출

Topic 0
year 10 game medical health team 12 20 disease cancer 1993 games years patients good
Topic 1
don just like know people said think time ve didn right going say ll way
Topic 2
image file jpeg program gif images output format files color entry 00 use bit 03
Topic 3
like know don think use does just good time book read information people used post
Topic 4
armenian israel armenians jews turkish people israeli jewish government war dos dos turkey arab armenia 000
Topic 5
edu com available graphics ftp data pub motif mail widget software mit information version sun
Topic 6
god people jesus church believe christ does christian say think christians bible faith sin life
Topic 7
use dos thanks windows using window does display help like problem server need know run


**개별 문서별 Topic 분포 확인**

lda 객체의 transform()을 수행하면 개별 문서별 Topic 분포를 반환한다. 

In [9]:
doc_topics = lda.transform(feat_vect)
print(doc_topics.shape)
print(doc_topics[:3])

(7862, 8)
[[0.01389701 0.01394362 0.01389104 0.48221844 0.01397882 0.01389205
  0.01393501 0.43424401]
 [0.27750436 0.18151826 0.0021208  0.53037189 0.00212129 0.00212102
  0.00212113 0.00212125]
 [0.00544459 0.22166575 0.00544539 0.00544528 0.00544039 0.00544168
  0.00544182 0.74567512]]


**개별 문서별 Topic 분포도 출력**

In [11]:
def get_filename_list(newsdata):
    filename_list = []

    for file in newsdata.filenames:
            # print(file)
            filename_temp = file.split('\\')[-2:]
            filename = '.'.join(filename_temp)
            filename_list.append(filename)
    
    return filename_list

filename_list = get_filename_list(news_df)
print("filename 개수 :", len(filename_list), "filename list 10개 :", filename_list[:10])

filename 개수 : 7862 filename list 10개 : ['/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-train/soc.religion.christian/20630', '/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-test/sci.med/59422', '/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-test/comp.graphics/38765', '/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-test/comp.graphics/38810', '/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-test/sci.med/59449', '/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-train/comp.graphics/38461', '/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-train/comp.windows.x/66959', '/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-train/rec.motorcycles/104487', '/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-train/sci.electronics/53875', '/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-train/sci.electronics/53617']


In [12]:
news_df.filenames

array(['/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-train/soc.religion.christian/20630',
       '/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-test/sci.med/59422',
       '/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-test/comp.graphics/38765',
       ...,
       '/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-train/rec.sport.baseball/102656',
       '/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-train/sci.electronics/53606',
       '/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-train/talk.politics.mideast/76505'],
      dtype='<U97')

**DataFrame으로 생성하여 문서별 Topic 분포도 확인**

In [13]:
import pandas as pd 

topic_names = ['Topic'+ str(i) for i in range(0, 8)]
doc_topic_df = pd.DataFrame(data = doc_topics, columns = topic_names, index = filename_list)
doc_topic_df.head(20)

Unnamed: 0,Topic0,Topic1,Topic2,Topic3,Topic4,Topic5,Topic6,Topic7
/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-train/soc.religion.christian/20630,0.013897,0.013944,0.013891,0.482218,0.013979,0.013892,0.013935,0.434244
/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-test/sci.med/59422,0.277504,0.181518,0.002121,0.530372,0.002121,0.002121,0.002121,0.002121
/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-test/comp.graphics/38765,0.005445,0.221666,0.005445,0.005445,0.00544,0.005442,0.005442,0.745675
/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-test/comp.graphics/38810,0.005439,0.005441,0.005449,0.578959,0.00544,0.388387,0.005442,0.005442
/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-test/sci.med/59449,0.006584,0.552,0.006587,0.408485,0.006585,0.006585,0.006588,0.006585
/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-train/comp.graphics/38461,0.008342,0.008352,0.182622,0.767314,0.008335,0.008341,0.008343,0.008351
/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-train/comp.windows.x/66959,0.372861,0.041667,0.37702,0.041668,0.041703,0.041703,0.041667,0.041711
/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-train/rec.motorcycles/104487,0.225351,0.674669,0.004814,0.07592,0.004812,0.004812,0.004812,0.00481
/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-train/sci.electronics/53875,0.008944,0.836686,0.008932,0.008941,0.008935,0.109691,0.008932,0.008938
/Users/1001l1000/scikit_learn_data/20news_home/20news-bydate-train/sci.electronics/53617,0.041733,0.04172,0.708081,0.041742,0.041671,0.041669,0.041699,0.041686
