這一部分我們要討論$\textbf{主題建模}$。其意思是將為標記標籤的文本賦予主題。比方說我們想從電影文本資料集中反推評論來自哪幾類的電影(或著說，可以分為哪幾類電影的評論)。一個常用的技術較做$\textbf{潛在狄利克雷分配}$，又稱為Latent Dirichlet Allocation (LDA)。
LDA會找出最頻繁出現的單詞組合，並輸入前面討論的詞袋模型，分解成兩個新的矩陣輸出:

1. 文件到主題矩陣
2. 單字到主題矩陣

其分解之依據為，兩個矩陣相乘後必須盡可能恢復輸入(詞袋矩陣)。因此我們要是先定義主題的數量(一個超參數)。

In [1]:
import pandas as pd

df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


In [2]:
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english',
                        max_df=.1, # 頻率超過10%就視為沒有幫助。這是一個超參數
                        max_features=5000) #待會一類別取5000個字作為特徵
X = count.fit_transform(df['review'].values)

我們就假定能分成十個主題。'batch'是讓LDA估計器在一次的迭代中，基於所有可用的訓練數據(詞袋矩陣)來進行估計。雖然會比較慢，但會比線上學習快一點。

In [4]:
from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_components=10, # 10個主題
                                random_state=123,
                                learning_method='batch')
X_topics = lda.fit_transform(X)

lda物件的屬性是一個10x5000的矩陣。5000個字詞依照重要性遞增排列。

In [5]:
lda.components_.shape

(10, 5000)

來看看各類別前10重要的字

In [8]:
n_top_words = 10
feature_names = count.get_feature_names()

for topic_idx, topic in enumerate(lda.components_):
    print("Topic %d:" % (topic_idx + 1))
    print(" ".join([feature_names[i]
                    for i in topic.argsort()\
                        [:-n_top_words - 1:-1]]))

Topic 1:
worst minutes awful script stupid terrible money waste budget ll
Topic 2:
family mother father children girl loved kids watched friends feel
Topic 3:
american war dvd music tv history german black early america
Topic 4:
human audience cinema art sense feel viewer game different camera
Topic 5:
police guy car dead murder wife goes town killed crime
Topic 6:
horror house sex girl woman blood gore creepy scary night
Topic 7:
role performance comedy actor performances plays played play john excellent
Topic 8:
series episode war episodes tv season action star king western
Topic 9:
book version original read novel disney effects fi sci animation
Topic 10:
action fight guy guys cool fun minutes music nice fighting


Based on reading the 10 most important words for each topic, we may guess that the LDA identified the following topics:
    
1. Generally bad movies (not really a topic category)
2. Movies about families
3. War movies
4. Art movies
5. Crime movies
6. Horror movies
7. Comedies
8. Movies somehow related to TV shows
9. Movies based on books
10. Action movies

To confirm that the categories make sense based on the reviews, let's plot 5 movies from the war movie category (category 3 at index position 2):

In [10]:
horror = X_topics[:, 2].argsort()[::-1]

for iter_idx, movie_idx in enumerate(horror[:3]):
    print('\nWar movie #%d:' % (iter_idx + 1))
    print(df['review'][movie_idx][:300], '...')


War movie #1:
In the 1980s in wrestling the world was simple. Hulk Hogan would take on Roddy Piper, or Bobby Heenan's cronies or Ted DiBiase and come out victorious more often than not. Occasionally he would get an ally like Randy Savage in 1988, but mostly it was all about Hulk Hogan vs Bobby Heenan, and that's  ...

War movie #2:
Hollywood Hotel was the last movie musical that Busby Berkeley directed for Warner Bros. His directing style had changed or evolved to the point that this film does not contain his signature overhead shots or huge production numbers with thousands of extras. By the last few years of the Thirties, sw ...

War movie #3:
This is a movie about the music that is currently being played in Istanbul. Istanbul was the center of the two Old World superpowers, the Byzantine Empire and the Ottoman Empire. Today, it is a megalopolis of almost 10 million. So it is to no ones surprise that a lot of music is being played in Ista ...


ㄜ...換1看看好了

In [12]:
horror = X_topics[:, 0].argsort()[::-1]

for iter_idx, movie_idx in enumerate(horror[:3]):
    print('\nGenerally bad movies movie #%d:' % (iter_idx + 1))
    print(df['review'][movie_idx][:300], '...')


Generally bad movies movie #1:
I went into this with my hopes up, by twenty minutes into the movie I couldn't have been more let down. Despite thinking that this would be another horribly bad remake, I kept my hopes high that maybe...just maybe someone would get it right this time around. Sadly, Prom Night is about on the same qu ...

Generally bad movies movie #2:
It used to be that video distributors like Sub Rosa and Brain Damage Films would release low-budget, shot-on-video horror films to a select market of gorehounds that ate them up with glee. That's acceptable to me, because you could see these movies from a mile away with their shoddy box art and chee ...

Generally bad movies movie #3:
It seems like anybody can make a movie nowadays. It's like all you need is a camera, a group of people to be your cast and crew, a script, and a little money and walla you have a movie. Problem is that talent isn't always part of this equation and often times these kind of low budget films tur