這一章我們要學著分析一篇文章的情緒(是好或壞)。但在那之前，我們要先學習如何將字詞轉為數字，讓電腦操作。我們先用以下的程式來將三個句子中的單字轉為數字

In [2]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
docs = np.array([
        'The sun is shining',
        'The weather is sweet',
        'The sun is shining, the weather is sweet, and one and one is two'])
bag = count.fit_transform(docs)

In [3]:
print(count.vocabulary_)

{'the': 6, 'sun': 4, 'is': 1, 'shining': 3, 'weather': 8, 'sweet': 5, 'and': 0, 'one': 2, 'two': 7}


所以這三句話一共出現了8種單字。換句話說，假設今天有一個向量$a$，它的長度為三個句子中最長的字數，每一個分量$a_n=t$代表編號$t$出現的次數，這樣一來就能用向量代表一句話。這個編號是由出現頻率最高的詞為最小，反之亦然。我們稱為這個編號$t$為該詞在文本$d$中的\text{詞頻}，記作$tf(t,d)$。

In [4]:
print(bag.toarray())

[[0 1 0 1 1 0 1 0 0]
 [0 1 0 0 0 1 1 0 1]
 [2 3 2 1 1 1 2 1 1]]


所謂"n-gram"是指幾個字為一組進行編碼。比方說，1gram會得到"the" "sun" "is"，2-gram會得到"the sun" "sun is" "is shinning"。另外，一篇文章中，太頻繁出現的字可能並沒有太大的意義。比方說"is"，"的"，"幹玲娘糙機掰(in 館長直播文本資料集)"。我們定義\textbf{詞頻-反向文件頻率}為 $$tfidf(t,d)=tf(t,d)\times idf(t,d)$$
其中$idf(t,d)$稱為反向文件頻率，$$idf(t,d)=\log\frac{n_d}{1+df(d,t)}$$
其中$n_d$為文件總數，$df(d,t)表示包含了字詞$t$的文件檔$d$的數量。
這樣的定義會使頻繁出現的字之後在作處理時的權重降低。我們用以下的程式計算tf-idf

In [5]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf=True, # scikit learn用的公式跟我們用的有一點差別，不過意義一樣
                         norm='l2', 
                         smooth_idf=True)
print(tfidf.fit_transform(count.fit_transform(docs))
      .toarray())

[[0.         0.43370786 0.         0.55847784 0.55847784 0.
  0.43370786 0.         0.        ]
 [0.         0.43370786 0.         0.         0.         0.55847784
  0.43370786 0.         0.55847784]
 [0.50238645 0.44507629 0.50238645 0.19103892 0.19103892 0.19103892
  0.29671753 0.25119322 0.19103892]]


引進電影評論資料集

In [6]:
import pandas as pd

df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.head(3)

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0


首先先用正規表達式的函式庫刪掉所有html語法中的符號。但我們表劉表情符號，比方說:)

In [7]:
import re

def preprocessor(text):
    text = re.sub('<[^>]*>', '', text) # 找出html語法
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', # 找出表情符號
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

In [9]:
df['review'] = df['review'].apply(preprocessor)

接著先引進nltk函式庫(但還沒有要用)。裡面的函式可將所有單字的變化型還原成原型，並清除掉比較沒意義的字。

In [11]:
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()

def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [12]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\USER\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [16]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')[-10:]
if w not in stop] #示範其效果

['runner', 'like', 'run', 'run', 'lot']

接下來我們要用機器學習的演算法來訓練模型了。首先先分割訓練集和測試集

In [13]:
X_train = df.loc[:25000, 'review'].values
y_train = df.loc[:25000, 'sentiment'].values
X_test = df.loc[25000:, 'review'].values
y_test = df.loc[25000:, 'sentiment'].values

用GridSearch演算法搜尋邏輯斯回歸的最佳超參數配置。

In [17]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)

param_grid = [{'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              {'vect__ngram_range': [(1, 1)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__C': [1.0, 10.0, 100.0]},
              ]

lr_tfidf = Pipeline([('vect', tfidf),
                     ('clf', LogisticRegression(random_state=0))])

gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
                           scoring='accuracy',
                           cv=5,
                           verbose=1,
                           n_jobs=-1)

In [18]:
gs_lr_tfidf.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 21.1min
[Parallel(n_jobs=-1)]: Done 192 tasks      | elapsed: 108.9min
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed: 138.5min finished


GridSearchCV(cv=5, error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=False,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                         

In [19]:
print('Best parameter set: %s ' % gs_lr_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_lr_tfidf.best_score_)

Best parameter set: {'clf__C': 10.0, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 1), 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x000001D80E7B86A8>} 
CV Accuracy: 0.897


現在讓我們對測試集作預測，研判測試後的結果。

In [20]:
clf = gs_lr_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.899
