<a href="https://colab.research.google.com/github/JohnsonYu0924/114_2_text-analysis/blob/main/L9_%E6%83%85%E7%B7%92%E5%88%86%E6%9E%90.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 課程概要

## 情緒分析
1. 字典法
2. 監督式分析：已經有先標注好/訓練好的文本
3. 非監督式分析法

## 訓練集與測試集
1. 80：訓練集（Training Set）
2. 20：測試集（Test Set）

## 模型訓練與評估流程
1. 建立：\
對訓練集和測試集進行前處理和向量化。
2. 建立y標籤列表（pos | neg）
3. 訓練模型
4. 在測試資料上跑模型
5. 評估狀況

## 好工具：VADER

# Scikit-Learn：逐步示範 fit 與 predict

In [None]:
## Slide 2 — Separate Steps Version
## Scikit-Learn：逐步示範向量化與模型訓練

mydir = "/content/dissents/"

from nltk.corpus.reader.plaintext import CategorizedPlaintextCorpusReader
import re

# 定義文件與類別（pos / neg）
documentPattern = r'[A-z0-9\s.]+\.txt'
categoryPattern = r'.*(pos|neg).*'

myCorpus2 = CategorizedPlaintextCorpusReader(
    mydir,
    documentPattern,
    cat_pattern=categoryPattern
)

# 將所有文本轉為字串列表
strCorpus = []
for file in myCorpus2.fileids():
    doc = myCorpus2.raw(file)
    doc = re.sub("\s+", " ", doc)
    strCorpus.append(doc)

  doc = re.sub("\s+", " ", doc)


## 建立 X：詞頻矩陣 + One-Hot Encoding

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import Binarizer

# 建立詞頻矩陣 (Count Matrix)
freq = CountVectorizer()
corpus = freq.fit_transform(strCorpus)

# One-hot encoding（將 count 轉為 0/1）
onehot = Binarizer()
documents = onehot.fit_transform(corpus.toarray())
documents


array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 1, 1, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 1, 1, 1]])

## 建立 y：文件標籤

In [None]:
import numpy as np

labels = []
for doc in myCorpus2.fileids():
    labels.append(myCorpus2.categories(doc))

labels = np.array(labels).ravel()
print(labels)

['neg' 'neg' 'neg' 'pos' 'pos' 'pos']


## 訓練 Naive Bayes 模型

- 依據每次的事件，改變原本發生的機率。
- alpha=0.1（Laplace smoothing）
  - 避免某些詞在某類別中出現次數為 0 時產生錯誤。
  - 越小 → 越接近原始數據
  - 越大 → 模型比較「平滑」




In [None]:
from sklearn.naive_bayes import MultinomialNB

model1 = MultinomialNB(alpha=0.1, class_prior=[0.4, 0.6])
model1.fit(documents, labels)
#documents 是 CountVectorizer 或 Binarizer 做好的詞頻矩陣
#labels 是 ['neg', 'pos', ...] 的標籤


# 使用相同資料預測（實務中應使用測試集）
model1.predict(documents)


# 這行讓模型：
# 讀文件的詞頻向量
# 計算每個類別的 posterior probability
# 選擇最大那一類（pos 或 neg）

# posterior probability（後驗機率）是什麼
# posterior probability = 模型看到文章之後，推測該文章屬於某個類別的機率
# 後驗 = 先驗 × 文件中每個詞提供的證據

array(['neg', 'neg', 'neg', 'pos', 'pos', 'pos'], dtype='<U3')

### 其他方法：Pipeline（結合向量化＋模型）

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB

model2 = Pipeline([
    ('vectorizerTfIdf', TfidfVectorizer()),  # TfidfVectorizer() 把文本轉成 TF-IDF 特徵矩陣。
    ('bayes', MultinomialNB())  #MultinomialNB(): 讀取 TFIDF 特徵並訓練 Naive Bayes 分類器。
])

model2.steps
model2.named_steps["bayes"]

# 檢查 Pipeline 裡面的內容，用來：
# 看模型裡到底有哪些步驟，會輸出我們定義的上兩個
# 取出特定步驟（像是取出 Naive Bayes 模型本體），也就是第二個步驟


model2.fit(strCorpus, labels) # fit（訓練）
model2.predict(strCorpus) # predict（用同一資料預測）
model2.score(strCorpus, labels) # score（計算準確度）


# 1.0（100% 準確率）

1.0

# Iqual Ch. 10

## STEP 1: 讀取與前處理

In [None]:
import os, re, string
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
import nltk
nltk.download('punkt_tab')
nltk.download('wordnet')


snowball = SnowballStemmer("english")
wordnet = WordNetLemmatizer()

dissents = []
dirlist = os.listdir(mydir)

# 將每篇 dissent 前處理
for entry in dirlist:
    infile = mydir + entry

    # 跳過系統資料夾（像 .ipynb_checkpoints）
    if os.path.isdir(infile):
        continue

    # 跳過非 txt 檔案
    if not entry.endswith(".txt"):
        continue

    txt = open(infile).read()

    txt = re.sub("[\\s]+", " ", txt).lower()
    tokens = word_tokenize(txt)

    cleaned = []
    for w in tokens:
        if re.search("^[a-z]+$", w):
            w = wordnet.lemmatize(w)
            w = snowball.stem(w)
            cleaned.append(w)

    dissents.append(cleaned)


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


## STEP 2: 建立 Document-Term Matrix（手動版）

In [None]:
from collections import Counter
import numpy as np

# 建立 vocabulary
vocab = set()
for dissent in dissents:
    vocab.update(dissent)

print(len(vocab))  # 特徵數目

# 計算每個文件的詞頻
dissentFreqs = []
for dissent in dissents:
    tf = Counter(dissent) # 計算一個 list 裡每個詞出現的次數
    row = [tf[token] if token in tf else 0 for token in vocab]
    dissentFreqs.append(row)

freqMat = np.matrix(dissentFreqs) #最後要把各文件的詞頻組成矩陣（matrix）
print(freqMat)


# 492 字彙表大小: 一共出現了 492 個不同的詞
#

492
[[ 1  0 11 ...  0  2  1]
 [ 0  0  0 ...  0  0  0]
 [ 0  0  0 ...  1  0  0]
 [ 0  0  0 ...  0  0  0]
 [ 0  1  0 ...  0  0  0]
 [ 0  0  0 ...  0  0  0]]


# 範例：處理電影評論

## STEP 1: 訓練 / 測試資料讀取

In [None]:
import os

base = "Week9/"
folders = [
    "train/pos/", "train/neg/",
    "test/pos/", "test/neg/"
]

# 建立資料夾
for f in folders:
    os.makedirs(base + f, exist_ok=True)

# 一些簡短示例影評
pos_reviews = [
    "I absolutely loved this movie. The acting was fantastic!",
    "A touching and beautifully filmed story.",
    "Great characters and a powerful ending. Highly recommend.",
    "This movie made me smile the entire time.",
    "A masterpiece. Wonderful soundtrack and stunning visuals.",
]

neg_reviews = [
    "This movie was boring and way too long.",
    "Terrible acting and the plot made no sense.",
    "I regret watching this. Complete waste of time.",
    "The script was weak and the pacing was awful.",
    "One of the worst movies I have seen this year.",
]

# 為 train/test 隨機分配資料（各類 2–3 則）
def save_reviews(reviews, folder, prefix):
    for i, text in enumerate(reviews):
        with open(base + folder + f"{prefix}_{i}.txt", "w") as f:
            f.write(text)

# train
save_reviews(pos_reviews[:3], "train/pos/", "pos")
save_reviews(neg_reviews[:3], "train/neg/", "neg")

# test
save_reviews(pos_reviews[3:], "test/pos/", "pos")
save_reviews(neg_reviews[3:], "test/neg/", "neg")

print("Small IMDB-style dataset created!")


Small IMDB-style dataset created!


In [None]:
! pip install unidecode

Collecting unidecode
  Downloading Unidecode-1.4.0-py3-none-any.whl.metadata (13 kB)
Downloading Unidecode-1.4.0-py3-none-any.whl (235 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.8/235.8 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.4.0


In [None]:
# y軸標籤
from unidecode import unidecode

trainText, testText = [], []

print("Reading training data")
for folder, label in [("train/pos/", 0), ("train/neg/", 1)]: # 讀取訓練資料
    for file in os.listdir("/content/Week9/" + folder):
        if file.endswith(".txt"):
            txt = open("/content/Week9/" + folder + file).read()
            trainText.append(unidecode(txt))

# pos / neg 計數
num_posTrain = len(os.listdir("/content/Week9/train/pos/"))
num_negTrain = len(os.listdir("/content/Week9/train/neg/"))

# 建立 targetTrain
targetTrain = [0] * num_posTrain + [1] * num_negTrain

print("Reading test data")
for folder, label in [("test/pos/", 0), ("test/neg/", 1)]: # 把 test/pos/ 和 test/neg/ 裡的每一個 .txt 文件讀進 testText
    for file in os.listdir("/content/Week9/" + folder):
        if file.endswith(".txt"):
            txt = open("/content/Week9/" + folder + file).read()
            testText.append(unidecode(txt))


Reading training data
Reading test data


## STEP 2: BOW 前處理函式

In [None]:
def BoW(text):
    tokenized = [word_tokenize(doc) for doc in text]

    # 移除標點符號
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    cleaned_docs = []
    for review in tokenized:
        cleaned = [regex.sub('', w) for w in review if regex.sub('', w)]
        cleaned_docs.append(cleaned)

    # Stemming
    porter = PorterStemmer()
    final_docs = [" ".join(porter.stem(w) for w in doc) for doc in cleaned_docs]

    return final_docs

## STEP 3: 向量化、建立 y、訓練模型

In [None]:
from nltk.stem.porter import PorterStemmer

train_docs = BoW(trainText) # 前處理文字 → 轉成 TF-IDF 特徵
tfidf = TfidfVectorizer(min_df = 1) #TF-IDF 向量化（min_df = 1: 至少出現1次才存進來）
trainData = tfidf.fit_transform(train_docs) #fit()：建立詞彙表與 TF-IDF 權重 #transform()：把文本變成數字向量

test_docs = BoW(testText)
testData = tfidf.transform(test_docs)

# 建立標籤
targetTrain = [0]*num_posTrain + [1]*(len(trainText)-num_posTrain) # [0] * num_posTrain 是「把 0 重複 num_posTrain 次」。 [0] * 5 會是 [0, 0, 0, 0, 0]
targetTest = [0]*sum(1 for _ in testText[:len(testText)//2]) + [1]*(len(testText)-len(testText)//2)

# [0]*sum(1 for _ in testText[:len(testText)//2]) 前面一半的元素
# [1]*(len(testText)-len(testText)//2) 後半段元素


## STEP 4-1: Naive Bayes 評估模型

In [None]:
from sklearn.naive_bayes import GaussianNB #匯入 Gaussian Naive Bayes 模型

gnb = GaussianNB() #創建一個空的分類模型
gnb.fit(trainData.toarray(), targetTrain) #用訓練資料訓練模型。 toarray: 因為 GaussianNB 不接受 sparse matrix，需要轉成 numpy array。
pred = gnb.predict(testData.toarray()) #用訓練好的模型預測 test 資料

print("Mislabeled test points:", (np.array(targetTest) != pred).sum())

# 比較 targetTest 與 pred 是否相同
# 如果 預測 ≠ 真實 → True（代表錯誤）
# 如果 預測 = 真實 → False（代表正確）
# 因為我們是用 != 所以當兩者不同，會被標記為 T 這樣我們也才可以算 "mislabeled test points"


Mislabeled test points: 2


## STEP 4-2: Support Vector Machine（SVM） 模型

In [None]:
print(targetTrain)

[0, 0, 0, 1, 1, 1]


In [None]:
from sklearn import svm

clf = svm.SVC() #建立一個 SVM 類別器
pred2 = clf.fit(trainData.toarray(), targetTrain).predict(testData.toarray())

# fit(trainData.toarray(), targetTrain): 訓練 SVM 模型
# .predict(testData.todense()): 使用 SVM 預測 test data
# pred2 = 把預測結果存起來

print("Mislabeled points:", (np.array(targetTest) != pred2).sum())

Mislabeled points: 2


# 字典法Dictionary Method


## 1. Bing Liu

In [None]:
nltk.download('opinion_lexicon')
# NLTK 內建的情緒字典:
# positive words（例如：good, excellent, amazing, beautiful…
# negative words（例如：bad, terrible, ugly, hate…）

from nltk.corpus import opinion_lexicon

pos_list = set(opinion_lexicon.positive()) #讀取正向字典
neg_list = set(opinion_lexicon.negative()) #讀取負向字典

def sentiment1(text):
    score = 0 #初始情緒分數為 0。
    words = [w.lower() for w in word_tokenize(text)]
    for w in words:
        if w in pos_list:
            score += 1
        elif w in neg_list:
            score -= 1
    return score

docSenti1 = [sentiment1(doc) for doc in strCorpus] #對 strCorpus 裡的每篇文章跑 sentiment1()
print(docSenti1)


[nltk_data] Downloading package opinion_lexicon to /root/nltk_data...
[nltk_data]   Unzipping corpora/opinion_lexicon.zip.


[-6, -3, -4, 20, -2, 1]


## 2. VADER（另一種情感分析字典）

In [None]:
!pip install vaderSentiment

Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

def sentiment_vader(text):
    return analyzer.polarity_scores(text)

docSenti2 = [sentiment_vader(doc) for doc in strCorpus]
docSenti2

# 若只想取某一個情緒指標，例如 negative
docSenti3 = [sentiment_vader(doc)["neg"] for doc in strCorpus]
print(docSenti3)


# 用的是 neg
# 0.195: 約 19.5% 的字詞情緒上是負面
# 0.039: 幾乎沒有負面情緒

[0.195, 0.267, 0.139, 0.039, 0.196, 0.038]


> 1. 分別針對整份文件的正負情感加總指數
> 2. 分別針對幾份文件的正負情感比例