# Bag-of-words

Transforming the text data into numeric form

原因： <br/>
機器學習或深度學習的模型是不能直接處理文字的，但可以處理 numeric features。

## 介紹

Bag-of-words 為一種紀錄一個文件或很多文件中(document)字詞出現的次數或頻率的corpus

## 舉例

![](Image/Image1.jpg)

我們可以發現紀錄著詞頻率的 corpus **沒有記錄任何 word order 和 grammer rules**，因此才會稱為 **Bag-of-words**

## 用來實作的Dataset

Reviews of Amazon products:

有兩個columns: score (1為positive，0為negative) 和 actual review

![](Image/Image2.jpg)

---

透過程式碼我要將上述資料轉變成以下形式：

![](Image/Image3.jpg)

Columns 為所有的 tokens，而數值則是它們出現的次數

---

## Bag-of-words in Python

利用 **sklearn.feature_extraction.text** 中的 **CountVectorizer** 函數達成

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# import the review data
data = pd.read_csv("Datasets/amazon_reviews_sample.csv")

data.head()

Unnamed: 0.1,Unnamed: 0,score,review
0,0,1,Stuning even for the non-gamer: This sound tr...
1,1,1,The best soundtrack ever to anything.: I'm re...
2,2,1,Amazing!: This soundtrack is my favorite musi...
3,3,1,Excellent Soundtrack: I truly like this sound...
4,4,1,"Remember, Pull Your Jaw Off The Floor After H..."


利用 CountVectorizer 將 reviews 都轉成 bag-of-words

其中，CountVectorizer 可以傳入一個參數：**max_features = N**，代表只保留出現次數前 N 名的 tokens 當作 columns (由於文件中有非常多字，因此全部當成 columns 可能會使資料量太大)

In [8]:
# initialize a CountVectorizer object
vect = CountVectorizer(max_features = 1000)

# fit the data and construct the tokens as columns
vect.fit(data.review)

# transform the review column into bag-of-words
X = vect.transform(data.review)

print(type(X))
print(X)

<class 'scipy.sparse.csr.csr_matrix'>
  (0, 34)	1
  (0, 50)	2
  (0, 54)	1
  (0, 74)	1
  (0, 88)	1
  (0, 98)	1
  (0, 129)	1
  (0, 272)	2
  (0, 274)	1
  (0, 326)	1
  (0, 337)	1
  (0, 343)	2
  (0, 344)	1
  (0, 385)	1
  (0, 386)	1
  (0, 388)	2
  (0, 433)	1
  (0, 452)	5
  (0, 496)	1
  (0, 543)	1
  (0, 565)	2
  (0, 580)	1
  (0, 590)	2
  (0, 614)	1
  (0, 631)	1
  :	:
  (9999, 448)	2
  (9999, 452)	3
  (9999, 504)	1
  (9999, 513)	1
  (9999, 534)	2
  (9999, 554)	1
  (9999, 556)	1
  (9999, 567)	3
  (9999, 584)	1
  (9999, 590)	1
  (9999, 598)	2
  (9999, 617)	1
  (9999, 624)	1
  (9999, 625)	1
  (9999, 731)	1
  (9999, 762)	1
  (9999, 849)	1
  (9999, 850)	3
  (9999, 856)	1
  (9999, 860)	1
  (9999, 864)	4
  (9999, 871)	1
  (9999, 874)	5
  (9999, 933)	2
  (9999, 965)	1


X 的 type 為一個 sparse matrix，代表只存有 matrix 中不為 0 的 elements 的數值、row number 和 column number

因此，如果要將 sparse matrix 轉變成真正的矩陣，要用 toarray 這個 method

In [9]:
my_array = X.toarray()

最後將 my_array 轉換成 dataframe

In [10]:
X_df = pd.DataFrame(my_array, columns = vect.get_feature_names())

In [11]:
X_df.head()

Unnamed: 0,10,100,12,15,1984,20,30,40,451,50,...,wrong,wrote,year,years,yes,yet,you,young,your,yourself
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,2,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,3,0,1,0


過程中，可以看到那個 CountVectorizer 物件可以用 **get_feature_names** 函數來取用 column names，也就是出現次數前 1000 名的所有 tokens

In [12]:
print(vect.get_feature_names())



---

課堂練習一：

In [13]:
annak = ['Happy families are all alike;', 'every unhappy family is unhappy in its own way']

In [14]:
# Build the vectorizer and fit it
anna_vect = CountVectorizer()
anna_vect.fit(annak)

CountVectorizer()

In [18]:
# Create the bow representation
anna_bow = anna_vect.transform(annak)

# Print the bag-of-words result 
print(anna_bow.toarray())

# Print the bag-of-words dataframe
print(pd.DataFrame(anna_bow.toarray(), columns = anna_vect.get_feature_names()))

[[1 1 1 0 1 0 1 0 0 0 0 0 0]
 [0 0 0 1 0 1 0 1 1 1 1 2 1]]
   alike  all  are  every  families  family  happy  in  is  its  own  unhappy  \
0      1    1    1      0         1       0      1   0   0    0    0        0   
1      0    0    0      1         0       1      0   1   1    1    1        2   

   way  
0    0  
1    1  


---

---

# N-grams

由於 bag-of-words 的內容並不會記錄文件中文字的 order，因此在某些狀況可能會沒有抓到句子的文義。

例如：
1. I am happy, not sad.
2. I am sad, not happy.

這兩句話出現的字以及頻率都相同，但是文義卻大不同，因此需要深入探討文字的 order。

當使用 bag-of-words 時，如何同時考慮到文字的順序？ <br/>
利用 N-grams，也就是將兩三個相鄰的字詞合併成一個 token。

## 定義

Unigrams: single tokens (每個 token 只包含一個字) <br/>
Bigrams: pairs of tokens (每個 token 包含兩個字) <br/>
Trigrams: triples of tokens (每個 token 包含三個字) <br/>
N-grams: sequence of N-tokens (每個 token 包含 n 個字)

例子：

"The weather today is wonderful"

分別可變成：

1. Unigrams: {The, weather, today, is, wonderful}
2. Bigrams: {The weather, weather today, today is, is wonderful}
3. Trigrams: {The weather today, weather today is, today is wonderful}

## N-grams in Python

在 initialize CountVectorizer 物件時 specify **ngram_range** 這個參數

**ngram_range** 傳入一個 tuple，此 tuple 有兩個值。第一個值為 minimum length of tokens，第二個值為 maximum length of tokens

例如： ngram_range = (1, 1) 代表我們只考慮 unigrams， ngram_range = (1, 2) 代表我們考慮 unigrams 和 bigrams

### 如何決定 N 的值

如果我們的 N 值選擇比較高，則代表 token 的長度較長，會產生更多的 features (考慮從 unigrams 到 N-grams 的 features)：<br/>
理論上，the number of bigrams could be the number of unigrams squared; the number of trigrams could be the number of unigrams to the power of 3。

Token 的長度越長，就有越多 features，因此可以訓練出**更 precise** 的機器學習模型，但也比較容易 **overfit**，因此可以用 **grid search Cross Validation** 的方法來決定哪個 N 值最好。

課堂練習二：同課堂練習一，只是多考慮 bigrams 的 features

In [21]:
annak = ['Happy families are all alike;', 'every unhappy family is unhappy in its own way']

# Build the vectorizer and fit it
anna_vect = CountVectorizer(ngram_range = (1, 2))
anna_vect.fit(annak)

# Create the bow representation
anna_bow = anna_vect.transform(annak)

# Print the bag-of-words result 
print(anna_bow.toarray())

# Print the bag-of-words dataframe
print(pd.DataFrame(anna_bow.toarray(), columns = anna_vect.get_feature_names()))

[[1 1 1 1 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 2 1 1 1]]
   alike  all  all alike  are  are all  every  every unhappy  families  \
0      1    1          1    1        1      0              0         1   
1      0    0          0    0        0      1              1         0   

   families are  family  ...  is  is unhappy  its  its own  own  own way  \
0             1       0  ...   0           0    0        0    0        0   
1             0       1  ...   1           1    1        1    1        1   

   unhappy  unhappy family  unhappy in  way  
0        0               0           0    0  
1        2               1           1    1  

[2 rows x 25 columns]


## CountVectorizer 的參數介紹

max_features = N，代表只取詞頻或次數前 N 個的 tokens

max_df = N，如果 N 為 integer，則代表不包含**出現的 document 數大於** N 的 tokens；若 N 為 float，則代表不包含**出現的 document 數的比例**大於 N 的 tokens (default = 1，ignore nothing)

min_df = N，如果 N 為 integer，則代表不包含**出現的 document 數小於** N 的 tokens；若 N 為 float，則代表不包含**出現的 document 數的比例**小於 N 的 tokens (default = 1，ignore nothing)

ngram_range = (N_min, N_max)，代表我要包含 N_min grams 到 N_max grams 的 features

### 課堂練習

課堂練習三：只包含出現次數前 100 名的 tokens

In [24]:
# Build the vectorizer, specify size of vocabulary and fit
vect = CountVectorizer(max_features=100)
vect.fit(data.review)

# Transform the review column
X_review = vect.transform(data.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   about  after  all  also  am  an  and  any  are  as  ...  what  when  which  \
0      0      0    1     0   0   0    2    0    0   0  ...     0     0      0   
1      0      0    0     0   0   0    3    1    1   0  ...     0     0      0   
2      0      0    3     0   0   1    4    0    1   1  ...     0     0      1   
3      0      0    0     0   0   0    9    0    1   0  ...     0     0      0   
4      0      1    0     0   0   0    3    0    1   0  ...     0     0      0   

   who  will  with  work  would  you  your  
0    2     0     1     0      2    0     1  
1    0     0     0     0      1    1     0  
2    1     0     0     1      1    2     0  
3    0     0     0     0      0    0     0  
4    0     0     0     0      0    3     1  

[5 rows x 100 columns]


課堂練習四：忽略出現在超過 200 個 documents 的 tokens

In [26]:
# Build and fit the vectorizer
vect = CountVectorizer(max_df=200)
vect.fit(data.review)

# Transform the review column
X_review = vect.transform(data.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   00  000  001  002  00290  007  0070412901  0072316373  008  00now  ...  \
0   0    0    0    0      0    0           0           0    0      0  ...   
1   0    0    0    0      0    0           0           0    0      0  ...   
2   0    0    0    0      0    0           0           0    0      0  ...   
3   0    0    0    0      0    0           0           0    0      0  ...   
4   0    0    0    0      0    0           0           0    0      0  ...   

   zzzzzzzzzzzzzzzzzzzzz  \
0                      0   
1                      0   
2                      0   
3                      0   
4                      0   

   zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz  \
0                                                  0                              
1                                                  0                              
2                                                  0                              
3                                   

課堂練習五：忽略出現在小於 50 個 documents 的 tokens

In [28]:
# Build and fit the vectorizer
vect = CountVectorizer(min_df=50)
vect.fit(data.review)

# Transform the review column
X_review = vect.transform(data.review)
# Create the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   10  100  12  15  1984  20  2nd  30  40  451  ...  year  years  yes  yet  \
0   0    0   0   0     0   0    0   0   0    0  ...     0      0    0    0   
1   0    0   0   0     0   0    0   0   0    0  ...     0      1    0    0   
2   0    0   0   0     0   0    0   0   0    0  ...     0      1    0    0   
3   0    0   0   0     0   0    0   0   0    0  ...     0      0    0    0   
4   0    0   0   0     0   0    0   0   0    0  ...     0      0    0    0   

   you  young  younger  your  yourself  zero  
0    0      0        0     1         0     0  
1    1      0        0     0         0     0  
2    2      0        0     0         0     0  
3    0      0        0     0         0     0  
4    3      0        0     1         0     0  

[5 rows x 1326 columns]


課堂練習六：Build the vectorizer and make sure to specify the following parameters: the size of the vocabulary should be limited to 1000, include only bigrams, and ignore terms that appear in more than 500 documents

In [30]:
#Import the vectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Build the vectorizer, specify max features and fit
vect = CountVectorizer(max_features=1000, ngram_range=(2, 2), max_df=500)
vect.fit(data.review)

# Transform the review
X_review = vect.transform(data.review)

# Create a DataFrame from the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names())
print(X_df.head())

   able to  about how  about it  about the  about this  after reading  \
0        0          0         0          0           0              0   
1        0          0         0          0           0              0   
2        0          0         0          0           0              0   
3        0          0         0          0           0              0   
4        0          0         0          0           0              0   

   after the  again and  ago and  agree with  ...  you think  you to  you ve  \
0          0          0        0           0  ...          0       0       0   
1          0          0        0           0  ...          0       0       0   
2          0          0        0           0  ...          0       0       2   
3          0          0        0           0  ...          0       0       0   
4          0          0        0           0  ...          0       0       1   

   you want  you will  you won  you would  your money  your own  your time  
0  

---

---

---

# Build new features from text

In [1]:
import pandas as pd

reviews = pd.read_csv("Datasets/amazon_reviews_sample.csv")

reviews.head()

Unnamed: 0.1,Unnamed: 0,score,review
0,0,1,Stuning even for the non-gamer: This sound tr...
1,1,1,The best soundtrack ever to anything.: I'm re...
2,2,1,Amazing!: This soundtrack is my favorite musi...
3,3,1,Excellent Soundtrack: I truly like this sound...
4,4,1,"Remember, Pull Your Jaw Off The Floor After H..."


## 目的

一個 dataset 的 features 越多，基本上能提升模型的表現，因此我們可以增加額外的 features。

除了 bag-of-words 將原本的變數變成 numeric form 以外，我們還可以透過更多方式增加 features。例如：
1. 計算每個 review 的長度
2. 計算每個 review 包含多少個 sentences
3. What parts of speech are evolved?
4. 計算每個 review 有幾個標點符號

計算這些的方式，可以在實踐 bag-of-words 之前，做 tokenization 之後去計算數量。

## Tokenisation

將 documents 或是 sentences 分割成更小的 tokens (sentence 也可以是一個 token)

### Tokenisation in Python

使用 nltk 的 word_tokenize 函數來實踐，也可以用 regular expression 搭配 re 套件的 search、find_all 等函數達成

例子：

In [2]:
from nltk import word_tokenize

In [3]:
anna_k = 'Happy families are all alike, every unhappy family is unhappy in its own way'

In [7]:
print(type(word_tokenize(anna_k)))

print(word_tokenize(anna_k))

<class 'list'>
['Happy', 'families', 'are', 'all', 'alike', ',', 'every', 'unhappy', 'family', 'is', 'unhappy', 'in', 'its', 'own', 'way']


由結果可見，不只字詞變成 tokens，連**標點符號**也都變成 tokens。此外，數字也會變成 tokens。

另外，word_tokenize 函數的回傳值為一個存有所有 tokens 的 list。

---

### 實作

現在，計算各個 review 的長度：

In [11]:
word_tokens = [word_tokenize(review) for review in reviews.review]

print(type(word_tokens))

print(type(word_tokens[0]))

<class 'list'>
<class 'list'>


In [13]:
len_tokens = []

for review in word_tokens:
    len_tokens.append(len(review))
    
print(len_tokens)

[87, 109, 165, 145, 109, 169, 168, 117, 113, 70, 43, 109, 66, 108, 151, 61, 135, 134, 83, 55, 87, 182, 187, 155, 65, 132, 120, 57, 119, 187, 112, 166, 67, 195, 38, 152, 129, 94, 48, 79, 69, 190, 38, 56, 25, 82, 132, 108, 55, 119, 52, 193, 108, 93, 55, 75, 85, 80, 29, 41, 86, 69, 126, 97, 78, 132, 99, 71, 84, 69, 82, 204, 55, 159, 132, 134, 142, 32, 30, 36, 143, 204, 63, 44, 29, 75, 36, 49, 131, 74, 114, 187, 78, 92, 49, 62, 36, 40, 35, 31, 18, 181, 95, 194, 46, 40, 134, 90, 59, 47, 215, 57, 45, 43, 33, 91, 74, 200, 166, 108, 108, 53, 37, 54, 122, 84, 91, 42, 31, 46, 45, 64, 70, 57, 177, 32, 144, 89, 78, 219, 103, 57, 99, 22, 114, 76, 198, 82, 88, 50, 94, 116, 122, 70, 68, 73, 158, 166, 167, 57, 96, 180, 179, 34, 63, 216, 48, 32, 109, 34, 58, 52, 35, 216, 47, 34, 29, 69, 106, 51, 97, 46, 94, 130, 157, 159, 26, 50, 77, 175, 191, 186, 82, 70, 47, 49, 89, 67, 64, 37, 135, 110, 101, 135, 24, 40, 46, 74, 126, 73, 68, 95, 163, 188, 100, 79, 98, 77, 114, 118, 133, 139, 44, 60, 83, 109, 68, 47,

將計算出來的長度加入 reviews 這個 dataframe 中作為一個 feature

In [15]:
reviews["n_tokens"] = len_tokens

reviews.head()

Unnamed: 0.1,Unnamed: 0,score,review,n_tokens
0,0,1,Stuning even for the non-gamer: This sound tr...,87
1,1,1,The best soundtrack ever to anything.: I'm re...,109
2,2,1,Amazing!: This soundtrack is my favorite musi...,165
3,3,1,Excellent Soundtrack: I truly like this sound...,145
4,4,1,"Remember, Pull Your Jaw Off The Floor After H...",109


還有很多 create features 的方法：
1. 計算 review 中標點符號的數量作為一個 feature。一個 review 有很多標點符號可能是因為是 very emotionally charged opinion
2. 將每個 review 中的句數也當作一個 feature (一個 token 要是一個句子)

計算標點符號數量：

In [17]:
import re

In [19]:
pattern = r"[.,!?]"

punct_tokens = [re.findall(pattern, review) for review in reviews.review]
print(punct_tokens)

[['!', '.', '!', '!', '.', '!'], ['.', '.', '.', '.', ',', ',', '.'], ['!', ',', '.', '.', ',', ',', '.', ',', ',', ',', ',', '.', '.'], ['.', '.', '.', ',', ',', ',', ',', ',', '.', ',', ',', ',', ',', ',', '.', ',', ',', ',', ',', ',', ',', ',', ',', '.', ',', '.'], [',', ',', '!', ',', '!', ',', ',', ',', ',', '.', ',', '.', '.'], [',', '.', ',', '.', ',', ',', '.', ',', ',', ',', '.', ',', '.'], [',', '!', '.', ',', '!', '.', '.', '.', '.', '.', ',', ',', '!'], ['.', '.', ',', '.', '!', '!', '.', '.', ',', '.', ',', '.'], ['.', '.', ',', '.', '!', '.', '.', '!'], [',', '.', ',', '.', '.', '.', '.'], ['!', '.', ',', ',', '.', '.'], [',', ',', '.', '.', ',', '.', '.', '.', '.'], [',', '.', '.', '.', '.', '.'], [',', '.', '!', '.', ',', '.', '.', '.', '.', ',', '.', '.'], ['!', '.', '!', ',', ',', '.', ',', '.', ',', '.', '.', ',', '.', '!'], ['.', ',', '.', '.', '?'], [',', '.', ',', '.', ',', ',', '.', ',', ',', ',', '!'], ['!', '.', '.', '.', ',', '.', ',', '.', ',', '.', ',', '.',

In [20]:
n_punct_tokens = []

for review in punct_tokens:
    n_punct_tokens.append(len(review))
    
print(n_punct_tokens)

[6, 7, 13, 26, 13, 13, 13, 12, 8, 7, 6, 9, 6, 12, 14, 5, 11, 13, 9, 5, 11, 12, 19, 10, 4, 12, 17, 4, 15, 13, 10, 22, 7, 25, 5, 13, 11, 7, 10, 3, 5, 16, 4, 2, 2, 6, 11, 11, 7, 11, 6, 18, 8, 9, 3, 4, 15, 6, 3, 4, 11, 7, 12, 11, 7, 12, 12, 8, 5, 6, 12, 16, 4, 10, 6, 11, 13, 6, 5, 3, 8, 43, 14, 4, 0, 6, 4, 8, 18, 11, 7, 17, 11, 7, 8, 4, 4, 4, 3, 4, 1, 13, 13, 10, 4, 3, 9, 9, 9, 4, 25, 6, 3, 3, 3, 12, 6, 22, 17, 10, 9, 0, 0, 9, 18, 15, 6, 3, 0, 4, 2, 5, 8, 5, 23, 7, 10, 9, 8, 30, 12, 6, 9, 0, 10, 7, 19, 11, 12, 8, 7, 14, 19, 8, 11, 7, 12, 16, 18, 4, 11, 15, 17, 2, 7, 22, 6, 2, 8, 5, 8, 4, 4, 18, 10, 3, 1, 5, 14, 7, 7, 4, 11, 12, 13, 24, 0, 5, 6, 12, 38, 16, 7, 5, 4, 9, 9, 5, 6, 2, 19, 8, 7, 12, 4, 3, 4, 6, 17, 7, 11, 10, 18, 14, 9, 5, 6, 9, 21, 11, 9, 4, 6, 5, 7, 14, 5, 4, 4, 11, 6, 11, 10, 6, 13, 14, 4, 15, 6, 13, 7, 18, 14, 5, 14, 13, 11, 9, 16, 32, 2, 2, 7, 9, 9, 18, 5, 2, 7, 6, 11, 5, 6, 5, 2, 5, 6, 7, 11, 4, 3, 11, 16, 4, 8, 16, 18, 16, 17, 17, 8, 12, 13, 5, 27, 24, 13, 8, 5, 20, 6, 23

In [21]:
reviews["n_punctuations"] = n_punct_tokens

reviews.head()

Unnamed: 0.1,Unnamed: 0,score,review,n_tokens,n_punctuations
0,0,1,Stuning even for the non-gamer: This sound tr...,87,6
1,1,1,The best soundtrack ever to anything.: I'm re...,109,7
2,2,1,Amazing!: This soundtrack is my favorite musi...,165,13
3,3,1,Excellent Soundtrack: I truly like this sound...,145,26
4,4,1,"Remember, Pull Your Jaw Off The Floor After H...",109,13


計算句子的數量：

In [22]:
from nltk import sent_tokenize

sent_tokens = [sent_tokenize(review) for review in reviews.review]

print(type(sent_tokens))

<class 'list'>


In [23]:
n_sent_tokens = []

for review in sent_tokens:
    n_sent_tokens.append(len(review))
    
print(n_sent_tokens)

[7, 4, 4, 4, 5, 4, 7, 8, 7, 4, 4, 6, 5, 8, 9, 4, 4, 9, 5, 3, 6, 7, 5, 5, 1, 7, 7, 3, 4, 7, 5, 6, 5, 8, 2, 9, 5, 4, 4, 3, 3, 6, 4, 1, 2, 3, 9, 8, 4, 5, 4, 13, 5, 6, 2, 4, 5, 4, 2, 2, 8, 5, 7, 5, 5, 7, 7, 6, 5, 6, 6, 9, 3, 9, 6, 4, 10, 2, 4, 1, 6, 17, 4, 3, 1, 4, 4, 3, 11, 4, 6, 4, 2, 6, 3, 4, 2, 2, 1, 3, 2, 9, 7, 8, 3, 2, 3, 5, 4, 4, 14, 6, 3, 2, 2, 2, 4, 8, 6, 5, 7, 1, 1, 1, 9, 7, 6, 3, 1, 2, 2, 4, 1, 3, 7, 5, 8, 6, 4, 21, 8, 3, 3, 1, 5, 3, 6, 10, 7, 5, 5, 8, 6, 6, 7, 2, 4, 8, 7, 2, 5, 7, 8, 2, 4, 9, 5, 1, 7, 4, 6, 4, 3, 10, 2, 1, 1, 3, 6, 2, 2, 4, 4, 3, 12, 11, 1, 1, 5, 9, 9, 5, 3, 2, 4, 3, 4, 5, 6, 1, 8, 6, 6, 6, 1, 3, 4, 6, 13, 5, 6, 3, 13, 6, 7, 4, 5, 6, 10, 7, 6, 4, 1, 4, 3, 5, 4, 4, 3, 3, 4, 6, 6, 4, 4, 7, 4, 5, 2, 7, 5, 6, 6, 3, 5, 5, 4, 5, 4, 18, 2, 3, 4, 7, 9, 8, 3, 2, 5, 2, 3, 4, 3, 5, 2, 4, 2, 6, 3, 3, 1, 5, 9, 3, 6, 6, 5, 6, 7, 7, 4, 5, 3, 4, 7, 7, 5, 5, 2, 8, 1, 10, 3, 9, 9, 1, 7, 4, 4, 2, 2, 3, 5, 16, 5, 2, 4, 2, 3, 4, 3, 7, 2, 7, 4, 9, 5, 3, 5, 4, 6, 4, 4, 7, 6, 6, 2, 6,

In [24]:
reviews["n_sentences"] = n_sent_tokens

reviews.head()

Unnamed: 0.1,Unnamed: 0,score,review,n_tokens,n_punctuations,n_sentences
0,0,1,Stuning even for the non-gamer: This sound tr...,87,6,7
1,1,1,The best soundtrack ever to anything.: I'm re...,109,7,4
2,2,1,Amazing!: This soundtrack is my favorite musi...,165,13,4
3,3,1,Excellent Soundtrack: I truly like this sound...,145,26,4
4,4,1,"Remember, Pull Your Jaw Off The Floor After H...",109,13,5


### 課堂練習

課堂練習七：練習 word_tokenize

In [26]:
GoT = 'Never forget what you are, for surely the world will not. Make it your strength. Then it can never be your weakness. Armour yourself in it, and it will never be used to hurt you.'

In [27]:
print(word_tokenize(GoT))

['Never', 'forget', 'what', 'you', 'are', ',', 'for', 'surely', 'the', 'world', 'will', 'not', '.', 'Make', 'it', 'your', 'strength', '.', 'Then', 'it', 'can', 'never', 'be', 'your', 'weakness', '.', 'Armour', 'yourself', 'in', 'it', ',', 'and', 'it', 'will', 'never', 'be', 'used', 'to', 'hurt', 'you', '.']


課堂練習八：練習 tokenize 所有 list 中的句子

In [29]:
avengers = ["Cause if we can't protect the Earth, you can be d*** sure we'll avenge it",
           'There was an idea to bring together a group of remarkable people, to see if we could become something more',
           "These guys come from legend, Captain. They're basically Gods."]

In [30]:
print([word_tokenize(sent) for sent in avengers])

[['Cause', 'if', 'we', 'ca', "n't", 'protect', 'the', 'Earth', ',', 'you', 'can', 'be', 'd', '*', '*', '*', 'sure', 'we', "'ll", 'avenge', 'it'], ['There', 'was', 'an', 'idea', 'to', 'bring', 'together', 'a', 'group', 'of', 'remarkable', 'people', ',', 'to', 'see', 'if', 'we', 'could', 'become', 'something', 'more'], ['These', 'guys', 'come', 'from', 'legend', ',', 'Captain', '.', 'They', "'re", 'basically', 'Gods', '.']]


---

---

---

# Guess the language

實際上，不是所有含有 sentiment 的文件都是英文。因此我們可能要在 sentiment analysis 之前就去 detect 文件是哪個語言，再用該語言的特性產生額外的 features

## Guess the language in Python

Python 中，有許多套件都能辨識 string 的語言

這裡我們使用 **langdetect** 的 detect_langs 函數。此函數會回傳一個 list，每個 element 都是 a pair of a language and a number (代表機率)

In [32]:
# pip install langdetect
from langdetect import detect_langs

foreign = "Este libro ha sido uno de los mejores libros que he leido."

In [33]:
detect_langs(foreign)

[es:0.9999964880504906]

detect_langs 函數辨認出這段文字是西班牙文，機率為 0.9999965 (由於只有一句話，因此回傳值的 list 只有一個 pair

很多時候，reviews 會被記錄在 dataframe 中，當 reviews 的語言不只英文時，如何建立一個新的 column，紀錄著該行 review 的語言？

方法一： apply

In [36]:
from langdetect import detect_langs

reviews = pd.read_csv("Datasets/amazon_reviews_sample.csv")
reviews.head()

Unnamed: 0.1,Unnamed: 0,score,review
0,0,1,Stuning even for the non-gamer: This sound tr...
1,1,1,The best soundtrack ever to anything.: I'm re...
2,2,1,Amazing!: This soundtrack is my favorite musi...
3,3,1,Excellent Soundtrack: I truly like this sound...
4,4,1,"Remember, Pull Your Jaw Off The Floor After H..."


In [38]:
reviews['lang'] = reviews.apply(
    lambda row: detect_langs(row.review),
    axis = 1
)

reviews.head()

Unnamed: 0.1,Unnamed: 0,score,review,lang
0,0,1,Stuning even for the non-gamer: This sound tr...,[en:0.9999969067663559]
1,1,1,The best soundtrack ever to anything.: I'm re...,[en:0.9999952717863452]
2,2,1,Amazing!: This soundtrack is my favorite musi...,[en:0.9999968904235674]
3,3,1,Excellent Soundtrack: I truly like this sound...,[en:0.9999965173225646]
4,4,1,"Remember, Pull Your Jaw Off The Floor After H...",[en:0.9999960547506581]


方法二： list comprehensions

In [47]:
from langdetect import detect_langs

reviews = pd.read_csv("Datasets/amazon_reviews_sample.csv")
reviews.head()

Unnamed: 0.1,Unnamed: 0,score,review
0,0,1,Stuning even for the non-gamer: This sound tr...
1,1,1,The best soundtrack ever to anything.: I'm re...
2,2,1,Amazing!: This soundtrack is my favorite musi...
3,3,1,Excellent Soundtrack: I truly like this sound...
4,4,1,"Remember, Pull Your Jaw Off The Floor After H..."


In [40]:
languages = [detect_langs(review) for review in reviews.review]
languages

[[en:0.9999978843031365],
 [en:0.9999965212719248],
 [en:0.9999986228380299],
 [en:0.9999959418926485],
 [en:0.9999976995668334],
 [en:0.9999971108273983],
 [en:0.9999966354971183],
 [en:0.9999966292443495],
 [en:0.999996139019806],
 [en:0.9999962973649532],
 [en:0.9999992661797201],
 [en:0.9999958149023931],
 [en:0.9999966193127277],
 [en:0.9999974563326305],
 [en:0.9999965503243425],
 [en:0.9999971638972667],
 [en:0.9999980002072468],
 [en:0.9999970406001106],
 [en:0.9999981538194448],
 [en:0.9999975751784247],
 [en:0.999996710967795],
 [en:0.9999971556452446],
 [en:0.9999987984765251],
 [en:0.9999973457211311],
 [en:0.9999967223515327],
 [en:0.9999970779444767],
 [en:0.9999967227367594],
 [en:0.9999947589348448],
 [en:0.9999980129836016],
 [en:0.9999971634327406],
 [en:0.9999962461885475],
 [en:0.9999975089114947],
 [en:0.9999958940814304],
 [en:0.99999784532727],
 [en:0.9999966378448485],
 [en:0.9999977456207687],
 [en:0.999997259579664],
 [en:0.9999969501361152],
 [en:0.9999952910

節錄 pair 中語言的部分

In [43]:
langs = [list(str(i).split(":"))[0][1:] for i in languages]
langs

['en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',
 'en',

In [48]:
reviews['language'] = langs
reviews.head()

Unnamed: 0.1,Unnamed: 0,score,review,language
0,0,1,Stuning even for the non-gamer: This sound tr...,en
1,1,1,The best soundtrack ever to anything.: I'm re...,en
2,2,1,Amazing!: This soundtrack is my favorite musi...,en
3,3,1,Excellent Soundtrack: I truly like this sound...,en
4,4,1,"Remember, Pull Your Jaw Off The Floor After H...",en
