## 作業目標: 透過思考與回答以更加了解計數方法的詞向量

### 請問詞庫手法會有什麼樣的優缺點？

詞庫手法為創建包含大量字詞的詞庫，將相同意思字詞(同義字)或相似意思字詞(相似字)分類在相同群組。

優點:
1.  Coded by hand, so it guarantees (some) quality
2.  Potentially beneficial in natural language generation (paraphrasing, generation of semantic net for lexicalization and basis for domain ontology)
 
缺點:
1.  Takes too much time to build
2.  Potentially biased, as it depends on the builders’ pre-existing knowledge of the world
3. Unable to distinguish between atomic level and non-atomic level semantic units (e.g., phrasal, idioms, collocations, etc.)
4.  Frequency data not available




### 請問共現矩陣有什麼樣的優缺點？ 

根據分佈假說，相似的字詞會有類似的上下文，因此我們可以透過計數周圍(window)的字詞來表達特定字詞的向量。

優點:
1.   Semantic relationships between words are preserved in co-occurrence matrix
2.   Uses SVD --> more accurate word vector representations
3.   Faster (has to be computed only once) 

缺點:
1.   Requires more storage 
2.   Raw frequency is not discriminative enough since some words (e.g., the, it) appears frequently but are not informative enough. 

### 請問為何需要對共現矩陣或PPMI進行SVD降維?
Ans: Because there may be data in the given co-occurrence matrix or PPMI that 1) does not contain useful information or 2) contributes to noise when building models, which could damage the quality of the model. Moreover, an increased number of dimensions in the data requires more time and storage for data processing, and therefore bad for efficiency.

### 實作cosine similarity

在比較兩個詞向量的相似度時可以使用cosine similarity:
$$
similarity(x,y) = \frac{x \cdot y}{||x||||y||} = \frac{x_1y_1+...+x_ny_n}{\sqrt{x_1^2+...+x_n^2}\sqrt{y_1^2+...+y_n^2}}
$$

請實作cosine similarity並計算共現矩陣課程範例中you向量([0,1,0,0,0,0,0])與I([0,1,0,1,0,0,0])向量的相似度

In [None]:
import numpy as np
import math
I = np.array([0,1,0,0,0,0,0])
You = np.array([0,1,0,1,0,0,0])

def cos_similarity(x, y, eps=1e-8):
    ### your code ###
    return sum(x*y) / (math.sqrt(sum([i**2 for i in x]) + eps) * math.sqrt(sum([u**2 for u in y])) + eps)

print(f"Similarity: {cos_similarity(I, You)}")



# alternatives: 
#1. Use sklearn.metrics.pairwise.cosine_similarity
#2. Import 1) dot from numpy and 2) norm from numpy.linalg --> cosine_similarity = dot(a, b)/(norm(a)*norm(b))

Similarity: 0.7071067726510137


In [21]:
# get input
user = str(input("Enter string: "))
# get length 
number = len(user)
# print meme
print(user + "\n.\n.\n.\n")
print("_" + "人" * number + "_" + "\n" + ">" + user + "<" + "\n" + "=" + "Y^" * number + "=")

Enter string: 142
142
.
.
.

_人人人_
>142<
=Y^Y^Y^=


enumerate

In [24]:
string = "oop"
print([word for word in string])

['o', 'o', 'p']
