##### 참고 링크
https://www.analyticsvidhya.com/blog/2020/08/top-4-sentence-embedding-techniques-using-python/

- Word2Vec, Glove 등 단어에 대한 임베딩(벡터화)는 쉬운데 , 문장에 대한 벡터화에 어려움을 겪어 공부해봄
- 문장 벡터화의 경우는 크게 4가지 방법이 존재함
    - 1. Doc2Vec
    - 2. sentence Bert (sBert)
    - 3. InferSent
    - 4. Universal Sentence Encoder
    

이 외에도 TF-IDF, 문장에 단어들 토큰화 후 , 문장 내의 단어들을 average값을 활용하는 방법도 존재 => 문장의 맥락은 이해하기 힘듦

----

##### Step 1: 

Library Load , 영어 데이터에는 주로  nltk 라이브러리 사용 (natural language Toolkit)

In [1]:
from nltk.tokenize import word_tokenize
import nltk
import numpy as np

def cosine(u,v ):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

##### Step 2 : 
예시 문장 데이터셋 생성

In [3]:
sentences = ["I ate dinner",
             "We had a three-course meal",
             "Brad came to dinner with us.",
             "He loves fish tacos.",
             "In the end, we all felt like we ate too much",
             "We all agreed; it was a magnificent evening."]

##### Step 3:

문장 데이터 셋을 Tokenize

In [4]:
tokenized_sent = []

for s in sentences:
    tokenized_sent.append(word_tokenize(s.lower()))
    
tokenized_sent

[['i', 'ate', 'dinner'],
 ['we', 'had', 'a', 'three-course', 'meal'],
 ['brad', 'came', 'to', 'dinner', 'with', 'us', '.'],
 ['he', 'loves', 'fish', 'tacos', '.'],
 ['in',
  'the',
  'end',
  ',',
  'we',
  'all',
  'felt',
  'like',
  'we',
  'ate',
  'too',
  'much'],
 ['we', 'all', 'agreed', ';', 'it', 'was', 'a', 'magnificent', 'evening', '.']]

##### Step 4:
Cosine Similarity Funciton 생성

In [5]:
def cosine(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))

![](img/cosineSim.png)

----

### <center>Doc2Vec</center>
- introduced in 2014
- unsuervised algorithm and adds on to the Word2Vec model by introducing another 'paragraph vector'. Also there are 2 ways to add the paragraph vector to the model

![](img/doc2vec.png)

1) __PVDOBW__(Distributed Bag of Words version of Par) : Word2Vec의 Skip-gram 과 유사하게 , 단어가 주어지면 그 단어의 주변값을 예측해가면서 학습하는 과정

2) __PVDM(Distributed Memory Version of Paragraph Vector)__ :we predict the next sentence given a set of sentences, 문장의 다음에 나올 단어들을 예측해가면서 학습하는 과정

- 글쓴이는 , 이 두가지 방법을 조합하여 사용하는것을 주천하지만  PVDM 이 성능이 더 좋다고 하는듯 ? Word2Vec에서는 보통 Skip-gram이 더 좋은것으로 알고있는데 , 경우에 따라 다른것 같음. Default 값은 PVDOVW 로 학습

In [6]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(tokenized_sent)]
tagged_data

[TaggedDocument(words=['i', 'ate', 'dinner'], tags=[0]),
 TaggedDocument(words=['we', 'had', 'a', 'three-course', 'meal'], tags=[1]),
 TaggedDocument(words=['brad', 'came', 'to', 'dinner', 'with', 'us', '.'], tags=[2]),
 TaggedDocument(words=['he', 'loves', 'fish', 'tacos', '.'], tags=[3]),
 TaggedDocument(words=['in', 'the', 'end', ',', 'we', 'all', 'felt', 'like', 'we', 'ate', 'too', 'much'], tags=[4]),
 TaggedDocument(words=['we', 'all', 'agreed', ';', 'it', 'was', 'a', 'magnificent', 'evening', '.'], tags=[5])]

In [10]:
model = Doc2Vec(tagged_data, vector_size = 20, window = 2, min_count = 1, epochs = 100)

'''
vector_size = Dimensionality of the feature vectors.
window = The maximum distance between the current and predicted word tiwhin a sentence.
min_count = Ignores all words with total frequency lower than this
alpha = The initial learning rate

'''

model.wv.vocab

{'i': <gensim.models.keyedvectors.Vocab at 0x24029cc5408>,
 'ate': <gensim.models.keyedvectors.Vocab at 0x2402a0606c8>,
 'dinner': <gensim.models.keyedvectors.Vocab at 0x2402a0da6c8>,
 'we': <gensim.models.keyedvectors.Vocab at 0x2402815ff08>,
 'had': <gensim.models.keyedvectors.Vocab at 0x2402815f848>,
 'a': <gensim.models.keyedvectors.Vocab at 0x240281c65c8>,
 'three-course': <gensim.models.keyedvectors.Vocab at 0x240283199c8>,
 'meal': <gensim.models.keyedvectors.Vocab at 0x24028319bc8>,
 'brad': <gensim.models.keyedvectors.Vocab at 0x2402815fd48>,
 'came': <gensim.models.keyedvectors.Vocab at 0x2402815fac8>,
 'to': <gensim.models.keyedvectors.Vocab at 0x24028319908>,
 'with': <gensim.models.keyedvectors.Vocab at 0x24028319748>,
 'us': <gensim.models.keyedvectors.Vocab at 0x24028319288>,
 '.': <gensim.models.keyedvectors.Vocab at 0x24028319248>,
 'he': <gensim.models.keyedvectors.Vocab at 0x240283194c8>,
 'loves': <gensim.models.keyedvectors.Vocab at 0x24028319408>,
 'fish': <gensim

In [8]:
test_doc = word_tokenize("I had pizza and pasta".lower())
test_doc_vector = model.infer_vector(test_doc)

print(test_doc_vector) # vector_size = 20으로 설정하였으므로 , 20차원의 벡터 출력됨

model.docvecs.most_similar(positive = [test_doc_vector])  # positive = List of sentences that contribute positively.

[ 0.0003346   0.02430088  0.01482583  0.01320496  0.00774708 -0.00479163
 -0.0043848   0.0011541   0.01091491 -0.00402519  0.02941764 -0.0013013
 -0.00784817  0.0232918  -0.02132636 -0.01285449 -0.00544751 -0.00254209
 -0.01468766 -0.00313348]


[(5, 0.41944509744644165),
 (4, 0.37895679473876953),
 (3, 0.3163576126098633),
 (1, 0.31625896692276),
 (2, 0.22266343235969543),
 (0, 0.20930778980255127)]

>- 데이터가 적어서 , 결과적으로 좋은지는 모르겠음

### <center> sBERT</center>

In [13]:
!pip install sentence-transformers

Collecting sentence-transformers
  Using cached sentence-transformers-0.3.9.tar.gz (64 kB)
Collecting transformers<3.6.0,>=3.1.0
  Using cached transformers-3.5.1-py3-none-any.whl (1.3 MB)


ERROR: Could not find a version that satisfies the requirement torch>=1.6.0 (from sentence-transformers) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch>=1.6.0 (from sentence-transformers)


##### error 로 인해 torch 설치가 안되서 수행은 못함, GPU가 없어서 그렇지 않을까 싶음

In [15]:
from sentence_transformers import SentenceTransformer

sbert_model = SentenceTransformer('bert-base-nli-mean-tokens') #pre-trained Model인듯?

##### STEP 1 : 
Encode the provided sentences. We can also display the sentence vectors

In [17]:
sentence_embeddings = model.encode(sentences)

#print('Sample BERT embedding vector - length', len(sentence_embeddings[0]))
#print('Sample embedding vector - note includes negative values', sentence_embeddings[0])

##### STEP 2:
Define a test query and decode it as well


In [None]:
query = "I had pizza and pasta"
query_vec = model.encode([query])[0]

##### STEP 3:
Compute the cosine similarity using scipy. We will retrieve the similarity values between the sentecnes and our test query

In [None]:
for sent in sentences:
    sim = cosine(query_vec, model.encode([sent])[0])
    print("Sentence = ", sent, "; similarity = ", sim)