Word2Vec(CBOW, Skip-gram), GloVe를 사용해 단어들의 유사도 확인 비교 모델 베이스라인 작성 

# Import

In [None]:
!pip install gensim
!pip install glove-python-binary

In [61]:
from gensim.models import Word2Vec
from glove import Corpus, Glove
from transformers import AutoTokenizer, AutoModel
from torch.nn.functional import cosine_similarity
import torch
import pickle

# Load tokens

In [46]:
with open('./단어사전/disorder_token_sangjin.pkl', 'rb') as f:
    tokens = pickle.load(f)

# Model

## CBOW

In [41]:
model = Word2Vec(tokens.values(), vector_size=100, window=5, min_count=1, sg=0)

word1 = "주의"

similar_words = model.wv.most_similar(word1, topn=5)

print(f"'{word1}'와(과) 유사한 단어들:")
for word, similarity in similar_words:
    print(f"{word}: {similarity}")


'주의'와(과) 유사한 단어들:
임무: 0.3143480122089386
운동: 0.25495392084121704
동안: 0.24983598291873932
비행: 0.23560838401317596
활동: 0.22014707326889038


## Skip-gram

In [43]:
# Skip-gram 모델 훈련
skipgram_model = Word2Vec(tokens.values(), vector_size=100, window=5, min_count=1, workers=4, sg=1)

# 단어 유사도 계산
similarity = skipgram_model.wv.similarity('주의', '활동')
print(f"Sample-Example similarity: {similarity}")

Sample-Example similarity: 0.2884923219680786


## GloVe

In [48]:
# 코퍼스 생성 및 모델 훈련
corpus = Corpus()
corpus.fit(tokens.values(), window=10)
glove = Glove(no_components=100, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)

# 단어 유사도 계산
word = '집중력'
similar_words = glove.most_similar(word, number=10)
print(f"Words similar to '{word}': {similar_words}")

Performing 30 training epochs with 4 threads
Epoch 0
Epoch 1
Epoch 2
Epoch 3
Epoch 4
Epoch 5
Epoch 6
Epoch 7
Epoch 8
Epoch 9
Epoch 10
Epoch 11
Epoch 12
Epoch 13
Epoch 14
Epoch 15
Epoch 16
Epoch 17
Epoch 18
Epoch 19
Epoch 20
Epoch 21
Epoch 22
Epoch 23
Epoch 24
Epoch 25
Epoch 26
Epoch 27
Epoch 28
Epoch 29
Words similar to '집중력': [('시작', 0.37474681930448084), ('참여', 0.2798269753093418), ('휴대', 0.2720635181098524), ('양식', 0.2401908446687907), ('쇼핑', 0.23388976725801533), ('관계', 0.22760232135279784), ('지향', 0.21636718687754952), ('지갑', 0.21214355773878119), ('느낌', 0.1849278445299922)]


## transformers

### klue/bert-base

In [60]:
# 모델과 토크나이저 로드
model_name = "klue/bert-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

def get_word_embedding(word, tokenizer, model):
    # 단어를 문장에 포함시켜 토큰화 (여기서는 단어 자체를 문장으로 가정)
    inputs = tokenizer(word, return_tensors="pt", padding=True, truncation=True, max_length=512)
    
    # 모델을 통해 텍스트의 임베딩을 얻음
    with torch.no_grad():
        outputs = model(**inputs)
    
    # 첫 번째 토큰([CLS] 토큰)의 임베딩을 제외하고, 단어에 해당하는 토큰의 임베딩 평균 계산
    hidden_states = outputs.last_hidden_state[:, 1:-1, :]  # [CLS]와 [SEP] 토큰을 제외
    word_embedding = torch.mean(hidden_states, dim=1)
    
    return word_embedding

In [58]:
def get_cosine_similarity(word1, word2):
    word1_embedding = get_word_embedding(word1, tokenizer, model)
    word2_embedding = get_word_embedding(word2, tokenizer, model)

    similarity = cosine_similarity(word1_embedding, word2_embedding)

    return similarity

In [59]:
print(get_cosine_similarity("사과", "바나나"))

tensor([0.7181])
