# Word Embedding

- **Word Embedding**은 단어를 고정된 차원의 벡터로 변환하는 기술로, 단어 간의 의미적 유사성을 반영하도록 학습된 벡터를 말한다.
- 이 기술은 자연어 처리에서 문장을 처리하고 이해하는 데 활용된다.
- 숫자로 표현된 단어 목록을 통해 감정을 추출하는 것도 가능하다.
- 연관성 있는 단어들을 군집화하여 다차원 공간에 벡터로 나타낼 수 있으며, 이는 단어나 문장을 벡터 공간에 매핑하는 과정이다.

**Embedding Matrix 예시**

*아래 표의 벡터 값들은 모두 기계 학습을 통해 학습된 결과이다.*  

| Dimension | Man (5391) | Woman (9853) | King (4914) | Queen (7157) | Apple (456) | Orange (6257) |
|-----------|------------|--------------|-------------|--------------|-------------|---------------|
| 성별      | -1         | 1            | -0.95       | 0.97         | 0.00        | 0.01          |
| 귀족      | 0.01       | 0.02         | 0.93        | 0.95         | -0.01       | 0.00          |
| 나이      | 0.03       | 0.02         | 0.7         | 0.69         | 0.03        | -0.02         |
| 음식      | 0.04       | 0.01         | 0.02        | 0.01         | 0.95        | 0.97          |

<br>

*아래는 전치된 표이다.*

| Word          | 성별   | 귀족   | 나이   | 음식   |
|---------------|--------|--------|--------|--------|
| Man (5391)    | -1.00  | 0.01   | 0.03   | 0.04   |
| Woman (9853)  | 1.00   | 0.02   | 0.02   | 0.01   |
| King (4914)   | -0.95  | 0.93   | 0.70   | 0.02   |
| Queen (7157)  | 0.97   | 0.95   | 0.69   | 0.01   |
| Apple (456)   | 0.00   | -0.01  | 0.03   | 0.95   |
| Orange (6257) | 0.01   | 0.00   | -0.02  | 0.97   |

- **의미적 유사성 반영**  
  - 단어를 고정된 크기의 실수 벡터로 표현하며, 비슷한 의미를 가진 단어는 벡터 공간에서 가깝게 위치한다.  
  - 예를 들어, "king"과 "queen"은 비슷한 맥락에서 자주 사용되므로 벡터 공간에서 가까운 위치에 배치된다.  

- **밀집 벡터(Dense Vector)**  
  - BoW, DTM, TF-IDF와 달리 Word Embedding은 저차원 밀집 벡터로 변환되며, 차원이 낮으면서도 의미적으로 풍부한 정보를 담는다.  
  - 벡터 차원은 보통 100 또는 300 정도로 제한된다.  

- **문맥 정보 반영**  
  - Word Embedding은 단어 주변의 단어들을 학습해 단어의 의미를 추론한다.  
  - 예를 들어, "bank"라는 단어가 "river"와 함께 나오면 "강둑"을, "money"와 함께 나오면 "은행"을 의미한다고 학습한다.  

- **학습 기반 벡터**  
  - Word Embedding은 대규모 텍스트 데이터에서 단어 간 연관성을 학습해 벡터를 생성한다.  
  - 반면, BoW나 TF-IDF는 단순한 규칙 기반 벡터화 방법이다.  

### 희소 표현(Sparse Representation) | 분산 표현(Distributed Representation)
- 원-핫 인코딩으로 얻은 원-핫 벡터는 단어의 인덱스 값만 1이고 나머지는 모두 0으로 표현된다.
- 이렇게 대부분의 값이 0인 벡터나 행렬을 사용하는 표현 방식을 희소 표현(sparse representation)이라고 한다.  
- 희소 표현은 단어 벡터 간 유의미한 유사성을 표현할 수 없다는 단점이 있다.
- 이를 해결하기 위해 단어의 의미를 다차원 공간에 벡터화하는 분산 표현(distributed representation)을 사용한다.
- 분산 표현으로 단어 간 의미적 유사성을 벡터화하는 작업을 워드 임베딩(embedding)이라고 하며, 이렇게 변환된 벡터를 임베딩 벡터(embedding vector)라고 한다.  
- **원-핫 인코딩 → 희소 표현**  
- **워드 임베딩 → 분산 표현**  

**분산 표현(Distributed Representation)**
- 분산 표현은 분포 가설(distributional hypothesis)에 기반한 방법이다.
- 이 가설은 "비슷한 문맥에서 등장하는 단어들은 비슷한 의미를 가진다"는 내용을 전제로 한다.
- 예를 들어, '강아지'라는 단어는 '귀엽다', '예쁘다', '애교' 등의 단어와 함께 자주 등장하며, 이를 벡터화하면 해당 단어들은 유사한 벡터값을 갖게 된다.
- 분산 표현은 단어의 의미를 여러 차원에 걸쳐 분산하여 표현한다.  
- 이 방식은 원-핫 벡터처럼 단어 집합 크기만큼의 차원이 필요하지 않으며, 상대적으로 저차원으로 줄어든다.
- 예를 들어, 단어 집합 크기가 10,000이고 '강아지'의 인덱스가 4라면, 원-핫 벡터는 다음과 같다:
  
- **강아지 = [0 0 0 0 1 0 0 ... 0]** (뒤에 9,995개의 0 포함)  
- 그러나 Word2Vec으로 임베딩된 벡터는 단어 집합 크기와 무관하며, 설정된 차원의 수만큼 실수값을 가진 벡터가 된다:  
- **강아지 = [0.2 0.3 0.5 0.7 0.2 ... 0.2]**  

**요약하면,**
- 희소 표현은 고차원에서 각 차원이 분리된 방식으로 단어를 표현하지만, 분산 표현은 저차원에서 단어의 의미를 여러 차원에 분산시켜 표현한다.
- 이를 통해 단어 벡터 간 유의미한 유사도를 계산할 수 있으며, 대표적인 학습 방법으로 Word2Vec이 사용된다.  

### Embedding Vector 시각화 wevi
https://ronxin.github.io/wevi/

### Word2Vec
- 2013년 구글에서 개발한 Word Embedding 방법
- 최초의 neural embedding model
- 매우 큰 corpus에서 자동 학습
    - 비지도 지도 학습 (자기 지도학습)이라 할 수 있음
    - 많은 데이터를 기반으로 label 값 유추하고 이를 지도학습에 사용
- ex) 
    - **이사금**께 충성을 맹세하였다.
    - **왕**께 충성을 맹세하였다.

**WordVec 훈련방식에 따른 구분**
1. CBOW : 주변 단어로 중심 단어를 예측 (벡터값 유추)
2. Skip-gram : 중심 단어로 주변 단어를 예측 (벡터값 유추)

In [31]:
# 비지도 지도학습(자가지도학습) : 비지도처럼 정답이 없는 데이터를 이용하지만, 그 안에서 스스로 ‘가짜 정답(라벨)’을 만들어서 지도학습처럼 학습하는 방식

In [32]:
# 문장에서 일부 단어를 가리고, 그 가려진 단어를 맞히면서 문맥을 이해하는 능력을 학습하는 방법 (문장 속 단어 몇 개를 가려놓고, 그 가려진 단어가 뭔지 맞히게 학습시키는 방식)

##### CBOW (Continuous Bag of Words)  
- CBOW는 원-핫 벡터를 사용하지만, 이는 단순히 위치를 가리킬 뿐 vocabulary를 직접적으로 참조하지 않는다.  

**예시:**  

> The fat cat sat on the mat  

주어진 문장에서 'sat'이라는 단어를 예측하는 것이 CBOW의 주요 작업이다.  
- **중심 단어(center word):** 예측하려는 단어 ('sat')  
- **주변 단어(context word):** 예측에 사용되는 단어들  

중심 단어를 예측하기 위해 앞뒤 몇 개의 단어를 참고할지 결정하는 범위를 **윈도우(window)**라고 한다.  
예를 들어, 윈도우 크기가 2이고 중심 단어가 'sat'라면, 앞의 두 단어(fat, cat)와 뒤의 두 단어(on, the)를 입력으로 사용한다.  
윈도우 크기가 n일 경우, 참고하는 주변 단어의 개수는 총 2n이다. 윈도우를 옆으로 이동하며 학습 데이터를 생성하는 방법을 **슬라이딩 윈도우(sliding window)**라고 한다.  

![](https://wikidocs.net/images/page/22660/%EB%8B%A8%EC%96%B4.PNG)


**훈련 과정**

CBOW는 embedding 벡터를 학습하기 위한 구조를 갖는다. 초기에는 가중치가 임의의 값으로 설정되며, 역전파를 통해 최적화된다.  

![](https://wikidocs.net/images/page/22660/word2vec_renew_1.PNG)

Word2Vec은 은닉층이 하나뿐인 얕은 신경망(shallow neural network) 구조를 사용한다.  
학습 대상이 되는 주요 가중치는 두 가지이다:  

1. **투사층(projection layer):**  
   - 활성화 함수가 없으며 룩업 테이블 연산을 담당한다.  
   - 입력층과 투사층 사이의 가중치 W는 V × M 행렬로 표현되며, 여기서 **V는 단어 집합의 크기, M은 벡터의 차원**이다.  
   - W 행렬의 각 행은 학습 후 단어의 M차원 임베딩 벡터로 간주된다.  
   - 예를 들어, 벡터 차원을 5로 설정하면 각 단어의 임베딩 벡터는 5차원이 된다.  

2. **출력층:**  
   - 투사층과 출력층 사이의 가중치 W'는 M × V 행렬로 표현된다.  
   - 이 두 행렬(W와 W')은 서로 독립적이며, 학습 전에는 랜덤 값으로 초기화된다.  

![](https://wikidocs.net/images/page/22660/word2vec_renew_3.PNG)


**예측 과정**
1. CBOW는 계산된 룩업 테이블의 평균을 구한 뒤, 출력층의 가중치 W'와 내적한다.  
2. 결과값은 **소프트맥스(softmax)** 활성화 함수에 입력되어, 중심 단어일 확률을 나타내는 예측값으로 변환된다.  
3. 출력된 예측값(스코어 벡터)은 실제 타겟 원-핫 벡터와 비교되며, **크로스 엔트로피(cross-entropy)** 함수로 손실값을 계산한다.  

![](https://wikidocs.net/images/page/22660/word2vec_renew_5.PNG)

손실 함수 식:  
$
cost(\hat{y}, y) = -\sum_{j=1}^{V} y_{j} \cdot log(\hat{y}_{j})
$  

여기서, $\hat{y}_{j}$는 예측 확률, $y_{j}$는 실제 값이며, V는 단어 집합의 크기를 의미한다.  


**학습 결과**  
- 역전파를 통해 가중치 W와 W'가 학습된다. 
- 학습이 완료되면 W 행렬의 각 행을 단어의 임베딩 벡터로 사용하거나, W와 W' 모두를 이용해 임베딩 벡터를 생성할 수 있다.  
- CBOW는 주변 단어를 기반으로 중심 단어를 예측하는 구조를 갖추고 있으며, 이를 통해 단어 간 의미적 관계를 효과적으로 학습할 수 있다.  

##### Skip-gram
- Skip-gram은 중심 단어에서 주변 단어를 예측한다.
- 윈도우 크기가 2일 때, 데이터셋은 다음과 같이 구성된다.

![](https://wikidocs.net/images/page/22660/skipgram_dataset.PNG)

![](https://wikidocs.net/images/page/22660/word2vec_renew_6.PNG)

- 중심 단어에 대해서 주변 단어를 예측하므로 투사층에서 벡터들의 평균을 구하는 과정은 없다.
- 여러 논문에서 성능 비교를 진행했을 때 전반적으로 Skip-gram이 CBOW보다 성능이 좋다고 알려져 있다.

https://regexr.com/

In [33]:
# !pip install gensim   # word2vec을 지원해주고 있는 라이브러리 

##### 영어 Word Embedding

- 데이터 취득 및 전처리

In [34]:
import gdown

url = 'https://drive.google.com/uc?id=1TF1yAHF3qRINbXWFOajFjUCxUF64QZMX'
output = 'ted_en.xml'

gdown.download(url, output)

Downloading...
From (original): https://drive.google.com/uc?id=1TF1yAHF3qRINbXWFOajFjUCxUF64QZMX
From (redirected): https://drive.google.com/uc?id=1TF1yAHF3qRINbXWFOajFjUCxUF64QZMX&confirm=t&uuid=bb923696-7901-4e2e-8438-d24c210d565b
To: c:\encore_skn11\07_nlp\03_word_embedding\ted_en.xml
100%|██████████| 74.5M/74.5M [00:01<00:00, 40.3MB/s]


'ted_en.xml'

In [35]:
from lxml import etree
import re
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

In [36]:
# xml 데이터 처리
f = open('ted_en.xml', 'r', encoding='UTF-8')
xml = etree.parse(f)

contents = xml.xpath('//content/text()')    # content 태그 하위 텍슽 
# contents[:5]

corpus = '\n'.join(contents)
print(len(corpus))

# 정규식을 이용해 (Laughter), (Applause) 등 키워드 제거  
corpus = re.sub(r'\([^)]*\)', '', corpus)  
print(len(corpus))

24222849
24062319


In [37]:
# 데이터 전처리 (토큰화/대소문자 정규화/불용어 처리)
sentences = sent_tokenize(corpus)
preprocessed_sentences = []
en_stopwords = stopwords.words('english')

for sentence in sentences:
    sentence = sentence.lower()
    sentence = re.sub(r'[^a-z0-9]', ' ', sentence)  # 영소문자, 숫자 외 제거 
    tokens = word_tokenize(sentence)
    tokens = [token for token in tokens if token not in en_stopwords]
    preprocessed_sentences.append(tokens)

preprocessed_sentences[:5]

[['two', 'reasons', 'companies', 'fail', 'new'],
 ['real',
  'real',
  'solution',
  'quality',
  'growth',
  'figuring',
  'balance',
  'two',
  'activities',
  'exploration',
  'exploitation'],
 ['necessary', 'much', 'good', 'thing'],
 ['consider', 'facit'],
 ['actually', 'old', 'enough', 'remember']]

- Embedding 모델 학습

In [46]:
from gensim.models import Word2Vec

model = Word2Vec(
    sentences=preprocessed_sentences,   # corpus
    vector_size=100,                    # embedding vector의 차원
    sg=0,                               # 학습 알고리즘 선택 (1을 넣으면 Skip-gram, 0을 넣으면 CBOW 학습 알고리즘을 선택)
    window=5,                           # 중심단어 주위의 주변단어로서 사용될 개수 (앞뒤로 n개 고려)
    min_count=5                         # 최소 빈도수 (빈도수가 5보다 작으면 제거)
)

model.wv.vectors.shape

(21462, 100)

In [39]:
import pandas as pd

pd.DataFrame(model.wv.vectors, index=model.wv.index_to_key).head(10)    # index_to_key : 인덱스 자리에 단어사전이 들어감 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99
one,-1.037543,0.256925,-0.456524,-0.051298,0.097585,-0.966302,-0.565377,0.368886,-1.929277,-0.312155,0.985646,0.283635,-0.439902,0.40408,0.278465,-0.432822,0.33519,-0.855497,-0.898538,-1.666382,1.503753,0.282579,1.233055,-0.711129,0.066786,-0.92289,0.114391,-0.07693,0.382532,-0.309766,-0.341735,0.777135,0.286894,-0.980891,-0.066363,0.540996,-0.048148,-0.064269,-0.326356,-1.481685,...,0.792139,0.442927,-1.040028,-0.302952,0.106998,0.361419,0.096842,0.079683,-0.219651,0.504311,-0.328717,0.169539,0.149502,0.098578,-0.451252,-1.259105,-0.056237,0.11175,0.059826,1.108103,0.91609,0.542874,-0.990058,-0.458527,-1.918749,-0.736803,-0.774773,-0.086868,-0.465951,2.017634,1.418058,0.45086,-0.933226,-0.133593,0.724407,-1.697631,-1.307543,0.098816,1.096606,1.468806
people,-1.624187,0.492345,-0.033345,0.237822,-0.694842,-0.407227,-0.084756,0.409875,-0.759424,-2.812324,1.183966,-0.271538,0.364884,0.055587,1.66347,-0.467956,0.383573,0.507944,0.22617,0.16507,0.774358,-1.994244,3.106387,0.459523,-0.296938,0.06842,-0.816073,-0.857725,0.013103,-0.018559,0.766922,0.355904,-0.148447,0.896573,-0.138577,1.520363,0.699693,-0.734261,-0.818657,1.123982,...,0.803025,-0.03262,-0.350681,-1.4185,-0.619981,0.490477,0.611523,0.065134,0.39706,1.95663,0.962901,0.404418,0.652151,0.324194,-0.513917,2.069163,-0.97755,-0.2058,0.381311,-0.616677,-2.694984,-0.257644,0.096314,0.586033,-1.093633,1.139184,-0.951657,-0.875583,-0.878273,0.422534,1.163053,1.030891,-1.037176,-0.815198,-0.723342,0.300006,0.148611,-0.626744,-1.62649,1.004843
like,-0.868234,-0.626363,-0.779085,-0.609282,-0.121492,-0.314972,-0.212054,0.368738,-1.920715,0.784323,-1.361491,0.273666,0.697903,-0.063869,1.657494,-0.895221,-0.091522,0.403542,-0.055418,0.411376,0.373937,-0.827234,-0.893167,-1.812549,0.670303,-0.312897,2.486383,0.731049,-0.667016,0.052039,1.28916,-0.590186,-0.934933,0.026615,-1.478675,-0.106513,0.520368,1.107007,-0.183793,-0.254251,...,0.828883,-0.598797,0.976088,-0.557595,-1.175766,2.331516,0.684848,-0.775479,-0.931835,0.281739,0.926357,0.553959,1.117526,-0.252998,0.341684,1.741089,1.889227,-0.213646,-0.578799,0.730605,-0.032705,-0.198634,-1.818456,0.788169,0.288374,-1.001983,-1.239816,0.27873,0.82154,0.11161,-0.176924,0.236012,0.292788,-0.369551,-0.42878,0.866908,-0.207208,-0.16089,0.909012,-1.139922
know,-0.730419,-0.223894,-0.771613,0.589559,0.273112,-0.37614,-0.063493,0.116479,-0.388547,-0.552971,-0.305082,-0.428757,-0.311401,-0.442544,0.727279,0.086084,0.789956,0.366774,0.117284,-1.048147,0.342976,-0.011824,0.322511,0.581102,0.480408,0.867339,0.964122,-0.400571,-0.324457,0.006519,0.462021,0.055638,-0.513282,-0.19281,-0.347237,0.546274,0.344883,-0.887153,-0.286458,-0.681864,...,0.007677,-0.161763,0.465145,-0.670182,-0.972113,0.790358,0.420402,0.912053,-1.478917,-0.493234,0.638826,0.930349,0.295155,0.509309,0.871249,-0.259527,-0.343176,0.163859,-0.866958,0.35093,-0.922148,0.139717,-1.530186,0.657039,-1.020263,0.47991,0.841832,1.167693,0.55082,-0.993412,0.004779,0.391151,-0.112773,-0.470254,-0.662986,0.079997,-0.005156,-0.46687,0.795719,-0.620436
going,-1.217457,0.998311,-0.800044,-0.558765,1.854455,0.207791,-0.531749,1.33199,-0.515191,-0.887948,-0.847039,-0.073535,0.236947,0.341424,0.872574,-1.281547,0.220324,0.553818,0.934403,-1.90359,-0.155387,-1.477983,-0.089042,1.551849,0.196964,-0.448708,-0.386514,0.339032,0.452127,-0.413506,1.20009,1.561357,0.691826,-0.068079,0.675684,-0.223103,-0.269838,0.020678,-0.414629,0.397458,...,-0.322329,1.212499,0.939324,0.555845,-1.513418,0.964989,0.357831,1.376425,-0.98346,-0.457857,-1.527039,1.028238,-0.319057,0.919059,0.67859,1.716021,0.942217,1.425267,-1.023851,0.364298,-0.254255,-0.475094,-0.697739,-0.833883,-0.856777,1.166101,0.981448,0.222386,-1.113844,0.347084,0.524775,-0.602731,-0.432099,1.684758,0.283094,-0.546439,1.327456,-0.315375,-0.844938,0.604256
think,-0.698013,-0.350541,0.439105,-0.42333,-0.002902,-0.713102,0.327641,0.41562,-0.558601,-0.681224,-0.027001,-0.239665,0.193703,-1.084151,-0.152699,-0.991992,-0.31068,-0.558592,0.166604,-0.305133,-0.781597,0.434186,1.815184,0.621272,0.554223,0.064271,-0.039505,-0.239199,0.447448,0.558048,0.337707,0.330063,0.839116,-0.221557,-0.168948,1.36848,0.748682,0.195159,0.482072,-1.026534,...,0.081276,0.401096,1.035606,-0.247496,-1.257211,0.892749,1.143865,-0.093701,-1.885954,-0.333048,0.90079,1.471134,0.611164,0.832383,0.666187,-0.225479,-0.044388,0.153181,-0.07148,0.287535,-0.585079,-0.035646,-0.987847,-0.225045,-0.679795,0.318586,0.175872,1.545713,0.192996,-0.120158,1.363064,1.348718,-0.199634,0.169713,0.251029,-0.326358,-0.393013,-1.12916,-0.103626,-0.687824
see,0.011837,-0.309356,0.814826,-0.893165,-0.756341,-0.831024,-0.533742,0.551236,-1.732524,0.980099,-0.524586,-0.111244,-0.283461,0.362067,0.524442,-1.457345,0.077236,-0.227996,1.078211,-0.944182,0.29136,-0.583276,-0.561987,0.0711,1.333853,0.514647,0.642655,0.008117,0.114323,0.840784,0.049981,0.986045,0.107874,0.631564,0.363832,1.531346,-0.794771,-2.028866,0.279526,-0.837822,...,-0.449252,-0.185145,-0.676833,1.000013,-0.433591,-0.496693,-0.112013,-0.708001,-0.923223,0.418192,-0.147092,0.373842,0.18192,0.050845,0.491707,-0.697894,0.771469,0.341502,-0.398357,0.812333,0.103181,0.507074,-0.809158,0.452946,0.616222,1.505453,0.373915,1.150921,0.372524,0.197352,-0.345581,0.676969,0.318996,0.219935,0.429069,0.235492,0.635551,-0.98452,0.250803,-0.373136
would,-0.314115,0.284673,1.002277,-0.718731,1.877864,0.647009,-0.391647,0.199008,-1.867468,-0.364277,-0.278587,0.040899,-0.007944,0.70294,-0.064164,-0.446113,0.474226,1.155997,-0.198069,-0.809132,0.296678,-0.34632,-0.127738,-0.852389,-1.028188,-0.699171,-0.983645,-0.342364,-0.832245,-0.00855,0.273549,1.094226,-0.31813,-0.862018,0.077014,0.912072,-1.73028,0.084342,-1.580748,1.797623,...,1.943131,0.692836,-1.499373,-0.32657,0.072018,0.535699,-0.300201,1.49277,-0.618633,-0.928613,1.31951,0.179915,0.722216,-0.468315,-0.292347,2.203519,2.158412,0.163281,0.565245,-0.836462,-1.206817,0.125269,-2.132327,-1.193671,-0.264488,0.483963,1.482911,0.709435,1.38721,0.314114,0.930601,-0.612731,-0.553153,1.390393,-0.459028,0.513495,-0.104244,-0.9936,-0.702535,-1.063961
really,-2.165626,-1.05455,-0.12516,0.53436,0.532016,0.173083,1.108847,0.669265,-0.679644,-1.174095,0.622346,-1.201398,1.22696,0.449152,0.399258,-0.542976,-0.269319,-0.370091,0.767597,-1.210803,-0.240126,0.092191,1.307999,0.267733,1.551126,1.068226,-0.266986,-1.865929,-0.000936,-0.182694,1.55254,0.005381,1.899784,0.758373,-0.508254,0.807865,0.571889,0.416567,-0.080914,-0.73567,...,0.434285,-0.455447,2.099265,-0.670725,-1.705803,0.197479,0.712666,0.543947,-0.808352,0.597578,1.002218,0.672947,-0.127098,0.267151,0.043273,0.107183,-0.14098,-0.335164,0.579778,-0.03655,0.082515,0.042893,-0.02338,0.685178,0.103946,1.034348,0.84858,-0.032466,0.243719,0.145248,0.639411,-0.069999,0.188896,0.038189,0.929764,0.230777,-0.806072,-1.730868,0.341294,-0.314023
get,-2.094868,-0.996378,-0.8717,-0.920379,0.193898,0.017109,-0.427439,1.126358,0.675815,-2.225879,0.588232,0.216891,-2.515471,0.455121,0.35045,0.358709,-0.540396,-0.344688,1.379858,-0.203664,0.293467,-0.451586,-0.202152,1.646159,0.669212,-0.195778,0.181725,-0.389433,1.292014,-0.533416,0.62796,0.697279,1.603717,0.399779,-0.955494,1.308291,-0.911537,-1.057855,-0.424855,-0.394519,...,0.293506,1.930882,0.431977,0.07428,-1.466985,-0.095761,0.247062,-0.461277,-1.476601,0.069654,0.824789,0.79026,0.61194,-0.559995,0.318212,0.021219,-0.209655,1.030852,0.345151,0.674453,-0.18989,-1.27138,-1.128333,0.947941,1.096415,-0.11777,0.062011,0.67198,-0.123901,1.216709,-0.044727,-0.322464,-0.151085,0.025218,0.591481,0.469718,0.618129,0.416137,-0.563816,0.200399


In [48]:
# 학습된 임베딩 모델 저장
model.wv.save_word2vec_format('ted_en_w2v')

In [None]:
# 임베딩 모델 로드
from gensim.models import KeyedVectors

load_model = KeyedVectors.load_word2vec_format('ted_en_w2v')

- 유사도 계산

In [47]:
model.wv.most_similar('man')    # man에 대한 유사도 
# model.wv.most_similar('abracadabra')    # 임베딩 벡터에 없는 단어 조회 시 key error가 발생 

[('woman', 0.8962153792381287),
 ('daughter', 0.7890374660491943),
 ('girl', 0.7854432463645935),
 ('lady', 0.7824974060058594),
 ('father', 0.7647601962089539),
 ('son', 0.7621195316314697),
 ('boy', 0.7620626091957092),
 ('brother', 0.7458978891372681),
 ('grandfather', 0.7357702255249023),
 ('sister', 0.7320916652679443)]

In [44]:
load_model.most_similar('man')  # Word2Vec.wv = KeyedVectors 

[('woman', 0.8906522989273071),
 ('girl', 0.8080182075500488),
 ('daughter', 0.7959010004997253),
 ('son', 0.7876567244529724),
 ('lady', 0.7868496179580688),
 ('grandfather', 0.7742390036582947),
 ('father', 0.7667587995529175),
 ('grandmother', 0.7577228546142578),
 ('sister', 0.752505898475647),
 ('boy', 0.7516880631446838)]

In [None]:
model.wv.similarity('man', 'husband')   # 두 단어간의 유사도 (학습 결과이기 때문에 조금씩 다를 수 있다)

0.72041714

In [None]:
model.wv['man']

array([ 1.15428603e+00,  1.66146472e-01,  8.74931633e-01,  1.71548569e+00,
       -9.20389533e-01,  3.37695852e-02, -7.85597980e-01,  1.56282449e+00,
       -6.69489682e-01, -1.04619658e+00, -4.47931513e-02,  5.90489388e-01,
        8.13253641e-01,  7.34468579e-01,  7.59392917e-01, -4.78596717e-01,
        1.00773227e+00, -1.41518131e-01, -1.13767982e+00, -6.63283706e-01,
        6.64211631e-01,  1.28990185e+00, -8.04617479e-02,  7.56952912e-02,
        1.23693466e-01, -4.77305770e-01, -7.38497496e-01, -1.13054931e+00,
        2.51591444e-01,  1.29510498e+00, -1.27243352e+00, -1.70458579e+00,
       -2.18518943e-01, -9.02804971e-01, -6.44786477e-01,  1.14883518e+00,
       -5.48195302e-01, -5.57738185e-01,  5.35091460e-01,  2.99523026e-01,
        8.87771070e-01,  4.44204926e-01,  7.82241642e-01,  4.12249297e-01,
        2.11211872e+00,  5.84527075e-01, -4.46760476e-01,  8.21967185e-01,
        6.83324039e-01, -3.67053509e-01,  9.39221501e-01, -3.72545302e-01,
        1.04090892e-01, -

- 임베딩 시각화

https://projector.tensorflow.org/

- embedding vector(tensor) 파일 (.tsv)
- metadata 파일 (.tsv)

In [49]:
!python -m gensim.scripts.word2vec2tensor --input ted_en_w2v --output ted_en_w2v

2025-04-07 16:14:33,923 - word2vec2tensor - INFO - running c:\Users\USER\anaconda3\envs\pystudy_env\Lib\site-packages\gensim\scripts\word2vec2tensor.py --input ted_en_w2v --output ted_en_w2v
2025-04-07 16:14:33,924 - keyedvectors - INFO - loading projection weights from ted_en_w2v
2025-04-07 16:14:34,969 - utils - INFO - KeyedVectors lifecycle event {'msg': 'loaded (21462, 100) matrix of type float32 from ted_en_w2v', 'binary': False, 'encoding': 'utf8', 'datetime': '2025-04-07T16:14:34.810173', 'gensim': '4.3.3', 'python': '3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:49:16) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-11-10.0.26100-SP0', 'event': 'load_word2vec_format'}
2025-04-07 16:14:35,687 - word2vec2tensor - INFO - 2D tensor file saved to ted_en_w2v_tensor.tsv
2025-04-07 16:14:35,687 - word2vec2tensor - INFO - Tensor metadata file saved to ted_en_w2v_metadata.tsv
2025-04-07 16:14:35,687 - word2vec2tensor - INFO - finished running word2vec2tensor.py


##### 한국어 Word Embedding
- NSMC (Naver Sentiment Movie Corpus)

In [50]:
import numpy as np
import pandas as pd 
import urllib.request
from konlpy.tag import Okt

In [51]:
# 데이터 다운로드
urllib.request.urlretrieve(
    "https://raw.githubusercontent.com/e9t/nsmc/master/ratings.txt",
    filename="naver_movie_ratings.txt"
)

('naver_movie_ratings.txt', <http.client.HTTPMessage at 0x1cd279bf680>)

In [52]:
# 데이터 프레임 생성
ratings_df = pd.read_csv('naver_movie_ratings.txt', sep='\t')

In [None]:
# 결측치 확인 및 처리(제거)
display(ratings_df.isnull().sum())

ratings_df = ratings_df.dropna(how='any')

id          0
document    8
label       0
dtype: int64

In [55]:
ratings_df['document'][200:300]

200    많은 생각을 할 수 있는 영화~ 시간여행류의 스토리를 좋아하는 사람이라면 빠트릴 수...
201    고소한 19 정말 재미있게 잘 보고 있습니다^^ 방송만 보면 털털하고 인간적이신 것...
202                                                  가연세
203                         goodgoodgoodgoodgoodgoodgood
204                                           이물감. 시 같았다
                             ...                        
295                                   박력넘치는 스턴트 액션 평작이다!
296                                      엄청 재미있다 명작이다 ~~
297    나는 하정우랑 개그코드가 맞나보다 엄청 재밌게봤네요 특히 단발의사샘 장면에서 계속 ...
298                                                적당 ㅎㅎ
299                                    배경이 이쁘고 캐릭터도 귀엽네~
Name: document, Length: 100, dtype: object

In [56]:
# 한글이 아닌 데이터 제거
ratings_df['document'] = ratings_df['document'].replace(r'[^0-9가-힣ㄱ-ㅎㅏ-ㅣ\s]', '', regex=True)

In [57]:
# 전처리
from tqdm import tqdm   # 진행도 시각화

okt = Okt()
ko_stopwords = ['은', '는', '이', '가', '을', '를', '와', '과', '들', '도', '부터', '까지', '에', '나', '너', '그', '걔', '얘']

preprocessed_data = []

for sentence in tqdm(ratings_df['document']):
    tokens = okt.morphs(sentence, stem=True)
    tokens = [token for token in tokens if token not in ko_stopwords]
    preprocessed_data.append(tokens)


100%|██████████| 199992/199992 [07:24<00:00, 449.74it/s]


In [58]:
model = Word2Vec(
    sentences=preprocessed_data,
    vector_size=100,
    window=5,
    min_count=5,
    sg=0    # CBOW
)

model.wv.vectors.shape

(16841, 100)

In [59]:
model.wv.most_similar('극장')

[('영화관', 0.942000150680542),
 ('케이블', 0.7884935140609741),
 ('틀어주다', 0.766829252243042),
 ('학교', 0.7575247883796692),
 ('티비', 0.7262733578681946),
 ('대학로', 0.7220457792282104),
 ('방금', 0.7055728435516357),
 ('영화제', 0.7023119926452637),
 ('시사회', 0.6865780353546143),
 ('개봉관', 0.6836691498756409)]

In [60]:
model.wv.similarity('김혜수', '전도연')

0.87271345

In [61]:
# 모델 저장
model.wv.save_word2vec_format('naver_movie_ratings_w2v')

In [62]:
!python -m gensim.scripts.word2vec2tensor --input naver_movie_ratings_w2v --output naver_movie_ratings_w2v

2025-04-07 17:11:10,096 - word2vec2tensor - INFO - running c:\Users\USER\anaconda3\envs\pystudy_env\Lib\site-packages\gensim\scripts\word2vec2tensor.py --input naver_movie_ratings_w2v --output naver_movie_ratings_w2v
2025-04-07 17:11:10,096 - keyedvectors - INFO - loading projection weights from naver_movie_ratings_w2v
2025-04-07 17:11:11,068 - utils - INFO - KeyedVectors lifecycle event {'msg': 'loaded (16841, 100) matrix of type float32 from naver_movie_ratings_w2v', 'binary': False, 'encoding': 'utf8', 'datetime': '2025-04-07T17:11:10.818605', 'gensim': '4.3.3', 'python': '3.12.9 | packaged by Anaconda, Inc. | (main, Feb  6 2025, 18:49:16) [MSC v.1929 64 bit (AMD64)]', 'platform': 'Windows-11-10.0.26100-SP0', 'event': 'load_word2vec_format'}
2025-04-07 17:11:11,620 - word2vec2tensor - INFO - 2D tensor file saved to naver_movie_ratings_w2v_tensor.tsv
2025-04-07 17:11:11,620 - word2vec2tensor - INFO - Tensor metadata file saved to naver_movie_ratings_w2v_metadata.tsv
2025-04-07 17:11:

- 사전 훈련된 임베딩

In [63]:
url = 'https://drive.google.com/uc?id=1aL_xpWW-CjfCrLWeflIaipITOZ6zHI5c'
output = "GoogleNews_vecs.bins.gz"

gdown.download(url, output)

Downloading...
From (original): https://drive.google.com/uc?id=1aL_xpWW-CjfCrLWeflIaipITOZ6zHI5c
From (redirected): https://drive.google.com/uc?id=1aL_xpWW-CjfCrLWeflIaipITOZ6zHI5c&confirm=t&uuid=bb8a4442-822e-4514-b1e2-b90551199429
To: c:\encore_skn11\07_nlp\03_word_embedding\GoogleNews_vecs.bins.gz
100%|██████████| 1.65G/1.65G [05:20<00:00, 5.14MB/s]


'GoogleNews_vecs.bins.gz'

In [66]:
google_news_wv = KeyedVectors.load_word2vec_format('GoogleNews_vecs.bins.gz', binary=True)
google_news_wv.vectors.shape

(3000000, 300)

In [67]:
google_news_wv.similarity('king','man')

0.22942671

In [68]:
google_news_wv.most_similar('king')

[('kings', 0.7138044834136963),
 ('queen', 0.6510957479476929),
 ('monarch', 0.6413194537162781),
 ('crown_prince', 0.6204219460487366),
 ('prince', 0.6159994602203369),
 ('sultan', 0.5864822864532471),
 ('ruler', 0.5797566175460815),
 ('princes', 0.5646552443504333),
 ('Prince_Paras', 0.5432944297790527),
 ('throne', 0.5422106385231018)]

In [None]:
google_news_wv.n_similarity(['king', 'queen'], ['man', 'woman'])    # 두 리스트간의 평균 유사도 

0.24791394

In [71]:
google_news_wv.most_similar('king', topn=5)

[('kings', 0.7138044834136963),
 ('queen', 0.6510957479476929),
 ('monarch', 0.6413194537162781),
 ('crown_prince', 0.6204219460487366),
 ('prince', 0.6159994602203369)]

In [73]:
google_news_wv.similar_by_word('king', topn=5)

[('kings', 0.7138044834136963),
 ('queen', 0.6510957479476929),
 ('monarch', 0.6413194537162781),
 ('crown_prince', 0.6204219460487366),
 ('prince', 0.6159994602203369)]

In [None]:
google_news_wv.has_index_for('ㅋㅋㅋㅋㅋㅋ')    # 모델이 학습한 vocabulary에 존재하는지 아닌지 체크할 수 있음 

False