# 제목 : 단어 임베딩, 문구 임베딩하기

## 이름 : 전은영

## 출처 : https://github.com/Eun0/NLP

## 참고 : https://github.com/yandexdataschool/nlp_course/blob/master/week01_embeddings/seminar.ipynb

<br>

\## 밑의 코드는 본인이 작성한 것

# \* 데이터 로드하기

각 샘플 데이터가 한 문장으로 이루어진 537272개의 quora.txt 로드하기

quora.txt 다운로드 주소: https://yadi.sk/i/BPQrUu1NaTduEw

In [1]:
# Load data

data=list(open('quora.txt',encoding='utf8'))

print(len(data))

537272


In [2]:
# check form of data
data[50]

"What TV shows or books help you read people's body language?\n"

# \* 데이터 전처리

## 1. 토큰화

nltk의 WordPunctTokenizer 이용

In [3]:
# Tokenization

from nltk.tokenize import WordPunctTokenizer

tokenizer=WordPunctTokenizer()

tokenizer.tokenize(data[50])

['What',
 'TV',
 'shows',
 'or',
 'books',
 'help',
 'you',
 'read',
 'people',
 "'",
 's',
 'body',
 'language',
 '?']

## 2. 소문자화

리스트의 string 소문자로 바꾸는 법

=> ```   list(map(str.lower,리스트))   ```

In [4]:
# example
list(map(str.lower,['A','B']))

['a', 'b']

In [5]:
# TASK: lowercase everything and extract tokens with tokenizer. 
# data_tok should be a list of lists of tokens for each line in data.

## My Code

data_tok=[]

for line in data:
    
    data_tok.append(list(map(str.lower,tokenizer.tokenize(line))))

In [6]:
assert all(isinstance(row, (list, tuple)) for row in data_tok), "please convert each line into a list of tokens (strings)"
assert all(all(isinstance(tok, str) for tok in row) for row in data_tok), "please convert each line into a list of tokens (strings)"
is_latin = lambda tok: all('a' <= x.lower() <= 'z' for x in tok)
assert all(map(lambda l: not is_latin(l) or l.islower(), map(' '.join, data_tok))), "please make sure to lowercase the data"

### 결과 확인

In [7]:
# Check the result
print([' '.join(row) for row in data_tok[:2]])

["can i get back with my ex even though she is pregnant with another guy ' s baby ?", 'what are some ways to overcome a fast food addiction ?']


# \* Word2Vec으로 단어 임베딩 (word embeddings) 학습

gensim 모듈의 Word2Vec 이용

In [8]:
from gensim.models import Word2Vec
model_w2v = Word2Vec(data_tok, 
                 size=32,      # embedding vector size
                 min_count=5,  # consider words that occured at least 5 times
                 window=5).wv  # define context as a 5-word window around the target word

### 결과 확인

- ```get_vector(단어)``` : 단어의 word vector 리턴

- ```most_similar(단어)``` : 단어와 가장 유사한 vocabulary에 있는 10개 단어 리스트 리턴

- ```vocab.keys()``` : vocabulary에 있는 단어들 리턴

In [9]:
# now you can get word vectors !
model_w2v.get_vector('anything')

array([ 1.4799985 ,  0.5257579 ,  3.0423071 ,  0.2545758 , -0.8643857 ,
       -2.6472995 ,  0.80737066,  2.1281335 , -1.468268  , -1.3665662 ,
       -0.37811634, -0.3121864 , -2.8806322 ,  0.5670545 ,  1.4744236 ,
        0.46121374,  0.47948676,  4.3539786 ,  1.2901214 ,  2.1634343 ,
       -1.0742007 , -1.2146378 , -2.1564162 , -1.1870527 , -0.05559259,
        2.6080284 ,  5.288558  , -2.3956692 ,  2.0836742 , -1.3603    ,
        1.9552332 ,  1.3251094 ], dtype=float32)

In [10]:
# or query similar words directly. Go play with it!
model_w2v.most_similar('bread')

[('rice', 0.9562526941299438),
 ('cheese', 0.9328122138977051),
 ('sauce', 0.9258643388748169),
 ('butter', 0.924580991268158),
 ('fruit', 0.9201968908309937),
 ('honey', 0.9140980243682861),
 ('potatoes', 0.9066241979598999),
 ('potato', 0.9027183651924133),
 ('pasta', 0.9006110429763794),
 ('noodles', 0.8994271755218506)]

In [11]:
model_w2v.most_similar('what')

[('which', 0.652188777923584),
 ('that', 0.5350184440612793),
 ('the', 0.49438631534576416),
 ('activities', 0.44785767793655396),
 ('sort', 0.4475882649421692),
 ('unique', 0.44433557987213135),
 ('harmless', 0.4323808550834656),
 ('tchaikovsky', 0.43113014101982117),
 ('success', 0.4290727376937866),
 ('consist', 0.42593106627464294)]

In [12]:
model_w2v.most_similar('tv')

[('television', 0.8935765624046326),
 ('netflix', 0.809036374092102),
 ('game', 0.8028695583343506),
 ('games', 0.7749791741371155),
 ('anime', 0.7661345601081848),
 ('ps4', 0.7257632613182068),
 ('kapil', 0.7182228565216064),
 ('video', 0.7111369371414185),
 ('hbo', 0.7091661691665649),
 ('dvd', 0.7071616649627686)]

In [13]:
# word list in vocabulary
list(model_w2v.vocab.keys())[:20]

['can',
 'i',
 'get',
 'back',
 'with',
 'my',
 'ex',
 'even',
 'though',
 'she',
 'is',
 'pregnant',
 'another',
 'guy',
 "'",
 's',
 'baby',
 '?',
 'what',
 'are']

# \* Pre-trained 모델 사용하기

gensim 이용해 학습된 'glove-twitter-100' 모델 사용하기

In [14]:
import gensim.downloader as api
model_trained = api.load('glove-twitter-100')

### 결과 확인

In [15]:
# W2V model
model_trained.most_similar("pregnant")

[('married', 0.7712636590003967),
 ('preggo', 0.7219228744506836),
 ('wife', 0.7133387327194214),
 ('pregnancy', 0.7131091952323914),
 ('daughter', 0.7130356431007385),
 ('birth', 0.7080836296081543),
 ('she', 0.6912790536880493),
 ('girl', 0.68682861328125),
 ('ugly', 0.6823412775993347),
 ('shes', 0.67890864610672)]

In [16]:
# Pre-trained model
model_trained.most_similar(positive=["pregnant"])

[('married', 0.7712636590003967),
 ('preggo', 0.7219228744506836),
 ('wife', 0.7133387327194214),
 ('pregnancy', 0.7131091952323914),
 ('daughter', 0.7130356431007385),
 ('birth', 0.7080836296081543),
 ('she', 0.6912790536880493),
 ('girl', 0.68682861328125),
 ('ugly', 0.6823412775993347),
 ('shes', 0.67890864610672)]

# \* Word vector 시각화 하기

## 1. 단어 사용된 횟수로 정렬하기

```sorted(lst,key=ftn,reverse=[True,False])```

: lst를 ftn 기준으로 정렬하고 True이면 내림차순, False면 오름차순으로 정렬된 리스트를 리턴한다

In [17]:
# Sort pre-trained model by word's count 
# Use only the top 1000

words_trained = sorted(model_trained.vocab.keys(), 
               key=lambda word: model_trained.vocab[word].count,
               reverse=True)[:1000]

words_trained[::100]

['<user>',
 '_',
 'please',
 'apa',
 'justin',
 'text',
 'hari',
 'playing',
 'once',
 'sei']

## 2.정렬된 단어 리스트의 word vector 구하기

In [18]:
# for each word, compute it's vector with model

## My Code

import numpy as np

word_vectors = np.array(list(map(model_trained.get_vector,words_trained)))

In [19]:
assert isinstance(word_vectors, np.ndarray)
assert word_vectors.shape == (len(words_trained), 100)
assert np.isfinite(word_vectors).all()

## draw_vectors 함수 정의 :

2d word vector 를 그려주는 함수


In [20]:
import bokeh.models as bm, bokeh.plotting as pl
from bokeh.io import output_notebook
output_notebook()

def draw_vectors(x, y, radius=10, alpha=0.25, color='blue',
                 width=600, height=400, show=True, **kwargs):
    """ draws an interactive plot for data points with auxilirary info on hover """
    if isinstance(color, str): color = [color] * len(x)
    data_source = bm.ColumnDataSource({ 'x' : x, 'y' : y, 'color': color, **kwargs })

    fig = pl.figure(active_scroll='wheel_zoom', width=width, height=height)
    fig.scatter('x', 'y', size=radius, color='color', alpha=alpha, source=data_source)

    fig.add_tools(bm.HoverTool(tooltips=[(key, "@" + key) for key in kwargs.keys()]))
    if show: pl.show(fig)
    return fig

## 3-1). PCA로 visualize

sklearn의 PCA 이용

```PCA(n_component=N).fit_trainsform(data)```

: data를  N차원으로 PCA한다

In [21]:
from sklearn.decomposition import PCA

# map word vectors onto 2d plane with PCA. Use good old sklearn api (fit, transform)
# after that, normalize vectors to make sure they have zero mean and unit variance

## My Code

word_vectors_pca = PCA(n_components=2).fit_transform(word_vectors)

# Normalization
word_vectors_pca=(word_vectors_pca-word_vectors_pca.mean(axis=0))/word_vectors_pca.std(axis=0)


In [22]:
assert word_vectors_pca.shape == (len(word_vectors), 2), "there must be a 2d vector for each word"
assert max(abs(word_vectors_pca.mean(0))) < 1e-5, "points must be zero-centered"
assert max(abs(1.0 - word_vectors_pca.std(0))) < 1e-2, "points must have unit variance"

### 그리기

In [23]:
draw_vectors(word_vectors_pca[:, 0], word_vectors_pca[:, 1], token=words_trained)

## 3-2). t-SNE로 visualize

In [24]:
from sklearn.manifold import TSNE

# map word vectors onto 2d plane with TSNE. hint: use verbose=100 to see what it's doing.
# normalize them as just lke with pca

## My Code

word_tsne = TSNE(verbose=100).fit_transform(word_vectors)

word_tsne = (word_tsne-word_tsne.mean(axis=0))/word_tsne.std(axis=0)

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 1000 samples in 0.004s...
[t-SNE] Computed neighbors for 1000 samples in 0.212s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1000
[t-SNE] Mean sigma: 1.716134
[t-SNE] Computed conditional probabilities in 0.048s
[t-SNE] Iteration 50: error = 68.8632584, gradient norm = 0.3285225 (50 iterations in 16.211s)
[t-SNE] Iteration 100: error = 69.2891922, gradient norm = 0.2977110 (50 iterations in 18.170s)
[t-SNE] Iteration 150: error = 69.6372223, gradient norm = 0.2873838 (50 iterations in 18.674s)
[t-SNE] Iteration 200: error = 69.9386826, gradient norm = 0.2779433 (50 iterations in 18.983s)
[t-SNE] Iteration 250: error = 69.2451859, gradient norm = 0.2901278 (50 iterations in 18.395s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 69.245186
[t-SNE] Iteration 300: error = 1.2349916, gradient norm = 0.0033400 (50 iterations in 13.183s)
[t-SNE] Iteration 350: error = 1.1235009, gradient norm = 0

### 그리기

In [25]:
draw_vectors(word_tsne[:, 0], word_tsne[:, 1], color='green', token=words_trained)

# \* 문구 시각화하기

## get_phrase_embedding 함수 정의

: 문구 임베딩해주는 함수
    
1. 소문자화

2. 토큰화

3. 토큰들의 평균이 문구 벡터

In [26]:
def get_phrase_embedding(phrase):
    """
    Convert phrase to a vector by aggregating it's word embeddings. See description above.
    """
    
    vector = np.zeros([model_trained.vector_size], dtype='float32')
    
    ## My Code
    
    # 1. lowercase phrase
    
    lowercase_phrase=phrase.lower()
    
    # 2. tokenize phrase
    
    tokens_phrase=tokenizer.tokenize(lowercase_phrase)
    #print(tokens_phrase)
    
    # 3. average word vectors for all words in tokenized phrase
    # skip words that are not in model's vocabulary
    # if all words are missing from vocabulary, return zeros
    
    num_word=0
    
    for token in tokens_phrase:
        
        if token in model_trained.vocab.keys():
            token_vector=model_trained.get_vector(token)
            vector+=token_vector
            num_word+=1
    
    if num_word==0:
        return np.zeros_like(vector)
    else:
        vector=vector/num_word
    
    return vector

### 확인

In [27]:
vector = get_phrase_embedding("I'm very sure. This never happened to me before...")

assert np.allclose(vector[::10],
                   np.array([ 0.31807372, -0.02558171,  0.0933293 , -0.1002182 , -1.0278689 ,
                             -0.16621883,  0.05083408,  0.17989802,  1.3701859 ,  0.08655966],
                              dtype=np.float32))

## 문구 고르고 임베딩하기 

In [28]:
# let's only consider ~5k phrases for a first run.
chosen_phrases = data[::len(data) // 1000]

# compute vectors for chosen phrases

## My Code
phrase_vectors = np.array(list(map(get_phrase_embedding,chosen_phrases)))

In [29]:
assert isinstance(phrase_vectors, np.ndarray) and np.isfinite(phrase_vectors).all()
assert phrase_vectors.shape == (len(chosen_phrases), model_trained.vector_size)

In [30]:
# map vectors into 2d space with pca, tsne or your other method of choice
# don't forget to normalize

phrase_vectors_2d = TSNE(verbose=1000).fit_transform(phrase_vectors)

phrase_vectors_2d = (phrase_vectors_2d - phrase_vectors_2d.mean(axis=0)) / phrase_vectors_2d.std(axis=0)

[t-SNE] Computing 91 nearest neighbors...
[t-SNE] Indexed 1001 samples in 0.008s...
[t-SNE] Computed neighbors for 1001 samples in 0.228s...
[t-SNE] Computed conditional probabilities for sample 1000 / 1001
[t-SNE] Computed conditional probabilities for sample 1001 / 1001
[t-SNE] Mean sigma: 0.477874
[t-SNE] Computed conditional probabilities in 0.047s
[t-SNE] Iteration 50: error = 79.9959793, gradient norm = 0.2999685 (50 iterations in 21.542s)
[t-SNE] Iteration 100: error = 79.6029816, gradient norm = 0.3136635 (50 iterations in 23.609s)
[t-SNE] Iteration 150: error = 80.2882233, gradient norm = 0.3097210 (50 iterations in 23.922s)
[t-SNE] Iteration 200: error = 80.4504852, gradient norm = 0.3149351 (50 iterations in 24.720s)
[t-SNE] Iteration 250: error = 80.8826523, gradient norm = 0.2998649 (50 iterations in 24.202s)
[t-SNE] KL divergence after 250 iterations with early exaggeration: 80.882652
[t-SNE] Iteration 300: error = 1.9988880, gradient norm = 0.0031828 (50 iterations in 14

In [31]:
draw_vectors(phrase_vectors_2d[:, 0], phrase_vectors_2d[:, 1],
             phrase=[phrase[:50] for phrase in chosen_phrases],
             radius=20,)

## data의 모든 문구 임베딩 하기

In [32]:
# compute vector embedding for all lines in data
data_vectors = np.array([get_phrase_embedding(l) for l in data])

## find_nearest 함수 정의 :

```find_nearest(query,k=10)```

: query와 가장 유사한 k개의 문구 리턴

In [33]:
def find_nearest(query, k=10):
    """
    given text line (query), return k most similar lines from data, sorted from most to least similar
    similarity should be measured as cosine between query and line embedding vectors
    hint: it's okay to use global variables: data and data_vectors. see also: np.argpartition, np.argsort
    """
    ## YOUR CODE
    
    query_vec=get_phrase_embedding(query)
    
    def compute_cos(vec):
        from numpy.linalg import norm
        
        if np.dot(query_vec,vec)==0:
            return 0
        else:
            return np.dot(query_vec,vec)/(norm(query_vec)*norm(vec))
    
    
    data_cos=np.array(list(map(compute_cos,data_vectors)))
    
    # 뒤에서 k개 뽑기
    inds=data_cos.argsort()[::-1][:k]

    return [data[i] for i in inds]

In [34]:
results = find_nearest(query="How do i enter the matrix?", k=10)
print(''.join(results))

How do I get to the dark web?
What should I do to enter hollywood?
How do I use the Greenify app?
What can I do to save the world?
How do I win this?
How do I think out of the box? How do I learn to think out of the box?
How do I find the 5th dimension?
How do I use the pad in MMA?
How do I estimate the competition?
What do I do to enter the line of event management?



In [35]:
assert len(results) == 10 and isinstance(results[0], str)
assert results[0] == 'How do I get to the dark web?\n'
assert results[3] == 'What can I do to save the world?\n'

### 확인

In [36]:
find_nearest(query="How does Trump?", k=10)

['What does Donald Trump think about Israel?\n',
 'What books does Donald Trump like?\n',
 'What does India think of Donald Trump?\n',
 'What does Donald Trump think of India?\n',
 'What does Donald Trump think of China?\n',
 'What does Donald Trump think about Pakistan?\n',
 'What companies does Donald Trump own?\n',
 'What does Dushka Zapata think about Donald Trump?\n',
 'How does it feel to date Ivanka Trump?\n',
 'What does salesforce mean?\n']

In [37]:
find_nearest(query="Why don't i ask a question myself?", k=10)

["Why don't I get a date?\n",
 "Why do you always answer a question with a question? I don't, or do I?\n",
 "Why can't I ask a question anonymously?\n",
 "Why don't I get a girlfriend?\n",
 "Why don't I have a boyfriend?\n",
 "I don't have no question?\n",
 "Why can't I take a joke?\n",
 "Why don't I ever get a girl?\n",
 "Can I ask a girl out that I don't know?\n",
 "Why don't I have a girlfriend?\n"]