<a href="https://colab.research.google.com/github/Kimminsu-ds/Deep-Learning-NLP-using-Tensorflow/blob/main/03_01_Word_Embedding(English).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 러닝스푼즈 - Tensorflow를 활용한 딥러닝 자연어처리

In [1]:
import re
from lxml import etree
import urllib.request
import zipfile
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

In [2]:
nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

## 데이터 세팅

In [8]:
urllib.request.urlretrieve("https://raw.githubusercontent.com/GaoleMeng/RNN-and-FFNN-textClassification/master/ted_en-20160408.xml", filename="ted_en-20160408.xml")

targetXML = open("ted_en-20160408.xml", 'r', encoding="UTF8")
target_text = etree.parse(targetXML)

# xml 파일로부터 <content>와 </content> 사이의 내용만 가져온다.
parse_text = '\n'.join(target_text.xpath("//content/text()"))
print(parse_text[:96])

Here are two reasons companies fail: they only do more of the same, or they only do what's new.



## 전처리

In [10]:
# 정규 표현식의 sub 모듈을 통해 content 중간에 등장하는 (Audio), (Laughter) 등의 배경음 부분을 제거
# 해당 코드는 괄호로 구성된 내용을 제거
content_text = re.sub(r'\([%)]*\)', '', parse_text)
print(content_text[:95])

Here are two reasons companies fail: they only do more of the same, or they only do what's new.


In [11]:
# 입력 코퍼스에 대해서 NLTK를 이용하여 문장 토큰화 수행
sent_text = sent_tokenize(content_text)

In [12]:
# 각 문장에 대해서 구두점을 제거하고, 대문자를 소문자로 변환
normalized_text = []
for string in sent_text:
  tokens = re.sub(r"[^a-z0-9]+", " ", string.lower())
  normalized_text.append(tokens)

In [13]:
# 각 문장에 대해서 NLTK를 이용하여 단어 토큰화를 수행
result = [word_tokenize(sentence) for sentence in normalized_text]

In [14]:
print("총 샘플의 개수: {}".format(len(result)))

총 샘플의 개수: 273649


In [15]:
for line in result[:3]:
  print(line)

['here', 'are', 'two', 'reasons', 'companies', 'fail', 'they', 'only', 'do', 'more', 'of', 'the', 'same', 'or', 'they', 'only', 'do', 'what', 's', 'new']
['to', 'me', 'the', 'real', 'real', 'solution', 'to', 'quality', 'growth', 'is', 'figuring', 'out', 'the', 'balance', 'between', 'two', 'activities', 'exploration', 'and', 'exploitation']
['both', 'are', 'necessary', 'but', 'it', 'can', 'be', 'too', 'much', 'of', 'a', 'good', 'thing']


## Word2Vec 임베딩
[Gensim Word2Vec API](https://radimrehurek.com/gensim/models/word2vec.html)
- size: 워드 벡터의 특징값. 즉, 임베딩 된 벡터의 차원
- window: 컨텍스트 윈도우 크기
- min_count: 단어 최소 빈도수 제한(빈도가 적은 단어들은 학습하지 않는다)
- workers: 학습을 위한 프로세스 수
- sg
  - 0: CBOW
  - 1: Skip-gram

In [16]:
from gensim.models import Word2Vec
model = Word2Vec(sentences = result, size = 100, window = 5, min_count = 5, workers = 4, sg = 0)

In [17]:
model_result = model.wv.most_similar("man")
print(model_result)

[('woman', 0.8565999865531921), ('guy', 0.8201879262924194), ('lady', 0.7860196828842163), ('girl', 0.767379641532898), ('soldier', 0.7628076076507568), ('boy', 0.7621582746505737), ('gentleman', 0.736214280128479), ('kid', 0.7107130289077759), ('poet', 0.6882884502410889), ('surgeon', 0.6674830913543701)]


In [18]:
from gensim.models import KeyedVectors
model.wv.save_word2vec_format("eng_w2v") # 모델 저장
loaded_model = KeyedVectors.load_word2vec_format("eng_w2v") # 모델 로드

In [19]:
model_result = loaded_model.most_similar("man")
print(model_result)

[('woman', 0.8565999865531921), ('guy', 0.8201879262924194), ('lady', 0.7860196828842163), ('girl', 0.767379641532898), ('soldier', 0.7628076076507568), ('boy', 0.7621582746505737), ('gentleman', 0.736214280128479), ('kid', 0.7107130289077759), ('poet', 0.6882884502410889), ('surgeon', 0.6674830913543701)]


## Visualization
- eng_w2v라는 Word2Vec 모델이 이미 존재한다는 가정 하에 아래 커맨드를 수행

In [24]:
%pwd

'/content'

In [25]:
!python -m gensim.scripts.word2vec2tensor --input eng_w2v --output eng_w2v

2021-10-19 12:46:07,984 - word2vec2tensor - INFO - running /usr/local/lib/python3.7/dist-packages/gensim/scripts/word2vec2tensor.py --input eng_w2v --output eng_w2v
2021-10-19 12:46:07,985 - utils_any2vec - INFO - loading projection weights from eng_w2v
2021-10-19 12:46:10,568 - utils_any2vec - INFO - loaded (21662, 100) matrix from eng_w2v
2021-10-19 12:46:12,611 - word2vec2tensor - INFO - 2D tensor file saved to eng_w2v_tensor.tsv
2021-10-19 12:46:12,611 - word2vec2tensor - INFO - Tensor metadata file saved to eng_w2v_metadata.tsv
2021-10-19 12:46:12,614 - word2vec2tensor - INFO - finished running word2vec2tensor.py


### [Embedding Projector](https://projector.tensorflow.org/)
  - 위 Choose file 버튼 클릭
  - eng_w2v_tensor.tsv 파일 업로드
  - 아래 Choose file 버튼 클릭
  - eng_w2v_metadata.tsv 파일 업로드
  - 두 파일을 업로드하면 워드 임베딩 모델이 시각화

## FastText
- Word2Vec의 OOV 문제

In [26]:
loaded_model = KeyedVectors.load_word2vec_format("eng_w2v") # Word2Vec 모델 로드
model_result = loaded_model.most_similar("overacting")
print(model_result)

KeyError: ignored

In [27]:
model_result = loaded_model.most_similar("memory")
print(model_result)

[('brain', 0.7454793453216553), ('body', 0.7242261171340942), ('imagination', 0.6850780248641968), ('perception', 0.6834355592727661), ('intuition', 0.6751922369003296), ('function', 0.668582558631897), ('activity', 0.6680774688720703), ('consciousness', 0.6617116332054138), ('strength', 0.6600881218910217), ('tissue', 0.6555130481719971)]


In [28]:
model_result = loaded_model.most_similar("memorry")
print(model_result)

KeyError: ignored

In [29]:
model_result = loaded_model.most_similar("electrofishing")
print(model_result)

KeyError: ignored

### FastText Embedding
[Gensim FastText API](https://radimrehurek.com/gensim/models/fasttext.html)

In [30]:
from gensim.models import FastText
fasttext_model = FastText(result, size=100, window=5, min_count=5, workers=4, sg=1)

In [31]:
fasttext_model.most_similar("overacting")

  """Entry point for launching an IPython kernel.


[('interacting', 0.8681800961494446),
 ('manipulating', 0.8505892753601074),
 ('subtracting', 0.8299702405929565),
 ('distracting', 0.8239017724990845),
 ('impacting', 0.8165470957756042),
 ('contracting', 0.8099298477172852),
 ('extracting', 0.7972195148468018),
 ('acting', 0.7911936640739441),
 ('behaving', 0.7858031988143921),
 ('cooperating', 0.7831429243087769)]

In [32]:
fasttext_model.most_similar("memorry")

  """Entry point for launching an IPython kernel.


[('memo', 0.8432082533836365),
 ('memory', 0.769365668296814),
 ('memorize', 0.7666587829589844),
 ('memoir', 0.7558838129043579),
 ('nemo', 0.7090966105461121),
 ('emory', 0.6855215430259705),
 ('dereck', 0.6816719770431519),
 ('memorial', 0.6809093952178955),
 ('memoirs', 0.6794471740722656),
 ('forgery', 0.6784298419952393)]

In [33]:
fasttext_model.most_similar("electrofishing")

  """Entry point for launching an IPython kernel.


[('electrolux', 0.8118401765823364),
 ('electro', 0.7946038246154785),
 ('electrolyte', 0.7906200289726257),
 ('electrochemical', 0.7681560516357422),
 ('electroshock', 0.766903281211853),
 ('electron', 0.758313775062561),
 ('airbus', 0.75740647315979),
 ('electric', 0.7544244527816772),
 ('petroleum', 0.7492835521697998),
 ('electrogram', 0.7476785182952881)]

## Glove

In [34]:
!pip install glove_python_binary

Collecting glove_python_binary
  Downloading glove_python_binary-0.2.0-cp37-cp37m-manylinux1_x86_64.whl (948 kB)
[?25l[K     |▍                               | 10 kB 26.4 MB/s eta 0:00:01[K     |▊                               | 20 kB 26.8 MB/s eta 0:00:01[K     |█                               | 30 kB 11.7 MB/s eta 0:00:01[K     |█▍                              | 40 kB 9.2 MB/s eta 0:00:01[K     |█▊                              | 51 kB 5.2 MB/s eta 0:00:01[K     |██                              | 61 kB 5.7 MB/s eta 0:00:01[K     |██▍                             | 71 kB 5.5 MB/s eta 0:00:01[K     |██▊                             | 81 kB 6.2 MB/s eta 0:00:01[K     |███                             | 92 kB 4.7 MB/s eta 0:00:01[K     |███▌                            | 102 kB 5.1 MB/s eta 0:00:01[K     |███▉                            | 112 kB 5.1 MB/s eta 0:00:01[K     |████▏                           | 122 kB 5.1 MB/s eta 0:00:01[K     |████▌                    

In [35]:
from glove import Corpus, Glove

# 훈련 데이터로부터 GloVE에서 사용할 동시 등장 행렬 생성
corpus = Corpus()
corpus.fit(result, window=5)

# 학습에 이용할 쓰레드의 개수는 4개, 에포크는 20으로 설정
glove = Glove(no_components=100, learning_rate=0.05)
glove.fit(corpus.matrix, epochs=20, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)

Performing 20 training epochs with 4 threads
Epoch 0
Epoch 1
Epoch 2
Epoch 3
Epoch 4
Epoch 5
Epoch 6
Epoch 7
Epoch 8
Epoch 9
Epoch 10
Epoch 11
Epoch 12
Epoch 13
Epoch 14
Epoch 15
Epoch 16
Epoch 17
Epoch 18
Epoch 19


In [36]:
model_result1 = glove.most_similar("man")
print(model_result1)

[('woman', 0.9561557091918471), ('guy', 0.8764421784552235), ('girl', 0.8723682461981662), ('boy', 0.8437721102413391)]


In [37]:
model_result2 = glove.most_similar("boy")
print(model_result2)

[('girl', 0.9457186703096012), ('woman', 0.8694451439550647), ('man', 0.843772110241339), ('kid', 0.8400019102507477)]


In [38]:
model_result3 = glove.most_similar("university")
print(model_result3)

[('harvard', 0.8885466417077313), ('mit', 0.8467218265307618), ('stanford', 0.8293999937217268), ('cambridge', 0.8249610860364013)]


In [39]:
model_result4 = glove.most_similar("water")
print(model_result4)

[('clean', 0.8569630638984247), ('fresh', 0.8388016032558542), ('air', 0.8299769157043785), ('food', 0.8238028760017486)]


In [40]:
model_result5 = glove.most_similar("physics")
print(model_result5)

[('economics', 0.906194820720972), ('chemistry', 0.8805944669589155), ('simplicity', 0.8641558617129502), ('beauty', 0.8591936877994484)]


In [41]:
model_result6 = glove.most_similar("muscle")
print(model_result6)

[('tissue', 0.8528104455396867), ('nerve', 0.816419353092629), ('skeletal', 0.7969838320863409), ('bone', 0.7723693962517788)]


In [42]:
model_result7 = glove.most_similar("clean")
print(model_result7)

[('fresh', 0.8600574993010727), ('water', 0.8569630638984247), ('heat', 0.8065985020930019), ('wind', 0.7973389015673037)]
