<a href="https://colab.research.google.com/github/DaeSeokSong/NLP-Aengmu/blob/main/NLP_traning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NLP training**

This is the code that I am personally working on by referring to the [Github - microsoft/nlp-recipes](https://github.com/microsoft/nlp-recipes) that based on MIT license.
However, I do not intend to write the code for commercial purposes and will only write it for personal learning purposes.

Since the codewriter is Korean, some annotations can be included in Korean for convenience when studying.

Alternatively, I can use a translator called [Papago](https://papago.naver.com/) to write down inaccurate interpretations of the original text.

# **1. Embeddings**
### Developing Word Embeddings

유사성 baseline_deep_dive 노트북처럼 사전 훈련된 임베딩을 사용하는 대신 자체 데이터 집합을 사용하여 단어 임베딩을 학습시킬 수 있습니다.
이 노트북에서는 단어 2vec, GloVe 및 fastText 모델을 사용하여 단어 임베딩을 생성하는 교육 과정을 시연합니다. 

이 작업에는 STS 벤치마크 데이터 세트를 활용하겠습니다.

### *Import and Preparing Dataset*

Microsoft 에서 제공하는 nlp-recipes Repository의 [utils_nlp](https://github.com/microsoft/nlp-recipes/tree/master/utils_nlp) 라이브러리 가져오는 코드

In [None]:
!pip install -e git+https://github.com/microsoft/nlp-recipes.git@master#egg=utils_nlp

In [None]:
import gensim
import sys
import os

# Set the environment path
sys.path.append("../..")

import numpy as np
from utils_nlp.dataset.preprocess import (
    to_lowercase, # 각 DataFrame의 요소들의 값이 str(문자열)이면 해당 요소의 값을 모두 소문자로 만드는 메서드
    to_spacy_tokens, # 토큰 추출
    rm_spacy_stopwords, # Stopwords 제거
)
from utils_nlp.dataset import stsbenchmark
from utils_nlp.common.timer import Timer
from gensim.models import Word2Vec
from gensim.models.fasttext import FastText

In [None]:
# Set the path for where your repo is located
NLP_REPO_PATH = os.path.join('..','..')

# Set the path for where your datasets are located
BASE_DATA_PATH = os.path.join(NLP_REPO_PATH, "data")

# Set the path for location to save embeddings
SAVE_FILES_PATH = os.path.join(BASE_DATA_PATH, "trained_word_embeddings")
if not os.path.exists(SAVE_FILES_PATH):
    os.makedirs(SAVE_FILES_PATH)

### *Load and Preprocess Data*

In [None]:
# Produce a pandas dataframe for the training set
train_raw = stsbenchmark.load_pandas_df(BASE_DATA_PATH, file_split="train")

# Clean the sts dataset
sts_train = stsbenchmark.clean_sts(train_raw)

In [None]:
sts_train.head(5)

In [None]:
# Check the size of our dataframe
sts_train.shape

### *Training set preprocessing*

In [None]:
# Convert all text to lowercase
df_low = to_lowercase(sts_train)  
# Tokenize text
sts_tokenize = to_spacy_tokens(df_low) 
# Tokenize with removal of stopwords
sts_train_stop = rm_spacy_stopwords(sts_tokenize)

In [None]:
# Append together the two sentence columns to get a list of all tokenized sentences.
all_sentences =  sts_train_stop[["sentence1_tokens_rm_stopwords", "sentence2_tokens_rm_stopwords"]]
# Flatten two columns into one list and remove all sentences that are size 0 after tokenization and stop word removal.
sentences = [i for i in all_sentences.values.flatten().tolist() if len(i) > 0]

In [None]:
len(sentences)

In [None]:
sentence_lengths = [len(i) for i in sentences]
print("Minimum sentence length is {} tokens".format(min(sentence_lengths)))
print("Maximum sentence length is {} tokens".format(max(sentence_lengths)))
print("Median sentence length is {} tokens".format(np.median(sentence_lengths)))

In [None]:
sentences[:10]