# fastText

https://github.com/facebookresearch/fastText

**FastText**는 자연어 처리(NLP) 작업에서 사용되는 오픈소스 라이브러리로, 텍스트 분류 및 단어 임베딩을 위한 빠르고 효율적인 도구이다. 이는 Facebook AI Research 팀에서 개발했으며, 특히 대규모 텍스트 데이터에서도 높은 성능과 속도를 제공한다. FastText는 아래와 같은 주요 특징을 가진다:


**주요 특징**
1. **단어 벡터 학습 (Word Embeddings)**  
   - FastText는 단어를 고정된 크기의 벡터로 변환하는 단어 임베딩 모델을 학습한다. 이는 단어의 의미를 벡터 공간에 매핑하여 유사한 단어가 가까운 벡터로 표현되도록 한다.
   - 기존의 Word2Vec과 유사하지만, FastText는 단어를 **서브워드(subword)** 단위로 처리한다.

2. **서브워드 기반 모델 (Subword-based Model)**  
   - 단어를 n-그램(예: 'apple' → ['app', 'ppl', 'ple'])으로 분해하여 학습하기 때문에, **희귀 단어**나 **철자 오류**에도 강건하다.
   - 이는 단어 외에도 철자 패턴과 같은 더 세밀한 정보를 학습하는 데 유용하다.

3. **텍스트 분류 (Text Classification)**  
   - FastText는 문서나 문장을 빠르고 정확하게 분류하는 데 최적화되어 있다.
   - 학습 과정이 빠르고, 모델의 크기가 작으며, 정확도도 뛰어나다.

4. **효율적인 구현**  
   - FastText는 CPU 기반으로도 높은 성능을 내도록 설계되었으며, 대규모 데이터셋에서도 빠르게 작동한다.

**FastText의 작동 원리**
1. **단어 표현**  
   - 단어를 n-그램 서브워드로 나눈 후, 각 서브워드에 대해 벡터를 학습한다.
   - 예를 들어, "cat"이라는 단어는 'c', 'ca', 'cat'과 같은 다양한 조합으로 분해된다.
   - 결과적으로 단어 벡터는 각 서브워드 벡터의 합으로 표현된다.

2. **모델 구조**  
   - FastText는 Skip-gram 모델이나 CBOW 모델을 기반으로 동작한다.
   - 단, 기존 모델과 달리 단어 자체가 아닌 서브워드를 사용하여 학습한다.

**FastText의 장점**
1. **희귀 단어 처리 능력**  
   - 서브워드 기반 접근 방식 덕분에 희귀 단어 또는 새로운 단어에 대해 더 좋은 일반화 성능을 발휘한다.
2. **빠른 학습 속도**  
   - 단순한 모델 구조와 최적화된 구현으로 매우 빠르게 학습할 수 있다.
3. **다양한 언어 지원**  
   - 다양한 언어에서 동작하며, 특히 굴절어(inflected languages)와 같은 복잡한 언어에서도 효과적이다.

**활용 사례**
1. **단어 임베딩**  
   - 단어 간 유사도 계산, 문장 표현 학습.
2. **텍스트 분류**  
   - 스팸 필터링, 감정 분석, 뉴스 분류.
3. **다언어 지원**  
   - 다국어 데이터셋에서 빠른 응답 성능 제공.

### gensim FastText

In [2]:
from gensim.models import FastText
from lxml import etree
import re
from nltk.tokenize import word_tokenize, sent_tokenize
import pandas as pd

In [4]:
f = open('ted_en.xml', 'r', encoding='utf-8')
xml = etree.parse(f)

corpus = '\n'.join(xml.xpath('//content/text()'))
corpus = re.sub(r'\([^)]*\)', '', corpus)

sentences = sent_tokenize(corpus)

preprocessed_sentences = []

for sentence in sentences:
    sentence = sentence.lower()
    sentence = re.sub(r'[^0-9a-zA-Z]', ' ', sentence)
    tokens = word_tokenize(sentence)
    preprocessed_sentences.append(tokens)

In [6]:
from gensim.models import Word2Vec

w2v_model = Word2Vec(
    sentences=preprocessed_sentences,
    vector_size=100,
    window=5,
    min_count=5,
    sg=0
)

In [9]:
w2v_model.wv.vectors.shape

(21613, 100)

In [10]:
w2v_df = pd.DataFrame(w2v_model.wv.vectors, index=w2v_model.wv.index_to_key)
w2v_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
the,-1.602183,-0.754122,0.054211,0.197093,-0.343549,-0.359738,-0.439755,-0.628242,0.412635,0.294378,...,0.399949,1.003600,1.711362,-1.225475,1.281034,-0.631227,-0.346572,-0.308788,0.563423,0.139364
and,-0.960859,0.355629,-1.937655,0.191394,0.887632,-0.150058,-1.535106,-0.098262,-1.384128,0.010243,...,0.278464,-0.538096,1.244998,-0.282341,1.333021,-0.160819,-0.515773,0.325857,0.474988,0.573832
to,1.057269,1.325501,-2.271777,-1.905770,1.649639,-0.086139,-2.578136,-0.062404,-1.654634,1.582814,...,-1.203855,1.284397,0.643960,-0.724886,2.698494,0.406241,-1.976951,1.565815,3.284841,1.009592
of,-1.978508,2.332829,0.200989,-1.777468,0.513555,-0.945224,-1.850023,1.879423,0.149226,-1.886073,...,-1.238511,0.613360,0.719313,-0.293735,1.992131,-1.026388,0.696551,0.334138,-0.584948,2.398067
a,-2.096079,-0.624076,0.227985,-2.024044,0.670893,1.890247,0.873058,-1.302917,0.707418,1.289235,...,-0.656068,1.118097,1.191869,0.571720,2.612408,-2.233231,0.444938,0.469160,0.786502,2.924187
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
bullies,0.010675,0.038673,0.003364,-0.019312,0.051840,-0.113449,-0.020208,0.141567,-0.038554,-0.041842,...,0.105336,0.039803,0.002791,0.046667,0.108889,0.051507,0.062895,-0.009533,0.049379,-0.007275
splendor,0.006641,0.025374,-0.021623,-0.005972,-0.026425,-0.101663,0.038669,0.085754,-0.065216,-0.090220,...,0.063066,-0.004041,0.061301,-0.032802,0.117324,-0.015248,0.073585,-0.037586,0.023925,-0.021365
enslaving,-0.087698,0.032488,0.048653,0.018605,0.015937,-0.125980,0.038796,0.135697,-0.096660,-0.023390,...,0.013414,-0.003434,-0.006682,-0.044973,0.041363,0.076625,-0.007417,-0.031955,0.076854,-0.047980
inspirations,-0.003380,0.016162,0.014172,-0.034789,0.005911,-0.095611,-0.001882,0.175428,-0.037522,0.022480,...,0.040983,0.003520,-0.029165,-0.005583,0.120120,0.033486,0.025417,-0.056796,-0.011567,-0.065332


In [20]:
# w2v_model.wv.most_similar('father')
w2v_model.wv.most_similar('luckyfather')

KeyError: "Key 'luckyfather' not present in vocabulary"

In [14]:
# FastText
from gensim.models import FastText

fasttext_model = FastText(
    sentences=preprocessed_sentences,
    vector_size=100,
    window=5,
    min_count=5,
    sg=0
)

In [15]:
fasttext_model.wv.vectors.shape

(21613, 100)

In [16]:
fasttext_df = pd.DataFrame(fasttext_model.wv.vectors, index=fasttext_model.wv.index_to_key)
fasttext_df.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
the,1.884445,0.306065,-0.0391,-0.437819,-2.915466,0.837425,-2.247126,0.612085,-3.043438,1.9332,...,-2.221439,-0.330609,4.234833,-0.979546,1.873483,0.020404,0.49395,0.134106,1.32007,1.68862
and,-0.925729,1.344329,0.531703,1.310783,-1.809213,0.159564,-1.674048,-1.061774,-1.543813,3.514447,...,-0.915509,0.154899,2.817363,1.42159,0.440697,-2.932109,-0.25469,-2.181306,-0.109521,-0.320897
to,1.690793,6.124421,-0.61132,-0.206671,-3.219455,-2.77844,-0.383981,0.707338,-1.881064,0.92292,...,-1.761237,-0.338074,3.565907,0.142494,5.310114,-0.795193,2.050869,1.776596,5.077247,-6.552162
of,-2.852442,-1.107803,-2.979213,-3.902354,0.684918,6.356664,-2.288266,-0.56095,4.03617,-2.695691,...,1.950008,-1.206472,-4.020442,-2.265939,-2.418921,3.865775,-2.740152,-10.624485,-0.322602,0.556215
a,5.272165,1.742687,2.795201,-5.716394,-1.767586,7.387663,-5.325272,5.288905,0.597842,1.118361,...,2.283784,-5.048785,-2.526465,-3.285891,4.146706,3.938247,1.484104,4.67685,0.226703,2.482768
that,3.014732,-0.629949,2.135756,1.073808,-2.043225,0.008067,-0.912322,-2.073413,-1.534726,2.159022,...,0.956505,1.387371,2.62947,-0.249867,1.233294,-2.383852,0.798021,1.537379,1.909022,0.811682
i,10.242331,-3.876848,2.293705,3.167578,4.882856,2.002811,-1.932109,-17.129282,-2.111537,9.196996,...,3.028786,0.034507,9.147279,-1.500751,-7.368648,3.514261,-0.296399,9.704303,7.565924,-10.358772
in,-2.065645,-1.371473,6.577747,-1.62793,2.833375,-0.778094,0.404471,0.361873,-3.543106,-0.789999,...,-5.625654,-2.79616,-4.692937,3.268616,3.212688,2.409694,-2.042954,-7.056334,-3.822797,0.914039
it,2.785935,-2.793764,2.497077,2.729937,-1.778097,-2.026195,-0.416117,-1.838374,-7.269243,1.612041,...,2.126479,-0.839263,4.673944,-0.278593,-0.115846,-0.224461,0.327232,7.269763,2.151349,2.114584
you,1.012847,0.044959,0.696618,2.957201,-3.040638,0.969388,-2.165245,-6.37466,0.291031,3.073544,...,3.64082,-2.31662,2.68708,-4.118287,-2.581318,-2.835653,0.008869,2.873196,3.052771,-1.219708


In [None]:
# fasttext_model.wv.most_similar('father')
fasttext_model.wv.most_similar('luckyfather')   # 위와는 다르게 문제가 생기지 않는다. 왜???? (알아보자)

[('father', 0.9543288946151733),
 ('godfather', 0.9418891072273254),
 ('grandfather', 0.925786554813385),
 ('mother', 0.9017308950424194),
 ('grandmother', 0.8983917832374573),
 ('brother', 0.8912230730056763),
 ('stepfather', 0.8893412351608276),
 ('feather', 0.8775393962860107),
 ('luther', 0.8769897818565369),
 ('slaughter', 0.8685480952262878)]

In [None]:
fasttext_model.wv['abracadabra']    # 서브워드, 쪼개서 검색ㅎ

array([ 0.25092041,  0.32342055,  0.28366297,  0.04466848, -0.40150365,
        0.18973528, -0.1157398 ,  0.05085277,  0.07350447,  0.314601  ,
       -0.12719235,  0.24691883,  0.11057214,  0.10992316, -0.20794125,
        0.320184  ,  0.03650503,  0.04627708, -0.21884   , -0.13797432,
        0.02486305,  0.29300675,  0.18621942,  0.22936836, -0.01200364,
        0.17893566, -0.04150484, -0.05518465, -0.03156302,  0.12966977,
        0.12998235,  0.33754033, -0.02690321, -0.14489959,  0.13249753,
        0.02198056, -0.18086907,  0.04038905, -0.00096031, -0.01142669,
        0.04960864, -0.22249827, -0.02679194,  0.09768704, -0.24788776,
       -0.4038155 ,  0.08912575, -0.13664585, -0.26976705,  0.3839891 ,
        0.03894145, -0.03931139, -0.36678648,  0.08726571, -0.0312816 ,
        0.11962115,  0.03238783, -0.12999655, -0.05044733,  0.23675969,
        0.42200425,  0.02478763,  0.27449378, -0.07859233,  0.3125196 ,
        0.33317038, -0.15556662, -0.29050404, -0.07675852, -0.27

### fasttext 패키지 설치

In [None]:
# !pip install fasttext-wheel



In [23]:
import fasttext
import fasttext.util

model = fasttext.train_unsupervised(
    'naver_movie_ratings.txt',
    model='skipgram',
    minCount=1,
    dim=100,    # 벡터 차원
    minn=3,     # 최소 ngram 수
    maxn=5      # 최대 ngram 수 
)

In [None]:
model.get_word_vector('극장')   # 극장이라는 글자에 대한 벡터 

array([ 0.49097133, -0.313208  , -0.5330142 ,  0.8234431 , -0.21739802,
        0.05225354, -0.13133241,  0.07269783,  0.22873338,  0.6822052 ,
        0.07438467,  0.4016128 ,  0.5904844 , -0.3304736 , -0.720209  ,
       -0.9982164 ,  0.03726093, -1.3955009 , -0.44415912, -0.19743568,
        0.48032144,  0.38558164,  0.24129906, -0.11955368,  0.6412722 ,
       -0.9159176 , -0.213653  ,  0.46298224, -0.82508254,  0.40884316,
       -0.21237008,  0.501677  , -0.02462932,  0.3216497 , -0.63968337,
        0.24529728,  0.04198614,  0.3109082 , -0.4031785 , -0.3072338 ,
        0.1141006 ,  0.3523968 ,  0.307692  , -0.2626483 ,  1.0313857 ,
       -0.1366069 ,  0.15914622, -0.5285372 , -0.12884441,  0.26838934,
        0.21005908, -0.8546542 , -0.1334615 ,  0.5188438 , -0.57247734,
        0.04643317,  0.7910253 ,  0.26133966,  1.1033598 , -0.07967176,
        0.4530004 ,  0.03895722,  0.24288367,  0.42147338, -0.02514918,
       -0.07914953, -0.14788286,  0.02455036,  0.4773987 ,  0.01

In [None]:
model.get_subwords('영화관')    # 서브워드 3 ~ 5개. <를 포함해서 3개에서 5개까지의 서브워드를 가져온다. 

(['영화관', '<영화', '<영화관', '<영화관>', '영화관', '영화관>', '화관>'],
 array([   2062, 1921845, 1442415, 1378913, 2245977, 1515139, 1352938]))