# Analysis on Filipino Word Embeddings

In this task, 2 Filipino word embeddings that are trained on different techniques (GloVe, FastText, or Word2Vec) will be compared. In the interest of time, we will only exploring 2 techniques — specifically Word2Vec and FastText. We will be using [@danjohnvelasco](https://github.com/danjohnvelasco)'s Filipino Word Embeddings. More information regarding these embeddings are found in this [link](https://github.com/danjohnvelasco/Filipino-Word-Embeddings). 

The first set of test cases will be word inputs where the program must show the top 10 related words according to each embedding. Five test samples will be prepared in this set.

The second set of test cases will be incomplete word analogies where the program must show the top 10 possible answers according to each embedding. Five test samples will also be prepared for this set.

## Import libraries

In [9]:
from gensim.models import Word2Vec
from gensim.models import FastText
import pandas as pd

## Load models

In [4]:
w2v_model = Word2Vec.load("word2vec_embeddings/word2vec_300dim_20epochs.model")
ft_model = FastText.load("fasttext_embeddings/fasttext_300dim_20epochs.model")

## 1st set of test cases
As mentioned previously, we will first test the two models by showing the top 10 most related words for each of the 5 inputs. 

In [None]:
word_test_cases = ["gamot", "ilaw", "tubig", "tao", "pusa"]

### Word2Vec model

In [15]:
w2v_word_results = {}

for test_case in word_test_cases:
    w2v_word_results[test_case] = w2v_model.wv.most_similar(test_case, topn=10)

w2v_word_df = pd.DataFrame(w2v_word_results)

w2v_word_df

Unnamed: 0,gamot,ilaw,tubig,tao,pusa
0,"(antibiotic, 0.6555707454681396)","(kuryente, 0.5558214783668518)","(kuryente, 0.6309226155281067)","(taong, 0.7032338380813599)","(aso, 0.8510643839836121)"
1,"(antibiotics, 0.6533203125)","(kandila, 0.5446860790252686)","(hangin, 0.6050646901130676)","(mamamayan, 0.5895366668701172)","(kuting, 0.6781439781188965)"
2,"(alak, 0.6216790080070496)","(aircon, 0.5370152592658997)","(gripo, 0.5996379256248474)","(babae, 0.5782827138900757)","(daga, 0.6219066381454468)"
3,"(pagkain, 0.6125994920730591)","(electricfan, 0.5120261311531067)","(yelo, 0.5941765904426575)","(bagay, 0.5741786360740662)","(langgam, 0.5900928378105164)"
4,"(inumin, 0.6038749814033508)","(liwanag, 0.5054677128791809)","(kumukulong, 0.5639724135398865)","(lalake, 0.5585845112800598)","(hamster, 0.5615367293357849)"
5,"(gatas, 0.596613347530365)","(bintana, 0.5049338340759277)","(tubig-baha, 0.5634187459945679)","(pilipino, 0.5494675636291504)","(langaw, 0.548129141330719)"
6,"(vitamins, 0.5947099924087524)","(kable, 0.5006621479988098)","(maiinom, 0.5614603757858276)","(bata, 0.5454597473144531)","(bubuyog, 0.5478506088256836)"
7,"(biogesic, 0.5867588520050049)","(kawad, 0.49458202719688416)","(dugo, 0.5602498650550842)","(nilalang, 0.5426891446113586)","(lamok, 0.5461353659629822)"
8,"(bakuna, 0.5765191912651062)","(kurtina, 0.4879303276538849)","(inuming, 0.5584797859191895)","(lalaki, 0.5294330716133118)","(ipis, 0.5432469844818115)"
9,"(antibiyotiko, 0.5671688914299011)","(tubig, 0.475870281457901)","(pagkain, 0.5566661953926086)","(indibidwal, 0.496071457862854)","(kambing, 0.542522132396698)"


### FastText model

In [16]:
ft_word_results = {}

for test_case in word_test_cases:
    ft_word_results[test_case] = ft_model.wv.most_similar(test_case, topn=10)

ft_word_df = pd.DataFrame(ft_word_results)

ft_word_df

Unnamed: 0,gamot,ilaw,tubig,tao,pusa
0,"(gamotea, 0.8889113068580627)","(pailaw, 0.7953163981437683)","(tubig*, 0.967059314250946)","(taoo, 0.8425832986831665)","(aso, 0.8613008260726929)"
1,"(panggamot, 0.8156951665878296)","(ilawom, 0.7897399067878723)","(tubigg, 0.9415788650512695)","(taod², 0.8216907382011414)","(pusakal, 0.7979550957679749)"
2,"(pampagamot, 0.7876476645469666)","(pilaw, 0.7762776017189026)","(tubig-dagat, 0.8721705675125122)","(taoos, 0.8000805974006653)","(pusan, 0.7868925929069519)"
3,"(manggamot, 0.7830954194068909)","(tilaw, 0.7544955611228943)","(catubig, 0.8658272624015808)","(taooo, 0.7624260783195496)","(pusaaa, 0.7693825960159302)"
4,"(paggamot, 0.7692429423332214)","(iilaw, 0.7525157928466797)","(patubig, 0.865227222442627)","(taong, 0.724565327167511)","(pusanggala, 0.7630274891853333)"
5,"(pagamot, 0.7635043859481812)","(ilawa, 0.7512474060058594)","(tubig-baha, 0.8589447736740112)","(taoooo, 0.7239161133766174)","(daga, 0.7318812012672424)"
6,"(gamos, 0.758536696434021)","(madilaw, 0.7464135885238647)","(tubig-alat, 0.8559607863426208)","(taob, 0.7121660113334656)","(pusang, 0.7297970056533813)"
7,"(gagamot, 0.7569298148155212)","(kilaw, 0.7269748449325562)","(matubig, 0.8499014377593994)","(taoyuan, 0.7102544903755188)","(pusaaaa, 0.7294673919677734)"
8,"(gamo, 0.7512171268463135)","(umiilaw, 0.7231032848358154)","(tubi, 0.8461125493049622)","(tao'y, 0.7097187638282776)","(kambing, 0.7218106985092163)"
9,"(panggagamot, 0.7384175062179565)","(ilawan, 0.7192785143852234)","(tubigan, 0.8407340049743652)","(taong-bayan, 0.6993751525878906)","(manok, 0.7201105952262878)"


## 2nd set of test cases
Similarly, the 2nd set of test cases will contain test samples that will test the performance of the models on analogies. Different analogies will be present in the following samples which include synonymy, antonymy, part-whole, superclass, and geography.

In [93]:
analogy_types = ["synonymy", "antonymy", "part-whole", "superclass", "geography"]

analogy_test_cases = {
    # Pinggan is to plato, as gwapo is to pogi (expected)
    # synonymy | word2vec_result: gwapo → pogi (correct) | fasttext_result: gwapo → gwapo-gwapo
    "synonymy": {"positive": ["pinggan", "gwapo"], "negative": ["plato"]},
    # Buhay is to patay, as laki is to liit (expected)
    # antonymy | word2vec_result: laki → liit (correct) | fasttext_result: laki → laki-laki (wrong)
    "antonymy": {"positive": ["buhay", "laki"], "negative": ["patay"]},
    # Papel is to libro, as mata is to mukha (expected)
    # part-whole | word2vec_result: mata → paningin (wrong) | fasttext_result: mata → paningin (wrong)
    "part-whole": {"positive": ["papel", "mata"], "negative": ["libro"]},
    # Talong is to gulay, as asul is to kulay (expected)
    # superclass | word2vec_result: asul → kahel (wrong) | fasttext_result: asul → rasul (wrong)
    "superclass": {
        "positive": ["talong", "asul"],
        "negative": ["gulay"],
    },
    # Australia is to Canberra, as Thailand is to Bangkok (expected)
    # geography | word2vec_result: thailand → japan (wrong) | fasttext_result: thailand → australians (wrong)
    "geography": {"positive": ["australia", "thailand"], "negative": ["canberra"]},
}

### Word2Vec Model

In [94]:
w2v_analogy_results = {}

for analogy_type, test_case in analogy_test_cases.items():
    w2v_analogy_results[analogy_type] = w2v_model.wv.most_similar(
        positive=test_case["positive"],
        negative=test_case["negative"],
        topn=10,
    )

w2v_analogy_df = pd.DataFrame(w2v_analogy_results)

w2v_analogy_df

Unnamed: 0,synonymy,antonymy,part-whole,superclass,geography
0,"(pogi, 0.7468219995498657)","(liit, 0.44972288608551025)","(paningin, 0.49562305212020874)","(kahel, 0.4754602909088135)","(japan, 0.5280833840370178)"
1,"(ampogi, 0.5760890245437622)","(anlaki, 0.44665172696113586)","(braso, 0.49183645844459534)","(berde, 0.4515073597431183)","(india, 0.5260334610939026)"
2,"(gwapooo, 0.5609509348869324)","(napakalaki, 0.4349755346775055)","(pisngi, 0.4549057185649872)","(pula, 0.4440903067588806)","(vietnam, 0.5219374299049377)"
3,"(napakagwapo, 0.5426948666572571)","(bohai, 0.4336892068386078)","(ilong, 0.4541238844394684)","(abuhing, 0.42844104766845703)","(indonesia, 0.5177457928657532)"
4,"(cute, 0.5409790873527527)","(malaki, 0.4159761965274811)","(leeg, 0.45015662908554077)","(matingkad, 0.4224248230457306)","(taiwan, 0.5176504254341125)"
5,"(gwapoo, 0.5348778367042542)","(lahat, 0.3834744989871979)","(dibdib, 0.44576549530029297)","(puting, 0.42004266381263733)","(singapore, 0.5071107149124146)"
6,"(popogi, 0.5269265174865723)","(laking, 0.3754507303237915)","(tainga, 0.4250698387622833)","(pulang, 0.41969069838523865)","(malaysia, 0.5032574534416199)"
7,"(gwapoooo, 0.5192775726318359)","(buhai, 0.3696788251399994)","(kamay, 0.42230093479156494)","(itim, 0.41946735978126526)","(myanmar, 0.49050167202949524)"
8,"(pogiii, 0.5108906626701355)","(napakaliit, 0.36925753951072693)","(matang, 0.4221208393573761)","(berdeng, 0.4161509871482849)","(turkey, 0.479464054107666)"
9,"(gwapooooo, 0.5016728639602661)","(paglaki, 0.36560046672821045)","(ulo, 0.417401522397995)","(matitingkad, 0.41305628418922424)","(korea, 0.4692491888999939)"


### FastText Model

In [95]:
ft_analogy_results = {}

for analogy_type, test_case in analogy_test_cases.items():
    ft_analogy_results[analogy_type] = ft_model.wv.most_similar(
        positive=test_case["positive"],
        negative=test_case["negative"],
        topn=10,
    )

ft_analogy_df = pd.DataFrame(ft_analogy_results)

ft_analogy_df

Unnamed: 0,synonymy,antonymy,part-whole,superclass,geography
0,"(gwapo-gwapo, 0.818594217300415)","(laki-laki, 0.6621410250663757)","(paningin, 0.669924795627594)","(rasul, 0.5702042579650879)","(australians, 0.6875006556510925)"
1,"(g-gwapo, 0.7936439514160156)","(mlaki, 0.6123846173286438)","(mapapel, 0.6686519384384155)","(sul, 0.5469658970832825)","(australis, 0.6754210591316223)"
2,"(gugwapo, 0.7935822010040283)","(anlaki, 0.5819886922836304)","(takipmata, 0.6683081388473511)","(puting, 0.5453780889511108)","(australasia, 0.6733406186103821)"
3,"(pogi, 0.7768386602401733)","(laking, 0.5781446695327759)","(pilikmata, 0.6572210192680359)","(polong, 0.5427811741828918)","(japan, 0.6640950441360474)"
4,"(gwagwapo, 0.7765454649925232)","(lakin, 0.568845808506012)","(namamalikmata, 0.6521999835968018)","(itim, 0.5382503271102905)","(singapore, 0.6620980501174927)"
5,"(ga-gwapo, 0.7635363936424255)","(kalaki, 0.5619842410087585)","(pagkatawan, 0.6470500230789185)","(berdeng, 0.5317749977111816)","(indonesia, 0.6597210168838501)"
6,"(ggwapo, 0.7601233720779419)","(anlaking, 0.5489166975021362)","(matamlay, 0.6470266580581665)","(talolong, 0.5256481170654297)","(australian, 0.6456921100616455)"
7,"(gagwapo, 0.7546386122703552)","(lakim, 0.5487865805625916)","(kamatyan, 0.6458039283752441)","(talon-talon, 0.5174729228019714)","(malaysia, 0.6354982852935791)"
8,"(gwapoo, 0.7515375018119812)","(buhay~, 0.5481079816818237)","(pamata, 0.6439923644065857)","(hasul, 0.517160177230835)","(singaporeans, 0.6304317712783813)"
9,"(angwapo, 0.7432950139045715)","(apakalaki, 0.5442602634429932)","(pakatawa, 0.6435849666595459)","(zilong, 0.5165695548057556)","(australoid, 0.6273944973945618)"
