# 2. Word embedding model training

Following approach in [tutorial](https://rare-technologies.com/word2vec-tutorial/).

In [None]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Preprocessing involves tokenising (gensim uses `(((?![\d])\w)+)` regex), stopword removal (see below). 

No stemming/lemmatisation needed for now. Null value removal isn't necessary because regex sorts everything out.

In [None]:
from gensim.parsing.preprocessing import remove_stopword_tokens
from gensim.test.utils import datapath
from gensim import utils

In [None]:
STOPWORDS = """
a about above across after afterwards again against all almost alone along already also although always am among amongst an and another any anyhow anyone anything anyway anywhere are around as at back be
became because become becomes becoming been before beforehand being beside besides between beyond both bottom but by call can
cannot cant co con could couldnt cry de
did didn do does doesn doing don done down due during
each eight eg either eleven else elsewhere enough etc even ever every everyone everything everywhere except few fifteen
fifty fill find for former formerly forty found four from front full further get give go
had has hasnt have hence here hereafter hereby herein hereupon how however hundred i ie
if in inc indeed into is it its itself keep last latter latterly least less ltd
just
kg km
made make many may me meanwhile might mill mine more moreover most mostly move much must my myself name namely
neither never nevertheless next nine no nobody none noone nor not nothing now nowhere of off
often on once one only onto or other others otherwise our ours ourselves out over own part per
perhaps please put rather re
quite
rather really regarding
same say see seem seemed seeming seems several should show side since sincere six sixty so some somehow someone something sometime sometimes somewhere still such take ten
than that the then thence there thereafter thereby therefore therein thereupon these third this those though three through throughout thru thus to together too top toward towards twelve twenty two un under
until up unless upon us used using
various very via
was we well were what whatever when whence whenever where whereafter whereas whereby wherein whereupon wherever whether which while whither who whoever whole whom whose why will with within without would yet you
your yours yourself yourselves

lrb rrb lcb rcb lsb rsb
"""
STOPWORDS = frozenset(w for w in STOPWORDS.split() if w)

In [None]:
class TrainingCorpus:
    def __iter__(self):
        corpus_path = datapath('/Users/andrew.wang/Documents/academy/project/kainosRecruitmentApi-TeamA/ai/data/corpus/dataset_wikibios_merged.txt')
        for line in open(corpus_path):
            tokens = utils.simple_preprocess(line)
            tokens = remove_stopword_tokens(tokens, stopwords=STOPWORDS)
            yield tokens

Test tokenizer

In [None]:
corpus = TrainingCorpus()
for i,sentence in enumerate(corpus):
    print(sentence)
    if i==5: break

Train embedding model

In [None]:
from gensim.models import Word2Vec

In [None]:
corpus = TrainingCorpus()

Default epochs is 5, this takes ~ 10 minutes

In [None]:
model = Word2Vec(sentences=corpus, workers=4)

In [None]:
model.save("models/word2vec_wikibios_merged.pt")

Simple testing. Load model and check word vocab works

In [None]:
model = Word2Vec.load("models/word2vec_wikibios_merged.pt")

In [None]:
model.wv.similarity("man", "male")

In [None]:
model.wv["hello"]