# Generating a Latin WordVector


## Parameter suggestions brought to you by:
 * Word2vec applied to Recommendation: Hyperparameters Matter - https://arxiv.org/pdf/1804.04212
 * How to Generate a Good Word Embedding? - https://arxiv.org/pdf/1507.05523.pdf
   
   

## Guidelines/key points as quotes:
* for semantic property tasks, larger dimensions will lead to better performance 
* For most NLP tasks a dimensionality of 50 is typically sufficient.
* ... multiple iterations are necessary.  The performance increases by a large margin when weiterate more than once, regardless of the task and the cor-pus. 
* Early stopping for regularization based on minimizing the validation loss isn't as useful as with other ML tasks; ideally a specific test would be implemented, but this is difficult to implement.


In [1]:
import json
import logging 
import multiprocessing
from datetime import datetime
from pathlib import Path
import os

from gensim.models import Word2Vec
from gensim.models import KeyedVectors

from cltk.stop.latin import PERSEUS_STOPS 

In [2]:
LOG = logging.getLogger('make_word_vec')
logging.basicConfig(format='%(levelname)s : %(message)s', level=logging.INFO)

In [3]:
# corpus_characteristics = 'non_lemmatized'  
# corpus_filename =  'latin_library.preprocessed.cor' 

# corpus_characteristics = 'lemmatized'  
# corpus_filename ='latin_library.lemmatized.preprocessed.cor'

corpus_characteristics = ''  
corpus_filename ='latin_library.preprocessed.cor'


In [4]:
STOPWORDS = set(PERSEUS_STOPS)
# additional stops
additional_stops  ='ille iste ispe haec quem illic qui sic hic haec quae '.split()

for stop in additional_stops:
    STOPWORDS.add(stop)

In [5]:
corpus_file_wo_stopwords = 'latin_library.wostops.cor'
with open(corpus_filename, 'rt') as infile:
    with open(corpus_file_wo_stopwords, 'wt') as outfile:
        for line in infile:
            words = [word for word in line.split() if word not in STOPWORDS]
            sent = ' '.join(words).strip()
            outfile.write('{}\n'.format(sent))            

In [6]:
keyword_params = {
    'size': 50,
    'iter': 30,
    'min_count': 3, # Ignores all words with total frequency lower than this.
    'max_vocab_size': None,
    'ns_exponent': 0.75, # the default, optimal for linguistic tasks; also try -0.5 for recommenders
    'alpha':  0.025,
    'min_alpha': 0.004,
    'sg': 1, # skip gram
    'window': 10, # number of surrounding words to consider
    'workers': multiprocessing.cpu_count() - 1,
    'negative': 15, # 15 may be best
    'sample': 0 #   0.00001  # sample=1e-05 downsamples 4158 most-common words
    #     sample=0.001 downsamples 32 most-common words
}
LOG.info('Creating vector with parameters: %s', json.dumps(keyword_params))
latin_lib_vec = Word2Vec(corpus_file=corpus_file_wo_stopwords, **keyword_params)

INFO : Creating vector with parameters: {"size": 50, "iter": 30, "min_count": 3, "max_vocab_size": null, "ns_exponent": 0.75, "alpha": 0.025, "min_alpha": 0.004, "sg": 1, "window": 10, "workers": 7, "negative": 15, "sample": 0}
INFO : collecting all words and their counts
INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO : PROGRESS: at sentence #10000, processed 132618 words, keeping 23063 word types
INFO : PROGRESS: at sentence #20000, processed 252256 words, keeping 34768 word types
INFO : PROGRESS: at sentence #30000, processed 402348 words, keeping 44505 word types
INFO : PROGRESS: at sentence #40000, processed 547836 words, keeping 53444 word types
INFO : PROGRESS: at sentence #50000, processed 703812 words, keeping 60619 word types
INFO : PROGRESS: at sentence #60000, processed 845552 words, keeping 66467 word types
INFO : PROGRESS: at sentence #70000, processed 982392 words, keeping 75397 word types
INFO : PROGRESS: at sentence #80000, processed 11106

INFO : EPOCH 3 - PROGRESS: at 16.77% examples, 27770 words/s, in_qsize -1, out_qsize 1
INFO : worker thread finished; awaiting finish of 6 more threads
INFO : EPOCH 3 - PROGRESS: at 32.56% examples, 53950 words/s, in_qsize -1, out_qsize 1
INFO : worker thread finished; awaiting finish of 5 more threads
INFO : worker thread finished; awaiting finish of 4 more threads
INFO : worker thread finished; awaiting finish of 3 more threads
INFO : EPOCH 3 - PROGRESS: at 76.26% examples, 131214 words/s, in_qsize -1, out_qsize 1
INFO : worker thread finished; awaiting finish of 2 more threads
INFO : worker thread finished; awaiting finish of 1 more threads
INFO : EPOCH 3 - PROGRESS: at 100.79% examples, 174609 words/s, in_qsize -1, out_qsize 1
INFO : worker thread finished; awaiting finish of 0 more threads
INFO : EPOCH - 3 : training on 7707269 raw words (7558795 effective words) took 43.3s, 174604 effective words/s
INFO : EPOCH 4 - PROGRESS: at 16.77% examples, 27763 words/s, in_qsize -1, out_qsi

INFO : EPOCH 12 - PROGRESS: at 16.77% examples, 27394 words/s, in_qsize -1, out_qsize 1
INFO : worker thread finished; awaiting finish of 6 more threads
INFO : EPOCH 12 - PROGRESS: at 32.56% examples, 53321 words/s, in_qsize -1, out_qsize 1
INFO : worker thread finished; awaiting finish of 5 more threads
INFO : worker thread finished; awaiting finish of 4 more threads
INFO : EPOCH 12 - PROGRESS: at 61.99% examples, 104057 words/s, in_qsize -1, out_qsize 1
INFO : worker thread finished; awaiting finish of 3 more threads
INFO : worker thread finished; awaiting finish of 2 more threads
INFO : worker thread finished; awaiting finish of 1 more threads
INFO : EPOCH 12 - PROGRESS: at 100.79% examples, 171997 words/s, in_qsize -1, out_qsize 1
INFO : worker thread finished; awaiting finish of 0 more threads
INFO : EPOCH - 12 : training on 7707269 raw words (7558795 effective words) took 43.9s, 171992 effective words/s
INFO : EPOCH 13 - PROGRESS: at 16.77% examples, 27344 words/s, in_qsize -1, o

INFO : EPOCH - 20 : training on 7707269 raw words (7558795 effective words) took 43.9s, 172033 effective words/s
INFO : EPOCH 21 - PROGRESS: at 16.77% examples, 27410 words/s, in_qsize -1, out_qsize 1
INFO : worker thread finished; awaiting finish of 6 more threads
INFO : EPOCH 21 - PROGRESS: at 32.56% examples, 53162 words/s, in_qsize -1, out_qsize 1
INFO : worker thread finished; awaiting finish of 5 more threads
INFO : worker thread finished; awaiting finish of 4 more threads
INFO : worker thread finished; awaiting finish of 3 more threads
INFO : EPOCH 21 - PROGRESS: at 76.26% examples, 128988 words/s, in_qsize -1, out_qsize 1
INFO : worker thread finished; awaiting finish of 2 more threads
INFO : worker thread finished; awaiting finish of 1 more threads
INFO : EPOCH 21 - PROGRESS: at 100.79% examples, 172135 words/s, in_qsize -1, out_qsize 1
INFO : worker thread finished; awaiting finish of 0 more threads
INFO : EPOCH - 21 : training on 7707269 raw words (7558795 effective words) t

INFO : worker thread finished; awaiting finish of 0 more threads
INFO : EPOCH - 29 : training on 7707269 raw words (7558795 effective words) took 44.1s, 171364 effective words/s
INFO : EPOCH 30 - PROGRESS: at 16.77% examples, 27470 words/s, in_qsize -1, out_qsize 1
INFO : worker thread finished; awaiting finish of 6 more threads
INFO : EPOCH 30 - PROGRESS: at 32.56% examples, 53124 words/s, in_qsize -1, out_qsize 1
INFO : worker thread finished; awaiting finish of 5 more threads
INFO : worker thread finished; awaiting finish of 4 more threads
INFO : worker thread finished; awaiting finish of 3 more threads
INFO : EPOCH 30 - PROGRESS: at 76.26% examples, 129015 words/s, in_qsize -1, out_qsize 1
INFO : worker thread finished; awaiting finish of 2 more threads
INFO : worker thread finished; awaiting finish of 1 more threads
INFO : EPOCH 30 - PROGRESS: at 100.79% examples, 171938 words/s, in_qsize -1, out_qsize 1
INFO : worker thread finished; awaiting finish of 0 more threads
INFO : EPOCH

In [7]:
LOG.info('Saving word2vec for latin library corpus')
latin_lib_vec.save('latin_library.{}.vec'.format( datetime.now().strftime('%Y.%m.%d')))

INFO : Saving word2vec for latin library corpus
INFO : saving Word2Vec object under latin_library.2019.06.01.vec, separately None
INFO : not storing attribute vectors_norm
INFO : not storing attribute cum_table
INFO : saved latin_library.2019.06.01.vec


In [8]:
with open('latin_library.vec.{}.params'.format(corpus_characteristics, datetime.now().strftime('%Y.%m.%d')), 'wt') as writer:
    json.dump(keyword_params, writer)

### Persist the word vectors to disk
they should be cross platform, cross language loadable

In [9]:
word_vectors = latin_lib_vec.wv
the_filename = 'latin_library.{}.kv'.format(datetime.now().strftime('%Y.%m.%d'))
# word_vectors.save_word2vec_format(the_filename, binary=False)
word_vectors.save(the_filename)

INFO : saving Word2VecKeyedVectors object under latin_library.2019.06.01.kv, separately None
INFO : not storing attribute vectors_norm
INFO : saved latin_library.2019.06.01.kv


## Some QA

In [10]:
latin_lib_vec.wv.most_similar('puella')

INFO : precomputing L2-norms of word weight vectors


[('uirgo', 0.8270761966705322),
 ('puellae', 0.8253666162490845),
 ('coniuge', 0.8226670622825623),
 ('pudica', 0.8078868389129639),
 ('toro', 0.797211766242981),
 ('uxor', 0.7896655797958374),
 ('inuita', 0.7836523652076721),
 ('mulier', 0.78020840883255),
 ('formosa', 0.7798080444335938),
 ('nupta', 0.7792884111404419)]

In [11]:
if 'haec' in latin_lib_vec:
    latin_lib_vec.wv.similar_by_word('haec')

  """Entry point for launching an IPython kernel.


In [12]:
latin_lib_vec.wv.similar_by_word('uiolenter')

[('interemptis', 0.7972910404205322),
 ('obsessam', 0.7874257564544678),
 ('regionem', 0.7604485154151917),
 ('compellens', 0.7579995393753052),
 ('obsidione', 0.7536187171936035),
 ('urbem', 0.7497695088386536),
 ('populaturi', 0.7447846531867981),
 ('ciuibus', 0.7440009117126465),
 ('irruentes', 0.7433646321296692),
 ('nemine', 0.7394951581954956)]

In [13]:
the_filename = 'latin_library.{}.kv'.format( datetime.now().strftime('%Y.%m.%d'))
latin_word_vectors = KeyedVectors.load(the_filename, mmap='r')

INFO : loading Word2VecKeyedVectors object from latin_library.2019.06.01.kv
INFO : setting ignored attribute vectors_norm to None
INFO : loaded latin_library.2019.06.01.kv


In [14]:
latin_word_vectors.most_similar('uir')

INFO : precomputing L2-norms of word weight vectors


[('homo', 0.8561000227928162),
 ('ait', 0.7978405356407166),
 ('ecce', 0.7895816564559937),
 ('filius', 0.7844275236129761),
 ('itaque', 0.784336268901825),
 ('princeps', 0.7794846296310425),
 ('egressus', 0.7729367613792419),
 ('eo', 0.7721001505851746),
 ('ei', 0.7715548872947693),
 ('cuius', 0.770010232925415)]

In [15]:
latin_lib_vec.wv.most_similar('homo')

[('uir', 0.8561000227928162),
 ('mundus', 0.8154703378677368),
 ('factus', 0.8133982419967651),
 ('iustus', 0.8085343241691589),
 ('peccator', 0.8056743144989014),
 ('tuus', 0.7843465805053711),
 ('numquid', 0.7827498912811279),
 ('nemo', 0.7798494100570679),
 ('stultus', 0.7725919485092163),
 ('hominis', 0.7723689079284668)]

In [16]:
latin_lib_vec.wv.most_similar('canere', topn=10) 

[('liticines', 0.7464401721954346),
 ('tuba', 0.7431555390357971),
 ('tubicines', 0.7368534803390503),
 ('tibicines', 0.7339298129081726),
 ('intermitterent', 0.7267663478851318),
 ('cohortaretur', 0.7150977253913879),
 ('contionantem', 0.7062110304832458),
 ('tibiis', 0.694241464138031),
 ('fidibus', 0.6819440722465515),
 ('tripudiis', 0.6780459880828857)]

In [17]:
latin_lib_vec.wv.most_similar('piger', topn=10) 

[('nauita', 0.792907178401947),
 ('saeuus', 0.7922354340553284),
 ('lentus', 0.7881541848182678),
 ('durus', 0.7816176414489746),
 ('celer', 0.7715222239494324),
 ('patiens', 0.7708593606948853),
 ('rusticus', 0.76863032579422),
 ('subiectat', 0.7670884728431702),
 ('uelox', 0.7548305988311768),
 ('terit', 0.7522436380386353)]

In [18]:
latin_lib_vec.wv.most_similar('scandere')

[('descendere', 0.7483309507369995),
 ('conscendere', 0.7236344218254089),
 ('ire', 0.7135241627693176),
 ('ascendere', 0.7120531797409058),
 ('glomerare', 0.7041159868240356),
 ('tendere', 0.7014951109886169),
 ('insilit', 0.7014558911323547),
 ('subsidere', 0.69866544008255),
 ('attollere', 0.6982595920562744),
 ('subducere', 0.6909117698669434)]

In [19]:
latin_lib_vec.wv.most_similar('praelucere')

[('despectare', 0.7084847688674927),
 ('fatentes', 0.699190080165863),
 ('uisere', 0.6510680913925171),
 ('sternentem', 0.6478651762008667),
 ('properabam', 0.6439028382301331),
 ('tenderemus', 0.6410418152809143),
 ('serenâ', 0.6398051977157593),
 ('petrum', 0.6358171105384827),
 ('adstare', 0.6350216865539551),
 ('sagaci', 0.633321225643158)]

In [20]:
latin_lib_vec.wv.similar_by_word('ciuis')

[('frugalior', 0.736150324344635),
 ('necabatur', 0.7324003577232361),
 ('consularis', 0.7273312211036682),
 ('priuatus', 0.7270860075950623),
 ('clarissimus', 0.7148644924163818),
 ('optimus', 0.7088893055915833),
 ('improbissimos', 0.7083759307861328),
 ('senator', 0.7047805786132812),
 ('integerrimus', 0.7040387988090515),
 ('romanus', 0.7024648189544678)]

In [21]:
the_lemmatized_filename = 'latin_library.2019.03.07.kv' 
lem_lat_wordvec = KeyedVectors.load(the_lemmatized_filename, mmap='r')

INFO : loading Word2VecKeyedVectors object from latin_library.2019.03.07.kv
INFO : setting ignored attribute vectors_norm to None
INFO : loaded latin_library.2019.03.07.kv


In [22]:
lem_lat_wordvec.most_similar('puella')

INFO : precomputing L2-norms of word weight vectors


[('puer', 0.5749707818031311),
 ('iuuenis', 0.5151010751724243),
 ('uirgo', 0.49944934248924255),
 ('soror', 0.4523782730102539),
 ('mater', 0.45129919052124023),
 ('amare', 0.4469846189022064),
 ('uxor', 0.44040897488594055),
 ('maritus', 0.43844184279441833),
 ('at', 0.4366375207901001),
 ('coniunx', 0.4324739873409271)]

In [23]:
lem_lat_wordvec.most_similar('puer')

[('mater', 0.6095702052116394),
 ('iuuenis', 0.5838768482208252),
 ('puella', 0.5749707818031311),
 ('ille', 0.5479326248168945),
 ('ludere', 0.535244345664978),
 ('senex', 0.519166111946106),
 ('uirgo', 0.5188031196594238),
 ('at', 0.5050873756408691),
 ('ferre', 0.504031240940094),
 ('parare', 0.5006879568099976)]

In [24]:
'eccum' in lem_lat_wordvec

True

In [25]:
lem_lat_wordvec.most_similar('eccum')

[('eccam', 0.525143027305603),
 ('attat', 0.49384814500808716),
 ('popli', 0.44641977548599243),
 ('scibo', 0.414834201335907),
 ('surrupta', 0.41333162784576416),
 ('sycophantiam', 0.3958692252635956),
 ('quoia', 0.39479494094848633),
 ('uidulo', 0.39022475481033325),
 ('optume', 0.38850533962249756),
 ('erus', 0.38483574986457825)]

In [26]:
 
the_date ='2019.03.08'
#the_date =datetime.now().strftime('%Y.%m.%d')
the_filename = 'latin_library.{}.kv'.format(the_date )
latin_word_vectors = KeyedVectors.load(the_filename, mmap='r')

INFO : loading Word2VecKeyedVectors object from latin_library.2019.03.08.kv
INFO : loading vectors from latin_library.2019.03.08.kv.vectors.npy with mmap=r
INFO : setting ignored attribute vectors_norm to None
INFO : loaded latin_library.2019.03.08.kv


In [27]:
the_filename = 'latin_library.{}.txt'.format(the_date)
latin_word_vectors.save_word2vec_format(the_filename, binary=False)

INFO : storing 147262x600 projection weights into latin_library.2019.03.08.txt
