By @christabella, @NightmareNyx and @smspillaz
We present a comprehensive comparison of word vector generation models using the skip-gram algorithm, the continuous bag of words algorithm (Mikolov et al.), the AWD-LSTM model (Merity et al.) and Node2Vec over a graph of words (Grover et al.). We also compare word senses using the approach given in Arora et al. Results are over the WikiText-103 dataset.
The results of our experiments can be found in the results directory.
Use something like:
python word2node2vec/word2node2vec.py --train data/wiki.train.tokens --valid data/wiki.valid.tokens --test data/wiki.test.tokens --save vecs.out
The saved vectors will be in gensim.Word2Vec
format, you can load them and save the underlying KeyedVectors
for testing.
First, train the language model:
pushd submodules/awd-lstm-lm;
python -u main.py --epochs 14 --nlayers 4 --emsize 400 --nhid 2500 --alpha 0 --beta 0 --dropoute 0 --dropouth 0.1 --dropouti 0.1 --dropout 0.1 --wdrop 0 --wdecay 0 --bptt 140 --batch_size 60 --optimizer adam --lr 1e-3 --data data/wikitext-103 --save WT103.12hr.QRNN.pt --when 12 --model QRNN
popd
This will also save the corpus/dictionary to submodules/awd-lstm-lm/corpus.[md5].data
where [md5]
is an MD5 hash of the path
of the corpus itself.
Then convert to gensim format with:
pushd submodules/awd-lstm-lm;
python -u lm2word2vec.py --model WT103.12hr.QRNN.pt --dictionary submodules/awd-lstm-lm/corpus.[md5].data --output WT103.12hr.QRNN.pt.vw
popd
Once word vectors have been trained, you can get the benchmark results on word-benchmarks by using:
pushd submodules/word-benchmarks/tests
METHOD=some-label ./run-tests ../../PATH/TO/GENSIM/VECTORS
popd
The results should be put into results/some-label
model | google-analogies.csv | jair.csv | msr.csv | sat.csv | semeval.csv |
---|---|---|---|---|---|
100.vw | 0.6403398 | 0 | 0.6013075 | 0.25918242 | 0.30312118 |
150.vw | 0.64039266 | 0 | 0.60108894 | 0.25918135 | 0.30323598 |
200.vw | 0.5558439 | 0 | 0.52386457 | 0.22404711 | 0.24454139 |
250.vw | 0.5342923 | 0 | 0.49946675 | 0.20587595 | 0.22498928 |
300.vw | 0.5533238 | 0 | 0.5254152 | 0.20573309 | 0.239287 |
400.vw | 0.47414735 | 0 | 0.45231628 | 0.17319292 | 0.19356392 |
50.vw | 0.6427122 | 0 | 0.59071636 | 0.30830148 | 0.3142907 |
model | ap.csv | battig.csv | bless.csv | essli-2008.csv |
---|---|---|---|---|
100.vw | 0.5111549817629505 | 0.4701409492539494 | 0.609097674518625 | 0.5948063443028617 |
150.vw | 0.48337606690742263 | 0.4726439058488439 | 0.553934936613524 | 0.4713177507990925 |
200.vw | 0.511383587009481 | 0.4992238436761192 | 0.6126116789022762 | 0.5054455414190702 |
250.vw | 0.534971147431235 | 0.4949624355341389 | 0.5926719306594618 | 0.5428752562899452 |
300.vw | 0.4758289068682214 | 0.4697019245667439 | 0.5340228077481495 | 0.45504684670414264 |
400.vw | 0.5413637749205396 | 0.4804087142051286 | 0.6253793226808986 | 0.5574793252585205 |
50.vw | 0.5144323695914503 | 0.4380701059353572 | 0.5196268914210909 | 0.5470529637342717 |
model | 8-8-8.csv | wikisem500.csv |
---|---|---|
100.vw | 0.78125 | 0.5843355119825708 |
150.vw | 0.78125 | 0.5848023653906007 |
200.vw | 0.890625 | 0.578145813881108 |
250.vw | 0.671875 | 0.5966472144413321 |
300.vw | 0.890625 | 0.5511367880485527 |
400.vw | 0.890625 | 0.5744026221599752 |
50.vw | 0.734375 | 0.6143921179582945 |
model | mc-30.csv | men.csv | mturk-287.csv | mturk-771.csv | rg-65.csv | rw.csv | semeval17.csv | simverb-3500.csv | verb-143.csv | wordsim353-rel.csv | wordsim353-sim.csv | yp-130.csv |
---|---|---|---|---|---|---|---|---|---|---|---|---|
100.vw | 0.23262541544040044 | 0.1987571170169611 | 0.2299824983457101 | 0.23243278729087308 | 0.24639078094409064 | 0.26130866001826786 | 0.21347329613235264 | 0.7104533505530943 | 0.29282450701225365 | 0.2349569548387237 | 0.16924773826374084 | 0.2913383059767576 |
150.vw | 0.23345235339403153 | 0.19875746533272165 | 0.229978757863782 | 0.2322595934252116 | 0.2456465694996027 | 0.2613362571071438 | 0.21357433823017793 | 0.7102285577319915 | 0.29339505643267483 | 0.23503139389656919 | 0.16901092648445346 | 0.2919843622524005 |
200.vw | 0.26101349561015763 | 0.24530788521412764 | 0.3517292108064354 | 0.3268703272746958 | 0.2492108909143851 | 0.3132374001824417 | 0.23401217016287978 | 0.7575145062073669 | 0.279708762641937 | 0.3188833066455769 | 0.24707455361022482 | 0.2816417728634981 |
250.vw | 0.2688238610188166 | 0.25876522520390655 | 0.3633188989739486 | 0.3423926644681169 | 0.25916253110766413 | 0.32137162352681575 | 0.2482190000116365 | 0.7717744624336146 | 0.27800052317716767 | 0.33567863427891453 | 0.2678640785343489 | 0.28081431128428536 |
300.vw | 0.25841380699078237 | 0.2503854629996047 | 0.3061668968330241 | 0.31148313830086743 | 0.25706328180203075 | 0.28782788833692075 | 0.24473696922308505 | 0.7548047950251773 | 0.281380884222095 | 0.3092829964800793 | 0.23902553946920194 | 0.28390958477373307 |
400.vw | 0.2860361686358848 | 0.2847717287496975 | 0.3803760932231476 | 0.3766193355413912 | 0.2763803583177236 | 0.3354976072341171 | 0.27245529542688596 | 0.7979207980398926 | 0.2717867930499571 | 0.3620669955474114 | 0.2956181045926499 | 0.27530058174236466 |
50.vw | 0.28006486112276713 | 0.23680829711709173 | 0.33807099371004945 | 0.28372421405925496 | 0.2554075000068316 | 0.3105925488978037 | 0.22681946402599595 | 0.7063345005552424 | 0.3300705308093438 | 0.26734758736550057 | 0.2216020986003687 | 0.3316088645703517 |
model | google-analogies.csv | jair.csv | msr.csv | sat.csv | semeval.csv |
---|---|---|---|---|---|
vectors.150.vw | 0.07235756 | 0 | 0.07478042 | 0.03397527 | 0.040563706 |
vectors.300.vw | 0.052819487 | 0 | 0.05203791 | 0.020160092 | 0.025098415 |
vectors.50.2.10.vw | 0.12740257 | 0 | 0.12945199 | 0.06371317 | 0.07074753 |
vectors.50.2.vw | 0.123734064 | 0 | 0.13750415 | 0.061421208 | 0.06865054 |
vectors.50.3.vw | 0.1268055 | 0 | 0.13204797 | 0.06705497 | 0.07329087 |
model | ap.csv | battig.csv | bless.csv | essli-2008.csv |
---|---|---|---|---|
vectors.150.vw | 0.2233648166831707 | 0.1985093423365453 | 0.3444428148656265 | 0.36205929622347505 |
vectors.300.vw | 0.18943355164112774 | 0.16159989347649026 | 0.24895186823020957 | 0.32783256064332633 |
vectors.50.2.10.vw | 0.3149895087745146 | 0.25520657037106165 | 0.39650575475763133 | 0.460464658899246 |
vectors.50.2.vw | 0.332559046268511 | 0.25667003695295193 | 0.37123533560082406 | 0.36601398667179713 |
vectors.50.3.vw | 0.29625278203581673 | 0.24695485897783215 | 0.41748016494741397 | 0.39248065660716464 |
model | 8-8-8.csv | wikisem500.csv |
---|---|---|
vectors.150.vw | 0.96875 | 0.4352611940298507 |
vectors.300.vw | 0.875 | 0.40423773987206824 |
vectors.50.2.10.vw | 0.84375 | 0.4597459132906894 |
vectors.50.2.vw | 0.75 | 0.45090174129353233 |
vectors.50.3.vw | 0.78125 | 0.45316275764036956 |
model | mc-30.csv | men.csv | mturk-287.csv | mturk-771.csv | rg-65.csv | rw.csv | semeval17.csv | simverb-3500.csv | verb-143.csv | wordsim353-rel.csv | wordsim353-sim.csv | yp-130.csv |
---|---|---|---|---|---|---|---|---|---|---|---|---|
vectors.150.vw | 0.45421816510955487 | 0.44709521411152875 | 0.5145976531481472 | 0.5605171253721606 | 0.43434111706912515 | 0.5567051749041152 | 0.43455471036608445 | 1.0247882937361021 | 0.4088670136987601 | 0.46214303871371687 | 0.4404123406294806 | 0.41382217590719983 |
vectors.300.vw | 0.46932814587936517 | 0.4729951437201099 | 0.5358358766150964 | 0.5803959295878485 | 0.45797111880824426 | 0.5692399127521439 | 0.45567100489543655 | 1.0311851822871347 | 0.41602743672354603 | 0.48018405934459074 | 0.4559738808273539 | 0.41826603622590763 |
vectors.50.2.10.vw | 0.45241435927436463 | 0.40961690992261834 | 0.47074437245647754 | 0.5274061602989087 | 0.3712963395046465 | 0.5388203667353184 | 0.408855536744672 | 0.9947995810571388 | 0.40386756843947014 | 0.41536981136041057 | 0.3885697232360154 | 0.40281220346270946 |
vectors.50.2.vw | 0.4812938427690948 | 0.40939629644750747 | 0.4850890708806759 | 0.5369209036975529 | 0.4141098256138238 | 0.5404103529586431 | 0.4021569165385375 | 1.0083735446250555 | 0.4059512368544655 | 0.42295771195297255 | 0.403765660424735 | 0.409976069673373 |
vectors.50.3.vw | 0.46251754138583223 | 0.4085646353832784 | 0.47842742245049 | 0.5353729284949493 | 0.3716052839430895 | 0.5423108534659385 | 0.40774824658356146 | 0.9976270813186842 | 0.39301769410270565 | 0.42282387197734844 | 0.39310650535093417 | 0.39275082177394316 |
model | google-analogies.csv | jair.csv | msr.csv | sat.csv | semeval.csv |
---|---|---|---|---|---|
sg.100.vw | 0.56514007 | 0 | 0.5282249 | 0.26186305 | 0.28606668 |
sg.150.vw | 0.564691 | 0 | 0.52989185 | 0.26213324 | 0.2860251 |
sg.200.vw | 0.56421906 | 0 | 0.5302714 | 0.26190928 | 0.28512627 |
sg.250.vw | 0.56673586 | 0 | 0.529467 | 0.26226997 | 0.28623167 |
sg.300.vw | 0.56275225 | 0 | 0.5279672 | 0.26199442 | 0.28614104 |
sg.400.vw | 0.56544155 | 0 | 0.5285368 | 0.26082078 | 0.2859096 |
sg.50.vw | 0.5645637 | 0 | 0.52917844 | 0.26293534 | 0.28640532 |
model | ap.csv | battig.csv | bless.csv | essli-2008.csv |
---|---|---|---|---|
sg.100.vw | 0.5780351367799206 | 0.5308714982991621 | 0.6802891644376551 | 0.4949234069716011 |
sg.150.vw | 0.5683078284839367 | 0.5346200255213555 | 0.7285094694865466 | 0.5573308757972358 |
sg.200.vw | 0.5573185524378818 | 0.5220812239035447 | 0.724033832855196 | 0.5460514546980405 |
sg.250.vw | 0.6219522503122408 | 0.5212961723940952 | 0.7118885556469524 | 0.5218275522815171 |
sg.300.vw | 0.586389209644308 | 0.5277810616756028 | 0.7177956996403545 | 0.5589448207676889 |
sg.400.vw | 0.6073188625468228 | 0.5425737692706525 | 0.7139109925636257 | 0.5654972845283717 |
sg.50.vw | 0.5885475777287145 | 0.5398500645739408 | 0.7198277232926009 | 0.5628185114529359 |
model | 8-8-8.csv | wikisem500.csv |
---|---|---|
sg.100.vw | 0.9375 | 0.7253855431061313 |
sg.150.vw | 0.9375 | 0.7023548863990041 |
sg.200.vw | 0.9375 | 0.7260546996576409 |
sg.250.vw | 0.90625 | 0.7200813103018985 |
sg.300.vw | 0.875 | 0.7260348583877996 |
sg.400.vw | 0.875 | 0.7012087612822908 |
sg.50.vw | 0.9375 | 0.7165931372549019 |
model | mc-30.csv | men.csv | mturk-287.csv | mturk-771.csv | rg-65.csv | rw.csv | semeval17.csv | simverb-3500.csv | verb-143.csv | wordsim353-rel.csv | wordsim353-sim.csv | yp-130.csv |
---|---|---|---|---|---|---|---|---|---|---|---|---|
sg.100.vw | 0.22457766694674894 | 0.16500178837932647 | 0.26203211413998617 | 0.257022740190531 | 0.22239366227961502 | 0.32538271345040903 | 0.1955761686579129 | 0.7753478213606123 | 0.22860458746599774 | 0.20954808050569368 | 0.17474445497594315 | 0.2300175332592084 |
sg.150.vw | 0.22627560336490476 | 0.16503211605573692 | 0.26136119470188557 | 0.2560657149660557 | 0.22103912269839873 | 0.3252307220086531 | 0.1928489686396939 | 0.7747899146833436 | 0.22940083722675603 | 0.20732885167703913 | 0.1745678984656626 | 0.23073822200412933 |
sg.200.vw | 0.23075113968650504 | 0.1650317048163712 | 0.26270622085262335 | 0.2577664801664301 | 0.22309861202194142 | 0.3247418034147028 | 0.19615788372946374 | 0.7749358436830007 | 0.23024944150069404 | 0.208367323123983 | 0.1730173220584283 | 0.23139012094002506 |
sg.250.vw | 0.22375898566444719 | 0.164888768336835 | 0.26350590302445503 | 0.2562372860778658 | 0.22100013875502805 | 0.3246530143408281 | 0.19532924690824233 | 0.775262754113634 | 0.2267156203275635 | 0.20645792081889489 | 0.1725011530793747 | 0.2281422318330178 |
sg.300.vw | 0.22865829121669137 | 0.1649957287545502 | 0.26296136197145004 | 0.25746302772061924 | 0.22340114461917143 | 0.3251233265563012 | 0.19348522284654554 | 0.7741839955347828 | 0.22931881706321997 | 0.20919915158007327 | 0.1737514746173912 | 0.2308629694665854 |
sg.400.vw | 0.22645711617271105 | 0.1653129494190837 | 0.26247695273350874 | 0.2576026289056974 | 0.21955526261834 | 0.32554172958831495 | 0.19368215688308818 | 0.7740991174773549 | 0.22892832564668994 | 0.20884658048658813 | 0.1730231713099139 | 0.23035785070680653 |
sg.50.vw | 0.22548519935806594 | 0.16495744984851526 | 0.26342656995723657 | 0.2570405497809845 | 0.22259370768528716 | 0.3258908928242255 | 0.1943813393124736 | 0.7732667210869834 | 0.22964877533303601 | 0.20870550771107696 | 0.17568209998568102 | 0.2308864334891049 |
model | google-analogies.csv | jair.csv | msr.csv | sat.csv | semeval.csv |
---|---|---|---|---|---|
vectors.100.vw | 0.56566936 | 0 | 0.5181261 | 0.22984047 | 0.24702403 |
vectors.150.vw | 0.52832794 | 0 | 0.49000055 | 0.2047717 | 0.2210983 |
vectors.200.vw | 0.50124794 | 0 | 0.46906146 | 0.18800929 | 0.20329203 |
vectors.250.vw | 0.47829783 | 0 | 0.4507169 | 0.17471649 | 0.18987046 |
vectors.300.vw | 0.46010032 | 0 | 0.43649906 | 0.16416313 | 0.17929642 |
vectors.400.vw | 0.42723885 | 0 | 0.4128722 | 0.14949596 | 0.16291577 |
vectors.50.vw | 0.6158859 | 0 | 0.5557572 | 0.27176368 | 0.28892532 |
model | ap.csv | battig.csv | bless.csv | essli-2008.csv |
---|---|---|---|---|
vectors.100.vw | 0.5186787217874035 | 0.44917405160498264 | 0.5985590153207353 | 0.6282470981385256 |
vectors.150.vw | 0.501222591664049 | 0.44069583479131175 | 0.6359226825050992 | 0.5203114337922603 |
vectors.200.vw | 0.4399885217561208 | 0.436792121638877 | 0.611799833990984 | 0.6318694693102543 |
vectors.250.vw | 0.48929031636804654 | 0.43948700170741606 | 0.642456145628704 | 0.5496525656946352 |
vectors.300.vw | 0.4611962171917175 | 0.43392883704347524 | 0.6421705161387511 | 0.5455943433060263 |
vectors.400.vw | 0.41830478633387075 | 0.41853018105564266 | 0.5601155861242109 | 0.564174537727787 |
vectors.50.vw | 0.49687971045889856 | 0.44207880376355224 | 0.6209848439434117 | 0.5717386581773791 |
model | 8-8-8.csv | wikisem500.csv |
---|---|---|
vectors.100.vw | 0.90625 | 0.6781006847183317 |
vectors.150.vw | 0.875 | 0.6730656707127295 |
vectors.200.vw | 0.875 | 0.6812889044506691 |
vectors.250.vw | 0.84375 | 0.6810914643635232 |
vectors.300.vw | 0.875 | 0.6759257314036726 |
vectors.400.vw | 0.90625 | 0.6899050731403672 |
vectors.50.vw | 0.9375 | 0.6652499610955495 |
model | mc-30.csv | men.csv | mturk-287.csv | mturk-771.csv | rg-65.csv | rw.csv | semeval17.csv | simverb-3500.csv | verb-143.csv | wordsim353-rel.csv | wordsim353-sim.csv | yp-130.csv |
---|---|---|---|---|---|---|---|---|---|---|---|---|
vectors.100.vw | 0.23038650089899698 | 0.21923658708872895 | 0.36832776654834054 | 0.32773828016643874 | 0.23611912248914058 | 0.4062063027807598 | 0.21038553543043906 | 0.7938249489157057 | 0.26696976556510676 | 0.2859681463096322 | 0.23796280732127478 | 0.2714475272500744 |
vectors.150.vw | 0.2506429349248608 | 0.23830748303716381 | 0.38604338547293915 | 0.35106527508638935 | 0.24412845624066315 | 0.42510492811087475 | 0.22434797749981097 | 0.8131645940380711 | 0.26086672615058837 | 0.31469117380763184 | 0.2624242294446318 | 0.2657948846791799 |
vectors.200.vw | 0.2631210482954979 | 0.25658957226411133 | 0.39800221384865253 | 0.36797061117768404 | 0.26311753586192543 | 0.4391687420138501 | 0.23939983296989936 | 0.826432495547041 | 0.26535408727901555 | 0.33026859247781526 | 0.2763637401989984 | 0.27022124161932326 |
vectors.250.vw | 0.2787220492303371 | 0.27069625033054506 | 0.4055108457351588 | 0.3823652068754352 | 0.27181525084720215 | 0.4505058808363015 | 0.2512847663258076 | 0.8377571090047548 | 0.2693417198378297 | 0.3450065469388701 | 0.29142332346638544 | 0.2745034476726674 |
vectors.300.vw | 0.2833251005935171 | 0.28271744849836455 | 0.41482171355165515 | 0.3922742199723817 | 0.28333529571902294 | 0.45705053903310966 | 0.26065227397085156 | 0.8471877858203759 | 0.2706125113557847 | 0.353636019557464 | 0.30176180617172954 | 0.2757952979987057 |
vectors.400.vw | 0.2953554815739393 | 0.3003277529611377 | 0.42721706318859665 | 0.4096689057298675 | 0.2931748999838646 | 0.47173839810948404 | 0.2756826099164692 | 0.860888173511247 | 0.27429627646192434 | 0.3686817812032121 | 0.31734611253417566 | 0.27954536144681846 |
vectors.50.vw | 0.2124405572752158 | 0.1976250445358455 | 0.3404563358804081 | 0.2943356926996086 | 0.23218345549473393 | 0.3768920889936449 | 0.19993329021251863 | 0.761389157968194 | 0.27770870635410144 | 0.24648373577430718 | 0.20545761788385555 | 0.28180398214298946 |