Skip to content

smspillaz/whatever2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

63 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

whatever2vec πŸ€·β€β™‚οΈ

By @christabella, @NightmareNyx and @smspillaz

We present a comprehensive comparison of word vector generation models using the skip-gram algorithm, the continuous bag of words algorithm (Mikolov et al.), the AWD-LSTM model (Merity et al.) and Node2Vec over a graph of words (Grover et al.). We also compare word senses using the approach given in Arora et al. Results are over the WikiText-103 dataset.

The results of our experiments can be found in the results directory.

Generating the word vectors.

Word2Node2Vec vectors

Use something like:

python word2node2vec/word2node2vec.py --train data/wiki.train.tokens --valid data/wiki.valid.tokens --test data/wiki.test.tokens --save vecs.out

The saved vectors will be in gensim.Word2Vec format, you can load them and save the underlying KeyedVectors for testing.

LM2Word2Vec vectors

First, train the language model:

pushd submodules/awd-lstm-lm;
python -u main.py --epochs 14 --nlayers 4 --emsize 400 --nhid 2500 --alpha 0 --beta 0 --dropoute 0 --dropouth 0.1 --dropouti 0.1 --dropout 0.1 --wdrop 0 --wdecay 0 --bptt 140 --batch_size 60 --optimizer adam --lr 1e-3 --data data/wikitext-103 --save WT103.12hr.QRNN.pt --when 12 --model QRNN
popd

This will also save the corpus/dictionary to submodules/awd-lstm-lm/corpus.[md5].data where [md5] is an MD5 hash of the path of the corpus itself.

Then convert to gensim format with:

pushd submodules/awd-lstm-lm;
python -u lm2word2vec.py --model WT103.12hr.QRNN.pt --dictionary submodules/awd-lstm-lm/corpus.[md5].data --output WT103.12hr.QRNN.pt.vw
popd

Running tests on word vectors

Once word vectors have been trained, you can get the benchmark results on word-benchmarks by using:

pushd submodules/word-benchmarks/tests
METHOD=some-label ./run-tests ../../PATH/TO/GENSIM/VECTORS
popd

The results should be put into results/some-label

Results

LM2Word2Vec

Analogy

model google-analogies.csv jair.csv msr.csv sat.csv semeval.csv
100.vw 0.6403398 0 0.6013075 0.25918242 0.30312118
150.vw 0.64039266 0 0.60108894 0.25918135 0.30323598
200.vw 0.5558439 0 0.52386457 0.22404711 0.24454139
250.vw 0.5342923 0 0.49946675 0.20587595 0.22498928
300.vw 0.5533238 0 0.5254152 0.20573309 0.239287
400.vw 0.47414735 0 0.45231628 0.17319292 0.19356392
50.vw 0.6427122 0 0.59071636 0.30830148 0.3142907

Clustering

model ap.csv battig.csv bless.csv essli-2008.csv
100.vw 0.5111549817629505 0.4701409492539494 0.609097674518625 0.5948063443028617
150.vw 0.48337606690742263 0.4726439058488439 0.553934936613524 0.4713177507990925
200.vw 0.511383587009481 0.4992238436761192 0.6126116789022762 0.5054455414190702
250.vw 0.534971147431235 0.4949624355341389 0.5926719306594618 0.5428752562899452
300.vw 0.4758289068682214 0.4697019245667439 0.5340228077481495 0.45504684670414264
400.vw 0.5413637749205396 0.4804087142051286 0.6253793226808986 0.5574793252585205
50.vw 0.5144323695914503 0.4380701059353572 0.5196268914210909 0.5470529637342717

Outlier Detection

model 8-8-8.csv wikisem500.csv
100.vw 0.78125 0.5843355119825708
150.vw 0.78125 0.5848023653906007
200.vw 0.890625 0.578145813881108
250.vw 0.671875 0.5966472144413321
300.vw 0.890625 0.5511367880485527
400.vw 0.890625 0.5744026221599752
50.vw 0.734375 0.6143921179582945

Similarity

model mc-30.csv men.csv mturk-287.csv mturk-771.csv rg-65.csv rw.csv semeval17.csv simverb-3500.csv verb-143.csv wordsim353-rel.csv wordsim353-sim.csv yp-130.csv
100.vw 0.23262541544040044 0.1987571170169611 0.2299824983457101 0.23243278729087308 0.24639078094409064 0.26130866001826786 0.21347329613235264 0.7104533505530943 0.29282450701225365 0.2349569548387237 0.16924773826374084 0.2913383059767576
150.vw 0.23345235339403153 0.19875746533272165 0.229978757863782 0.2322595934252116 0.2456465694996027 0.2613362571071438 0.21357433823017793 0.7102285577319915 0.29339505643267483 0.23503139389656919 0.16901092648445346 0.2919843622524005
200.vw 0.26101349561015763 0.24530788521412764 0.3517292108064354 0.3268703272746958 0.2492108909143851 0.3132374001824417 0.23401217016287978 0.7575145062073669 0.279708762641937 0.3188833066455769 0.24707455361022482 0.2816417728634981
250.vw 0.2688238610188166 0.25876522520390655 0.3633188989739486 0.3423926644681169 0.25916253110766413 0.32137162352681575 0.2482190000116365 0.7717744624336146 0.27800052317716767 0.33567863427891453 0.2678640785343489 0.28081431128428536
300.vw 0.25841380699078237 0.2503854629996047 0.3061668968330241 0.31148313830086743 0.25706328180203075 0.28782788833692075 0.24473696922308505 0.7548047950251773 0.281380884222095 0.3092829964800793 0.23902553946920194 0.28390958477373307
400.vw 0.2860361686358848 0.2847717287496975 0.3803760932231476 0.3766193355413912 0.2763803583177236 0.3354976072341171 0.27245529542688596 0.7979207980398926 0.2717867930499571 0.3620669955474114 0.2956181045926499 0.27530058174236466
50.vw 0.28006486112276713 0.23680829711709173 0.33807099371004945 0.28372421405925496 0.2554075000068316 0.3105925488978037 0.22681946402599595 0.7063345005552424 0.3300705308093438 0.26734758736550057 0.2216020986003687 0.3316088645703517

Word2Node2Vec

Analogy

model google-analogies.csv jair.csv msr.csv sat.csv semeval.csv
vectors.150.vw 0.07235756 0 0.07478042 0.03397527 0.040563706
vectors.300.vw 0.052819487 0 0.05203791 0.020160092 0.025098415
vectors.50.2.10.vw 0.12740257 0 0.12945199 0.06371317 0.07074753
vectors.50.2.vw 0.123734064 0 0.13750415 0.061421208 0.06865054
vectors.50.3.vw 0.1268055 0 0.13204797 0.06705497 0.07329087

Clustering

model ap.csv battig.csv bless.csv essli-2008.csv
vectors.150.vw 0.2233648166831707 0.1985093423365453 0.3444428148656265 0.36205929622347505
vectors.300.vw 0.18943355164112774 0.16159989347649026 0.24895186823020957 0.32783256064332633
vectors.50.2.10.vw 0.3149895087745146 0.25520657037106165 0.39650575475763133 0.460464658899246
vectors.50.2.vw 0.332559046268511 0.25667003695295193 0.37123533560082406 0.36601398667179713
vectors.50.3.vw 0.29625278203581673 0.24695485897783215 0.41748016494741397 0.39248065660716464

Outlier Detection

model 8-8-8.csv wikisem500.csv
vectors.150.vw 0.96875 0.4352611940298507
vectors.300.vw 0.875 0.40423773987206824
vectors.50.2.10.vw 0.84375 0.4597459132906894
vectors.50.2.vw 0.75 0.45090174129353233
vectors.50.3.vw 0.78125 0.45316275764036956

Similarity

model mc-30.csv men.csv mturk-287.csv mturk-771.csv rg-65.csv rw.csv semeval17.csv simverb-3500.csv verb-143.csv wordsim353-rel.csv wordsim353-sim.csv yp-130.csv
vectors.150.vw 0.45421816510955487 0.44709521411152875 0.5145976531481472 0.5605171253721606 0.43434111706912515 0.5567051749041152 0.43455471036608445 1.0247882937361021 0.4088670136987601 0.46214303871371687 0.4404123406294806 0.41382217590719983
vectors.300.vw 0.46932814587936517 0.4729951437201099 0.5358358766150964 0.5803959295878485 0.45797111880824426 0.5692399127521439 0.45567100489543655 1.0311851822871347 0.41602743672354603 0.48018405934459074 0.4559738808273539 0.41826603622590763
vectors.50.2.10.vw 0.45241435927436463 0.40961690992261834 0.47074437245647754 0.5274061602989087 0.3712963395046465 0.5388203667353184 0.408855536744672 0.9947995810571388 0.40386756843947014 0.41536981136041057 0.3885697232360154 0.40281220346270946
vectors.50.2.vw 0.4812938427690948 0.40939629644750747 0.4850890708806759 0.5369209036975529 0.4141098256138238 0.5404103529586431 0.4021569165385375 1.0083735446250555 0.4059512368544655 0.42295771195297255 0.403765660424735 0.409976069673373
vectors.50.3.vw 0.46251754138583223 0.4085646353832784 0.47842742245049 0.5353729284949493 0.3716052839430895 0.5423108534659385 0.40774824658356146 0.9976270813186842 0.39301769410270565 0.42282387197734844 0.39310650535093417 0.39275082177394316

Word2Vec Skip-Gram

Analogy

model google-analogies.csv jair.csv msr.csv sat.csv semeval.csv
sg.100.vw 0.56514007 0 0.5282249 0.26186305 0.28606668
sg.150.vw 0.564691 0 0.52989185 0.26213324 0.2860251
sg.200.vw 0.56421906 0 0.5302714 0.26190928 0.28512627
sg.250.vw 0.56673586 0 0.529467 0.26226997 0.28623167
sg.300.vw 0.56275225 0 0.5279672 0.26199442 0.28614104
sg.400.vw 0.56544155 0 0.5285368 0.26082078 0.2859096
sg.50.vw 0.5645637 0 0.52917844 0.26293534 0.28640532

Clustering

model ap.csv battig.csv bless.csv essli-2008.csv
sg.100.vw 0.5780351367799206 0.5308714982991621 0.6802891644376551 0.4949234069716011
sg.150.vw 0.5683078284839367 0.5346200255213555 0.7285094694865466 0.5573308757972358
sg.200.vw 0.5573185524378818 0.5220812239035447 0.724033832855196 0.5460514546980405
sg.250.vw 0.6219522503122408 0.5212961723940952 0.7118885556469524 0.5218275522815171
sg.300.vw 0.586389209644308 0.5277810616756028 0.7177956996403545 0.5589448207676889
sg.400.vw 0.6073188625468228 0.5425737692706525 0.7139109925636257 0.5654972845283717
sg.50.vw 0.5885475777287145 0.5398500645739408 0.7198277232926009 0.5628185114529359

Outlier Detection

model 8-8-8.csv wikisem500.csv
sg.100.vw 0.9375 0.7253855431061313
sg.150.vw 0.9375 0.7023548863990041
sg.200.vw 0.9375 0.7260546996576409
sg.250.vw 0.90625 0.7200813103018985
sg.300.vw 0.875 0.7260348583877996
sg.400.vw 0.875 0.7012087612822908
sg.50.vw 0.9375 0.7165931372549019

Similarity

model mc-30.csv men.csv mturk-287.csv mturk-771.csv rg-65.csv rw.csv semeval17.csv simverb-3500.csv verb-143.csv wordsim353-rel.csv wordsim353-sim.csv yp-130.csv
sg.100.vw 0.22457766694674894 0.16500178837932647 0.26203211413998617 0.257022740190531 0.22239366227961502 0.32538271345040903 0.1955761686579129 0.7753478213606123 0.22860458746599774 0.20954808050569368 0.17474445497594315 0.2300175332592084
sg.150.vw 0.22627560336490476 0.16503211605573692 0.26136119470188557 0.2560657149660557 0.22103912269839873 0.3252307220086531 0.1928489686396939 0.7747899146833436 0.22940083722675603 0.20732885167703913 0.1745678984656626 0.23073822200412933
sg.200.vw 0.23075113968650504 0.1650317048163712 0.26270622085262335 0.2577664801664301 0.22309861202194142 0.3247418034147028 0.19615788372946374 0.7749358436830007 0.23024944150069404 0.208367323123983 0.1730173220584283 0.23139012094002506
sg.250.vw 0.22375898566444719 0.164888768336835 0.26350590302445503 0.2562372860778658 0.22100013875502805 0.3246530143408281 0.19532924690824233 0.775262754113634 0.2267156203275635 0.20645792081889489 0.1725011530793747 0.2281422318330178
sg.300.vw 0.22865829121669137 0.1649957287545502 0.26296136197145004 0.25746302772061924 0.22340114461917143 0.3251233265563012 0.19348522284654554 0.7741839955347828 0.22931881706321997 0.20919915158007327 0.1737514746173912 0.2308629694665854
sg.400.vw 0.22645711617271105 0.1653129494190837 0.26247695273350874 0.2576026289056974 0.21955526261834 0.32554172958831495 0.19368215688308818 0.7740991174773549 0.22892832564668994 0.20884658048658813 0.1730231713099139 0.23035785070680653
sg.50.vw 0.22548519935806594 0.16495744984851526 0.26342656995723657 0.2570405497809845 0.22259370768528716 0.3258908928242255 0.1943813393124736 0.7732667210869834 0.22964877533303601 0.20870550771107696 0.17568209998568102 0.2308864334891049

Word2Vec Continuous Bag of Words

Analogy

model google-analogies.csv jair.csv msr.csv sat.csv semeval.csv
vectors.100.vw 0.56566936 0 0.5181261 0.22984047 0.24702403
vectors.150.vw 0.52832794 0 0.49000055 0.2047717 0.2210983
vectors.200.vw 0.50124794 0 0.46906146 0.18800929 0.20329203
vectors.250.vw 0.47829783 0 0.4507169 0.17471649 0.18987046
vectors.300.vw 0.46010032 0 0.43649906 0.16416313 0.17929642
vectors.400.vw 0.42723885 0 0.4128722 0.14949596 0.16291577
vectors.50.vw 0.6158859 0 0.5557572 0.27176368 0.28892532

Clustering

model ap.csv battig.csv bless.csv essli-2008.csv
vectors.100.vw 0.5186787217874035 0.44917405160498264 0.5985590153207353 0.6282470981385256
vectors.150.vw 0.501222591664049 0.44069583479131175 0.6359226825050992 0.5203114337922603
vectors.200.vw 0.4399885217561208 0.436792121638877 0.611799833990984 0.6318694693102543
vectors.250.vw 0.48929031636804654 0.43948700170741606 0.642456145628704 0.5496525656946352
vectors.300.vw 0.4611962171917175 0.43392883704347524 0.6421705161387511 0.5455943433060263
vectors.400.vw 0.41830478633387075 0.41853018105564266 0.5601155861242109 0.564174537727787
vectors.50.vw 0.49687971045889856 0.44207880376355224 0.6209848439434117 0.5717386581773791

Outlier Detection

model 8-8-8.csv wikisem500.csv
vectors.100.vw 0.90625 0.6781006847183317
vectors.150.vw 0.875 0.6730656707127295
vectors.200.vw 0.875 0.6812889044506691
vectors.250.vw 0.84375 0.6810914643635232
vectors.300.vw 0.875 0.6759257314036726
vectors.400.vw 0.90625 0.6899050731403672
vectors.50.vw 0.9375 0.6652499610955495

Similarity

model mc-30.csv men.csv mturk-287.csv mturk-771.csv rg-65.csv rw.csv semeval17.csv simverb-3500.csv verb-143.csv wordsim353-rel.csv wordsim353-sim.csv yp-130.csv
vectors.100.vw 0.23038650089899698 0.21923658708872895 0.36832776654834054 0.32773828016643874 0.23611912248914058 0.4062063027807598 0.21038553543043906 0.7938249489157057 0.26696976556510676 0.2859681463096322 0.23796280732127478 0.2714475272500744
vectors.150.vw 0.2506429349248608 0.23830748303716381 0.38604338547293915 0.35106527508638935 0.24412845624066315 0.42510492811087475 0.22434797749981097 0.8131645940380711 0.26086672615058837 0.31469117380763184 0.2624242294446318 0.2657948846791799
vectors.200.vw 0.2631210482954979 0.25658957226411133 0.39800221384865253 0.36797061117768404 0.26311753586192543 0.4391687420138501 0.23939983296989936 0.826432495547041 0.26535408727901555 0.33026859247781526 0.2763637401989984 0.27022124161932326
vectors.250.vw 0.2787220492303371 0.27069625033054506 0.4055108457351588 0.3823652068754352 0.27181525084720215 0.4505058808363015 0.2512847663258076 0.8377571090047548 0.2693417198378297 0.3450065469388701 0.29142332346638544 0.2745034476726674
vectors.300.vw 0.2833251005935171 0.28271744849836455 0.41482171355165515 0.3922742199723817 0.28333529571902294 0.45705053903310966 0.26065227397085156 0.8471877858203759 0.2706125113557847 0.353636019557464 0.30176180617172954 0.2757952979987057
vectors.400.vw 0.2953554815739393 0.3003277529611377 0.42721706318859665 0.4096689057298675 0.2931748999838646 0.47173839810948404 0.2756826099164692 0.860888173511247 0.27429627646192434 0.3686817812032121 0.31734611253417566 0.27954536144681846
vectors.50.vw 0.2124405572752158 0.1976250445358455 0.3404563358804081 0.2943356926996086 0.23218345549473393 0.3768920889936449 0.19993329021251863 0.761389157968194 0.27770870635410144 0.24648373577430718 0.20545761788385555 0.28180398214298946

About

Lots of things, 2vec πŸ€·β€β™‚οΈ

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published