# Homework 5: Distributional semantics

This is due on **11/27 (11:55pm)**, submitted electronically. 

## How to do this problem set

Most of these questions require writing Python code and computing results, and the rest of them have textual answers.  Write all the textual answers in this document, show the output of your experiment in this document, and implement the functions in the `distsim.py`. Once you are finished, you will upload this `.ipynb` file and `distsim.py` to Moodle.

* When creating your final version of the problem set to hand in, please do a fresh restart and execute every cell in order.  Then you'll be sure it's actually right.  Make sure to press "Save"!

**Your Name:**

**List collaborators, and how you collaborated, here:** (see our [grading and policies page](http://people.cs.umass.edu/~brenocon/inlp2016/grading.html) for details on our collaboration policy).

* Rafael Lizarralde 

## Cosine Similarity

Recall that, where $i$ indexes over the context types, cosine similarity is defined as follows. $x$ and $y$ are both vectors of context counts (each for a different word), where $x_i$ is the count of context $i$.

$$cossim(x,y) = \frac{ \sum_i x_i y_i }{ \sqrt{\sum_i x_i^2} \sqrt{\sum_i y_i^2} }$$

The nice thing about cosine similarity is that it is normalized: no matter what the input vectors are, the output is between 0 and 1. One way to think of this is that cosine similarity is just, um, the cosine function, which has this property (for non-negative $x$ and $y$). Another way to think of it is, to work through the situations of maximum and minimum similarity between two context vectors, starting from the definition above.

Note: a good way to understand the cosine similarity function is that the numerator cares about whether the $x$ and $y$ vectors are correlated. If $x$ and $y$ tend to have high values for the same contexts, the numerator tends to be big. The denominator can be thought of as a normalization factor: if all the values of $x$ are really large, for example, dividing by the square root of their sum-of-squares prevents the whole thing from getting arbitrarily large. In fact, dividing by both these things (aka their norms) means the whole thing can’t go higher than 1.

## Question 1 (10 points)

See the file `nytcounts.university_cat_dog`, which contains context count vectors for three words: “dog”, “cat”, and “university”. These are immediate left and right contexts from a New York Times corpus. You can open the file in a text editor since it’s quite small.

Please complete `cossim_sparse(v1,v2)` in `distsim.py` to compute and display the cosine similarities between each pair of these words. Briefly comment on whether the relative simlarities make sense.

Note that we’ve provided very simple code that tests the context count vectors from the data file.

In [8]:
import distsim; reload(distsim)

word_to_ccdict = distsim.load_contexts("nytcounts.university_cat_dog")
print "Cosine similarity between cat and dog" ,distsim.cossim_sparse(word_to_ccdict['cat'],word_to_ccdict['dog'])
print "Cosine similarity between cat and university" ,distsim.cossim_sparse(word_to_ccdict['cat'],word_to_ccdict['university'])
print "Cosine similarity between university and dog" ,distsim.cossim_sparse(word_to_ccdict['university'],word_to_ccdict['dog'])

file nytcounts.university_cat_dog has contexts for 3 words
Cosine similarity between cat and dog 0.966891672715
Cosine similarity between cat and university 0.660442421144
Cosine similarity between university and dog 0.659230248969


**The relative similarities makes sense as the metric gives a significantly higher value to (cat,dog) pair as oposed to (cat, univeristy) or (dog, university).**

## Question 2 (15 points)

Implement `show_nearest()`. 
Given a dictionary of word-context vectors, the context vector of a particular query word `w`, the words you want to exclude in the responses (It should be the query word `w` in this question), and the similarity metric you want to use (It should be the `cossim_sparse` function you just implemented), `show_nearest()` finds the 20 words most-similar to `w`. For each, display the other word, and its similarity to the query word `w`.

To make sure it’s working, feel free to use the small `nytcounts.university_cat_dog` database as follows.

In [9]:
import distsim
word_to_ccdict = distsim.load_contexts("nytcounts.university_cat_dog")
distsim.show_nearest(word_to_ccdict, word_to_ccdict['dog'], set(['dog']), distsim.cossim_sparse)

file nytcounts.university_cat_dog has contexts for 3 words
cat : 0.966891672715
university : 0.659230248969


## Question 3 (20 points)

Explore similarities in `nytcounts.4k`, which contains context counts for about 4000 words in a sample of New York Times. The news data was lowercased and URLs were removed. The context counts are for the 2000 most common words in twitter, as well as the most common 2000 words in the New York Times. (But all context counts are from New York Times.) The context counts only contain contexts that appeared for more than one word. The file `vocab` contains the list of all terms in this data, along with their total frequency.
Choose **six** words. For each, show the output of `show_nearest()` and comment on whether the output makes sense. Comment on whether this approach to distributional similarity makes more or less sense for certain terms.
Four of your words should be:

 * a name (for example: person, organization, or location)
 * a common noun
 * an adjective
 * a verb

You may also want to try exploring further words that are returned from a most-similar list from one of these. You can think of this as traversing the similarity graph among words.

*Implementation note:* 
On my laptop it takes several hundred MB of memory to load it into memory from the `load_contexts()` function. If you don’t have enough memory available, your computer will get very slow because the OS will start swapping. If you have to use a machine without that much memory available, you can instead implement in a streaming approach by using the `stream_contexts()` generator function to access the data; this lets you iterate through the data from disk, one vector at a time, without putting everything into memory. You can see its use in the loading function. (You could also alternatively use a key-value or other type of database, but that’s too much work for this assignment.)

*Extra note:* 
You don’t need this, but for reference, our preprocessing scripts we used to create the context data are in the `preproc/` directory.

In [23]:
import distsim; reload(distsim)
word_to_ccdict = distsim.load_contexts("nytcounts.4k")
###Provide your answer below; perhaps in another cell so you don't have to reload the data each time

file nytcounts.4k has contexts for 3648 words


In [11]:
###Answer examples
distsim.show_nearest(word_to_ccdict, word_to_ccdict['london'],set(['london']),distsim.cossim_sparse)

paris : 0.969922701547
washington : 0.966413121743
baghdad : 0.960228810316
iraq : 0.956542488425
atlanta : 0.954037925324
2000 : 0.948294403121
chicago : 0.947112215726
philadelphia : 0.947008542809
europe : 0.945073821088
manhattan : 0.943055053744
2002 : 0.942927981567
1998 : 0.942282912202
2003 : 0.939722684448
1996 : 0.939628513887
1999 : 0.934767034178
1994 : 0.931474672799
miami : 0.930494514481
1997 : 0.927593121476
1995 : 0.926635164544
jail : 0.926115332412
florida : 0.922942736016


the output makes sense as it lists the tokens mostly countries, cities and years thats relevant to London in terms of news as the counts of context are from New York times and twitter. 

In [33]:
distsim.show_nearest(word_to_ccdict, word_to_ccdict['eric'],set(['eric']),distsim.cossim_sparse)

david : 0.934898603227
peter : 0.929509857942
jonathan : 0.923215513894
andrew : 0.922229193309
robert : 0.913586443217
susan : 0.913104358468
chris : 0.912410311969
daniel : 0.910650063728
james : 0.90213572096
steven : 0.901382235893
adam : 0.898352454499
jim : 0.891580206908
jennifer : 0.887926912004
richard : 0.881836031176
steve : 0.879818282158
william : 0.8755367179
nancy : 0.875534147136
kevin : 0.874963882108
anthony : 0.86890606037
justin : 0.865493576771
matt : 0.865227025353


The distributional similarity has resulted in other proper nouns mainly names of people due to the common context shared by them. This is as expected. 

In [13]:
distsim.show_nearest(word_to_ccdict, word_to_ccdict['lawyers'],set(['lawyers']),distsim.cossim_sparse)

neighbors : 0.781064727475
doctors : 0.776508758064
clothes : 0.753363197835
books : 0.749142873592
writers : 0.745011556811
developers : 0.743908047393
musicians : 0.740890960915
teachers : 0.740734917438
proposals : 0.734189373102
democrats : 0.732364286141
ideas : 0.731754421717
parents : 0.730339486114
drivers : 0.729310356034
republicans : 0.728771597727
clothing : 0.727037686977
players : 0.726445096246
horses : 0.724292847834
students : 0.723455822114
talent : 0.722829138602
witnesses : 0.720642712864
fans : 0.719940600682


Almost all the terms in the list are common nouns though most of them are not  quite relevant to lawyers as one would expect.

In [14]:
distsim.show_nearest(word_to_ccdict, word_to_ccdict['serious'],set(['serious']),distsim.cossim_sparse)

significant : 0.896394608995
successful : 0.874107623898
good : 0.87261355999
strong : 0.871730541815
powerful : 0.865286620101
rare : 0.862773035956
small : 0.860530784884
large : 0.858372185891
terrible : 0.856937087099
wonderful : 0.853802055021
beautiful : 0.8481101697
lovely : 0.84390752851
strange : 0.843154034069
sharp : 0.842517415649
healthy : 0.842224450257
specific : 0.84009620961
special : 0.838683100859
huge : 0.837980367336
different : 0.836788475073
simple : 0.836450307935
brief : 0.835471025749


Again as the query word, the result is mostly adjectives but not all relevant to serious. 

In [16]:
distsim.show_nearest(word_to_ccdict, word_to_ccdict['city'],set(['city']),distsim.cossim_sparse)

region : 0.937184233374
state : 0.933963707974
country : 0.921173376926
company : 0.921073569216
governor : 0.913328882314
world : 0.90418263635
sun : 0.903656128098
bride : 0.898366691474
government : 0.896816725136
band : 0.893184292775
pentagon : 0.889174003997
planet : 0.886205908968
nation : 0.884027868613
agency : 0.88362360939
film : 0.881995169291
legislature : 0.881328609258
bronx : 0.876154206439
heat : 0.872157801128
ball : 0.871896409723
panel : 0.871348598272
sport : 0.871164815239


Result as expected.

In [19]:
distsim.show_nearest(word_to_ccdict, word_to_ccdict['find'],set(['find']),distsim.cossim_sparse)

buy : 0.966328233154
get : 0.964068235843
make : 0.961741234641
send : 0.959847755132
take : 0.959722625802
produce : 0.959285007969
develop : 0.958554140172
build : 0.95656186285
provide : 0.954616275432
write : 0.951020358045
see : 0.948156108201
hold : 0.9478776166
maintain : 0.947026040175
create : 0.943193650106
bring : 0.938538746987
earn : 0.937658288596
fill : 0.936945845902
carry : 0.935415124589
wear : 0.934061056165
prevent : 0.933595815203
kill : 0.933212055104


Results mostly verb. Word too generic to define expectation.

**As noted in the output of most query word, distributional similarity tends to capture the word type accurately due to the shared context words. Based on how you qualitatively define similarity this emphasis on word type may or may not be what you expect. Personally, I would like to explore similarity in a broader sense based on the word meaning. Distributional similarity would then not match my expectation well. Also, it has a poor geberalizability.**

## Question 4 (10 points)

In the next several questions, you'll examine similarities in trained word embeddings, instead of raw context counts.

See the file `nyt_word2vec.university_cat_dog`, which contains word embedding vectors pretrained by word2vec [1] for three words: “dog”, “cat”, and “university”. You can open the file in a text editor since it’s quite small.

Please complete `cossim_dense(v1,v2)` in `distsim.py` to compute and display the cosine similarities between each pair of these words.

*Implementation note:*
Notice that the inputs of `cossim_dense(v1,v2)` are numpy arrays. If you do not very familiar with the basic operation in numpy, you can find some examples in the basic operation section here:
https://docs.scipy.org/doc/numpy-dev/user/quickstart.html

If you know how to use Matlab but haven't tried numpy before, the following link should be helpful:
https://docs.scipy.org/doc/numpy-dev/user/numpy-for-matlab-users.html

[1] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." NIPS 2013.

In [23]:
import distsim; reload(distsim)
word_to_vec_dict = distsim.load_word2vec("nyt_word2vec.university_cat_dog")
print "Cosine similarity between cat and dog" ,distsim.cossim_dense(word_to_vec_dict['cat'],word_to_vec_dict['dog'])
print "Cosine similarity between cat and university" ,distsim.cossim_dense(word_to_vec_dict['cat'],word_to_vec_dict['university'])
print "Cosine similarity between university and dog" ,distsim.cossim_dense(word_to_vec_dict['university'],word_to_vec_dict['dog'])

word_to_vec_dict = distsim.load_word2vec("nyt_word2vec.university_cat_dog")
distsim.show_nearest(word_to_vec_dict, word_to_vec_dict['dog'], set(['dog']),distsim.cossim_dense)

Cosine similarity between cat and dog 0.827517295965
Cosine similarity between cat and university -0.205394745036
Cosine similarity between university and dog -0.190753135501
cat : 0.827517295965
university : -0.190753135501


## Question 5 (25 points)

Repeat the process you did in the question 3, but now use dense vector from word2vec. Comment on whether the outputs makes sense. Compare the outputs of using `show_nearest()` on word2vec and the outputs on sparse context vector (so we suggest you to use the same words in question 3). Which method works better on the query words you choose. Please brief explain why one method works better than other in each case.

Notice that we use default parameters of word2vec in [gensim](https://radimrehurek.com/gensim/models/word2vec.html) to get word2vec word embeddings.

In [2]:
import distsim
word_to_vec_dict = distsim.load_word2vec("nyt_word2vec.4k")
###Provide your answer bellow

In [4]:
distsim.show_nearest(word_to_vec_dict, word_to_vec_dict['london'],set(['london']),distsim.cossim_dense)

paris : 0.742107827129
chicago : 0.676770811474
philadelphia : 0.59429769204
england : 0.587948398039
newark : 0.578417918674
boston : 0.571829045744
madrid : 0.563767976184
seattle : 0.55322589154
spain : 0.544267480338
australia : 0.540475332018
york : 0.53997553081
manhattan : 0.535027539424
chelsea : 0.525994762298
atlanta : 0.525622409271
el : 0.524066431248
washington : 0.514216746133
1997 : 0.50900778333
houston : 0.508965859676
fashion : 0.503772688122
1999 : 0.503085817215
1996 : 0.50188590366


significantly less years than places as compared to sparse logic. This output makes more sense to me.

In [32]:
distsim.show_nearest(word_to_vec_dict, word_to_vec_dict['eric'],set(['eric']),distsim.cossim_dense)

brian : 0.894186570284
daniel : 0.886016873831
kevin : 0.875281443728
gary : 0.869643812807
jeff : 0.868130392477
jonathan : 0.865974899608
adam : 0.865754187497
chris : 0.863424805911
scott : 0.857978086682
jim : 0.85097179025
andrew : 0.850539452137
bruce : 0.841079454436
justin : 0.840832595618
steve : 0.840444411046
anthony : 0.838173516411
tim : 0.834469543451
larry : 0.833590575026
david : 0.833193625701
matt : 0.832877268667
patrick : 0.831428141526
jennifer : 0.829940531379


Fewer girl names as compared to sparse logic. If I am interested in a particular domain, say the movie Titanic, I would not prefer this result as it is too generic.

In [6]:
distsim.show_nearest(word_to_vec_dict, word_to_vec_dict['lawyers'],set(['lawyers']),distsim.cossim_dense)

prosecutors : 0.775297156601
investigators : 0.678165745168
judges : 0.663011352717
witnesses : 0.646656372143
lawyer : 0.629047434705
agents : 0.615615261957
employees : 0.608172244932
aides : 0.604734421243
officers : 0.598752840552
officials : 0.598640344387
doctors : 0.596266981157
actions : 0.591824133768
papers : 0.591820696789
clients : 0.589970894172
testimony : 0.587679618257
lawmakers : 0.581707235336
opponents : 0.580216463071
criminal : 0.573577738669
authorities : 0.572877082566
colleagues : 0.571763378261
supporters : 0.571481622549


A better representation of similarity as compared to sparse. More emphasis on word meaning.

In [15]:
distsim.show_nearest(word_to_vec_dict, word_to_vec_dict['serious'],set(['serious']),distsim.cossim_dense)

positive : 0.699890292893
significant : 0.68606746792
critical : 0.658510985884
negative : 0.609582955657
crucial : 0.601757842358
specific : 0.594108835655
potential : 0.586777590992
dangerous : 0.571456581401
such : 0.561674828979
common : 0.561278142439
long-term : 0.557334708765
physical : 0.546333122761
certain : 0.545411746723
terrible : 0.541384495838
sexual : 0.536974606489
particular : 0.535600508496
risks : 0.533087121555
accurate : 0.5295905936
mental : 0.528871096944
important : 0.523539731647
difficult : 0.520717070132


has a more negative connotation in the output as compared to the sparse output and goes well with the query word.

In [25]:
distsim.show_nearest(word_to_vec_dict, word_to_vec_dict['city'],set(['city']),distsim.cossim_dense)

town : 0.727960723145
state : 0.646071801271
nation : 0.620838949994
area : 0.618970739821
region : 0.600232409368
country : 0.593311950317
neighborhood : 0.576728956949
parks : 0.557922559453
district : 0.553389193187
county : 0.539186049261
community : 0.528192322121
newark : 0.519678589125
downtown : 0.51847462255
housing : 0.513368482927
connecticut : 0.506858190453
building : 0.49473467596
westchester : 0.493845284477
mayor : 0.48483016808
communities : 0.4810677938
residents : 0.473806226518
authority : 0.472970790837


similar to sparse

In [18]:
distsim.show_nearest(word_to_vec_dict, word_to_vec_dict['find'],set(['find']),distsim.cossim_dense)

get : 0.721049380383
imagine : 0.718554980425
see : 0.712686678472
bring : 0.705328513736
enjoy : 0.695330500537
prove : 0.692737795753
create : 0.687239780648
ignore : 0.686548066275
carry : 0.684808359136
appreciate : 0.680417827671
consider : 0.669722646314
reach : 0.664784903209
fill : 0.658821036432
understand : 0.649937019684
draw : 0.647413203766
handle : 0.646085541075
hide : 0.640297032571
provide : 0.639882642894
keep : 0.634017689362
seek : 0.633712626869
make : 0.633535005841


Results mostly verb and similar to sparse. Word too generic to define expectation.

**The dense representation with word2vec in general seems to capture the word similairties better than the sparse representation. One disadvantage with this approach is that it ignores the domain or context and might need some finetuning to perform better. DisadvantageS with the sparse approach is the increase space and computation time, though the impact of this could be reduced with efficient algorithms. But this apparoach still dependends on quality of training data and poorer generalizability.**

## Question 7 (15 points)
After you have word embedding, one of interesting things you can do is to perform analogical reasoning tasks. In the following example, we provide the code which can find the closet words to the vector $v_{king}-v_{man}+v_{woman}$ to fill the blank on the question:

king : man = ____ : woman

Notice that the word2vec is trained in an unsupervised manner; it is impressive that it can apparently do an interesting type of reasoning.  (For a contrary opinion, see [Linzen 2016](http://www.aclweb.org/anthology/W/W16/W16-2503.pdf).)

Please come up with another analogical reasoning task (another triple of words), and output the answer using the the same method. Comment on whether the output makes sense. If the output makes sense, explain why we can capture such relation between words using an unsupervised algorithm. Where does the information come from? On the other hand, if the output does not make sense, propose an explanation why the algorithm fails on this case.


In [26]:
import distsim
king = word_to_vec_dict['king']
man = word_to_vec_dict['man']
woman = word_to_vec_dict['woman']
distsim.show_nearest(word_to_vec_dict,
                     king-man+woman,
                     set(['king','man','woman']),
                     distsim.cossim_dense)
###Provide your answer bellow

queen : 0.725028631986
princess : 0.577900103401
prince : 0.566962392417
lord : 0.530919391111
royal : 0.520203296864
mary : 0.497698146284
mama : 0.495469636832
daughter : 0.493757946566
singer : 0.489838082014
kim : 0.488354695243
elizabeth : 0.482484843405
girl : 0.477338294808
grandma : 0.476990726681
sister : 0.470304371825
mother : 0.469422028833
clark : 0.46824004741
wedding : 0.46233629356
husband : 0.456851188179
boyfriend : 0.447550574504
jesus : 0.438572115806
wolf : 0.428880090916


In [48]:
import distsim
blue = word_to_vec_dict['blue']
man = word_to_vec_dict['man']
woman = word_to_vec_dict['woman']
distsim.show_nearest(word_to_vec_dict,
                     blue-man+woman,
                     set(['blue','man','woman']),
                     distsim.cossim_dense)
###Provide your answer bellow

pink : 0.665804014903
leather : 0.612621614795
yellow : 0.611314615177
red : 0.61031100343
jeans : 0.602957248795
black : 0.553939521626
shorts : 0.553855150597
shirt : 0.550835184438
gray : 0.5483526129
dress : 0.5470958903
shirts : 0.543401892557
bar : 0.542705288537
tea : 0.53655740934
pants : 0.52330855032
tiny : 0.517971218307
boots : 0.510116822657
orange : 0.506494216352
wings : 0.503183061064
gold : 0.501217061816
bright : 0.499989254229
hat : 0.499151557984


**Write your response here:**

Though the output might hold true statistically, this seems like inducing stereotypes in our AI models. It would be wrong to assume the representation of our training data when the query in question is not factual. <br>
earn-man = <>-man <br>
receive : 0.671115468114 (best result)<br>
earn-woman = <>-man <br>
lose : 0.601115468114 (best result)<br>

## Extra credit (up to 5 points)

Analyze word similarities with WordNet, and compare and contrast against the distributional similarity results. For a fair comparison, limit yourself to words in the `nytcounts.4k` vocabulary. First, calculate how many of the words are present in WordNet, at least based on what method you’re doing lookups with. (There is an issue that WordNet requires a part-of-speech for lookup, but we don’t have those in our data; you’ll have to devise a solution). 

Second, for the words you analyzed with distributional similarity above, now do the same with WordNet-based similarity as implemented in NLTK, as described <a href="http://www.nltk.org/howto/wordnet.html">here</a>, or search for “nltk wordnet similarity”. For a fair comparison, do the nearest-similarity ranking among the words in the `nytcounts.4k` vocabulary. You may use `path_similarity`, or any of the other similarity methods (e.g. `res_similarity` for Resnik similarity, which is one of the better ones). Describe what you are doing. Compare and contrast the words you get. Does WordNet give similar or very different results? Why?</p>

## Extra credit (up to 5 points)

Investigate a few of the alternative methods described in [Linzen 2016](http://www.aclweb.org/anthology/W/W16/W16-2503.pdf) on the man/woman/king/queen and your new example.  What does this tell you about the legitimacy of analogical reasoning tasks?  How do you assess Linzen's arguments?

In [34]:
#IGNORE-A - Gender
import distsim
king = word_to_vec_dict['king']
man = word_to_vec_dict['man']
woman = word_to_vec_dict['woman']
distsim.show_nearest(word_to_vec_dict,
                     king+woman,
                     set(['king','man','woman']),
                     distsim.cossim_dense)

girl : 0.792813108092
boy : 0.790081430397
queen : 0.681013287404
princess : 0.641849033577
prince : 0.622121026431
girlfriend : 0.613567895305
child : 0.612858977402
cat : 0.612271393217
soldier : 0.600440739677
dog : 0.584798842347
lord : 0.580856424775
blonde : 0.580381252379
singer : 0.577198587828
mother : 0.57236750517
kid : 0.565451894067
doctor : 0.564419288952
boyfriend : 0.554783097888
cousin : 0.553546769661
son : 0.547906679666
person : 0.546853109787
daughter : 0.546254916769


In [38]:
#IGNORE-A - Capital
import distsim
london = word_to_vec_dict['london']
england = word_to_vec_dict['england']
iraq = word_to_vec_dict['iraq']
distsim.show_nearest(word_to_vec_dict,
                     london+iraq,
                     set(['london','england','iraq']),
                     distsim.cossim_dense)

afghanistan : 0.762437706802
baghdad : 0.682714040439
europe : 0.669691210578
germany : 0.668746773533
britain : 0.660849776008
russia : 0.648489995123
vietnam : 0.640047093237
iran : 0.628611152017
china : 0.62222874257
france : 0.614384360429
iraqi : 0.61407321472
japan : 0.611422551556
gaza : 0.611051792062
military : 0.595761963932
israel : 0.586398079713
india : 0.563168459511
region : 0.550751918723
ukraine : 0.545003325069
america : 0.540400151195
italy : 0.538410391165
pentagon : 0.533084366599


not as good as  ADD, mostly picks a neighbour of baghdad

In [20]:
#ONLY-B
import distsim
king = word_to_vec_dict['king']
man = word_to_vec_dict['man']
woman = word_to_vec_dict['woman']
distsim.show_nearest(word_to_vec_dict,
                     woman,
                     set(['king','man','woman']),
                     distsim.cossim_dense)

girl : 0.86383601422
boy : 0.786581684421
child : 0.678692596742
person : 0.678675262709
doctor : 0.643260732956
soldier : 0.64201026603
herself : 0.612077346824
mother : 0.575203768415
patient : 0.574240780618
someone : 0.56777877143
dog : 0.559724830678
cat : 0.558030561205
guy : 0.556801357775
baby : 0.556227160654
men : 0.555089484198
kid : 0.553023348019
boyfriend : 0.541439868409
girlfriend : 0.540814530321
daughter : 0.537658999991
blonde : 0.536857529045
she : 0.524530234881


As noted by Linzen, this approach has a limitation and doesnt seem to work in this case.

In [12]:
#ADD-OPPOSITE
import distsim
king = word_to_vec_dict['king']
man = word_to_vec_dict['man']
woman = word_to_vec_dict['woman']
distsim.show_nearest(word_to_vec_dict,
                     -(king-man)+woman,
                     set(['king','man','woman']),
                     distsim.cossim_dense)

girl : 0.777172950568
boy : 0.72237968541
person : 0.721604772155
soldier : 0.643904819677
guy : 0.609361570981
doctor : 0.605059792372
someone : 0.599356573486
child : 0.595538385446
patient : 0.595495936886
kid : 0.529667791095
men : 0.527004685204
herself : 0.519762400929
baby : 0.510893089607
people : 0.498358675263
dog : 0.498163543648
teenagers : 0.495894233496
smile : 0.488049182473
naked : 0.4813396728
artist : 0.466727763772
cat : 0.466153077219
girls : 0.462756807988


mostly picks a neighbour of queen

In [47]:
#MULTIPLY - GENDER
import distsim
from __future__ import division
king = word_to_vec_dict['king']
man = word_to_vec_dict['man']
woman = word_to_vec_dict['woman']
x_best = None
x_best_val = -1
for x in word_to_vec_dict:
    if x !='king' and x !='man' and x !='woman':
        x_king = distsim.cossim_dense(word_to_vec_dict[x],king)
        x_woman = distsim.cossim_dense(word_to_vec_dict[x],woman)
        x_man = distsim.cossim_dense(word_to_vec_dict[x],man)
        x_this = (x_king*x_woman)/x_man
        #print x_king,x_woman,x_man
        #print x, x_this
        if x_this > x_best_val:
            x_best = x
            x_best_val = x_this
print x_best, x_best_val

racing 46.2687226871


racing seems to have a very low absolute value causing the objective function value to be very high. Not as expected. 

In [44]:
#MULTIPLY - CAPITALS
#England:London = Iraq:Baghdad
import distsim
london = word_to_vec_dict['london']
england = word_to_vec_dict['england']
iraq = word_to_vec_dict['iraq']
x_best = None
x_best_val = -1
for x in word_to_vec_dict:
    if x !='london' and x !='england' and x !='iraq':
        x_london = distsim.cossim_dense(word_to_vec_dict[x],london)
        x_england = distsim.cossim_dense(word_to_vec_dict[x],england)
        x_iraq = distsim.cossim_dense(word_to_vec_dict[x],iraq)
        x_this = x_london*x_iraq/x_england
        #print x_king,x_woman,x_man
        #print x, x_this
        if x_this > x_best_val:
            x_best = x
            x_best_val = x_this
print x_best, x_best_val

suppose 136.644600796


Not as expected.

Linzen's claims hold good for IGNORE-A, ONLY-B and ADD-OPPOSITE but doesnt work for MULTIPLY for the examples above.