# Generating incorrect answer suggestions
Using word embeddings we're going to find the most similar words to an answer.

## Importing the word embeddings
Unfortunately our beloved *spacy* does not offer most similar words. We'll use **gensim** for that.

Make sure you download the embeddings file first. Instructions in the *README* in the **data** folder.

In [31]:
import gensim
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors

In [32]:
glove_file = '../data/embeddings/glove.6B.300d.txt'
tmp_file = '../data/embeddings/word2vec-glove.6B.300d.txt'

In [33]:
import os

if not os.path.isfile(glove_file):
    print("Glove embeddings not found. Please download and place them in the following path: " + glove_file)

In [34]:
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_file, tmp_file)
model = KeyedVectors.load_word2vec_format(tmp_file)

  glove2word2vec(glove_file, tmp_file)


## Similar words examples

In [35]:
model.most_similar(positive=['koala'], topn=10)

[('probo', 0.5426342487335205),
 ('koalas', 0.4729689657688141),
 ('orangutan', 0.4557289779186249),
 ('grizzly', 0.418164998292923),
 ('marsupial', 0.39361125230789185),
 ('wombat', 0.3832378387451172),
 ('cuddly', 0.3804109990596771),
 ('kodiak', 0.37843799591064453),
 ('kade', 0.37742382287979126),
 ('kangaroo', 0.3612629473209381)]

It seems to be working fine. Though what *the f* * is a probo?

![image.png](https://i.gyazo.com/8e982abd6da0025cb985b388c07507a8.png)

Ok.

At this point we asume that we have our answer, the sentence it's in, the entire text, and the title. Let's explore some words.

*__Oxygen__ is a chemical element with symbol O and atomic number 8.*  

In [36]:
model.most_similar(positive=['oxygen'], topn=10)

[('hydrogen', 0.63267982006073),
 ('nitrogen', 0.6251460313796997),
 ('helium', 0.5435217022895813),
 ('nutrients', 0.5369840860366821),
 ('breathing', 0.5023170113563538),
 ('chlorine', 0.494693785905838),
 ('monoxide', 0.4911428987979889),
 ('dioxide', 0.4911196231842041),
 ('ammonia', 0.4907909035682678),
 ('carbon', 0.4836854636669159)]

That was easy. Let's try something more difficult.

*the oldest portuguese university was first established in **lisbon** before moving to coimbra.*

In [37]:
model.most_similar(positive=['lisbon'], topn=10)

[('portugal', 0.6408252716064453),
 ('porto', 0.5835251212120056),
 ('benfica', 0.550417423248291),
 ('copenhagen', 0.5288482308387756),
 ('portuguese', 0.5266897678375244),
 ('madrid', 0.5219067335128784),
 ('brussels', 0.5173485279083252),
 ('oporto', 0.5147968530654907),
 ('prague', 0.5037161707878113),
 ('amsterdam', 0.501822292804718)]

Seems like we are getting closer to *football teams* rather than *cities with old universities*. Let's add some more words from the sentence.

In [38]:
model.most_similar(positive=['lisbon', 'university'], topn=10)

[('faculty', 0.5288036465644836),
 ('college', 0.5237010717391968),
 ('professor', 0.5193325877189636),
 ('graduate', 0.5135288834571838),
 ('universities', 0.5098860263824463),
 ('copenhagen', 0.5022274851799011),
 ('campus', 0.4942850172519684),
 ('prague', 0.48807740211486816),
 ('madrid', 0.4852182865142822),
 ('portugal', 0.47880998253822327)]

Great! But now the words are getting too close to university. It would be good if we can add more weight to the orignal answer.

I can manually do it by taking the best embeddings to the original answer and counting how many times they occur in the joint embeddings.

In [39]:
model.most_similar(positive=['lisbon', 'coimbra'], topn=10)

[('porto', 0.6089159250259399),
 ('portugal', 0.6070288419723511),
 ('oporto', 0.5988742709159851),
 ('braga', 0.5796492099761963),
 ('benfica', 0.5514551401138306),
 ('leiria', 0.5170067548751831),
 ('aveiro', 0.4983532130718231),
 ('viseu', 0.491713285446167),
 ('évora', 0.4914955496788025),
 ('são', 0.4868908226490021)]

Using another city really makes a difference and shows some good candidates. I think it'll be a good idea to use words in the sentence that are next to the answer.

### Words with the same stem

In [40]:
model.most_similar(positive=['write'], topn=10)

[('writing', 0.6969848871231079),
 ('read', 0.6291235089302063),
 ('wrote', 0.6251993179321289),
 ('written', 0.6065736413002014),
 ('publish', 0.5670630931854248),
 ("'d", 0.5343195796012878),
 ('writes', 0.5341792702674866),
 ('tell', 0.5337096452713013),
 ('you', 0.5316604971885681),
 ('books', 0.5285096168518066)]

We could just remove all similar words that have the same stem as the original answer.

Additionally, the incorrect answers should be the same part of speech. Like with **write** - *read*, *publish*, *tell* are good candidates, but *books* could be easily discarded for being a noun.

### Numbers

In [41]:
model.most_similar(positive=['1944'], topn=10)

[('1943', 0.9581360816955566),
 ('1942', 0.9418259859085083),
 ('1941', 0.9256348609924316),
 ('1940', 0.8975383043289185),
 ('1945', 0.8817086219787598),
 ('1939', 0.8315709233283997),
 ('1946', 0.8234673142433167),
 ('1938', 0.7819805145263672),
 ('1937', 0.7764102220535278),
 ('1935', 0.7516503930091858)]

Not that bad. They seem to gravitate around the events of WW2. It seems better than ramdon numbers or closest numbers if we need to have multiple answer question. But I think it may be a better question if you have to input the number yourself, and you get a better score if you are closer to the correct answer.

### Names

In [42]:
model.most_similar(positive=['bush'], topn=10)

[('clinton', 0.7889922261238098),
 ('obama', 0.7570987939834595),
 ('gore', 0.6871948838233948),
 ('w.', 0.6750579476356506),
 ('cheney', 0.6621242761611938),
 ('mccain', 0.6613168716430664),
 ('barack', 0.6568867564201355),
 ('administration', 0.6468126773834229),
 ('george', 0.6463572382926941),
 ('kerry', 0.6004412174224854)]

In [43]:
model.most_similar(positive=['euclid'], topn=10)

[('postulate', 0.4412064254283905),
 ('archimedes', 0.43941453099250793),
 ('n.e.', 0.39649108052253723),
 ('pythagoras', 0.39116498827934265),
 ('aristotle', 0.3895653486251831),
 ('avenue', 0.38695403933525085),
 ('proclus', 0.3855825662612915),
 ('greektown', 0.3836863040924072),
 ('ptolemy', 0.38028305768966675),
 ('berea', 0.37123367190361023)]

In [44]:
model.most_similar(positive=['uri'], topn=10)

[('savir', 0.5212430357933044),
 ('geller', 0.47964778542518616),
 ('lubrani', 0.43920308351516724),
 ('avnery', 0.42534583806991577),
 ('zvi', 0.4224642217159271),
 ('dromi', 0.4120088219642639),
 ('likud', 0.41067302227020264),
 ('saguy', 0.408449649810791),
 ('yosef', 0.39055609703063965),
 ('moshe', 0.38498955965042114)]

I expected to be a lot worse. Names of famous people gets us other names of people with the same profesion - US presidents and greek mathematicians come up pretty easily. 

But with some less known figures, like a general in a certain battle, it woulnd't work. In those cases it would be good if we find other names in the same text or if we're working with a textbook we can use the names from other topics.

# Function

We'll keep it simple. We just need the *count* amount of distractors (incorrect answers).

In [45]:
def generate_distractors(answer, count):
    answer = str.lower(answer)
    
    ##Extracting closest words for the answer. 
    try:
        closestWords = model.most_similar(positive=[answer], topn=count)
    except:
        #In case the word is not in the vocabulary, or other problem not loading embeddings
        return []

    #Return count many distractors
    distractors = list(map(lambda x: x[0], closestWords))[0:count]
    
    return distractors

In [46]:
generate_distractors('green', 9)

['red',
 'blue',
 'purple',
 'yellow',
 'brown',
 'bright',
 'dark',
 'orange',
 'black']

In [47]:
generate_distractors('bulgaria', 6)

['romania', 'hungary', 'ukraine', 'slovakia', 'bulgarian', 'macedonia']