# Generating wrong answers by a given true answer
None of the white papers I read have proposed a way to generate multiple answer questions. My idea is to use word embeddings to generate answers that are close to the correct answer and it's context.

In [34]:
import gensim
#model = gensim.models.KeyedVectors.load_word2vec_format('data/embeddings/GoogleNews-vectors-negative300.bin', binary=True)

The word2vec dataset bricked my laptop (twice). Seems like a smaller pretained embedding should suffice.

In [35]:
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
glove_file = datapath('D:\ML\QG\QG\data\embeddings\glove.6B.300d.txt')
tmp_file = get_tmpfile("D:\ML\QG\QG\data\embeddings\word2vec-glove.6B.300d.txt")
# call glove2word2vec script
# default way (through CLI): python -m gensim.scripts.glove2word2vec --input <glove_file> --output <w2v_file>
from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_file, tmp_file)
model = KeyedVectors.load_word2vec_format(tmp_file)

In [39]:
model.most_similar(positive=['koala'], topn=10)

[('probo', 0.5426342487335205),
 ('koalas', 0.4729689359664917),
 ('orangutan', 0.4557289779186249),
 ('grizzly', 0.41816502809524536),
 ('marsupial', 0.39361128211021423),
 ('wombat', 0.3832378685474396),
 ('cuddly', 0.3804110288619995),
 ('kodiak', 0.37843799591064453),
 ('kade', 0.37742379307746887),
 ('kangaroo', 0.3612629175186157)]

It seems to be working fine. Though what is a probo?
![probo|512x397, 20%](https://www.thenakedscientists.com/sites/default/files/media/xProbo_3D_withtree_press.jpg.pagespeed.ic.XPBbt90-xd.jpg)![koala|512x397, 20%](https://www.teddybeartreasures.com.au/media/catalog/product/cache/1/image/650x650/d2bcc0a40912a9f6628bcdc83a25e9a6/k/a/kalypso_small.png)
Ok.

After we have found a sentence worthy to be a question and the phrase that's going to be the answer we can find similar phrases that could fit the sentence.  

Let's see what we can do with the following sentence.  
*__Oxygen__ is a chemical element with symbol O and atomic number 8.*  

In [41]:
model.most_similar(positive=['oxygen'], topn=10)

[('hydrogen', 0.63267982006073),
 ('nitrogen', 0.6251459717750549),
 ('helium', 0.5435217022895813),
 ('nutrients', 0.5369840860366821),
 ('breathing', 0.5023170709609985),
 ('chlorine', 0.4946938157081604),
 ('monoxide', 0.4911428987979889),
 ('dioxide', 0.4911195933818817),
 ('ammonia', 0.49079084396362305),
 ('carbon', 0.4836854636669159)]

That was easy. Let's try something more difficult.

*the oldest portuguese university was first established in **lisbon** before moving to coimbra.*

In [43]:
model.most_similar(positive=['lisbon'], topn=10)

[('portugal', 0.6408252716064453),
 ('porto', 0.5835250616073608),
 ('benfica', 0.5504175424575806),
 ('copenhagen', 0.5288481712341309),
 ('portuguese', 0.5266897678375244),
 ('madrid', 0.5219067335128784),
 ('brussels', 0.5173484683036804),
 ('oporto', 0.5147969126701355),
 ('prague', 0.5037161707878113),
 ('amsterdam', 0.5018222332000732)]

Seems like we are getting closer to football teams rather than cities that could have had the oldest university in the country. Let's add some more words from the sentence.

In [50]:
model.most_similar(positive=['lisbon', 'university'], topn=10)

[('faculty', 0.5288037061691284),
 ('college', 0.523701012134552),
 ('professor', 0.5193326473236084),
 ('graduate', 0.5135288834571838),
 ('universities', 0.5098860859870911),
 ('copenhagen', 0.5022274255752563),
 ('campus', 0.4942850172519684),
 ('prague', 0.4880773425102234),
 ('madrid', 0.4852182865142822),
 ('portugal', 0.4788099527359009)]

The words now are getting too close to university. It would be good if we can add more weight to the orignal answer.

I can manually do it by taking the best 20 embeddings to the original answer and checking if they are also showing in the embeddings list of combined words. Though I could also extract the orignal top 20 or even 50 embeddings and add features like occurences with other words and train a model...

In [51]:
model.most_similar(positive=['lisbon', 'coimbra'], topn=10)

[('porto', 0.6089159846305847),
 ('portugal', 0.6070287823677063),
 ('oporto', 0.5988742113113403),
 ('braga', 0.5796492099761963),
 ('benfica', 0.5514551401138306),
 ('leiria', 0.5170067548751831),
 ('aveiro', 0.4983532428741455),
 ('viseu', 0.491713285446167),
 ('évora', 0.4914955198764801),
 ('são', 0.4868907928466797)]

Using another city really makes a difference and shows some good candidates. I think it'll be a good idea to use a word in the sentence that is closest to the answer.

I suspect a couple more problems. 

In [58]:
model.most_similar(positive=['write'], topn=10)

[('writing', 0.6969849467277527),
 ('read', 0.6291235089302063),
 ('wrote', 0.6251993179321289),
 ('written', 0.6065735816955566),
 ('publish', 0.5670630931854248),
 ("'d", 0.5343195796012878),
 ('writes', 0.5341792702674866),
 ('tell', 0.5337096452713013),
 ('you', 0.5316603779792786),
 ('books', 0.5285096168518066)]

For our problem it would make more sense to work with the stems of the words. So after we gather the closest embeddings we should use stemming to remove the duplicates.

Another problem would be answers that are not of the same part of speech. If the correct answer is a verb the incorrect answers should also be verbs. Like with **write** - *read*, *publish*, *tell* are good candidates, but *books* could be easily discarded for being a noun.

Let's carry on. How about numbers?

In [80]:
model.most_similar(positive=['1944'], topn=10)

[('1943', 0.9581360220909119),
 ('1942', 0.9418259859085083),
 ('1941', 0.9256348609924316),
 ('1940', 0.8975383043289185),
 ('1945', 0.8817087411880493),
 ('1939', 0.8315708637237549),
 ('1946', 0.8234671950340271),
 ('1938', 0.781980574131012),
 ('1937', 0.7764101028442383),
 ('1935', 0.7516504526138306)]

Seems like embeddings for numbers aren't that bad. I think better than ramdon numbers or closest numbers. Atleast when there is an embedding for the number.

What about names?

In [87]:
model.most_similar(positive=['bush'], topn=10)

[('clinton', 0.7889922261238098),
 ('obama', 0.7570987939834595),
 ('gore', 0.6871949434280396),
 ('w.', 0.6750580072402954),
 ('cheney', 0.6621242761611938),
 ('mccain', 0.6613168716430664),
 ('barack', 0.6568867564201355),
 ('administration', 0.6468127965927124),
 ('george', 0.6463572978973389),
 ('kerry', 0.6004412174224854)]

In [94]:
model.most_similar(positive=['euclid'], topn=10)

[('postulate', 0.4412064254283905),
 ('archimedes', 0.43941453099250793),
 ('n.e.', 0.39649108052253723),
 ('pythagoras', 0.39116495847702026),
 ('aristotle', 0.3895653486251831),
 ('avenue', 0.38695406913757324),
 ('proclus', 0.3855825662612915),
 ('greektown', 0.3836863040924072),
 ('ptolemy', 0.38028305768966675),
 ('berea', 0.37123364210128784)]

In [99]:
model.most_similar(positive=['atanasov'], topn=10)

[('atanas', 0.6365466713905334),
 ('fery', 0.4410214424133301),
 ('simeonov', 0.4386071562767029),
 ('atanassov', 0.4376071095466614),
 ('mladenov', 0.4347333312034607),
 ('sergeevich', 0.4314761757850647),
 ('neophytos', 0.4266960620880127),
 ('geleta', 0.419179230928421),
 ('vassilev', 0.41890764236450195),
 ('stoev', 0.414333313703537)]

I expected to be a lot worse. Names that are on a well know post like president or greek mathematician come up pretty easily. But obviosly with some less known figures like a general in a certain battle it woulnd't work. In that case I think it would be better to train own embeddings on a more specific dataset.

A bigger problem is that some of the answers contain multiple words. Looking at the answers from the SQuAD dataset, most of them are only single words and then those that are not:  

#### Some contain digits and some other words:  
*12 minutes after*  
*3 to 5 days*  

Though they could easily be handled my just messing with the digits alone. 

#### Some contain additional describing word
*__chinese__ characters*  
*__german__ language*  
*have gained much __knowledge__*  

In those cases the answers could be just different names of languages and charecters. But I think those additional words shouldn't even be in the answer.

#### Some are just long names of institutions
*the william allen white school of journalism and mass communications*

Which I guess are similar to regular names of people. For which some special care must be taken, because they rely heavily on context.

#### And for some are just... hard
*over 20 years 1/5 of the women changed their sexual identity at least once*

## Now let's do some of the proposed techiques
I'll asume we have a sentence and a single word as an answer.

In [131]:
sentence = 'oxygen is a chemical element with symbol O and atomic number 8.'
answer = 'oxygen'

### Stemming
First we'll stem the sentence and answer, asumming it hasn't been done already.

In [132]:
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

sentence = stemmer.stem(sentence)
answer = stemmer.stem(answer)

print(sentence)
print(answer)

oxygen is a chemical element with symbol o and atomic number 8.
oxygen


In [150]:
#Just to check it's working
print(stemmer.stem('writing'))
print(stemmer.stem('Koala'))
print(stemmer.stem('lisbon'))
print(stemmer.stem('amsterdam'))
print(stemmer.stem('portugal'))

write
koala
lisbon
amsterdam
portug


### Part of speech

In [141]:
from nltk.corpus import wordnet as wn
words = ['write', 'oxygen', 'lisbon']

for w in words:
    tmp = wn.synsets(w)[0].pos()
    print (w, ":", tmp)

write : v
oxygen : n
lisbon : n


### Removing stopwords from sentence

In [194]:
from nltk.corpus import stopwords
word_list = sentence.replace(answer, '').split()
word_list = sentence.replace('.', '').split()

filtered_words = [word for word in word_list if word not in stopwords.words('english')]

print(filtered_words)

['oxygen', 'chemical', 'element', 'symbol', 'atomic', 'number', '8']


### Extracting closest embeddings

In [185]:
topEmbeddings = model.most_similar(positive=[answer], topn=30)

In [186]:
embeddings = []
for embeddingIndex in range(len(topEmbeddings)):
    #Having a threshold. Word embedding shouldn't be further than 0.45
    if topEmbeddings[embeddingIndex][1] > 0.45: 
        word = stemmer.stem(topEmbeddings[embeddingIndex][0])
        #Since we are stemming the embeddings, it's possible for a stem to appear more than once
        if word not in embeddings:
            embeddings.append(word)
        
print(embeddings)

['hydrogen', 'nitrogen', 'helium', 'nutrient', 'breath', 'chlorin', 'monoxid', 'dioxid', 'ammonia', 'carbon', 'liquid', 'hemoglobin', 'tissu', 'vapor', 'respir', 'atom', 'molecul', 'oxid', 'hypoxia', 'sulfur', 'phosphoru', 'photosynthesi']


In [195]:
#A list holding the occurences for each stemmed word of the original answer in the embeddings for every other word in the sentence
embeddingsOccurences =  [0] * len(embeddings)

for sentenceWordIndex in range(len(filtered_words)):
    senteceWordEmbeddings = model.most_similar(positive=[answer, filtered_words[sentenceWordIndex]], topn=30)
    stemmedEmbeddings = []
    for embeddingIndex in range(len(senteceWordEmbeddings)):
        #Having a threshold. Word embedding shouldn't be further than 0.45
        if senteceWordEmbeddings[embeddingIndex][1] > 0.45: 
            word = stemmer.stem(senteceWordEmbeddings[embeddingIndex][0])
            #Since we are stemming the embeddings, it's possible for a stem to appear more than once
            if word not in stemmedEmbeddings:
                stemmedEmbeddings.append(word)
                
    for stemmedEmbeddingIndex in range(len(stemmedEmbeddings)):
        #Checking if the embedding is also contained in the embedding of the answer
        if stemmedEmbeddings[stemmedEmbeddingIndex] in embeddings:
            embeddingIndex = embeddings.index(stemmedEmbeddings[stemmedEmbeddingIndex])
            embeddingsOccurences[embeddingIndex]+=1
            
print(embeddingsOccurences)
    

[6, 5, 3, 2, 1, 2, 2, 2, 2, 4, 3, 1, 2, 2, 1, 3, 3, 2, 1, 2, 2, 1]


In [207]:
combined = list(zip(embeddings, embeddingsOccurences))

In [208]:
sorted(combined, key=lambda x: x[1], reverse=True)

[('hydrogen', 6),
 ('nitrogen', 5),
 ('carbon', 4),
 ('helium', 3),
 ('liquid', 3),
 ('atom', 3),
 ('molecul', 3),
 ('nutrient', 2),
 ('chlorin', 2),
 ('monoxid', 2),
 ('dioxid', 2),
 ('ammonia', 2),
 ('tissu', 2),
 ('vapor', 2),
 ('oxid', 2),
 ('sulfur', 2),
 ('phosphoru', 2),
 ('breath', 1),
 ('hemoglobin', 1),
 ('respir', 1),
 ('hypoxia', 1),
 ('photosynthesi', 1)]

In [209]:
print(embeddings)

['hydrogen', 'nitrogen', 'helium', 'nutrient', 'breath', 'chlorin', 'monoxid', 'dioxid', 'ammonia', 'carbon', 'liquid', 'hemoglobin', 'tissu', 'vapor', 'respir', 'atom', 'molecul', 'oxid', 'hypoxia', 'sulfur', 'phosphoru', 'photosynthesi']


This seems like a lot better order.

In [None]:
sentence = 'oxygen is a chemical element with symbol O and atomic number 8.'
answer = 'oxygen'

