# Question generation

## Research

After reading a few papers and discovering actual applications, a linkedIn group for question generation and many quora questions and answers, it seems that question generation is not unheard of.

I learned some tricks, but I also thought of a method for generating multiple answer questions, that I didn't find anywhere else.

### Papers
Difficulty Controllable Question Generation for Reading Comprehension - https://arxiv.org/abs/1807.03586  
Identifying Where to Focus in Reading Comprehension for Neural Question Generation - https://aclanthology.coli.uni-saarland.de/papers/D17-1219/d17-1219  
SQuAD: 100,000+ Questions for Machine Comprehension of Text - https://arxiv.org/abs/1606.05250  
Know What You Don't Know: Unanswerable Questions for SQuAD - https://arxiv.org/abs/1806.03822  

### Applications
Questo AI - https://questo.ai/index.html - In developement  
Quillionz - https://blog.quillionz.com/ - Making quiz tests for teachers. Hand curating the generated questions before providing them.  

##  Data
I chose the Stanford Question Answering Dataset (SQaAD) dataset. There are currently two versions of it.  

The first version contains 100k questions generated by mechanical turks that were really well paid. The questions are on 536 various wikipedia articles.

The second version contains 50k more questions, but they are specifically questions the answer for which cannot be found in the text. This has been done so that the neural nets trying to answer the questions learn when there is insufficient information and to avoid guessing.  

Obviously I cannot use the additional 50k questions for my purposes, so I used SQaUD v1.

I just found another dataset which could be used in addition to SQuAD - https://www.kaggle.com/rtatman/questionanswer-dataset. 

## Idea
Instead of trying to generate a question directly from text, with some impossible to train model (which seems to be the idea of most people) I propose a more step by step solution.

* Identifying phrases that are worth making a question for and taking the sentence the phrase is contained in.
* Transforming that sentence to 'cloze'* type question with the phrase as the right answer.
* Generating wrong answers for multiple answers type questions.
* Transforming the 'cloze' type question to a question looking... question.

\*Cloze questions are those things: *Today, I went to the ________ and bought some milk and eggs*

### Identifying phrases that are worth making a question for

For the neural network I guessed that a seq2seq aproach would work. It's commonly used for translating languages, but I could also train it to translate from text to phrases worth asking a question for.

#### Preparing data
The dataset exploration with some data engineering is done in the 'prepare data' notebook.

#### Training the model
I trained the model for about for about 1.5h.

#### Testing
Having in mind my limited time and resources I didn't expect much from my model. And rightfully so. Some of the generated phrases weren't even in the text.

#### Results

##### Text
the exact date of creation of the kiev metropolis is uncertain, as well as who was the first leader of the church. predominantly it is considered that the first head was michael i of kiev, however some sources also claim **leontiy** who is often placed after michael or anastas chersonesos, became the first bishop of the church of the tithes. the first metropolitan to be confirmed by historical sources is theopemp, who was appointed by patriarch alexius of constantinople in ####. before #### there were five dioceses: kiev, chernihiv, bilhorod, volodymyr, novgorod, and soon thereafter yuriy-upon-ros. the kiev metropolitan sent his own delegation to the council of bari in ####.

##### Answer
**leonity**

##### Text
during the period in which the populares party controlled the city, they flouted convention by re-electing marius consul several times without observing the customary ten-year interval between offices. they also transgressed the established oligarchy by advancing unelected individuals to magisterial office, and by substituting magisterial edicts for popular legislation. sulla soon made peace with mithridates. in ## bc, he returned to rome, overcame all resistance, and recaptured the city. sulla and his supporters then slaughtered most of marius' supporters. sulla, having observed the violent results of radical popular reforms, was naturally conservative. as such, he sought to strengthen the aristocracy, and by extension the senate. sulla made himself dictator, passed a series of constitutional reforms, resigned the dictatorship, and served one last term as consul. he died in ## bc.

##### Answer
\# bc

##### Text
**oxygen** is a chemical element with symbol O and atomic number 8. It is a member of the chalcogen group on the periodic table and is a highly reactive nonmetal and oxidizing agent that readily forms compounds (notably oxides) with most elements.

##### Answer
oxygen

In [85]:
fullText = 'Oxygen is a chemical element with symbol O and atomic number 8. It is a member of the chalcogen group on the periodic table and is a highly reactive nonmetal and oxidizing agent that readily forms compounds (notably oxides) with most elements. '
phrase = 'Oxygen'

### Transforming sentences to cloze questions.

In [86]:
sentences = fullText.split('.')
sentence = []

for sentenceIndex in range (len(sentences)):
    if phrase in sentences[sentenceIndex]:
        sentence = sentences[sentenceIndex]
        
print(sentence.replace(phrase, '_____'))

_____ is a chemical element with symbol O and atomic number 8


### Generating wrong answers for multiple answers type questions.

In [20]:
import gensim
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords

#Load embeddings
glove_file = datapath('D:\ML\QG\QG\data\embeddings\glove.6B.300d.txt')
tmp_file = get_tmpfile("D:\ML\QG\QG\data\embeddings\word2vec-glove.6B.300d.txt")

from gensim.scripts.glove2word2vec import glove2word2vec
glove2word2vec(glove_file, tmp_file)
model = KeyedVectors.load_word2vec_format(tmp_file)


In [91]:
import re

#
answer = phrase

#Stemming
stemmer = PorterStemmer()

sentence = stemmer.stem(sentence)
answer = stemmer.stem(answer)

#Removing stopwords, answer and punctuation from sentence
word_list = re.sub(r'[^\w\s]','',sentence)
word_list = word_list.replace(answer, '').split()

filtered_words = [word for word in word_list if word not in stopwords.words('english')]

#Getting what part of speech the answer is
answerPartOfSpeech = wn.synsets(answer) != [] and wn.synsets(answer)[0].pos()

##Extracting closest embeddings for the answer
topEmbeddings = model.most_similar(positive=[answer], topn=30)

embeddings = []
for embeddingIndex in range(len(topEmbeddings)):
    #Having a threshold. Word embedding shouldn't be further than 0.45
    if topEmbeddings[embeddingIndex][1] > 0.45: 
        word = stemmer.stem(topEmbeddings[embeddingIndex][0])
        #Removing words that are not of the same part of speech
        if answerPartOfSpeech and wn.synsets(word) != [] and wn.synsets(word)[0].pos() == answerPartOfSpeech:
            #Since we are stemming the embeddings, it's possible for a stem to appear more than once
            if word not in embeddings:
                embeddings.append(word)

                
#List of occurences for each stemmed word of the original answer in the embeddings for every other word in the sentence
embeddingsOccurences =  [0] * len(embeddings)

for sentenceWordIndex in range(len(filtered_words)):
    senteceWordEmbeddings = model.most_similar(positive=[answer, filtered_words[sentenceWordIndex]], topn=30)
    stemmedEmbeddings = []
    for embeddingIndex in range(len(senteceWordEmbeddings)):
        #Having a threshold. Word embedding shouldn't be further than 0.45
        if senteceWordEmbeddings[embeddingIndex][1] > 0.45: 
            word = stemmer.stem(senteceWordEmbeddings[embeddingIndex][0])
            #Since we are stemming the embeddings, it's possible for a stem to appear more than once
            if word not in stemmedEmbeddings:
                stemmedEmbeddings.append(word)
                
    for stemmedEmbeddingIndex in range(len(stemmedEmbeddings)):
        #Checking if the embedding is also contained in the embedding of the answer
        if stemmedEmbeddings[stemmedEmbeddingIndex] in embeddings:
            embeddingIndex = embeddings.index(stemmedEmbeddings[stemmedEmbeddingIndex])
            embeddingsOccurences[embeddingIndex]+=1
            
combined = list(zip(embeddings, embeddingsOccurences))
bestEmbeddings = sorted(combined, key=lambda x: x[1], reverse=True)

[('hydrogen', 5), ('nitrogen', 4), ('carbon', 3), ('helium', 2), ('liquid', 2), ('atom', 2), ('nutrient', 1), ('ammonia', 1), ('vapor', 1), ('sulfur', 1), ('breath', 0), ('hemoglobin', 0), ('hypoxia', 0)]
['hydrogen', 'nitrogen', 'helium', 'nutrient', 'breath', 'ammonia', 'carbon', 'liquid', 'hemoglobin', 'vapor', 'atom', 'hypoxia', 'sulfur']


Top 3 closest incorrect answers:

In [93]:
print(bestEmbeddings[0][0])
print(bestEmbeddings[1][0])
print(bestEmbeddings[2][0])

hydrogen
nitrogen
carbon


Questions for which cannot be found sufficient number of good incorrect answers could be turned into true/false questions. 
The sentences just must be turned into negatives sometimes, as not to be so obvious that every question's answer is true.

### Transforming the 'cloze' type question to a question looking... question

I think it would be easier to train another seq2seq neural network that takes regular sentences and transforms them into questions.  
The data could be relatively easy to generate from the SQaUD dataset and they may be other datasets ready for the task.