# Advanced Question Analysis

The goal of this assignment is to write a more flexible version of the interactive QA system using syntactic analysis (SpaCy). As in the previous assignment, the system should be able to take a question in natural language (Dutch) as input, analyse the question, and generate a SPARQL query for it.

## Assignment  // Additional requirements

* Make sure that your system can analyse **at least two more question types**. E.g. questions that start with *Hoe lang/Hoe groot/Hoe oud*, *Hoe heet/Hoe noem*, *Hoeveel*, and questions where the property is expressed by a verb (*Wat eet, Waar leeft,*), Yes/No questions, *welke* questions, etc. 
* Apart from the techniques introduced in week 3 (matching tokens on the basis of their lemma or part-of-speech), also include at least one pattern where you use the **dependency relations** to find the relevant property or entity in the question. 
* Include 10 examples of questions that your system can handle, and that illustrate the fact that you cover additional question types

## Examples

Here is a non-representative list of questios and question types to consider. See the list with all questions for more examples

* Hoe groot is een olifant?
* Hoe heet de studie van insecten?
* Hoe lang is een giraffe?
* Hoe noem je de studie van mieren?
* Hoeveel weegt een giraffe?
* Waar komen koala's vandaan?
* Waar leeft een orca?
* Wanneer zijn vliegen ontstaan?
* Welke kleur heeft een ijsbeer?
* Zijn impala's een bedreigde diersoort?



In [1]:
# IMPORTS
import requests
import time
import spacy
from spacy import displacy

In [2]:
import spacy

nlp = spacy.load("nl_core_news_lg") # this loads the (large) model for analysing Dutch text
                   

## Dependency Analysis with Spacy

All the functionality of Spacy, as in the last assignment, is still available for doing question analysis. 

In addition, also use the dependency relations assigned by spacy. Note that a dependency relation is a directed, labeled, arc between two tokens in the input. In the example below, the system detects that *ijsbeer* is the subject of *heeft* (with label nsubj), and that *kleur* is a direct object (*obj*) dependent of *heeft*. Note also that *heeft* has lemma *hebben*. 

You can also use displacy to visualize the parse output. 


In [3]:
from spacy import displacy

question = 'Hoe groot is een olifant?'

parse = nlp(question) # parse the input 

for word in parse : # iterate over the token objects 
    print(word.lemma_, word.pos_, word.dep_, word.head.lemma_)

displacy.render(parse, jupyter=True, style="dep")

hoe ADV advmod groot
groot ADJ ROOT groot
zijn AUX cop groot
een DET det olifant
olifant NOUN nsubj groot
? PUNCT punct groot


## Navigating the parse tree

The [Spacy web site](https://spacy.io/usage/linguistic-features#dependency-parse) explains a few handy functions that you can use to navigate in a dependency tree, using it for information extraction. (Question analysis for question answering is very similar to information extraction). 

### Phrases

You can match with the full phrase that is the subject of the sentence, or any other dependency relation, using the **subtree** function 

### Chunks

A nice feature is the fact that it can return chunks, combinations of a noun and adjectives (and a determiner, which you probably want to remove before searching on wikidata). See example below. 

Note that subtree often identifies the same string in the input as chunks. Differences occur for instance when a noun phrase contains another noun phrase. Chunks do not handle recursion, and thus would recognize two phrases, whereas  subtree would identify a single phrase.



In [4]:
def phrase(word) :
    children = []
    for child in word.subtree :
        children.append(child.text)
    return " ".join(children)

question = nlp('Welke kleur heeft een ijsbeer?')

for word in question:
    if word.dep_ == 'nsubj' or word.dep_ == 'obj' :
        phrase_text = phrase(word)
        print(word.dep_, phrase_text)

print(1)

for chunk in question.noun_chunks:
    print(chunk.lemma_, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)

print(2)
complicated_question = nlp('Hoeveel slagen per minuut maakt het hart van een giraffe?')

for word in complicated_question:
    if word.dep_ == 'nsubj' or word.dep_ == 'obj' :
        phrase_text = phrase(word)
        print(word.dep_, phrase_text)

print(3)
for chunk in complicated_question.noun_chunks:
    print(chunk.root.dep_, chunk.lemma_)

        

obj Welke kleur
nsubj een ijsbeer
1
welk kleur kleur obj heeft
een ijsbeer ijsbeer nsubj heeft
2
nsubj Hoeveel slagen per minuut
obj het hart van een giraffe
3
nsubj slag
nmod minuut
obj het hart
nmod een giraffe


In [5]:
# FROM PREVIOUS ASSIGNMENT #

def getAnswer(query): 
    url = 'https://query.wikidata.org/sparql'
    resultsx = requests.get(url, params={'query': query, 'format': 'json'})
    
    # To prevent a 429 request code (too many requests), wait 5 seconds when this occurs and try again.
    if resultsx.status_code == 429:
        time.sleep(5)
        resultsx = requests.get(url, params={'query': query, 'format': 'json'})

    results = resultsx.json()

    # Check for yes/no answer
    if 'boolean' in results.keys():
        return results['boolean'] 
    else:
        answers = []
        varNames = results['head']['vars']
        answerItems = results['results']['bindings']
        
        # Loop through multiple answers (if given)
        for item in answerItems: 
            for varName in varNames:
                answers.append(item[varName]['value'])
    
        return answers

# Function to find the IDs of a search query
def getIDs(query, p=False):
    # Create parameters
    url = 'https://www.wikidata.org/w/api.php'
    params = {'action':'wbsearchentities',               
              'language':'nl',
              'uselang':'nl',
              'format':'json',
              'search': query}
    if p: # If looking for property id
        params['type'] = 'property'
    json = requests.get(url,params).json()
    # Get IDs from different answers
    IDs = []
    for search in json['search']:
        IDs.append(search['id'])

    return IDs

# To filter on animals
def animalID(ID):
    query1 = 'SELECT ?ansLabel WHERE { wd:' + ID + ' wdt:P1417 ?ans. SERVICE wikibase:label { bd:serviceParam wikibase:language "nl,en". } }'
    result1 = getAnswer(query1)
    isAnimal = False
    if result1 != []:
        # If dictionary code starts with animal
        if result1[0][:6] == 'animal':
            isAnimal = True

    query2 = 'SELECT ?ansLabel WHERE { wd:' + ID + ' schema:description ?ans. SERVICE wikibase:label { bd:serviceParam wikibase:language "nl,en". } FILTER (langMatches(lang(?ans),"nl")) }'
    result2 = getAnswer(query2)
    if result2 != []:
        # If dier is in description, it probably is an animal.
        if 'dier' in result2[0]:
            isAnimal = True
    return isAnimal

# Removes articles 'de', 'het' and 'een' from the input
def removeArticles(line):
    lineSplit = line.lower().split()
    articles = ['de', 'het', 'een']
    for article in articles:
        while article in lineSplit:
            lineSplit.remove(article)
    newLine = ' '.join(lineSplit)

    return newLine

# Function to retreive the keywords, based on the question given
def getKeywords(question):
    sentence = nlp(question)
    keywords = {
        'subject': '',
        'property': ''
    }
    
    # First find the property by looking at the subject of the sentence
    for chunk in sentence.noun_chunks:
        if chunk.root.dep_ == 'nsubj':
            keywords['property'] = removeArticles(chunk.text)
            subject_root = chunk.root.text
        
    # Then find the right subject, by comparing the root of the property to the head chunk x
    for chunk in sentence.noun_chunks:
        if chunk.root.head.text == subject_root:
            keywords['subject'] = removeArticles(chunk.text)

    return keywords

# Function for creating possible queries for the retreived keywords
def createQueries(question):
    keywords = getKeywords(question)
    IDx = getIDs(keywords['subject'])
    ID1s = []
    for ID in IDx:
        if animalID(ID):
            ID1s.append(ID)
    ID2s = getIDs(keywords['property'], p=True)
    qs = []
    for ID1 in ID1s:
        for ID2 in ID2s:
            query = 'SELECT ?ansLabel WHERE { wd:' + ID1 + ' wdt:' + ID2 + ' ?ans. SERVICE wikibase:label { bd:serviceParam wikibase:language "nl,en". } }'
            qs.append(query)
    return qs

def answerQuestion(question):
    queries = createQueries(question)
    answers = []
    for query in queries:
        answer = getAnswer(query)
        if answer != []:
            answers.append(answer)

    for ans in answers:
        for ansLabel in ans:
            print(' -', ansLabel)

In [6]:
def rm_punct(sent):
    clean_sent = ''
    for char in sent:
        if char not in ['.', ',', '?', '!']:
            clean_sent += char
    return clean_sent

def find_pos(parse_sent, qword):
    for word in parse_sent:
        if word.text == qword:
            funct = word.pos_

    return funct

def find_dep(parse_sent, qword):
    for word in parse_sent:
        if word.text == qword:
            funct = word.dep_

    return funct

def find_root(sent, head):
    parse = nlp(sent)
    root = []
    for word in parse:
        if word.head.text == head:
            root.append(word.text)
    return root

def find_head(sent, root):
    parse = nlp(sent)
    head = []
    for word in parse:
        if word.text == root:
            head.append(word.head.text)
    return head

def analyse(s):
    anal_d = {}
    for word in s:
        anal_d[word.text] = word.dep_
    return anal_d

def findSynonym(word):
    synonym = word
    for possibleword in ['hoogte', 'lengte', 'grootte', 'lang']:
        if word == possibleword:
            synonym = 'hoogte'
        elif word == possibleword[:len(word)]:
            synonym = 'hoogte'

    return synonym

q1 = 'welke kleur heeft een ijsbeer?'
q2 = 'welke YSO-identificatiecode heeft de olifant?'
q3 = 'hoe groot is een olifant?'
def find_QP(sent):
    sent_cl = rm_punct(sent)
    query_dict = {}
    parse = nlp(sent)

    # questions starting with 'welke'
    if parse[0].lemma_.lower() == 'welk':
        anal_d = analyse(parse)
        for word in sent_cl.split():
            if word == find_head(sent_cl, word)[0]:
                sent_ROOT = word
        keys = find_root(sent_cl, sent_ROOT)
        keys.remove(sent_ROOT)
        for word in keys:
            for root in find_root(sent_cl, word):
                if root == parse[0].text: # parse[0] => .lemma.lower() == 'welk'
                    query_dict['P'] = findSynonym(word)
                else:
                    query_dict['Q'] = findSynonym(word)
    elif parse[0].lemma_.lower() == 'hoe':
        anal_d = analyse(parse)
        for word in sent_cl.split():
            if word == find_head(sent_cl, word)[0]:
                sent_ROOT = word
                query_dict['P'] = findSynonym(sent_ROOT)
            elif find_dep(parse, word) == 'nsubj':
                query_dict['Q'] = findSynonym(word)
    else: # wat vragen
        d = getKeywords(sent)
        query_dict['Q'] = d['subject']
        query_dict['P'] = d['property']
    
    return query_dict


def createQueries2(qIDs, pIDs):
    ID1s = []
    for ID in qIDs:
        if animalID(ID):
            ID1s.append(ID)
    qs = []
    for ID1 in ID1s:
        for ID2 in pIDs:
            query = 'SELECT ?ansLabel WHERE { wd:' + ID1 + ' wdt:' + ID2 + ' ?ans. SERVICE wikibase:label { bd:serviceParam wikibase:language "nl,en". } }'
            qs.append(query)
    return qs


def answerQuestion2(question):
    keys = find_QP(question)
    q_ids = getIDs(keys['Q'])
    p_ids = getIDs(keys['P'], p=True)
    queries = createQueries2(q_ids, p_ids)
    answers = []
    for query in queries:
        answer = getAnswer(query)
        if answer != []:
            answers.append(answer)

    if len(answers) == 0:
        print(' - Excuses. Ik heb op deze vraag geen antwoord kunnen vinden.')
    else: 
        for ans in answers:
            for ansLabel in ans:
                print(' -', ansLabel)



# Examples

Bij het implementeren van twee andere soorten vragen heb ik voor de vragen beginnend met 'Hoe ...' en 'Welke ...' gekozen. De complexere vragen zijn nog niet waterdicht beantwoord, maar de simpelere vragen beginnend met 'hoe', 'welke' of 'wat' (vorige opdracht) werken. Hier onder staan een aantal voorbeeld vragen die werken met het huidige vraag-antwoord systeem. 

In [7]:
q1 = 'Hoe groot is een olifant?'
q2 = 'Welke kleur heeft een ijsbeer?'
q3 = 'Welke commonscategorie past bij de olifant?'
q4 = 'Hoe lang is een giraffe?'
q5 = 'Welke IUCN-status heeft de leeuw?'

In [8]:
questions = [q1, q2, q3, q4, q5]
for q in questions:
    print(q)
    answerQuestion2(q)
    print()

Hoe groot is een olifant?
 - 4

Welke kleur heeft een ijsbeer?
 - wit

Welke commonscategorie past bij de olifant?
 - Elephants
 - Elephants in heraldry

Hoe lang is een giraffe?
 - 5.5

Welke IUCN-status heeft de leeuw?
 - kwetsbare soort



In [12]:
question = input('Stel een vraag\n')
answerQuestion2(question)
print()

Stel een vraag
 Wat is de bijtkracht van een leeuw?


 - 112

