# Advanced Question Analysis

The goal of this assignment is to write a more flexible version of the interactive QA system using syntactic analysis (SpaCy). As in the previous assignment, the system should be able to take a question in natural language (Dutch) as input, analyse the question, and generate a SPARQL query for it.

## Assignment  // Additional requirements

* Make sure that your system can analyse **at least two more question types**. E.g. questions that start with *Hoe lang/Hoe groot/Hoe oud*, *Hoe heet/Hoe noem*, *Hoeveel*, and questions where the property is expressed by a verb (*Wat eet, Waar leeft,*), Yes/No questions, *welke* questions, etc.
* Apart from the techniques introduced in week 3 (matching tokens on the basis of their lemma or part-of-speech), also include at least one pattern where you use the **dependency relations** to find the relevant property or entity in the question.
* Include 10 examples of questions that your system can handle, and that illustrate the fact that you cover additional question types

## Examples

Here is a non-representative list of questios and question types to consider. See the list with all questions for more examples

* Hoe groot is een olifant?
* Hoe heet de studie van insecten?
* Hoe lang is een giraffe?
* Hoe noem je de studie van mieren?
* Hoeveel weegt een giraffe?
* Waar komen koala's vandaan?
* Waar leeft een orca?
* Wanneer zijn vliegen ontstaan?
* Welke kleur heeft een ijsbeer?
* Zijn impala's een bedreigde diersoort?



In [1]:
import spacy
!python -m spacy download nl_core_news_sm
!python -m spacy download nl_core_news_lg
nlp = spacy.load("nl_core_news_lg") # this loads the (large) model for analysing Dutch text


Collecting nl-core-news-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/nl_core_news_sm-3.7.0/nl_core_news_sm-3.7.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m40.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: nl-core-news-sm
Successfully installed nl-core-news-sm-3.7.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('nl_core_news_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Collecting nl-core-news-lg==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/nl_core_news_lg-3.7.0/nl_core_news_lg-3.7.0-py3-none-any.whl (568.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5

## Dependency Analysis with Spacy

All the functionality of Spacy, as in the last assignment, is still available for doing question analysis.

In addition, also use the dependency relations assigned by spacy. Note that a dependency relation is a directed, labeled, arc between two tokens in the input. In the example below, the system detects that *ijsbeer* is the subject of *heeft* (with label nsubj), and that *kleur* is a direct object (*obj*) dependent of *heeft*. Note also that *heeft* has lemma *hebben*.

You can also use displacy to visualize the parse output.


In [2]:
from spacy import displacy

question = 'Hoe lang is een giraffe?'

parse = nlp(question) # parse the input

for word in parse : # iterate over the token objects
    print(word.lemma_, word.pos_, word.dep_, word.head.lemma_)

displacy.render(parse, jupyter=True, style="dep")

hoe ADV advmod lang
lang ADJ ROOT lang
zijn AUX cop lang
een DET det giraffe
giraffe NOUN nsubj lang
? PUNCT punct lang


## Navigating the parse tree

The [Spacy web site](https://spacy.io/usage/linguistic-features#dependency-parse) explains a few handy functions that you can use to navigate in a dependency tree, using it for information extraction. (Question analysis for question answering is very similar to information extraction).

### Phrases

You can match with the full phrase that is the subject of the sentence, or any other dependency relation, using the **subtree** function

### Chunks

A nice feature is the fact that it can return chunks, combinations of a noun and adjectives (and a determiner, which you probably want to remove before searching on wikidata). See example below.

Note that subtree often identifies the same string in the input as chunks. Differences occur for instance when a noun phrase contains another noun phrase. Chunks do not handle recursion, and thus would recognize two phrases, whereas  subtree would identify a single phrase.



In [3]:
def phrase(word) :
    children = []
    for child in word.subtree :
        children.append(child.text)
    return " ".join(children)

question = nlp('Welke kleur heeft een ijsbeer?')

for word in question:
    if word.dep_ == 'nsubj' or word.dep_ == 'obj' :
        phrase_text = phrase(word)
        print(word.dep_, phrase_text)

print()

for chunk in question.noun_chunks:
    print(chunk.lemma_, chunk.root.text, chunk.root.dep_,
            chunk.root.head.text)

print()
complicated_question = nlp('Hoeveel slagen per minuut maakt het hart van een giraffe?')

for word in complicated_question:
    if word.dep_ == 'nsubj' or word.dep_ == 'obj' :
        phrase_text = phrase(word)
        print(word.dep_, phrase_text)

print()
for chunk in complicated_question.noun_chunks:
    print(chunk.root.dep_, chunk.lemma_)



obj Welke kleur
nsubj een ijsbeer

welk kleur kleur obj heeft
een ijsbeer ijsbeer nsubj heeft

nsubj Hoeveel slagen per minuut
obj het hart van een giraffe

nsubj slag
nmod minuut
obj het hart
nmod een giraffe


In [21]:
import spacy
import requests

# Load SpaCy with Dutch language model
nlp = spacy.load("nl_core_news_lg")

def query_answer(query):
    url = 'https://query.wikidata.org/sparql'
    results = requests.get(url, params={'query': query, 'format': 'json'}).json()

    # Check if the query is an ASK query
    if 'boolean' in results:
        print(results['boolean'])
    else:
        # Loop through the results and print them
        for result in results['results']['bindings']:
            for var_name, var_value in result.items():
                print(f"{var_name}: {var_value['value']}")

def look_for_animal(results):
  for result in results:
    try:
      descriptions = result['description'].split()
      for description in descriptions:
        if description in ["dier", "huisdier", "zespotigen", "vliesvleugeligen", "insecten", "roofdier", "vissen", "kraakbeenvissen"]:
            return result['id']
    except Exception as e:
      continue
  return results[0]['id']

def find_entity_id(entity_name):
    url = 'https://www.wikidata.org/w/api.php'
    params = {'action':'wbsearchentities',
              'language':'nl',
              'uselang':'nl',
              'format':'json',
              'search': entity_name}

    response = requests.get(url, params=params)
    data = response.json()
    if data.get('search'):
      return look_for_animal(data['search'])

    return None

def find_property_id(property_name):
    url = 'https://www.wikidata.org/w/api.php'
    params = {'action':'wbsearchentities',
              'language':'nl',
              'uselang':'nl',
              'format':'json',
              'search': property_name,
              'type': 'property'}

    response = requests.get(url, params=params)
    data = response.json()
    if data.get('search'):
        return data['search'][0]['id']
    return None

def match_property(term):
    synonyms_dict = {
        "hoogte": ["lang", "lengte", "groot", "hoog"],
        "massa": ["wegen", "gewicht", "weegt"],
        "belangrijkste voedselbron": ["eten", "voeden", "voedsel"],
        "endemisch in": ["herkomst", "komen", "vandaan"],
        "habitat": ["leven", "leeft", "leefomgeving"],
        "IUCN-status": ["bedreigde diersoort", "bedreigd", "gevaar", "bedreigde dieren", "diersoort"],
        "bestudeerd door": ["studie"],
        "snelheid": ["snelheid", "rennen"],
        "wetenschappelijke naam": ["naam", "wetenschappelijke"],
        "begindatum": ["ontstaan"],
        "moedertaxon": ["moedertaxon"]
    }
    # search for synonnyms to match the right property in wikidata
    for prop_id, synonyms in synonyms_dict.items():
      for synonym in synonyms:
        if synonym in term:
            return prop_id
    return term

def analyze_query(query):
    # Parse the sentence with Spacy
    doc = nlp(query)

    # Initialize variables to store entity and property
    entity_name = None
    property_name = None
    amod = None
    nsubj_nmod_count = len([True for token in doc if token.dep_ in ["nmod", "nsubj"]])

    for token in doc:
      if token.pos_ in ["NOUN", "VERB"]:
        if token.dep_ == "nmod":
          entity_name = token.lemma_
        elif entity_name is None and token.dep_ == "nsubj":
          entity_name = token.lemma_
      if nsubj_nmod_count >= 2 and token.dep_ == "nsubj":
          property_name = token.lemma_
      elif token.dep_ == "obj":
          property_name = token.lemma_
      elif property_name is None and token.dep_ == "ROOT":
        property_name = token.lemma_
      if token.dep_ == "amod":
        amod = token.text

    if amod:
      property_name = amod + " " + property_name

    matched_property_name = match_property(property_name)

    return entity_name, matched_property_name

def construct_sparql_query(entity_id, property_id):
    # Construct SPARQL query
    query = 'SELECT ?answerLabel WHERE {wd:' + entity_id + ' wdt:' + property_id + ' ?answer. SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],nl". } }'
    return query

def answer_user_query(question):
    #user_query = input("Stel je vraag: ")
    #user_query = question

    # Analyze query and extract keywords
    entity_name, property_name = analyze_query(question)

    #print(entity_name)
    #print(property_name)

    if entity_name and property_name:
        # Find entity ID
        entity_id = find_entity_id(entity_name)
        if not entity_id:
            print("Kon de ID van de entiteit niet vinden voor:", entity_name)
            return

        # Find property ID
        property_id = find_property_id(property_name)
        if not property_id:
            print("Kon de ID van de eigenschap niet vinden voor:", property_name)
            return

        # Construct SPARQL query
        sparql_query = construct_sparql_query(entity_id, property_id)
        #print(sparql_query)
        # Execute SPARQL query
        return query_answer(sparql_query)

    else:
        print("Niet genoeg relevante informatie gevonden om een SPARQL-query te maken.")

# Call the function to answer user queries
# answer_user_query()

def process_questions(questions):
    for question in questions:
        print("=" * 50)
        print("Vraag:", question)
        print("=" * 50)
        answer_user_query(question)
        print()

# List of questions to process
questions = [
    "Welke kleur heeft een ijsbeer?",
    "Hoe lang is een giraffe?",
    "Wat is het voedsel van een olifant?",
    "Waar komen koala's vandaan?",
    "Wat is de habitat van een tijger?",
    "Zijn impala's een bedreigde diersoort?",
    "Wat is het gewicht van een leeuw?",
    "Hoe snel kan een cheetah rennen?",
    "Wat is de moedertaxon van haaien?",
    "Hoe heet de studie van insecten?",
    "Waar leeft een orka?",
    "Wat is het eten van reuzenpanda's?"
]

# process the questions in the list
process_questions(questions)

Vraag: Welke kleur heeft een ijsbeer?
answerLabel: wit

Vraag: Hoe lang is een giraffe?
answerLabel: 5.5

Vraag: Wat is het voedsel van een olifant?
answerLabel: planten

Vraag: Waar komen koala's vandaan?
answerLabel: Australië

Vraag: Wat is de habitat van een tijger?
answerLabel: bos

Vraag: Zijn impala's een bedreigde diersoort?
answerLabel: niet bedreigde soort

Vraag: Wat is het gewicht van een leeuw?
answerLabel: 1.65
answerLabel: 126
answerLabel: 188

Vraag: Hoe snel kan een cheetah rennen?
answerLabel: 120

Vraag: Wat is de moedertaxon van haaien?
answerLabel: Neoselachii

Vraag: Hoe heet de studie van insecten?
answerLabel: entomologie

Vraag: Waar leeft een orka?
answerLabel: wereldoceaan

Vraag: Wat is het eten van reuzenpanda's?
answerLabel: bamboe



Naam: Collin Krooneman

S4890396