# Advanced Question Analysis

The goal of this assignment is to write a more flexible version of the interactive QA system. As in the previous assignment, the system should be able to take a question in natural language (English) as input, analyse the question, and generate a SPARQL query for it.

## Assignment  // Additional requirements

* Make sure that your system can analyse at least two more question types. E.g. questions that start with *which*, *when*, where the property is expressed by a verb, etc.
* Apart from the techniques introduced last week (matching tokens on the basis of their lemma or part-of-speech), also include at least one pattern where you use the dependency relations to find the relevant property or entity in the question. 
* Include 10 examples of questions that your system can handle, and that illustrate the fact that you cover additional question types

## Examples

Here is a non-representative list of questios and question types to consider. See the list with all questions for more examples

* For what movie did Leonardo DiCaprio win an Oscar?
* How long is Pulp Fiction?
* How many episodes does Twin Peaks have?
* In what capital was the film The Fault in Our Stars, filmed?
* In what year was The Matrix released?
* When did Alan Rickman die?
* Where was Morgan Freeman born?
* Which actor played Aragorn in Lord of the Rings?
* Which actors played the role of James Bond
* Who directed The Shawshank Redemption?
* Which movies are directed by Alice Wu?


In [None]:
import spacy

nlp = spacy.load("en_core_web_sm") # this loads the model for analysing English text
                   

## Dependency Analysis with Spacy

All the functionality of Spacy, as in the last assignment, is still available for doing question analysis. 

In addition, also use the dependency relations assigned by spacy. Note that a dependency relation is a directed, labeled, arc between two tokens in the input. In the example below, the system detects that *movie* is the subject of the passive sentence (with label nsubjpass), and that the head of which this subject is a dependent is the word *are* with lemma *be*. 


In [None]:
question = 'Which movies are directed by Alice Wu?'

parse = nlp(question) # parse the input 

for word in parse : # iterate over the token objects 
    print(word.lemma_, word.pos_, word.dep_, word.head.lemma_)


## Phrases

You can also match with the full phrase that is the subject of the sentence, or any other dependency relation, using the subtree function 


In [None]:
def phrase(word) :
    children = []
    for child in word.subtree :
        children.append(child.text)
    return " ".join(children)
        
for word in parse:
    if word.dep_ == 'nsubjpass' or word.dep_ == 'agent' :
        phrase_text = phrase(word)
        print(phrase_text)
        

## Visualisation

For a quick understanding of what the parser does, and how it assigns part-of-speech, entities, etc. you can also visualise parse results. Below, the entity visualiser and parsing visualiser is demonstrated. 
This code is for illustration only, it is not part of the assignment. 

In [None]:
from spacy import displacy

question = "Who is the main character of the movie Harry Potter"

parse = nlp(question)

displacy.render(parse, jupyter=True, style="ent")

displacy.render(parse, jupyter=True, style="dep")

In [None]:








# Below this cell, my assignment solution starts










The additional question types that are handled in the version for this assignment are listed below. The previous type, 'Who is the X of Y?', is also still functional. 

1.   **Questions in which the verb denotes the role (property) of the question**

      This entails questions of the form: *'Who [directed] X?'*. This is done using SpaCy dependencies: we obtain the property from the main verb and the entity from looking for the phrase of the direct object (and filtering this). More examples can be found below.

2.   **Questions that ask for a specific property: movie duration**

      This deals with the specific problem of asking for the length of a movie. Solving this problem was done by looking at all questions (from 'Test Questions') about duration and extracting all formulations. String matching turned out to be the best solution: phrases such as 'how long' and 'how many minutes' are then directly translated to having the property 'duration'. Since this seems like a simple solution, I added an additional type of question handled below:

3.   **Questions that are asked in a specific passive form**

      This entails questions of the form *'Which [movies]  are [directed] by X?'*. Interestingly, for such a question, the challenge is not only to find the right spans for the entity and property, but also to reformulate the query to accomodate searching 'in reverse' (i.e. not using movie titles, but looking for them).


Listed below are 10 example questions that the new system can answer.

1.   Who produced 'Shutter Island'?
2.   Who narrated the classic movie called Titanic?
3.   Which film was directed by Theo Maassen?
4.   How long will it take me to watch "The Truman Show"?
5.   How many minutes is the documentary The Social Dilemma?
6.   Which films were directed by James Cameron?
7.   Who are the directors of Shrek 2?
8.   Who was the screenwriter for the Dutch movie Gewoon Hans?
9.   Who directed Donnie Darko?
10.  Which pieces were composed by film music composer Jóhann Jóhannsson?

These example questions are meant to illustrate the range of questions that the system now can handle. Examples 7 and 8 illustrate the *'Who is the X of Y?'*-questions are still handled. Examples 1, 2 and 9 illustrate how the verb can denote the role (extra question type 1 above). Examples 4 and 5 illustrate how asking for movie duration can function (extra question type 2). Examples 3, 6 and 10 show how the passive construction (extra question type 3) can be used. As in the previous system, verb tense and the use of apostrophies around entities do not affect performance.




In [62]:
import spacy
import requests

def phrase(word) :
  """
  Returns a list of the full phrase, derived from a word.

  This was copied (but slightly modified) from examples above.
  """

  children = []
  for child in word.subtree :
      children.append(child.text)
  return children


def answer_question():

  """
  This function deals with 3 additional question types, compared to the
  system for assignment 3.

  General approach:
  1) Determine question type
  2) Obtain and filter entity and property according to question type
  3) Send query to Wikidata and print answer(s)

  10 example questions (and explanations) are found in the text cell above.
  
  """

  nlp = spacy.load("en_core_web_sm") # load English text model

  url = 'https://www.wikidata.org/w/api.php'
  sparql_url = 'https://query.wikidata.org/sparql'

  params_prop = {'action':'wbsearchentities', 
            'language':'en',
            'format':'json',
            'type': 'property'}

  params_ent = {'action':'wbsearchentities', 
            'language':'en',
            'format':'json',
            'type': 'item'}

  duration_keywords = ['how long', 'How long', 'duration',
                    'How many minutes', 'how many minutes',
                    'How much time', 'how much time',
                    'What is the length of', 'what is the length of']

  question = input("Please enter your question: ")

  # Parse the input
  parse = nlp(question)

  # Extract sentence structure
  lemmas = []
  pos = []
  dep = []
  for word in parse:
      lemmas.append(word.lemma_)
      pos.append(word.pos_)
      dep.append(word.dep_)

  sent = parse.text.replace("?", "")  # Strip question mark
  sent = sent.replace('"', "")  # Strip double apostrophe
  sent = sent.replace("'", "")  # Strip single apostrophe

  # Determine question type
  question_type = ""
  for rel in dep:
    if 'pass' in rel:
      question_type = "passive"  # e.g. 'Which movies are directed by X?'
      break
    else:
      if 'pobj' in rel:
        question_type = "XofY"  # e.g. 'Who is the director of X?'
      if 'dobj' in rel:
        question_type = "verb_prop"  # e.g. 'Who directed X?'
      
  for keyword in duration_keywords:
    if keyword in sent:
      question_type = "duration"  # e.g. 'How long is X?'

  if not question_type:
    print("Question type could not be found ...")

  # GET ENTITY AND PROPERTY (DIFFERS PER QUESTION TYPE)

  if question_type == "XofY":

    # Property: from AUX to ADP
    if pos.count("ADP") == 1:
      prop = lemmas[pos.index('AUX')+1 :pos.index('ADP')]
      if len(prop) > 1:
        if "the" in prop:  # strip 'the'
          prop.remove("the")
        if "a" in prop:   # strip 'a'
          prop.remove("a")
        if "an" in prop:   # strip 'an'
          prop.remove("an")

    # More than one adposition: 'of' is likely in property (e.g. 'cause of death')
    # Filter these specific common cases out manually
    else: 
      if "cause of death" in parse.text:
        prop = ["cause of death"]
      elif "city of birth" in parse.text:
        prop = ["birth city"]
      elif "date of birth" in parse.text:
        prop = ["birth date"]
      elif "country of origin" in parse.text:
        prop = ["country of origin"]
      elif "country of citizenship" in parse.text:
        prop = ["country of citizenship"]

      # Perhaps there is an 'of' in the entity, such as in 'Lord of the Rings'
      else:
        prop = lemmas[pos.index('AUX')+1 :pos.index('PROPN')]
        if len(prop) > 1:
          if "the" in prop:  # strip 'the'
            prop.remove("the")
          if "a" in prop:   # strip 'a'
            prop.remove("a")
          if "an" in prop:   # strip 'an'
            prop.remove("an")
          if "of" in prop:   # strip 'of'
            prop.remove("of")

    ent = sent.split(" ")[pos.index('ADP')+1:] # assuming it always ends with '?'

  elif question_type == "verb_prop":
   # Find entity: direct object (phrase!)
    for word in parse:
      if word.dep_ == 'dobj':
        ent = phrase(word)

    main_verb = parse[dep.index("ROOT")]
    prop = [main_verb.lemma_]

  elif question_type == "duration":
    prop = ["duration"]
    # Find entity: probably follows main verb (ROOT)
    ent = sent.split(" ")[dep.index("ROOT")+1:]

  elif question_type == "passive":
    for word in parse:
      if word.dep_ == 'pobj':
        ent = phrase(word)
        
    # Find prop: probably follows main verb (ROOT)
    for word in parse:
      if word.pos_ == "VERB" and word.dep_ == "ROOT":
        prop = [word.text]

  # Filter entity: starts with first capital letter and start is not an adjective (e.g. the Dutch movie ...)
  try:
    start = min([ent.index(word) for word in ent if word.istitle() and \
                  pos[sent.split(" ").index(word)] != "ADJ"])
    ent = ent[start:]
  except ValueError:
    pass

  # Convert entity and property from list to string
  ent = " ".join(ent)
  prop = " ".join(prop)

  prop = prop.replace('"', "")  # Strip double apostrophe
  prop = prop.replace("'", "")  # Strip single apostrophe

  # Find property in Wikidata
  params_prop['search'] = prop
  json = requests.get(url,params_prop).json()

  # Find entity in Wikidata
  params_ent['search'] = ent
  json_e = requests.get(url,params_ent).json()


  # Retrieve Wikidata answer
  try:
    for prop in json['search']:
      for ent in json_e['search']:  # double for loop: match prop and ent correctly
        prop_id = prop['id']
        ent_id = ent['id']

        # Build query
        if question_type != "passive":
          query = "SELECT ?answerLabel WHERE {SERVICE wikibase:label \
          { bd:serviceParam wikibase:language '[AUTO_LANGUAGE],en'. } wd:" + ent_id + " wdt:" + prop_id + " ?answer .}"
        else:
          query = "SELECT ?answerLabel WHERE {SERVICE wikibase:label \
          { bd:serviceParam wikibase:language '[AUTO_LANGUAGE],en'. } ?answer wdt:" + prop_id + " wd:" + ent_id + " .}"

        # Send query and print results
        results = requests.get(sparql_url, params={'query': query, 'format': 'json'}).json()
        if question_type != "duration":
          for item in results['results']['bindings']:  # We show all items: sometimes one name can refer to multiple possible entities!
            for var in item:
              print("The answer to your question is:", item[var]['value'])

        else:  # Code duplication, but gives a more meaningful answer for duration type questions
          for item in results['results']['bindings']:
            for var in item:
              print("The answer to your question is:", item[var]['value'] + " minutes")  # 
  except:
    pass  # Not found


# Call function
answer_question()


Please enter your question: Who produced 'Shutter Island'?
The answer to your question is: Martin Scorsese
The answer to your question is: Mike Medavoy
