# NLU - First Assignment

- Student: Ziglio Riccardo
- Student Number: 224285

This notebook contains the solution for the first assignment of the NLU course.





# Dependency Parsing Visualizer

In this first cell I've reported the dependency parsing visualization (using the displacy library of Spacy) of the sentence used in the assignment for testing the code.

In [2]:
import spacy
from spacy import displacy #to visualize dependency parse
sentence = "I saw the man with a telescope"
#parsing of the sentence
nlp = spacy.load("en_core_web_sm")
doc = nlp(sentence)
displacy.render(doc, style="dep", jupyter = True)
print("")




# Exercise 1 - getDependencyPath function

The first exercise required to find the path of depdency relations from the root to each token of the input sentence.
To do that the head attribute of the token class have been used. This returns "The syntactic parent, or “governor”, of this token" (https://spacy.io/api/token). Since the root of the sentence is identified as the word which is its own head, or it has no arcs entering in it as the node "saw" in the dependency parsing graph reported before, the idea is to iterate over all the tokens of the sentence that are different from the head ("saw" in this case) and save their text and dependencies in a list called "dependency". As last element of this "dependency" list it was added the root element "saw". 
To display the root element as the first one of the list the "dependency" list has been copied in reverse order inside the "paths" list, which, as the name suggest, returns all the paths from the root to a token.

The output of the function is in the following format: ['Root (token.text)', '-> token.text: token.dep_'].

The cells below show the code of the described function and its application.

In [13]:
def getDependencyPath(sentence):
    #parsing sentence
    doc = nlp(sentence)
    paths = []
    for token in doc:
        dependency = []  
        while(token!=token.head): #check if word is not its own head --> not the root of the sentence
            dependency.append(" -> " + token.text + ": " + token.dep_)
            token = token.head #to exit loop
        #add root to list
        dependency.append("ROOT " + "("+token.head.text+")") #add root as last element of list
        paths.append(dependency[::-1]) #copy list in reverse order to have ROOT as first element
    return paths 

In [17]:
dependency_paths = getDependencyPath(sentence)
print("--------------------output getDependencyPath (function 1)--------------------\n")
print(dependency_paths)
print("")

--------------------output getDependencyPath (function 1)--------------------

[['ROOT (saw)', ' -> I: nsubj'], ['ROOT (saw)'], ['ROOT (saw)', ' -> man: dobj', ' -> the: det'], ['ROOT (saw)', ' -> man: dobj'], ['ROOT (saw)', ' -> with: prep'], ['ROOT (saw)', ' -> with: prep', ' -> telescope: pobj', ' -> a: det'], ['ROOT (saw)', ' -> with: prep', ' -> telescope: pobj']]



# Exercise 2 - getDependencySubtree function

To obtain the depenceny subtree given a token, the token.subtree property has been used. This allows to obtain the whole phrase by its syntactic head, and returns an ordered sequence of tokens (https://spacy.io/usage/linguistic-features#dependency-parse). So, each node (called "child" in the function) in the token's subtree is added to the "dependents" list, which is then added to the "subtree" list in order to obtain the following format: [ token -> [token's subtree]].

Notice that, for the definiton of subtree, which is:  "A subtree of a tree T is a tree S consisting of a node in T and all of its descendants in T" (https://www.quora.com/What-is-the-exact-and-easily-understandable-definition-of-a-subtree-Is-the-following-tree-S-the-subtree-of-T), the subtree of the root node "saw" is the whole tree, the whole sentence.

The cells below show the code of the described function and its application.

In [9]:
def getDependencySubtree(sentence):
    #parsing sentence
    doc = nlp(sentence)
    subtree = []
    for token in doc:
        dependents = []
        for child in token.subtree: 
            #token.subtree -> returns a sequence containing the token and all the token’s syntactic descendants.
            dependents.append(child.text)
        subtree.append(token.text + " -> " + str(dependents))
    return subtree

In [11]:
dependency_subtree = getDependencySubtree(sentence)
print("--------------------output function dep.subtree (function 2)--------------------\n")
print(dependency_subtree)
print("")

--------------------output function dep.subtree (function 2)--------------------

["I -> ['I']", "saw -> ['I', 'saw', 'the', 'man', 'with', 'a', 'telescope']", "the -> ['the']", "man -> ['the', 'man']", "with -> ['with', 'a', 'telescope']", "a -> ['a']", "telescope -> ['a', 'telescope']"]


# Exercise 3 - isSubtree function

To check if a given segment of a sentence forms a subtree, as in the previous function, the property token.subtree was used.
For each token in the input sentence its subtree was detected, saved as a list of strings, and then added into a dictionary in the following format: [token.text]: [token's subtree]. 
So, i.e., for the sentence: "I saw the man with the telescope", the dicitonary will contain the following: ['with': ['with', 'a', 'telescope']] (so the "with" token is the id and the subtree of the "with" token is the value of the dictionary).

Then, to check if the sequence of words given in input forms a subtree, is sufficient to check if there is a correspondence between the values inside the dictionary and the sequence of words. 

The cells below show the code of the described function and its application.

In [5]:
def isSubtree(sentence, squence_of_words):
    #parsing sentence
    doc = nlp(sentence)
    dict = {}
    
    #find subtree for each token in the sequence of words
    for token in doc:
        childrens=[]
        for descendents in token.subtree: 
            childrens.append(descendents.text)
        dict[token.text] = childrens

    #check if seq of words forms a subtre (if is inside dictionary generated from original sentence)
    for word in squence_of_words:
        if dict.get(word) == squence_of_words: #get value of "word" key -> ex. get('with') = [with, a, telescope]
            return True
    #end for
    return False

In [15]:
#test sequence of words
seq_words1 = ['with', 'a', 'telescope'] #is a subtree
#seq_words = ['I', 'man'] #not a subtree
seq_words2 = ['I','saw', 'the', 'man'] #not a subtree (subtree of root, saw, is whole tree)

print("--------------------output function isSubtree (function 3)--------------------\n")
if isSubtree(sentence, seq_words1):
    print(seq_words1, " IS a subtree of the sentence: ", sentence)
else: print(seq_words1, " IS NOT a subtree of the sentence: ", sentence)
print("\n")
if isSubtree(sentence, seq_words2):
    print(seq_words2, " IS a subtree of the sentence: ", sentence)
else: print(seq_words2, " IS NOT a subtree of the sentence: ", sentence)
print("")

--------------------output function isSubtree (function 3)--------------------

['with', 'a', 'telescope']  IS a subtree of the sentence:  I saw the man with a telescope


['I', 'saw', 'the', 'man']  IS NOT a subtree of the sentence:  I saw the man with a telescope


# Exercise 4 - headOfSpan function

To identify the head of a span given a sequence of words as input, is sufficient to use the Span.root property (https://spacy.io/api/span).
As defined in the documentation of Spacy a span is a slice of a Doc object, which is a sequence of Token objects (so formed by one or more tokens, https://spacy.io/api/doc).

In order to use the Span.root property is necessary to "convert" the input sentence into a span. This is done with the span = doc[start : end] instruction. 

The root is then returned by span.root.

The cells below show the code of the described function and its application.

In [12]:
#function 4 -> Regarding point 4: it is about correct usage of Span.root.
def headOfSpan(sentence):
    #parsing sentence
    doc = nlp(sentence)
    span = doc[:] #span --> slice of a sentence; a sequence of tokens (1 or more token)
    print("span: ", span) #the man with a telescope -> head = man
    return span.root  
    #span.root -> The token with the shortest path to the root of the sentence 
    #(or the root itself). If multiple tokens are equally high in the tree, the first token is taken.

In [13]:
print("--------------------output function headOfSpan (function 4)--------------------\n")
span = "the man with"
print("head of span: ", headOfSpan(span))
print("")

--------------------output function headOfSpan (function 4)--------------------
span:  the man with
head of span:  man


# Exercise 5 - extractSentenceSpan function

To extract the sentence subject, direct object and indirect object spans the idea is to iterate over each token in the sentence, then for each token's children (the immediate syntactic dependents of the token) check if its depdendency is one between: 
- "dobj" for direct object, 
- "nsubj" for nominal subject, 
- "dative" for indirect object (in spacy dative is used instead of iobj, as shown here: https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md).

In order to extract the determinant connected with the object (i.e. to extract "the man" instead of "man" as direct object in the sentence "I saw the man"), iterate over the token's children and check the sequences of syntactic children that occur before the token (in sentence order) with token.lefts (https://spacy.io/usage/linguistic-features#dependency-parse).

I have used token.lefts because determiners appear before nouns, so on their left (i.e "the man"; https://dictionary.cambridge.org/it/grammatica/grammatica-britannico/determiners-the-my-some-this).

In the code is also reported (as a comment) the same version of the function without the detection of the object's determiner.

The cells below show the code of the described function and its application.

In [15]:
def extractSentenceSpan(sentence):
    #parsing sentence
    doc = nlp(sentence)
    relationships_dict = {}
    subjList = []
    dobjList = []
    iobjList = []

    #version without the determiner of object (ex. "man" instead of "the man" for the sentence "I saw the man.")
    # for token in doc:
    #     if token.dep_ == "nsubj":
    #         subjList.append(token.text)
    #     if token.dep_ == "dobj":
    #         dobjList.append(token.text)
    #     if token.dep_ == "dative":
    #         iobjList.append(token.text)
            
    for token in doc:
        for child in token.children:
            #to find dep of object using children.lefts -> see what dep of child are on the left (ex. the man)
            if child.dep_ == "dobj":
                for objects in child.lefts: 
                    #look for determier of object -> ex. the man
                   dobjList.append(objects.text) #the
                dobjList.append(child.text) #man
            if child.dep_ == "nsubj" or child.dep_ == "subj": 
                subjList.append(child.text
            if child.dep_ == "dative": #in spacy dative is used instead of iobj
                for objects in child.lefts:
                    iobjList.append(objects.text)
                iobjList.append(child.text) 
    
    relationships_dict["subj"] = subjList 
    relationships_dict["dep. obj"] = dobjList 
    relationships_dict["indep. obj"] = iobjList

    return relationships_dict

In [16]:
#sentence = "I saw the man with a telescope"
print("--------------------output function extracteSentenceSpan (function 5)--------------------")
print(extractSentenceSpan(sentence))

--------------------output function extracteSentenceSpan (function 5)--------------------
{'subj': ['I'], 'dep. obj': ['the', 'man'], 'indep. obj': []}
